INDIAN SCRIPT CHARACTER DATABASES

HANDWRITTEN / SCENE CHARACTER / DEGRADED DOCUMENT DATABASES OF INDIC SCRIPTS

Research on OCR systems for scanned printed Indian script documents have been continued for quite some time. On the other hand, not much research works on handwriting recognitioni, scene texts recognition or degraded document analysis of Indian scripts are available in the literature. Unfortunately, the technology of printed OCR cannot be extended to recognition of such characters, images or documents due to enormous variability in their samples.

Devanagari is the first-most popular language and script of India while Bangla is the second-most popular language and script of the Indian subcontinent and the fifth-most popular language of the world. There are several other scripts such as Tamil, Telegu, Kanada, Malayalam and a few others, which are used by significant sections of Indian population.

Most of the available works on handwriting or scene text recognition of Indian scripts are based on either small or non-standard databases collected in laboratory environments. To change this scenario, we developed a few large databases of handwritten or scene characters of major Indic script(s). The latest addition to this bunch of databases is ISIDDI, an image database of Bangla degraded documents. These database are either already made available free of cost to the academic researchers or their release is under processing. These databases are the following.

1. Offline handwritten database

(a) Numerals

(i) Devanagari

(ii) Bangla

(iii) Oriya

(b) Basic characters

(i) Bangla

(ii) Devanagari

(c) Bangla Vowel Modifiers

(d) Bangla Compound characters

2. Online handwritten database

(a) Bangla numerals