ISI Bangla Scene Character Database (Version 2014)



Bangla script is used to write Bangla and a few other languages of the eastern part of South Asia such as Assamese and Manipuri. This script holds the official status in the two neighboring countries Bangladesh and India and it is the 6th most popularly used script in the world. Its alphabet set has has several diacritics and a large number of conjunct characters in addition to 50 basic characters which include 11 vowels and 39 consonants. To meet the requirement of a standard dataset of scene characters of Bangla script for planned research works on its scene text recognition, recently, one such dataset of Bangla characters or their parts has been developed at the Computer Vision and Pattern Recognition Unit of the Indian Statistical Institute, Kolkata. Its samples have been extracted from 260 outdoor scene images captured at different times from the streets, lanes and by-lanes of the state of West Bengal of India using a varieties of digital camera. Since the occurrence statistics of several Bangla characters in real life texts is very low, we added several artificially created samples of these characters with the help of Microsoft Power Point Software. A small subset of this sample database may be downloaded by clicking here (We are hopeful to release this entire segmented character database soon). The filname of each real sample is as follows <Unicode of parent word>_<Graphical transliteration of parent word>_<File name of source scene image>_<Left column number of the character in its parent word>_<Right column number of the character in its parent word>.jpg while the filename of each artificial sample is as follows <Character class number>_<Graphical transliteration of the character>_<Sample sequence number>.jpg

Since a piece of Bangla text has three distinct regions namely upper, middle and lower regions and since this script has a large character set, a common approach of devoloping an automatic recognition system of this script is to segment each line of Bangla texts into the three horizontal regions and use a distinct recognizer for each of these three regions. Thus, in the present database we provide samples of Bangla characters or their segmented parts belonging to each of the three regions as it can be seen from the below Figure.


Ref:- S. Tian, U. Bhattacharya, S. Lu, B. Su, Q. Wang, X. Wei, Y. Lu and C. L. Tan, Multilingual Scene Character Recognition with Co-occurrence of Histogram of Oriented Gradients, Pattern Recognition (Online available).

Application form for obtaining "ISI BENGALI_CHARACTER_DATASET"

Back to Ujjwal's main page