  Resources for Indic Language NLP

English-Bengali Bilingual Dictionary: Here you will get a bilingual English-Bengali dictinary which contains about 32,000 unique English terms. Most of the entries are part of the dictionary available with Ankur project . We manually cleaned up the entries and added few more. Most of the English terms in this dictionary have more than one Bengali translation. Only 14,764 English terms have only one Bengali translation and others have multiple (up to 16) different translations. In total, there are 70,808 total term pairs (English term - Bengali translation). Although all English terms are one word, many of the Bengali translations are multiple word expressions. Out of 70,808 term pairs, for 26,915 cases the Bengali translation includes more than one word.

Bengali POS tagger: We have used Stanford POS tagger to make it work for Bengali POS tagging . It gives about 92% accuracy. Readme to use it for your need. Details about the POS tag labels can be found here.

E2B Transliteration: We have used the Joshua open source statistical machine translation (SMT) system which is reconfigured by Irvine et al. for E2B transliteration . As a part of this development, we generated a list of 6000 E2B proper name pairs and a Bengali language model required by the transliteration system. Readme to use it for your need.

Bengali Dependency Parser: We have used the MaltParser for developing a dependency parser for Bengali. Readme to use it for your need. The details about the tag labels can be found here. This development is partially supported by SNLTR .

BenLem (Bengali Lemmatizer): A novel Bengali lemmatizer has been developed. The algorithm has been tested on a newly developed dataset, known as BenLem dataset. This dataset can be used for non-profit research purpose only and if used, a reference to the following paper must be given: A. Chakrabarty and U. Garain (2015): BenLem (a Bengali Lemmatizer) and its Role in WSD, in ACM Trans. Asian and Low-Resource Language Information Processing (TALIIP). This work also reports a preliminary Bengali WSD results. The Bengali WSD dataset used in this experiment is part of the BenLem dataset.

Rescources for Document Image Analysis

EMERS: A tree matching based performance Evaluation of Mathematical Expression Recognition Systems. If you have developed a math recognizer, please have a look at our CROHME campaign .

Bengali Writer Identification/Verification Dataset: This data set (about 342MB in size) contains samples from 40 subjects (native Bengali writers) each writing two samples. The content of one sample is very interesting as it contains almost all major characters (basic as well as conjucts) usually used in Bengali text. Anyone using this dataset for academic/research (purely non-profit, non-commercial) purpose should cite the following two papers while reporting their results on this dataset:
1. U. Garain , "A Stochastic Approach for Finding Optimal Context in a Contextual Pattern Analysis Task," in IEEE Intelligent Systems, vol. 31, number 2 (March/April), 2016.
2. U. Garain and T. Paquet , "Off-Line Multi-Script Writer Identification using AR Coefficients", in Proc. of the 10th Int. Conf. on Document Analysis and Recognition (ICDAR), pp.991-995, Barcelona, Spain, July, 2009.

Rescources for Medical Image Analysis

Psoratic Plaque Segmentation: Thanks to Dr. Raghunath Chatterjee of Human Genetics Unit of ISI, Kolkata, an annotated dataset of 75 images has been generated for conducting research on Proasis Image Analysis. You can view a set of images here. Anyone wants to use this dataset for academic/research (purely non-profit, non-commercial) purpose should contact Dr. Chatterjee or me and must cite the following paper while reporting their results on this dataset:

Anabik Pal, Anandarup Roy, Kushal Sen, Raghunath Chatterjee, Utpal Garain, and Swapan Senapati, "Mixture model based color clustering for psoriatic plaque segmentation", in Proc. of the 3rd Asian Conf. on Pattern Recognition (ACPR), pp.376-380, Kuala Lumpur, Malaysia, 2015.

