Education & Career
Research & Publication
Resources
Professional Activity
Teaching & Supervision
Awards / Honors
Photo Gallery
 
click...
 
  NOTE: All resources listed below are for academic researchers who can use these datasets, tools free of charge citing proper references as given. For commercial use, please write to me for technology transfer which involves charge/license fee payable to Indian Statistical Institute. Unauthorized use may invite legal consequences.

Resources for Indic Language NLP


BenLem (a Bengali Lemmatizer): A rule-based Bengali lemmatizer has been developed and tested on a newly developed dataset, known as BenLem dataset. Use of this dataset should refer the following paper: A. Chakrabarty and U. Garain (2015): BenLem (a Bengali Lemmatizer) and its Role in WSD, in ACM Trans. Asian and Low-Resource Language Information Processing (TALIIP). This work also reports a preliminary Bengali WSD results. The Bengali WSD dataset used in this experiment is part of the BenLem dataset.

NeuroLem (a Neural Lemmatizer): A novel Neural Lemmatizer tested on several languages including Hindi and Bengali has been reported recently in ACL 2017. The implementation of this work is available here at github .

A bigger sized annotated Bengali dataset consisting of 20257 words and Bengali word vectors trained on a large Bengali corpus are now available. Any researcher using the above implementation of our ACL 2017 work, Bengali dataset and word vectors is requested to refer the following paper: A. Chakrabarty, O.A. Pandit, and U. Garain (2017): Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks, in ACL 2017:1481-1491.

Bengali POS tagger: We have used Stanford POS tagger to make it work for Bengali POS tagging . It gives about 92% accuracy. Readme to use it for your need. Details about the POS tag labels can be found here.

E2B Transliteration: We have used the Joshua open source statistical machine translation (SMT) system which is reconfigured by Irvine et al. for E2B transliteration . As a part of this development, we generated a list of 6000 E2B proper name pairs and a Bengali language model required by the transliteration system. Readme to use it for your need.

Bengali Dependency Parser: We have used the MaltParser for developing a dependency parser for Bengali. Readme to use it for your need. The details about the tag labels can be found here. This development is partially supported by SNLTR .

English-Bengali Bilingual Dictionary: Here you will get a bilingual English-Bengali dictinary which contains about 32,000 unique English terms. Most of the entries are part of the dictionary available with Ankur project . We manually cleaned up the entries and added few more. Most of the English terms in this dictionary have more than one Bengali translation. Only 14,764 English terms have only one Bengali translation and others have multiple (up to 16) different translations. In total, there are 70,808 total term pairs (English term - Bengali translation). Although all English terms are one word, many of the Bengali translations are multiple word expressions. Out of 70,808 term pairs, for 26,915 cases the Bengali translation includes more than one word.


Rescources for Medical Image Analysis

Psoratic Plaque Segmentation: Thanks to Dr. Raghunath Chatterjee of Human Genetics Unit of ISI, Kolkata, an annotated dataset of 75 images has been generated for conducting research on Proasis Image Analysis. You can view a set of images here. Anyone wants to use this dataset for academic/research (purely non-profit, non-commercial) purpose should contact Dr. Chatterjee or me and must cite the following paper while reporting their results on this dataset:

Anabik Pal, Anandarup Roy, Kushal Sen, Raghunath Chatterjee, Utpal Garain, and Swapan Senapati, "Mixture model based color clustering for psoriatic plaque segmentation", in Proc. of the 3rd Asian Conf. on Pattern Recognition (ACPR), pp.376-380, Kuala Lumpur, Malaysia, 2015.


Rescources for Document Image Analysis

EMERS: A tree matching based performance Evaluation of Mathematical Expression Recognition Systems. If you have developed a math recognizer, please have a look at our CROHME campaign .

Bengali Writer Identification/Verification Dataset: This data set (about 342MB in size) contains samples from 40 subjects (native Bengali writers) each writing two samples. The content of one sample is very interesting as it contains almost all major characters (basic as well as conjucts) usually used in Bengali text. Anyone using this dataset for academic/research (purely non-profit, non-commercial) purpose should cite the following two papers while reporting their results on this dataset:
1. U. Garain , "A Stochastic Approach for Finding Optimal Context in a Contextual Pattern Analysis Task," in IEEE Intelligent Systems, vol. 31, number 2 (March/April), 2016.
2. U. Garain and T. Paquet , "Off-Line Multi-Script Writer Identification using AR Coefficients", in Proc. of the 10th Int. Conf. on Document Analysis and Recognition (ICDAR), pp.991-995, Barcelona, Spain, July, 2009.


  Computer Vision & Pattern Recognition [CVPR] Unit
Indian Statistical Institute
203, B.T. Road, Kolkata 700 108, INDIA
+91.33.25.75.28.60: phone +91.33.25.77.30.35: fax
utpal (at) isical (dot) ac (dot) in : email
 
 
Teaching NLP
0002149