Resource Centre for Indian Language Technology Solutions Bangla
Project Basics
Activities
Demo
Products
Downloads
Related Sites
Search
FAQ
Contact Address
Home

 

Activities of the Resource Centre:

A. Corpus development:

B. System development:

C. Font generation:

D. R & D work:

E. Web site development

F. Training Programmes

G. End products:

  • CD-ROMs containing (i) corpus of Bangla document images, (ii) corpus of Bangla documents in electronic text form and (iii) corpus of Bangla speech data.
  • Web site on Eastern Indian Language Technologies.
  • Bangla fonts package for various platforms.
  • Prototypes of (i) OCR system for Oriya and Assamese, (ii) information retrieval system for Bangla electronic documents, (iii) multi-lingual script line separation system, (iv) automatic processing system for handprinted table-form documents and (v) Neural Network based tools for printed document (in Eastern regional scripts) processing.

Web site on Eastern Indian language and language technologies: We have started work on designing the web site. At present, this site contains (i) details about the MIT project (Resource Centre for Indian Language Technology Solutions - Bangla), (ii) a brief description of the products developed in our department and (iii) information about the Bangla Language & Script. We are currently collecting information about other institutions/companies engaged in technology development for Indian languages. We have contacted the local language academy (Bangla Academy). A prototype of a Web-based front-end to our spell-checker and phonetic dictionary has been developed using an evaluation copy of CDAC's GIST Software Development Kit (SDK). We are in the process of purchasing an official copy of the GIST SDK and iPlugin (or Modular Software's Shree Lipi), so that this front-end can be put on the Internet for public use. We are also exploring the issues related to hosting our web-site on a server/Web hosting service. This site will be a part of TDIL(Technology Development for Indian Languages) initiative.

Top

Training programmes and workshops: We organized a 5-day training programme-cum-workshop during 26-30 March, 2001. The first three days had introductory talks and tutorials on various areas of language technology. The last two days featured lectures by international experts on these subjects. A total of 29 speakers presented talks at the workshop, and there were 75 participants from India and abroad.

Top

Electronic corpus of document images: The composition of existing document image databases for English (the University of Washington database) has been studied. Several Bangla books published by various publishers have been selected as sources (e.g. Maitreya Jaatak, Pratham Aalo, Mahabharat Katha, etc.), and several hundred pages have already been scanned. The textual content of the images is extracted by running our OCR system on the images. The output is corrected by hand to generate ground truth for the images. A set of over 400 images page images along with the corresponding ground truth has been written to a CD-ROM.

Top

Text corpus in electronic form: Some sources have been identified for inclusion in the corpus. Several novels (by Bankim Chandra Chattopadhyaya, Ishwar Chandra Vidyasagar, etc.) have been entered into the computer in ISCII format. About 10,000 words of a bilingual (Bangla-English) dictionary have also been entered. A comprehensive electronic dictionary for Bangla (with 65,000 words) has been constructed and checked. We have also started designing an electronic thesaurus for Bangla (based on WordNet, a well-known electronic resource for the English language).

Top

Electronic corpus of speech data: The composition of existing speech databases for English has been studied. We have categorized speech data into several classes based on criteria such as sex, age, and region of speaker, place of data collection, whether the source of the spoken material is a written script, etc. Based on this, we have designed the composition of the database. We have also contacted All India Radio, Calcutta as a potential source of audio material.

Top

Bangla font generation: Font design techniques for the PC (Windows) platform have been studied. A set of Bangla character glyphs have been designed taking into account kerning contraints. A Bangla true-type font file has been generated, and a font driver (the keyboard logic for writing with this font) has also been implemented. We have written a couple of conversion routines to convert between ISCII code and our font code. We are currently exploring the possibility of creating a dynamic version of our font for use on the WWW by using BitStream Wizard.

Top

OCR system for Oriya: A prototype OCR system for Oriya has been implemented. The system includes modules for skew correction, line, word, and character segmentation, as well as character recognition. Based on preliminary experiments, the system achieves about 96% character level accuracy. More work is required for recognizing compound characters.

Top

Adaptation of Bangla OCR to Assamese: A prototype system has been already prepared. More font specific modification is being done to bring it to commercial level.

Top

Information Retrieval system for Bangla documents: A Bangla stop-word list has been constructed by combining statistical and manual methods. An elementary stemming algorithm for Bangla has been implemented and is currently being refined.

Top

Script identification and separation from Indian multi-script documents: We have completed a survey of literature on multi-lingual script separation. The distinguishing features of some Indian scripts have been identified. We are in the process of implementing a prototype for identification of different scripts lines.

Top

Automatic processing of handprinted table-form documents: We have started creating a collection of table-form documents. A neural-network based system for recognizing isolated handwritten numerals has been designed. We are in the process of implementing and testing the system.

Top

Neural-Network based tools for printed documents processing: We are currently surveying the literature on similar document analysis tools for English. Two different NN learning algorithms that are likely to be useful in this project have been implemented and tested.

Top

 

For more information mail us at rc_bangla@isical.ac.in