RISOT 2012: Retrieval from Indic Script OCR'd Text

 

Introduction:

 

After introducing RISOT in 2011, this year too we plan to continue the event as the major issues for retrieval of Indic Script OCR'd text are yet to be solved. Last year only one task was attempted, i.e. retrieval from OCR'd Bengali text. This year we expand the experiment by adding two more tasks namely, development of effective OCR post-processing techniques for two major Indic scripts and retrieval of Devanagari (Hindi) OCR'd text. The tasks are described as follows.

 

TASK LISTS:

 

Task 1: OCR Post-processing for Devanagari (Hindi) and Bengali

 

Development of a robust post-processing module is an integral part of any OCR system. Successful English OCR systems use this module is an extensive way. However, there is a lack of the same research for Indic script OCRs. This task will facilitate this research for two major Indic scripts namely Devanagari (Hindi) and Bengali. Clean as well as the corresponding OCR'd texts are provided and the task is to correct the OCR'd output so that it shows a minimum dissimilarity with its corresponding clean text. About 50,000 clean vs. OCR'd text pairs (each for Devanagari-Hindi and Bengali) are provided so that one can learn what types of mistakes are made by the OCR engine. These mistakes can be modeled and used for correcting the errors. The OCR'd text are generated by rendering the clean text as image. All these images are OCR'd by a single OCR engine (the same OCR engine is trained separately for Devanagari-Hindi and Bengali).

 

For evaluation purposes, participants need to submit their post-processing modules (say, post-proc-ben.exe) that will take an OCR'd text as input and return the corrected text as output:

post-proc-ben <input.txt> <out.txt>

The out.txt will be compared with the corresponding clean text and unicode level accuracy will be measured by using a dynamic string matching utility. For instance, if <><> is misrecognized as <>, this will incur 2 unicode-level errors; if <><><> is misrecognized as <> then 3 errors will be counted. Evaluation will be done on a new dataset which will not be used for learning the error patterns. As post-processing is expected to be language specific, the participants are encouraged to submit post-processing tools for both Bengali and Hindi OCR'd text separately.

 

Task 2: Retrieval from Bengali OCR'd Text

The RISOT 2011 data is reused for this task. The participants will be provided with a relevance judged collection of 62,825 articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006). For each article, text as well as OCR'd results are given. The text collection is relevance judged against 66 queries. Each query has three parts: title, description and narrative. The OCR output is obtained by converting each text document into a clean image which is then read by a Bangla OCR engine. The character level (more specifically, unicode level as explained in Task-1) accuracy of the OCR engine is about 92%.

Task 3: Retrieval from Devanagari (Hindi) OCR'd Text

A new dataset is provided for experimenting retrieval from Devanagari (Hindi) OCR'd text. This dataset is a relevance judged collection of about 100,000 articles of a leading Hindi newspaper, Dainik Jagran (2005-2007). For each article, text as well as OCR'd results are given. Text articles are rendered as images which are then OCR'd using the same engine used in Task-2. The OCR used in Task-2 is retrained on Devanagari characters to produce Task-3 collection. The OCR doesn’t use any post-processing tool. The Hindi collections (both the text and OCR'd one) are relevance judged against about 100 queries. Each query has three parts: title, description and narrative.  

The participants are expected to develop IR techniques to retrieve documents from these collections. Retrieval from the OCR collections is expected to show degradation in IR efficiveness and therefore the search algorithms are expected to make use of additional techniques (e.g., OCR error correction, modelling of OCR errors for IR purposes, query expansion, n-gram based indexing, etc.) to improve the performance of IR from OCR'd data.

 

Submissions for Task 2 and Task 3:

Each submission file should contain 1000 documents per topic, ranked 0-999, in the usual TREC / CLEF submission format, i.e. each line in the file should have the following fields:

<Query id> Q0 <DOCNO> <RANK> <SIMILARITY> <Run-ID>

Participants are required to submit at least one run that uses only the title and description fields (no narrative) of the topics. There is no upper limit on the number of submitted runs.

Important Dates

 

Corpus and Query Release

July 20, 2012

Submission of Results Due

Oct 20, 2012

Working Note Due

Nov 20, 2012

 

Data

 

Registered participants will download the corpus from http://www.isical.ac.in/fire/risot

 

Task-1: A collection of 50,000 pairs clean vs. OCR'd text for Bengali are already available. Registered participants can send a request for this data before its official release on July 20, 2012. The data for Hindi will be provided soon.

 

Task-2: As we reuse the RISOT 2011 data, the collections (i.e. text and OCR) are available in two different directories. The registered participants can send a request for downloading this collection before its official release. A text document and its corresponding OCR'd document have the same names. The query set contains 66 queries. The relevance judgement report is also provided.

 

Task-3: Development of the collection is in progress. It is expected that this collection will contain about 100,000 document pairs (clean text and its corresponding OCR'd text). The collection will be relevance judged against about 100 queries.

 

Track Organizers

 

Utpal Garain, ISI, Kolkata.

Jiaul Paik, ISI, Kolkata.

Tamaltaru Pal, ISI, Kolkata.

Kripa Ghosh, ISI, Kolkata.

David Doermann, Univ. of Maryland, USA.

Douglas W. Oard, Univ. of Maryland, USA.

 

Registration

 

For registration (or for any queries), please mail to Utpal Garain: utpal@isical.ac.in, Jiaul Paik: jia.paik@gmail.com, or Kripa Ghosh:  kripa.ghosh@gmail.com giving the following details: 1) Name(s) of the participant(s); 2) Affiliation(s); and 3) Contact details (Contact person, Email, and Telephone)