RISOT: Retrieval from Indic Script OCR'd Text

Introduction | Important Dates | Data | Task Organizers | Registration | Run Submissions

Introduction

RISOT focuses on evaluating IR effectiveness on Indic script OCR'd text. The participants will be provided with a relevance judged collection of 62,825 articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006). For each article, both the original digital text and corresponding OCR results are given. Relevance judgments are available for 92 topics. The OCR output is obtained by rendering each digital document as a document image, which is then processed by a Bangla OCR system. The document images have variation in font faces, character styles and sizes. The character level (more specifically, Unicode level) accuracy of the OCR engine is about 92%. For instance, if <প><ু> is misrecognized as <এ>, this incurs 2 unicode-level errors; if <স><্><ন> is mis-recognized as <ল> then 3 errors are counted.

The participants in the 2011 RISOT pilot task are expected to develop IR techniques to retrieve documents from these collections and report the MAP and Precision@10 separately for the digital text collection and for the OCR collection. Retrieval from the OCR collection is expected to show degradation in IR effectiveness, and therefore the search algorithms are expected to make use of additional techniques (e.g., OCR error corrections, modeling of OCR errors for IR purposes, etc.) to improve the performance of IR from OCR'd text.

In subsequent years of FIRE we anticipate conducting an extended version of RISOT. Although we are asking participants to compute their own results using existing relevance judgments for the 2011 pilot task, in future years we would expect to conduct blinded evaluations using new relevance judgments. For the 2011 pilot task we have generated clean images from the text pages, but image degradation models could be applied before running the OCR. We could model the actual application with even higher fidelity by actually printing and then re-scanning at least a part of the collection. And even higher fidelity could be achieved by finding a subset of documents that have actually been printed in newspaper and scanning them. This could generate as many as four different versions of the OCR collection. Some participants in future years might also wish to contribute additional OCR results. In this case, the participants would be provided with the image dataset along with the text collection. Adding documents in other Indic scripts such as Devanagari will also be considered in future years. We may consider adoption of additional evaluation measures. The specific design of the task in future years will, of course, be discussed among the potential participants. We therefore encourage the broadest possible participation in the 2011 pilot task in order to provide a basis for those discussions.

Important Dates

Corpus and Query Release Aug 25 2011
Submission of Results Due Oct 25 2011
Working Note Due Nov 25 2011

Data

Registered participants will download the corpus from the following URL. Two collections (i.e. text and OCR) are given in two different directories. A text document and its correspodning OCR'd document are having the same names. The topic set contains 92 topics, which are taken from FIRE 2008 and 2010 topic sets. Each topic consists of three parts namely title, description (desc) and narrative (narr) along with a unique query number. Title gives the focus of the information need, description field gives somewhat clearer information need and narrative field provides a content guidelines of the relevant documents for this topic. Here is the structure of a sample topic. The participants can build queries using these parts. Therefore, in the working note the participants should explain how they have made their queries. For example, a query can be title only or title, description query (title and description fields are combined). Existing relevance judgments have also been provided.



Task Organizers

Utpal Garain, ISI, Kolkata. Jiaul Paik, ISI, Kolkata
Tamaltaru Pal, ISI, Kolkata. Prasenjit Majumder, DAIICT, Gandhinagar.
David Doermann, University of Maryland, College Park, USA. Doug Oard, University of Maryland, College Park, USA.

Registration

For registration (or for any queries), please mail to Utpal Garain: utpal@isical.ac.in or Jiaul Paik: jia.paik@gmail.com giving the following details:

1) Name(s) of the participant(s); 2) Affiliation(s); and 3) Contact details (Contact person, Email, and Telephone)

Run Submissions

Participants are expected to report MAP and Precision@10. The reported results will have the following format:
Number of queries = ???
Retrieved = ???
Relevant = ???
Relevant retrieved = ???
------------------------
Average Precision : ?????
R Precision : ?????
------------------------
Precision at 0: ?????
Precision at 10: ?????
Precision at 20: ?????
Precision at 30: ?????
Precision at 40: ?????
Precision at 50: ?????
Precision at 60: ?????
Precision at 70: ?????
Precision at 80: ?????
Precision at 90: ?????
Precision at 100: ?????

Participants will need to submit the results (i.e. MAP as well as Precision@10) for both the text as well as OCR collections.