RISOT: Retrieval from Indic Script OCR'd Text
RISOT focuses on evaluating IR effectiveness on Indic script OCR'd text. The participants will be provided with a relevance judged collection of 62,825 articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006). For each article, both the original digital text and corresponding OCR results are given. Relevance judgments are available for 92 topics. The OCR output is obtained by rendering each digital document as a document image, which is then processed by a Bangla OCR system. The document images have variation in font faces, character styles and sizes. The character level (more specifically, Unicode level) accuracy of the OCR engine is about 92%. For instance, if <প><ু> is misrecognized as <এ>, this incurs 2 unicode-level errors; if <স><্><ন> is mis-recognized as <ল> then 3 errors are counted.
The participants in the 2011 RISOT pilot task are expected to develop IR techniques to retrieve documents from these collections and report the MAP and Precision@10 separately for the digital text collection and for the OCR collection. Retrieval from the OCR collection is expected to show degradation in IR effectiveness, and therefore the search algorithms are expected to make use of additional techniques (e.g., OCR error corrections, modeling of OCR errors for IR purposes, etc.) to improve the performance of IR from OCR'd text.
In subsequent years of FIRE we anticipate conducting an extended
version of RISOT. Although we are asking participants to compute their
own results using existing relevance judgments for the 2011 pilot task,
in future years we would expect to conduct blinded evaluations using
new relevance judgments. For the 2011 pilot task we have generated
clean images from the text pages, but image degradation models could be
applied before running the OCR. We could model the actual application
with even higher fidelity by actually printing and then re-scanning at
least a part of the collection. And even higher fidelity could be
achieved by finding a subset of documents that have actually been
printed in newspaper and scanning them. This could generate as many as
four different versions of the OCR collection. Some participants in future
years might also wish to contribute additional OCR results. In this
case, the participants would be provided with the image dataset along
with the text collection. Adding documents in other Indic scripts such
as Devanagari will also be considered in future years. We may consider
adoption of additional evaluation measures. The specific design of the
task in future years will, of course, be discussed among the potential
participants. We therefore encourage the broadest possible
participation in the 2011 pilot task in order to provide a basis for
|Corpus and Query Release||Aug 25 2011|
|Submission of Results Due||Oct 25 2011|
|Working Note Due||Nov 25 2011|
Registered participants will download the corpus from the following URL. Two collections (i.e. text and OCR) are given in two different directories. A text document and its correspodning OCR'd document are having the same names. The topic set contains 92 topics, which are taken from FIRE 2008 and 2010 topic sets. Each topic consists of three parts namely title, description (desc) and narrative (narr) along with a unique query number. Title gives the focus of the information need, description field gives somewhat clearer information need and narrative field provides a content guidelines of the relevant documents for this topic. Here is the structure of a sample topic. The participants can build queries using these parts. Therefore, in the working note the participants should explain how they have made their queries. For example, a query can be title only or title, description query (title and description fields are combined). Existing relevance judgments have also been provided.
|Utpal Garain, ISI, Kolkata.||Jiaul Paik, ISI, Kolkata|
|Tamaltaru Pal, ISI, Kolkata.||Prasenjit Majumder, DAIICT, Gandhinagar.|
|David Doermann, University of Maryland, College Park, USA.||Doug Oard, University of Maryland, College Park, USA.|
|Number of queries||=||???|
Participants will need to submit the results (i.e. MAP as well as Precision@10) for both the text as well as OCR collections.