Ad-hoc retrieval | Mailing lists & forums | Wikipedia-entity retrieval



Ad-hoc Retrieval :



The FIRE adhoc task is similar to the TREC adhoc task. Its objective is to evaluate the effectiveness of retrieval systems in retrieving accurate and complete ranked lists of documents in response to fifty one-time information needs. The FIRE 2010 adhoc task focuses specifically on South Asian languages. This is the second year of the FIRE Adhoc task. New participants are welcome!

The adhoc task has several sub-tasks:

  1. Mono-lingual retrieval in each of the following languages:
    • Bengali
    • Hindi
    • Marathi
  2. Cross-lingual retrieval:
    • queries in Bengali / English / Hindi / Marathi / Tamil / Telugu / Gujarati
    • documents in Bengali / English / Hindi / Marathi

The Bengali and Hindi topics will also be transliterated and made available in Roman script. All participants are encouraged to submit runs using these queries as well.

Administrivia:

All participants should periodically check this website (http://www.isical.ac.in/~fire) for announcements, updates, etc.

All participants should also be subscribed to the track mailing list to participate in any track discussion and to be informed of any late-breaking announcements. Contact fire-list (at) isical.ac.in to be added to the list.

Submissions:

Each submission file should contain 1000 documents per topic, ranked 0-999, in the usual TREC / CLEF submission format, i.e. each line in the file should have the following fields:

<Query id> Q0 <DOCNO> <RANK> <SIMILARITY> <Run-ID>

Participants will need to submit a gzipped file containing retrieval results in the above format.

All participants are required to submit at least one run that uses only the title and description fields (no narrative) of the topics. There is no upper limit on the number of submitted runs. However, please assign a priority to each of your submissions. Runs will be included in the pooling process in order of priority.



Retrieval and classification from mailing lists and forums:



About the track:

The mailing lists and forums we are considering typically consist of message threads, most of which are started by somebody seeking a solution to a technical problem (s)he faced. Other members seek clarifications / more details about the problem, or reply with proposed solutions. The initial poster may explain the problem in more detail in subsequent messages, if required. The other members may help the poster to eventually reach a solution; or the problem may remain unsolved in that thread. Sometimes, the poster may be referred to an earlier thread, where the solution can be found. Occasionally, the discussion digresses into other topics as well.

These aspects of the data from a mailing list or discussion forum make retrieval of the solution (i.e. finding a message, or a set of messages containing a legitimate solution) fairly complex. The objective of this track is to evaluate the effectiveness of retrieval and classification systems on this type of data. This is a pilot track starting this year and we eagerly look forward to your participation.

About the data:

The text collection is drawn from web discussion forums and mailing lists. The forum sub-collections are organized into files, each of which corresponds to a complete discussion thread. In contrast, each file in the mailing list sub-collections corresponds to an individual email message. The thread structure for these sub-collections may be reconstructed using the message ids in the "In-reply-to" and "References" fields of the messages. As usual, each file in the collection has a unique <DOCNO> field.

Task 1: Ad-hoc retrieval from mailing lists and forums:

The topics of this task will be in the standard TREC / CLEF format, and will describe a technical problem. The retrieval task will be to find a working solution to the problem from the corpus.

Since the complete solution to a problem may be spread across various messages in a thread, the unit of retrieval for this task will be a complete thread. A thread will be regarded as relevant if it contains a "correct" or working solution for the problem mentioned in the topic. To make the relevance assessment process easy and consistent, we will assume that a thread contains a correct solution only if a poster confirms in a follow-up message that the proposed solution indeed worked. The task for a retrieval system will thus be to find such threads containing a solution to the problem posed in the topic.

Note that a single file in the forum sub-collections corresponds to a complete thread, but in the mailing-list sub-collections, a thread is spread across multiple files. For the sake of uniformity, we will shortly be providing a table that maps the <DOCNO> of each such file (email message) to the corresponding THREAD-ID.

Task 2: Classification of messages in mailing lists and forums:

Most of the posts of a mailing list or a forum belong to one (in some cases more than one) of the following categories (MSG-CLASS):

  1. ASK_QUESTION: Asking a question, e.g. somebody posts a problem. This is usually, but not always, the first post of a thread.
  2. DITTO: Repeating a question, e.g. "Yes, I also have the same (or a very similar) problem".
  3. ASK_CLARIFICATION: Asking for more details about the problem, e.g. "Can you please provide more details? What kind of error message are you getting?"
  4. FURTHER_DETAILS: The person who is facing a problem provides more detailed information about it, possibly after somebody asks for more details.
  5. SUGGEST_SOLUTION: Suggesting a solution
  6. SOLUTION_FEEDBACK_NEG: Somebody tries a suggested solution and says that it did not work for him or her.
  7. SOLUTION_FEEDBACK_POS: Somebody tries a suggested solution that works, and (s)he confirms that it works. Sometimes this may be the legitimate end of a thread.
Note that the same post may belong to more than one of the above categories. For example, when somebody repeats a question, (s)he may also provide more details. The goal of this task is to classify a set of given messages (identified by MSG-ID) into one or more of the above categories.

We will also provide a list of pre-classified messages as training data. Participants may (in fact, they are encouraged to) build a larger training set on their own. Naturally, they must exclude the messages that are in the test set from their training data.

Submission guidelines:

Task 1: Ad-hoc retrieval from mailing lists and forums:

Each submission file should contain 1000 threads per topic, ranked 0-999, in the usual TREC / CLEF submission format, i.e. each line in the file should have the following fields:

<Query id> Q0 <THREAD-ID> <RANK> <SIMILARITY> <Run-ID>

The THREAD-ID for a document from the forum sub-collections will be the same as its DOCNO. For a document from the mailing-list sub-collections, the THREAD-ID will be obtained as described above.

Participants will need to submit a gzipped file for each run. There is no upper limit on the number of submitted runs. However, please assign a priority to each of your submissions. Runs will be included in the pooling process in order of priority.

All participants are required to submit at least one run that uses only the title and description fields (no narrative) of the topic.

Task 2: Classification of messages in mailing lists and forums:

Each line of a submission file should contain the following fields:

<MSG-ID> <MSG-CLASS> <CONFIDENCE-SCORE (optional)>

The field CONFIDENCE-SCORE is optional and if present, must be a value in the range 0 to 1. If a confidence score is not present, then a default value 1 will be assumed. When one message is classified into multiple message classes, then there will be multiple rows for the same MSG-ID.



Ad-hoc Wikipedia-entity retrieval from news documents (WikEND):



Overview:

An interesting adhoc entity retrieval task involves identifying a set of entities from Wikipedia which are relevant to a given document, which we call the query document. Interesting applications include linking specific entities in a news or a scientific article to their corresponding Wikipedia page(s). The task for the WikEND track may be defined as - given a query document find entities from Wikipedia that are related to this document.

Note that, as opposed to classical entity retrieval task, context available from the query document determines relevance of the retrieved entities.

An entity is represented by a Wikipedia article. Category pages, help pages, discussion pages and author information pages are not considered as entity pages. They can be used by the participants for disambiguating entity pages.

Example:

A few sample query documents from Yahoo News and lists of Wikipedia entities relevant to them are provided below.

Query-1 Query-2 Query-3

Dataset:

The track uses a Wikipedia dump from 09/14/2009. The bz2 file is 5.2 GB and when uncompressed might expand to about 24 GB. A set of Yahoo News articles in plain text format will be used as query documents.

Submissions:

Each submission file should contain at most 100 results per topic, ranked 0-99, in the usual TREC / CLEF submission format, i.e. each line in the file should have the following fields:

<Query id> Q0 <WIKI-PAGE-ID> <RANK> <SCORE (optional)> <Run-ID>

The <WIKI-PAGE-ID> is the id present in the <id> tag for each page in the Wikipedia dump. The WIKI-PAGE-ID must be ids from pages that DO NOT redirect to other pages. In case a Wikipedia page which redirects to other page is selected as a candidate result, then the id of the page it redirects to should be provided. The redirect information is present in the <text> tag in the Wikipedia dump.

The field SCORE will not be considered in the evaluation this year and can have default value 0.

Participants will need to submit a gzipped file for each run. There is no upper limit on the number of submitted runs. However, please assign a priority to each of your submissions. Runs will be included in the pooling process in order of priority.

Copyright © 2010 FIRE All rights reserved.