Task Desciption


Morphology is the study of words and their grammatical structure. Morphemes are the basic meaningful units of a language. The goal of morphological analysis is to either understand the structure of a language and use it in various machine learning, natural language processing or information retrieval tasks or to improve retrieval tasks.

Indian languages are morphologically rich because of presence of huge amount of different word forms. The various inflected form of words pose an important challenge for information retrieval tasks in Indian languages. Thus morpheme analysis becomes an important step for information retrieval in Indian languages.

The objective of Morpheme Extraction Task (MET) is to design algorithms/methods that discover morphemes in Indian Languages.

Task 1: Language Independent Tasks (Unsupervised Stemmer)

Unsupervised stemmers are statistical, algorithmic and do not need any language-specific information. Develop a Stemmer that works on different languages irrespective of the language specific rules.


Task 2: Language Dependent Tasks (Supervised System)

Supervised systems depend on known grammatical rules of a language. Develop a Morphological Analyzer, Lemmatizer or Stemmer which can discover morphemes of any of the specified languages from Bengali, Tamil, Hindi and Gujarati.



System Output Format:


Participants are expected to develop systems which take a huge list of words as input file and produce an output file which contains all the words along with their stems/roots/morpheme analysis. The test data will contain a list of all unique words from the FIRE corpora for each language and can be used as input file while testing the systems. The output file should adhere to the following set of guidelines:

  1. Output format for Stemmer and Lemmatizer:
    • Print the first word. Print tab.
    • Print the stem/root of the word.
    • Print the next word in the next line. Repeat the process for each word in the list
    • Word [tab] stem/root
    • Example:
      cats "\t" cat
  2. Output format for Morphological Analyzer:
    • Print the first word. Print Tab.
    • Print the first morpheme of the word. Print space.
    • Print the grammatical part of the first morpheme with a '+' sign before it. There are no standard set of rules for specifying the grammatical part. The actual terms will not be compared. Instead, pairs of terms having the same grammatical structure will be compared in the Language dependent evaluation part. So, you can decide your own set of rules for specifying the grammatical structure. Print space.
    • Print the second morpheme. Print space.
    • Print the grammatical part of the second morpheme with a '+' sign before it. Print space.
    • Continue similarly for all morphemes that make up the word. After the first word is done, print next word in the next line and continue similarly for all words in the input file.
    • Word [tab] morpheme1 [space] +grammatical structure [space] morpheme2 [space] +grammatical structure [space] ...
    • Example:
      titiliya ''\t'' titili '' '' +PL
      (titili means butterfly in Hindi, its plural form, titiliya can be shown as +PL )

Evaluation Technique


The output of the proposed systems will be compared against the Gold Standard data (which contains manually generated stems/roots/morpheme analysis). In case of Language Independent Task, this experiment will be repeated over a couple of languages and the average will be treated as the final score. Test data will comprise of 3,000 surface words in each language. The results will be evaluated manually.


Submission Format


Participants are required to submit a zip folder named as
"InstituteName_FIRE-MET_[Task 1/2]_[Language]_2014". The [Language] part is optional, in case it is for the language independent task. The folder should contain the system(an executable code or software) and a read me file containing the following:

  • Full name of the institute or research group
  • Full names and email ids of Team members
  • The specifics about the task (1 or 2), language and category (stemmer, lemmatizer or morphological analyzer)
  • A step-by-step procedure required to run the system
  • Any other parameters required for running the system


Training Data

The training data consisting of the terms list and the gold standard for individual languages can be downloaded from the following link :

MET 2014 Training Data

The data is password protected, registered participants will be provided with the password.
(Gold standard for Gujarati will be sent soon)

Registration


Drop a mail at metfire2014@gmail.com with the details of the team members, contact details and your task.

Task Coordinators


Mrs. Nilotpala Gandhi, Gujarat University

Prashasti Kapadia, DA-IICT, Gandhinagar, Gujarat
Kanika Mehta, DA-IICT, Gandhinagar, Gujarat
Adda Roshni, DA-IICT, Gandhinagar, Gujarat
Vaibhavi Sonavane, DA-IICT, Gandhinagar, Gujarat

For any queries contact at metfire2014@gmail.com.