SMS-Based FAQ Retrieval

FIRE 2011 Task


Home
Dataset
Important Dates
People
Submission
Attendance
Resources
Contact
Results
Joint Task Coordinators
COER
and
IBM Research

Working with UTF-8

The FIRE 2011 SMS Task has data in three languages:

  • English
  • Hindi
  • Malayalam

The data has been released in UTF-8 format and we have recieved many queries on how to work with non-English data sets. This page describes how data in UTF-8 can be used in Java.

Using a BufferedReader, one can easily read UTF8 based text files as shown in the code snippet below:

File F=new File("/home/UTF8file.txt");
// Create a File object pointing to the UTF8 file you would like to read.

BufferedReader b1 = new BufferedReader(new InputStreamReader(new FileInputStream(f), "UTF8"));

// A constant with the value "UTF8" has been defined in Java, and we pass this as an argument to the InputStreamReader.

String line="";
while ((line = b1.readLine()) != null) {
System.out.println(line); //Display content of file.
}
b1.close(); // Close reader.


Levenstein distance calculation, character comparison, string comparison, tokenization etc can all be performed on UTF8 strings. However, language/locale dependent methods such toLowerCase(); toUpperCase() etc will not work for Hindi and Malayalam strings.


Some useful Indian Language resources have been provided on the FIRE Resources Page