Talk Abstracts

  • Building Test Collections
    - Donna Harman, National Institute of Standards and Technology, USA.

    Test collections are one of the major tools for evaluation in information retrieval, starting in the 1960s with the Cranfield experiments. Over the years some of the methodology has changed, but the basic paradigm used in these experiments is still followed. This talk will discuss the philosophy behind the Cranfield paradigm, and, with a series of case studies, will show how it has been implemented in TREC and other evaluations. The talk will end with some general hints about building test collections.

  • User studies in IR: Toward user and task based evaluation
    - Kalervo Jarvelin, School of Information Sciences, University of Tampere, Finland

    The practical goal of information retrieval (IR) research is to create ways to support humans in their tasks and support their information interaction. To determine the effectiveness of IR techniques and systems in meeting this goal, evaluation is necessary. Evaluation of IR effectiveness is guided by a theory on factors that affect effectiveness. The practical goal requires that, ultimately, the theory and factors cover users and their task performance. The talk discusses three kinds of theories that go beyond the traditional Cranfield paradigm: theories of searching, theories of information access, and theories of information interaction. All involve users. The talk closes by asking, when are user studies (not) useful.

  • NTCIR-9 and Beyond
    - Hideo Joho, University of Tsukuba, Japan.

    NTCIR is a series of evaluation workshops designed to enhance research in Information Access (IA) technologies such as Information Retrieval, Question Answering, Text Summarisation, Opinion Analysis, etc. NTCIR-9, the latest cycle of NTCIR, had seven exciting evaluation tasks, of which six were new, under a new organising structure. The NTCIR-9 tasks successfully attracted one of the largest numbers of participants. In this talk, I present a brief history of NTCIR, and give an overview of evaluation tasks carried out in NTCIR-9, and finally discuss prospects for NTCIR-10.

  • MediaEval 2011 Evaluation Campaign
    - Gareth Jones, Dublin City University, Ireland.

    MediaEval is an international multimedia benchmarking initiative that offers innovative new content analysis, indexing and search tasks to the multimedia community. MediaEval focuses on social and human aspects of multimedia and strives to emphasize the ‘multi’ in multimedia, including the use of speech, audio, tags, users, and context, as well as visual content. MediaEval seeks to encourage novel and creative approaches to tackling these new and emerging multimedia tasks. Participation in MediaEval tasks is open to any research group wishing to sign up. MediaEval 2011 consisted of 6 tasks coordinated in cooperation with various research groups in Europe and elsewhere. The following tasks were offered in the 2011 season:
    • Placing Task: This task required participants to assign geographical coordinates (latitude and longitude) to each of a provided set of test videos. Participants could make use of metadata and audio and visual features as well as external resources.
    • Spoken Web Search Task: This task involved searching for audio content within audio content using an audio content query. It addresses the challenge of search for multiple, resource-limited languages. The application domain is the Spoken Web being developed for low-literacy communities in the developing world.
    • Affect Task: This task required participants to deploy multimodal features to automatically detect portions of movies containing violent material. Violence is defined as "physical violence or accident resulting in human injury or pain". Any features automatically extracted from the video, including the subtitles, could be used by participants.
    • Social Event Detection Task: This task requires participants to discover events and detect media items that are related to either a specific social event or an event-class of interest. Social events of interest were planned by people, attended by people and the social media captured by people.
    • Genre Tagging Task: The task required participants to automatically assign tags to Internet videos using features derived from speech, audio, visual content or associated textual or social information. This year the task focused on labels that reflect the genre of the video.
    • Rich Speech Retrieval Task: The task went beyond conventional spoken content retrieval by requiring participants to deploy spoken content and its context in order to find jump-points in an audiovisual collection of Internet video for given a set of queries.
    This presentation will review the tasks and partcipation from MediaEval 2011 and oveview plans for MediaEval 2012.

  • IR Evaluation in Context
    - Jaap Kamps, University of Amsterdam, The Netherlands.

    Standard evaluation benchmarks in the Cranfield/TREC paradigm allow us to research the generic retrieval effectiveness of system, by abstracting away from the specific document genre, use-case and searcher stereotype. While this is of clear value to study general hypothesis about IR system effectiveness, it also creates two problems. First, increasingly IR research is about improving a particular search engine, and by ignoring the unique content and user community the evaluation may be less informative about this exact application. How does this content differ, and how do the tasks and searchers differ? Can we make the evaluation tailored to their unique characteristics? Second, there is a substantial impact of the individuals involved in the test collection building -- the "topic effect" or "user effect" is the largest source of variation, greater than the "system effect." Can we capture the context of persons involved in test collection building? What is the consequence of moving toward "anonymous" and uncontrolled judges when we resort to crowdsourcing?

  • Utility estimation framework for query-performance prediction
    - Oren Kurland, Israel Institute of Technology, Israel.

    We present a novel framework for the query-performance prediction task. That is, estimating the effectiveness of a search performed in response to a query in lack of relevance judgments. The framework is based on estimating the utility that a given document ranking provides with respect to an information need expressed by the query. To address the uncertainty in inferring the information need, we estimate utility by the expected similarity between the given ranking and those induced by relevance models; the impact of a relevance model is based on its presumed representativeness of the information need. Specific query-performance predictors instantiated from the framework are shown to substantially outperform state-of-the-art predictors. In addition, we present an extension of the framework that results in a unified prediction model that can be used to derive and/or explain several previously proposed post-retrieval predictors which are presumably based on different principles.
    Joint work with Anna Shtok, David Carmel and Shay Hummel.

  • The Recall Problem in Cross-Language Information Retrieval
    - Doug Oard, University of Maryland, College Park, USA

    All information retrieval effectiveness measures combine some characterization of specificity (e.g., precision) with some characterization of exhaustiveness (e.g., recall). In many cases, such as Web search, specificity appropriately receives emphasis. But there are also cases in which exhaustiveness deserves primacy, such as preparation of comprehensive reviews in medicine or the discovery of digital evidence in civil litigation (“e-discovery”). In such cases, two problems arise, neither of which is yet well characterized. First, recall-oriented sampling and estimation is understudied. In particular, recent experience in evaluation of classification effectiveness for email and other types of business records has demonstrated significant gaps in our present ability to optimize sampling and estimation strategies in the presence of reasonably expected rates of annotator error. Second, when the query and the documents are expressed in different languages, gaps in translation lexicons may yield as-yet-uncharacterized deficiencies in the exhaustiveness of Cross-Language Information Retrieval (CLIR) searches. As a result we have the worst possible situation: potentially important deficiencies in system behavior may be being masked by as-yet uncharacterized deficiencies in our evaluation measures. In this talk I will our review progress to date on these two challenges, I will describe two new projects that seek to advance our understanding of these challenges, and I will suggest ways in which we might adapt CLIR evaluation venues such as FIRE to begin to add some focus on evaluation of recall.

  • The Importance of Evaluation in MLIR System Development
    - Carol Peters, ISTI-CNR, Italy.

    The aim of any Information Retrieval (IR) system is to provide users with easy access to, and interaction with, information that is relevant to their needs and to enable them to effectively use this information; the aim of a Multilingual Information Retrieval (MLIR) system is to do this in a multilingual and/or cross-language context. This implies additional complexities. The talk will provide a broad coverage of the various issues involved in designing and developing systems for Multilingual Information Retrieval (MLIR) and the challenges implied. The importance of the role played by evaluation campaigns in stimulating the building of systems that give better results not only in terms of performance in retrieving relevant information but also with respect to satisfying the expectations of the user will be stressed. The example cited will be that of the Cross-Language Evaluation Forum; some of the lessons that have been learnt as a result of the CLEF experience will be presented.

  • Why Recall Matters
    - Stephen Robertson, Microsoft Research, Cambridge, UK.

    The success of web search over the last decade-and-a-half has focussed attention on high-precision search, where the relevance of the first few ranked items matters a lot and what happens way down the ranking matters not at all. The conventional view is that, having traded high recall for high precision, we can more-or-less forget about recall. This is reinforced by the notion that what we cannot see, we need not care about – that the value of a search to the user resides in the results that they look at, not those that they don’t. However, there are several reasons why we should reject such a view. One is that there are some kinds of searches (legal discovery, patent search, maybe medical search) where high recall clearly is directly important. Another is that in the context of search environments other than the English-language web (enterprise, desktop and minority languages for example), we cannot rely on the huge variety of things that exist on the web to get us out of difficulty. But a more fundamental reason is that the tradeoff between precision and recall is not an opposition: it’s a mutual benefit situation. In addition to the practical reasons, I will present the tradeoff argument via a long-standing but relatively unfamiliar way of thinking about the traditional recall-precision curve.

  • Spoken Web Search
    - Nitendra Rajput

    For the developing world, internet reach and low literacy act as barriers to access and create content on the World Wide Web. An alternative therefore is to use the cell phones as the device and speech as the interaction modality. This is the basic premise of the Spoken Web which uses the concepts of VoiceSites that can be created by any user in their language, by speaking to the system over phone. However, this leads to an interesting research challenge related to finding the relevant content in the increasing quantity of audio content. We challenge ourselves to enable searching of user generated audio content through a audio-only query-result interface. The motivation to perform such a search, the challenges in terms of data, interface and users will be presented in this talk. The hope is that the audience will be able to identify sub-problems in this large space of Spoken Web search.