Abstracts


Searching (almost) all the world’s words
Douglas W. Oard, University of Maryland, USA

Modern search technology evolved as a way of shifting the locus of control over information dissemination closer to the consumers of that content. However, most of the world’s words aren’t actually created with dissemination in mind. In this talk, I will invite us all to give some thought to the far reaching consequences of that simple fact. In particular, I will argue that almost all the world’s words are found in what we might call “conversational content,” and moreover that searching conversational content calls for fundamentally rethinking the way we approach the task. I’ll focus my remarks on some implications for indexing, query processing, and result presentation. Although I will give examples of some early work on these issues, my real goal will be to stimulate our thinking about the consequences of these new challenges for evaluation design.

About the Speaker: Douglas Oard is a Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies. He is a General Co-Chair for NTCIR-10 and he has served as a track or task coordinator at TREC and CLEF, and FIRE. Additional information is available at http://terpconnect.umd.edu/~oard/.




Evaluation infrastructures for IR experimentation
Nicola Ferro, University of Padova, Italy

This talk will discuss the need for proper infrastructures to manage the experimental evaluation of IR systems and it will present the most recent achievements of the PROMISE network of excellence in this respect in terms of a modular, flexible, and extensible evaluation infrastructure for information access evaluation enhanced by powerful annotation, visualization, and impact analysis functionalities.




Figurative language processing in social media: humor recognition and irony detection
Paolo Rosso, Universitat Politècnica de València, Spain

This talk aims at showing how two specific domains of figurative language - humor and irony - could be automatically handled by means of considering linguistic-based patterns. The focus is on discussing how underlying knowledge, which relies on shallow and deep linguistic layers, could represent relevant information to automatically identify figurative uses of language. The problem of automatically detecting figurative language needs to consider nearly every aspect of language, from pronunciation to lexical choice, syntactic structure, semantics and conceptualization. Therefore, it would be unrealistic to look for a computational silver bullet for figurative language processing because a general solution cannot be found. Rather, we try to identify specific aspects and forms of figurative language that are keen on being computationally analyzed, with the attempt to move gradually forward a broader solution. The objective is to detect textual patterns that could be applied in their automatic identification. The role of linguistic devices such as ambiguity and incongruity, and meta-linguistic devices such as emotional scenarios and polarity, is also described. Special emphasis is given to the use of humor and irony in social media: Twitter, Amazon, TripAdvisor etc.

About the Speaker: Paolo Rosso received his Ph.D. degree in Computer Science from the Trinity College Dublin, University of Ireland. He is currently an Associate Professor at Universitat Politècnica de València, Spain, where he leads the Natural Language Engineering Lab: http://www.dsic.upv.es/grupos/nle/. His research interests are mainly focused on plagiarism detection, irony detection in social media, and analysis of short texts. He has been collaborating in the organization of PAN activities on Uncovering Plagiarism, Authorship, and Social Software Misuse, at CLEF (Plagiarism detection task, since 2009; Author profile in social media, from 2013) and at FIRE: CL!TR in 2011 (Cross-Language !ndian Text Reuse) and CL!NSS in 2012 (Cross-Language !ndian News Story Search).




Visual analytics and Information Retrieval
Giuseppe Santucci, Sapienza University of Rome, Italy

Visual Analytics is an emerging multi-disciplinary area that takes into account both ad-hoc and classical Data Mining algorithms and Information Visualization techniques, combining the strengths of human and electronic data processing. Visualization becomes the medium of a semi-automated analytical process, where human beings and machines cooperate using their respective distinct capabilities for the most effective results. This talk will introduce the Visual Analytics research field, starting from an historical perspective and showing the main issues, purposes, and challenges. In particular, the talk will clarify the relationships that exist between this discipline and Information Visualization and Data Mining. After presenting some outstanding results in this area, the talk will show how to apply Visual Analytics to Information Retrieval evaluation activities, focusing on the CLEF data and showing the characteristic of a Visual Analytics prototype specifically designed to analyze such data.




Social book search: the impact of professional and user generated content on book suggestions
Jaap Kamps, University of Amsterdam, The Netherlands

The Web and social media give us access to a wealth of information, not only different in quantity but also in character---traditional descriptions from professionals are now supplemented with user generated content. This challenges modern search systems based on the classical model of topical relevance and ad hoc search: How does their effectiveness transfer to the changing nature of information and to the changing types of information needs and search tasks?




Compressed data structures for annotated Web search
Soumen Chakrabarti, IIT Bombay

Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of mil- lions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload- sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 mil- liseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40×8 cores with 40×3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.




The CLEF initiative: from 2010 to 2013 and onwards
Nicola Ferro, University of Padova, Italy

The talk will discuss the challenges faced by and innovations introduced in the CLEF Initiative and it will present the main activities carried out in the last years as well as plans for the future.




Report on INEX 2012
Jaap Kamps, University of Amsterdam, The Netherlands

This talk reports on the goals, tasks, and outcomes of the INEX 2012 evaluation campaign, covering a total of five tracks: linked data, relevance feedback, snippet retrieval, social book search, and tweet contextualization.