![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HomeCommitteesImportant DatesProceedingsRegistration & AccommodationSponsorshipWorkshop VenueAbout KolkataContact Us |
International Workshop On Research Issues in Digital Libraries
|
Tentative Schedule
Contributors :
Information retrieval and digital libraries: lessons of research. Abstract This paper reviews lessons from the history of information retrieval research, with particular emphasis on recent developments. These have demonstrated the value of statistical techniques for retrieval, and have also shown that they have an important, though not exclusive, part to play in other information processing tasks, like question answering and summarising. The heterogeneous materials that digital libraries are expected to cover, their scale, and their changing composition, imply that statistical methods, which are general-purpose and very flexible, have significant potential value for the digital libraries of the future.
Vagueness and Uncertainty in Information Retrieval: How can Fuzzy Sets Help? Abstract The field of fuzzy information systems has grown and is maturing. In this paper, some applications of fuzzy set theory to information retrieval are described, as well as the more recent outcomes of research in this field. Fuzzy set theory is applied to information retrieval with the main aim being to define flexible systems, i.e., systems that can represent and manage the vagueness and subjectivity which characterizes the process of information representation and retrieval, one of the main objectives of artificial intelligence.
Digitizing, coding, annotating, disseminating, and preserving documents. Abstract We examine some research issues in pattern recognition and image processing that have been spurred by the needs of digital libraries. Broader.and not only linguistic.context must be introduced in character recognition on low-contrast, tightly-set documents because the conversion of documents to coded (searchable) form is lagging far behind conversion to image formats. At the same time, the prevalence of imaged documents over coded documents gives rise to interesting research problems in interactive annotation of document images. At the level of circulation, reformatting document images to accommodate diverse user needs remains a challenge.
Digital Representation of Cultural Heritage Materials - new possibilities for enhanced access and analysis. Abstract It is now feasible, both technologically and economically, to create extremely accurate digital surrogates of a wide variety of cultural heritage materials which capture and in many cases increase the scholarly value of the originals, while leaving them unharmed, intact and unaltered. Artifacts for which this has been successfully done include fragile manuscripts and inscriptions on a variety of media, images, sounds, 3-dimensional objects, ancient buildings and culturally significant sites and place
Digital Audiovisual Repositories - an Introduction. Abstract This paper briefly describes the essential aspects of the digital world that audiovisual archives are entering - or being swallowed-up in. The crucial issue is whether archives will sink or swim in this all-digital environment. The core issue is defining - and meeting - the requirements for a secure, sustainable digital repository.
Document Image Understanding for Digital Libraries. Abstract The rapid growth of digital libraries (DLs) worldwide poses many new challenges for document image understanding research and\ development. We are entering a period of book-scanning on an unprecedented scale: the Million Book Project, Amazon's Search\ Inside feature, Google Books, etc. However, images of pages, when they are accessed through DLs, lose many advantages of ha\ rdcopy and of course lack the advantages of richly encoded symbolic representations. Document image understanding technology\ can alleviate some of these problems and integrate book-images more fully into the digital world. But many technical proble\ ms remain. The broad diversity of document types poses serious challenges to the state of the art: we need new more versati\ le technologies that can scale up to exploit the huge training sets that are rapidly coming on line. We also need new strate\ gies for improving recognition accuracy by combining multiple independent contextual constraints. Another promising research\ direction seeks to adapt recognition algorithms to the particular book at hand. Also, our systems should accept correction \ by users in order to accelerate automatic understanding. These research frontiers will be illustrated concretely by recent re\ sults achieved at Lehigh University and elsewhere.
Multimedia Semantics: What Do We Really Know? Abstract In this talk, we discuss several issues related to multimedia semantics, including emergent semantics, the multimedia semanti\ c web, monomedia versus multimedia, how context can be used in a multimedia search environment, and a look at just what shoul\ d be considered a multimedia document in the context of multimedia search. Throughout the talk, we discuss cross-pollination \ of ideas from the fields of content-based retrieval, information retrieval, and the semantic web. Speakers :
On the science of search: Statistical approaches, evaluation, optimisation. Abstract Evaluation of information retrieval (search) systems has a long history, but in the last 15 years the agenda has in large measure been set by the Text REtrieval Conference (TREC). I will talk about how TREC has moulded the field, and the strengths and limitations of the TREC approach. I will also discuss other efforts in a similar vein, including those happening within commercial organisations. Turning to the models and methods used in search, I will discuss the dominance of the various statistical approaches, including the vector space, probabilistic and 'language' models. Finally, I will discuss the confluence of these various ideas in the domain of optimisation: training algorithms using test data.
Shallow syntax analysis in Sanskrit guided by semantic nets constraints. Abstract We present the state of the art of a computational platform for the analysis of classical Sanskrit. The platform comprises modules for phonology, morphology, segmentation and shallow syntax analysis, organized around a structured lexical database. It relies on the Zen toolkit for finite state automata and transducers, which provides data structures and algorithms for the modular construction and execution of finite state machines, in a functional framework. Some of the layers proceed in bottom-up synthesis mode - for instance, noun and verb morphological modules generate all inflected forms from stems and roots listed in the lexicon. Morphemes are assembled through internal sandhi, and the inflected forms are stored with morphological tags in dictionaries usable for lemmatizing. These dictionaries are then compiled into transducers, implementing the analysis of external sandhi, the phonological process which merges words together by euphony. This provides a tagging segmenter, which analyses a sentence presented as a stream of phonemes and produces a stream of tagged lexical entries, hyperlinked to the lexicon. The next layer is a syntax analyser, guided by semantic nets constraints capturing dependencies between the word forms. Finite verb forms demand semantic roles, according to subcategorization patterns depending on the voice (active, passive) of the form and the governance (transitive, etc) of the root. Conversely, noun/adjective forms provide actors which may fill those roles, provided agreement constraints are satisfied. Tool words and particles are mapped to transducers operating on tagged streams, allowing the modeling of linguistic phenomena such as coordination by abstract interpretation of actor streams. The parser ranks the various interpretations (matching actors with roles) with a penalty, and returns to the user the minimum penalty analyses, for final validation of ambiguities. Work is under way to adapt this machinery to the construction of a treebank of parsed sentences, issued from characteristic examples from Apte.s Sanskrit Syntax manual. This work is in cooperation with Pr Brendan Gillon from McGill University. It is expected that this treebank will be used to learn statisticallly the parameters of the parser in order to increase its precision. Other modules will attempt statistical tagging, needed for bootstrapping this prototype into a more robust analyser, doing lexicon acquisition from the corpus. The whole platform is organized as a Web service, allowing the piecewise tagging of a Sanskrit text. The talk will essentially consist in an interactive demonstration of this software.
Open source, academic research and Web search engines. Abstract Search engines are the primary access to the information in the World Wide Web for millions of users. As for now, all the search engines able to deal with both the huge quantity of information and the huge number of Web users are driven by commercial companies. They use hidden algorithms that put the integrity of their results in doubt. Moreover some specialized information needs are not easily served, so there is a need for some open source Web search engines. On the other hand, academic research in the Information Retrieval field has been dealing with the problem of retrieving information in large repositories of digital documents since the mid of the 20th century. And in the past ten years several open source Information Retrieval Systems were developed. I will present some reasons why we want open source search. I will survey the major tools available with Open Source licenses, and present their key points for their use either in academic research or as a search engine. I will tackle some interactions and benefits between research and open source development, what are the problems faced, what are the working bricks so far, the strategies and the opportunities for the future.
Finding an answer to a question. Abstract The huge quantity of available electronic information leads to a growing need for users to have tools able to be precise and selective. These kinds of tools have to provide answers to requests quite rapidly without requiring the user to explore each document, to reformulate her request or to seek for the answer inside documents. From that viewpoint, finding an answer consists not only in finding relevant documents but also in extracting relevant parts. This leads us to express the question-answering problem in terms of an information retrieval problem that can be solved using natural language processing (NLP) approaches. In my talk, I will focus on defining what a ?good? answer is, and how a system can find it. A good answer has to give the required piece of information. However, it is not sufficient; it also has both to be presented within its context of interpretation and to be justified in order to give a user means to evaluate if the answer fits her needs and is appropriate. One can view searching an answer to a question as a reformulation problem: according to what is asked, find one of the different linguistic expressions of the answer in all candidate sentences. Within this framework, interlingual question-answering can also be seen as another kind of linguistic variation. The answer phrasing can be considered as an affirmative reformulation of the question, partly or totally, which entails the definition of models that match with sentences containing the answer. According to the different approaches, the kinds of model and the matching criteria greatly differ. It can consist in building a structured representation that makes explicit the semantic relations between the concepts of the question and that is compared to a similar representation of sentences. As this approach requires a syntactic parser and a semantic knowledge base, which are not always available in all the languages, systems often apply a less formal approach based on a similarity measure between a passage and the question and answers are extracted from highest scored passages. Similarity involves different criteria: question terms and their linguistic variations in passages, syntactic proximity, answer type. We will see that, in such an approach, justifications can be envisioned by using text themselves, considered as depositories of semantic knowledge. I will focus on the approach the LIR group of LIMSI has taken for its monolingual and bilingual systems.
How to Compose a Complex Document Recognition System. Abstract The technical challenges in document analysis and recognition have been to solve the problems of uncertainty and variability that exist in document images. From our experiences in developing OCRs, business form readers, and postal address recognition engines, I would like to present several design principles to cope with these problems of uncertainty and variability. When the targets of document recognition are complex and diversified, the recognition engine needs to solve many different kinds of pattern recognition (sub)problems, which are concrete phenomena of uncertainty and variability. Inevitably, the engine gets complex as a result. The question is how to combine the subcomponents of a recognition engine so that the total engine produces sufficiently accurate results. These principles will be explained together with examples.
Adaptive Search Systems and Experimentation Methodology Abstract In this paper, I will explore the domain of adaptive search systems and their evaluation. Current evaluation methodologies, as that of traditional TREC and also that of interactive experimentation methodologies, are inadequate for the development of such systems. Classical approach is less expensive and facilitates comparison between systems. However, they are not suitable for the experimentation of adaptive retrieval systems. On the other hand, the interactive evaluation methodology is expensive in terms of time and other resources. In my presentation, I will explore these issues and propose an approach for the experimentation of adaptive retrieval systems. The proposed approach is based on the simulated user based experimentation.
Document Image Analysis for Digital Libraries at PARC Abstract Digital Libraries have many forms -- institutional libraries for information dissemination, document repositories for record-keeping, and personal digital libraries for organizing personal thoughts, knowledge, and course of action. Digital image content (scanned or otherwise) is a substantial component of all of these libraries. Processing and analyzing these images include tasks such as document layout understanding, character recognition, functional role labeling, image enhancement, indexing, organizing, restructuring, summarizing, cross linking, redaction, privacy management, and distribution are a few examples of tasks on document images in a library that computational assistance could facilitate. At the Palo Alto Research Center, we conduct research on several aspects of document analysis for Digital Libraries ranging from raw image transformations to linguistic analysis to interactive sensemaking tools. DigiPaper, DataGlyphs, ScanScribe, Document Image Decoding, Ubitext, UpLib, 3Book are a few examples of PARC research projects. I shall describe three recent research activities in the realm of document image analysis. 1. Document Image Classification Engine (DICE) 2. Robust OCR using Fisher scores from multi-level paremeterized template models. 3. Efficient Functional Role Labeling algorithms
From CLIR to CLIE: Some Experiences in NTCIR Evaluation. Abstract Cross-language information retrieval (CLIR) facilitates the use of one language to access documents in other languages. Cross-language information extraction (CLIE) extracts relevant information in finer granularity from multilingual documents for some specific applications like summarization, question answering, opinion extraction, etc. NTCIR started evaluation of CLIR tasks on Chinese, English, Japanese and Korean languages in 2001. In these 5 years (2001-2005), three CLIR test collections . say, NTCIR-3, NTCIR-4 and NTCIR-5 evaluation sets, have been developed. In NTCIR-5 (2004-2005), we extended CLIR task to CLQA task, which is an application of CLIE. In NTCIR-6 (2005-2006), we further reused the past NTCIR CLIR test collections to build a corpus for opinion analysis, which is another application of CLIE. This paper shows the design methodologies of the test collections for CLIR, CLQA and opinion extraction. In addition, some kernel technologies for these tasks are discussed.
Advances in XML Information Retrieval: The INEX Initiative Abstract By exposing their logical structure, XML documents offer various opportunities for improved information access. Within the Initiative for the Evaluation of XML Retrieval (INEX), a number of different retrieval tasks are studied: - Content-only queries aim at retrieving the smallest XML elements that answer the query. - Content-and-structure queries specify the element types to be returned and also contain restrictions with regard to the elements that must satisfy single query conditions - Multimedia retrieval combines these types of queries with conditions referring to the content of images contained in the document. - Interactive retrieval focuses on the design of appropriate user interfaces for these tasks - Document mining aims at the application of clustering, classification and information extraction methods or XML documents. For each of these tasks, appropriate evaluation metrics have to be defined. Since 2002, INEX organizes annual evaluation campaigns in this area by providing testbeds and the evaluation infrastructure; with more than 70 participating groups from all over the world, INEX is the focal point of research on XML retrieval.
Title Using eBooks to Break Down the Bars of Ignorance and Illiteracy. Abstract Michael S. Hart, the founder of Project Gutenberg, will talk about how the world's first electronic library came to be, and its role for the future. Opportunities abound for conference attendees to assist in the Project Gutenberg mission to encourage the creation and distribution of eBooks. This session will provide an overview of Project Gutenberg, including its principal of minimal regulation/administration and ongoing efforts to foster like-minded efforts. The production and distribution of eBooks -- literature in electronnic format -- is challenging and fun, presenting opportunities for curious and talented people to assist. Next-generation efforts include aiming for millions of unique titles, pursuing automated translation to 100 different languages, providing reformatting on the fly, and replicator technology.
How the dragons work: searching in a web Abstract Search engines..web dragons..are the portals through which we access society.s treasure trove of information. They do not publish the algorithms they use to sort and filter information, yet how they work is one of the most important questions of our time. Google.s PageRank is a way of measuring the prestige of each web page in terms of who links to it: it reflects the experience of a surfer condemned to click randomly around the web forever. The HITS technique distinguishes .hubs. that point to reputable sources from .authorities,. the sources themselves. This helps differentiate communities on the web, which in turn can tease out alternative interpretations of ambiguous query terms. RankNet uses machine learning techniques to rank documents by predicting relevance judgments based on training data. This talk explains in non-technical terms how the dragons work.
Document Image Understanding for Digital Libraries. Abstract The rapid growth of digital libraries (DLs) worldwide poses many new challenges for document image understanding research and development. We are entering a period of book-scanning on an unprecedented scale: the Million Book Project, Amazon's Search Inside feature, Google Books, etc. However, images of pages, when they are accessed through DLs, lose many advantages of hardcopy and of course lack the advantages of richly encoded symbolic representations. Document image understanding technology can alleviate some of these problems and integrate book-images more fully into the digital world. But many technical problems remain. The broad diversity of document types poses serious challenges to the state of the art: we need new more versatile technologies that can scale up to exploit the huge training sets that are rapidly coming on line. We also need new strategies for improving recognition accuracy by combining multiple independent contextual constraints. Another promising research direction seeks to adapt recognition algorithms to the particular book at hand. Also, our systems should accept correction by users in order to accelerate automatic understanding. These research frontiers will be illustrated concretely by recent results achieved at Lehigh University and elsewhere. <-->
The Connection Between Scientific Literature and Data in Astronomy. Abstract For more than a century journal articles have been the primary vector transporting scientific knowledge into the future; also during this time scientists have created and maintained complex systems of archives, preserving the primary information for their disciplines. Modern communications and information processing technologies are enabling a synergism between the (now fully digital) archives and journals which can have profound effects on the future of research. During the last approximately 20 years astronomers have been simultaneously building out the new digital systems for data and for literature, and have been merging these systems into a coherent, distributed whole. Currently the system consists of a network of journals, data centers, and indexing agencies, which interact via a massive sharing of meta-data between organizations. The system has been in active use for more than a decade; Peter Boyce named it Urania in 1997. Astronomers are now on the verge of making a major expansion of these capabilities. Besides the ongoing improvement in the capabilities and interactions of existing organizations this expansion will entail the creation of new archiving and indexing organizations, as well as a new international supervisory structure for the development of meta-data standards. The nature of scientific communication is clearly being changed by these developments, and with these changes, and with these changes will come others, such as: How will information be accessed? How will the work of individual scientists be evaluated? How will the publishing process be funded?
Toward a common semantics between media and languages. Abstract Semantic interpretation of written texts, sound, speech and still and moving images have been treated, in the past, by specialists of different backgrounds with little interaction between their approaches. The growing current need to index and retrieve multimedia documents as well as to develop natural interfaces which must combine results from vision, sound and speech recognition and text understanding provides a strong incentive for these specialists to share their disjoint experience, and stimulate a cross-fertilization of research and development in understanding the semantics present in media. Concept recognition can profit from merging information coming from various media. For example, a person can be recognized using a combination of visual face recognition, sound-based speaker recognition, analyze of the narration track, as well as textual analysis of available subtitles. People from the text community of natural language processing have a long experience in developing indexing tools, developing, en route, extensive linguistic resources (dictionaries, grammars, thesaurus, ontologies) to help in language processing. This is also the case for speech processing with their language models. But for images and sounds, the research community lacks large scale semantic resources. Recognizing objects, situations
Vagueness and Uncertainty in Information Retrieval: How can Fuzzy Sets Help? Abstract The field of fuzzy information systems has grown and is maturing. In this paper, some applications of fuzzy set theory to information retrieval are described, as well as the more recent outcomes of research in this field. Fuzzy set theory is applied to information retrieval with the main aim being to define flexible systems, i.e., systems that can represent and manage the vagueness and subjectivity which characterizes the process of information representation and retrieval, one of the main objectives of artificial intelligence.
Large-Scale Evaluation Infrastructure for Information Access Technologies on East Asian Languages. Abstract Research and Development (R & D) of Information Access technologies like information retrieval, question answering and summarization requires to indicate the superiority of the proposed systems over the previous ones based on the experiments. Standard test collections are fundamental infrastructure to make such comparative evaluation feasible with reasonable cost and have contributed the enhancement of the R & D in information access technologies and speeded the research especially in the earlier stages in the R & D. In this talk, I describe why and how test collections and evaluation are important for information access technologies, briefly introduce the activities of the NTCIR project, which has organized a series of evaluation workshops on information access technologies using East Asian languages and has attracted international participants and the test collections constructed through it, and indicate its implications. Finally I will discuss some thoughts on the future direction of the research in the information access and needs for the wider collaborations. |