In Co-operation with

International Workshop On Research Issues in Digital Libraries
(IWRIDL 2006)

(12th - 15th December, 2006, Kolkata)

Programme

Tentative Schedule

	10:00-11:15	11:15-11:30	11:30-12:45	12:45-13:45	13:45-15:00	15:00-15:15	15:15-16:30
12.12.2006	Inauguration	Tea/ Coffee	Stephen Robertson	Lunch	Hsin-Hsi Chen	Tea/ Coffee	Christian Fluhr
13.12.2006	Ian H. Witten	Tea/ Coffee	Norbert Fuhr	Lunch	Gerard Huet	Tea/ Coffee	Michael Kurtz
14.12.2006	Michel Beigbeder	Tea/ Coffee	Brigitte Grau	Lunch	Gabriella Pasi	Tea/ Coffee	Joemon M. Jose
15.12.2006	Noriko Kando	Tea/ Coffee	Hiromichi Fujisawa	Lunch	Prateek Sarkar	Tea/ Coffee	Michael Hart

Contributors :

Karen Spärck Jones

Professor of Computers and Information (emeritus),University of Cambridge, United Kingdom.

She is a Fellow of the British Academy, an AAAI Fellow and ECCAI Fellow, and was President of the Association for Computational Linguistics in 1994. She has been a member of the DARPA/NIST Text Retrieval Conferences Programme Committee since 1994, and is involved with other US evaluation programmes.

Title
Information retrieval and digital libraries: lessons of research.

Abstract
This paper reviews lessons from the history of information retrieval research, with particular emphasis on recent developments. These have demonstrated the value of statistical techniques for retrieval, and have also shown that they have an important, though not exclusive, part to play in other information processing tasks, like question answering and summarising. The heterogeneous materials that digital libraries are expected to cover, their scale, and their changing composition, imply that statistical methods, which are general-purpose and very flexible, have significant potential value for the digital libraries of the future.

Donald H. Kraft

Professor, Department of Computer Science, Louisiana State University, Baton Rouge, LA.

He is an Editor of Journal of the American Society for Information Science and Technology (JASIST), and Editorial Board Member of Information Retrieval, International Journal of Computational Intelligence Research (IJCIR), Journal of Digital Information Management (JDIM). As other professional activity he served as a summer faculty of U.S. Air Force Office of Scientific Research (AFOSR), a Research Associate of Wright-Patterson Air Force Base, Ohio. He Worked on a project, contracted through Research and Development Laboratories (RDL), to do an exploratory study of weighted fuzzy keyword retrieval and automatic generation of hypertext links for CASHE:PVS, a hypermedia system of human engineering documents and standards for use in design.

Title
Vagueness and Uncertainty in Information Retrieval: How can Fuzzy Sets Help?

Abstract
The field of fuzzy information systems has grown and is maturing. In this paper, some applications of fuzzy set theory to information retrieval are described, as well as the more recent outcomes of research in this field. Fuzzy set theory is applied to information retrieval with the main aim being to define flexible systems, i.e., systems that can represent and manage the vagueness and subjectivity which characterizes the process of information representation and retrieval, one of the main objectives of artificial intelligence.

George Nagy

Professor, Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY.

He has held visiting appointments at the Stanford Research Institute, Cornell, the University of Montreal, the National Scientific Research Institute of Quebec, the University of Genoa and the Italian National Research Council in Naples and Genoa, AT&T and Lucent Bell Laboratories, IBM Almaden, McGill University, the Institute for Information Science Research at the University of Nevada, the University of Bern, and the Center for Image Analysis in Uppsala. He is director of the ECSE DocLab and co-director with Prof. W.R. Franklin of the Computational Geometry Laboratory.

Title
Digitizing, coding, annotating, disseminating, and preserving documents.

Abstract
We examine some research issues in pattern recognition and image processing that have been spurred by the needs of digital libraries. Broader.and not only linguistic.context must be introduced in character recognition on low-contrast, tightly-set documents because the conversion of documents to coded (searchable) form is lagging far behind conversion to image formats. At the same time, the prevalence of imaged documents over coded documents gives rise to interesting research problems in interactive annotation of document images. At the level of circulation, reformatting document images to accommodate diverse user needs remains a challenge.

Stephen M. Griffin

Program Director, Division of Information, and Intelligent Systems at the National Science Foundation (NSF).

He is currently Program Director for Special Projects and for the Interagency Digital Libraries Initiative and the International Digital Libraries Collaborative Research program. Prior to his current assignment, Mr. Griffin served in several research divisions, including the Divisions of Chemistry and Advanced Scientific Computing, the Office of the Assistant Director, Directorate for Computer and Information Science and Engineering, and staff offices of the Director of the NSF. He has been active in working groups of the Federal High Performance Computing and Communications Program.

Title
Digital Representation of Cultural Heritage Materials
- new possibilities for enhanced access and analysis.

Abstract
It is now feasible, both technologically and economically, to create extremely accurate digital surrogates of a wide variety of cultural heritage materials which capture and in many cases increase the scholarly value of the originals, while leaving them unharmed, intact and unaltered.
Artifacts for which this has been successfully done include fragile manuscripts and inscriptions on a variety of media, images, sounds, 3-dimensional objects, ancient buildings and culturally significant sites and place

Richard Wright

Technology Manager, BBC, Information & Archives, United Kingdom.

Richard Wright was educated at the University of Michigan, USA and Southampton University, UK. Degrees: BSc Engineering Science 1967, MA Computer Science 1972, and Ph D in Digital Signal Processing (Speech Synthesis) 1988. He worked in acoustics, speech and signal processing for US and UK Government research laboratories (1968-76), University College London (1976-80; Research Fellow) and Royal National Institute for the Deaf (1980-90; Senior Scientist). He was Chief Designer, Cirrus Research 1990-94 (acoustical and audiometric instrumentation). He has been Technology manager, BBC Archives since 1994.

Title
Digital Audiovisual Repositories - an Introduction.

Abstract
This paper briefly describes the essential aspects of the digital world that audiovisual archives are entering - or being swallowed-up in. The crucial issue is whether archives will sink or swim in this all-digital environment. The core issue is defining - and meeting - the requirements for a secure, sustainable digital repository.

Henry S. Baird

Professor, Computer Science & Engineering, Lehigh University.

He is a Fellow of the IEEE and of the IAPR. Henry Baird is building a research group investigating high-performance computer vision and pattern recognition technologies. He joined Lehigh in January 2004 after thirty years in industrial research as Principal Scientist and area manager at PARC (Palo Alto, CA), department head and Member of Research Staff at Bell Labs (Murr\ ay Hill, NJ), and member of technical staff at RCA Sarnoff Labs (Princeton, NJ).

Title
Document Image Understanding for Digital Libraries.

Abstract
The rapid growth of digital libraries (DLs) worldwide poses many new challenges for document image understanding research and\ development. We are entering a period of book-scanning on an unprecedented scale: the Million Book Project, Amazon's Search\ Inside feature, Google Books, etc. However, images of pages, when they are accessed through DLs, lose many advantages of ha\ rdcopy and of course lack the advantages of richly encoded symbolic representations. Document image understanding technology\ can alleviate some of these problems and integrate book-images more fully into the digital world. But many technical proble\ ms remain. The broad diversity of document types poses serious challenges to the state of the art: we need new more versati\ le technologies that can scale up to exploit the huge training sets that are rapidly coming on line. We also need new strate\ gies for improving recognition accuracy by combining multiple independent contextual constraints. Another promising research\ direction seeks to adapt recognition algorithms to the particular book at hand. Also, our systems should accept correction \ by users in order to accelerate automatic understanding. These research frontiers will be illustrated concretely by recent re\ sults achieved at Lehigh University and elsewhere.

William I. Grosky

Professor and Chair, Department of Computer and Information Science, University of Michigan, Dearborn, Michigan.

He is in the editorial board of IEEE Multimedia, International Journal of Information and Communication Technology Education\ , International Journal on Semantic Web and Information Systems, Journal of Digital Information Management, Multimedia Tools \ and Applications, Pattern Recognition.

Title
Multimedia Semantics: What Do We Really Know?

Abstract
In this talk, we discuss several issues related to multimedia semantics, including emergent semantics, the multimedia semanti\ c web, monomedia versus multimedia, how context can be used in a multimedia search environment, and a look at just what shoul\ d be considered a multimedia document in the context of multimedia search. Throughout the talk, we discuss cross-pollination \ of ideas from the fields of content-based retrieval, information retrieval, and the semantic web.

Speakers :

Stephen Robertson

Researcher, Microsoft Research Cambridge.

He is a fellow of Girton College, Cambridge. He also co-direct the Centre for Interactive Systems Research in the Department. At Microsoft, he runs a group called Information Retrieval and Analysis, which is concerned with core search processes such as term weighting, document scoring and ranking algorithms, and combination of evidence from different sources. In 1998, he was awarded the Tony Kent STRIX award by the Institute of Information Scientists. In 2000, he was awarded the Salton Award by ACM SIGIR.

Title
On the science of search: Statistical approaches, evaluation, optimisation.

Abstract
Evaluation of information retrieval (search) systems has a long history, but in the last 15 years the agenda has in large measure been set by the Text REtrieval Conference (TREC). I will talk about how TREC has moulded the field, and the strengths and limitations of the TREC approach. I will also discuss other efforts in a similar vein, including those happening within commercial organisations. Turning to the models and methods used in search, I will discuss the dominance of the various statistical approaches, including the vector space, probabilistic and 'language' models. Finally, I will discuss the confluence of these various ideas in the domain of optimisation: training algorithms using test data.

Gérard Huet

Directeur de Recherche, INRIA (Institut National de Recherche en Informatique et en Automatique), Rocquencourt (France).

Member of the French Academy of Sciences and of Academia Europaea.
Member of International Advisory Board, NII (National Institute of Informatics), Tokyo, Japan.
Gérard Huet has been working in computational logic, proof and computation theory, functional programming, type theory and software engineering for the past 30 years. He received the Herbrand Award for his work on Computational Logic in 1998, and a Doctor of Technology honoris causa from Chalmers University in Göorg (Sweden) in 2004. In the last 5 years he has been working in computational linguistics and web services for cultural heritage, with applications to digital resources for the Sanskrit language.

Title
Shallow syntax analysis in Sanskrit guided by semantic nets constraints.

Abstract
We present the state of the art of a computational platform for the analysis of classical Sanskrit. The platform comprises modules for phonology, morphology, segmentation and shallow syntax analysis, organized around a structured lexical database. It relies on the Zen toolkit for finite state automata and transducers, which provides data structures and algorithms for the modular construction and execution of finite state machines, in a functional framework.
Some of the layers proceed in bottom-up synthesis mode - for instance, noun and verb morphological modules generate all inflected forms from stems and roots listed in the lexicon. Morphemes are assembled through internal sandhi, and the inflected forms are stored with morphological tags in dictionaries usable for lemmatizing. These dictionaries are then compiled into transducers, implementing the analysis of external sandhi, the phonological process which merges words together by euphony. This provides a tagging segmenter, which analyses a sentence presented as a stream of phonemes and produces a stream of tagged lexical entries, hyperlinked to the lexicon.
The next layer is a syntax analyser, guided by semantic nets constraints capturing dependencies between the word forms. Finite verb forms demand semantic roles, according to subcategorization patterns depending on the voice (active, passive) of the form and the governance (transitive, etc) of the root. Conversely, noun/adjective forms provide actors which may fill those roles, provided agreement constraints are satisfied. Tool words and particles are mapped to transducers operating on tagged streams, allowing the modeling of linguistic phenomena such as coordination by abstract interpretation of actor streams. The parser ranks the various interpretations (matching actors with roles) with a penalty, and returns to the user the minimum penalty analyses, for final validation of ambiguities.
Work is under way to adapt this machinery to the construction of a treebank of parsed sentences, issued from characteristic examples from Apte.s Sanskrit Syntax manual. This work is in cooperation with Pr Brendan Gillon from McGill University. It is expected that this treebank will be used to learn statisticallly the parameters of the parser in order to increase its precision. Other modules will attempt statistical tagging, needed for bootstrapping this prototype into a more robust analyser, doing lexicon acquisition from the corpus.
The whole platform is organized as a Web service, allowing the piecewise tagging of a Sanskrit text. The talk will essentially consist in an interactive demonstration of this software.

Michel Beigbeder

Researcher and Teacher,Ecole Nationale Supérieure des Mines de Saint-Etienne,France.

After works in image analysis and image synthesis and then in parallelism and distributed computing, he is now involved in information retrieval since 1995. He is particularly interested in Web information retrieval. He works about ad'hoc information retrieval, and with his team, he participates regularly to international information retrieval campaigns TREC, CLEF, and INEX. He is a co-organizer of the international workshop about Open Source Information Retrieval (OSIR06 and OSWIR05). He is the organizer of the 2007 edition of the french conference about information retrieval CORIA .

Title
Open source, academic research and Web search engines.

Abstract
Search engines are the primary access to the information in the World Wide Web for millions of users. As for now, all the search engines able to deal with both the huge quantity of information and the huge number of Web users are driven by commercial companies. They use hidden algorithms that put the integrity of their results in doubt. Moreover some specialized information needs are not easily served, so there is a need for some open source Web search engines.
On the other hand, academic research in the Information Retrieval field has been dealing with the problem of retrieving information in large repositories of digital documents since the mid of the 20th century. And in the past ten years several open source Information Retrieval Systems were developed.
I will present some reasons why we want open source search. I will survey the major tools available with Open Source licenses, and present their key points for their use either in academic research or as a search engine. I will tackle some interactions and benefits between research and open source development, what are the problems faced, what are the working bricks so far, the strategies and the opportunities for the future.

Brigitte Grau

Associate Professor of Computer Sciences, The Institute Informatique Entreprise, an engineering school.

She graduated in 1979 from this school and has obtained her PhD in Computer Science in 1983 at Paris 6 University. She is a researcher at LIMSI-CNRS laboratory, in the Language, Information and Representations (LIR) Group, where she coordinates the Question-Answering theme.

Title
Finding an answer to a question.

Abstract
The huge quantity of available electronic information leads to a growing need for users to have tools able to be precise and selective. These kinds of tools have to provide answers to requests quite rapidly without requiring the user to explore each document, to reformulate her request or to seek for the answer inside documents. From that viewpoint, finding an answer consists not only in finding relevant documents but also in extracting relevant parts. This leads us to express the question-answering problem in terms of an information retrieval problem that can be solved using natural language processing (NLP) approaches. In my talk, I will focus on defining what a ?good? answer is, and how a system can find it.
A good answer has to give the required piece of information. However, it is not sufficient; it also has both to be presented within its context of interpretation and to be justified in order to give a user means to evaluate if the answer fits her needs and is appropriate.
One can view searching an answer to a question as a reformulation problem: according to what is asked, find one of the different linguistic expressions of the answer in all candidate sentences. Within this framework, interlingual question-answering can also be seen as another kind of linguistic variation. The answer phrasing can be considered as an affirmative reformulation of the question, partly or totally, which entails the definition of models that match with sentences containing the answer. According to the different approaches, the kinds of model and the matching criteria greatly differ. It can consist in building a structured representation that makes explicit the semantic relations between the concepts of the question and that is compared to a similar representation of sentences. As this approach requires a syntactic parser and a semantic knowledge base, which are not always available in all the languages, systems often apply a less formal approach based on a similarity measure between a passage and the question and answers are extracted from highest scored passages. Similarity involves different criteria: question terms and their linguistic variations in passages, syntactic proximity, answer type. We will see that, in such an approach, justifications can be envisioned by using text themselves, considered as depositories of semantic knowledge. I will focus on the approach the LIR group of LIMSI has taken for its monolingual and bilingual systems.

Hiromichi Fujisawa

Corporate Chief Scientist, Research & Development Group, Hitachi, Ltd.

He joined Central Research Laboratory, Hitachi in 1974. Since then, he has engaged in research on handwritten character recognition, document understanding, and document retrieval. He has led projects of developing business OCRs, a full-text search system, and a postal address recognition engine for mail sorting machines. He has also engaged in international activities such as conference organization, journal editing, and international standardization. He is an IEEE Fellow, IAPR Fellow, and IEICE Fellow (Japan).

Title
How to Compose a Complex Document Recognition System.

Abstract
The technical challenges in document analysis and recognition have been to solve the problems of uncertainty and variability that exist in document images. From our experiences in developing OCRs, business form readers, and postal address recognition engines, I would like to present several design principles to cope with these problems of uncertainty and variability. When the targets of document recognition are complex and diversified, the recognition engine needs to solve many different kinds of pattern recognition (sub)problems, which are concrete phenomena of uncertainty and variability. Inevitably, the engine gets complex as a result. The question is how to combine the subcomponents of a recognition engine so that the total engine produces sufficiently accurate results. These principles will be explained together with examples.

Joemon M. Jose

Senior Lecturer, Department of Computing Science, University of Glasgow, Scotland,UK.

He is a member of the Information Retrieval group led by Prof. C. J. van Rijsbergen.

Title
Adaptive Search Systems and Experimentation Methodology
Abstract
In this paper, I will explore the domain of adaptive search systems and their evaluation. Current evaluation methodologies, as that of traditional TREC and also that of interactive experimentation methodologies, are inadequate for the development of such systems. Classical approach is less expensive and facilitates comparison between systems. However, they are not suitable for the experimentation of adaptive retrieval systems. On the other hand, the interactive evaluation methodology is expensive in terms of time and other resources. In my presentation, I will explore these issues and propose an approach for the experimentation of adaptive retrieval systems. The proposed approach is based on the simulated user based experimentation.

Prateek Sarkar

Researcher, Statistical Pattern and Image Analysis Area (SPIA) in the Information sciences and Technologies Lab at the Palo Alto Research Center (PARC, Inc.).

He earned a BTech degree in Electronics and Electrical Communication Engineering from the Indian Institute of Technology, Kharagpur 1n 1993. Soon after, He came to the US for graduate studies. While at grad school, he also earned MS and PhD degrees in Computer Systems Engineering, and interned at Panasonic and IBM research Labs. On completion of his PhD, he joined Xerox PARC (currently PARC, Inc.) in October 2000 as a member of research staff in the Document Image Decoding (DID) group. He is a member of Perceptual Document Analysis at ISTL/PARC.

Title
Document Image Analysis for Digital Libraries at PARC

Abstract
Digital Libraries have many forms -- institutional libraries for information dissemination, document repositories for record-keeping, and personal digital libraries for organizing personal thoughts, knowledge, and course of action. Digital image content (scanned or otherwise) is a substantial component of all of these libraries. Processing and analyzing these images include tasks such as document layout understanding, character recognition, functional role labeling, image enhancement, indexing, organizing, restructuring, summarizing, cross linking, redaction, privacy management, and distribution are a few examples of tasks on document images in a library that computational assistance could facilitate.
At the Palo Alto Research Center, we conduct research on several aspects of document analysis for Digital Libraries ranging from raw image transformations to linguistic analysis to interactive sensemaking tools. DigiPaper, DataGlyphs, ScanScribe, Document Image Decoding, Ubitext, UpLib, 3Book are a few examples of PARC research projects. I shall describe three recent research activities in the realm of document image analysis.
1. Document Image Classification Engine (DICE)
2. Robust OCR using Fisher scores from multi-level paremeterized template models.
3. Efficient Functional Role Labeling algorithms

Hsin-Hsi Chen

Professor, Department of Computer Science and Information Engineering, National Taiwan University.

He is a Director of Computer and Information Networking Center, National Taiwan University. He is a member of ROCLING and ACL and also a board director of ROCLING and an editorial board of Communications of COLIPS.

Title
From CLIR to CLIE: Some Experiences in NTCIR Evaluation.

Abstract
Cross-language information retrieval (CLIR) facilitates the use of one language to access documents in other languages. Cross-language information extraction (CLIE) extracts relevant information in finer granularity from multilingual documents for some specific applications like summarization, question answering, opinion extraction, etc. NTCIR started evaluation of CLIR tasks on Chinese, English, Japanese and Korean languages in 2001. In these 5 years (2001-2005), three CLIR test collections . say, NTCIR-3, NTCIR-4 and NTCIR-5 evaluation sets, have been developed. In NTCIR-5 (2004-2005), we extended CLIR task to CLQA task, which is an application of CLIE. In NTCIR-6 (2005-2006), we further reused the past NTCIR CLIR test collections to build a corpus for opinion analysis, which is another application of CLIE. This paper shows the design methodologies of the test collections for CLIR, CLQA and opinion extraction. In addition, some kernel technologies for these tasks are discussed.

Norbert Fuhr

Professor, CS department of the University of Dortmund, Germany.

He has served as regular PC member of major international conferences related to information retrieval and digital libraries, such as ACM-SIGIR, CIKM, ECIR, SPIRE, ICDL, ECDL, ICADL, FQAS. He was PC chair of ECIR 2002 and IR track chair of CIKM 2005. He is a member of the editorial boards of the journals Information Retrieval, ACM Transactions on Information Systems, International Journal of Digital Libraries, and Foundations and Trends in Information Retrieval. He became Associate Professor in the computer science department of the University of Dortmund in 1991 and was appointed Full Professor for computer science at the University of Duisburg-Essen in 2002.

Title
Advances in XML Information Retrieval: The INEX Initiative

Abstract
By exposing their logical structure, XML documents offer various opportunities for improved information access. Within the Initiative for the Evaluation of XML Retrieval (INEX), a number of different retrieval tasks are studied:

- Content-only queries aim at retrieving the smallest XML elements that answer the query.
- Content-and-structure queries specify the element types to be returned and also contain restrictions with regard to the elements that must satisfy single query conditions
- Multimedia retrieval combines these types of queries with conditions referring to the content of images contained in the document.
- Interactive retrieval focuses on the design of appropriate user interfaces for these tasks
- Document mining aims at the application of clustering, classification and information extraction methods or XML documents.
For each of these tasks, appropriate evaluation metrics have to be defined.
Since 2002, INEX organizes annual evaluation campaigns in this area by providing testbeds and the evaluation infrastructure; with more than 70 participating groups from all over the world, INEX is the focal point of research on XML retrieval.

Michael S. Hart

Founder, Project Gutenberg which makes electronic books freely available via the Internet.
Michael S. Hart started Project Gutenberg in 1971, by typing in the "Declaration of Independence" which was posted as the first electronic book, or eBook. The premise on which Michael Hart based Project Gutenberg was "Replicator Technology": anything that can be entered into a computer can be reproduced indefinitely. The concept of Replicator Technology is simple; once a book or any other item (including pictures, sounds, and even 3-D items) can be stored in a computer, then any number of copies can and will be available. Everyone, everywhere, can have a copy of a book that has been entered into a computer.

Title
Using eBooks to Break Down the Bars of Ignorance and Illiteracy.

Abstract
Michael S. Hart, the founder of Project Gutenberg, will talk about how the world's first electronic library came to be, and its role for the future. Opportunities abound for conference attendees to assist in the Project Gutenberg mission to encourage the creation and distribution of eBooks. This session will provide an overview of Project Gutenberg, including its principal of minimal regulation/administration and ongoing efforts to foster like-minded efforts. The production and distribution of eBooks -- literature in electronnic format -- is challenging and fun, presenting opportunities for curious and talented people to assist. Next-generation efforts include aiming for millions of unique titles, pursuing automated translation to 100 different languages, providing reformatting on the fly, and replicator technology.

Ian H. Witten

Professor, Department of Computer ScienceUniversity of Waikato, New Zealand.

He is a Fellow of the Association for Computing Machinery (ACM). He is on the editorial boards of Applied Intelligence, Encyclopedia of Computer Science, International Journal of Human-Computer Studies, Journal of Experimental and Theoretical Artificial Intelligence, Journal of Universal Computer Science and McGraw Hill International Book Series in Human-Computer Systems.

Title
How the dragons work: searching in a web

Abstract
Search engines..web dragons..are the portals through which we access society.s treasure trove of information. They do not publish the algorithms they use to sort and filter information, yet how they work is one of the most important questions of our time. Google.s PageRank is a way of measuring the prestige of each web page in terms of who links to it: it reflects the experience of a surfer condemned to click randomly around the web forever. The HITS technique distinguishes .hubs. that point to reputable sources from .authorities,. the sources themselves. This helps differentiate communities on the web, which in turn can tease out alternative interpretations of ambiguous query terms. RankNet uses machine learning techniques to rank documents by predicting relevance judgments based on training data. This talk explains in non-technical terms how the dragons work.

Henry S. Baird

Professor, Computer Science & Engineering, Lehigh University.

He is a Fellow of the IEEE and of the IAPR. Henry Baird is building a research group investigating high-performance computer vision and pattern recognition technologies. He joined Lehigh in January 2004 after thirty years in industrial research as Principal Scientist and area manager at PARC (Palo Alto, CA), department head and Member of Research Staff at Bell Labs (Murray Hill, NJ), and member of technical staff at RCA Sarnoff Labs (Princeton, NJ).

Title
Document Image Understanding for Digital Libraries.

Abstract
The rapid growth of digital libraries (DLs) worldwide poses many new challenges for document image understanding research and development. We are entering a period of book-scanning on an unprecedented scale: the Million Book Project, Amazon's Search Inside feature, Google Books, etc. However, images of pages, when they are accessed through DLs, lose many advantages of hardcopy and of course lack the advantages of richly encoded symbolic representations. Document image understanding technology can alleviate some of these problems and integrate book-images more fully into the digital world. But many technical problems remain. The broad diversity of document types poses serious challenges to the state of the art: we need new more versatile technologies that can scale up to exploit the huge training sets that are rapidly coming on line. We also need new strategies for improving recognition accuracy by combining multiple independent contextual constraints. Another promising research direction seeks to adapt recognition algorithms to the particular book at hand. Also, our systems should accept correction by users in order to accelerate automatic understanding. These research frontiers will be illustrated concretely by recent results achieved at Lehigh University and elsewhere. <-->

Michael J. Kurtz

Computer Scientist, Smithsonian Astrophysical Observatory, Harvard-Smithsonian Center for Astrophysics, Cambridge, USA.

He is a Computer Scientist of Smithsonian Astrophysical Observatory. He was Image Processing Laboratory Director, SAO, 1984-1992, Lecturer on Astronomy, Harvard University, 1983-1984, Postdoctoral Research Assistant, Dartmouth College, 1982, Dartmouth Fellow, 1977-1982, Research Assistant, San Francisco State University, 1977. He got Van Biesbroek Prize, American Astronomical Society, 2001, designer of the ADS Abstract Service and ISI/ASIST Citation Award, American Society for Information Science and Technology, 2000, innovation in bibliographic research.

Title
The Connection Between Scientific Literature and Data in Astronomy.

Abstract
For more than a century journal articles have been the primary vector transporting scientific knowledge into the future; also during this time scientists have created and maintained complex systems of archives, preserving the primary information for their disciplines.
Modern communications and information processing technologies are enabling a synergism between the (now fully digital) archives and journals which can have profound effects on the future of research.
During the last approximately 20 years astronomers have been simultaneously building out the new digital systems for data and for literature, and have been merging these systems into a coherent, distributed whole.
Currently the system consists of a network of journals, data centers, and indexing agencies, which interact via a massive sharing of meta-data between organizations. The system has been in active use for more than a decade; Peter Boyce named it Urania in 1997.
Astronomers are now on the verge of making a major expansion of these capabilities. Besides the ongoing improvement in the capabilities and interactions of existing organizations this expansion will entail the creation of new archiving and indexing organizations, as well as a new international supervisory structure for the development of meta-data standards.
The nature of scientific communication is clearly being changed by these developments, and with these changes, and with these changes will come others, such as: How will information be accessed? How will the work of individual scientists be evaluated? How will the publishing process be funded?

Christian Fluhr

Director , Multimedia Multilingual Knowledge Engineering laboratory (LIC2M) within the CEA LIST.
He has been working in the field of natural language processing since the early 1970s. He was one of the creators of SPIRIT, a multilingual, multiplatform text indexing software, and one of the founders of Systex, CEA spin-off that produced text normalizing and retrieval software in the mid 1980s, and which was bought out by TGID in the early 1990s. He is a well-known figure in the information retrieval and natural language processing community. He was leader of the first European Project about crosslingual information retrieval. He created a new lab merging the French CEA's competencies in language engineering, information retrieval and image indexing. He was co-founder of a new CEA spin-off, NewPhenix which develops products for text and image retrieval.

Title
Toward a common semantics between media and languages.

Abstract
Semantic interpretation of written texts, sound, speech and still and moving images have been treated, in the past, by specialists of different backgrounds with little interaction between their approaches. The growing current need to index and retrieve multimedia documents as well as to develop natural interfaces which must combine results from vision, sound and speech recognition and text understanding provides a strong incentive for these specialists to share their disjoint experience, and stimulate a cross-fertilization of research and development in understanding the semantics present in media.
Concept recognition can profit from merging information coming from various media. For example, a person can be recognized using a combination of visual face recognition, sound-based speaker recognition, analyze of the narration track, as well as textual analysis of available subtitles. People from the text community of natural language processing have a long experience in developing indexing tools, developing, en route, extensive linguistic resources (dictionaries, grammars, thesaurus, ontologies) to help in language processing. This is also the case for speech processing with their language models. But for images and sounds, the research community lacks large scale semantic resources. Recognizing objects, situations

Gabriella Pasi

Associate Professor at the University degli Studi di Milano Bicocca

Gabriella Pasi graduated in Computer Science at the Universita degli Studi di Milano, Italy, and took a PhD in Computer Science at the Universite de Rennes, France. She worked as a researcher at the National Council of Research in Italy from April 1985 till February 2005. Actually she is Associate Professor at the Universita Degli Studi di Milano Bicocca, Milano, Italy. Her research activity mainly concerns the modelling and design of flexible systems (i.e. systems able to manage imprecision and uncertainty) for the management and access to information, such as Information Retrieval Systems, Information Filtering Systems and Data Base Management Systems. She also works at the definition of techniques of Multi Criteria Decision Making and Group Decision Making. She is a member of organizing and program committees of several international conferences. She has co-edited seven books and several special issues of International Journals. She has published more than 150 papers on International Journals and Books, and on the Proceeding of International Conferences. Since 2001 she is a member of the Editorial Board of the journals Mathware and Soft Computing and ACM Applied Computing Review. She has been the coordinator of the European Project PENG (Personalized News Content Programming). This is a STREP (Specific Targeted Research or Innovation Project), within the VI Framework Programme, Priority II, Information Society Technology. She organized several International events among which the European Summer school in Information Retrieval (ESSIR 2000), and she co-organizes every year the track .Information Access and Retrieval. within the ACM Symposium on Applied Computing.

Noriko Kando

Professor, Information and Society Research Division, National Institute of Informatics, Tokyo, Japan.

She is the Chair of the NTCIR Project, the large-scale evaluation of information access technologies, such as information retrieval, summarization, question answering, etc., focusing on Asian languages, which has attracted international participation. Her research interests are information access technologies, human-language technologies, educational application of digital archives, multilingual information access. She received her Ph.D. from Keio University in 1995 and has been conducting research at NII since that time. She has published scientific articles in journals and international conferences, and has been an invited speaker at many international conferences. She is currently an associate editor of the ACM Transactions on Asian Language Information Processing (ACM TALIP) and on the editorial board of the Information Processing and Management (IP&M). She is an Asian representative for ACM-SIGIR Executive Committee and a Program Co-Chair of SIGIR 2007.

Title
Large-Scale Evaluation Infrastructure for Information Access Technologies on East Asian Languages.

Abstract
Research and Development (R & D) of Information Access technologies like information retrieval, question answering and summarization requires to indicate the superiority of the proposed systems over the previous ones based on the experiments. Standard test collections are fundamental infrastructure to make such comparative evaluation feasible with reasonable cost and have contributed the enhancement of the R & D in information access technologies and speeded the research especially in the earlier stages in the R & D. In this talk, I describe why and how test collections and evaluation are important for information access technologies, briefly introduce the activities of the NTCIR project, which has organized a series of evaluation workshops on information access technologies using East Asian languages and has attracted international participants and the test collections constructed through it, and indicate its implications. Finally I will discuss some thoughts on the future direction of the research in the information access and needs for the wider collaborations.

In Co-operation with

Home

Committees

Important Dates

Proceedings

Registration & Accommodation

Sponsorship

Workshop Venue

About Kolkata

Contact Us

International Workshop On Research Issues in Digital Libraries
(IWRIDL 2006)

(12th - 15th December, 2006, Kolkata)

Programme

Tentative Schedule

Contributors :

Karen Spärck Jones

Donald H. Kraft

George Nagy

Stephen M. Griffin

Richard Wright

Henry S. Baird

William I. Grosky

Speakers :

Stephen Robertson

Gérard Huet

Michel Beigbeder

Brigitte Grau

Hiromichi Fujisawa

Joemon M. Jose

Prateek Sarkar

Hsin-Hsi Chen

Norbert Fuhr

Michael S. Hart

Ian H. Witten

Henry S. Baird

Michael J. Kurtz

Christian Fluhr

Gabriella Pasi

Noriko Kando

In Co-operation with

Home

Committees

Important Dates

Proceedings

Registration & Accommodation

Sponsorship

Workshop Venue

About Kolkata

Contact Us

International Workshop On Research Issues in Digital Libraries (IWRIDL 2006)

(12th - 15th December, 2006, Kolkata)

Programme

Tentative Schedule

Contributors :

Karen Spärck Jones

Donald H. Kraft

George Nagy

Stephen M. Griffin

Richard Wright

Henry S. Baird

William I. Grosky

Speakers :

Stephen Robertson

Gérard Huet

Michel Beigbeder

Brigitte Grau

Hiromichi Fujisawa

Joemon M. Jose

Prateek Sarkar

Hsin-Hsi Chen

Norbert Fuhr

Michael S. Hart

Ian H. Witten

Henry S. Baird

Michael J. Kurtz

Christian Fluhr

Gabriella Pasi

Noriko Kando

International Workshop On Research Issues in Digital Libraries
(IWRIDL 2006)