Precise information retrieval in semantic scientific digital libraries

Main applicant: Gilles Falquet

PhD researcher: Hélène de Ribaupierre


This research project aims at developing models and tools for precise information retrieval and strategic reading in large scientific document collections.

When scientists or engineers look for information in document collections, or on the web, they generally have a precise objective in mind. Instead of looking for documents “about a topic T”, they rather try to answer specific needs such as finding a concept definition, finding results for a particular problem, checking if an idea has already been tested, or comparing the scientific conclusions of two articles. Helping these users to accomplish these tasks requires at least three steps: 1) a document indexing model that takes into account the scientific content of documents; 2) a query model to formulate these types of information needs; and 3) a way to efficiently represent query results. This research will cover these three closely related aspects.

The innovative part of the indexing model will include the decomposition of documents into fragments that correspond to discourse elements (definition, hypothesis, method, result, etc.). Each type of discourse element will be modeled by defining specific characteristics. This fragment- based model will be included into the indexing model, based on domain dimension, we developed in previous projects [1, 2]. Each fragment will be semantically annotated and indexed with relevant domain concepts and the role they play in the fragment, as well as metadata.

Our query model will be based on the same model as the indexing model, which will allow the user to formulate queries including a discursive type, one or several domains, and several types of metadata,such as author, year of publication, etc. We will implement different techniques, that allow the user to combine discursive type and keywords to find an answer in a specific discursive part of the text. Another technique will be navigation request, which will allow the user to navigate through several defined ontologies for indexing documents. The representation model will allow operations that merge or combine retrieved documents or fragments. We will propose an interface that allows strategic reading such as parallel reading of fragments. To evaluate this approach, we will develop an indexing environment, a query processing system and a system to generate result documents.

This research will be carried out in close collaboration with a non computer scientist researcher who will participate in the definition of a relevant model, the building of a test corpus, and the evaluation of the retrieval system.