BioGAIL Text Mining Software

The BioGAIL text mining software comprises modules for information retrieval and information extraction which have been developed in connection with two ongoing PhD theses, one on information retrieval, the other on information extraction.

Information Retrieval

This thesis concerns the development of an ontology-based query reformulation system for information retrieval (IR). The system's life cycle starts with an initial ontology which evolves with system use. A failure-driven ontology refinement mechanism analyses system performance based on user feedback, focusing on mistakenly retrieved irrelevant documents. The system then determines which portion of the ontology requires improvement in order to avoid such errors in the future. It calls on an information extraction module to provide candidate revisions and selects the best lexical items, concepts, or relations to incorporate into the ontology. The version proposed here is based on the initial ontology, prior to all learning. The user can choose between standard search or search with ontology-based query reformulation. 

  Click here to use the biological IR system. 

A more elaborate version has been developed for researchers in biomarker discovery. The user would typically enter a query concerning a specific disease. The system searches PubMed (performing query reformulation if the user so desires) and analyses the retrieved documents for occurrence patterns of putative protein biomarkers. Protein synonyms are conflated to ensure consistent and non-redundant biomarker identification. The candidate biomarkers are then ranked based on their co-occurrence with the target disease in the relevant documents, and the ranking is displayed in the form of an Excel file. The user is encouraged to use the checkboxes in the Excel file to provide feedback on the biological plausibility of the ranked proteins. A hyperbolique co-occurrence graph is also available for easy inspection of the candidate proteins.  

 Click here to enter the biomarker-discovery oriented IR system.

Information  Extraction

The focus of this research is information extraction (IE) in the absence of pre-annotated corpora or pre-defined domain templates. The goal of this research is to relieve the user of the time-consuming task of annotating large corpora and designing extraction templates to fill a given information need. Users only need to provide labeled corpora, where sentences have been marked as being relevant or not to their search focus.

Nevertheless, even the reduced burden of selecting and labeling training documents might still prove infeasible to the casual user. This is where the idea of a Lego-like IE server comes in. Provided that the IE system has been previously trained on similar topics, new users can avail of the extraction patterns and templates derived on the basis of previously analysed labeled corpora. Elementary building blocks in the form of single-slot extraction patterns are proposed by the LegoIE server, and the user can select any number of patterns and combine them into templates that fit his specific purposes. He can then submit these templates for completion by BioGAIL. Ideally, the IR module would query PubMed and pass the retrieved documents to the IE module; the latter would then extract the relevant sentences, and parse and analyse them to fill the selected templates. The current prototype demonstrates system capabilities using pre-parsed sentences (either available on the server or uploaded by the user) concerning protein function and structure or protein-disease associations. 

 Click here to enter the LegoIE system.

Last update : 2/11/2006 by Melanie Hilario