| EU FP5 Quality of Life Project |
 |
BioMint: Biological Text Mining
Project Partners
Abstract
Genome research has spawned unprecedented volumes of data, but characterisation
of DNA and protein sequences has not kept pace with the rate of data acquisition.
To anyone trying to know more about a given sequence, the worldwide collection
of abstract and papers remains the ultimate information source. The goal
of the BioMinT project is to develop a generic text mining tool that
(1) interprets diverse types of query, (2) retrieves relevant documents
from the biological literature, (3) extracts the required information,
and (4) outputs the result as a database slot filler or as a structured
report. The BioMinT tool will thus operate in two modes. As a curator's
assistant, it will be validated on SWISS-PROT and PRINTS; as a researcher's
assistant, its reports will submitted to the scrutiny of biologists
in academia and industry. The project will be conducted by an interdisciplinary
team from biology, computational linguistics, and data/text mining.
Objectives
Overall, the objective is to develop a generic text mining tool which performs
information retrieval and extraction in one of two modes. As a curator's
assistant, it outputs a query result as a database field filler according
to a prespecified template; as a researcher's assistant, it generates
a structured report in readable prose. The underlying technological
challenges include: the investigation of semantic content based information
retrieval and extraction; customization of natural language processing
techniques to biological texts; and application of relational data/text
mining techniques. The business objective is the commercialization of this
tool; the target market includes biotech and pharmaceutical companies that
maintain databases or otherwise depend on efficient and reliable information
retrieval and extraction.
Description of work
We will develop a generic text mining tool to support the professional
activities of biology researchers and database annotators. The core of
this system will consist of information retrieval (IR) techniques for identifying
documents that are relevant to a given query and information extraction
(IE) techniques for discovering the required answers. We will take a strictly
problem-oriented, rapid prototyping approach to system development. All
design decisions will be based on input from those who will use the final
product in their daily work: (1) curators of SWISS-PROT and PRINTS, and
(2) biology researchers from partner institutions and interested companies.
Initially, IR and IE technologies will be developed independently based
on training and evaluation data provided by the user partners (biologists).
The first prototype (month 12) will feature a graphical user interface
and will already be able to produce simple reports and annotate selected
SWISS-PROT and PRINTS fields. Subsequently, the IR and IE technologies
will be gradually integrated into a coherent system. The power of our IR
and IE techniques will come from their seamless integration of biochemical
background knowledge and natural language processing techniques, which
is enabled by the use of state-of-the-art relational learning algorithms.
After the completion of the second prototype (month 24), the system will
be extended with an automated update module that regularly checks existing
database annotations against the recent relevant literature and suggests
additional annotations to the database curator. It will also be generalized
to other applications based on input from the End User Club, which will
be formed at project kickoff. The final BioMinT text mining system (month
36) will thus be useful for a wide variety of bio-databases and other applications
that require automated information extraction from the scientific literature.
Local contact: Melanie.Hilario@cui.unige.ch