Combining Machine Learning and Natural Language Processing for Knowledge
Discovery in Text Corpora
Partners:
University of Geneva
Dept. of Computer Science
|
University of Geneva
Dept. of Linguistics
|
| Dr. James Henderson |
Prof. Paola Merlo |
| Prof. Christian Pellegrini |
Prof. Eric Wehrli |
Summary:
Technology for accessing information from text collections is moving
beyond information retrieval to more complicated tasks such as
summarization and question answering. Consequently, document
representation must shift from representations encoding the topic of a
text to representations encoding its semantic content. Such a shift
would also be useful in text mining applications. Current text mining
systems can produce clusterings and visualizations based on the topics
of the text, using the same representations used in information
retrieval. To support applications like multi-document summarization,
such a clustering is not enough. Different articles on the same topic
may say very different things about that topic, even contradict each
other.
In this project we propose to develop a text mining
system which can convey to a user both the range of topics in
a text collection and the range of statements made about those
topics. The hybrid Machine Learning and Natural Language Processing
(NLP) system we have developed for clustering, labeling, and
visualizing text collections will form the testbed for new document
representations and new data processing techniques.
-
NLP Technology: We propose to use NLP technology to develop
text document representations which encode the information necessary
to cluster documents according to the summaries they should have.
Specifically, NLP techniques will be used to extract from text
documents the semantic relationships between verbs and their
arguments. These relationships will then become the terms in a
vectorial representation of the document, making it possible to apply
our existing system to these document representations. This stage of
the project will address an area of NLP that has only very recently
reached useful levels of robustness, and will push forward the
frontier of this research.
-
Compression and Smoothing: Machine Learning techniques will be
central to the success of this project, due to the statistical
properties of the required document representations and the
computational demands of the text mining system. The vectorial
representations will be extremely large and extremely sparse. The size
of the vector requires compression techniques to make the system
computationally tractable. The sparseness of the vector requires
smoothing techniques to avoid unimportant distinctions, such as choice
of words, from being the dominant factor influencing the results of
the text mining system. In this phase of the project, we will profit
from the experience gained in implementing the previous version of the
system and we will test the feasibility of our approach by an ever
more demanding task, where the optimization needs will be extreme.
-
Multi-document Summarization: Machine Learning will also be
important in the corpus-based multi-document summarization techniques
which we will add to the system's labeling methods. The clusters
found by the proposed system will be labeled using methods from the
emerging field of multi-document summarization. The result will be a
2-dimensional map of short summaries conveying the important
information in the document collection. We will then be able to
evaluate the validity of the document representations and the quality
of the clusters through the system's ability to support multi-document
summarization.
The result of combining multi-document summarization techniques with
Natural Language Processing technology and Machine Learning methods
will be a truly novel form of text mining system, based on the
semantic content of documents, rather than just their topic.
Keywords: Text Mining, Statistical Natural Language Processing,
Clustering, Visualization,
Hybrid Intelligent Systems, Multi-Document
Summarization.
Contact: J. Henderson