Combining Machine Learning and Natural Language Processing for Knowledge Discovery in Text Corpora


Partners:

University of Geneva
Dept. of Computer Science
 
University of Geneva
Dept. of Linguistics
 
 Dr.  James Henderson  Prof.  Paola Merlo
 Prof.  Christian Pellegrini  Prof.  Eric Wehrli

Summary:

 
Technology for accessing information from text collections is moving beyond information retrieval to more complicated tasks such as summarization and question answering. Consequently, document representation must shift from representations encoding the topic of a text to representations encoding its semantic content. Such a shift would also be useful in text mining applications. Current text mining systems can produce clusterings and visualizations based on the topics of the text, using the same representations used in information retrieval. To support applications like multi-document summarization, such a clustering is not enough. Different articles on the same topic may say very different things about that topic, even contradict each other. In this project we propose to develop a text mining system which can convey to a user both the range of topics in a text collection and the range of statements made about those topics. The hybrid Machine Learning and Natural Language Processing (NLP) system we have developed for clustering, labeling, and visualizing text collections will form the testbed for new document representations and new data processing techniques. The result of combining multi-document summarization techniques with Natural Language Processing technology and Machine Learning methods will be a truly novel form of text mining system, based on the semantic content of documents, rather than just their topic.

Keywords: Text Mining, Statistical Natural Language Processing, Clustering, Visualization,
                     Hybrid Intelligent Systems, Multi-Document Summarization.



Contact:  J. Henderson