Mass Spectrometry Data Mining for Early Diagnosis and Prognosis of Stroke


International Framework

Funding by the Swiss Office for Education and Science (OFES) in the framework of the European COST Action 282: Knowledge Exploration in Science and Technology KnowlEST.

Partners

Abstract

A vascular cerebral accident, also called stroke or brain attack, is an interruption of the blood supply by blockage or rupture of a blood vessel to any part of the brain, resulting in damaged brain tissue. An early diagnosis of a stroke associated with an appropriate treatment would reduce the risk of death and enhance the chances of recovery. Moreover, two different types of stroke exist: ischemic and hemorrhagic, with an occurence of 85% for the ischemic type. When the diagnosis of a stroke is established, the physician needs to know the type, the extent and the location of the accident in order to orientate and prescribe the most suitable treatment. As no specific and unique symptom as well as no early diagnostic marker or universal treatment is available, it is of major interest to develop new approaches in the research and discovery arena of new and early diagnostic and prognostic markers of stroke. More precisely the goal of this project is the discovery of early diagnostic markers for the pathology of stroke, that is within three hours of the admition of a patient to the emergency ward who has possible symptoms of a stroke. The diagnostic test ideally would indentify the type of stroke, i.e. ischemic or hemorrhagic, and the extend of brain damage.

Mass spectrometry has become a tool of choice in the search for biomarkers which may lead to new methods of disease diagnosis, prognosis and therapy. By comparing mass spectra of diseased and control samples, we can exctract discriminatory patters which reflect disease correlated alterations in the expression levels of proteins. The approach has been actively investigated in view of the early detection of cancer.

The concrete deliverable of this joint collaboration is a prototype tool for mining mass spectra to uncover diagnostic patterns for stroke. It will provide the envirnonment and software covering all the essential phases of mass spectral drivern biomarker pattern extraction: mass spectra processing; dimensionality reduction; model construction, evaluation and interpretation. It is obvious that the produced prototype is not limited to stroke pathology but can be applied to discover biomarkers from mass spectral data for any kind of pathology.

In the scope of this collaboration we developed a complete preprocessing pipeline of mass-spectral data in order to convert the raw mass spectral data to a form that is usable by machine learning and data mining algorithms (the learning task is that of classification).

Once the mass spectral data are preprocessed they can be given to any machine learning and data mining algorithm for data analysis. We have developed methods to measure the stability of the produced classification models in order to present to the domain specialists with a quantification of the sensitivity of the different algorithms to different training sets. Domain specialists prefer models that do not change significantly with perturbations of the training set.

In summary we tackle the complete cycle of proteomics mass spectral analysis, starting from the preprocessing of raw mass spectra data, to the construction of classification models and their evaluation both in terms of classification performance and model stability.

Objectives

Discover highly specific and highly sensitive biomarkers for the early diagnosis of stroke.

Description of work

  • Establish a complete preprocessing pipeline that takes as input the raw mass spectra and outputs a uniform set of features, among all spectra. Each feature corresponds to a peak.
  • Build classification models from the preprocessed spectra with the goal of discovering a small set of sensitive and specific biomarkers.
  • Evaluate the produced models in terms of accuracy, sensitivity, specificity, and stability.
  • Introduction to Proteomics and Mass Spectrometry

    Recent breakthroughs in genomics and proteomics have raised hopes of finding useful biomarkers, i.e., meaningful indicators of biological states or processes that can be used for disease diagnosis, monitoring, and control. Genomics-based approaches have made it possible to identify changes in gene expression as markers of certain diseases. However, these achievements must be qualified by a major caveat: gene expression does not always correlate with protein expression, which is more closely linked to the actual functional state of cells. A single gene can lead to a large number of protein products via a complex process which potentially involves, e.g., alternative splicing, post-translational modifications, and proteolytic cleavages.

    In contrast to the genome, the proteome-the ensemble of protein forms expressed in a biological sample at a given point in time reflects both the intrinsic genetic program of the cell and the impact of its immediate environment. Clinical proteomics aims at investigating changes in protein expression in order to discover new disease markers and drug targets. A variety of proteomics workflows have been developed the common denominator of which is the use of mass spectrometry.

    Mass spectrometry (MS) is emerging as an important tool for biomarker discovery. Body fluids such as serum or urine, can be routinely used to generate protein profiles, (m/z ratios versus signal intensities) replete with potential disease markers whether individual proteins or sets of interacting proteins. To discover these biomarker patterns, the data miner must face a number of technical challenges, foremost among which is the extremely high-dimensionality of mass spectra. A typical mass spectrum will have several thousands of attributes that exhibit a high degree of spatial redundancy. The exact number depends on the type of mass spectrometry instrument that is used, its resolution, and the mass range it covers. Usually a small number of individuals are chosen to be included in the study; the number of samples ranges from a couple of dozens to at most a few hundreds depending on ease of access to diseased individuals. So the data miner will have to deal with the problem of high dimensionality small sample size. Mass-spectra are aqcuired from a body fluid for all the individualls included in the study.

    Mass spectra preprocessing

    Briefly when a biological sample is submited to mass spectrometry it is applied on a surface known as ProteinChip. The sample is mixed with an energy absorbing matrix that makes it crystallize as it dries. The ProteinChip is then placed into a vacuum chamber and is hit by a laser. The matrix absorbes energy and transfers it to the proteins of the biological sample. As a result proteins desorb and ionize. Next a brief electric field is applied which accelerates the ionized proteins into a flight tupe where they drift until they strike a detector that records the time of flight. Given the length of the tube and the applied voltage a quadratic transformation is used to derive the mass-to-charge ratio (m/z) of the protein from the time of flight. M/z values are directly related to the mass of the corresponding molecules. The spectral data that result from this experiment consist of the sequentially recorded number of ions arriving at the detector (the intensity) coupled with the corresponding m/z values, which essentially represent the distribution of masses in the biological sample. Peaks in the intensity plot ideally correspond to individual proteins and are related to their concentration in the biological sample.

    A mass spectrum can be represented by a vector whose dimensionality is equal to the number of distinct m/z values recorded by the spectrometer and the value of each dimension is the intensity of the corresponding m/z value. Obviously the intensities of neighboring m/z values are highly correlated, thus the high level of spatial redudancy within the features describing a mass spectrum.

    There is a large number of different types of mass spectra instruments, we are dealing with spectra produced by SELDI mass spectrometers of Ciphergen Biosystems Inc, but the general principles and methods are the same for other types of instruments.

    Mass spectra have several imperfections which can complicate their interpretation, thus a considerable amount of time and effort should be spent in preprocessing and feature extraction in order to improve their quality and reduce their dimensionality. Preprocessing of mass spectra can roughly be divided into several subtasks, baseline removal, denoising, smoothing, normalization, peak detection and peak calibration. In the next paragraphs we will describe how we tackle these issues. For more details see PRA04, KAL05.

  • Baseline removal
  • The baseline is an offset of the intensities of masses, which happens mainly at low masses, and varies between different spectra. It is mainly a result of molecules of the energy absorbing matrix and has to be subtracted from the signal before any further processing takes place, so that comparisons between intensities of m/z values of different spectra can be meaningful. To compute the baseline we use a local weighted quadratic fitting on the list of the local minima extracted from the spectrum. On the new fitted values of local minima a new search for local minima was performed, the first fitting smooths out small variations. Using the new local minima the signal is split to piecewise constant parts and the final baseline is simply computed by the reapplication of the initial local weighted quadratic fitting, on the piecewise constant signal.

  • Denoising and Smoothing
  • Mass spectra are affected by two types of noise, electrical, a result of the instrument used, and chemical, a result of contaminants in the sample or matrix molecules. The manifestation of the chemical noise is mainly in the baseline; electrical noise alters the intensity in a random manner. To denoise and smooth the signal we use wavelet decomposition coupled with a median filter; a detailed description is given in KAL05. One of the critical issues on how to perform denoising and smoothing is setting the values of the preprocessing parameters. In KAL05 we have presented a method for parameter tunning that is based on cross-validation and tries to select this set of parameters that maximizes classification performance.

  • Normalization
  • Even after baseline correction, denoising and smoothing, it is possible that large experimental variations remain in the data, since the signal intensities can change between experiments due to different analyte concentration or ionization efficiency for example. Signal intensities are normalized, to be less dependent on experimental conditions, via total ion current which is equivalent to normalizing with the $L_1$ norm of the spectrum.

  • Peak Detection
  • Peak detection is the detection of local maxima in the mass spectrum. This is a rather crude approach since it results in an overestimated number of detected peaks. Some of them could be random fluctuations that remained even after denoising and signal smoothing. One possible approach to reduce the number of spuriously detected peaks is to take into account their signal to noise ratio, if this is higher than a given value retain the corresponding peaks, otherwise discard them. Nevertheless we opt for a conservative approach retaining all detected peaks since we have no knowledge of the properties and the form of the peaks and what really constitutes a peak. Like that we retain the maximum possible amount of information but we also retain noise. This means that we should have a learning algorithm that is robust to noise, since some of the features will be simply that, and let the learning algorithm determine what is of real discriminating value. A peak collectively represents all the m/z values that define it, that is: starting from its left closest local minimum and moving to its right closest local minimum. The intensities of all these neighboring m/z values exhibit a high level of redundancy, thus by representing a spectrum only via its peaks we considerably reduce the level of spatial redundancy.

  • Peak Calibration
  • Peak calibration establishes which peaks among different spectra correspond to the same peak, i.e. the same protein. The problem is that there is a machine measurement error on the order of 0.03\%-0.06\% of the measured mass which has to be accounted for when we decide whether two peaks appearing in different spectra are the same or not. We used the approach followed in PRA04 which is essentially complete linkage hierarchical clustering with some additional domain constraints. Detected peaks from the different spectra are pooled together and clustering is performed only on their mass dimension. The constraints state that no two clusters of masses can be merged if their distance is greater than the measurement error or if this would result in two masses from the same spectrum appearing in the same cluster. The final clusters contain masses from different spectra that correspond to the same peak.

    Feature Extraction

    Feature extraction is the combined effect of all the preprocessing steps. However three of them are central: denoising-smoothing, peak detection and peak calibration. The first step determines how many peaks-features will be preserved in the preprocessed spectrum, it thus, indirectly, determines the dimensionality of the finally constructed feature space. The two latter steps are the actual steps of feature extraction. Clusters established in the peak calibration step become the extracted features. The feature value of a learning instance in the new representation for the feature that is associated with a given cluster will be simply the intensity in the preprocessed mass spectrum at the mass value that is associated with that cluster.

    Biomarker Discovery

    Once the mass spectral data have been preprocessed and their features extracted we can proceed to biomarker discovery using machine learning and data mining algorithms. We have experimented with a number of machine learning and data mining methods in order to extract a small set of specific and sensitive biomarkers including support vector machines, neural networks, decision trees and more. For a thorough description of the results in the stroke problem see PRA04. For this type of pathology the best obtained results were by a set of 13 biomarkers, using a nearest neighbor algorithm, achievien a specificity of 85.7% and a specificity that ranged from 85.7% to 95%.

    Results in other application domains, namely ovarian and prostate cancer can be found in KAL05.

    Model Stability

    A very crusial issue in the biomarker discovery from mass spectral data, and one that is most often neglected, is the quantification of the sensitivity of the machine learning and data mining methods used in different training sets. That is: how different are the classification models produced when the algorithms are trained in different sets of data.

    We have developed a method for the quantification of syntactic stability of the produced classification models when the learning algorithms are trained on different datasets. The quantification is based on a measure of the correlation of the rankings (importance) that a given method assigns to different features, i.e. possible markers. The higher the degree of correlation the less sensitive is the learning method to perturbations of the learning set. The method allows also for a clear visualization of the stability of the produced models. For a complete description of the approach seeKAL05b.

    Publications

    KAL05 - Alexandros Kalousis, Jullien Prados, Elton Rexhepaj and Melanie Hilario. Feature extraction from mass spectral data for the classification of pathological states. In Principles of Data Mining and Knowledge Discoverty, Ninth European Conference. Springer 2005. Extended Version, pdf.

    KAL05b - Alexandros Kalousis, Jullien Prados, and Melanie Hilario. Stability of feature selection algorithms. Accepted in Fifth IEEE International Conference on Data Mining. 2005.

    PRA04 - Julien Prados, Alexandros Kalousis, Jean-Charles Sanchez, Laure Allard, Odile Carrette and Melanie Hilario. Mining mass spectra for diagnosis and discovery of cerebral accidents. Proteomics, 4(8):2320-2332, 2004. pdf.

    HIL05 - Hilario, M., Kalousis, A., Muller, M., Pellegrini, C. Classification of Mass-Spectra, accepted for publication to Mass-Spectrometry Reviews, 2005.

    HIL04 - Hilario, M., Kalousis A., Prados, J., Binz, P.A., (2004). Data Mining for Mass-Spectra Based Diagnosis and Biomarker Discovery. Biosilico journal, 2(5): 171-222, 2004.

    Local contact: Alexandros.Kalousis@cui.unige.ch

    Last update : 04/08/2005 by Alexandros Kalousis