Mass Spectrometry Data Mining for Early Diagnosis and Prognosis of Stroke
International Framework
Funding by the Swiss Office for Education and Science (OFES) in the framework of the European COST Action 282: Knowledge Exploration in Science
and Technology KnowlEST.
Partners
- Biomedical Proteomics Research Group, Central Clinical Chemistry Laboratory, University Hospital of Geneva - BPRG / project page /
- University of Geneva AI Lab (Switzerland) - UNIGE-GAIL /
project page /
Abstract
A vascular cerebral accident, also called stroke or brain attack, is an interruption of the blood supply
by blockage or rupture of a blood vessel to any part of the brain, resulting in damaged brain tissue.
An early diagnosis of a stroke associated with an appropriate treatment would reduce the risk of death
and enhance the chances of recovery. Moreover, two different types of stroke exist: ischemic and hemorrhagic,
with an occurence of 85% for the ischemic type. When the diagnosis of a stroke is established, the physician
needs to know the type, the extent and the location of the accident in order to orientate and prescribe the
most suitable treatment. As no specific and unique symptom as well as no early diagnostic marker or universal
treatment is available, it is of major interest to develop new approaches in the research and discovery
arena of new and early diagnostic and prognostic markers of stroke.
More precisely the goal of this project is the discovery of early diagnostic markers for the pathology of stroke,
that is within three hours of the admition of a patient to the emergency ward who has possible symptoms of a stroke.
The diagnostic test ideally would indentify the type of stroke, i.e. ischemic or hemorrhagic, and the extend of
brain damage.
Mass spectrometry has become a tool of choice in the search for biomarkers which may lead to new methods
of disease diagnosis, prognosis and therapy. By comparing mass spectra of diseased and control samples,
we can exctract discriminatory patters which reflect disease correlated alterations in the expression levels
of proteins. The approach has been actively investigated in view of the early detection of cancer.
The concrete deliverable of this joint collaboration is a prototype tool for mining mass spectra to uncover
diagnostic patterns for stroke. It will provide the envirnonment and software covering all the essential
phases of mass spectral drivern biomarker pattern extraction: mass spectra processing; dimensionality reduction;
model construction, evaluation and interpretation. It is obvious that the produced prototype is not limited
to stroke pathology but can be applied to discover biomarkers from mass spectral data for any kind of pathology.
In the scope of this collaboration we developed a complete preprocessing pipeline of mass-spectral data in order
to convert the raw mass spectral data to a form that is usable by machine learning and data mining algorithms
(the learning task is that of classification).
Once the mass spectral data are preprocessed they can be given to any machine learning and data mining algorithm for
data analysis. We have developed methods to measure the stability of the produced classification models in order to
present to the domain specialists with a quantification of the sensitivity of the different algorithms to different
training sets. Domain specialists prefer models that do not change significantly with perturbations of the training
set.
In summary we tackle the complete cycle of proteomics mass spectral analysis, starting
from the preprocessing of raw mass spectra data, to the construction of classification models and their
evaluation both in terms of classification performance and model stability.
Objectives
Discover highly specific and highly sensitive biomarkers for the early diagnosis of stroke.
Description of work
Establish a complete preprocessing pipeline that takes as input the raw mass spectra and outputs a uniform
set of features, among all spectra. Each feature corresponds to a peak.
Build classification models from the preprocessed spectra with the goal of discovering a small set of sensitive and specific biomarkers.
Evaluate the produced models in terms of accuracy, sensitivity, specificity, and stability.
Introduction to Proteomics and Mass Spectrometry
Recent breakthroughs in genomics and proteomics have raised hopes of finding useful biomarkers, i.e.,
meaningful indicators of biological states or processes that can be used for disease diagnosis,
monitoring, and control.
Genomics-based approaches have made it possible to identify changes in gene expression as markers of
certain diseases. However, these achievements must be qualified by a major caveat: gene expression does
not always correlate with protein expression, which is more closely linked to the actual functional state
of cells. A single gene can lead to a large number of protein products via a complex process which potentially
involves, e.g., alternative splicing, post-translational modifications, and proteolytic cleavages.
In contrast to the genome, the proteome-the ensemble of protein forms expressed in a biological sample at a given point
in time reflects both the intrinsic genetic program of the cell and the impact of its immediate environment.
Clinical proteomics aims at investigating changes in protein expression in order to discover new disease markers and drug targets.
A variety of proteomics workflows have been developed the common denominator of which is the use of mass spectrometry.
Mass spectrometry (MS) is emerging as an important tool for biomarker discovery.
Body fluids such as serum or urine, can be routinely used to
generate protein profiles, (m/z ratios versus signal intensities)
replete with potential disease markers whether individual proteins or sets of interacting proteins.
To discover these biomarker patterns, the data miner must face a number of
technical challenges, foremost among which is the extremely high-dimensionality of mass spectra.
A typical mass spectrum will have several thousands of attributes that exhibit a high degree of spatial redundancy. The exact number depends on
the type of mass spectrometry instrument that is used, its resolution, and the mass range it covers.
Usually a small number of individuals are chosen to be included in the study; the number of samples
ranges from a couple of dozens to at most a few hundreds depending on ease of access to diseased individuals.
So the data miner will have to deal with the problem of high dimensionality small sample size.
Mass-spectra are aqcuired from a body fluid for all the individualls included in the study.
Mass spectra preprocessing
Briefly when a biological sample is submited to mass spectrometry it is applied on a surface known as ProteinChip. The sample is
mixed with an energy absorbing matrix that makes it crystallize as it dries. The ProteinChip is then placed into a vacuum chamber
and is hit by a laser. The matrix absorbes energy and transfers it to the proteins of the biological sample. As a result proteins
desorb and ionize. Next a brief electric field is applied which accelerates the ionized proteins into a flight tupe where they drift
until they strike a detector that records the time of flight. Given the length of the tube and the applied voltage a quadratic transformation
is used to derive the mass-to-charge ratio (m/z) of the protein from the time of flight. M/z values are directly related to the mass of the
corresponding molecules. The spectral data that result from this
experiment consist of the sequentially recorded number of ions arriving at the detector (the intensity) coupled with the corresponding m/z values,
which essentially represent the distribution of masses in the biological sample. Peaks in the intensity plot ideally correspond to individual proteins
and are related to their concentration in the biological sample.
A mass spectrum can be represented by a vector
whose dimensionality is equal to the number of distinct m/z values recorded by the spectrometer and the value of each dimension is the
intensity of the corresponding m/z value. Obviously the intensities of neighboring m/z values are highly correlated, thus the high level
of spatial redudancy within the features describing a mass spectrum.
There is a large number of different types of mass spectra instruments, we are dealing with spectra produced by
SELDI mass spectrometers of Ciphergen Biosystems Inc, but the general principles and methods are the same for other
types of instruments.
Mass spectra have several imperfections which can complicate their interpretation, thus a considerable amount of
time and effort should be spent in preprocessing and feature extraction in order to improve their quality and reduce
their dimensionality. Preprocessing of mass spectra can roughly be divided into several subtasks, baseline removal,
denoising, smoothing, normalization, peak detection and peak calibration. In the next paragraphs
we will describe how we tackle these issues. For more details see PRA04, KAL05.
Baseline removal
The baseline is an offset of the intensities of masses, which happens mainly at low masses, and varies between
different spectra. It is mainly a result of molecules of the energy absorbing matrix and has to be subtracted
from the signal before any further processing takes place, so that comparisons between intensities of m/z values
of different spectra can be meaningful.
To compute the baseline we use a local weighted quadratic fitting on the list of the local minima extracted from the spectrum. On the new fitted
values of local minima a new search for local minima was performed, the first fitting smooths out small variations. Using
the new local minima the signal is split to piecewise constant parts and the final baseline is simply computed by the
reapplication of the initial local weighted quadratic fitting, on the piecewise constant signal.
Denoising and Smoothing
Mass spectra are affected by two types of noise, electrical, a result of the instrument used, and chemical, a result of
contaminants in the sample or matrix molecules. The manifestation of the chemical noise is mainly in the baseline;
electrical noise alters the intensity in a random manner. To denoise and smooth the signal we use wavelet
decomposition coupled with a median filter; a detailed description is given in KAL05. One of the critical
issues on how to perform denoising and smoothing is setting the values of the preprocessing parameters. In KAL05
we have presented a method for parameter tunning that is based on cross-validation and tries to select this set
of parameters that maximizes classification performance.
Normalization
Even after baseline correction, denoising and smoothing, it is possible that large experimental
variations remain in the data, since the signal intensities can change between experiments due to
different analyte concentration or ionization efficiency for example.
Signal intensities are normalized, to be less dependent on experimental conditions,
via total ion current which is equivalent to normalizing with the $L_1$ norm of the spectrum.
Peak Detection
Peak detection is the detection of local maxima in the mass spectrum.
This is a rather crude approach
since it results in an overestimated number of detected peaks. Some of them could
be random fluctuations that remained even after denoising and signal smoothing.
One possible approach to reduce the number of spuriously detected peaks is to take into account their signal to
noise ratio, if this is higher than a given value retain the corresponding peaks, otherwise discard them.
Nevertheless we opt for a conservative approach retaining all detected peaks since we
have no knowledge of the properties and the form of the peaks and what really constitutes a peak.
Like that we retain the maximum possible amount of information but we also retain noise.
This means that we should have a learning algorithm that is robust to noise, since some of the features will
be simply that, and let the learning algorithm determine what is of real discriminating value.
A peak collectively represents all the m/z values that define it, that is:
starting from its left closest local minimum and moving to its right closest local minimum.
The intensities of all these neighboring m/z values exhibit a high level of redundancy, thus by representing
a spectrum only via its peaks we considerably reduce the level of spatial redundancy.
Peak Calibration
Peak calibration establishes which peaks among different spectra correspond to the same peak, i.e. the same protein.
The problem is that there is a machine measurement error on the order of 0.03\%-0.06\% of the measured mass which has
to be accounted for when we decide whether two peaks appearing in different spectra are the same or not.
We used the approach followed in PRA04 which is essentially complete linkage hierarchical clustering
with some additional domain constraints.
Detected peaks from the different spectra are pooled together and clustering
is performed only on their mass dimension. The constraints state that no two clusters of masses can be merged if their
distance is greater than the measurement error or if this would result in two masses from the same spectrum
appearing in the same cluster.
The final clusters contain masses from different spectra that correspond to the same peak.
Feature Extraction
Feature extraction is the combined effect of all the preprocessing steps. However three of them are
central: denoising-smoothing, peak detection and peak calibration. The first step determines how many
peaks-features will be preserved in the preprocessed spectrum, it thus, indirectly, determines
the dimensionality of the finally constructed feature space. The two latter steps are the actual steps of feature extraction.
Clusters established in the peak calibration step become the extracted features. The feature
value of a learning instance in the new representation for the feature that is associated with a given cluster
will be simply the intensity in the preprocessed mass spectrum at the mass value that is associated with that cluster.
Biomarker Discovery
Once the mass spectral data have been preprocessed and their features extracted we can proceed to biomarker
discovery using machine learning and data mining algorithms.
We have experimented with a number of machine learning and data mining methods in order to extract a small set of
specific and sensitive biomarkers including support vector machines, neural networks, decision trees and more. For
a thorough description of the results in the stroke problem see PRA04. For this type of pathology
the best obtained results were by a set of 13 biomarkers, using a nearest neighbor algorithm,
achievien a specificity of 85.7% and a specificity that ranged from 85.7% to 95%.
Results in other application domains, namely ovarian and prostate cancer can be found in KAL05.
Model Stability
A very crusial issue in the biomarker discovery from mass spectral data, and one that is most often neglected, is the
quantification of the sensitivity of the machine learning and data mining methods used in different training sets.
That is: how different are the classification models produced when the algorithms are trained in different sets of
data.
We have developed a method for the quantification of syntactic stability of the produced classification models
when the learning algorithms are trained on different datasets. The quantification is based on a measure of the
correlation of the rankings (importance) that a given method assigns to different features, i.e. possible markers.
The higher the degree of correlation the less sensitive is the learning method to perturbations of the learning set.
The method allows also for a clear visualization of the stability of the produced models. For a complete description
of the approach seeKAL05b.
Publications
KAL05 - Alexandros Kalousis, Jullien Prados, Elton Rexhepaj and Melanie Hilario. Feature extraction from mass spectral data for the
classification of pathological states. In Principles of Data Mining and Knowledge Discoverty, Ninth European Conference. Springer 2005. Extended Version,
pdf.
KAL05b - Alexandros Kalousis, Jullien Prados, and Melanie Hilario. Stability of feature selection algorithms.
Accepted in Fifth IEEE International Conference on Data Mining. 2005.
PRA04 - Julien Prados, Alexandros Kalousis, Jean-Charles Sanchez, Laure Allard, Odile Carrette and Melanie Hilario.
Mining mass spectra for diagnosis and discovery of cerebral accidents. Proteomics, 4(8):2320-2332, 2004.
pdf.
HIL05 - Hilario, M., Kalousis, A., Muller, M., Pellegrini, C.
Classification of Mass-Spectra, accepted for publication to Mass-Spectrometry Reviews, 2005.
HIL04 - Hilario, M., Kalousis A., Prados, J., Binz, P.A., (2004).
Data Mining for Mass-Spectra Based Diagnosis and Biomarker Discovery.
Biosilico journal, 2(5): 171-222, 2004.
Local contact:
Alexandros.Kalousis@cui.unige.ch