[Show abstract][Hide abstract] ABSTRACT: In each sample run, comprehensive two-dimensional gas chromatography with dual secondary columns and detectors (GCx2GC) provides complementary information in two chromatograms generated by its two detectors. For example, a flame ionization detector (FID) produces data that is especially effective for quantification and a mass spectrometer (MS) produces data that is especially useful for chemical-structure elucidation and compound identification. The greater information capacity of two detectors is most useful for difficult analyses, such as metabolomic, but using the joint information offered by the two complex two-dimensional chromatograms requires data fusion. In the case that the second columns are equivalent but flow conditions vary (e.g., related to the operative pressure of their different detectors), data fusion can be accomplished by aligning the chromatographic data and/or chromatographic features such as peaks and retention-time windows. Chromatographic alignment requires a mapping from the retention times of one chromatogram to the retention times of the other chromatogram. This paper considers general issues and experimental performance for global two-dimensional mapping functions to align pairs of GCx2GC chromatograms. Experimental results for GCx2GC with FID and MS for metabolomic analyses of human urine samples suggest that low-degree polynomial mapping functions out-perform affine transformation (as measured by root-mean-square residuals for matched peaks) and achieve performance near a lower-bound benchmark of inherent variability. Third-degree polynomials slightly out-performed second-degree polynomials in these results, but second-degree polynomials performed nearly as well and may be preferred for parametric and computational simplicity as well as robustness.
[Show abstract][Hide abstract] ABSTRACT: Comprehensive two-dimensional chromatography is a powerful technology for analyzing the patterns of constituent compounds in complex samples, but matching chromatographic features for comparative analysis across large sample sets is difficult. Various methods have been described for pairwise peak matching between two chromatograms, but the peaks indicated by these pairwise matches commonly are incomplete or inconsistent across many chromatograms. This paper describes a new, automated method for post-processing the results of pairwise peak matching to address incomplete and inconsistent peak matches and thereby select chromatographic peaks that reliably correspond across many chromatograms. Reliably corresponding peaks can be used both for directly comparing relative compositions across large numbers of samples and for aligning chromatographic data for comprehensive comparative analyses. To select reliable features for a set of chromatograms, the Consistent Cliques Method (CCM) represents all peaks from all chromatograms and all pairwise peak matches in a graph, finds the maximal cliques, and then combines cliques with shared peaks to extract reliable features. The parameters of CCM are the minimum number of chromatograms with complete pairwise peak matches and the desired number of reliable peaks. A particular threshold for the minimum number of chromatograms with complete pairwise matches ensures that there are no conflicts among the pairwise matches for reliable peaks. Experimental results with samples of complex bio-oils analyzed by comprehensive two-dimensional gas chromatography (GCxGC) coupled with mass spectrometry (GCxGC-MS) indicate that CCM provides a good foundation for comparative analysis of complex chemical mixtures.
[Show abstract][Hide abstract] ABSTRACT: This review surveys different approaches for generating features from comprehensive two-dimensional chromatography for non-targeted cross-sample analysis. The goal of non-targeted cross-sample analysis is to discover relevant chemical characteristics (such as compositional similarities or differences) from multiple samples. In non-targeted analysis, the relevant characteristics are unknown, so individual features for all chemical constituents should be analyzed, not just those for targeted or selected analytes. Cross-sample analysis requires matching the corresponding features that characterize each constituent across multiple samples so that relevant characteristics or patterns can be recognized. Non-targeted, cross-sample analysis requires generating and matching all features across all samples. Applications of non-targeted cross-sample analysis include sample classification, chemical fingerprinting, monitoring, sample clustering, and chemical marker discovery. Comprehensive two-dimensional chromatography is a powerful technology for separating complex samples and so is well suited for non-targeted cross-sample analysis. However, two-dimensional chromatographic data is typically large and complex, so the computational tasks of extracting and matching features for pattern recognition are challenging. This review examines five general approaches that researchers have applied to these difficult problems: visual image comparisons, datapoint feature analysis, peak feature analysis, region feature analysis, and peak-region feature analysis.
Journal of Chromatography A 07/2011; 1226:140-8. DOI:10.1016/j.chroma.2011.07.046 · 4.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Comprehensive two-dimensional gas chromatography (GC×GC) is a powerful technology for separating complex samples. The typical goal of GC×GC peak detection is to aggregate data points of analyte peaks based on their retention times and intensities. Two techniques commonly used for two-dimensional peak detection are the two-step algorithm and the watershed algorithm. A recent study  compared the performance of the two-step and watershed algorithms for GC×GC data with retention-time shifts in the second-column separations. In that analysis, the peak retention-time shifts were corrected while applying the two-step algorithm but the watershed algorithm was applied without shift correction. The results indicated that the watershed algorithm has a higher probability of erroneously splitting a single two-dimensional peak than the two-step approach. This paper reconsiders the analysis by comparing peak-detection performance for resolved peaks after correcting retention-time shifts for both the two-step and watershed algorithms. Simulations with wide-ranging conditions indicate that when shift correction is employed with both algorithms, the watershed algorithm detects resolved peaks with greater accuracy than the two-step method.
Journal of Chromatography A 07/2011; 1218(38):6792-8. DOI:10.1016/j.chroma.2011.07.052 · 4.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We successfully detected halogenated compounds from several kinds of environmental samples by using a comprehensive two-dimensional gas chromatograph coupled with a tandem mass spectrometer (GC×GC-MS/MS). For the global detection of organohalogens, fly ash sample extracts were directly measured without any cleanup process. The global and selective detection of halogenated compounds was achieved by neutral loss scans of chlorine, bromine and/or fluorine using an MS/MS. It was also possible to search for and identify compounds using two-dimensional mass chromatograms and mass profiles obtained from measurements of the same sample with a GC×GC-high resolution time-of-flight mass spectrometer (HRTofMS) under the same conditions as those used for the GC×GC-MS/MS. In this study, novel software tools were also developed to help find target (halogenated) compounds in the data provided by a GC×GC-HRTofMS. As a result, many dioxin and polychlorinated biphenyl congeners and many other halogenated compounds were found in fly ash extract and sediment samples. By extracting the desired information, which concerned organohalogens in this study, from huge quantities of data with the GC×GC-HRTofMS, we reveal the possibility of realizing the total global detection of compounds with one GC measurement of a sample without any pre-treatment.
Journal of Chromatography A 06/2011; 1218(24):3799-810. DOI:10.1016/j.chroma.2011.04.042 · 4.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper describes informatics for cross-sample analysis with comprehensive two-dimensional gas chromatography (GCxGC) and high-resolution mass spectrometry (HRMS). GCxGC-HRMS analysis produces large data sets that are rich with information, but highly complex. The size of the data and volume of information requires automated processing for comprehensive cross-sample analysis, but the complexity poses a challenge for developing robust methods. The approach developed here analyzes GCxGC-HRMS data from multiple samples to extract a feature template that comprehensively captures the pattern of peaks detected in the retention-times plane. Then, for each sample chromatogram, the template is geometrically transformed to align with the detected peak pattern and generate a set of feature measurements for cross-sample analyses such as sample classification and biomarker discovery. The approach avoids the intractable problem of comprehensive peak matching by using a few reliable peaks for alignment and peak-based retention-plane windows to define comprehensive features that can be reliably matched for cross-sample analysis. The informatics are demonstrated with a set of 18 samples from breast-cancer tumors, each from different individuals, six each for Grades 1-3. The features allow classification that matches grading by a cancer pathologist with 78% success in leave-one-out cross-validation experiments. The HRMS signatures of the features of interest can be examined for determining elemental compositions and identifying compounds.
[Show abstract][Hide abstract] ABSTRACT: This study examined how advanced fingerprinting methods (i.e., non-targeted methods) provide reliable and specific information about groups of samples based on their component distribution on the GC x GC chromatographic plane. The volatile fractions of roasted hazelnuts (Corylus avellana L.) from nine different geographical origins, comparably roasted for desirable flavor and texture, were sampled by headspace-solid phase micro extraction (HS-SPME) and then analyzed by GC x GC-qMS. The resulting patterns were processed by: (a) "chromatographic fingerprinting", i.e., a pattern recognition procedure based on retention-time criteria, where peaks correspondences were established through a comprehensive peak pattern covering the chromatographic plane; and (b) "comprehensive template matching" with reliable peak matching, where peak correspondences were constrained by retention time and MS fragmentation pattern similarity criteria. Fingerprinting results showed how the discrimination potential of GC x GC can be increased by including in sample comparisons and correlations all the detected components and, in addition, provide reliable results in a comparative analysis by locating compounds with a significant role. Results were completed by a chemical speciation of volatiles and sample profiling was extended to known markers whose distribution can be correlated to sensory properties, geographical origin, or the effect of thermal treatment on different classes of compounds. The comprehensive approach for data interpretation here proposed may be useful to assess product specificity and quality, through measurable parameters strictly and consistently correlated to sensory properties and origin.
Journal of Chromatography A 09/2010; 1217(37):5848-58. DOI:10.1016/j.chroma.2010.07.006 · 4.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Comprehensive two-dimensional LC (LC x LC) is a powerful tool for analysis of complex biological samples. With its multidimensional separation power and increased peak capacity, LC x LC generates information-rich, but complex, chromatograms, which require advanced data analysis to produce useful information. An important analytical challenge is to classify samples on the basis of chromatographic features, e.g., to extract and utilize biomarkers indicative of health conditions, such as disease or response to therapy. This study presents a new approach to extract comprehensive non-target chromatographic features from a set of LC x LC chromatograms for sample classification. Experimental results with urine samples indicate that the chromatographic features generated by this approach can be used to effectively classify samples. Based on the extracted features, a support vector machine successfully classified urine samples by individual, before/after procedure, and concentration with leave-one-out and replicate K-fold cross-validation. The new method for comprehensive chromatographic feature analysis of LC x LC separations provides a potentially powerful tool for classifying complex biological samples.
[Show abstract][Hide abstract] ABSTRACT: The present study examines the ability of targeted and non-targeted methods to provide specific and complementary information on groups of samples on the basis of their component distribution on the two-dimensional gas chromatography (GCxGC) plane. The volatile fraction of Arabica green and roasted coffee samples differing in geographical origins and roasting treatments and the volatile fraction from juniper needles, sampled by headspace-solid phase microextraction, were analyzed by GCxGC-qMS and sample profiles processed by different approaches. In the target analysis profiling, samples submitted to different roasting cycles and/or differing in origin and post-harvest treatment are characterized on the basis of known constituents (botanical, technological, and/or aromatic markers). This approach provides highly reliable results on quali-quantitative compositional differences because of the authentic standard confirmation, extending and improving the specificity of the comparative procedure to trace and minor components. On the other hand, non-targeted data-processing methods (e.g., direct image comparison and template-based fingerprinting) include in the sample comparisons and correlations all detected sample components, offering an increased discrimination potential by identifying compounds that are comparatively significant but not known targets. Results demonstrate the ability of GCxGC to explore in depth the complexity of samples and emphasize the advantages of a comprehensive and multidisciplinary approach to improve the level of information provided by GCxGC separation.
[Show abstract][Hide abstract] ABSTRACT: Interactive visualization of data from a new generation of chemical imaging systems requires coding that is efficient and accessible. New technologies for secondary ion mass spectrometry (SIMS) generate large three-dimensional, hyperspectral datasets with high spatial and spectral resolution. Interactive visualization is important for chemical analysis, but the raw dataset size exceeds the memory capacities of typical current computer systems and is a significant obstacle. This paper reports the development of a lossless coding method that is memory efficient, enabling large SIMS datasets to be held in fast memory, and supports quick access for interactive visualization. The approach provides pixel indexing, as required for chemical imaging applications, and is based on the statistical characteristics of the data. The method uses differential time-of-flight to effect mass-spectral run-length-encoding and uses a scheme for variable-length, byte-unit representations for both mass-spectral time-of-flight and intensity values. Experiments demonstrate high compression rates and fast access.
Rapid Communications in Mass Spectrometry 05/2009; 23(9):1229-33. DOI:10.1002/rcm.3962 · 2.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Comprehensive two-dimensional liquid chromatography (LC × LC) generates information-rich but complex peak patterns that require automated processing for rapid chemical identification and classification. This paper describes a powerful approach and specific methods for peak pattern matching to identify and classify constituent peaks in data from LC × LC and other multidimensional chemical separations. The approach records a prototypical pattern of peaks with retention times and associated metadata, such as chemical identities and classes, in a template. Then, the template pattern is matched to the detected peaks in subsequent data and the metadata are copied from the template to identify and classify the matched peaks. Smart Templates employ rule-based constraints (e.g., multispectral matching) to increase matching accuracy. Experimental results demonstrate Smart Templates, with the combination of retention-time pattern matching and multispectral constraints, are accurate and robust with respect to changes in peak patterns associated with variable chromatographic conditions.
Journal of Chromatography A 04/2009; 1216(16-1216):3458-3466. DOI:10.1016/j.chroma.2008.09.058 · 4.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Identifying and separating subtly different biological samples is one of the most critical tasks in biological analysis. Time-of-flight secondary ion mass spectrometry (ToF-SIMS) is becoming a popular and important technique in the analysis of biological samples, because it can detect molecular information and characterize chemical composition. ToF-SIMS spectra of biological samples are enormously complex with large mass ranges and many peaks. As a result the classification and cluster analysis are challenging. This study presents a new classification algorithm, the most similar neighbor with a probability-based spectrum similarity measure (MSN-PSSM), which uses all the information in the entire ToF-SIMS spectra. MSN-PSSM is applied to automatically classify bacterial samples which are major causal agents of urinary tract infections. Experimental results show that MSN-PSSM is an accurate classification algorithm. It outperforms traditional techniques such as decision trees, principal component analysis (PCA) with discriminant function analysis (DFA), and soft independent modeling of class analogy (SIMCA). This study also applies a modern clustering algorithm, normalized spectral clustering, to automatically cluster the bacterial samples at the species level. Experimental results demonstrate that normalized spectral clustering is able to show accurate quantitative separations. It outperforms traditional techniques such as hierarchical clustering analysis, k-means, and PCA with k-means.
[Show abstract][Hide abstract] ABSTRACT: New technologies for Secondary Ion Mass Spectrometry (SIMS) produce three-dimensional hyperspectral chemical images with high spatial resolution and fine mass-spectral precision. SIMS imaging of biological tissues and cells promises to provide an informational basis for important advances in a wide variety of applications, including cancer treatments. However, the volume and complexity of data pose significant challenges for interactive visualization and analysis. This paper describes new methods and tools for computer-based visualization and analysis of SIMS data, including a coding scheme for efficient storage and fast access, interactive interfaces for visualizing and operating on three-dimensional hyperspectral images, and spatio-spectral clustering and classification.
Visual Information Processing XVIII, 14 April 2009, Orlando, Florida, USA; 01/2009
[Show abstract][Hide abstract] ABSTRACT: The multiple-instance learning (MIL) model has been successful in numerous application areas. Recently, a generalization of this model and an algorithm for it were introduced, showing significant advantages over the conventional MIL model on certain application areas. Unfortunately, that algorithm is not scalable to high dimensions. We adapt that algorithm to one using a support vector machine with our new kernel k\wedge. This reduces the time complexity from exponential in the dimension to polynomial. Computing our new kernel is equivalent to counting the number of boxes in a discrete, bounded space that contain at least one point from each of two multisets. We show that this problem is #P-complete, but then give a fully polynomial randomized approximation scheme (FPRAS) for it. We then extend k\wedge by enriching its representation into a new kernel kmin, and also consider a normalized version of k\wedge that we call k\wedge/\vee (which may or may not not be a kernel, but whose approximation yielded positive semidefinite Gram matrices in practice). We then empirically evaluate all three measures on data from content-based image retrieval, biological sequence analysis, and the musk data sets. We found that our kernels performed well on all data sets relative to algorithms in the conventional MIL model.
IEEE Transactions on Pattern Analysis and Machine Intelligence 01/2009; 30(12):2084-98. DOI:10.1109/TPAMI.2007.70846 · 5.78 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A Markov chain Monte Carlo method has previously been introduced to estimate weighted sums in multiplicative weight update
algorithms when the number of inputs is exponential. However, the original algorithm still required extensive simulation of
the Markov chain in order to get accurate estimates of the weighted sums. We propose an optimized version of the original
algorithm that produces exactly the same classifications while often using fewer Markov chain simulations. We also apply three
other sampling techniques and empirically compare them with the original Metropolis sampler to determine how effective each
is in drawing good samples in the least amount of time, in terms of accuracy of weighted sum estimates and in terms of Winnow’s
prediction accuracy. We found that two other samplers (Gibbs and Metropolized Gibbs) were slightly better than Metropolis
in their estimates of the weighted sums. For prediction errors, there is little difference between any pair of MCMC techniques
we tested. Also, on the data sets we tested, we discovered that all approximations of Winnow have no disadvantage when compared
to brute force Winnow (where weighted sums are exactly computed), so generalization accuracy is not compromised by our approximation.
This is true even when very small sample sizes and mixing times are used.
[Show abstract][Hide abstract] ABSTRACT: This paper develops a method for automatic colorization of two-dimensional fields presented as images, in order to visualize local changes in values. In many applications, local changes in values are as important as magnitudes of values. For example, in topography, both elevation and slope often must be considered. Gradient-based value mapping for colorization is a technique to visualize both value (e.g., intensity or elevation) and gradient (e.g., local differences or slope). The method maps pixel values to a color scale in a manner that emphasizes gradients in the image. The value mapping function is monotonically non-decreasing, to maintain ordinal relationships of values on the color scale. The color scale can be a grayscale or pseudocolor scale. The first step of the method is to compute the gradient at each pixel. Then, the pixels (with computed gradients) are sorted by value. The value mapping function is the inverse of the relative cumulative gradient magnitude function computed from the sorted array. The value mapping method is demonstrated with data from comprehensive two-dimensional gas chromatography (GCxGC), using both grayscale and a pseudocolor scale to visualize local changes related to both small and large peaks in the GCxGC data.
Proceedings of SPIE - The International Society for Optical Engineering 06/2006; DOI:10.1117/12.669839 · 0.20 Impact Factor