Using singular value decomposition to estimate how many molecules overlap in a GC-MS peak
In GC-MS, each molecule produces a unique pattern across hundreds of ion channels as it elutes through the column. When molecules coelute (overlap in time), their signals mix together into a tangled mess of overlapping peaks. The instrument gives you a matrix of intensities — scans × m/z channels — but no indication of how many molecules are hiding in there.
Here's a concrete example. This is what 4 overlapping molecules look like:
The top plot shows the raw ion channels — hundreds of signals overlapping in time, each colored by m/z. The middle plot is the TIC (sum of all ions) — it looks like one or two broad peaks, but there are actually 4 molecules hiding in there (bottom plot). Could you have guessed that from the TIC alone?
This is the challenge: given only the noisy intensity matrix, figure out how many independent components are present. It's the essential first step before any deconvolution can happen.
The problem scales too — here are more examples with varying numbers of components:
Singular Value Decomposition factors the intensity matrix M into three parts:
U, s, Vt = np.linalg.svd(M, full_matrices=False)
The key insight: for a matrix with k independent molecules, the first k singular values will be large, and the rest will drop to noise level. The shape of this decay curve encodes the component count.
The drop-off is clearly visible — a 2-component mixture has a sharp drop after s[1], while a 10-component mixture stays elevated much longer.
Here's the full picture — all SVD decay curves from our dataset, grouped by component count. Each faint line is one sample. You can see the clusters tighten as the pattern becomes consistent within each group:
The curves clearly separate by component count, but there's overlap and noise — especially between adjacent counts. Where exactly to draw the cutoff varies from sample to sample. Perfect job for a classifier.
We keep it simple: extract the first 20 normalized singular values as features, and train a random forest classifier (200 trees) to predict the component count (1–10).
features = s[:20] / s[0] # normalized singular value decay model = RandomForestClassifier(n_estimators=200) model.fit(X_train, y_train)
Training data: 1,000 synthetic samples generated with our data generator (800 train / 200 test, stratified split).
The confusion matrix shows near-perfect diagonal — misclassifications only happen between adjacent counts (e.g. predicting 3 instead of 2), which makes sense for borderline cases.
The middle singular values (s[2]–s[7]) are the most informative — s[0] is always 1 (after normalization), s[1] is almost always high, and the late values are mostly noise. The discriminative signal lives in the transition zone.
from tools.estimate_components import estimate_components
import numpy as np
ms = np.load("data/synthetic_peaks/0042/ms.npy")
n = estimate_components(ms)
print(f"Estimated components: {n}") # -> 4
Now that we can estimate how many components are present, the next challenge is actually separating them — recovering each molecule's individual elution profile and mass spectrum from the mixed signal.
← Part 1: Building the Data Generator
Built with scikit-learn and Bokeh