Counting Components with SVD

Using singular value decomposition to estimate how many molecules overlap in a GC-MS peak

1. The Problem

In GC-MS, each molecule produces a unique pattern across hundreds of ion channels as it elutes through the column. When molecules coelute (overlap in time), their signals mix together into a tangled mess of overlapping peaks. The instrument gives you a matrix of intensities — scans × m/z channels — but no indication of how many molecules are hiding in there.

Here's a concrete example. This is what 4 overlapping molecules look like:

The top plot shows the raw ion channels — hundreds of signals overlapping in time, each colored by m/z. The middle plot is the TIC (sum of all ions) — it looks like one or two broad peaks, but there are actually 4 molecules hiding in there (bottom plot). Could you have guessed that from the TIC alone?

This is the challenge: given only the noisy intensity matrix, figure out how many independent components are present. It's the essential first step before any deconvolution can happen.

The problem scales too — here are more examples with varying numbers of components:

2. SVD as Feature Extractor

Singular Value Decomposition factors the intensity matrix M into three parts:

U, s, Vt = np.linalg.svd(M, full_matrices=False)

U — elution profiles (how each component varies over time)
s — singular values (how “strong” each component is)
Vt — mass spectra (each component's spectral fingerprint)

The key insight: for a matrix with k independent molecules, the first k singular values will be large, and the rest will drop to noise level. The shape of this decay curve encodes the component count.

The drop-off is clearly visible — a 2-component mixture has a sharp drop after s[1], while a 10-component mixture stays elevated much longer.

Here's the full picture — all SVD decay curves from our dataset, grouped by component count. Each faint line is one sample. You can see the clusters tighten as the pattern becomes consistent within each group:

The curves clearly separate by component count, but there's overlap and noise — especially between adjacent counts. Where exactly to draw the cutoff varies from sample to sample. Perfect job for a classifier.

3. The Model

We keep it simple: extract the first 20 normalized singular values as features, and train a random forest classifier (200 trees) to predict the component count (1–10).

features = s[:20] / s[0]  # normalized singular value decay
model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)

Training data: 1,000 synthetic samples generated with our data generator (800 train / 200 test, stratified split).

4. Results

Test accuracy: 98.5% — only 3 errors out of 200, all off-by-one

The confusion matrix shows near-perfect diagonal — misclassifications only happen between adjacent counts (e.g. predicting 3 instead of 2), which makes sense for borderline cases.

Feature Importances

The middle singular values (s[2]–s[7]) are the most informative — s[0] is always 1 (after normalization), s[1] is almost always high, and the late values are mostly noise. The discriminative signal lives in the transition zone.

5. How Does That Look in Code?

from tools.estimate_components import estimate_components
import numpy as np

ms = np.load("data/synthetic_peaks/0042/ms.npy")
n = estimate_components(ms)
print(f"Estimated components: {n}")  # -> 4

6. What's Next

Now that we can estimate how many components are present, the next challenge is actually separating them — recovering each molecule's individual elution profile and mass spectrum from the mixed signal.

← Part 1: Building the Data Generator

Built with scikit-learn and Bokeh