Why Deconvolute?

What happens when molecules overlap, and why it matters for identification

1. A Real GC-MS Run

A GC-MS instrument separates molecules over time (chromatography) and measures their mass fragmentation pattern (mass spectrometry). The result is an intensity matrix: scans × m/z channels. Here's a real run from the Copenhagen Soft Camel Cheese dataset:

Drag the selection box on the TIC to explore different regions. The default window shows a peak cluster where at least two molecules are visibly overlapping — you can see multiple distinct ion traces rising and falling at slightly different times. This kind of coelution is extremely common in GC-MS.

2. A Simplified Example

Let's look at what happens when two molecules coelute. Here's a clean, synthetic example with no noise and no baseline — just two overlapping molecules:

The colored lines show the true elution profiles of each molecule. The instrument doesn't see these separately — it only records the combined signal (ion channels plot). The question is: can we identify what's in there?

3. The Problem: Contaminated Spectra

These two molecules share 4 ions in common: m/z 29, 42, 43, 44. At any point in time where both are eluting, the instrument records the sum of both contributions at these m/z channels — there's no way to tell them apart just by looking at the raw signal.

The standard approach to identify a molecule is to extract the mass spectrum at its peak apex and match it against a reference library. Let's try that:

The left column shows the pure reference spectra from our library. The right column shows what we actually extract from the combined signal at each apex scan. They look similar but not identical — each extracted spectrum is contaminated by the other molecule's signal.

What happens when we search our library of 9,971 spectra for the best match?

Library search at apex of molecule A (scan 22):

#	Molecule	Cosine Similarity
1	BUTYL ACETATE	0.9581
2	ISOBUTYL ACETATE	0.9558
3	2-METHYLPENTYL ACETATE	0.9391
4	LEVULINIC ACID	0.9317
5	ETHYLENE GLYCOL MONOACETATE ← correct	0.8965

Library search at apex of molecule B (scan 30):

#	Molecule	Cosine Similarity
1	BUTYL CHLORIDE	0.9468
2	3-METHYLTETRAHYDROFURAN ← correct	0.9372
3	4-METHYLPENTANOL	0.8842
4	METHYLCYCLOPENTANE	0.8840
5	BUTYL FORMATE	0.8635

Look at molecule A: the correct molecule (highlighted in green) ranks #5, not #1. Without deconvolution, we would identify this as Butyl Acetate — the wrong molecule. Molecule B fares slightly better at #2, but is still at risk of being misidentified. In a real analysis, these errors propagate silently.

4. The Solution: Deconvolution

Deconvolution is the process of separating the mixed signal back into its individual components. The key idea:

Recover the elution profiles — figure out how each molecule's signal varies over time
Separate the matrix — using the elution profiles, solve for each molecule's pure spectrum via NNLS (non-negative least squares)

Let's see how NNLS works on a single ion. Take m/z 42 — it's present in both molecules. The combined signal is a mix of both elution profiles:

NNLS finds the intensity of each molecule's contribution to this ion. In code:

from scipy.optimize import nnls

# profiles: (num_scans, 2) — the two elution profiles (normalized to peak=1)
# ion_signal: (num_scans,) — the combined signal at m/z 42

weights, _ = nnls(profiles, ion_signal)
# weights = [8,390, 13,660]
# → molecule A contributes 8,390 intensity at m/z 42
# → molecule B contributes 13,660 intensity at m/z 42

Now we simply repeat this for every m/z channel (0–300). The vector of weights across all channels is the recovered mass spectrum for each molecule.

5. The Payoff: Clean Spectra

Using the true elution profiles (which we know in this synthetic example), NNLS perfectly separates the mixed signal:

The recovered spectra match the reference perfectly — but keep in mind this is synthetic data with no noise. On real data the match won't be exact, but it will be dramatically better than the contaminated version. Let's run the library search again on the deconvoluted spectra:

Library search on recovered spectrum A:

#	Molecule	Cosine Similarity
1	ETHYLENE GLYCOL MONOACETATE ← correct	1.0000
2	ACETOMETHANOL	0.9629
3	METHYL ISOPENTYL ETHER	0.9440
4	PROPYL ACETATE	0.9390
5	ETHYLENE GLYCOL DIACETATE	0.9371

Library search on recovered spectrum B:

#	Molecule	Cosine Similarity
1	3-METHYLTETRAHYDROFURAN ← correct	1.0000
2	BUTYL CHLORIDE	0.9311
3	METHYLCYCLOPENTANE	0.9231
4	2,2-DIMETHYLCYCLOPENTANONE	0.9180
5	2,2,5-TRIMETHYLCYCLOPENTANONE	0.9122

Perfect matches — cosine similarity of 1.0000. The molecules are now correctly identified with no ambiguity.

And identification isn't the only benefit. Deconvolution also enables quantification: the NNLS weights tell us exactly how much each molecule contributes to the combined signal. We don't just know what's in the peak — we know how much of each molecule is there.

6. The Challenge Ahead

Of course, in practice we don't know the elution profiles — that's the whole problem. The upcoming posts tackle this step by step:

Part 1: Building realistic synthetic training data
Part 2: Estimating how many components are present (98.5% accuracy)
Part 3: Recovering the elution profiles themselves
Part 4: Putting it all together on real data

Data sources: Copenhagen Soft Camel Cheese GC-MS dataset, MassBank mass spectral library