Alternating least squares to separate overlapping GC-MS peaks
Warning: The benchmark in this post clearly needs better metrics — cosine similarity alone painted too rosy a picture, and the scale ratio is a rough first attempt that can hide errors (one component too high can compensate another too low). I don't have a great idea yet for a single number that captures everything. But this is a fun side project, so good enough for now — if you have suggestions, I'd love to hear them in the comments!
In Part 0 we saw why deconvolution matters — overlapping molecules contaminate each other's spectra and break library identification. In Part 2 we trained a model to estimate how many components are present (98.5% accuracy). Now comes the hard part: actually recovering the elution profiles and spectra. Well, the actual hard part will be to do it on real data.
MCR-ALS (Multivariate Curve Resolution – Alternating Least Squares) is a well-established method for resolving mixtures. For a thorough introduction, see de Juan, Jaumot & Tauler (2014).
The idea is simple: we want to factor the intensity matrix M (scans × m/z)
into two parts:
M ≈ C @ S
where C is the elution profiles (scans × n_components) and S
is the mass spectra (n_components × m/z). We solve this by alternating:
C = M @ S.T @ inv(S @ S.T)S = inv(C.T @ C) @ C.T @ MThe initial guess comes from SVD. A key optimization: instead of calling NNLS per m/z column (slow), we solve the entire matrix at once and clip negatives. Not mathematically identical to true NNLS, but within ALS the alternation corrects the clipping errors.
"""MCR-ALS (Multivariate Curve Resolution - Alternating Least Squares)."""
import numpy as np
def _enforce_unimodality(C):
"""Enforce unimodality on each column of C (each elution profile)."""
for j in range(C.shape[1]):
col = C[:, j]
peak = np.argmax(col)
# Before peak: enforce monotonic increase
for i in range(peak - 1, -1, -1):
if col[i] > col[i + 1]:
col[i] = col[i + 1]
# After peak: enforce monotonic decrease
for i in range(peak + 1, len(col)):
if col[i] > col[i - 1]:
col[i] = col[i - 1]
return C
def mcr_als(M, n_components, max_iter=100, tol=1e-6):
"""Run MCR-ALS on an intensity matrix.
Args:
M: intensity matrix, shape (scans, mz)
n_components: number of components to resolve
max_iter: maximum ALS iterations
tol: convergence tolerance (relative change in residual)
Returns:
C: elution profiles, shape (scans, n_components)
S: spectra, shape (n_components, mz)
"""
M = M.astype(np.float64)
n_scans, n_mz = M.shape
# Initial guess: SVD
U, s, Vt = np.linalg.svd(M, full_matrices=False)
S = np.abs(Vt[:n_components]) # (n_components, mz)
norm_M = np.linalg.norm(M)
prev_residual = np.inf
for iteration in range(max_iter):
# Solve for C (profiles) given S
# M ≈ C @ S → C = M @ S.T @ inv(S @ S.T)
StS = S @ S.T
try:
C = M @ S.T @ np.linalg.inv(StS)
except np.linalg.LinAlgError:
C = M @ S.T @ np.linalg.pinv(StS)
np.clip(C, 0, None, out=C)
# Unimodality constraint
C = _enforce_unimodality(C)
# Solve for S (spectra) given C
# M ≈ C @ S → S = inv(C.T @ C) @ C.T @ M
CtC = C.T @ C
try:
S = np.linalg.solve(CtC, C.T @ M)
except np.linalg.LinAlgError:
S = np.linalg.pinv(CtC) @ C.T @ M
np.clip(S, 0, None, out=S)
# Check convergence
residual = np.linalg.norm(M - C @ S) / norm_M
change = abs(prev_residual - residual)
if change < tol:
break
prev_residual = residual
return C, S
Let's run MCR-ALS on a synthetic sample with 4 overlapping molecules. Dashed lines are the true profiles, solid lines are what MCR-ALS recovered:
The shapes match nearly perfectly. The similarity matrix shows each recovered component mapping cleanly to one true component, with cosine similarities > 0.999.
The benchmark works like this:
We generated a fresh dataset of 1,000 synthetic samples (different random seed from the component estimator's training data) with 1–10 components each — 5,433 components total.
Here's a grid of 12 samples — the 4 worst, 4 medium, and 4 best. Dashed = true, solid = recovered:
The worst sample in the dataset (0887, 9 components) scored 0.78 — still recovers most profiles correctly, but some components get confused. Likely caused by highly similar spectra or extreme overlap where the algorithm can't distinguish between components:
Cosine similarity only measures shape — it's completely blind to scale. A profile that's 20% too large still gets cosine = 1.0. So we added a second metric: the scale ratio, defined as:
scale_ratio = sum(recovered_TIC) / sum(true_TIC)
A perfect recovery gives 1.0. Here's the distribution across all 5,433 components:
The median is 1.04 — a systematic ~4% overestimate (not sure why it's consistently over). The P5–P95 range spans 0.93–1.22, meaning some components are off by up to 22%. This is consistent across component counts.
This makes sense: MCR-ALS can shift intensity between components while keeping the total
reconstruction accurate (the overall C @ S residual is tiny). Cosine similarity
was giving us a false sense of precision because it only checks whether the shapes match.
The shapes are excellent, but the per-component intensities need work. In the next part, we'll try this pipeline on actual GC-MS data — where the real challenges begin.
← Part 2: Counting Components with SVD
Reference: de Juan, A., Jaumot, J. & Tauler, R. (2014). Multivariate Curve Resolution (MCR). Anal. Methods, 6, 4964–4976. DOI: 10.1039/c4ay00571f