Recovering Elution Profiles with MCR-ALS

Alternating least squares to separate overlapping GC-MS peaks

Warning: The benchmark in this post clearly needs better metrics — cosine similarity alone painted too rosy a picture, and the scale ratio is a rough first attempt that can hide errors (one component too high can compensate another too low). I don't have a great idea yet for a single number that captures everything. But this is a fun side project, so good enough for now — if you have suggestions, I'd love to hear them in the comments!

1. The Pipeline So Far

In Part 0 we saw why deconvolution matters — overlapping molecules contaminate each other's spectra and break library identification. In Part 2 we trained a model to estimate how many components are present (98.5% accuracy). Now comes the hard part: actually recovering the elution profiles and spectra. Well, the actual hard part will be to do it on real data.

2. MCR-ALS in a Nutshell

MCR-ALS (Multivariate Curve Resolution – Alternating Least Squares) is a well-established method for resolving mixtures. For a thorough introduction, see de Juan, Jaumot & Tauler (2014).

The idea is simple: we want to factor the intensity matrix M (scans × m/z) into two parts:

M ≈ C @ S

where C is the elution profiles (scans × n_components) and S is the mass spectra (n_components × m/z). We solve this by alternating:

Fix S, solve for C — C = M @ S.T @ inv(S @ S.T)
Apply constraints: clip negatives, enforce unimodality (one peak per profile)
Fix C, solve for S — S = inv(C.T @ C) @ C.T @ M
Clip negatives
Check convergence, repeat

The initial guess comes from SVD. A key optimization: instead of calling NNLS per m/z column (slow), we solve the entire matrix at once and clip negatives. Not mathematically identical to true NNLS, but within ALS the alternation corrects the clipping errors.

Full implementation (~70 lines)

"""MCR-ALS (Multivariate Curve Resolution - Alternating Least Squares)."""

import numpy as np


def _enforce_unimodality(C):
    """Enforce unimodality on each column of C (each elution profile)."""
    for j in range(C.shape[1]):
        col = C[:, j]
        peak = np.argmax(col)

        # Before peak: enforce monotonic increase
        for i in range(peak - 1, -1, -1):
            if col[i] > col[i + 1]:
                col[i] = col[i + 1]

        # After peak: enforce monotonic decrease
        for i in range(peak + 1, len(col)):
            if col[i] > col[i - 1]:
                col[i] = col[i - 1]

    return C


def mcr_als(M, n_components, max_iter=100, tol=1e-6):
    """Run MCR-ALS on an intensity matrix.

    Args:
        M: intensity matrix, shape (scans, mz)
        n_components: number of components to resolve
        max_iter: maximum ALS iterations
        tol: convergence tolerance (relative change in residual)

    Returns:
        C: elution profiles, shape (scans, n_components)
        S: spectra, shape (n_components, mz)
    """
    M = M.astype(np.float64)
    n_scans, n_mz = M.shape

    # Initial guess: SVD
    U, s, Vt = np.linalg.svd(M, full_matrices=False)
    S = np.abs(Vt[:n_components])  # (n_components, mz)

    norm_M = np.linalg.norm(M)
    prev_residual = np.inf

    for iteration in range(max_iter):
        # Solve for C (profiles) given S
        # M ≈ C @ S → C = M @ S.T @ inv(S @ S.T)
        StS = S @ S.T
        try:
            C = M @ S.T @ np.linalg.inv(StS)
        except np.linalg.LinAlgError:
            C = M @ S.T @ np.linalg.pinv(StS)
        np.clip(C, 0, None, out=C)

        # Unimodality constraint
        C = _enforce_unimodality(C)

        # Solve for S (spectra) given C
        # M ≈ C @ S → S = inv(C.T @ C) @ C.T @ M
        CtC = C.T @ C
        try:
            S = np.linalg.solve(CtC, C.T @ M)
        except np.linalg.LinAlgError:
            S = np.linalg.pinv(CtC) @ C.T @ M
        np.clip(S, 0, None, out=S)

        # Check convergence
        residual = np.linalg.norm(M - C @ S) / norm_M
        change = abs(prev_residual - residual)
        if change < tol:
            break
        prev_residual = residual

    return C, S

3. A Concrete Example

Let's run MCR-ALS on a synthetic sample with 4 overlapping molecules. Dashed lines are the true profiles, solid lines are what MCR-ALS recovered:

The shapes match nearly perfectly. The similarity matrix shows each recovered component mapping cleanly to one true component, with cosine similarities > 0.999.

4. How We Measure Accuracy

The benchmark works like this:

We give MCR-ALS the correct number of components — we're testing MCR-ALS in isolation, not the full pipeline
MCR-ALS returns N recovered profiles + N recovered spectra
Problem: the recovered components are unordered — recovered component 0 might correspond to true component 3
We compute the cosine similarity between every pair (recovered × true) and use the Hungarian algorithm to find the optimal one-to-one assignment
Each matched pair gets a cosine similarity score. We average across components per sample

5. Benchmark: 1,000 Samples

We generated a fresh dataset of 1,000 synthetic samples (different random seed from the component estimator's training data) with 1–10 components each — 5,433 components total.

Spectra recovery — Median cosine: 0.9999 (P5: 0.9895)
Profile recovery — Median cosine: 1.0000 (P5: 0.9891)
Scale ratio — Median: 1.04 (P5: 0.93, P95: 1.22)

Here's a grid of 12 samples — the 4 worst, 4 medium, and 4 best. Dashed = true, solid = recovered:

6. When It Fails

The worst sample in the dataset (0887, 9 components) scored 0.78 — still recovers most profiles correctly, but some components get confused. Likely caused by highly similar spectra or extreme overlap where the algorithm can't distinguish between components:

7. But Cosine Similarity Hides Something

Cosine similarity only measures shape — it's completely blind to scale. A profile that's 20% too large still gets cosine = 1.0. So we added a second metric: the scale ratio, defined as:

scale_ratio = sum(recovered_TIC) / sum(true_TIC)

A perfect recovery gives 1.0. Here's the distribution across all 5,433 components:

The median is 1.04 — a systematic ~4% overestimate (not sure why it's consistently over). The P5–P95 range spans 0.93–1.22, meaning some components are off by up to 22%. This is consistent across component counts.

This makes sense: MCR-ALS can shift intensity between components while keeping the total reconstruction accurate (the overall C @ S residual is tiny). Cosine similarity was giving us a false sense of precision because it only checks whether the shapes match.

8. What's Next

The shapes are excellent, but the per-component intensities need work. In the next part, we'll try this pipeline on actual GC-MS data — where the real challenges begin.

← Part 2: Counting Components with SVD

Reference: de Juan, A., Jaumot, J. & Tauler, R. (2014). Multivariate Curve Resolution (MCR). Anal. Methods, 6, 4964–4976. DOI: 10.1039/c4ay00571f