Building a Synthetic GC-MS Data Generator

From raw chromatography data to labeled training sets for component counting

1. Why Synthetic Data?

In GC-MS analysis, overlapping peaks are everywhere. Before you can deconvolve them, you need to know how many components are hiding in each peak cluster. Training a model to estimate this requires labeled data — and real GC-MS data doesn't come with ground truth labels.

The solution: build a generator that produces realistic synthetic intensity matrices where we control exactly how many molecules overlap and what their shapes look like.

2. Starting from Real Data

We started with 24 GC-MS runs from the Copenhagen Soft Camel Cheese dataset — freely available ANDI-MS NetCDF (.CDF) files. Each run has 12,004 scans across m/z 15–300.

Extracted each CDF to numpy arrays: time.npy (acquisition times) and ms.npy (intensity matrix, scans × m/z bins).

24 samples · 12,004 scans each · 286 m/z bins · ~41 min acquisition time

3. Extracting Peak Shapes

For each of the 286 ion channels in each sample, we detected individual peaks:

Denoise with Gaussian filter (σ=1.0)
Detect peaks with scipy.signal.find_peaks (adaptive height/prominence thresholds)
Find boundaries with peak_widths(rel_height=0.95)
Subtract linear baseline so each peak starts and ends at zero

~200,000 peaks extracted across all 24 samples

4. Clustering into Elution Profile Models

We filtered to high-intensity peaks (>100k) giving ~2,100 candidates, resampled each to 100 points and normalized to unit height for shape comparison, then clustered with k-means (30 clusters).

After manual review, 21 clusters were selected as clean elution profile models, comprising 1,996 peaks. Each model's averaged profile (shown below) represents a characteristic peak shape.

2,129 high-intensity peaks → 30 clusters → 21 selected models (1,996 peaks)

5. Mass Spectra Library

For realistic molecular fingerprints, we downloaded the MassBank bulk export (NIST format, 130 MB). From 139,000 spectra, we filtered to electron impact (EI) ionization and deduplicated by InChIKey.

139,006 spectra parsed → 13,473 EI spectra → 9,971 unique compounds

6. The Generator

Each synthetic sample is defined by a simple config:

GeneratorConfig(
    num_scans=80,
    molecules=[
        MoleculeConfig(model=5,  spectrum=0,   apex=-5,  width=50, intensity=1_200_000),
        MoleculeConfig(model=2,  spectrum=50,  apex=40,  width=60, intensity=900_000),
        MoleculeConfig(model=14, spectrum=100, apex=85,  width=50, intensity=700_000),
    ]
)

For each molecule, the elution profile is resampled to the desired width, placed at the apex position, and multiplied (outer product) with the normalized mass spectrum scaled by the target intensity. Components are summed, then Poisson and Gaussian noise are added.

Apex positions can be outside the scan range to create realistic tailing/fronting edge components.

Next: Part 2 — Counting Components with SVD →

Data sources: Copenhagen Soft Camel Cheese GC-MS dataset, MassBank mass spectral library