From raw chromatography data to labeled training sets for component counting
In GC-MS analysis, overlapping peaks are everywhere. Before you can deconvolve them, you need to know how many components are hiding in each peak cluster. Training a model to estimate this requires labeled data — and real GC-MS data doesn't come with ground truth labels.
The solution: build a generator that produces realistic synthetic intensity matrices where we control exactly how many molecules overlap and what their shapes look like.
We started with 24 GC-MS runs from the Copenhagen Soft Camel Cheese dataset — freely available ANDI-MS NetCDF (.CDF) files. Each run has 12,004 scans across m/z 15–300.
Extracted each CDF to numpy arrays: time.npy (acquisition times)
and ms.npy (intensity matrix, scans × m/z bins).