We propose a supervised learning approach to automatic quantification of cell populations in flow cytometric samples. One sample contains up to millions of measurement vectors with a dimensionality between 10 and 20. Normally, each measurement vector corresponds to a single cell in the biological sample. Identifying biologically meaningful cell populations is essentially a clustering problem, however, standard clustering methods are impractical, because size, shape and location of corresponding clusters may vary strongly between samples mainly due to phenotypic differences and inter-laboratory variations. In our holistic approach, we implicitly employ the structural information (such as relative locations and shape of sub-populations). A new input sample is reconstructed by a linear combination of artificial reference samples each represented by a Gaussian Mixture Model (GMM), in which for each Gaussian component the class label of the corresponding cluster of observations is known. The reference samples are calculated from a larger set of training samples by non-negative matrix factorization and can be regarded as the basis of a lower dimensional feature space, in which input samples are reconstructed. We show a method for calculating the feature space transformation based on minimization the L2 distance defined between two GMM. The feature space representation of the sample is then used to assign each observation to one of the specified sub-populations by a Bayes decision. We present classification results on a database of about 170 patients with Acute Lymphoblastic Leukemia (ALL), where high accuracy in the prediction of relatively small leukemic populations is crucial. The approach is not limited to our application. It can be employed wherever analysis of large, multi-dimensional, numerical data of a specific class of samples with related structure has to be performed.
Clustering of cell populations in flow cytometry data using a combination of Gaussian mixtures / Reiter, M.; Rota, P.; Kleber, F.; Diem, M.; Groeneveld-Krentz, S.; Dworzak, M.. - In: PATTERN RECOGNITION. - ISSN 0031-3203. - 60:(2016), pp. 1029-1040. [10.1016/j.patcog.2016.04.004]
Clustering of cell populations in flow cytometry data using a combination of Gaussian mixtures
Rota P.;
2016-01-01
Abstract
We propose a supervised learning approach to automatic quantification of cell populations in flow cytometric samples. One sample contains up to millions of measurement vectors with a dimensionality between 10 and 20. Normally, each measurement vector corresponds to a single cell in the biological sample. Identifying biologically meaningful cell populations is essentially a clustering problem, however, standard clustering methods are impractical, because size, shape and location of corresponding clusters may vary strongly between samples mainly due to phenotypic differences and inter-laboratory variations. In our holistic approach, we implicitly employ the structural information (such as relative locations and shape of sub-populations). A new input sample is reconstructed by a linear combination of artificial reference samples each represented by a Gaussian Mixture Model (GMM), in which for each Gaussian component the class label of the corresponding cluster of observations is known. The reference samples are calculated from a larger set of training samples by non-negative matrix factorization and can be regarded as the basis of a lower dimensional feature space, in which input samples are reconstructed. We show a method for calculating the feature space transformation based on minimization the L2 distance defined between two GMM. The feature space representation of the sample is then used to assign each observation to one of the specified sub-populations by a Bayes decision. We present classification results on a database of about 170 patients with Acute Lymphoblastic Leukemia (ALL), where high accuracy in the prediction of relatively small leukemic populations is crucial. The approach is not limited to our application. It can be employed wherever analysis of large, multi-dimensional, numerical data of a specific class of samples with related structure has to be performed.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione