Project: PROJECT | Report date: 2023-11-07 | Lab data analyst(s): OAA/LN/RMH/MK | Report author(s): JLB



1 Imputation methods

Before analyzing and publishing targeted metabolomics data from the BEVITAL lab, please ensure proper data processing procedures for missing values, if any. Before any imputation, you should identify the main or most likely nature among the missing values in your data set and impute accordingly. When a high proportion (70% or more) of the missing values are missing not at random (MNAR), MNAR imputation methods are preferred.

Left-censored missing values caused by lower than LOD (limit of detection) or LOQ (limit of quantification) commonly exist in targeted metabolomics data and can be considered as MNAR. Traditionally, left-censored missing values have in several research fields been imputed by 0, LOD, LOD/2, or LOD/\(\sqrt{2}\). However, guidelines have stated that these conservative methods should be used only if the percentage of censored data is less than 5% and if the data are mildly skewed, and some have advocated that such approaches should never be implemented. Replacement with zero or a determined small value (such as LOD or LOD/2) may lead to certain biases, e.g., distortions of the distribution of missing variables and underestimations of the standard deviation (SD). Thus, improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses.

Although some new statistical methods have been developed that allow the existence of left-censored missing values, such as the ‘accelerated failure time’ model, a complete data set is required for most statistical analysis frameworks in metabolomics studies. A few imputation methods have been developed and applied to the situation of MNAR in the field of targeted metabolomics, e.g.:

  • GSimp
  • QRILC
  • MinProb
  • MinDet

Among these, we primarily use GSimp and for some data sets also QRILC . In addition, for comparison reasons, we may have included a data set where missing elements caused by values below LOD are imputed with random values between zero and the variable-specific LOD (if any) by using the ‘runif’ function in R (runif(n = 1 , min = 0 , max = LOD)).

QRILC (quantile regression imputation of left-censored data) is specifically designed for MS-based targeted metabolomics studies with left-censored MNAR data caused by lower than LOD or LOQ. This method imputes missing elements with randomly drawn values from a truncated distribution estimated by a quantile regression. QRILC has been shown to keep the overall data distribution and variances. However, stochastic values may be generated by this approach since QRILC imputes missing values independently within variables without utilizing the predictive information from other variables.

An alternative approach is GSimp, a Gibbs sampler-based left-censored missing value imputation procedure for metabolomics data. GSimp utilizes the predictive information of other variables by employing a prediction model and held a truncated normal distribution for each missing element simultaneously. Gibbs sampler is a Markov Chain Monte Carlo (MCMC) technique that sequentially updates parameters while others are fixed and can be used to generate posterior samples. For each missing variable in a data set, GSimp applies a Gibbs sampler to impute the missing values by sampling from a truncated normal distribution with prediction model fitted value as mean and root mean square deviation (RMSD) of missing part as standard deviation while truncated by specified cut-points.

If the data set contains elements missing both at random (MAR) and MNAR, a hybrid method is recommended, such as the MNAR/MAR MI SFI-hybrid approach.

Several Bayesian approaches have also been used for different missing data mechanisms, such as Bayesian estimation from left-censored data using the Markov Chain Monte Carlo (MCMC) method or Bayesian principal component analysis (BPCA).

Alternatively, you may use Tobit regression in the presence of left-censored data.

For further details about left-censored missing value imputation approaches for metabolomics data, see, e.g.:

In this report, we have tried to help you identifying the nature of missing values, but we recommend you to conduct your own thorough diagnosis of missing data mechanisms and choose the most appropriate imputation method(s) based on this assessment. This diagnosis can operate at different levels, e.g.: (i) at the data set level, so that the imputation strategy is applied conditionally to the majority of missing values in the entire data set; and (ii) at the missing value level, so as to have a most refined categorization of the missing values across the data set.

If the data set contains missing elements primarily caused by values lower than LOD (coded by -3), i.e., left-censored missing values, then you may use a data set with missing values imputed by the GSimp approach as outlined here (section 2.1) or alternatively by QRILC (section 2.2). Notably, missing values should be imputed separately for each subgroup. This is done in the lab data report only when we have information about group allocation specified in a separate column in the Excel file following the samples. The group allocation should be blinded in randomized clinical trials.


2 Imputed data

Data are downloaded by clicking the table button(s) for your preferred file format(s).
You can choose between several options that handle missing values by different approaches, as explained in section 1 and in the table footnotes.

2.1 Data set with missing values imputed by GSimp

2.1.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
Here, missing data points are imputed across groups by the GSimp approach as outlined here.
Missing values were initialized by QRILC (quantile regression imputation of left-censored data). We natural log-transformed data before QRILC was conducted to improve the imputation accuracy and ensure positive values in the original scale after back-transformation. Elastic net from the R package ‘glmnet’ was used as the prediction model. We applied the minimum observed value of missing variable as an informative upper truncation point and -Inf as a non-informative lower truncation point for left-censored missing. Before GSimp, we did not follow the ‘80% rule’ or ‘modified 80% rule’, but removed metabolites with >60% missing values or <25 observations.
Subjects with sample missing (code -1 or -2) are excluded from the imputation procedure but included at the end of the data set with imputed values.
Metabolites with biological meaningful zero values are excluded from the imputation procedure but included in the data set with imputed values.
Metabolites only measured in other matrices than plasma/serum may be excluded from the imputation procedure and associated tables/figures.
Platforms are designated by the uppercase letters after the two punctations in column headings.
SampleID may be a combination of: 1) the lab’s ‘SampleID’ (before the punctations) and the project’s ‘SubjectID’ (after the punctations) or 2) the lab’s ‘SampleID’ (before the underscore) and the lab’s ‘SeriesNo’ (after the underscore).
As most circulating metabolites fit a log-normal (multiplicative) distribution equally well or better than a normal (additive) distribution, we strongly recommend to log-transform the positive valued continuous outcome data, which are often positively skewed, before the data exploration and statistical analyses.


2.1.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.
Here, missing data points are imputed within groups by the GSimp approach as outlined here.
Missing values were initialized by QRILC (quantile regression imputation of left-censored data). We natural log-transformed data before QRILC was conducted to improve the imputation accuracy and ensure positive values in the original scale after back-transformation. Elastic net from the R package ‘glmnet’ was used as the prediction model. We applied the minimum observed value of missing variable as an informative upper truncation point and -Inf as a non-informative lower truncation point for left-censored missing. Before GSimp, we did not follow the ‘80% rule’ or ‘modified 80% rule’, but removed metabolites with >60% missing values or <25 observations.
Subjects with sample missing (code -1 or -2) are excluded from the imputation procedure but included at the end of the data set with imputed values.
Metabolites with biological meaningful zero values are excluded from the imputation procedure but included in the data set with imputed values.
Metabolites only measured in other matrices than plasma/serum may be excluded from the imputation procedure and associated tables/figures.
Platforms are designated by the uppercase letters after the two punctations in column headings.
SampleID may be a combination of: 1) the lab’s ‘SampleID’ (before the punctations) and the project’s ‘SubjectID’ (after the punctations) or 2) the lab’s ‘SampleID’ (before the underscore) and the lab’s ‘SeriesNo’ (after the underscore).
As most circulating metabolites fit a log-normal (multiplicative) distribution equally well or better than a normal (additive) distribution, we strongly recommend to log-transform the positive valued continuous outcome data, which are often positively skewed, before the data exploration and statistical analyses.


2.2 Data set with missing values imputed by QRILC

2.2.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
Here, missing data points are imputed across groups by QRILC (quantile regression imputation of left-censored data).
We natural log-transformed data before QRILC was conducted to improve the imputation accuracy and ensure positive values in the original scale after back-transformation. The R package ‘MsCoreUtils’ (function ‘impute_matrix’ and method ‘QRILC’) was applied for this imputation approach. Before QRILC, we did not follow the ‘80% rule’ or ‘modified 80% rule’, but removed metabolites with >60% missing values or <25 observations.
Subjects with sample missing (code -1 or -2) are excluded from the imputation procedure but included at the end of the data set with imputed values.
Metabolites with biological meaningful zero values are excluded from the imputation procedure but included in the data set with imputed values.
Metabolites only measured in other matrices than plasma/serum may be excluded from the imputation procedure and associated tables/figures.
Platforms are designated by the uppercase letters after the two punctations in column headings.
SampleID may be a combination of: 1) the lab’s ‘SampleID’ (before the punctations) and the project’s ‘SubjectID’ (after the punctations) or 2) the lab’s ‘SampleID’ (before the underscore) and the lab’s ‘SeriesNo’ (after the underscore).
As most circulating metabolites fit a log-normal (multiplicative) distribution equally well or better than a normal (additive) distribution, we strongly recommend to log-transform the positive valued continuous outcome data, which are often positively skewed, before the data exploration and statistical analyses.


2.2.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.
Here, missing data points are imputed within groups by QRILC (quantile regression imputation of left-censored data).
We natural log-transformed data before QRILC was conducted to improve the imputation accuracy and ensure positive values in the original scale after back-transformation. The R package ‘MsCoreUtils’ (function ‘impute_matrix’ and method ‘QRILC’) was applied for this imputation approach. Before QRILC, we did not follow the ‘80% rule’ or ‘modified 80% rule’, but removed metabolites with >60% missing values or <25 observations.
Subjects with sample missing (code -1 or -2) are excluded from the imputation procedure but included at the end of the data set with imputed values.
Metabolites with biological meaningful zero values are excluded from the imputation procedure but included in the data set with imputed values.
Metabolites only measured in other matrices than plasma/serum may be excluded from the imputation procedure and associated tables/figures.
Platforms are designated by the uppercase letters after the two punctations in column headings.
SampleID may be a combination of: 1) the lab’s ‘SampleID’ (before the punctations) and the project’s ‘SubjectID’ (after the punctations) or 2) the lab’s ‘SampleID’ (before the underscore) and the lab’s ‘SeriesNo’ (after the underscore).
As most circulating metabolites fit a log-normal (multiplicative) distribution equally well or better than a normal (additive) distribution, we strongly recommend to log-transform the positive valued continuous outcome data, which are often positively skewed, before the data exploration and statistical analyses.


2.3 Data set imputed with random values between 0 and LOD


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
Here, data points missing because of values below the limits of detection (LOD) are imputed with random values between zero and the LOD (if LOD is available; otherwise empty) using the ‘runif’ function in R (runif(n = 1 , min = 0 , max = LOD)).
Table cells are empty if data points are missing for other reasons specified in part 3.
Metabolites only measured in other matrices than plasma/serum may be excluded from the imputation procedure and associated tables/figures.
Platforms are designated by the uppercase letters after the two punctations in column headings.
SampleID may be a combination of: 1) the lab’s ‘SampleID’ (before the punctations) and the project’s ‘SubjectID’ (after the punctations) or 2) the lab’s ‘SampleID’ (before the underscore) and the lab’s ‘SeriesNo’ (after the underscore).


3 Descriptive statistics

3.1 GSimp-imputed data

3.1.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
If we have data on group allocation, the data set with missing values imputed within groups is used to calculate the descriptive statistics.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included
  • IQR: interquartile range; measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMean: geometric mean
  • gSD: geometric SD
  • gSDrange: geometric SD range (1 SD range); calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower (gSDlower) and upper (gSDupper) limits, respectively

3.1.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.
If we have data on group allocation, the data set with missing values imputed within groups is used to calculate the descriptive statistics.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included
  • IQR: interquartile range; measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMean: geometric mean
  • gSD: geometric SD
  • gSDrange: geometric SD range (1 SD range); calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower (gSDlower) and upper (gSDupper) limits, respectively

3.2 QRILC-imputed data

3.2.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
If we have data on group allocation, the data set with missing values imputed within groups is used to calculate the descriptive statistics.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included
  • IQR: interquartile range; measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMean: geometric mean
  • gSD: geometric SD
  • gSDrange: geometric SD range (1 SD range); calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower (gSDlower) and upper (gSDupper) limits, respectively

3.2.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.
If we have data on group allocation, the data set with missing values imputed within groups is used to calculate the descriptive statistics.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included
  • IQR: interquartile range; measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMean: geometric mean
  • gSD: geometric SD
  • gSDrange: geometric SD range (1 SD range); calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower (gSDlower) and upper (gSDupper) limits, respectively

3.3 LOD-imputed data (random method)

3.3.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included; do not include data points that were missing of other reasons than LOD (missing data code -3)
  • IQR: interquartile range; measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMean: geometric mean
  • gSD: geometric SD
  • gSDrange: geometric SD range (1 SD range); calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower (gSDlower) and upper (gSDupper) limits, respectively

3.3.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.
This table may also be empty if not all metabolites have been measured for all groups/samples in the data set.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included; do not include data points that were missing of other reasons than LOD (missing data code -3)
  • IQR: interquartile range; measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMean: geometric mean
  • gSD: geometric SD
  • gSDrange: geometric SD range (1 SD range); calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower (gSDlower) and upper (gSDupper) limits, respectively

3.4 Comparison of data sets

3.4.1 Missing metabolites

3.4.1.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.

Abbreviations:

  • n: numbers of samples included
  • MeanSD: mean ± standard deviation
  • MedianIQR: median (interquartile range); IQR measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMeanSDrange: geometric mean (1 geometric SD range); gSDrange calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower and upper limits, respectively
  • nonIMP: non-imputed data
  • GSimp: Gibbs sampler based left-censored missing value imputation procedure
  • QRILC: quantile regression imputation of left-censored data
  • LODrandom: LOD-imputed data (random method)

3.4.1.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.

Abbreviations:

  • n: numbers of samples included
  • MeanSD: mean ± standard deviation
  • MedianIQR: median (interquartile range); IQR measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMeanSDrange: geometric mean (1 geometric SD range); gSDrange calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower and upper limits, respectively
  • nonIMP: non-imputed data
  • GSimp: Gibbs sampler based left-censored missing value imputation procedure
  • QRILC: quantile regression imputation of left-censored data
  • LODrandom: LOD-imputed data (random method)

3.4.2 All metabolites

3.4.2.1 Pooled data


This table may be empty if not all metabolites have been measured for all groups/samples in the data set or because of other issues in the data.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included
  • MeanSD: mean ± standard deviation
  • MedianIQR: median (interquartile range); IQR measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMeanSDrange: geometric mean (1 geometric SD range); gSDrange calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower and upper limits, respectively
  • nonIMP: non-imputed data
  • GSimp: Gibbs sampler based left-censored missing value imputation procedure
  • QRILC: quantile regression imputation of left-censored data
  • LODrandom: LOD-imputed data (random method)

3.4.2.2 Grouped data


This table is empty if we have no data on group allocation or it may be omitted because of other issues in the data.
Be aware of metabolites with biological meaningful zero values. Here, gMean is usually equal to zero, and it will not be possible to calculate gSD.
Metabolites only measured in other matrices than plasma/serum may be excluded from this table.

Abbreviations:

  • n: numbers of samples included
  • MeanSD: mean ± standard deviation
  • MedianIQR: median (interquartile range); IQR measures the spread of the middle half of the data when values are ordered from lowest to highest
  • gMeanSDrange: geometric mean (1 geometric SD range); gSDrange calculated by dividing and multiplying the geometric mean (gMean) with the geometric SD factor (gSD) to obtain the lower and upper limits, respectively
  • nonIMP: non-imputed data
  • GSimp: Gibbs sampler based left-censored missing value imputation procedure
  • QRILC: quantile regression imputation of left-censored data
  • LODrandom: LOD-imputed data (random method)