Cardiopulmonary and metabolic responses during a 2-day CPET in [ME/CFS]: translating reduced oxygen consumption [...], Keller et al, 2024

Anyway, this is a bit besides the point.

I plan to write a blog post about this because what the data shows is a quite different than what the paper reports and focuses on.

- I think the data show that there is no significant effect for any of the outcomes, whether you look at AT or max values, matched pairs or not, excluding patients who failed to meet the maximum value criteria or not.

- VO2 and workload differences do not correlate with severity as measured with the Bell Scale (they call it bell activity scale but it is really a measure of disability rather than activity).

- As the authors point out, the differences in the ME/CFS group are often slightly larger than in the control group but this was not statistically significant and possibly due to chance. If there is an effect it will likely be a very small one, one that can only be detected reliably with even larger sample sizes.

- Surprisingly the largest effect sizes were seen at peak values, not at the ventilatory threshold. This could be due to the criteria to determine peak effort which have been questioned in the literature. Some argue to use lactate measurements instead of relying on %predicted HR.

- The large overlap between ME/CFS patients and controls means that 2day CPET is not a useful measurement for ME/CFS disease activity or PEM.​

I think that's a good analysis. At the very least, none of these metrics look like anything resembling a diagnostic biomarker to classify between groups. *Possibly* decrease in VO2 or workload greater than around 20% might be fairly specific to ME/CFS but has very low sensitivity.

Edit: It'd be interesting to check all difference metrics in this regard - trying to find cutoff values where it is highly specific, and at least like 20% sensitive, to maybe identify a subtype of ME/CFS. Or maybe they're the most severe, and the Bell score didn't capture that somehow.
 
Last edited:
Edit: I made a mistake, see following posts.

I had posted this then deleted it because I thought there was an error in the code, but I went through it and its fine.

I calculated the sensitivity and specificity at every threshold for every difference feature. I got the threshold for each feature where the specificity was at least 90%, then sorted by features with highest sensitivity. These are all the ones over 20%. (I removed max_VO2_t_diff_percentage and AT_VO2_t_diff_percentage because VO2 and VO2_t percentages are identical.)

I also excluded the participants we concluded didn't meet criteria from all max metrics.

Features sorted by sensitivity at >= 90% specificity:
max_VO2_diff_percentage: 30.3% sensitivity
max_VO2_diff_absolute: 30.3% sensitivity
AT_VO2_diff_percentage: 29.8% sensitivity
max_VCO2_diff_percentage: 28.9% sensitivity
AT_Ve_VCO2_diff_absolute: 28.6% sensitivity
AT_PETCO2_diff_percentage: 27.4% sensitivity
AT_wkld_diff_percentage: 25.0% sensitivity
max_wkld_diff_percentage: 25.0% sensitivity
AT_Ve_VCO2_diff_percentage: 23.8% sensitivity
max_VO2_t_diff_absolute: 23.7% sensitivity
AT_DBP_diff_percentage: 22.9% sensitivity
AT_VO2_HR_diff_percentage: 22.9% sensitivity
AT_PP_diff_percentage: 21.7% sensitivity
AT_VO2_t_diff_absolute: 21.4% sensitivity
max_Ve_BTPS_diff_percentage: 21.1% sensitivity
rest_HR_diff_percentage: 20.2% sensitivity
rest_PP_diff_percentage: 20.2% sensitivity
AT_VO2_diff_absolute: 20.2% sensitivity
AT_Ve_VO2_diff_percentage: 20.2% sensitivity

Here are the plots of the top 10:
max_VO2_diff_percentage_spec_sens_hist.png max_VO2_diff_absolute_spec_sens_hist.png AT_VO2_diff_percentage_spec_sens_hist.png max_VCO2_diff_percentage_spec_sens_hist.png AT_Ve_VCO2_diff_absolute_spec_sens_hist.png AT_PETCO2_diff_percentage_spec_sens_hist.png AT_wkld_diff_percentage_spec_sens_hist.png max_wkld_diff_percentage_spec_sens_hist.png AT_Ve_VCO2_diff_percentage_spec_sens_hist.png max_VO2_t_diff_absolute_spec_sens_hist.png

For anyone interested in an explanation: I calculated the value that each metric had to change where at least 90% of the people who had a larger change were ME/CFS. For example, in the first chart, max_VO2_diff_percentage, this shows the percent that the participant's VO2 changed on day 2. The red dotted vertical line shows the threshold (-9.70%) where for the participants that had a larger decrease (further left of the line) 91.3% of them were ME/CFS.

This could be important, because this might indicate that if you see someone has a larger decrease than around 10% on day 2, it might be very likely they have ME/CFS. In the first chart, you can see a lot of red (MECFS) and not a lot of blue (HC) left of the vertical line. (Purple bars are red and blue in the same location.)

The sensitivity is how many ME/CFS participants were left (or right for increasing metrics) of the threshold. In the first chart, that would be 30.3% of pwME had a larger decrease than about 10% of VO2. So 70% of them would not appear to have ME/CFS if using this threshold to diagnose. You can raise the threshold to include more pwME, but then you're also misdiagnosing more controls.

None of the above is conclusive, but I think it's a good thing to look into. Maybe a subtype of ME/CFS has a large change? Maybe these are the most severe, but the Bell scale didn't capture that somehow? Maybe these are only the ones that have entered PEM by 24 hours later?

Seeing these thresholds with high specificity and somewhat okay sensitivities (20-30%) doesn't seem like it's just random variation between groups, because out of 22 metrics, VO2, one of the metrics most associated with 2-day CPET decline in past studies, is top of the above list for highest sensitivity. And workload, the other main metric talked about, is the fifth most sensitive unique metric above. Seems like more than coincidence.

Group numbers are around 76-84 MECFS and 69-71 HC (it's different per feature depending on if participants were excluded or were missing values).


-----

@ME/CFS Skeptic I think there's something weird with PI-026. Almost all the metrics are exactly the same on both days.
 
Last edited:
Thanks again for the impressive analysis @forestglip.

I do not get the same results though. For example if I use the threshold of -9.7% for max VO2 values, I get 25 ME/CFS and 7 healthy controls so that ME/CFS patients make up 78% of the sample rather than 90%.

Because of the large overlap between MECFS and HC I find no values that had a specificity EDIT: precision > 90% except for really high ones where because a couple of ME/CFS patients had high increases on day 2.

upload_2024-9-10_12-21-19.png
 
Last edited:
Thanks again for the impressive analysis @forestglip.

I do not get the same results though. For example if I use the threshold of -9.7% for max VO2 values, I get 25 ME/CFS and 7 healthy controls so that ME/CFS patients make up 78% of the sample rather than 90%.

Because of the large overlap between MECFS and HC I find no values that had a specificity > 90% except for really high ones where because a couple of ME/CFS patients had high increases on day 2.

View attachment 23198

Oh I am dumb. I've had a misunderstanding of the term specificity. It doesn't measure true positives/(true positives + false positives). It measures true negatives/(true negatives + false positives). The correct measure for what I was looking for is precision, not specificity. I'm going to redo this.

Edit: For max VO2 percentage difference I get 22 ME/CFS and 6 HC at <9.7. I think this is just because I excluded the participants who didn't satisfy the max criteria. The percent is about the same though.
 
Last edited:
Yes I think we mixed up the terms.

You we're right about the sensitivity and specificity values, but your description (how many with lower values than -9.7% would be ME/CFS patients) refers to precision rather than specificity.

Here's what I got for a threshold of -9.7, for example:

Total sample
84 MECFS
71 HC
155 Total

Under threshold 9.7
25 MECFS
7 HC
32 Total

Sensitivity: 25/84 = 29.76%
Specificity: (71-7)/71 = 90.14%

Positive Predictive Value (PVV, precision) = 25/32 = 78.13%
Negative Predictive Value = (71-7) / (155-32) = 62.14%

False Positive Rate = 1 - specificity = (7)/71 = 9.86%
False Negative Rate = 1 - sensitivity = (84-24) / 84 = 70.24%
 
Thank you for checking that and making sure the false info got corrected!

I've redone it with precision and recall (recall means the same thing as sensitivity). I've calculated the highest recall where precision is greater than 80% for every metric. Here are all where recall is greater than 10%:

AT_VO2_diff_percentage: 29.8% recall at >= 80% precision
max_VO2_diff_absolute: 27.6% recall at >= 80% precision
max_VO2_diff_percentage: 26.3% recall at >= 80% precision
AT_DBP_diff_percentage: 20.5% recall at >= 80% precision
AT_VO2_HR_diff_percentage: 20.5% recall at >= 80% precision
AT_PP_diff_percentage: 19.3% recall at >= 80% precision
AT_VO2_t_diff_absolute: 19.0% recall at >= 80% precision
AT_Ve_VO2_diff_percentage: 17.9% recall at >= 80% precision
AT_Ve_VO2_diff_absolute: 16.7% recall at >= 80% precision
AT_wkld_diff_percentage: 16.7% recall at >= 80% precision
AT_PETO2_diff_percentage: 14.3% recall at >= 80% precision
AT_PETO2_diff_absolute: 14.3% recall at >= 80% precision
AT_Ve_VCO2_diff_absolute: 14.3% recall at >= 80% precision
AT_VO2_diff_absolute: 13.1% recall at >= 80% precision
rest_DBP_diff_percentage: 11.9% recall at >= 80% precision
rest_DBP_diff_absolute: 11.9% recall at >= 80% precision
AT_VCO2_diff_percentage: 11.9% recall at >= 80% precision
AT_Ve_VCO2_diff_percentage: 11.9% recall at >= 80% precision
rest_HR_diff_percentage: 10.7% recall at >= 80% precision
rest_OUES_diff_percentage: 10.7% recall at >= 80% precision
max_RR_diff_percentage: 10.5% recall at >= 80% precision

And here are the plots for precision and recall for the top ten. I also added the percent of HC that are misclassified (false positive rate) to the legend.
AT_VO2_diff_percentage_prec_rec_hist.png max_VO2_diff_absolute_prec_rec_hist.png max_VO2_diff_percentage_prec_rec_hist.png AT_DBP_diff_percentage_prec_rec_hist.png AT_VO2_HR_diff_percentage_prec_rec_hist.png AT_PP_diff_percentage_prec_rec_hist.png AT_VO2_t_diff_absolute_prec_rec_hist.png AT_Ve_VO2_diff_percentage_prec_rec_hist.png AT_Ve_VO2_diff_absolute_prec_rec_hist.png AT_wkld_diff_percentage_prec_rec_hist.png

I checked with the data for max_VO2_diff_percentage.
With the 10 participants excluded, at a threshold of -10.593220:

20 MECFS and 5 HC below threshold. 5/(5+20) = .80, which matches precision.
20 MECFS are below threshold and 56 are above. 20/(20+56) = 0.263157895, which matches recall.
5 HC are below threshold and 64 are above. 5/(5+64) = 0.072463768, which matches the percent on the legend.

Not as impressive, but still there might be something here.

I think my argument still applies that it seems improbable due to chance that out of 22 metrics, the metric with the highest recall is VO2, and the 6th highest unique metric is workload. These are the two metrics named in a quote from MEpedia:
According to a 2015 report by The National Academy of Medicine, “ME/CFS patients have significantly lower results on CPET 2 than on CPET 1 on one or more of the following parameters: VO2max, VO2 at ventilatory threshold and maximal workload or workload at ventilatory threshold.”[9]

Or if splitting by timepoints, VO2 at AT is #1, VO2 at max is #2, workload at AT is #8, workload at max is #27. (This is out of 66 58 unique metrics)

Edit: I looked at 58 unique metrics. Double that for total because I did absolute difference and percentage difference for each one.

Full code:
Code:
import pandas as pd
import numpy as np
from typing import List, Dict, Any
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

class MECFSDataset:
    def __init__(self, tsv_file: str, participants_excluded: List[str]):
        self.df = pd.read_csv(tsv_file, sep='\t', na_values=['na'])
        self.participants = self.df['ParticipantID'].unique()
        self.participants_excluded = participants_excluded
       
        self.metrics = {
            'rest': ['DBP', 'HR', 'OUES', 'PETCO2', 'PETO2', 'PP', 'RER', 'RR', 'SBP', 'VCO2', 'VO2', 'VO2_HR', 'Ve_BTPS', 'Ve_VCO2', 'Ve_VO2', 'Vt_BTPS_L'],
            'AT': ['DBP', 'HR', 'PETCO2', 'PETO2', 'PP', 'RER', 'RPE', 'RPM', 'RPP', 'RR', 'SBP', 'VCO2', 'VO2', 'VO2_t', 'VO2_HR', 'Ve_BTPS', 'Ve_VCO2', 'Ve_VO2', 'Vt_BTPS_L', 'wkld', 'time_sec'],
            'max': ['DBP', 'HR', 'PETCO2', 'PETO2', 'PP', 'RER', 'RPE', 'RPM', 'RPP', 'RR', 'SBP', 'VCO2', 'VO2', 'VO2_t', 'VO2_HR', 'Ve_BTPS', 'Ve_VCO2', 'Ve_VO2', 'Vt_BTPS_L', 'wkld', 'time_sec']
        }

        self.time_points = ['rest', 'AT', 'max']
        self.demographic_metrics = ['ParticipantID', 'test_site', 'matched_pair', 'sex', 'phenotype', 'race', 'age', 'height_in', 'weight_lb', 'bmi', 'bas_score', 'q_education', 'q_reclined', 'q_sleeprefreshing', 'q_hoursinbed']
        self.features = self._create_features()

    def _create_features(self) -> Dict[str, Dict[str, Any]]:
        features = {}
        for participant in self.participants:
            participant_data = self.df[self.df['ParticipantID'] == participant]
            features[participant] = {
                'single_day': self._get_single_day_features(participant_data),
                'difference': self._get_difference_features(participant_data),
                'demographics': self._get_demographic_features(participant_data),
                'phenotype': participant_data['phenotype'].iloc[0]
            }
        return features

    def _get_single_day_features(self, participant_data: pd.DataFrame) -> Dict[str, Dict[str, float]]:
        single_day = {}
        for day in ['D1', 'D2']:
            day_data = participant_data[participant_data['Study_Visit'] == day]
            single_day[day] = {}
            for time_point in self.time_points:
                time_point_data = day_data[day_data['Time_Point'] == time_point]
                if not time_point_data.empty:
                    for metric in self.metrics.get(time_point, []):
                        single_day[day][f"{time_point}_{metric}"] = time_point_data[metric].iloc[0]
        return single_day

    def _get_difference_features(self, participant_data: pd.DataFrame) -> Dict[str, float]:
        difference = {}
        d1_data = participant_data[participant_data['Study_Visit'] == 'D1']
        d2_data = participant_data[participant_data['Study_Visit'] == 'D2']
       
        participant_id = participant_data['ParticipantID'].iloc[0]
       
        for time_point in self.time_points:
            d1_time_point = d1_data[d1_data['Time_Point'] == time_point]
            d2_time_point = d2_data[d2_data['Time_Point'] == time_point]
           
            if not d1_time_point.empty and not d2_time_point.empty:
                for metric in self.metrics.get(time_point, []):
                    d1_value = pd.to_numeric(d1_time_point[metric].iloc[0], errors='coerce')
                    d2_value = pd.to_numeric(d2_time_point[metric].iloc[0], errors='coerce')
                    if pd.notna(d1_value) and pd.notna(d2_value):
                        diff_percentage = np.divide((d2_value - d1_value), d1_value, out=np.full_like(d2_value - d1_value, np.inf, dtype=np.float64), where=d1_value!=0) * 100
                        diff_absolute = d2_value - d1_value
                       
                        # Only add max diff metrics if participant is not in excluded list
                        if time_point != 'max' or participant_id not in self.participants_excluded:
                            difference[f"{time_point}_{metric}_diff_percentage"] = diff_percentage
                            difference[f"{time_point}_{metric}_diff_absolute"] = diff_absolute
       
        return difference
   
    def _get_demographic_features(self, participant_data: pd.DataFrame) -> Dict[str, Any]:
        demographics = {}
        for metric in self.demographic_metrics:
            demographics[metric] = participant_data[metric].iloc[0]
        return demographics

    def get_feature_vector(self, participant: str) -> pd.DataFrame:
        participant_features = self.features[participant]
        feature_vector = []
        columns = []
       
        for day in ['D1', 'D2']:
            for time_point in self.time_points:
                for metric in self.metrics.get(time_point, []):
                    feature_vector.append(participant_features['single_day'][day].get(f"{time_point}_{metric}", np.nan))
                    columns.append(f"{day}_{time_point}_{metric}")
       
        for metric in self.demographic_metrics:
            feature_vector.append(participant_features['demographics'].get(metric, np.nan))
            columns.append(metric)
       
        for time_point in self.time_points:
            for metric in self.metrics.get(time_point, []):
                feature_vector.append(participant_features['difference'].get(f"{time_point}_{metric}_diff_percentage", np.nan))
                columns.append(f"{time_point}_{metric}_diff_percentage")
                feature_vector.append(participant_features['difference'].get(f"{time_point}_{metric}_diff_absolute", np.nan))
                columns.append(f"{time_point}_{metric}_diff_absolute")
       
        return pd.DataFrame([feature_vector], columns=columns)

    def get_feature_names(self) -> List[str]:
        feature_names = []
        for day in ['D1', 'D2']:
            for time_point in self.time_points:
                for metric in self.metrics.get(time_point, []):
                    feature_names.append(f"{day}_{time_point}_{metric}")
       
        for metric in self.demographic_metrics:
            feature_names.append(metric)
       
        for time_point in self.time_points:
            for metric in self.metrics.get(time_point, []):
                feature_names.append(f"{time_point}_{metric}_diff_percentage")
                feature_names.append(f"{time_point}_{metric}_diff_absolute")
       
        return feature_names

    def get_labels(self) -> np.array:
        return np.array([1 if self.features[p]['phenotype'] == 'MECFS' else 0 for p in self.participants])

participants_excluded = [
    'PI-012',
    'PI-043',
    'PI-147',
    'PI-008',
    'PI-018',
    'PI-029',
    'PI-057',
    'PI-082',
    'PI-091',
    'PI-128'
]

dataset = MECFSDataset('cpet_clinical_data.tsv', participants_excluded)
feature_names = dataset.get_feature_names()
X = np.array([dataset.get_feature_vector(p) for p in dataset.participants])
y = dataset.get_labels()

# Create a DataFrame with the feature vectors and labels
feature_vectors = [dataset.get_feature_vector(p) for p in dataset.participants]
labels = dataset.get_labels()
df = pd.concat(feature_vectors, ignore_index=True)
#df['label'] = labels

# Output the DataFrame to a CSV file
df.to_csv('mecfs_dataset.csv', index=False)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as lines
from sklearn.metrics import roc_curve
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_recall_curve

def plot_feature(df, feature, y):
    # Remove rows with NaN values for this feature
    mask = ~df[feature].isna()
    mask.value_counts()
    X = df[feature][mask]
    y_masked = y[mask]
   
    correlation = np.corrcoef(X, y_masked)[0, 1]
    X_for_roc = -X if correlation < 0 else X
   
    precision, recall, thresholds = precision_recall_curve(y_masked, X_for_roc)  
   
    # Reverse the thresholds back to original if the feature was inverted
    if correlation < 0:
        thresholds = -thresholds
       
    precision_without_last = precision[:-1]
   
    # Get indices where precision_without_last is greater than or equal to 0.80
    indices = np.where(precision_without_last >= 0.80)[0]
   
    # Check if there are any indices and return the max recall index if found, otherwise return 0
    if len(indices) > 0:
        precision_80_point = indices[np.argmax(recall[indices])]
        recall_80 = recall[precision_80_point]
   
        # Calculate True Negatives (TN), False Negatives (FN), and the percentage
        threshold_selected = (-thresholds[precision_80_point]) if correlation < 0 else (thresholds[precision_80_point])
        y_pred = X_for_roc >= threshold_selected
   
        TN = np.sum((y_pred == 0) & (y_masked == 0))
        FP = np.sum((y_pred == 1) & (y_masked == 0))
   
        HC_past_threshold_pct = FP/(TN+FP)
   
    else:
        precision_80_point = None
        recall_80 = None
        HC_past_threshold_pct = None

    fig, ax1 = plt.subplots(figsize=(12, 6))
   
    # Define bins based on the combined range of both classes
    min_value = X.min()
    max_value = X.max()
    bins = np.linspace(min_value, max_value, 30)
   
    # Plot the histogram (stratified by phenotype) on the primary y-axis
    class_0 = X[y_masked == 0]
    class_1 = X[y_masked == 1]
   
    # Plot histograms with counts
    ax1.hist(class_0, bins=bins, alpha=0.3, label='HC', color='blue', density=False)
    ax1.hist(class_1, bins=bins, alpha=0.3, label='MECFS', color='red', density=False)
   
    # Set the y-axis limit for the histogram
    max_hist_class_0 = np.histogram(class_0, bins=bins)[0].max()
    max_hist_class_1 = np.histogram(class_1, bins=bins)[0].max()
    max_hist = max(max_hist_class_0, max_hist_class_1)
    ax1.set_ylim(0, 1.1 * max_hist)
    ax1.set_xlabel(f'{feature}')
    ax1.set_ylabel('Count')
    ax1.legend(loc='upper right')
   
    # Create a secondary y-axis for the ROC curve
    ax2 = ax1.twinx()
    ax2.plot(thresholds, precision[:-1], label='Precision', color='green', linewidth=2)
    ax2.plot(thresholds, recall[:-1], label='Recall', color='purple', linewidth=2)
   
    # Set the y-axis limit for specificity and sensitivity
    ax2.set_ylim(0, 1.05)
    ax2.set_ylabel('Precision/Recall')  

    if(precision_80_point):
        ax2.axvline(x=thresholds[precision_80_point], color='red', linestyle=':',
                label=f'{precision[precision_80_point]*100:.1f}% precision: Recall = {recall[precision_80_point]*100:.1f}%')
       
        ax2.text(thresholds[precision_80_point], -0.05, f'{thresholds[precision_80_point]:.2f}',
                 ha='center', va='top', color='black', fontsize=10, rotation=45)
       
    # Combine legends from both axes
    lines_1, labels_1 = ax1.get_legend_handles_labels()
    lines_2, labels_2 = ax2.get_legend_handles_labels()
   
    # Include class_0 percentage in the legend
    if HC_past_threshold_pct is not None:
        dummy_handle = lines.Line2D([0], [0], linestyle="none", marker='o', markersize=6, label=f'HC passing threshold: {HC_past_threshold_pct*100:.1f}%')
        ax1.legend(lines_1 + lines_2 + [dummy_handle], labels_1 + labels_2 + [dummy_handle.get_label()], loc='upper right')
    else:
        ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper right')

   
    # Save the figure
    plt.title(f'Specificity and Sensitivity vs Threshold: {feature}')
    plt.tight_layout()
    plt.savefig(f'images/{feature}_prec_rec_hist.png')
    plt.close()
   


   
    # Print some statistics
    print(f"\nFeature: {feature}")
    print(f"AUC: {np.trapz(tpr, fpr):.3f}")
    if(precision_80_point):      
        print(f"Threshold at 80% precision: {thresholds[precision_80_point]:.3f}")
        print(f"Recall at 80% precision: {recall_80*100:.1f}")
        print(f'HC passing threshold: {HC_past_threshold_pct*100:.1f}%')
    else:
        print("No precision >= 80%")
   
   

    return feature, recall_80



# Load the data
df = pd.read_csv('mecfs_dataset.csv')
labels = df['phenotype']  # Assuming 'phenotype' column exists in the CSV

# Encode labels
le = LabelEncoder()
y = le.fit_transform(labels)

# Get all difference features
diff_features = [col for col in df.columns if '_diff_' in col]

# Collect sensitivities at 90% specificity
feature_sensitivities = []

for feature in tqdm(diff_features, desc='Plotting features', unit='features'):
    feature_sensitivities.append(plot_feature(df, feature, y))

feature_sensitivities = [x for x in feature_sensitivities if x[1] is not None]

# Sort by sensitivity at 90% specificity in descending order
sorted_features = sorted(feature_sensitivities, key=lambda x: x[1], reverse=True)

# Print the sorted features
print("\nFeatures sorted by recall at >= 80% specificity:")
for feature, recall_80 in sorted_features:
    print(f"{feature}: {recall_80*100:.1f}% recall at >= 80% precision")
 
Last edited:
Thanks again for the impressive analysis @forestglip.

I do not get the same results though. For example if I use the threshold of -9.7% for max VO2 values, I get 25 ME/CFS and 7 healthy controls so that ME/CFS patients make up 78% of the sample rather than 90%.

Because of the large overlap between MECFS and HC I find no values that had a specificity EDIT: precision > 90% except for really high ones where because a couple of ME/CFS patients had high increases on day 2.

View attachment 23198
I think your chart might be checking the threshold backwards from mine, looking at precision for greater than the threshold value, not smaller. The numbers seem right though.
 
Thanks @forestglip. Your coding skills are impressive and very useful.

I think your analysis confirms that there might be a small effect. On the other hand, your approach is also iteratively looking for the best threshold, trying different things and looking for a value with the clearest separation. So it is at risk of overfitting the data and finding a pattern that might have happened by chance.

Therefore I prefer the t-test approach. It also suggested a small-moderate effect (for example for VO2_max it showed a cohen d of 0.35) but because of the large number of outcomes and tests done, it was not considered statistically significant.

I think my argument still applies that it seems improbable due to chance that out of 22 metrics, the metric with the highest recall is VO2, and the 6th highest unique metric is workload. These are the two metrics named in a quote from MEpedia:
I think that this is because most of the previous studies only reported these outcomes, not all the other ones (except for HR).

A 2020 meta-analysis tried to estimate the effect sizes. I think it is flawed because it focuses on the mean changes but it still shows that the largest differences were found not at peak values but at AT, especially for Workload (WR). The data by Keller et al. does not really support this as VO2peak rather than WR_AT showed the largest effect sizes.

upload_2024-9-10_17-32-11.png

upload_2024-9-10_17-40-40.png
upload_2024-9-10_17-40-54.png
 
I think your chart might be checking the threshold backwards from mine, looking at precision for greater than the threshold value, not smaller.
I tried to look at the precision level for different values for the VO2_percentage change. I had it switch around the 0 point (checked for smaller values in case of negative values and for bigger values when positive values were used).

Here's the code I used
Code:
df = df_max
values = np.arange(-40, 40, 0.1)
results = []
for value in values:
    if value <= 0:
        filtered_df = df[df['VO2_percentage_difference'] <= value]
    else:
        filtered_df = df[df['VO2_percentage_difference'] >= value]
    n_MECFS = filtered_df[filtered_df['phenotype'] == 'MECFS'].drop_duplicates(subset='ParticipantID').shape[0]
    n_total = filtered_df.drop_duplicates(subset='ParticipantID').shape[0]
    ratio_MECFS = n_MECFS / n_total
    results.append({
        'VO2_difference_value': value,
        'Number of MECFS patients': n_MECFS,
        'Percentage_MECFS': ratio_MECFS * 100,
    })
   
results = pd.DataFrame(results)
sns.lineplot(x='VO2_difference_value', y = 'Percentage_MECFS', data = results)
 
I think that this is because most of the previous studies only reported these outcomes, not all the other ones (except for HR).

It seems that even if they only looked at these two metrics in the past, to then run the above script and the same two metrics others looked at are near the top of the list of 58 metrics seems improbable due to chance.

On the other hand, your approach is also iteratively looking for the best threshold, trying different things and looking for a value with the clearest separation. So it is at risk of overfitting the data and finding a pattern that might have happened by chance.

Valid concern, but I only ran this with a single precision threshold (>=80%) and never even looked at other precisions. I suppose I had considered doing 90% but realized after your corrections that the recall would probably be very small, though.

I did run it mistakenly with 90% sensitivity, as we saw, though, and that gave a similar result.

Based on the charts, I suspect that at most high precision thresholds, VO2 would be near the top, though.

Therefore I prefer the t-test approach. It also suggested a small-moderate effect (for example for VO2_max it showed a cohen d of 0.35) but because of the large number of outcomes and tests done, it was not considered statistically significant.

Yes, I'm not sure how you could formally calculate an effect size based on my method, more than "seems improbable".

If there are subtypes of those who do and don't have an effect, a t-test risks missing it for the group, though, I think.

Part of the low effect size may also be due to outliers. Looking at the chart for AT_wkld_diff_percentage, some of those increases seem weird. One increased by about 280% and another by about 450%. I'll have to check later if that's accurate.

Edit: Anyway, the "order thing" might represent a small effect. But looking at the numbers for precision and recall, and looking at the histograms for VO2 and workload, where below a certain threshold, you have many more MECFS than HC, makes me think there is a large effect for some participants, and maybe very small or nonexistent for others.
 
Last edited:
I tried to look at the precision level for different values for the VO2_percentage change. I had it switch around the 0 point (checked for smaller values in case of negative values and for bigger values when positive values were used).

Here's the code I used
Code:
df = df_max
values = np.arange(-40, 40, 0.1)
results = []
for value in values:
    if value <= 0:
        filtered_df = df[df['VO2_percentage_difference'] <= value]
    else:
        filtered_df = df[df['VO2_percentage_difference'] >= value]
    n_MECFS = filtered_df[filtered_df['phenotype'] == 'MECFS'].drop_duplicates(subset='ParticipantID').shape[0]
    n_total = filtered_df.drop_duplicates(subset='ParticipantID').shape[0]
    ratio_MECFS = n_MECFS / n_total
    results.append({
        'VO2_difference_value': value,
        'Number of MECFS patients': n_MECFS,
        'Percentage_MECFS': ratio_MECFS * 100,
    })
  
results = pd.DataFrame(results)
sns.lineplot(x='VO2_difference_value', y = 'Percentage_MECFS', data = results)

I'll have to look more carefully later, but I think the precision line should gradually verge to (can't find right word) about 50 or 60 percent, as you include more and more of the cohort. So if the threshold for example is high enough that almost all mecfs and hc are included, precision should be close to the balance of the groups.
 
This could be nothing, and basing just on these results from this study is very little evidence. But I think considering the suspicion up to now was that a decrease of more than about 10-20% was good at classifying, and here we see that at around that threshold, the split becomes much more apparent, it might be a sign. There could very well be a flaw in my logic though.
 
but I think the precision line should gradually verge to (can't find right word) about 50 or 60 percent, as you include more and more of the cohort.
True, my graph looked weird because I checked precision below a certain value if the value was negative but above the value if it was positive. If I don't do this then I get the convergence to the ratio of ME/CFS patients in the total sample that you mentioned:

upload_2024-9-10_19-23-18.png
 
But I think considering the suspicion up to now was that a decrease of more than about 10-20% was good at classifying, and here we see that at around that threshold, the split becomes much more apparent, it might be a sign. There could very well be a flaw in my logic though.
I think we largely agree. The data does not suggest that there is no difference at all. It's not 100% random noise.

I would summarise it as: the data suggests that there might be an effect. But the effect is quite small with poor separation between groups and no correlation with the Bell disability scale. The data presented is not large enough to rule out that this effect might have happened due to chance.

When it comes to 2 day CPET as a test to measure PEM or ME/CFS, I think the data is pretty clear that it has very low potential to do this.
 
I think we largely agree. The data does not suggest that there is no difference at all. It's not 100% random noise.

I would summarise it as: the data suggests that there might be an effect. But the effect is quite small with poor separation between groups and no correlation with the Bell disability scale. The data presented is not large enough to rule out that this effect might have happened due to chance.

When it comes to 2 day CPET as a test to measure PEM or ME/CFS, I think the data is pretty clear that it has very low potential to do this.

That all seems about right. I'd just like to see my suspicion tested that maybe a CPET 48 hours later might capture an effect from more participants. 24 might just be too short.

Edit: That and maybe use a different symptom scale to check correlation, like FUNCAP.
 
@ME/CFS Skeptic I don't know if I can today, but I think it might be worth testing with the outliers removed. Both because a 400% increase seems strange, and because it seems an assumption of the independent t-test is no significant outliers. And maybe the groups should be tested for normal distributions.

I'd also remove PI-026, which is probably bad data.

You can see the outliers at least in AT_work_diff_percentage, and maybe also in the VO2 diffs.

https://www.datanovia.com/en/lessons/t-test-assumptions/#assumptions
  • no significant outliers in the two groups
  • the two groups of samples (A and B), being compared, should be normally distributed.
  • the variances of the two groups should not be significantly different. This assumption is made only by the original Student’s t-test. It is relaxed in the Welch’s t-test.

Edit: Or just a test that allows outliers, like Mann-Whitney.
 
Last edited:
You can see the outliers at least in AT_work_diff_percentage, and maybe also in the VO2 diffs.
Yes I think you make a good point. When expressed as a percentage, there are 4 ME/CFS patients (PI-029, PI-087, PI-114, and PI-166) that have extreme values:
upload_2024-9-10_23-2-31.png

That make the distribution of percentage changes quite skewed:
upload_2024-9-10_23-2-54.png


With these 4 included, I found a cohen_d of 0.008 and p_value of 0.93. With them excluded, things change quite dramatically. The effect sizes goes to d =0.44 and the p-value = 0.000151.

upload_2024-9-10_23-5-45.png

That looks like a reasonable difference. Also tried scipy.stats.mannwhitneyu on the full sample and got a p-value of 0.00503.
 
Yes I think you make a good point. When expressed as a percentage, there are 4 ME/CFS patients (PI-029, PI-087, PI-114, and PI-166) that have extreme values:
View attachment 23234

That make the distribution of percentage changes quite skewed:
View attachment 23235


With these 4 included, I found a cohen_d of 0.008 and p_value of 0.93. With them excluded, things change quite dramatically. The effect sizes goes to d =0.44 and the p-value = 0.000151.

View attachment 23237

That looks like a reasonable difference. Also tried scipy.stats.mannwhitneyu on the full sample and got a p-value of 0.00503.

Interesting. Could you check Mann-Whitney with PI-026 removed?

I'd think it's probably not best practice to just remove the outliers, though, for the t-test when making a final conclusion, if we're not sure there's some methodological problem with them.

Edit: Also, if possible, could you check VO2 at AT with Mann-Whitney and PI-026 removed? It seems VO2 is subject to less random variation than workload. Most differences in workload span from about -50% to +50%, while for VO2 it's more contained within about -20% to 20%.
 
Last edited:
This is an interesting blog about how Cohen's D is not robust to outliers, and some better alternatives: A single outlier could completely distort your Cohen's d value

It explains the slightly strange result where if you add a high outlier to the group that is already higher, it decreases the effect size:
Screenshot_20240910-193317.png

Alternatives more robust to variability mentioned are Mann-Whitney, Cliff's delta,
Kolmogorov-Smirnov test, Wilcox and Muska's Q, and mutual information (MI).

--------

From the thread's CPET paper:
Power analysis: The target comparison to determine the appropriate sample size for the present study was the difference in peak VO2 between ME/CFS and CTL. Previous research reported effect sizes as large as 0.66 (Hedges g) [30]. To ensure sufficient power, a more conservative estimated effect size (0.45) was chosen. Based on a power analysis with α at 0.05 and power at 0.80 (using G Power 2), the recommended sample size was 158 participants (i.e., 79 participants per group).

Continuous variables were summarized in means and standard deviations. Categorical variables were aggregated in means and percentages. Before the calculation of parametric statistics was conducted, normality of continuous variables was assessed. As none of the variables showed abnormality of data, parametric statistics were deemed appropriate. Participant characteristics were compared between phenotype groups by sex and between groups for the total sample using independent samples t-test with Cohen’s d effect sizes or χ2 tests with Cramer’s V effect sizes (for categorical variables), and 95% confidence intervals. Measures from CPET were compared between groups (ME/CFS, CTL) and within CPETs (CPET-1, CPET-2) using multiple repeated measures two-way ANOVAs. Post hoc adjustment of significant group by time interactions were done using the Holm-Bonferroni method [50] and Cohen’s d effect. Stability of CPET variables over tests was assessed using intraclass correlation coefficients with a 95% confidence interval. All analyses were conducted using JASP (version 0.17.2) with a set α level of 0.05.

So they found that the single day metrics were all normally distributed, but we're looking at difference between days as the feature.
 
Last edited:
Back
Top Bottom