Cardiopulmonary and metabolic responses during a 2-day CPET in [ME/CFS]: translating reduced oxygen consumption [...], Keller et al, 2024

Discussion in 'ME/CFS research' started by Nightsong, Jul 5, 2024.

Tags:
  1. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    I think that's a good analysis. At the very least, none of these metrics look like anything resembling a diagnostic biomarker to classify between groups. *Possibly* decrease in VO2 or workload greater than around 20% might be fairly specific to ME/CFS but has very low sensitivity.

    Edit: It'd be interesting to check all difference metrics in this regard - trying to find cutoff values where it is highly specific, and at least like 20% sensitive, to maybe identify a subtype of ME/CFS. Or maybe they're the most severe, and the Bell score didn't capture that somehow.
     
    Last edited: Sep 9, 2024
    RedFox and ME/CFS Skeptic like this.
  2. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    Edit: I made a mistake, see following posts.

    I had posted this then deleted it because I thought there was an error in the code, but I went through it and its fine.

    I calculated the sensitivity and specificity at every threshold for every difference feature. I got the threshold for each feature where the specificity was at least 90%, then sorted by features with highest sensitivity. These are all the ones over 20%. (I removed max_VO2_t_diff_percentage and AT_VO2_t_diff_percentage because VO2 and VO2_t percentages are identical.)

    I also excluded the participants we concluded didn't meet criteria from all max metrics.

    Here are the plots of the top 10:
    max_VO2_diff_percentage_spec_sens_hist.png max_VO2_diff_absolute_spec_sens_hist.png AT_VO2_diff_percentage_spec_sens_hist.png max_VCO2_diff_percentage_spec_sens_hist.png AT_Ve_VCO2_diff_absolute_spec_sens_hist.png AT_PETCO2_diff_percentage_spec_sens_hist.png AT_wkld_diff_percentage_spec_sens_hist.png max_wkld_diff_percentage_spec_sens_hist.png AT_Ve_VCO2_diff_percentage_spec_sens_hist.png max_VO2_t_diff_absolute_spec_sens_hist.png

    For anyone interested in an explanation: I calculated the value that each metric had to change where at least 90% of the people who had a larger change were ME/CFS. For example, in the first chart, max_VO2_diff_percentage, this shows the percent that the participant's VO2 changed on day 2. The red dotted vertical line shows the threshold (-9.70%) where for the participants that had a larger decrease (further left of the line) 91.3% of them were ME/CFS.

    This could be important, because this might indicate that if you see someone has a larger decrease than around 10% on day 2, it might be very likely they have ME/CFS. In the first chart, you can see a lot of red (MECFS) and not a lot of blue (HC) left of the vertical line. (Purple bars are red and blue in the same location.)

    The sensitivity is how many ME/CFS participants were left (or right for increasing metrics) of the threshold. In the first chart, that would be 30.3% of pwME had a larger decrease than about 10% of VO2. So 70% of them would not appear to have ME/CFS if using this threshold to diagnose. You can raise the threshold to include more pwME, but then you're also misdiagnosing more controls.

    None of the above is conclusive, but I think it's a good thing to look into. Maybe a subtype of ME/CFS has a large change? Maybe these are the most severe, but the Bell scale didn't capture that somehow? Maybe these are only the ones that have entered PEM by 24 hours later?

    Seeing these thresholds with high specificity and somewhat okay sensitivities (20-30%) doesn't seem like it's just random variation between groups, because out of 22 metrics, VO2, one of the metrics most associated with 2-day CPET decline in past studies, is top of the above list for highest sensitivity. And workload, the other main metric talked about, is the fifth most sensitive unique metric above. Seems like more than coincidence.

    Group numbers are around 76-84 MECFS and 69-71 HC (it's different per feature depending on if participants were excluded or were missing values).


    -----

    @ME/CFS Skeptic I think there's something weird with PI-026. Almost all the metrics are exactly the same on both days.
     
    Last edited: Sep 10, 2024
    ME/CFS Skeptic likes this.
  3. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    Thanks again for the impressive analysis @forestglip.

    I do not get the same results though. For example if I use the threshold of -9.7% for max VO2 values, I get 25 ME/CFS and 7 healthy controls so that ME/CFS patients make up 78% of the sample rather than 90%.

    Because of the large overlap between MECFS and HC I find no values that had a specificity EDIT: precision > 90% except for really high ones where because a couple of ME/CFS patients had high increases on day 2.

    upload_2024-9-10_12-21-19.png
     
    Last edited: Sep 10, 2024
  4. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    Yes looks like an error with data entry - well spotted!
     
    MEMarge likes this.
  5. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    Oh I am dumb. I've had a misunderstanding of the term specificity. It doesn't measure true positives/(true positives + false positives). It measures true negatives/(true negatives + false positives). The correct measure for what I was looking for is precision, not specificity. I'm going to redo this.

    Edit: For max VO2 percentage difference I get 22 ME/CFS and 6 HC at <9.7. I think this is just because I excluded the participants who didn't satisfy the max criteria. The percent is about the same though.
     
    Last edited: Sep 10, 2024
    bobbler and ME/CFS Skeptic like this.
  6. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    Yes I think we mixed up the terms.

    You we're right about the sensitivity and specificity values, but your description (how many with lower values than -9.7% would be ME/CFS patients) refers to precision rather than specificity.

    Here's what I got for a threshold of -9.7, for example:

    Total sample
    84 MECFS
    71 HC
    155 Total

    Under threshold 9.7
    25 MECFS
    7 HC
    32 Total

    Sensitivity: 25/84 = 29.76%
    Specificity: (71-7)/71 = 90.14%

    Positive Predictive Value (PVV, precision) = 25/32 = 78.13%
    Negative Predictive Value = (71-7) / (155-32) = 62.14%

    False Positive Rate = 1 - specificity = (7)/71 = 9.86%
    False Negative Rate = 1 - sensitivity = (84-24) / 84 = 70.24%
     
    forestglip likes this.
  7. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    Thank you for checking that and making sure the false info got corrected!

    I've redone it with precision and recall (recall means the same thing as sensitivity). I've calculated the highest recall where precision is greater than 80% for every metric. Here are all where recall is greater than 10%:

    AT_VO2_diff_percentage: 29.8% recall at >= 80% precision
    max_VO2_diff_absolute: 27.6% recall at >= 80% precision
    max_VO2_diff_percentage: 26.3% recall at >= 80% precision
    AT_DBP_diff_percentage: 20.5% recall at >= 80% precision
    AT_VO2_HR_diff_percentage: 20.5% recall at >= 80% precision
    AT_PP_diff_percentage: 19.3% recall at >= 80% precision
    AT_VO2_t_diff_absolute: 19.0% recall at >= 80% precision
    AT_Ve_VO2_diff_percentage: 17.9% recall at >= 80% precision
    AT_Ve_VO2_diff_absolute: 16.7% recall at >= 80% precision
    AT_wkld_diff_percentage: 16.7% recall at >= 80% precision
    AT_PETO2_diff_percentage: 14.3% recall at >= 80% precision
    AT_PETO2_diff_absolute: 14.3% recall at >= 80% precision
    AT_Ve_VCO2_diff_absolute: 14.3% recall at >= 80% precision
    AT_VO2_diff_absolute: 13.1% recall at >= 80% precision
    rest_DBP_diff_percentage: 11.9% recall at >= 80% precision
    rest_DBP_diff_absolute: 11.9% recall at >= 80% precision
    AT_VCO2_diff_percentage: 11.9% recall at >= 80% precision
    AT_Ve_VCO2_diff_percentage: 11.9% recall at >= 80% precision
    rest_HR_diff_percentage: 10.7% recall at >= 80% precision
    rest_OUES_diff_percentage: 10.7% recall at >= 80% precision
    max_RR_diff_percentage: 10.5% recall at >= 80% precision

    And here are the plots for precision and recall for the top ten. I also added the percent of HC that are misclassified (false positive rate) to the legend.
    AT_VO2_diff_percentage_prec_rec_hist.png max_VO2_diff_absolute_prec_rec_hist.png max_VO2_diff_percentage_prec_rec_hist.png AT_DBP_diff_percentage_prec_rec_hist.png AT_VO2_HR_diff_percentage_prec_rec_hist.png AT_PP_diff_percentage_prec_rec_hist.png AT_VO2_t_diff_absolute_prec_rec_hist.png AT_Ve_VO2_diff_percentage_prec_rec_hist.png AT_Ve_VO2_diff_absolute_prec_rec_hist.png AT_wkld_diff_percentage_prec_rec_hist.png

    I checked with the data for max_VO2_diff_percentage.
    With the 10 participants excluded, at a threshold of -10.593220:

    20 MECFS and 5 HC below threshold. 5/(5+20) = .80, which matches precision.
    20 MECFS are below threshold and 56 are above. 20/(20+56) = 0.263157895, which matches recall.
    5 HC are below threshold and 64 are above. 5/(5+64) = 0.072463768, which matches the percent on the legend.

    Not as impressive, but still there might be something here.

    I think my argument still applies that it seems improbable due to chance that out of 22 metrics, the metric with the highest recall is VO2, and the 6th highest unique metric is workload. These are the two metrics named in a quote from MEpedia:
    Or if splitting by timepoints, VO2 at AT is #1, VO2 at max is #2, workload at AT is #8, workload at max is #27. (This is out of 66 58 unique metrics)

    Edit: I looked at 58 unique metrics. Double that for total because I did absolute difference and percentage difference for each one.

    Full code:
    Code:
    import pandas as pd
    import numpy as np
    from typing import List, Dict, Any
    from sklearn.preprocessing import LabelEncoder
    from sklearn.metrics import roc_curve, auc
    import matplotlib.pyplot as plt
    from tqdm.notebook import tqdm
    
    class MECFSDataset:
        def __init__(self, tsv_file: str, participants_excluded: List[str]):
            self.df = pd.read_csv(tsv_file, sep='\t', na_values=['na'])
            self.participants = self.df['ParticipantID'].unique()
            self.participants_excluded = participants_excluded
           
            self.metrics = {
                'rest': ['DBP', 'HR', 'OUES', 'PETCO2', 'PETO2', 'PP', 'RER', 'RR', 'SBP', 'VCO2', 'VO2', 'VO2_HR', 'Ve_BTPS', 'Ve_VCO2', 'Ve_VO2', 'Vt_BTPS_L'],
                'AT': ['DBP', 'HR', 'PETCO2', 'PETO2', 'PP', 'RER', 'RPE', 'RPM', 'RPP', 'RR', 'SBP', 'VCO2', 'VO2', 'VO2_t', 'VO2_HR', 'Ve_BTPS', 'Ve_VCO2', 'Ve_VO2', 'Vt_BTPS_L', 'wkld', 'time_sec'],
                'max': ['DBP', 'HR', 'PETCO2', 'PETO2', 'PP', 'RER', 'RPE', 'RPM', 'RPP', 'RR', 'SBP', 'VCO2', 'VO2', 'VO2_t', 'VO2_HR', 'Ve_BTPS', 'Ve_VCO2', 'Ve_VO2', 'Vt_BTPS_L', 'wkld', 'time_sec']
            }
    
            self.time_points = ['rest', 'AT', 'max']
            self.demographic_metrics = ['ParticipantID', 'test_site', 'matched_pair', 'sex', 'phenotype', 'race', 'age', 'height_in', 'weight_lb', 'bmi', 'bas_score', 'q_education', 'q_reclined', 'q_sleeprefreshing', 'q_hoursinbed']
            self.features = self._create_features()
    
        def _create_features(self) -> Dict[str, Dict[str, Any]]:
            features = {}
            for participant in self.participants:
                participant_data = self.df[self.df['ParticipantID'] == participant]
                features[participant] = {
                    'single_day': self._get_single_day_features(participant_data),
                    'difference': self._get_difference_features(participant_data),
                    'demographics': self._get_demographic_features(participant_data),
                    'phenotype': participant_data['phenotype'].iloc[0]
                }
            return features
    
        def _get_single_day_features(self, participant_data: pd.DataFrame) -> Dict[str, Dict[str, float]]:
            single_day = {}
            for day in ['D1', 'D2']:
                day_data = participant_data[participant_data['Study_Visit'] == day]
                single_day[day] = {}
                for time_point in self.time_points:
                    time_point_data = day_data[day_data['Time_Point'] == time_point]
                    if not time_point_data.empty:
                        for metric in self.metrics.get(time_point, []):
                            single_day[day][f"{time_point}_{metric}"] = time_point_data[metric].iloc[0]
            return single_day
    
        def _get_difference_features(self, participant_data: pd.DataFrame) -> Dict[str, float]:
            difference = {}
            d1_data = participant_data[participant_data['Study_Visit'] == 'D1']
            d2_data = participant_data[participant_data['Study_Visit'] == 'D2']
           
            participant_id = participant_data['ParticipantID'].iloc[0]
           
            for time_point in self.time_points:
                d1_time_point = d1_data[d1_data['Time_Point'] == time_point]
                d2_time_point = d2_data[d2_data['Time_Point'] == time_point]
               
                if not d1_time_point.empty and not d2_time_point.empty:
                    for metric in self.metrics.get(time_point, []):
                        d1_value = pd.to_numeric(d1_time_point[metric].iloc[0], errors='coerce')
                        d2_value = pd.to_numeric(d2_time_point[metric].iloc[0], errors='coerce')
                        if pd.notna(d1_value) and pd.notna(d2_value):
                            diff_percentage = np.divide((d2_value - d1_value), d1_value, out=np.full_like(d2_value - d1_value, np.inf, dtype=np.float64), where=d1_value!=0) * 100
                            diff_absolute = d2_value - d1_value
                           
                            # Only add max diff metrics if participant is not in excluded list
                            if time_point != 'max' or participant_id not in self.participants_excluded:
                                difference[f"{time_point}_{metric}_diff_percentage"] = diff_percentage
                                difference[f"{time_point}_{metric}_diff_absolute"] = diff_absolute
           
            return difference
       
        def _get_demographic_features(self, participant_data: pd.DataFrame) -> Dict[str, Any]:
            demographics = {}
            for metric in self.demographic_metrics:
                demographics[metric] = participant_data[metric].iloc[0]
            return demographics
    
        def get_feature_vector(self, participant: str) -> pd.DataFrame:
            participant_features = self.features[participant]
            feature_vector = []
            columns = []
           
            for day in ['D1', 'D2']:
                for time_point in self.time_points:
                    for metric in self.metrics.get(time_point, []):
                        feature_vector.append(participant_features['single_day'][day].get(f"{time_point}_{metric}", np.nan))
                        columns.append(f"{day}_{time_point}_{metric}")
           
            for metric in self.demographic_metrics:
                feature_vector.append(participant_features['demographics'].get(metric, np.nan))
                columns.append(metric)
           
            for time_point in self.time_points:
                for metric in self.metrics.get(time_point, []):
                    feature_vector.append(participant_features['difference'].get(f"{time_point}_{metric}_diff_percentage", np.nan))
                    columns.append(f"{time_point}_{metric}_diff_percentage")
                    feature_vector.append(participant_features['difference'].get(f"{time_point}_{metric}_diff_absolute", np.nan))
                    columns.append(f"{time_point}_{metric}_diff_absolute")
           
            return pd.DataFrame([feature_vector], columns=columns)
    
        def get_feature_names(self) -> List[str]:
            feature_names = []
            for day in ['D1', 'D2']:
                for time_point in self.time_points:
                    for metric in self.metrics.get(time_point, []):
                        feature_names.append(f"{day}_{time_point}_{metric}")
           
            for metric in self.demographic_metrics:
                feature_names.append(metric)
           
            for time_point in self.time_points:
                for metric in self.metrics.get(time_point, []):
                    feature_names.append(f"{time_point}_{metric}_diff_percentage")
                    feature_names.append(f"{time_point}_{metric}_diff_absolute")
           
            return feature_names
    
        def get_labels(self) -> np.array:
            return np.array([1 if self.features[p]['phenotype'] == 'MECFS' else 0 for p in self.participants])
    
    participants_excluded = [
        'PI-012',
        'PI-043',
        'PI-147',
        'PI-008',
        'PI-018',
        'PI-029',
        'PI-057',
        'PI-082',
        'PI-091',
        'PI-128'
    ]
    
    dataset = MECFSDataset('cpet_clinical_data.tsv', participants_excluded)
    feature_names = dataset.get_feature_names()
    X = np.array([dataset.get_feature_vector(p) for p in dataset.participants])
    y = dataset.get_labels()
    
    # Create a DataFrame with the feature vectors and labels
    feature_vectors = [dataset.get_feature_vector(p) for p in dataset.participants]
    labels = dataset.get_labels()
    df = pd.concat(feature_vectors, ignore_index=True)
    #df['label'] = labels
    
    # Output the DataFrame to a CSV file
    df.to_csv('mecfs_dataset.csv', index=False)
    
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import matplotlib.lines as lines
    from sklearn.metrics import roc_curve
    from sklearn.preprocessing import LabelEncoder
    from sklearn.metrics import precision_recall_curve
    
    def plot_feature(df, feature, y):
        # Remove rows with NaN values for this feature
        mask = ~df[feature].isna()
        mask.value_counts()
        X = df[feature][mask]
        y_masked = y[mask]
       
        correlation = np.corrcoef(X, y_masked)[0, 1]
        X_for_roc = -X if correlation < 0 else X
       
        precision, recall, thresholds = precision_recall_curve(y_masked, X_for_roc)  
       
        # Reverse the thresholds back to original if the feature was inverted
        if correlation < 0:
            thresholds = -thresholds
           
        precision_without_last = precision[:-1]
       
        # Get indices where precision_without_last is greater than or equal to 0.80
        indices = np.where(precision_without_last >= 0.80)[0]
       
        # Check if there are any indices and return the max recall index if found, otherwise return 0
        if len(indices) > 0:
            precision_80_point = indices[np.argmax(recall[indices])]
            recall_80 = recall[precision_80_point]
       
            # Calculate True Negatives (TN), False Negatives (FN), and the percentage
            threshold_selected = (-thresholds[precision_80_point]) if correlation < 0 else (thresholds[precision_80_point])
            y_pred = X_for_roc >= threshold_selected
       
            TN = np.sum((y_pred == 0) & (y_masked == 0))
            FP = np.sum((y_pred == 1) & (y_masked == 0))
       
            HC_past_threshold_pct = FP/(TN+FP)
       
        else:
            precision_80_point = None
            recall_80 = None
            HC_past_threshold_pct = None
    
        fig, ax1 = plt.subplots(figsize=(12, 6))
       
        # Define bins based on the combined range of both classes
        min_value = X.min()
        max_value = X.max()
        bins = np.linspace(min_value, max_value, 30)
       
        # Plot the histogram (stratified by phenotype) on the primary y-axis
        class_0 = X[y_masked == 0]
        class_1 = X[y_masked == 1]
       
        # Plot histograms with counts
        ax1.hist(class_0, bins=bins, alpha=0.3, label='HC', color='blue', density=False)
        ax1.hist(class_1, bins=bins, alpha=0.3, label='MECFS', color='red', density=False)
       
        # Set the y-axis limit for the histogram
        max_hist_class_0 = np.histogram(class_0, bins=bins)[0].max()
        max_hist_class_1 = np.histogram(class_1, bins=bins)[0].max()
        max_hist = max(max_hist_class_0, max_hist_class_1)
        ax1.set_ylim(0, 1.1 * max_hist)
        ax1.set_xlabel(f'{feature}')
        ax1.set_ylabel('Count')
        ax1.legend(loc='upper right')
       
        # Create a secondary y-axis for the ROC curve
        ax2 = ax1.twinx()
        ax2.plot(thresholds, precision[:-1], label='Precision', color='green', linewidth=2)
        ax2.plot(thresholds, recall[:-1], label='Recall', color='purple', linewidth=2)
       
        # Set the y-axis limit for specificity and sensitivity
        ax2.set_ylim(0, 1.05)
        ax2.set_ylabel('Precision/Recall')  
    
        if(precision_80_point):
            ax2.axvline(x=thresholds[precision_80_point], color='red', linestyle=':',
                    label=f'{precision[precision_80_point]*100:.1f}% precision: Recall = {recall[precision_80_point]*100:.1f}%')
           
            ax2.text(thresholds[precision_80_point], -0.05, f'{thresholds[precision_80_point]:.2f}',
                     ha='center', va='top', color='black', fontsize=10, rotation=45)
           
        # Combine legends from both axes
        lines_1, labels_1 = ax1.get_legend_handles_labels()
        lines_2, labels_2 = ax2.get_legend_handles_labels()
       
        # Include class_0 percentage in the legend
        if HC_past_threshold_pct is not None:
            dummy_handle = lines.Line2D([0], [0], linestyle="none", marker='o', markersize=6, label=f'HC passing threshold: {HC_past_threshold_pct*100:.1f}%')
            ax1.legend(lines_1 + lines_2 + [dummy_handle], labels_1 + labels_2 + [dummy_handle.get_label()], loc='upper right')
        else:
            ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper right')
    
       
        # Save the figure
        plt.title(f'Specificity and Sensitivity vs Threshold: {feature}')
        plt.tight_layout()
        plt.savefig(f'images/{feature}_prec_rec_hist.png')
        plt.close()
       
    
    
       
        # Print some statistics
        print(f"\nFeature: {feature}")
        print(f"AUC: {np.trapz(tpr, fpr):.3f}")
        if(precision_80_point):      
            print(f"Threshold at 80% precision: {thresholds[precision_80_point]:.3f}")
            print(f"Recall at 80% precision: {recall_80*100:.1f}")
            print(f'HC passing threshold: {HC_past_threshold_pct*100:.1f}%')
        else:
            print("No precision >= 80%")
       
       
    
        return feature, recall_80
    
    
    
    # Load the data
    df = pd.read_csv('mecfs_dataset.csv')
    labels = df['phenotype']  # Assuming 'phenotype' column exists in the CSV
    
    # Encode labels
    le = LabelEncoder()
    y = le.fit_transform(labels)
    
    # Get all difference features
    diff_features = [col for col in df.columns if '_diff_' in col]
    
    # Collect sensitivities at 90% specificity
    feature_sensitivities = []
    
    for feature in tqdm(diff_features, desc='Plotting features', unit='features'):
        feature_sensitivities.append(plot_feature(df, feature, y))
    
    feature_sensitivities = [x for x in feature_sensitivities if x[1] is not None]
    
    # Sort by sensitivity at 90% specificity in descending order
    sorted_features = sorted(feature_sensitivities, key=lambda x: x[1], reverse=True)
    
    # Print the sorted features
    print("\nFeatures sorted by recall at >= 80% specificity:")
    for feature, recall_80 in sorted_features:
        print(f"{feature}: {recall_80*100:.1f}% recall at >= 80% precision")
     
    Last edited: Sep 10, 2024
  8. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    I think your chart might be checking the threshold backwards from mine, looking at precision for greater than the threshold value, not smaller. The numbers seem right though.
     
  9. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    Thanks @forestglip. Your coding skills are impressive and very useful.

    I think your analysis confirms that there might be a small effect. On the other hand, your approach is also iteratively looking for the best threshold, trying different things and looking for a value with the clearest separation. So it is at risk of overfitting the data and finding a pattern that might have happened by chance.

    Therefore I prefer the t-test approach. It also suggested a small-moderate effect (for example for VO2_max it showed a cohen d of 0.35) but because of the large number of outcomes and tests done, it was not considered statistically significant.

    I think that this is because most of the previous studies only reported these outcomes, not all the other ones (except for HR).

    A 2020 meta-analysis tried to estimate the effect sizes. I think it is flawed because it focuses on the mean changes but it still shows that the largest differences were found not at peak values but at AT, especially for Workload (WR). The data by Keller et al. does not really support this as VO2peak rather than WR_AT showed the largest effect sizes.

    upload_2024-9-10_17-32-11.png

    upload_2024-9-10_17-40-40.png
    upload_2024-9-10_17-40-54.png
     
    forestglip likes this.
  10. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    I tried to look at the precision level for different values for the VO2_percentage change. I had it switch around the 0 point (checked for smaller values in case of negative values and for bigger values when positive values were used).

    Here's the code I used
    Code:
    df = df_max
    values = np.arange(-40, 40, 0.1)
    results = []
    for value in values:
        if value <= 0:
            filtered_df = df[df['VO2_percentage_difference'] <= value]
        else:
            filtered_df = df[df['VO2_percentage_difference'] >= value]
        n_MECFS = filtered_df[filtered_df['phenotype'] == 'MECFS'].drop_duplicates(subset='ParticipantID').shape[0]
        n_total = filtered_df.drop_duplicates(subset='ParticipantID').shape[0]
        ratio_MECFS = n_MECFS / n_total
        results.append({
            'VO2_difference_value': value,
            'Number of MECFS patients': n_MECFS,
            'Percentage_MECFS': ratio_MECFS * 100,
        })
       
    results = pd.DataFrame(results)
    sns.lineplot(x='VO2_difference_value', y = 'Percentage_MECFS', data = results)
     
    forestglip likes this.
  11. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    It seems that even if they only looked at these two metrics in the past, to then run the above script and the same two metrics others looked at are near the top of the list of 58 metrics seems improbable due to chance.

    Valid concern, but I only ran this with a single precision threshold (>=80%) and never even looked at other precisions. I suppose I had considered doing 90% but realized after your corrections that the recall would probably be very small, though.

    I did run it mistakenly with 90% sensitivity, as we saw, though, and that gave a similar result.

    Based on the charts, I suspect that at most high precision thresholds, VO2 would be near the top, though.

    Yes, I'm not sure how you could formally calculate an effect size based on my method, more than "seems improbable".

    If there are subtypes of those who do and don't have an effect, a t-test risks missing it for the group, though, I think.

    Part of the low effect size may also be due to outliers. Looking at the chart for AT_wkld_diff_percentage, some of those increases seem weird. One increased by about 280% and another by about 450%. I'll have to check later if that's accurate.

    Edit: Anyway, the "order thing" might represent a small effect. But looking at the numbers for precision and recall, and looking at the histograms for VO2 and workload, where below a certain threshold, you have many more MECFS than HC, makes me think there is a large effect for some participants, and maybe very small or nonexistent for others.
     
    Last edited: Sep 10, 2024
    ME/CFS Skeptic and MEMarge like this.
  12. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    I'll have to look more carefully later, but I think the precision line should gradually verge to (can't find right word) about 50 or 60 percent, as you include more and more of the cohort. So if the threshold for example is high enough that almost all mecfs and hc are included, precision should be close to the balance of the groups.
     
    Amw66 likes this.
  13. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    This could be nothing, and basing just on these results from this study is very little evidence. But I think considering the suspicion up to now was that a decrease of more than about 10-20% was good at classifying, and here we see that at around that threshold, the split becomes much more apparent, it might be a sign. There could very well be a flaw in my logic though.
     
    Amw66 and ME/CFS Skeptic like this.
  14. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    True, my graph looked weird because I checked precision below a certain value if the value was negative but above the value if it was positive. If I don't do this then I get the convergence to the ratio of ME/CFS patients in the total sample that you mentioned:

    upload_2024-9-10_19-23-18.png
     
    forestglip likes this.
  15. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    I think we largely agree. The data does not suggest that there is no difference at all. It's not 100% random noise.

    I would summarise it as: the data suggests that there might be an effect. But the effect is quite small with poor separation between groups and no correlation with the Bell disability scale. The data presented is not large enough to rule out that this effect might have happened due to chance.

    When it comes to 2 day CPET as a test to measure PEM or ME/CFS, I think the data is pretty clear that it has very low potential to do this.
     
    forestglip likes this.
  16. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    That all seems about right. I'd just like to see my suspicion tested that maybe a CPET 48 hours later might capture an effect from more participants. 24 might just be too short.

    Edit: That and maybe use a different symptom scale to check correlation, like FUNCAP.
     
    Amw66, Lilas and ME/CFS Skeptic like this.
  17. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    @ME/CFS Skeptic I don't know if I can today, but I think it might be worth testing with the outliers removed. Both because a 400% increase seems strange, and because it seems an assumption of the independent t-test is no significant outliers. And maybe the groups should be tested for normal distributions.

    I'd also remove PI-026, which is probably bad data.

    You can see the outliers at least in AT_work_diff_percentage, and maybe also in the VO2 diffs.

    https://www.datanovia.com/en/lessons/t-test-assumptions/#assumptions
    Edit: Or just a test that allows outliers, like Mann-Whitney.
     
    Last edited: Sep 10, 2024
    Lilas likes this.
  18. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    4,146
    Location:
    Belgium
    Yes I think you make a good point. When expressed as a percentage, there are 4 ME/CFS patients (PI-029, PI-087, PI-114, and PI-166) that have extreme values:
    upload_2024-9-10_23-2-31.png

    That make the distribution of percentage changes quite skewed:
    upload_2024-9-10_23-2-54.png


    With these 4 included, I found a cohen_d of 0.008 and p_value of 0.93. With them excluded, things change quite dramatically. The effect sizes goes to d =0.44 and the p-value = 0.000151.

    upload_2024-9-10_23-5-45.png

    That looks like a reasonable difference. Also tried scipy.stats.mannwhitneyu on the full sample and got a p-value of 0.00503.
     
    Lilas and forestglip like this.
  19. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    Interesting. Could you check Mann-Whitney with PI-026 removed?

    I'd think it's probably not best practice to just remove the outliers, though, for the t-test when making a final conclusion, if we're not sure there's some methodological problem with them.

    Edit: Also, if possible, could you check VO2 at AT with Mann-Whitney and PI-026 removed? It seems VO2 is subject to less random variation than workload. Most differences in workload span from about -50% to +50%, while for VO2 it's more contained within about -20% to 20%.
     
    Last edited: Sep 10, 2024
  20. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    1,300
    This is an interesting blog about how Cohen's D is not robust to outliers, and some better alternatives: A single outlier could completely distort your Cohen's d value

    It explains the slightly strange result where if you add a high outlier to the group that is already higher, it decreases the effect size:
    Screenshot_20240910-193317.png

    Alternatives more robust to variability mentioned are Mann-Whitney, Cliff's delta,
    Kolmogorov-Smirnov test, Wilcox and Muska's Q, and mutual information (MI).

    --------

    From the thread's CPET paper:
    So they found that the single day metrics were all normally distributed, but we're looking at difference between days as the feature.
     
    Last edited: Sep 11, 2024

Share This Page