That's good news and will go some way to making things clearer, although inclusion of that PCA is still a problem. Many of us went 'wow! that looks amazing' when we looked at the chart - and not everyone has the interest or time to think 'it's too good to be true' and look closer at things.Dr. Tate mentioned that he would be modifying the results section to avoid giving the wrong impression.
For what it’s worth, the full PCA does actually tell an interesting story along PC2—the fact that there are distinct “Neapolitan stripes” between the three groups, despite some messy outliers, is impressive since it is actually taking into consideration all 70K sites (even despite the fact that the 5/5 selection to 70K will introduce some skewing to begin with, it’s been a very lively debate in epigenomic sequencing for years as to whether you can pre-select sites/peaks based on any a-priori knowledge)I don't think PCA is a useful tool for this particular analysis. It doesn't need to be in the paper, it's circular and is a distraction. The manuscript would be better if it just acknowledged the very small sample sizes and gave more space to the consideration of whether the identified DMFs, especially those ones common to both disease groups, might tell us something about ME/CFS and ME/CFS-like LC.
What was the percentage variation explained by the PC1 and PC2 axes?For what it’s worth, the full PCA does actually tell an interesting story along PC2—the fact that there are distinct “Neapolitan stripes” between the three groups, despite some messy outliers, is impressive since it is actually taking into consideration all 70K sites (even despite the fact that the 5/5 selection to 70K will introduce some skewing to begin with, it’s been a very lively debate in epigenomic sequencing for years as to whether you can pre-select sites/peaks based on any a-priori knowledge)
Sure. There might be other sources of variation between cohorts too, such as whether a sample was taken in the afternoon or morning or how the specimen was stored.I’m certainly not going to call that a definitive finding on the basis of n=15, but it could absolutely be the basis of a future study comparing people who got LC at the beginning of the pandemic to those recently afflicted to assess impact of disease duration with the same trigger.
I think it's reasonable to assume that readers of a paper like this will understand how to read a PCA chart, and, in any case, it's easy enough for the text to explain that it's the PC2 that separates the groups so that people without prior knowledge will understand. But, I don't think it's reasonable to expect a PCA to separate out people on the basis of disease specific DNA methylation when you only have 5 people per group and 70k fragments resulting from all sorts of biological processes.I do understand why they didn’t show it like that though. If you don’t know to look along each axis separately, it just all looks like soup. But I think there were possible ways to finesse this—using only the PC2 score as its own latent variable and showing it as a box plot, for example.
I would give the authors benefit of the doubt and let the manuscript be peer reviewed and see if the reviewers pick up something that we have been discussing about and see if they seem the paper to be deemed fit.
Yes, definitely, there are better statistical analysis and presentation tools, and I don't think they have to be related to PCA at all. What we want to know from a study like this is 'what associated genes were found to be differentially methylated between the groups?', so that that information can give us ideas about the disease mechanism.
R code said:library(tidyverse)
# Parameters
n_features <- 72000
n_individuals <- 15
# Create group labels
groups <- factor(rep(c("group1", "group2", "group3"), each = n_individuals/3))
individuals <- paste0("ind", 1n_individuals))
feature_names <- paste0("feature", 1:n_features)
# Simulate data - creating differences between groups
data_matrix <- matrix(nrow = n_features, ncol = n_individuals)
rownames(data_matrix) <- feature_names
colnames(data_matrix) <- individuals
# Simulate data
for (i in 1:n_features) {
feature_values <- rnorm(n_individuals, mean = 10, sd = 2)
data_matrix[i, ] <- feature_values
}
# Transpose for ANOVA
data_df <- as.data.frame(t(data_matrix))
data_df$group <- groups
# Function to perform ANOVA for one feature
perform_anova <- function(feature) {
aov_result <- aov(data_df[[feature]] ~ group, data = data_df)
p_value <- summary(aov_result)[[1]]$"Pr(>F)"[1]
return(p_value)
}
# Perform ANOVAs
p_values <- sapply(feature_names, perform_anova)
# Select significant features (p < 0.05)
significant_features <- names(p_values)[p_values < 0.05]
n_significant <- length(significant_features)
significant_data <- data_matrix[significant_features, ]
pca_result_significant <- prcomp(t(significant_data), scale. = TRUE)
pca_result_all <- prcomp(t(data_matrix), scale. = TRUE)
# Plot PCA
pca_df_sig <- as.data.frame(pca_result_significant$x)
pca_df_sig$group <- groups
plot_PCA_sig <- ggplot(pca_df_sig, aes(x = PC1, y = PC2, color = group)) +
geom_point(size = 3) +
stat_ellipse(level = 0.95) +
ggtitle("PCA of Significant Features") +
theme_minimal()
pca_df_all <- as.data.frame(pca_result_all$x)
pca_df_all$group <- groups
plot_PCA_all <- ggplot(pca_df_all, aes(x = PC1, y = PC2, color = group)) +
geom_point(size = 3) +
stat_ellipse(level = 0.95) +
ggtitle("PCA of All Features") +
theme_minimal()
summary(pca_result_significant)
summary(pca_result_all)
plot_PCA_sig
plot_PCA_all
@jnmaciuch described the PCA on the 72k data as Neapolitan layers with messy outliers on the PC2 axis. That chart from random variables is not far off that - in fact I can see a bit of separation of the "groups" on both axes. With so few data points, it isn't hard to find a pattern of some sort.Note that doing the PCA on all the simulated results doesn't separate the groups; if Peppercorn/Tate's does that would be of interest. I hope we can get to see that PCA soon.
A little over and under 10%, respectively. So there’s a lot of noise even when selecting down to 70K, probably to be expectedWhat was the percentage variation explained by the PC1 and PC2 axes?
[edit: sorry, hit post too soon] Yes if you pulled more of the red dots into the middle and pushed the green and blue further apart, it wouldn’t look too far off. It’s a weak signal, so probably would come down to ANOVA between groups on PC2 scores. Definitely would like to see replication on a larger cohortThank you @chillier! Yes, those charts are exactly what I was expecting.
@jnmaciuch described the PCA on the 72k data as Neapolitan layers with messy outliers on the PC2 axis. That chart from random variables is not far off that - in fact I can see a bit of separation of the "groups" on both axes. With so few data points, it isn't hard to find a pattern of some sort.
It looks like Katie Peppercorn and Sayan Sharma did the actual analysis and might have a better understanding of what is being asked. I like to remember that many of our researchers are from an older generation with different skills to the younger ones.I thought all it would take was to point it out and these researchers would go 'oh, yes' and remove the PCA.
Author Contributions
Conceptualization, W.P.T. and A.C; methodology, A.C, E.J.R, P.A.S, K.P.; software, P.A.S.; validation, E.J.R.; formal analysis, K.P, S.S .; investigation, K.P, S.S, C.D.E.; resources, W.P.T, A.C., P.A.S; data curation, E.J.R, S.S. K.P; writing—original draft /final preparation W.P.T.; writing—editing K.P, S.S, E.J.R, A.C ; visualization, S.S.; supervision, W.P.T,A.C ; project administration, W.P.T, K.P; funding acquisition, W.P.T, A.C. K.P and S.S have made equal contribution and are joint first authors. All authors have read and agreed to the published version of the manuscript
The other last author is the BSS analysis expert Tate referenced in the email to @Hutan. It seems like they achieved their PhD more recently, and probably directed/advised the first authors on the analysis. Even if the first authors saw the logic in Hutan’s point, ultimately issues like this (and whether or not to heed them as issues) would be up to the last authors.It looks like Katie Peppercorn and Sayan Sharma did the actual analysis and might have a better understanding of what is being asked. I like to remember that many of our researchers are from an older generation with different skills to the younger ones.
I appreciate all the thoughtful comments on this paper but with only 5 ME/CFS patients and 5 healthy controls, I don't think it's possible to get useful results, not matter how you analyse the data.Bisulphite Sequencing (RRBS) was applied to the DNA of age- and sex-matched cohorts: ME/CFS (n=5), LC (n=5) and HC (n=5).
I think for similar big data-driven methods, such as ATAC-seq and RNA-seq, it is possible to get worthwhile results even from such a small cohort. Information at an individual gene level is almost certainly going to be highly variable and frankly is not expected to generalize even from larger cohorts than this (unless you're talking several hundred or thousand participants). However, from my experience with the other two methods I mentioned, it is entirely possible to get surprisingly robust results at the meta-analysis level (i.e. pathway hits, or general trends such as "overall trend towards hypermethylation rather than hypomethylation") even with such a small sample size.I appreciate all the thoughtful comments on this paper but with only 5 ME/CFS patients and 5 healthy controls, I don't think it's possible to get useful results, not matter how you analyse the data.
That's a good idea. There is a comment facility there.The paper and the discussion are over my head but just wondering, would it be worth the effort for someone to submit a comment raising the points discussed here, to be published beside the preprint? It might then catch the eyes of the reviewers