The supplementary figure schematic depicts the input data as the images themselves, but I think this is actually incorrect if they're just using a standard PLS model. The input would have to be in discrete data points already, so going by the text it seems to be the voxel-wise SUVR values. Basically for every identified "voxel" (a little portion of the brain), there's a corresponding signal value. Depending on how many "voxels" they defined, this could end up being a very large amount of features.
Therefore I was wrong earlier: it's not 2400 image slices. Since it said they used leave one "pair" out cross validation, I realized 2400 is the total of 30 LC x 80 HC participants.
In normal machine learning, leave-one-out cross validation is a method where you train the model N-1 number of times, leaving out and generating a predicted value for one data point each time. Theoretically this is the best option for reducing overfitting, since each data point is only used to generate a prediction once. Therefore, if looking at particular rare feature gives perfect classification for only 2 or 3 individuals, its contribution to the model should be basically nill for the vast majority of CV folds and it will end up not having much importance in the final model that aggregates weights across the CV folds.
So what they're doing for the cross-validation is training a model with all the voxel-wise SUVR values for every participant except for the ones from the held-out LC and HC individuals in that CV fold. But this is not truly leave-one-out cross validation since every participant is being "tested" multiple times in the combinatorial pairs. Meaning that the likelihood of overfitting features being retained in the final model goes way up.
For the performance metric, it seems like both of the participants in the leave-one-out-pair are getting a score predicting the likelihood of being LC or HC. Then according to the text they are averaging the score for each participant across all the folds that held them out. The ROC in Fig 4B plots the sensitivity/specificity of increasing "threshold" values used to determine whether a participant with a given score is classified as LC or HC. At the best performing threshold, the reported sensitivity is 0.912 and specificity is.
As I already noted, these metrics are not on a truly held-out test cohort which is what you would want to see to determine if this model is purely hung up on artifacts or actually generalizable.
Also as already noted earlier in the thread, the healthy control images were from a prior study. There is no mention in the text about doing any sort of batch correction. The fact that this signal appears higher in every single brain region shown in Fig. 2 makes it pretty obvious it's a global skew. So if you have consistently higher levels for every single one of your features, you can get a fantastic prediction model no matter which features you use.