Test-Retest Reliability of Standardized Diagnostic Interviews for Common Adult Psychiatric Disorders, 2026, Xie

rvallee

Administrator
Staff member
Test-Retest Reliability of Standardized Diagnostic Interviews for Common Adult Psychiatric Disorders
Weiyi Xie, MPH; Julie Nordgaard, MD, PhD, DMsc; R. Christopher Sheldrick, PhD, Juwairiya F. Ahmad, MPH; Fabiano A. Gomes, MD, PhD; Benicio N. Frey, MD, PhD; Laura Duncan, PhD

Key Points
Question
What is the pooled test-retest reliability of standardized diagnostic interviews for common adult psychiatric disorders, including groups of mental disorders and substance use disorders (SUDs), and which factors explain between-study variation?

Findings
In this systematic review and meta-analysis of 57 studies, the overall test-retest reliability of standardized diagnostic interviews was moderate and highly heterogeneous, with significantly higher reliability for SUDs than for mental disorders. Only diagnostic criteria partially explained between-study variation in SUDs.

Meaning
The finding that standardized diagnostic interviews, although regarded as the gold standard of psychiatric disorder classification, had moderate and highly variable test-retest reliability highlights the need for careful selection and use.


Abstract

Importance
Standardized diagnostic interviews (SDIs) are structured assessments based on established criteria to improve the consistency and reliability of diagnoses. The pooled test-retest reliability of SDIs for adult psychiatric disorders is unknown.

Objectives
To estimate the test-retest reliability of SDIs used to classify common adult psychiatric disorders, examine variations in test-retest reliability between disorders, and assess prespecified factors associated with between-study heterogeneity.

Data Sources
MEDLINE, Embase, Emcare, PsycINFO, and Applied Social Sciences Index and Abstracts were searched without date or language limitations from inception until September 2025. References of eligible articles and relevant reviews were also screened.

Study Selection
Primary studies that evaluated test-retest reliability of SDIs assessing adult psychiatric disorders were selected. Disorders were selected based on estimated prevalence in the general adult population, clinical relevance, and frequent appearance in SDIs.

Data Extraction and Synthesis
Data were extracted and study quality was assessed based on the Consensus-based Standards for the Selection of Health Measurement Instruments checklist. Multilevel random-effects meta-analysis and meta-regression were performed. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed.

Main Outcomes and Measures
Test-retest reliability estimates (Cohen κ) of SDI-based adult psychiatric disorder diagnoses. Pooled estimates were calculated for 5 groups of mental disorders (anxiety, bipolar, depressive, personality, and nonaffective psychoses) and 8 groups of substance use disorders (SUDs; alcohol, cannabis, cocaine, hallucinogens, opioids, sedatives, stimulants, and tobacco).

Results
Fifty-seven studies were analyzed, 46 of which were included in the meta-analysis (535 κ estimates; N = 8146 participants [mean age range, 22.0-54.3 years]). The pooled estimate of SDI test-retest reliability was κ = 0.69 (95% CI, 0.66-0.72), with substantial between-study heterogeneity (Q534 = 23 578.7; P < .001; I2 = 93%). Reliability was higher for SUDs than for mental disorders (κ = 0.72 [95% CI, 0.69-0.72; 292 estimates] vs 0.65 [95% CI, 0.61-0.69; 243 estimates]; z = 3.74; P < .001) and varied among disorder types. Reliability for mental disorders ranged from κ = 0.55 (95% CI, 0.44-0.66) for nonaffective psychoses to κ = 0.74 (95% CI, 0.56-0.91) for bipolar disorders. Reliability for SUDs ranged from κ = 0.59 (95% CI, 0.49-0.70) for hallucinogens to κ = 0.81 (95% CI, 0.74-0.88) for opioids. Univariate meta-regression indicated that diagnostic criteria partially explained between-study variation in SUDs, whereas methodological quality indicators (eg, small sample size and retest interval) did not.

Conclusions and Relevance
In this systematic review and meta-analysis, SDIs showed moderate and heterogenous test-retest reliability that varied substantially across common adult psychiatric disorders. The findings indicated that structural standardization alone may not be sufficient to ensure consistent psychiatric diagnosis and highlighted the importance of considering contextual and phenomenological information into diagnostic assessment and research practice.
 
This is about interviews. Nothing surprising, although the exact same should be expected of standard questionnaires like GAD and PHQ: they are not reliable tools. Obviously the idea of using a test-retest approach to problems that fluctuate was always doomed to failure. This is only valid when the underlying answers don't spontaneously change, can be expected to remain identical across time and users. Instead this approach might as well be dowsing rods.

Labelling bad tools as "gold standard" because it's the best they have when they're actually lousy is a terrible idea, especially when it's explicitly to inspire confidence in those lousy tools. All it does is inspire justified suspicion in the validity of all other tools labelled as "gold standard". The usual outcome of doing this is to remain stuck in place, and sure enough this is what's happening in mental health care: the find comfortable local minima and just stay there, indifferent at being so far from the global minimum that they might as well not have bothered.
 
Back
Top Bottom