Institute for Quality and Efficiency in Health Care (IQWiG) / General Methods -
https://www.iqwig.de/methoden/general-methods_version-6-1.pdf
[Institute for Quality and Efficiency in Health Care (IQWiG) / General Methods – p. 170 - ]
9.3 Specific statistical aspects
9.3.1 Description of effects and risks
The description of intervention or exposure effects needs to be clearly linked to an explicit outcome variable. Consideration of an alternative outcome variable also alters the description and size of a possible effect. The choice of an appropriate effect measure depends in principle on the measurement scale of the outcome variable in question. For continuous variables, effects can usually be described using mean values and differences in mean values (if appropriate, after appropriate weighting). For categorical outcome variables, the usual effect and risk measures of 2x2 tables apply [45]. Chapter 10 of the Cochrane Handbook for Systematic Reviews of Interventions [161] provides a well-structured summary of the advantages and disadvantages of typical effect measures in systematic reviews. Agresti [10,11] describes the specific aspects to be considered for ordinal data.
It is essential to describe the degree of statistical uncertainty for every effect estimate. For this purpose, the calculation of the standard error and the presentation of a confidence interval are methods frequently applied. Whenever possible, the Institute will state appropriate confidence intervals for effect estimates, including information on whether one- or two-sided confidence limits apply, and on the confidence level chosen. In medical research, the two-sided 95% confidence level is typically applied; in some situations, 90% or 99% levels are used. Altman et al. [19] give an overview of the most common calculation methods for confidence intervals. In order to comply with the confidence level, the application of exact methods for the interval estimation of effects and risks should be considered, depending on the particular data situation (e.g. very small samples) and the research question posed. Agresti [12] provides an up-to-date discussion on exact methods.
9.3.2 Evaluation of statistical significance
With the help of statistical significance tests it is possible to test hypotheses formulated a priori with control for type 1 error probability. The convention of speaking of a “statistically significant result” when the p-value is lower than the significance level of 0.05 (p < 0.05) may often be meaningful. Depending on the research question posed and hypothesis formulated, a lower significance level may be required. Conversely, there are situations where a higher significance level is acceptable. The Institute will always explicitly justify such exceptions. A range of aspects should be considered when interpreting p-values. It must be absolutely clear which research question and data situation the significance level refers to, and how the statistical hypothesis is formulated. In particular, it should be evident whether a one- or two-sided hypothesis applies [61] and whether the hypothesis tested is to be regarded as part of a multiple hypothesis testing problem [713]. Both aspects, whether a one- or two-sided hypothesis is to be formulated, and whether adjustments for multiple testing need to be made, are a matter of repeated controversy in scientific literature [240,430].
[Institute for Quality and Efficiency in Health Care (IQWiG) / General Methods - 171 - ]
Regarding the hypothesis formulation, a two-sided test problem is traditionally assumed. Exceptions include non-inferiority studies. The formulation of a one-sided hypothesis problem is in principle always possible, but requires precise justification. In the case of a one-sided hypothesis formulation, the application of one-sided significance tests and the calculation of one-sided confidence limits are appropriate. For better comparability with two-sided statistical methods, some guidelines for clinical trials require that the typical significance level should be halved from 5% to 2.5% [371]. The Institute generally follows this approach. The Institute furthermore follows the central principle that the hypothesis formulation (one- or two-sided) and the significance level must be specified clearly a priori. In addition, the Institute will justify deviations from the usual specifications (one-sided instead of two-sided hypothesis formulation; significance level unequal to 5%, etc.) or consider the relevant explanations in the primary literature.
If the hypothesis investigated clearly forms part of a multiple hypothesis problem, appropriate adjustment for multiple testing is required if the type I error is to be controlled for the whole multiple hypothesis problem [53]. The problem of multiplicity cannot be solved completely in systematic reviews, but should at least be considered in the interpretation of results [48]. If meaningful and possible, the Institute will apply methods to adjust for multiple testing. In its benefit assessments (see Section 3.1). The Institute attempts to control type I errors separately for the conclusions on every single benefit outcome. A summarizing evaluation is not usually conducted in a quantitative manner, so that formal methods for adjustment for multiple testing cannot be applied here either.
The Institute does not evaluate a statistically non-significant finding as evidence of the absence of an effect (absence or equivalence) [17]. For the demonstration of equivalence, the Institute will apply appropriate methods for equivalence hypotheses. In principle, Bayesian methods may be regarded as an alternative to statistical significance tests [670,671]. Depending on the research question posed, the Institute will, where necessary, also apply Bayesian methods (e.g. for indirect comparisons, see Section 9.3.8).