The UK Biobank resource with deep phenotyping and genomic data, 2018, Bycroft et al

hotblack

Senior Member (Voting Rights)
The UK Biobank resource with deep phenotyping and genomic data

Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T. Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O’Connell, Adrian Cortes, Samantha Welsh, Alan Young, Mark Effingham, Gil McVean, Stephen Leslie, Naomi Allen, Peter Donnelly & Jonathan Marchini

Abstract
The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope.

A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits.

Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Link (Nature) [status]
https://doi.org/10.1038/s41586-018-0579-z
 
After a comment from Simon in the thread on Fluge and Mella about DecodeME looking at HLA I dug out the DecodeME Data Analysis plan (link) which says this

4.2 Human Leukocyte Antigen Complex
Classical human leukocyte antigen (HLA) alleles will be imputed using the HLA*IMP:02
algorithm as previously done for the UKB (8).

This is a reference to the methods in the above paper. Relevant section copied below

Imputation of classical HLA alleles
The major histocompatibility complex (MHC) on chromosome six is the most polymorphic region of the human genome and contains the largest number of genetic associations to common diseases29. We imputed HLA types at two-field (also known as four-digit) resolution for 11 classical HLA genes (HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5, HLA-DQA1, HLA-DQB1, HLA-DPA1 and HLA-DPB1) using the HLA*IMP:02 algorithm with a multi-population reference panel (Supplementary Tables 5 and 6)30 and validated the accuracy using a cross-validation experiment. In a typical use, case accuracy was estimated at better than 96% across all loci (see Methods and Supplementary Tables 7, 8).

To demonstrate the utility of the HLA imputation, we performed association tests for diseases known to have HLA associations. We analysed 409,724 individuals in the white British ancestry subset (see Methods) and focused on 11 self-reported immune-mediated diseases with known HLA associations. For each disease in our analysis, we identified the HLA allele with the strongest evidence of association. In all cases these were consistent with previous reports (see Methods and Supplementary Table 9). We further replicated independent HLA associations in a single disease study of multiple sclerosis (MS) susceptibility by the International Multiple Sclerosis Genetics Consortium (IMSGC)31. Here we observed evidence of association and effect size estimates for HLA alleles that are concordant in direction and relative magnitude with those found in the IMSGC study, although in 11 out of 14 cases this was closer to 1, consistent with regression dilution bias arising from a low rate of phenotypic error (Table 1).


And from the methods
HLA imputation and validation
For each individual we defined the HLA genotype at each locus as the pair of alleles with maximum posterior probability as reported by HLA*IMP:02. We performed association analysis (see, for example, ref. 31) for HLA alleles and each disease using logistic regression. The risk model (additive, dominant, recessive or general), as described previously31, was used to enable comparison of effect size estimates. For validation and further details, see Supplementary Information section S5. We repeated the analysis, setting genotypes with a maximum posterior probability of <0.7 to missing. No significant differences were observed compared to the full analysis (data not shown). As a negative control, we ran association analyses in the HLA region with imputed HLA alleles for type 2 diabetes (2,849 cases) and myocardial infarction (9,725 cases) in a total of 409,724 individuals and we found no significant associations (all P > 2.40 × 10−4, the Bonferroni corrected level of association) with any HLA alleles, which is consistent with the lack of associations in the HLA region in recent analyses of each phenotype44,45

We estimated the accuracy of the imputation process using fivefold cross-validation in the reference panel samples. For samples of European ancestry, the estimated four-digit accuracy for the maximum posterior probability genotype is above 93.9% for all 11 loci (Supplementary Table 7). This accuracy improved to above 96.1% for all 11 loci after restricting to HLA allelic variant calls with a posterior probability greater than 0.70. This resulted in call rates above 95.1% for all loci (Supplementary Table 8).
 
Last edited:
Back
Top