Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

The batch variant scoring tool looks like it'll be useful to test many variants at once: https://www.alphagenomedocs.com/colabs/batch_variant_scoring.html

I think the goal is basically looking for variants that produce a large (or small if negatives are possible) quantile or raw score. For RNAseq, I think for each variant, the model creates one score for each gene in the interval. The quantile score would go from -1 to 1, which is a measure of how large of an effect the variant has on the gene. (0 is no effect.)

Edit: So I think what I'll try to get to doing is:

1. Filter DecodeME SNPs to log10p greater than something like 7 (produces about 300 variants).
2. Do batch scoring on all these variants with the RNAseq predictor on brain tissue.
3. Find the largest magnitude quantile scores to identify which variant-gene pairs might be important.
 
Last edited:
Ok, well I did that.

I didn't initially filter to brain - I just had the model do all RNAseq predictions for all tissues, using all the variants with p-value less than 1x10^-7 (342 variants). That resulted in 3,800,412 scores, which are all the combinations of variants-genes-tissues. (It's looking at genes within 1MB of the variants.)

I filtered to only protein-coding genes and only brain. That brought it down to 4,398 scores. I attached the excel file with those brain-protein coding gene scores. (The unfiltered file would be far too large.)

I don't think this will be as easy to interpret as I thought. There are way too many different variant-gene combinations with high predicted effect scores.

Just for the chr20 locus, after filtering for variant-gene pairs which have a predicted effect on a gene with a quantile score (absolute value) >= 0.9:
  • 191 variants were in the input data (passed the p-value threshold)
  • 154 variant-gene pairs passed the above score threshold, and consist of:
    • 36 unique variants
    • 9 unique genes
The 9 unique genes are:
PTGIS
DDX27
KCNB1
CSE1L
STAU1
ZNFX1
B4GALT5
ARFGEF2
PREX1

It seems to me that this is probably more useful when you only have one or a small handful of variants of interest. Too many variants have effects on genes in various ways, so if we just throw all of them at AlphaGenome, it'll give lots of uninteresting results.

For the example analysis, where they try to see whether the model predicts expected effects for known leukemia variants, they're working with a much smaller set of variants. I think these are variants where there is high confidence that they actually cause the disease.

Python:
from alphagenome.data import genome
from alphagenome.models import dna_client, variant_scorers
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
import os
import pandas as pd

load_dotenv()
ALPHAGENOME_KEY=os.getenv('ALPHAGENOME_KEY')
save_predictions = True

# Load file which only has variants above -log10p of 7 in tsv format.
df = pd.read_csv('gwas1_log10p_gt_7.tsv', sep='\t')
df.CHROM = 'chr' + df.CHROM.astype(str)

# Create alphagenome client
dna_model = dna_client.create(ALPHAGENOME_KEY)

# Set up parameters:
# Interval: 1MB
# Scorer: RNAseq
sequence_length = '1MB'  # @param ["16KB", "100KB", "500KB", "1MB"] { type:"string" }
sequence_length = dna_client.SUPPORTED_SEQUENCE_LENGTHS[
    f'SEQUENCE_LENGTH_{sequence_length}'
]

scorer_selections = {
    'rna_seq': True,
    'cage': False,
    'procap': False,
    'atac': False,
    'dnase': False,
    'chip_histone': False,
    'chip_tf': False,
    'polyadenylation': False,
    'splice_sites': False,
    'splice_site_usage': False,
    'splice_junctions': False,
}

all_scorers = variant_scorers.RECOMMENDED_VARIANT_SCORERS
selected_scorers = [
    all_scorers[key]
    for key in all_scorers
    if scorer_selections.get(key.lower(), False)
]

# Fetch scores
results = []
for i, row in tqdm(df.iterrows(), total=len(df)):
  variant = genome.Variant(
      chromosome=str(row.CHROM),
      position=int(row.GENPOS),
      reference_bases=row.ALLELE0,
      alternate_bases=row.ALLELE1,
      name=row.ID,
  )
  interval = variant.reference_interval.resize(sequence_length)

  variant_scores = dna_model.score_variant(
      interval=interval,
      variant=variant,
      variant_scorers=selected_scorers,
      organism=dna_client.Organism.HOMO_SAPIENS,
  )
  results.append(variant_scores)
 
df_scores = variant_scorers.tidy_scores(results)

# Filter to protein-coding genes in the brain and save results
ontologies = ['UBERON:0000955'] # Brain
gene_types = ['protein_coding']
df_scores = df_scores[df_scores['ontology_curie'].isin(ontologies) & df_scores['gene_type'].isin(gene_types)]

df_scores['abs_quantile_score'] = abs(df_scores['quantile_score'])

df_scores = df_scores.sort_values('abs_quantile_score', ascending=False)

if save_predictions:
  df_scores.to_excel('variant_scores.xlsx', index=False)
 

Attachments

Last edited:
Out of curiosity, I looked only at the most significant variant in DecodeME only (20:48914387:T:TA). The model predicts that it has effects on these genes. Positive scores mean the ME/CFS risk allele increases expression.

Interestingly, it seems to say it has large effects on almost all genes it returned a score for.
VariantGeneQuantile score
chr20:48914387:T>TACSE1L0.9995
chr20:48914387:T>TAKCNB10.9989
chr20:48914387:T>TADDX270.9985
chr20:48914387:T>TAZNFX1-0.9984
chr20:48914387:T>TAARFGEF2-0.9982
chr20:48914387:T>TASTAU1-0.9979
chr20:48914387:T>TAPREX10.5843

If I pick a variant pretty close by to that one, it gives similarly high scores. I notice the sign of ZNFX1 and STAU1 is switched, which is surprising.
VariantGeneQuantile Score
chr20:48935095:C>CTCTTTTTTDDX270.9997
chr20:48935095:C>CTCTTTTTTZNFX10.9987
chr20:48935095:C>CTCTTTTTTCSE1L0.9970
chr20:48935095:C>CTCTTTTTTSTAU10.9967
chr20:48935095:C>CTCTTTTTTARFGEF2-0.9966
chr20:48935095:C>CTCTTTTTTKCNB10.9612
chr20:48935095:C>CTCTTTTTTPREX10.5690

If I go even further away on the chr20 locus and pick a variant, the scores look more like what I was expecting, where they're not all extremely high. (In this case, the T allele is the risk allele, so a positive score means decreased expression with the ME/CFS allele.)
VariantGeneQuantile Score
chr20:49199407:T>CKCNB1-0.7772
chr20:49199407:T>CZNFX1-0.6808
chr20:49199407:T>CPTGIS-0.569
chr20:49199407:T>CB4GALT5-0.5203
chr20:49199407:T>CDDX27-0.4859
chr20:49199407:T>CPREX1-0.4123
chr20:49199407:T>CARFGEF20.3329
chr20:49199407:T>CCSE1L0.2807
chr20:49199407:T>CSTAU1-0.1825

So I'm not really sure what that means. It seems odd that a variant would strongly affect so many genes.

Edit: Oh, I think it's because the first two variants are insertions. I looked at a few other random variants, both those that are insertions/deletions, and those that are just a simple base pair swap.

All the insertion/deletion variants give high scores for many genes, while the single nucleotide substitutions give smaller scores, like the third table. So it might be that inserting or deleting base pairs is likely to affect many genes.
 
Last edited:
I don't think this will be as easy to interpret as I thought. There are way too many different variant-gene combinations with high predicted effect scores.
Yeah, that was what I was starting to think with my comments yesterday and it seemed implied by comments on the roundtable discussion. Although I’ve got a lot more to read/listen to and hadn’t got things to work to be sure. Nice work there, thanks for sharing your code snippets and findings (and @ME/CFS Science Blog ).

There may still be some fun we can have with the tool as it is, but if better support for GWAS is on their backlog it may be easiest to wait until they have that working rather than us trying to layer things on top? Understanding what it can do so better understanding what questions we can ask with it may still be useful…
 
Last edited:
So I'm not really sure what that means. It seems odd that a variant would strongly affect so many genes.
Nice work! When I looked at that locus a while ago I remember that the region of SNPs with the highest significance overlapped an area with a lot of different transcription factor binding motifs.

It seems to be an important regulatory region, so theoretically these findings are biologically plausible—some variants increase expression and others decrease despite being a few base pairs apart, with a whole suite of nearby genes affected. But we’re also dealing with a new tool using a method that’s known to take too many liberties so that may be giving the output too much credit.
 
@forestglip if you share the variant locations from your examples above I can check the overlap with TF binding sites when I have a chance to verify if that’s what causes the sign change
 
Back
Top Bottom