Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Andy

Senior Member (Voting rights)
Abstract
Deep learning models that predict functional genomic measurements from DNA sequences are powerful tools for deciphering the genetic regulatory code. Existing methods involve a trade-off between input sequence length and prediction resolution, thereby limiting their modality scope and performance1,2,3,4,5. We present AlphaGenome, a unified DNA sequence model, which takes as input 1 Mb of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest available external models in 25 of 26 evaluations of variant effect prediction. The ability of AlphaGenome to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically relevant variants near the TAL1 oncogene6. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.

Open access
 
AI model from Google's DeepMind reads recipe for life in DNA

An AI model developed by Google's DeepMind could transform our understanding of DNA - the complete recipe for building and running the human body - and its impact on disease and medicine discovery, according to researchers.

Called AlphaGenome, the model could help scientists discover why subtle differences in our DNA put us at risk of conditions such as high blood pressure, dementia and obesity.

It could also dramatically accelerate our understanding of genetic diseases and cancer.

The developers of the model acknowledge it's not perfect, but experts have described it as "an incredible feat" and "a major milestone".

Full article
 
This looks interesting. Perhaps it could help to interpret the findings of DecodeME and clear out some ambiguity about which genes are involved?

From what I can understand it looks like a incremental improvement over previous models rather than a big breakthrough. Part of this is likely because the gene expression data that we have is still too limited (as discussed on the DecodeME results thread).

@forestglip @ChronicallyOverIt @hotblack @jnmaciuch
 
Will definitely be interesting to see how this is used. Deepmind and Isomorphic are doing really interesting things I think and great that this has now been released after being talked about last year.There’s an API available too with keys available for free for non commercial use.
 
The API uses hg38 which is what DecodeME summary stats use so we don’t even need to do liftover
There’s a video tutorial here, lots more reading to do but definitely looks like something someone should give a go!


And a roundtable from the team
 
I fed the API documentation and details of the structure of the DecodeME summary statistics files to Gemini to get an idea of what may be needed, outline follows for those interested, treat with caution, etc etc. The first steps should be familiar from our experiments last year.
To move from the raw REGENIE summary statistics you have to functional insights in AlphaGenome, follow these steps:

Step A: Data Preparation and Quality Control (QC)​

Before analysis, you must ensure the data is "clean."

  1. Filter by QC List: Use the provided gwas_qced.var file to keep only the variants that passed the study's quality thresholds.
  2. Filter by Significance: GWAS files contain millions of variants. You typically focus on "top hits"—variants with a high LOG10P (usually >7.3, which corresponds to a p-value of 5×10−8) or those in regions of interest.
  3. Validate Build: Ensure you are using GRCh38 coordinates, as AlphaGenome requires these to match its internal reference map.

Step B: Extract Variant Information​

AlphaGenome needs specific inputs for each variant. From your .regenie.gz file, you will need to extract:

  • CHROM: The chromosome.
  • GENPOS: The exact position (in base pairs).
  • ALLELE0 (Ref): The "reference" or non-effect version of the DNA.
  • ALLELE1 (Alt): The "alternate" or effect version found in the GWAS.

Step C: Implementation with AlphaGenome​

Using the AlphaGenome Python API, you would perform Variant Effect Prediction (VEP):

  1. Define the Interval: Create a genome.Interval object centered on your variant’s position (up to 1 Mb wide).
  2. Define the Variant: Create a genome.Variant object using the CHROM, POS, REF, and ALT data from your GWAS file.
  3. Run score_variant: Use the DnaClient.score_variant command. This function compares the model's predictions for the "Reference" DNA versus the "Alternate" DNA.
  4. Analyze Outputs: Look at the multi-modal tracks. AlphaGenome will tell you if that specific DecodeME variant is predicted to change:
    • RNA-Seq/CAGE-seq: Gene activity levels.
    • DNase/ATAC-seq: How accessible that part of the DNA is to the cell's machinery.
    • Splice Junctions: How the gene's instructions are "cut and pasted" together.
 
Deepmind and Isomorphic are doing really interesting things
Just saw their documentary about AlphaFold which figured out how human proteins are folded in 3D and earned them a Nobel prize in 2024. Impressive work, can only hope that AlphaGenome will have the same impact.

The documentary is called 'The Thinking Game' and worth a watch:
 
This looks interesting. Perhaps it could help to interpret the findings of DecodeME and clear out some ambiguity about which genes are involved?

From what I can understand it looks like a incremental improvement over previous models rather than a big breakthrough. Part of this is likely because the gene expression data that we have is still too limited (as discussed on the DecodeME results thread).

@forestglip @ChronicallyOverIt @hotblack @jnmaciuch
Yeah quantity, quality, and context-specificity of training data is really going to make or break a tool like this. I tend to reserve judgement until after several teams have used it and validated experimentally since the methods paper introducing the tool is always going to highlight the few instances where it works well.

It’s hard for me to get excited after seeing how tripped up AI models like this can get in my own thesis work, but would be happy if future testing shows it’s well worth the compute power
 
From what I can tell
- Focus is on scoring and visualising a single variant, you can do batches of variants though
- Apparently it currently predicts many many scores per variant, so there’s a lot of data to go through and interpret
- It looks like the team are hoping to support large scale analysis (for things like GWAS) in the future, they say in the roundtable it’s currently possible but maybe not ideally suited
- Some other people have used the tool on GWAS data, see https://github.com/Mirror-fish/AlphaGenome-Variant-Expression-Scanner
- There’s a lot of flexibility in the tool and things to understand!

I’m all for getting stuck in and will try to blag it as far as I can. I appreciate what @jnmaciuch about wanting to be sure of the value before investing time/resource. That makes sense for a researcher perspective. From a patient perspective I suppose I’m a bit more cavalier and of the why wait given all the other barriers we face mindset (I’m also impressed at how @jnmaciuch manages to balance those competing interests)

It would be great to see what someone with the required experience could make of the DecodeME data though. And even better SequenceME….
 
Last edited:
I don't understand most of what's written on these threads, but do find it interesting.

Could you train a model like this on an entirely different dataset, where there's known to be a genetic component but it isn't at all straightforward—autism, lupus, psoriasis, etc?

Or do you have to sacrifice a portion of your main data resource to train it, which then can't be used to generate results?

(Sorry if they're dumb questions!)
 
Interesting stuff. Unfortunately I’m away for a while (why I haven’t gotten back to decode FLAMES). I’ll try to check this out as well. I should have free time in end of Feb, life and being sick does not leave much time for other activities.

Just wanted to say, I personally like the alpha go doc more, the human perspective is intense. Being the first Go players to be beat by a computer took a psychological toll:

 
I've tried inserting the SNP with the lowest p-value from DecodeME and tested RNA expression in only one tissue, the brain (UBERON:0000955). But it didn't show an effect there if I understand the the plot correctly (no difference between the red and grey line).

Suspect that we would have to test all SNPs in all tissues inside a loop to get a better result.
1770136661778.png

Python:
from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.visualization import plot_components
import matplotlib.pyplot as plt
from google.colab import userdata

# 1. Initialize with your API key
client = dna_client.create(api_key=API_KEY)

# 2. Define the variant (1-based coordinate)
variant = genome.Variant(
    chromosome='chr20',
    position=48914387,
    reference_bases='T',
    alternate_bases='TA'
)

# 3. Create interval
interval = variant.reference_interval.resize(dna_client.SEQUENCE_LENGTH_1MB)

# 4. predict for RNA sequencing in brain
variant_output = client.predict_variant(
    interval=interval,
    variant=variant,
    requested_outputs=[dna_client.OutputType.RNA_SEQ],
    ontology_terms=['UBERON:0000955'], # The brain
)

# The GTF file contains information on the location of all trancripts.
gtf = pd.read_feather(
    'https://storage.googleapis.com/alphagenome/reference/gencode/'
    'hg38/gencode.v46.annotation.gtf.gz.feather'
)

# Set up transcript extractors using the information in the GTF file.
# Mane select transcripts consists of of one curated transcript per locus.
gtf_transcripts = gene_annotation.filter_protein_coding(gtf)
gtf_transcripts = gene_annotation.filter_to_mane_select_transcript(gtf_transcripts)
transcript_extractor = transcript_utils.TranscriptExtractor(gtf_transcripts)

# 5. plot the results
transcripts = transcript_extractor.extract(interval)

plot_components.plot(
    [
        plot_components.TranscriptAnnotation(transcripts),
        plot_components.OverlaidTracks(
            tdata={
                'REF': variant_output.reference.rna_seq,
                'ALT': variant_output.alternate.rna_seq,
            },
            colors={'REF': 'dimgrey', 'ALT': 'red'},
        ),
    ],
    interval=variant_output.reference.rna_seq.interval.resize(2**15),
    # Annotate the location of the variant as a vertical line.
    annotations=[plot_components.VariantAnnotation([variant], alpha=0.8)],
)
plt.show()
 
Suspect that we would have to test all SNPs in all tissues inside a loop to get a better result.
I haven't tried this yet. Been a bit confused about the docs so far.

But maybe its better to test all the significant variants in a locus [edit: at the same time] instead of one? I'd be interested in each locus's variants' predicted effect on the brain.
 
Last edited:
I’m all for getting stuck in and will try to blag it as far as I can. I appreciate what @jnmaciuch about wanting to be sure of the value before investing time/resource. That makes sense for a researcher perspective. From a patient perspective I suppose I’m a bit more cavalier and of the why wait given all the other barriers we face mindset (I’m also impressed at how @jnmaciuch manages to balance those competing interests)
Not sure I have managed the right balance, but thank you :) to be clear my my conservative approach here is moreso towards trusting the results more than other methods and whether it’s worth incorporating the tool into genomics pipelines for publishable projects. For just testing the results on DecodeME data and taking the output with an appropriate grain of salt, I don’t see the harm if someone has the time and energy
 
But it didn't show an effect there if I understand the the plot correctly (no difference between the red and grey line).
Maybe the plot should be zoomed out more? It's only showing one gene at the moment, but the variant could be affecting something else

Edit: I think just a bigger number here:
Code:
interval=variant_output.reference.rna_seq.interval.resize(2**15),
 
Last edited:
Maybe the plot should be zoomed out more? It's only showing one gene at the moment, but the variant could be affecting something else

Edit: I think just a bigger number here:
Yes thanks. Zooming out to 800kb, I got this but not sure how to interpret it.
1770147604490.png
My first impression is that AlphaGenome seem to focus on individual SNP and their effect, while I would think that for many GWAS of diseases it's the pattern of SNPs across a region that tells the story.
 
Yeah, I'll need to try to understand this more. It looks interesting, but I'm not exactly sure what it's doing.

I think there might be two modes? 1. Check the difference in predicted effect between a single ref and alt allele. 2. Just give a long sequence of DNA and see what the prediction is.

It feels like there should be a way to look for the difference in predicted effect between a sequence with and without all the several dozen decodeme effect SNPs in a locus, but will need to read more when I get some time/energy.
 
Yes thanks. Zooming out to 800kb, I got this but not sure how to interpret it.
Judging from some of the example plots posted on the AlphaGenome forum it looks the expression differences from the variant are just small here. I see some very small instances where the grey line slightly peaks out behind the red--you might be able to see some slight variation if you zoom in a lot.

Maybe just to confirm you could try running a positive control? Like one of the SNPs from the paper just to see if your code reproduces it https://www.nature.com/articles/s41586-025-10014-0/figures/3

Or if the data points from variant_output.reference.rna_seq and variant_output.alternate.rna_seq can be pulled out directly, just subtracting the values and finding points with the greatest difference between them to center the plot around
 
while I would think that for many GWAS of diseases it's the pattern of SNPs across a region that tells the story.
An LD pattern tells a story in terms of helping identify where the causal SNP is, or helping determine if a phenotype might contain the same causal SNP as another phenotype, but I don't think that's what AlphaGenome is doing. If we knew the specific causal variant in ME/CFS, and gave it that in your code, I think that's all it would need to make a prediction of what that variant does. Other SNPs in LD, as long as they aren't actually interacting with the SNP, are just noise and shouldn't improve the prediction.

Given we don't know which specific variant is causal, maybe we can give it all the potentially causal SNPs in the locus separately to test each one.

It might be that the top SNP, the one you tested, doesn't actually do anything, so that's why there might not be much effect. (Though I'm not positive I'm interpreting your plot correctly.)

Good idea from jnmaciuch to validate with some known variants. Like maybe a variant known to increase mRNA expression. (Specifically where the causal variant is known, not just a wide locus of many SNPs.)
 
Back
Top Bottom