Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Andy · Jan 28, 2026

Abstract
Deep learning models that predict functional genomic measurements from DNA sequences are powerful tools for deciphering the genetic regulatory code. Existing methods involve a trade-off between input sequence length and prediction resolution, thereby limiting their modality scope and performance1,2,3,4,5. We present AlphaGenome, a unified DNA sequence model, which takes as input 1 Mb of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest available external models in 25 of 26 evaluations of variant effect prediction. The ability of AlphaGenome to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically relevant variants near the TAL1 oncogene6. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.

Open access

Andy · Jan 28, 2026

AI model from Google's DeepMind reads recipe for life in DNA

An AI model developed by Google's DeepMind could transform our understanding of DNA - the complete recipe for building and running the human body - and its impact on disease and medicine discovery, according to researchers.

Called AlphaGenome, the model could help scientists discover why subtle differences in our DNA put us at risk of conditions such as high blood pressure, dementia and obesity.

It could also dramatically accelerate our understanding of genetic diseases and cancer.

The developers of the model acknowledge it's not perfect, but experts have described it as "an incredible feat" and "a major milestone".

Full article

ME/CFS Science Blog · Feb 3, 2026

This looks interesting. Perhaps it could help to interpret the findings of DecodeME and clear out some ambiguity about which genes are involved?

From what I can understand it looks like a incremental improvement over previous models rather than a big breakthrough. Part of this is likely because the gene expression data that we have is still too limited (as discussed on the DecodeME results thread).

@forestglip @ChronicallyOverIt @hotblack @jnmaciuch

hotblack · Feb 3, 2026

Will definitely be interesting to see how this is used. Deepmind and Isomorphic are doing really interesting things I think and great that this has now been released after being talked about last year.There’s an API available too with keys available for free for non commercial use.

AlphaGenome

AlphaGenome – Access Google DeepMind’s unifying genomics model for deciphering DNA function.

deepmind.google.com

hotblack · Feb 3, 2026

The API uses hg38 which is what DecodeME summary stats use so we don’t even need to do liftover
There’s a video tutorial here, lots more reading to do but definitely looks like something someone should give a go!

And a roundtable from the team

hotblack · Feb 3, 2026

I fed the API documentation and details of the structure of the DecodeME summary statistics files to Gemini to get an idea of what may be needed, outline follows for those interested, treat with caution, etc etc. The first steps should be familiar from our experiments last year.

To move from the raw REGENIE summary statistics you have to functional insights in AlphaGenome, follow these steps:

Step A: Data Preparation and Quality Control (QC)

Before analysis, you must ensure the data is "clean."

Filter by QC List: Use the provided gwas_qced.var file to keep only the variants that passed the study's quality thresholds.
Filter by Significance: GWAS files contain millions of variants. You typically focus on "top hits"—variants with a high LOG10P (usually >7.3, which corresponds to a p-value of 5×10−8) or those in regions of interest.
Validate Build: Ensure you are using GRCh38 coordinates, as AlphaGenome requires these to match its internal reference map.

Step B: Extract Variant Information

AlphaGenome needs specific inputs for each variant. From your .regenie.gz file, you will need to extract:

CHROM: The chromosome.
GENPOS: The exact position (in base pairs).
ALLELE0 (Ref): The "reference" or non-effect version of the DNA.
ALLELE1 (Alt): The "alternate" or effect version found in the GWAS.

Step C: Implementation with AlphaGenome

Using the AlphaGenome Python API, you would perform Variant Effect Prediction (VEP):

Define the Interval: Create a genome.Interval object centered on your variant’s position (up to 1 Mb wide).
Define the Variant: Create a genome.Variant object using the CHROM, POS, REF, and ALT data from your GWAS file.
Run score_variant: Use the DnaClient.score_variant command. This function compares the model's predictions for the "Reference" DNA versus the "Alternate" DNA.
Analyze Outputs: Look at the multi-modal tracks. AlphaGenome will tell you if that specific DecodeME variant is predicted to change:
- RNA-Seq/CAGE-seq: Gene activity levels.
- DNase/ATAC-seq: How accessible that part of the DNA is to the cell's machinery.
- Splice Junctions: How the gene's instructions are "cut and pasted" together.

ME/CFS Science Blog · Feb 3, 2026

hotblack said:
Deepmind and Isomorphic are doing really interesting things

Just saw their documentary about AlphaFold which figured out how human proteins are folded in 3D and earned them a Nobel prize in 2024. Impressive work, can only hope that AlphaGenome will have the same impact.

The documentary is called 'The Thinking Game' and worth a watch:

jnmaciuch · Feb 3, 2026

ME/CFS Science Blog said:
This looks interesting. Perhaps it could help to interpret the findings of DecodeME and clear out some ambiguity about which genes are involved?

From what I can understand it looks like a incremental improvement over previous models rather than a big breakthrough. Part of this is likely because the gene expression data that we have is still too limited (as discussed on the DecodeME results thread).

@forestglip @ChronicallyOverIt @hotblack @jnmaciuch

Yeah quantity, quality, and context-specificity of training data is really going to make or break a tool like this. I tend to reserve judgement until after several teams have used it and validated experimentally since the methods paper introducing the tool is always going to highlight the few instances where it works well.

It’s hard for me to get excited after seeing how tripped up AI models like this can get in my own thesis work, but would be happy if future testing shows it’s well worth the compute power

hotblack · Feb 3, 2026

From what I can tell
- Focus is on scoring and visualising a single variant, you can do batches of variants though
- Apparently it currently predicts many many scores per variant, so there’s a lot of data to go through and interpret
- It looks like the team are hoping to support large scale analysis (for things like GWAS) in the future, they say in the roundtable it’s currently possible but maybe not ideally suited
- Some other people have used the tool on GWAS data, see https://github.com/Mirror-fish/AlphaGenome-Variant-Expression-Scanner
- There’s a lot of flexibility in the tool and things to understand!

I’m all for getting stuck in and will try to blag it as far as I can. I appreciate what @jnmaciuch about wanting to be sure of the value before investing time/resource. That makes sense for a researcher perspective. From a patient perspective I suppose I’m a bit more cavalier and of the why wait given all the other barriers we face mindset (I’m also impressed at how @jnmaciuch manages to balance those competing interests)

It would be great to see what someone with the required experience could make of the DecodeME data though. And even better SequenceME….

Kitty · Feb 3, 2026

I don't understand most of what's written on these threads, but do find it interesting.

Could you train a model like this on an entirely different dataset, where there's known to be a genetic component but it isn't at all straightforward—autism, lupus, psoriasis, etc?

Or do you have to sacrifice a portion of your main data resource to train it, which then can't be used to generate results?

(Sorry if they're dumb questions!)

ChronicallyOverIt · Feb 3, 2026

Interesting stuff. Unfortunately I’m away for a while (why I haven’t gotten back to decode FLAMES). I’ll try to check this out as well. I should have free time in end of Feb, life and being sick does not leave much time for other activities.

Just wanted to say, I personally like the alpha go doc more, the human perspective is intense. Being the first Go players to be beat by a computer took a psychological toll:

ME/CFS Science Blog · Feb 3, 2026

I've tried inserting the SNP with the lowest p-value from DecodeME and tested RNA expression in only one tissue, the brain (UBERON:0000955). But it didn't show an effect there if I understand the the plot correctly (no difference between the red and grey line).

Suspect that we would have to test all SNPs in all tissues inside a loop to get a better result.

Python:

from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.visualization import plot_components
import matplotlib.pyplot as plt
from google.colab import userdata

# 1. Initialize with your API key
client = dna_client.create(api_key=API_KEY)

# 2. Define the variant (1-based coordinate)
variant = genome.Variant(
    chromosome='chr20',
    position=48914387,
    reference_bases='T',
    alternate_bases='TA'
)

# 3. Create interval
interval = variant.reference_interval.resize(dna_client.SEQUENCE_LENGTH_1MB)

# 4. predict for RNA sequencing in brain
variant_output = client.predict_variant(
    interval=interval,
    variant=variant,
    requested_outputs=[dna_client.OutputType.RNA_SEQ],
    ontology_terms=['UBERON:0000955'], # The brain
)

# The GTF file contains information on the location of all trancripts.
gtf = pd.read_feather(
    'https://storage.googleapis.com/alphagenome/reference/gencode/'
    'hg38/gencode.v46.annotation.gtf.gz.feather'
)

# Set up transcript extractors using the information in the GTF file.
# Mane select transcripts consists of of one curated transcript per locus.
gtf_transcripts = gene_annotation.filter_protein_coding(gtf)
gtf_transcripts = gene_annotation.filter_to_mane_select_transcript(gtf_transcripts)
transcript_extractor = transcript_utils.TranscriptExtractor(gtf_transcripts)

# 5. plot the results
transcripts = transcript_extractor.extract(interval)

plot_components.plot(
    [
        plot_components.TranscriptAnnotation(transcripts),
        plot_components.OverlaidTracks(
            tdata={
                'REF': variant_output.reference.rna_seq,
                'ALT': variant_output.alternate.rna_seq,
            },
            colors={'REF': 'dimgrey', 'ALT': 'red'},
        ),
    ],
    interval=variant_output.reference.rna_seq.interval.resize(2**15),
    # Annotate the location of the variant as a vertical line.
    annotations=[plot_components.VariantAnnotation([variant], alpha=0.8)],
)
plt.show()

forestglip · Feb 3, 2026

ME/CFS Science Blog said:
Suspect that we would have to test all SNPs in all tissues inside a loop to get a better result.

I haven't tried this yet. Been a bit confused about the docs so far.

But maybe its better to test all the significant variants in a locus [edit: at the same time] instead of one? I'd be interested in each locus's variants' predicted effect on the brain.

jnmaciuch · Feb 3, 2026

hotblack said:
I’m all for getting stuck in and will try to blag it as far as I can. I appreciate what @jnmaciuch about wanting to be sure of the value before investing time/resource. That makes sense for a researcher perspective. From a patient perspective I suppose I’m a bit more cavalier and of the why wait given all the other barriers we face mindset (I’m also impressed at how @jnmaciuch manages to balance those competing interests)

Not sure I have managed the right balance, but thank you

to be clear my my conservative approach here is moreso towards trusting the results more than other methods and whether it’s worth incorporating the tool into genomics pipelines for publishable projects. For just testing the results on DecodeME data and taking the output with an appropriate grain of salt, I don’t see the harm if someone has the time and energy

hotblack · Feb 3, 2026

ME/CFS Science Blog said:
Suspect that we would have to test all SNPs in all tissues inside a loop to get a better result

That’s my understanding. All the groups of variants and all the different scorers (not just RNA_SEQ) and different tissues… Then visualise and interpret it and… there’s a lot of work I think!

forestglip · Feb 3, 2026

ME/CFS Science Blog said:
But it didn't show an effect there if I understand the the plot correctly (no difference between the red and grey line).

Maybe the plot should be zoomed out more? It's only showing one gene at the moment, but the variant could be affecting something else

Edit: I think just a bigger number here:

Code:

interval=variant_output.reference.rna_seq.interval.resize(2**15),

ME/CFS Science Blog · Feb 3, 2026

forestglip said:
Maybe the plot should be zoomed out more? It's only showing one gene at the moment, but the variant could be affecting something else

Edit: I think just a bigger number here:

Yes thanks. Zooming out to 800kb, I got this but not sure how to interpret it.

My first impression is that AlphaGenome seem to focus on individual SNP and their effect, while I would think that for many GWAS of diseases it's the pattern of SNPs across a region that tells the story.

forestglip · Feb 3, 2026

Yeah, I'll need to try to understand this more. It looks interesting, but I'm not exactly sure what it's doing.

I think there might be two modes? 1. Check the difference in predicted effect between a single ref and alt allele. 2. Just give a long sequence of DNA and see what the prediction is.

It feels like there should be a way to look for the difference in predicted effect between a sequence with and without all the several dozen decodeme effect SNPs in a locus, but will need to read more when I get some time/energy.

jnmaciuch · Feb 3, 2026

ME/CFS Science Blog said:
Yes thanks. Zooming out to 800kb, I got this but not sure how to interpret it.

Judging from some of the example plots posted on the AlphaGenome forum it looks the expression differences from the variant are just small here. I see some very small instances where the grey line slightly peaks out behind the red--you might be able to see some slight variation if you zoom in a lot.

Maybe just to confirm you could try running a positive control? Like one of the SNPs from the paper just to see if your code reproduces it https://www.nature.com/articles/s41586-025-10014-0/figures/3

Or if the data points from variant_output.reference.rna_seq and variant_output.alternate.rna_seq can be pulled out directly, just subtracting the values and finding points with the greatest difference between them to center the plot around

forestglip · Feb 3, 2026

ME/CFS Science Blog said:
while I would think that for many GWAS of diseases it's the pattern of SNPs across a region that tells the story.

An LD pattern tells a story in terms of helping identify where the causal SNP is, or helping determine if a phenotype might contain the same causal SNP as another phenotype, but I don't think that's what AlphaGenome is doing. If we knew the specific causal variant in ME/CFS, and gave it that in your code, I think that's all it would need to make a prediction of what that variant does. Other SNPs in LD, as long as they aren't actually interacting with the SNP, are just noise and shouldn't improve the prediction.

Given we don't know which specific variant is causal, maybe we can give it all the potentially causal SNPs in the locus separately to test each one.

It might be that the top SNP, the one you tested, doesn't actually do anything, so that's why there might not be much effect. (Though I'm not positive I'm interpreting your plot correctly.)

Good idea from jnmaciuch to validate with some known variants. Like maybe a variant known to increase mRNA expression. (Specifically where the causal variant is known, not just a wide locus of many SNPs.)

Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Senior Member (Voting rights)

Senior Member (Voting rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Step A: Data Preparation and Quality Control (QC)​

Step B: Extract Variant Information​

Step C: Implementation with AlphaGenome​

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Step A: Data Preparation and Quality Control (QC)

Step B: Extract Variant Information

Step C: Implementation with AlphaGenome