Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

jnmaciuch · Feb 3, 2026

forestglip said:
maybe we can give it all the potentially causal SNPs in the locus separately to test each one.

If you can use a raw sequence of DNA as input it should be possible to just manually change all the base pairs to match the SNPs from DecodeME in a certain region and then run that string as your input. Might be too labor intensive, though

forestglip · Feb 3, 2026

The batch variant scoring tool looks like it'll be useful to test many variants at once: https://www.alphagenomedocs.com/colabs/batch_variant_scoring.html

I think the goal is basically looking for variants that produce a large (or small if negatives are possible) quantile or raw score. For RNAseq, I think for each variant, the model creates one score for each gene in the interval. The quantile score would go from -1 to 1, which is a measure of how large of an effect the variant has on the gene. (0 is no effect.)

Edit: So I think what I'll try to get to doing is:

1. Filter DecodeME SNPs to log10p greater than something like 7 (produces about 300 variants).
2. Do batch scoring on all these variants with the RNAseq predictor on brain tissue.
3. Find the largest magnitude quantile scores to identify which variant-gene pairs might be important.

forestglip · Feb 4, 2026

Ok, well I did that.

I didn't initially filter to brain - I just had the model do all RNAseq predictions for all tissues, using all the variants with p-value less than 1x10^-7 (342 variants). That resulted in 3,800,412 scores, which are all the combinations of variants-genes-tissues. (It's looking at genes within 1MB of the variants.)

I filtered to only protein-coding genes and only brain. That brought it down to 4,398 scores. I attached the excel file with those brain-protein coding gene scores. (The unfiltered file would be far too large.)

I don't think this will be as easy to interpret as I thought. There are way too many different variant-gene combinations with high predicted effect scores.

Just for the chr20 locus, after filtering for variant-gene pairs which have a predicted effect on a gene with a quantile score (absolute value) >= 0.9:

191 variants were in the input data (passed the p-value threshold)
154 variant-gene pairs passed the above score threshold, and consist of:
- 36 unique variants
- 9 unique genes

The 9 unique genes are:

PTGIS
DDX27
KCNB1
CSE1L
STAU1
ZNFX1
B4GALT5
ARFGEF2
PREX1

It seems to me that this is probably more useful when you only have one or a small handful of variants of interest. Too many variants have effects on genes in various ways, so if we just throw all of them at AlphaGenome, it'll give lots of uninteresting results.

For the example analysis, where they try to see whether the model predicts expected effects for known leukemia variants, they're working with a much smaller set of variants. I think these are variants where there is high confidence that they actually cause the disease.

Python:

from alphagenome.data import genome
from alphagenome.models import dna_client, variant_scorers
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
import os
import pandas as pd

load_dotenv()
ALPHAGENOME_KEY=os.getenv('ALPHAGENOME_KEY')
save_predictions = True

# Load file which only has variants above -log10p of 7 in tsv format.
df = pd.read_csv('gwas1_log10p_gt_7.tsv', sep='\t')
df.CHROM = 'chr' + df.CHROM.astype(str)

# Create alphagenome client
dna_model = dna_client.create(ALPHAGENOME_KEY)

# Set up parameters:
# Interval: 1MB
# Scorer: RNAseq
sequence_length = '1MB'  # @param ["16KB", "100KB", "500KB", "1MB"] { type:"string" }
sequence_length = dna_client.SUPPORTED_SEQUENCE_LENGTHS[
    f'SEQUENCE_LENGTH_{sequence_length}'
]

scorer_selections = {
    'rna_seq': True,
    'cage': False,
    'procap': False,
    'atac': False,
    'dnase': False,
    'chip_histone': False,
    'chip_tf': False,
    'polyadenylation': False,
    'splice_sites': False,
    'splice_site_usage': False,
    'splice_junctions': False,
}

all_scorers = variant_scorers.RECOMMENDED_VARIANT_SCORERS
selected_scorers = [
    all_scorers[key]
    for key in all_scorers
    if scorer_selections.get(key.lower(), False)
]

# Fetch scores
results = []
for i, row in tqdm(df.iterrows(), total=len(df)):
  variant = genome.Variant(
      chromosome=str(row.CHROM),
      position=int(row.GENPOS),
      reference_bases=row.ALLELE0,
      alternate_bases=row.ALLELE1,
      name=row.ID,
  )
  interval = variant.reference_interval.resize(sequence_length)

  variant_scores = dna_model.score_variant(
      interval=interval,
      variant=variant,
      variant_scorers=selected_scorers,
      organism=dna_client.Organism.HOMO_SAPIENS,
  )
  results.append(variant_scores)
 
df_scores = variant_scorers.tidy_scores(results)

# Filter to protein-coding genes in the brain and save results
ontologies = ['UBERON:0000955'] # Brain
gene_types = ['protein_coding']
df_scores = df_scores[df_scores['ontology_curie'].isin(ontologies) & df_scores['gene_type'].isin(gene_types)]

df_scores['abs_quantile_score'] = abs(df_scores['quantile_score'])

df_scores = df_scores.sort_values('abs_quantile_score', ascending=False)

if save_predictions:
  df_scores.to_excel('variant_scores.xlsx', index=False)

forestglip · Feb 4, 2026

Out of curiosity, I looked only at the most significant variant in DecodeME only (20:48914387:T:TA). The model predicts that it has effects on these genes. Positive scores mean the ME/CFS risk allele increases expression.

Interestingly, it seems to say it has large effects on almost all genes it returned a score for.

Variant	Gene	Quantile score
chr20:48914387:T>TA	CSE1L	0.9995
chr20:48914387:T>TA	KCNB1	0.9989
chr20:48914387:T>TA	DDX27	0.9985
chr20:48914387:T>TA	ZNFX1	-0.9984
chr20:48914387:T>TA	ARFGEF2	-0.9982
chr20:48914387:T>TA	STAU1	-0.9979
chr20:48914387:T>TA	PREX1	0.5843

If I pick a variant pretty close by to that one, it gives similarly high scores. I notice the sign of ZNFX1 and STAU1 is switched, which is surprising.

Variant	Gene	Quantile Score
chr20:48935095:C>CTCTTTTTT	DDX27	0.9997
chr20:48935095:C>CTCTTTTTT	ZNFX1	0.9987
chr20:48935095:C>CTCTTTTTT	CSE1L	0.9970
chr20:48935095:C>CTCTTTTTT	STAU1	0.9967
chr20:48935095:C>CTCTTTTTT	ARFGEF2	-0.9966
chr20:48935095:C>CTCTTTTTT	KCNB1	0.9612
chr20:48935095:C>CTCTTTTTT	PREX1	0.5690

If I go even further away on the chr20 locus and pick a variant, the scores look more like what I was expecting, where they're not all extremely high. (In this case, the T allele is the risk allele, so a positive score means decreased expression with the ME/CFS allele.)

Variant	Gene	Quantile Score
chr20:49199407:T>C	KCNB1	-0.7772
chr20:49199407:T>C	ZNFX1	-0.6808
chr20:49199407:T>C	PTGIS	-0.569
chr20:49199407:T>C	B4GALT5	-0.5203
chr20:49199407:T>C	DDX27	-0.4859
chr20:49199407:T>C	PREX1	-0.4123
chr20:49199407:T>C	ARFGEF2	0.3329
chr20:49199407:T>C	CSE1L	0.2807
chr20:49199407:T>C	STAU1	-0.1825

So I'm not really sure what that means. It seems odd that a variant would strongly affect so many genes.

Edit: Oh, I think it's because the first two variants are insertions. I looked at a few other random variants, both those that are insertions/deletions, and those that are just a simple base pair swap.

All the insertion/deletion variants give high scores for many genes, while the single nucleotide substitutions give smaller scores, like the third table. So it might be that inserting or deleting base pairs is likely to affect many genes.

hotblack · Feb 4, 2026

forestglip said:
I don't think this will be as easy to interpret as I thought. There are way too many different variant-gene combinations with high predicted effect scores.

Yeah, that was what I was starting to think with my comments yesterday and it seemed implied by comments on the roundtable discussion. Although I’ve got a lot more to read/listen to and hadn’t got things to work to be sure. Nice work there, thanks for sharing your code snippets and findings (and @ME/CFS Science Blog ).

There may still be some fun we can have with the tool as it is, but if better support for GWAS is on their backlog it may be easiest to wait until they have that working rather than us trying to layer things on top? Understanding what it can do so better understanding what questions we can ask with it may still be useful…

jnmaciuch · Feb 4, 2026

forestglip said:
So I'm not really sure what that means. It seems odd that a variant would strongly affect so many genes.

Nice work! When I looked at that locus a while ago I remember that the region of SNPs with the highest significance overlapped an area with a lot of different transcription factor binding motifs.

It seems to be an important regulatory region, so theoretically these findings are biologically plausible—some variants increase expression and others decrease despite being a few base pairs apart, with a whole suite of nearby genes affected. But we’re also dealing with a new tool using a method that’s known to take too many liberties so that may be giving the output too much credit.

jnmaciuch · Feb 4, 2026

@forestglip if you share the variant locations from your examples above I can check the overlap with TF binding sites when I have a chance to verify if that’s what causes the sign change

forestglip · Feb 4, 2026

jnmaciuch said:
@forestglip if you share the variant locations from your examples above I can check the overlap with TF binding sites when I have a chance to verify if that’s what causes the sign change

Sure, they're in the first column of the tables. These are the first two:

chr20:48914387:T>TA
chr20:48935095:C>CTCTTTTTT

jnmaciuch · Feb 4, 2026

forestglip said:
Sure, they're in the first column of the tables. These are the first two:

chr20:48914387:T>TA
chr20:48935095:C>CTCTTTTTT

Oh duh, it's amazing the things you miss when you're just scrolling quickly on your phone.

chr20:48914387:T>TA
Most significant hit
In lncRNA ENSG00000294533, described as anti-sense to ARFGEF2. Meaning that it's a complementary sequence to part of the actual sequence in the gene--when the anti-sense gets transcribed, it can bind to the matching region on the gene and interfere with transcription. I will see if I can work out where in the ARFGEF2 gene this variant might be wreaking havoc for complementary binding (hopefully that will explain the discrepancy with this insertion causing downregulation of the ARFGEF2 gene)

chr20:48935095:C>CTCTTTTTT
Inside Encode TF-bound regulatory region (meaning a large-scale genome study found that this region had a lot of active transcription factor binding)
This is really interesting. The insertion adds in several binding motifs that wouldn't have otherwise been there. Additions are CEBPB, IRF2, and IRF1. It absolutely makes sense for the switched sign on ZNFX1--it's a known interferon responsive gene, and this insertion basically adds a new binding motif for some of the strongest mediators of the interferon response

Reference query:

TFBIND result for your sequence.

Insertion query:

TFBIND result for your sequence.

Are these all in the brain only? I think there are some extremely interesting implications here.

chr20:49199407:T>C
Not in any known regulatory region (TF or lncRNA), likely just in LD with other hits.

ME/CFS Science Blog · Feb 4, 2026

forestglip said:
If I pick a variant pretty close by to that one

Was chr20:48935095:C>CTCTTTTTT chosen randomly or was it based on the size of its effect?

jnmaciuch · Feb 4, 2026

ME/CFS Science Blog said:
Was chr20:48935095:C>CTCTTTTTT chosen randomly or was it based on the size of its effect?

Good question, I assume randomly because it was the next closest in the credible set outside of the island around the strongest hit. The TF binding data explains the switch in signs for some genes, but might not be relevant to ME/CFS

forestglip · Feb 4, 2026

jnmaciuch said:
Are these all in the brain only? I think there are some extremely interesting implications here.

The scores I posted are all for predicted brain expression, but looking at the scores across all tissues for the same variant-gene pair of chr20:48935095:C>CTCTTTTTT and ZNFX1, it's similarly high for pretty much all tissues. I attached all the predicted scores for RNAseq that include that variant and that gene.

ME/CFS Science Blog said:
Was chr20:48935095:C>CTCTTTTTT chosen randomly or was it based on the size of its effect?

Yes, sorry, semi-randomly, just based on location by looking at the LocusZoom plot and picking two other variants.

The first labeled variant is the most significant. The second variant labeled here, rs58306097, is chr20:48935095:C>CTCTTTTTT. The third, rs111386480, is chr20:49199407:T>C.

Zoomed in on first two:

jnmaciuch · Feb 4, 2026

forestglip said:
The scores I posted are all for predicted brain expression, but looking at the scores across all tissues for the same variant-gene pair of chr20:48935095:C>CTCTTTTTT and ZNFX1, it's similarly high for pretty much all tissues. I attached all the predicted scores for RNAseq that include that variant and that gene.

Thanks, that's helpful. It's interesting how much the sign flip flops across tissues, even though the quantile score is similarly high. The only thing that makes me less worried is that the signs from unrelated samples in the same tissue/cell-type seem to be in the same direction (at least from a quick spot check, looking at different brain regions, smooth muscle samples, and T cell subsets).

forestglip · Feb 4, 2026

jnmaciuch said:
It's interesting how much the sign flip flops across tissues, even though the quantile score is similarly high.

Yeah, it is odd. They seem to mostly be consistent within the same tissues, but I did find a few that flip within the same tissue. Some examples below. Again based on chr20:48935095:C>CTCTTTTTT and ZNFX1.

I see that at least for cerebellum and frontal cortex, there's a difference in signs but also a difference in life stage. But I don't see what would make the breast epithelium switch. The only differences I see are the data source and the assay, though I don't know what polyA means here.

Assay title	biosample_name		data_source	quantile_score
total RNA-seq	breast epithelium	adult	encode	-0.96204376
polyA plus RNA-seq	breast epithelium	adult	gtex	0.90917575

total RNA-seq	cerebellum	embryonic	encode	-0.999506
polyA plus RNA-seq	cerebellum	adult	gtex	0.99937785
polyA plus RNA-seq	cerebellum	embryonic	encode	-0.9976844

total RNA-seq	frontal cortex	embryonic	encode	-0.9996078
polyA plus RNA-seq	frontal cortex	adult	gtex	0.998573

In case the file with all the scores in all tissues and all genes might be helpful now or in the future, I uploaded it to GitHub. Maybe when we have a better idea of which specific variants are involved, it'll be good to reference.

jnmaciuch · Feb 4, 2026

jnmaciuch said:
I will see if I can work out where in the ARFGEF2 gene this variant might be wreaking havoc for complementary binding (hopefully that will explain the discrepancy with this insertion causing downregulation of the ARFGEF2 gene)

Update: very difficult to tell. All the significant hits are in intronic regions of the lncRNA--if they're getting spliced out, they wouldn't be attached to the part of the lncRNA that has complementary specficity to ARFGEF2 (though small regions may have other complementarity...it's not something we have good tools to find out, really). This could explain the broad effect on lots of other genes in the region, or it could just be an LD issue.

Though the hit right next to the most significant one--chr20:48914264_TTGC/T--is interesting because it is in another known TF-bound region.
Looks like the deletion causes the site to lose an AP1 and ER (estrogen receptor) binding site. It's a leap to think that the variants in more interesting regions are automatically the causal ones, but it does seem to fit with the idea that these variants have such strong effects on a bunch of genes.

Reference query:

TFBIND result for your sequence.

Deletion query:

TFBIND result for your sequence.

forestglip · Feb 4, 2026

Can you give a simpler explanation for what you're doing? No idea what I'm looking at on the linked pages.

jnmaciuch said:
Reference query:
TFBIND result for your sequence.

AC ID Score Loc. Str. Consensus Sequence Signal Sequence
________________________________________________________________________________________________________
M00160 V$SRY_02 0.764827 1 (+) NWWAACAAWANN AAAAAAAATTGC
M00191 V$ER_Q6 0.730751 2 (+) NNARGNNANNNTGACCYNN AAAAAAATTGCTGACCAGG
M00042 V$SOX5_01 0.845980 6 (-) NNAACAATNN AAATTGCTGA
M00175 V$AP4_Q5 0.819097 7 (-) NNCAGCTGNN AATTGCTGAC
M00172 V$AP1FJ_Q2 0.854259 11 (+) RSTGACTNMNW GCTGACCAGGT
M00173 V$AP1_Q2 0.864422 11 (+) RSTGACTNMNW GCTGACCAGGT
M00174 V$AP1_Q6 0.819481 11 (+) NNTGACTCANN GCTGACCAGGT
M00188 V$AP1_Q4 0.838365 11 (+) RSTGACTMANN GCTGACCAGGT
M00176 V$AP4_Q6 0.764219 15 (-) CWCAGCTGGN ACCAGGTGCA
M00184 V$MYOD_Q6 0.942936 15 (-) NNCANCTGNY ACCAGGTGCA
M00217 V$USF_C 0.848038 16 (+) NCACGTGN CCAGGTGC
M00217 V$USF_C 0.843341 16 (-) NCACGTGN CCAGGTGC

jnmaciuch · Feb 4, 2026

forestglip said:
Can you give a simpler explanation for what you're doing? No idea what I'm looking at on the linked pages.

The website is just a quick way of cross referencing a given sequence with known binding motifs for TFs. A binding motif is like a footprint that the transcription factor physically binds to on the DNA (it's the last column in the links). Give the website a nucleotide sequence, it will give you a list of what TFs could potentially bind there.

Variants that just swap out one base tend to not affect the binding of a TF too much, there's still some flexibility. But insertions and deletions can completely change the shape of that footprint, which is why it's interesting when one of those happens in a region that is known to have active TF binding activity (the "ENCODE cCRE" track on UCSC genome browser helps identify these)

So I'm running it two times with a short string of nucleotides centered around the variant location, and seeing what changes between the string from the reference genome and a string manually changed to match the variant from DecodeME. [Edit: that tells me whether the variant creates any changes in what transcription factors are able to bind there, which could have effects for lots of genes in the region]

forestglip · Feb 4, 2026

Thanks, very interesting.

ME/CFS Science Blog · Feb 5, 2026

The links to ZNFX1 and interferon transcription factors are very interesting but the insertion was chosen randomly.

Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting.

Take for example:

20:49012496_T/TTTTG
Ref. Allele: T
P Value: 2.65 × 10^-9

ME/CFS Science Blog · Feb 5, 2026

forestglip said:
An LD pattern tells a story in terms of helping identify where the causal SNP is, or helping determine if a phenotype might contain the same causal SNP as another phenotype, but I don't think that's what AlphaGenome is doing. If we knew the specific causal variant in ME/CFS, and gave it that in your code, I think that's all it would need to make a prediction of what that variant does. Other SNPs in LD, as long as they aren't actually interacting with the SNP, are just noise and shouldn't improve the prediction.

Got a hunch that because LD for SNPs often isn't 1 or 0 but a correlation value in between, that the pattern of SNPs is helpful to find out which SNPs are causal. In addition, I suspect that the for complex disease the causal effect often isn't restricted to a single SNP in a region and that the other SNPs aren't all noise due to LD but that some contribute to the effect, or help show what it is doing.

I'd hoped that AlphaGenome would use its powerful AI on these GWAS patterns, but it seems mostly focused on the effect of isolated SNP variants.

Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Senior Member (Voting Rights)

Moderator

Moderator

Attachments

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Attachments

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)