In Silico Functional Annotation of Rare CACNA1A Variants in Familial Hemiplegic Migraine
December 2025 - Current
ChanVar is a Python pipeline for gene-specific pathogenicity annotation of rare missense variants in CACNA1A. It's the first project in the MiSOF series and functions as the foundational variant annotation layer from which downstream MiSOF projects can draw per-variant functional context when modeling treatment trajectories in the migraine patient population.
The gap ChanVar targets is specific: generic variant pathogenicity predictors (CADD, SIFT, PolyPhen-2, REVEL, AlphaMissense) are trained on diverse variant-disease associations across the human proteome. They have no structural awareness of Cav2.1's domain architecture, no knowledge that voltage-sensor arginine residues operate under a distinct evolutionary and functional constraint relative to interdomain linkers, and no capacity to reflect the gain-of-function vs. loss-of-function mechanistic distinction that separates FHM1 from EA2 at the same locus. ChanVar is built around that specific biology.
The 8 features composing the CPS are: gnomAD v4 allele frequency, FoldX/EvoEF2 thermodynamic stability change (ddG), local Ca-alpha RMSD in a +/-10 residue window, Jensen-Shannon divergence conservation from 87-vertebrate ortholog alignment, ACMG evidence code weights (PS3/BS3, PM2, PP3/BP4), ClinVar prior classification, CADD-PHRED, and Grantham physicochemical distance. Each feature is normalized to [0, 1] where 1 represents pathogenic evidence. The composite score is a weighted mean with 1,000-iteration bootstrap confidence intervals.
Cheers,
Angie X.
*Note: this project is not fully polished yet. Sorry if there are discrepancies or content gaps- I'm working on it!
Your DNA is three billion letters long and scattered throughout it are roughly 20,000 genes. One of those genes, CACNA1A, builds a protein that acts as a precisely calibrated gate in the walls of your neurons. When a nerve signal arrives, the gate opens, calcium ions rush in, and the neuron fires. Mutations in CACNA1A break the gate in different ways: sometimes it opens too easily, producing the hyperactive electrical cascade that underlies hemiplegic migraine; sometimes it barely opens at all, causing episodes of sudden weakness and cerebellar ataxia. The same gene, different amino acid, different disease.
The problem ChanVar addresses is this: genetic sequencing is now cheap enough that clinicians routinely sequence CACNA1A in patients with atypical migraines, and they keep finding variants (i.e. individual amino acid substitutions) that nobody has studied before. Is this particular variant causing the patient's disease, or is it a harmless quirk in their DNA? Most of those variants end up classified as VUS: variant of uncertain significance. ChanVar computes a Composite Pathogenicity Score (CPS) for any such variant, integrating eight lines of quantitative evidence into a single annotated score with a 95% confidence interval. It does not replace a neurogenetics clinician. It gives them one more calibrated signal to work with.
Cav2.1 and CACNA1A:
Voltage-gated calcium channels are protein complexes embedded in the cell membrane that respond to changes in membrane potential by opening a pore through which calcium ions (Ca2+) enter the cell. Cav2.1, the P/Q-type channel encoded by CACNA1A, is expressed throughout the central nervous system with especially high density at presynaptic terminals of cerebellar Purkinje cells and cortical neurons. Its function is mechanistically central: an arriving action potential depolarizes the presynaptic membrane, the voltage-sensing S4 segments physically displace outward, the channel opens, calcium enters, and calcium-dependent vesicle fusion releases neurotransmitter into the synapse. Cav2.1 is, at these synapses, the primary gate controlling when and how much neurotransmitter is released.
The CACNA1A protein has 2,505 amino acids organized across four homologous repeat domains (I through IV). Each domain contributes six transmembrane segments (S1 through S6). The S4 segments carry positively charged arginine and lysine residues at every third position: these are the voltage-sensing arginines. They move outward when the membrane depolarizes, coupling voltage change to pore opening. The P-loops between S5 and S6 in each domain form the channel pore, with one glutamate per domain contributing to the EEEE selectivity filter that discriminates Ca2+ from Na+ and K+. The C-terminal cytoplasmic tail hosts regulatory domains including calmodulin-binding sites, a synaptotagmin interaction region, and a polyglutamine repeat tract whose expansion causes Spinocerebellar Ataxia type 6 -- a separate disease at the same locus.
This structural specificity is the core problem for generic variant predictors. The voltage-sensing arginines in S4 are not like other positions: every one of them is under extreme purifying selection, every substitution disrupts the electrostatic coupling between membrane voltage and gate movement, and the functional consequence (gain-of-function producing FHM1; loss-of-function producing EA2) depends on whether the substitution shifts the activation curve leftward or rightward. A generic predictor trained on protein variation broadly cannot reflect this. ChanVar builds domain awareness directly into the feature weights and domain-specific pathogenicity multipliers.
Familial Hemiplegic Migraine Type 1 (FHM1):
Familial Hemiplegic Migraine (FHM) is an autosomal dominant monogenic migraine syndrome with estimated prevalence approximately 1 in 50,000. Its defining feature is migraine with motor aura: episodes include transient hemiparesis or hemiplegia in addition to the visual and sensory aura that characterize ordinary migraine with aura. Three causative genes have been established: CACNA1A (FHM1, roughly 50% of FHM families), ATP1A2 (FHM2, roughly 25%), and SCN1A (FHM3, fewer than 5%). FHM1 mutations predominantly produce gain-of-function in Cav2.1: the activation threshold is reduced, more calcium enters the presynaptic terminal per action potential, and the resulting glutamate excess lowers the threshold for cortical spreading depression -- the slowly propagating wave of depolarization that underlies migraine aura.
Beyond FHM1, CACNA1A variants cause at least three additional phenotypes: Episodic Ataxia type 2 (EA2), where loss-of-function variants produce episodes of cerebellar ataxia and vertigo; Spinocerebellar Ataxia type 6 (SCA6), from polyglutamine expansion in the C-terminal domain; and isolated absence epilepsy from isoform-specific variants. This clinical pleiotropy (the same gene causing migraine, ataxia, cerebellar degeneration, or epilepsy depending on which amino acid changes and in which direction) is what makes CACNA1A one of the most biologically interesting and clinically difficult loci in the human genome, and what makes a domain-aware variant predictor worth building.
The Variant Interpretation Gap:
Whole-exome and whole-genome sequencing are now routinely ordered for patients with severe or atypical migraine, familial migraine clusters, and suspected channelopathies. CACNA1A is included on essentially every epilepsy, ataxia, and migraine gene panel currently in clinical use. gnomAD v4, the Genome Aggregation Database released in 2023 and covering approximately 730,000 exomes and genomes, documents thousands of rare CACNA1A missense variants, the overwhelming majority of which have never been functionally characterized. When a neurologist encounters a patient with severe familial migraine and a candidate CACNA1A variant, they are usually looking at one of these uncharacterized variants, and the question of whether it causes the patient's disease is genuinely unanswerable with current tools.
The existing generic predictors help, but imperfectly. CADD computes a phred-scaled score from a gradient-boosted model trained on all single-nucleotide variants in the human genome; it has no knowledge of Cav2.1's domain architecture. PolyPhen-2 uses comparative sequence data and structural features but treats all protein positions as exchangeable in functional importance. AlphaMissense integrates AlphaFold2 structural embeddings and achieves substantially better performance than earlier tools, but it was trained on general human protein pathogenicity and its error modes at voltage-sensor arginine positions are not well characterized. ChanVar's contribution is gene-specific: every feature is calibrated to what is known about CACNA1A variant biology in particular, and every domain assignment carries an interpretation derived from the channel's functional architecture.
Ion Channel Biophysics and Structural Biology:
What are P/Q-type calcium channels? Who knows? I didn’t.
My learning path looked roughly like this: Hille's Ion Channels of Excitable Membranes for the basic electrophysiology, the Ophoff (1996) and Kors (2001) papers for FHM1 genetics, the Gao et al. cryo-EM paper and the AlphaFold2 structure for structural biology, and then a week of reading about ddG calculations before realizing that membrane proteins systematically violate the assumptions underlying every widely-used stability predictor. This section describes what I learned that was not obvious going in.
The voltage-sensing mechanism is easier to understand than the literature makes it seem. The S4 helix has an arginine at every third position, and the surrounding membrane creates a focused electric field across the hydrophobic seal between S4 and the rest of the protein. When the membrane depolarizes (inside becomes less negative), the electrostatic force on those arginines is reduced, and S4 moves outward by roughly 10-15 Angstroms. This movement is mechanically coupled to the pore domain via the S4-S5 linker; the coupling opens the channel gate. Disrupt any of the charged arginines and you disrupt this coupling. Depending on which arginine and in which direction, you either make the gate easier to open (gain-of-function, FHM1) or harder (loss-of-function, EA2). This is why domain membership is among the strongest CPS features: a voltage-sensor arginine substitution is not merely conserved; it is conserved specifically because it performs the voltage-sensing function that no other position can substitute.
The thermodynamic stability calculation (f2) required learning why FoldX, despite being the best widely-available ddG predictor, is systematically less reliable for transmembrane proteins. FoldX's energy function was parameterized on soluble protein structures; it computes solvation energies assuming aqueous surroundings. Membrane-embedded helices are instead surrounded by lipid bilayer, with hydrophobicity driving the thermodynamic logic in the opposite direction. A hydrophilic substitution in a TM helix will be flagged as destabilizing by FoldX -- correctly in an aqueous sense -- even though the actual functional consequence in the membrane context depends on whether the substitution disrupts packing with adjacent helices, not on solvation. ChanVar flags all TM-domain variants with is_tm_domain=True and halves the f2 weight accordingly.
Evolutionary Conservation: Jensen-Shannon Divergence
Evolutionary conservation is the second-strongest CPS feature (weight 2.5) after allele frequency (weight 3.0). The intuition is as follows: if an amino acid position has been invariant across 400 million years of vertebrate evolution, something is doing it. A mutation there is almost certainly disrupting a function that cannot be compensated. The rate4site algorithm (Ashkenazy et al. 2010, ConSurf) formalizes this as maximum likelihood estimation of the per-position evolutionary rate under an amino acid substitution model.
The implementation in ChanVar estimates the Jensen-Shannon divergence (JSD) between the observed amino acid distribution at each alignment column and the background distribution under the JTT (Jones-Taylor-Thornton) substitution model. JSD is symmetric, bounded in [0, 1], and does not blow up when a column has very low occupancy - properties that make it more numerically stable than raw Shannon entropy for this application. The conservation feature f4 = 1 - rate/3, clipped to [0, 1], maps a conserved position (rate near 0) to f4 near 1 and a rapidly evolving position (rate 3 or above) to f4 = 0. The alignment uses 87 vertebrate CACNA1A orthologs from Ensembl release 110.
The Grantham Matrix: Physicochemical Distance
Grantham (1974) is among the most elegantly minimal papers in computational biology: one formula, a 20x20 table, and a clear empirical grounding in observed amino acid exchangeability. The Grantham distance between two amino acids is computed from three physicochemical properties: composition (c, the ratio of non-carbon atoms to total side-chain atoms), polarity (p), and molecular volume (v, partial specific volume in mL/g).
d(i,j) = sqrt(86.83*(c_i - c_j)^2 + 0.1018*(p_i - p_j)^2 + 0.000399*(v_i - v_j)^2)
The coefficients are empirically derived to weight each property proportionally to its contribution to observed exchangeability in protein families. The maximum distance in the 20x20 matrix is 215 (C->W). ChanVar normalizes by 215 to produce f8 in [0, 1]. The canonical FHM1 variant R192Q has Grantham distance R->Q = 43 (moderate: charge lost, volume similar). Both this value and R->W = 101 are validated against the original Grantham (1974) Table 1 in the ChanVar unit test suite (TestGranthamDistance), because if you are going to cite a 1974 paper as your data source you should at least confirm that your table matches it.
The ChanVar pipeline draws from five external sources. gnomAD v4 is queried via the GraphQL API at gnomad.broadinstitute.org for per-variant allele frequencies and ancestry-stratified counts (AFR, AMR, ASJ, EAS, FIN, MID, NFE, SAS). For batch runs over 50 variants, the gnomAD CACNA1A VCF (chr19:13,206,000-13,512,000, hg38) is downloaded and indexed with tabix to avoid API rate limits. ClinVar variants for CACNA1A are parsed from the NCBI VCF; variants with 2+ star review status are separated into P/LP (positive training set) and B/LB (negative training set). The AlphaFold2 structure for O00555 is cached locally from alphafold.ebi.ac.uk; PDB 7MIY (cryo-EM Cav2.1) is used for transmembrane topology validation. Ortholog sequences for the conservation calculation are retrieved from Ensembl release 110 via REST and aligned with MUSCLE v5.
The CACNA1A domain boundary map is derived from UniProt feature annotations (O00555), the Ophoff (1996) transmembrane topology, and the 7MIY cryo-EM structure. Voltage-sensor S4 boundaries were adjusted relative to UniProt: S4_I spans residues 181-226 rather than 207-226 as listed in the UniProt annotation, which excludes R192 (the canonical FHM1 pathogenic position). This correction was made during validation, when the R192Q benchmark test returned interdomain_linker as the domain assignment. Verifying domain assignments against known pathogenic variants before trusting the pipeline is the lesson.
Feature assembly is handled by chanvar/scoring/features.py. The VariantFeatures dataclass holds all eight raw feature values, domain assignment, pLDDT at the variant position, data completeness (n_available divided by 8), and ClinVar classification if available. The scoring logic in chanvar/scoring/cps.py computes the weighted mean, applies TM-domain weight correction, runs the bootstrap CI, and classifies the result.
The bootstrap confidence interval uses feature-specific Gaussian noise perturbation rather than data resampling. Noise standard deviations reflect measurement uncertainty: f1 (gnomAD AF) receives sigma = 0.05 to model sampling noise at rare frequencies; f2 (ddG) receives sigma = 0.10 reflecting FoldX's reported RMSE of approximately 1 kcal/mol on soluble proteins; missing features receive sigma = 0.15 as an imputation uncertainty penalty. The 95% CI is the 2.5th-97.5th percentile across 1,000 iterations at seed = 42 by default.
Validation-wise, the test suite contains 58 unit tests, all passing. Key validations include Grantham R->Q = 43 and R->W = 101 against the 1974 paper, Shannon entropy = 0 for a fully conserved column and log2(20) for a uniformly distributed column, sigmoid inflection at the expected values for f2 and f3, and frequency feature monotonicity (more common -> lower pathogenicity score). Integration tests verify that R192Q scores above 0.65 and that a synthetic common variant (AF > 0.01) scores below 0.50, pre-registered before any pipeline runs were executed.
The ROC analysis on the ClinVar validation set yields an estimated AUC of approximately 0.87 for CPS, compared to 0.83 for AlphaMissense, 0.81 for REVEL, and 0.79 for CADD on this CACNA1A-specific task. These AUC estimates are based on synthetic data calibrated to ClinVar distributions pending full API annotation of the validation set; they should be treated as preliminary benchmarks rather than final results, and the figure captions reflect this explicitly.
Three known pathogenic FHM1 variants serve as primary benchmarks. R192Q (voltage sensor S4-I, domain I) was the first FHM1 variant described in Ophoff et al. (1996). It substitutes a voltage-sensing arginine for glutamine, reducing the activation threshold in heterologous expression systems. ChanVar scores it at CPS = 0.791 (95% CI: 0.725–0.831), classified Possibly Pathogenic, at 62.5% data completeness (5 of 8 features present; f2, f3, and f6 absent without FoldX and AlphaMissense). S218L (also voltage sensor S4-I), which causes particularly severe FHM1 sometimes with permanent cerebellar atrophy, scores CPS = 0.831 (95% CI: 0.765–0.873), the highest of the three, consistent with its documented severity relative to R192Q. T666M (pore lining, domain II), initially classified as EA2 but identified in FHM families as well, scores CPS = 0.769 (95% CI: 0.703–0.811). The CPS ordering S218L > R192Q > T666M matches the clinical severity hierarchy reported in the FHM literature. The domain weight differential between voltage_sensor (1.0) and pore_lining (0.8) contributes to T666M's lower score relative to the S4-I variants, which is the expected behavior.
ChanVar's weakest area is C-terminal domain variants. The C-terminus of Cav2.1 is structurally flexible, AlphaFold2 pLDDT scores are lower there, and the functional logic is harder to capture with the features ChanVar uses. The C-terminal domain hosts calmodulin binding, EF-hand motifs, and a synaptotagmin interaction region, all of which mediate protein-protein interactions rather than channel gating. A variant that disrupts the calmodulin binding site can cause disease by uncoupling activity-dependent calcium regulation from channel inactivation while appearing structurally unremarkable to ddG or RMSD calculations. The domain weight for C-terminal positions (0.4) is calibrated to under-weight CPS in this region, but the fundamental gap is that the feature set does not contain a protein-protein interaction disruption predictor.
Transmembrane stability (f2) is the feature I trust least. FoldX on S4 segments should be treated as directional at best: it will correctly identify severe destabilization but will misclassify many pathogenic variants as structurally neutral because the membrane environment is not modeled. The planned improvement is integration with a membrane-aware stability predictor, but none currently has an accessible API compatible with the pipeline's architecture.
The following are the known gaps in ChanVar, their likely effect on the CPS, and the planned improvements where applicable.
Weight calibration: default weights are literature-informed priors, not fitted parameters. The fitted weights from logistic regression training should always be reported alongside CPS output in any downstream analysis. Until the full ClinVar training run is complete with FoldX-annotated features, the default weights should be treated as directional initializations.
Feature independence violation: CPS assumes conditional feature independence; this assumption is violated by f1-f5, f4-f7, and f2-f3 correlations. The logistic regression training will partially correct for this; the weighted mean formula does not. See Section 4.1.
f3 (RMSD) is under-validated: AlphaFold2 was not designed for mutation-induced structural perturbation. f3 is disabled when pLDDT < 70, but even at high pLDDT the structural proxy may not reflect the true mutant conformation, particularly in TM helices. Future work will replace f3 with output from a mutation-effect-aware predictor.
FoldX TM reliability: FoldX was parameterized on soluble proteins. ddG values in transmembrane helices systematically overestimate destabilization for hydrophilic substitutions. The TM weight correction (halving) is a pragmatic patch, not a rigorous correction.
Training data: approximately 200-300 CACNA1A ClinVar variants at 2+ review stars. Logistic regression on this sample size has meaningful coefficient uncertainty; the L2 regularization toward literature-informed priors is the stated mitigation.
ROC AUC estimates are simulated: the reported AUC values (~0.87 CPS, ~0.83 AlphaMissense) are derived from synthetic data calibrated to ClinVar distributions, not from annotation of the actual validation variants. They are simulated performance estimates, not benchmarks.
Missense variants only: nonsense, frameshift, splice-site, and structural variants require a different analytical framework.
C-terminal domain: the feature set does not include a protein-protein interaction disruption predictor, which is the primary pathogenic mechanism for calmodulin-binding site variants in this domain.
The variant interpretation problem in clinical genetics is a problem of asymmetric knowledge. For a small set of CACNA1A variants (R192Q, S218L, T666M, and a handful of others), there are electrophysiological data, knock-in mouse phenotypes, and decades of clinical family data. For the other several thousand rare missense variants in gnomAD, there is a sequence, a population frequency, a structure, and a conservation score. ChanVar is a principled attempt to extract as much signal as possible from that latter set of data sources, and to quantify the uncertainty in the resulting inference honestly.
The 95% bootstrap CI is not decorative. For variants in data-rich regions, the CI is narrow and thus the classification is meaningful. For variants in the C-terminal domain without ClinVar data and without FoldX output, the CI is wide, the data_completeness flag is set, and the right response is to treat the CPS as a weak prior, not a conclusion. The confidence_flag field in the CPSResult dataclass exists specifically to make this distinction legible in every report the pipeline generates. A CPS of 0.72 with a 95% CI of [0.63, 0.76] and data_completeness = 0.75 is a different statement than the same point estimate with data_completeness = 1.0 and CI = [0.70, 0.74].
A complete channelopathy variant interpretation workflow would integrate ChanVar's structural and evolutionary features with cell-type-specific expression data (cerebellar Purkinje cells and cortical neurons express distinct Cav2.1 splice isoforms with different biophysical properties) with published electrophysiological data from the primary FHM literature, and ultimately with patient clinical phenotype and treatment response. ChanVar builds the foundational annotation layer. The rest of that stack is future work, some of it in progress elsewhere in the MiSOF series.
Cheers,
Angie X.
This project is open source at github.com/axshoe/ChanVar.