KLK Gene Family Variant Pathogenicity Pipeline for hEDS
June 2026 - Current
Every standard diagnostic test comes back clean. No MRI finding. No blood test to run. Just a checklist: hypermobility here, family history there, absence of other connective tissue conditions. Check enough boxes, and you have hEDS. What you don't get is a molecular answer.
That's the reality for people like my friend Vivian, who I sometimes joke is the real-life Elastigirl. Hypermobile Ehlers-Danlos Syndrome means her joints move wrong and she's hyperflexible, but the pain is constant and yet medicine has no test to point at. Just a syndrome and a name.
KLKVar was built to start closing this gap. It's a computational pathogenicity pipeline for all 15 genes in the kallikrein (KLK) family- the first tool designed specifically for hEDS. It exists because the field finally has a credible genetic lead: the Norris lab at MUSC recently showed that rare damaging variants in KLK genes are significantly enriched in hEDS patients, with KLK15 p.Gly226Asp acting through a dominant-negative mechanism that disrupts collagen crosslinking in the extracellular matrix. When researchers find a rare KLK variant in patient sequencing data, generic predictors like CADD can say "this is likely damaging." They can't say whether it matters for connective tissue. KLKVar can. It combines nine disease-specific features into a single score that reflects hEDS-relevant biology, not just general evolutionary conservation.
This is Layer 1 of MoSEDS, a three-part computational series for molecular stratification of hEDS. It mirrors my previous work on ChanVar for CACNA1A, but the biology is different. KLK proteases are enzymes working in the extracellular matrix, not ion channels, and the dominant-negative mechanism in hEDS isn't simple loss of function.
It's a different challenge, for sure. But ain't no mountain high enough 🤠
Cheers,
Angie X.
Note: this project is actively updated. Apologies for any content gaps during this time.
Imagine your genome as a giant instruction manual for building your body. Most of the instructions are fine. But sometimes one letter in one instruction is wrong. That's a variant. Most variants do nothing, and some cause disease.
hEDS is an unusual sort of disease because, until 2024/2025, no one had found a single genetic instruction that was consistently wrong in patients. It was the only major heritable connective tissue disorder with no identified genetic cause, diagnosed purely by a checklist of physical signs.
The Norris lab at MUSC changed that. They found that a family of 15 genes called kallikreins (KLK1 through KLK15), which produce enzymes that cut other proteins in the connective tissue, contain rare damaging variants significantly more often in hEDS patients than in the general population. The specific smoking gun: KLK15 p.Gly226Asp, a single amino acid swap that disrupts how the KLK15 enzyme and a collagen-crosslinking enzyme called LOX organize themselves in the tissue. Without that organization, collagen fibers don't crosslink properly, and you get the lax connective tissue that defines hEDS.
KLKVar is a scoring tool for variants in these genes. If a patient or researcher has a rare change in KLK1, KLK5, KLK15, or any of the other 13 genes in this family, KLKVar computes a score (0 to 1) reflecting how likely that variant is to be pathogenic in an hEDS context. It draws on nine different kinds of evidence: how rare the variant is in the population, whether it destabilizes the protein structure, whether the changed amino acid position is conserved across species, whether it sits near the enzyme's active site, and several others.
The score output is not a diagnosis. KLKVar cannot tell you whether you have hEDS or whether a given variant caused your symptoms. What it can do is help researchers prioritize which variants from a patient sequencing study are most worth investigating further, and help clinicians contextualize genetic findings in patients with a clinical hEDS diagnosis. And if you’re interested in seeing how much KLKVar connects to my previous project ChanVar for CACNA1A variants (hint: a lot), click here.
Kallikreins are serine proteases. They share a conserved catalytic triad (histidine, aspartate, and serine) arranged to cleave peptide bonds in target substrates. The 15 human KLK genes occupy a contiguous 300-kilobase cluster on chromosome 19q13.33, which makes them an unusual genetic neighborhood: a family of biochemically similar enzymes, grouped, regulated in coordination, and collectively governing ECM remodeling, vascular tone, immune signaling, and tissue homeostasis.
KLK15 is expressed in connective and cardiac tissue. Its known substrates include fibronectin and lysyl oxidase (LOX), an enzyme that crosslinks collagen into load-bearing fibers. The 2024 Norris/Gensemer work at MUSC identified KLK15 p.Gly226Asp segregating with hEDS in two pedigrees. The variant alters how KLK15 and LOX compartmentalize within the ECM not by eliminating KLK15 activity outright, but by disrupting the spatial relationship between the mutant protease and LOX, which impairs the crosslinking of collagen fibers that gives connective tissue its tensile strength. This is a dominant-negative mechanism, meaning the mutant protein interferes with the normal protein's function, and one copy of the damaged gene is enough.
Burden analysis across 197-200 sporadic hEDS patients found rare variants (MAF < 1%) enriched in 11 of the 15 KLK genes individually (p < 0.05 each) and across the entire KLK cluster at p = 2.28 × 10⁻¹⁴. The effect isn't limited to KLK15 though. In fact, the entire protease family appears to be under genetic pressure in hEDS.
This is the biological context KLKVar is designed to serve. A generic predictor like CADD or AlphaMissense assigns deleteriousness scores based on patterns trained across many disease contexts. When a researcher encounters KLK5 p.Arg34Cys or KLK12 p.Leu88Pro in a patient with hEDS, CADD can tell them the variant is likely damaging. It cannot tell them whether it is likely damaging in the specific way that matters for ECM remodeling in connective tissue. KLKVar can.
KLKVar computes nine features per variant and combines them using logistic regression into a Composite Pathogenicity Score (CPS). Each feature is normalized to 0-1 before entering the model. Higher scores indicate more evidence for pathogenicity.
f1: gnomAD Allele Frequency. Population frequency is the most powerful single predictor of pathogenicity for rare Mendelian variants. Variants absent from gnomAD or present at MAF < 0.0001 score 1.0. Score decays linearly to 0.0 at MAF = 0.01. Source: gnomAD v4 REST API, queried in real time with disk caching.
f2: FoldX Thermodynamic Stability (ΔΔG). FoldX PositionScan computes the change in folding free energy caused by a substitution on the protein's 3D structure. Variants with ΔΔG > +2.0 kcal/mol are strongly destabilizing and score 1.0; stabilizing mutations score 0.0. Runs on AlphaFold2 or PDB structures downloaded for each KLK gene.
f3: Jensen-Shannon Divergence Conservation. Evolutionarily constrained positions are more likely to be functionally critical. A multiple sequence alignment of KLK1-KLK15 plus vertebrate orthologs is built with MAFFT. For each alignment column, Jensen-Shannon divergence (JSD) measures how different the amino acid distribution is from the alignment-wide background. Low JSD = high conservation = high f3 score.
f4: Active Site Proximity. The catalytic triad (His/Asp/Ser) positions are known for each KLK gene from crystal structures and UniProt annotations. Variants within 8 Ångstroms of any catalytic atom (measured from Cα coordinates) score 1.0, decaying linearly to 0.0 at 16 Å.
f5: ECM Substrate Domain. Published biochemical characterization of KLK substrate specificity defines which protein regions interact with fibronectin, collagen, LOX, and elastin. Variants falling within these curated exosite regions score 1.0; all others score 0.0. For KLK15, the LOX-binding interface (~residues 220-240) receives special annotation as a dominant-negative hotspot.
f6: CADD-PHRED. The CADD score (v1.7) provides a generic deleteriousness prior trained on human evolutionary patterns. PHRED-scaled CADD is normalized to 0-1 with a ceiling at PHRED 50. This contributes as one feature among nine, not as the primary score — which is the whole point of KLKVar's existence.
f7: Grantham Distance. The 1974 Grantham matrix encodes the physicochemical dissimilarity between pairs of amino acids based on composition, polarity, and molecular volume. Cys → Trp is the most radical substitution (distance 215, normalized to 1.0); Ile→Leu is conservative (distance 5, normalized to 0.023). Radical substitutions correlate with loss of protein function across disease contexts.
f8: ClinVar Prior. ClinVar annotations for the same variant or same protein position are weighted into a prior score. A known pathogenic annotation at that position scores 1.0; benign scores 0.0; uncertain significance scores 0.5. The ClinVar flat file is downloaded and filtered to KLK genes on setup.
f9: GTEx Expression Context. Not all KLK genes are equally relevant to hEDS. KLK3 (PSA) is almost exclusively expressed in prostate tissue. KLK15 is expressed in fibroblasts and cardiac tissue. GTEx v10 median TPM values across hEDS-relevant tissues are combined with curated tissue priors to produce an expression context weight. A damaging variant in KLK3 is less likely to manifest in connective tissue than the same variant in KLK15; f9 captures that.
The logistic regression model is trained on 34 labeled variants: 15 pathogenic (from Norris/Gensemer lab supplementary data and ClinVar) and 19 benign (common gnomAD variants with no ClinVar pathogenic annotation). This is a small training set by the standards of genome-wide models, and the CPS should be interpreted accordingly. KLKVar is a disease-specific prior calibrated to KLK biology in hEDS, not a broad-population deleteriousness predictor.
Training uses scikit-learn's LogisticRegression with L2 regularization and balanced class weights to account for the pathogenic:benign imbalance. Calibration is applied via Platt scaling (sigmoid calibration with 5-fold cross-validation) to ensure that a CPS of 0.7 reflects roughly 70% probability of pathogenicity in the training distribution, not an uncalibrated logit output.
Model evaluation uses stratified 5-fold cross-validation with AUC as the primary metric. Bootstrap confidence intervals (1,000 iterations) are computed for each scored variant by perturbing feature values with Gaussian noise (σ = 0.05) and recomputing predictions, giving a 95% CI representing uncertainty in the feature measurements themselves.
Feature weights from the logistic regression identify which signals dominate the score. Based on prior biological expectations: f1 (MAF), f4 (active site proximity), and f5 (ECM domain) should carry the most weight in hEDS-specific contexts. f6 (CADD) should be a modest contributor.