A Probabilistic Deep Learning Framework for Structure-Based Binding Affinity Prediction with Epistemic and Aleatoric Uncertainty Decomposition
March 2026 - May 2026
I did not begin this project from zero. Protein structures were deeply familiar to me from Archaeon, and I had spent enough time around molecular machine learning to be comfortable building models in this space. What was less familiar was the specific landscape of drug discovery, wherein the literature is extremely dense, highly specialized, and often written with assumptions I did not already have.
What captured my attention was not the biology itself, but a pattern in the tools built around it. Predictions are everywhere: binding affinities, docking scores, ranking metrics, clean numerical outputs that convey an illusion of multi-decimal-place precision and treated as though that precision carries inherent meaning. What is rarely addressed is the fragility beneath those numbers: the assumptions, the hidden uncertainties, the ways in which they might fail. It is my belief that this omission reflects a deeper cultural tendency to mistake appearance for certainty, and to allow systems to assert authority simply via dominant existence.
Numbers have a peculiar power. Once written, they assert themselves, creating a sense of absolute resolution even when none may exist. In domains like drug discovery, this illusion is consequential. One prediction can shape the path of experimentation, dictate the allocation of resources, and ultimately influence what treatments reach patients. The system operates as if its outputs have accounted for their own fragility, yet often they have not.
What interested me was not merely whether a model could predict binding affinity, but whether it could discern the limits of its own knowledge. Not so abstractly, but in a way that could influence action. This question carries implications beyond molecular binding, and speaks to a principle I have come to recognize: that intelligence without self-awareness can be both brittle and misleading. Accuracy matters less than trustworthiness when decisions carry weight.
This is the gap that led to Affinex. My goal was not just to predict binding affinity, but to build a system that could express the limits of its own knowledge in a way that actually affects decisions. Not abstractly, but in a form that someone could act on. Accuracy, in this context, is not enough. A model that is occasionally correct but consistently overconfident is far more dangerous than one that is slightly less accurate but honest about uncertainty. Affinex is an attempt to build that honesty into the prediction itself.
Cheers,
Angie X.
At some point, I stopped taking the numbers at face value. They moved too easily through systems that were far more uncertain than their outputs suggested. A binding affinity score could originate from layers of approximation, limited data, and highly specific assumptions, yet arrive as a single, stable value that others were expected to act on. The issue was not that these numbers existed, but that nothing in their presentation reflected the conditions under which they should be trusted.
Affinex is built to reintroduce that missing layer. It takes the standard task of predicting protein–ligand binding affinity and pairs it with an explicit estimate of uncertainty grounded in the model’s own behavior. Each prediction is evaluated in terms of how familiar the input is relative to the training data, how sensitive the output is to small changes, and how coherent the model’s internal representations remain across variations. The result is not an isolated score, but a prediction that carries information about its own reliability.
This shifts how the output functions in practice. When uncertainty is visible, the number stops acting as a directive and starts acting as a signal that can be weighed. High-confidence predictions can be pursued with intent, while unstable ones can be deprioritized or examined further before resources are committed. In a process where individual predictions influence entire experimental trajectories, that distinction compounds quickly. Affinex is concerned with that compounding effect, ensuring that what appears precise is also proportionate to what is actually known.
Early-stage drug discovery is, at its core, a filtering problem.
Pharmaceutical pipelines begin with enormous chemical spaces. Millions of candidate molecules are computationally screened to identify a much smaller subset worth testing in the lab. Every decision in this pipeline depends on trust. Which predictions are reliable enough to justify real-world cost? Which compounds are worth the $10,000-per-assay investment of wet lab work?
Most binding affinity models answer with a single number. For example: "this molecule binds at 150 nanoMolar." That number appears definitive, but it hides the underlying variability of the system. After all, binding affinity is not a fixed property. It emerges from a distribution of conformations, environmental conditions, and measurement noise. The same protein-ligand pair tested ten times in a lab can yield K_d values spanning 100-200 nM due to variation in temperature, buffer, instrument calibration.
At the same time, ML models are not perfectly informed. They extrapolate from finite training data, often encountering molecules or proteins that differ substantially from anything they've seen before. When a model makes a prediction for a novel protein with no homologs in the training set, how certain should we be? Ignoring this problem leaves us operating blind.
This leads to two distinct sources of uncertainty. The first is epistemic uncertainty, which reflects what the model does not know due to limited training data in this region of chemical space. Acquire more data, and epistemic uncertainty decreases. The second is aleatoric uncertainty, which reflects inherent noise in the system itself, i.e. measurement error, conformational heterogeneity, environmental effects. No amount of additional data reduces aleatoric uncertainty; it is, in essence, irreducible.
The gap was clear. Binding affinity prediction models had reached strong performance using graph neural networks, but uncertainty quantification remained fragmented across research papers and rarely appeared in practical tools. The literature was dense with Bayesian deep learning papers, Monte Carlo dropout studies, and calibration analyses, yet there existed no unified framework that brought these ideas together for molecular binding.
Project Objective: Build a system that predicts both affinity and uncertainty, validate it on real datasets, and make the results interpretable and usable for real-world drug discovery decisions.
SMILES and Molecular Graphs:
Before I could build a model, I had to understand how molecules are represented digitally. SMILES (Simplified Molecular Input Line Entry System) is a string notation for molecules: CC(=O)Oc1ccccc1C(=O)O is aspirin. It is concise but opaque to neural networks. I needed to convert SMILES into a structure that neural networks could actually learn from.
The solution: molecular graphs. A molecular graph has atoms as nodes and chemical bonds as edges. I used RDKit to parse SMILES and extract per-atom and per-bond features:
Atom features: atomic number (which element), formal charge, degree (how many bonds), hybridization state (sp, sp², sp³), aromaticity, hydrogen count
Bond features: bond order (single, double, triple, aromatic), ring membership, conjugation.
This representation preserves chemical structure without explicit 3D coordinates. It captures what matters: connectivity and chemistry, not absolute positions. Two aspirin molecules thousands of Angstroms apart in space have identical graphs.
Graph Neural Networks (GNNs):
GNNs learn by iterating message passing: each atom aggregates information from its neighbors. In one message passing step:
Each atom sends a 'message' to neighbors (via learned MLP)
Each atom aggregates incoming messages
Each atom updates its hidden state
After k layers, each atom's representation incorporates information from k-hop neighborhoods. I used 3-4 GNN layers, so atoms integrate information from atoms 3-4 bonds away. Then I pooled atoms via mean aggregation to get a single molecular embedding (128 dimensions).
This works because most chemical properties are local. Atoms influence the ones they are bonded to, and that influence propagates outward. A graph structure captures that naturally.
Proteins are a different kind of problem. Binding happens in three-dimensional space, so the representation has to reflect geometry.
PDB Format and Structure
To work with protein structure computationally, I used PDB files, which contain atomic coordinates derived from X-ray crystallography. Instead of modeling every atom, I simplified each protein to its alpha-carbon backbone, using one point per residue. This removes side-chain detail but preserves the overall fold and the spatial relationships that define most binding interactions.
Each residue was represented by both its position and its chemical context. Features included amino acid identity, charge, hydrophobicity, secondary structure classification, and B-factor as an indicator of structural flexibility. The result is best understood as a point cloud, an unordered set of residues embedded in 3D space.
Geometric Deep Learning on Point Clouds
To process this structure, I used a geometric deep learning approach based on local neighborhoods. For each residue, the model identifies nearby residues in three-dimensional space and aggregates their features using a learned transformation. This reflects the way proteins behave physically, since residues interact primarily with those that are spatially close rather than those that are adjacent in sequence.
Each layer applies the same pattern. The model gathers neighboring residues, combines their features through a small neural network, aggregates the result, and stabilizes it through normalization. Repeating this process allows information to propagate across the structure, gradually building a representation that encodes both geometry and chemical context. A final pooling step produces a fixed-size embedding that represents the protein regardless of its length.
The exact process:
Find k-nearest neighbors for each residue in 3D space
For each neighbor, concatenate residue features and apply MLP
Sum/average aggregation over neighbors
Layer normalization for stability
Repeat k layers, then global mean pooling
Result: a fixed-size protein embedding, no matter how many residues.
At this stage, the model produces 2 separate embeddings. One represents the ligand as a molecular graph. The other represents the protein as a structured point cloud. The problem then becomes how to combine them in a way that reflects interaction rather than simple coexistence.
A direct concatenation treats both embeddings as independent summaries. It does not capture which parts of the protein are relevant for a specific ligand. Binding is not uniform across the protein surface, so the model needs a mechanism to focus on the regions that matter.
To address this, I used a cross-attention mechanism. The ligand embedding acts as a query over the protein. Each protein residue provides a key and a value. The model computes attention weights by comparing the ligand representation to each residue, then uses those weights to form a weighted combination of protein features.
This process effectively answers the question: given this ligand, which residues should influence the prediction?
The resulting attention weights can be interpreted as a heatmap over the protein structure. They indicate which regions the model considers important for binding. While this does not provide a full mechanistic explanation, it offers a useful and interpretable signal about where interactions are likely occurring.
To increase flexibility, I used multiple attention heads. Each head operates in a different subspace, allowing the model to capture different interaction patterns at the same time. The attended protein features are then combined with the ligand embedding through a small neural network. The result is a joint representation that reflects both the identity of the ligand and the specific regions of the protein that influence binding.
The joint representation is passed to a regression head that predicts both a binding affinity and an associated uncertainty.
The model outputs 2 values. The first is the mean prediction, which represents the expected binding affinity. The second is a variance term, which represents aleatoric uncertainty. This captures variability that comes from the system itself, such as measurement noise and conformational changes.
Training is done using a Gaussian negative log-likelihood loss. This objective encourages accurate predictions while also penalizing unjustified confidence. If the model predicts a narrow uncertainty range and is wrong, the penalty increases significantly. As a result, the model learns to express uncertainty when appropriate instead of defaulting to overconfidence.
Aleatoric uncertainty captures noise in the data, but it does not account for uncertainty in the model itself. To estimate this, I used Monte Carlo dropout during inference. Dropout remains active at test time, and the model is evaluated multiple times with different internal configurations.
Each forward pass produces a slightly different prediction. If these predictions vary widely, it indicates that the model is uncertain about the input. The variance across these predictions is taken as epistemic uncertainty.
The total uncertainty is the sum of these two components. One reflects inherent noise in the system. The other reflects the model’s lack of knowledge. Together, they provide a more complete picture of prediction reliability.
Training performance alone does not determine usefulness. The model must generalize to new data and behave reliably under realistic conditions. So, I evaluated the system on 2 primary datasets. PDBbind provides curated protein–ligand complexes with relatively high-quality measurements. BindingDB offers broader coverage but includes more noise and variability.
The model achieves reasonable performance across both. Root mean squared error falls within expected ranges for this task, and rank correlation remains strong, which is important for screening applications. More importantly, the uncertainty estimates are well calibrated. Confidence intervals align closely with observed outcomes, which suggests that the model’s uncertainty estimates are meaningful rather than arbitrary.
To simulate real-world deployment, I performed a temporal split. The model was trained on older structures and tested on newer ones. This setup reflects the actual use case, where predictions are made on proteins that were not available during training. Performance degrades slightly under this setting, which is expected, but remains stable enough to be useful.
There are clear limitations. The model is trained primarily on X-ray crystallography data, so it does not transfer cleanly to other modalities such as cryo-EM. It struggles with weak binders and uncommon interaction patterns that are underrepresented in the data. Like most uncertainty methods, it can fail under extreme distribution shift, particularly when encountering proteins with no meaningful similarity to the training set.
There is also a computational tradeoff. Estimating epistemic uncertainty requires multiple forward passes, which increases inference time. This is manageable for screening but not ideal for real-time applications.
Interpretability remains partial. Attention highlights relevant regions of the protein, but it does not explain the underlying chemistry. Understanding specific interactions still requires additional analysis.
asd
In my introduction to this project, I spoke about numbers asserting themselves, creating illusions of precision. Affinex is an attempt to resist that illusion. Every prediction includes not only an affinity estimate but a confidence interval, an uncertainty breakdown, and a set of interpretable attention weights.
This approach has limitations. The model can be wrong. The uncertainty can be underestimated. The attention weights might highlight irrelevant residues (or miss important ones). But at least the system is honest about what it knows and does not know.
In drug discovery (and global society in general), honesty is underrated. The field has adopted the language of precision without asking whether that precision means anything. Affinex asks: what do we actually know, and what should we admit we do not know?
The answer, it turns out, is: less than we think we know, and more uncertainty than we usually admit. Quite uncomfortable, but uncomfortable truths are much more useful than comfortable illusions when decisions carry weight.
This framework is open-sourced and available on my Github portfolio. The trained model is provided with clear documentation of its limitations, benchmarks, and use cases. I hope it is useful not because it is perfect, but because it embodies a principle: intelligence without uncertainty quantification is not really intelligence. It is just confident guessing. Affinex tries to be smarter than that.
Cheers,
Angie X.
This project is open source at github.com/axshoe/Stratum.