A Universal Safety Framework for Medical AI Using Worst-Case Bias Certification and Influence-Based Attribution Validated in Skin Cancer Screening
April 2025- May 2026
There is a specific kind of failure in applied technology that is harder to perceive than outright malfunction. A system that breaks completely is legible. It fails loudly, leaves evidence, invites correction. But a system that works on most people while quietly failing the rest produces a deceptive coherence. It reports good numbers and passes tests. The people it harms most are, statistically speaking, least likely to be in the test set. This is the failure mode I spent the better part of two years trying to build tools to detect.
When I built Dermi, an AI skin cancer screening app, in 2024/5, I ran standard validation metrics on my CNN model: accuracy, recall, F1. The numbers looked reasonable. Then I started reading the literature more carefully, and discovered that what the numbers were hiding might kill someone. Specifically: less than 10% of dermatological training images in the most widely used datasets represent dark skin tones. Models trained on these datasets can exhibit differential false negative rates across skin tones at specific operating thresholds while reporting globally acceptable average accuracy. The average accuracy tells you nothing about who the model fails, at what rate, and under what clinical conditions.
The standard metric conceals what matters most. This is not just a technical problem. It is an epistemological one: the diagnostic question was being asked incorrectly. Asking "how accurate is this model?" when the correct question is "how much worse does this model perform on dark skin, at its worst operating point, and which training images made it that way?" is the difference between a thermometer and a pressure gauge. They both report a number. Only one tells you when the boiler is about to fail.
DermEquity is my attempt to build the pressure gauge. The framework has three components: a deliberately balanced benchmark dataset (the DermEquity Benchmark), a worst-case deployment stress-test metric I call the Worst-Case Underdiagnosis Gap (WCUG), and a data attribution system I call Influence-Based Bias Attribution. Together, they transform fairness evaluation from something you do after a model is trained to something that actively informs what data to use and what deployment thresholds to set. This is not incremental improvement over existing tools. It is a different question about what safety evaluation should mean.
Cheers,
Angie X.
*Note: this project is not fully polished yet. Sorry if there are discrepancies or content gaps- I'm working on it!
Most AI performance reports are like a student's GPA: they tell you average performance and almost nothing else. A model that is 80% accurate overall might be 90% accurate on light skin and 65% accurate on dark skin, and the aggregate number absorbs both into a smooth, misleading 80. This matters enormously in medical AI, where the cost of a missed diagnosis is not a grade reduction. It is a melanoma that isn't caught in time.
The compounding problem is that bias in medical AI is not fixed. Models can look fair at one detection threshold (the cutoff above which you flag something as malignant) and catastrophically unfair at another. Clinicians set operating thresholds based on context: a high-risk screening clinic might lower the threshold to catch more potential cancers. A model that appears equitable at the default threshold can spike to a 30% worse miss rate on dark skin when that threshold shifts. No existing tool was measuring this.
DermEquity has three components:
The DermEquity Benchmark is a curated set of 1,658 skin lesion images, deliberately balanced by both skin tone and diagnosis, designed specifically as a stress-test set (not a representative sample).
WCUG (worst-case underdiagnosis gap) sweeps all possible classification thresholds and finds the single worst-case moment: the maximum gap between the false negative rate on dark skin and the false negative rate on light skin, across every possible operating point. Think of it as asking: "at the worst possible threshold, how many more melanomas does this model miss on dark skin?"
Influence-Based Bias Attribution goes one step further: by systematically retraining models with individual training images removed, it identifies which specific images are making worst-case bias worse, and proves that removing them reduces it. The result is a framework that tells you not just that a model is biased, but how bad it gets, where it gets that bad, and whose data caused it.
Dermi began in 2024 as a fairly conventional machine learning project: I wanted to build a smartphone app that could screen skin lesions for potential malignancy using a CNN, and make it free and accessible. The personal motivation came from a friend's family member facing a delayed diagnosis. The technical motivation came from roughly five months of self-directed study in ML, deep learning, and medical image analysis, followed by around 20 months of actual building.
The model I trained, based on InceptionV3 with transfer learning from ImageNet, achieved 83% test accuracy with 87% sensitivity for malignant lesions. On paper, those are solid numbers. The app launched with a waitlist and real users. I was satisfied, and then I read Groh et al. (2021).
The Fitzpatrick17k dataset paper, which analyzed one of the largest publicly available dermatological image datasets, reported that only around 9.5% of clinically labeled skin lesion images with both skin tone and diagnosis labels represent dark skin tones. The implications took a while to land. My training data, sourced from ISIC and HAM10000, had similar representation problems. A model trained on 90% light skin images doesn't merely perform slightly worse on dark skin. At certain classification thresholds, it can produce dramatically different false negative rates across groups while still reporting acceptable global metrics. I had built something that worked. I had not verified for whom, at what operating points, and under what conditions it failed.
The first version of what became DermEquity started as a fairness audit of Dermi. It became something considerably more ambitious when I realized the audit tools available were not adequate for the questions I was asking. Average fairness metrics give average answers. Deployment decisions are not made on averages. They are made on specific thresholds, specific populations, specific clinical contexts. I needed a worst-case metric.
The original plan involved HAM10000, one of the most widely used dermoscopic datasets. After two weeks of data preparation work, I discovered there was effectively no usable overlap between HAM10000 and Fitzpatrick17k: different image ID systems, different sources, different annotation frameworks. Fitzpatrick17k was the only dataset with both skin tone labels (using the Fitzpatrick scale) and clinical diagnosis labels (benign/malignant) at scale. Everything pivoted.
After filtering Fitzpatrick17k to images with known skin tone and binary diagnosis labels, the available pool was 4,320 images. The skin tone distribution was stark. 53.5% light skin images, 37.0% medium skin, and only 9.5% dark skin images.
This distribution is the problem. It is also the data. There are exactly 411 dark skin images available with both tone and diagnosis labels in the entire filtered Fitzpatrick17k. DermEquity uses all of them for the benchmark, and none of them for baseline model training. This was a deliberate stress-test philosophy: you cannot evaluate worst-case bias by putting your worst-case cases in the training set. The design was intentional, and it required defending.
The final DermEquity Benchmark consists of 1,658 images (499 light, 750 medium, 409 dark), downloaded and verified at a 99.8% success rate. The training set consists of 2,658 images (light and medium skin only, with 0% dark skin representation at baseline). From this split, four InceptionV3 models were trained: a baseline (0% dark skin training), ablation A (5% dark skin), ablation B (10% dark skin), and an improved model with data augmentation.
*I, and my partners, do not claim 0% dark skin training is realistic. We claim it is the correct stress test. Most deployed dermatology AI models were trained on datasets with this level of underrepresentation."
Before building WCUG, I needed to understand what existing fairness metrics could and couldn't tell me. I spent several weeks reading the foundational fairness literature: Hardt, Price, and Srebo (2016) on equalized odds, Guo et al. (2017) on calibration error, Verma and Rubin (2018) on the multiplicity of fairness definitions, and various applied medical AI fairness papers from 2019 onward.
The central insight was uncomfortable: average fairness metrics are structurally incapable of capturing threshold-dependent bias. If you evaluate a model at a single threshold and compare false negative rates across groups, you are only measuring fairness at one operating point. Clinical deployment doesn't work that way. Different institutions set different thresholds. Screening programs for high-risk populations lower thresholds. The same model can appear equitable at one operating point and catastrophically unfair at another. No existing metric was sweeping thresholds to find the worst case.
2 supporting metrics I implemented first, before WCUG, helped establish that average metrics were truly missing signal: CCG and FPNAI.
CCG: Confidence Calibration Gap
CCG measures whether the model's confidence is differentially miscalibrated across skin tones. A miscalibrated model doesn't just get predictions wrong; it's wrong with inappropriate confidence. On dark skin lesions, overconfidence in incorrect predictions is a specific failure mode with clinical implications: a clinician trusting the model's confidence score would have even less reason to override a wrong prediction.
CCG Formula
calibration_error(group) = |mean_confidence(group) − actual_accuracy(group)|
CCG = calibration_error(dark) − calibration_error(light)
FPNAI: False Positive/Negative Asymmetry Index
FPNAI captures whether error is directional across skin tones. Negative values indicate underdiagnosis on dark skin (missed cancers), which is the clinically dangerous direction. Positive values indicate overdiagnosis on dark skin (unnecessary biopsies). The direction matters as much as the magnitude.
FPNAI Formula
FPNAI = (FPR_dark − FPR_light) − (FNR_dark − FNR_light)
At baseline, CCG = 0.0476 and FPNAI = +0.0273. Both metrics improved monotonically as dark skin representation increased through the ablation series. But here is the critical finding: standard accuracy metrics showed no alarm at any of the four model configurations. Average accuracy across all four models was around 76.4%, and the gap in average dark skin accuracy was modest. The supporting metrics revealed hidden miscalibration and directional error that standard evaluation was structurally unable to detect. This is the setup for WCUG.
The conceptual leap that produced WCUG came from thinking about how other high-stakes industries handle the difference between typical performance and worst-case performance. FDA pharmaceutical testing does not evaluate drugs at typical doses and declare them safe. It tests at extreme doses, under stress conditions, across vulnerable subpopulations. Aircraft are not certified based on performance at standard cruising altitude. They are stress-tested at operating limits.
Medical AI is deployed at a chosen threshold. That threshold is not fixed. Clinicians adjust it. Screening programs adjust it. Individual institutions adjust it based on local patient risk profiles. If you evaluate fairness at only one threshold, you are reporting the thermometer reading in one room of a burning building. WCUG asks a different question: across all possible operating thresholds, what is the maximum gap between the false negative rate on dark skin and the false negative rate on light skin?
WCUG Definition
For each threshold τ ∈ [0, 1]:
FNR_dark(τ) = P(prediction < τ | malignant, dark skin)
FNR_light(τ) = P(prediction < τ | malignant, light skin)
WCUG = maxτ |FNR_dark(τ) − FNR_light(τ)|
Also computed:
τ* = argmaxτ |FNR_dark(τ) − FNR_light(τ)|
CI_WCUG = bootstrapped 95% confidence interval (n=1,000 iterations)
The formal claim for WCUG is not mathematical novelty. The maximum of an absolute difference is not a new mathematical object. The claim is operational novelty: this is the first worst-case deployment stress-test metric designed for medical AI with explicit threshold-sweep interpretation. Prior work on worst-case fairness (Dwork et al. 2012, Hashimoto et al. 2018) focused on theoretical guarantees under distributionally robust optimization. DermEquity operationalizes this thinking into a clinical deployment diagnostic that produces actionable threshold guidance, not just theoretical bounds.
I also added a sensitivity analysis: WCUG was computed across benchmark subsamples of 500, 1,000, and 1,658 images, with bootstrapped confidence intervals at each size. The purpose was to establish that 1,658 images is sufficient for stable WCUG estimation. The intervals converge above approximately 1,000 images, validating the benchmark size as adequate for the metric.
The most striking result was the bias direction reversal. In the improved model (trained with augmentation and modestly more diverse data), dark skin patients were actually not disadvantaged at any threshold relative to light skin. FPNAI shifted from +0.0273 to −0.0207. CCG improved from 0.0476 to 0.0083, an 82.6% reduction. All three metrics confirmed the same story, and WCUG told the story that average accuracy alone could not.
Once WCUG established that worst-case bias existed and was measurable, the next question was causative, or at least as close to causative as a non-interventional study can get: which specific training images were amplifying worst-case bias, and could removing them reduce it?
The methodological foundation for this comes from Koh and Liang (2017), who introduced influence functions as a framework for tracing model predictions back to specific training examples. The core idea is elegant: if you train a model on all N images, then retrain it leaving out image i, and measure how much the outcome of interest changes, you get a direct estimate of that image's influence on the outcome. In the context of fairness rather than accuracy, you can substitute WCUG as the outcome of interest, and the same framework produces influence scores for each image with respect to worst-case bias.
Influence Score Definition
For each training image i:
1. Train M_all on all 2,658 training images
2. Compute WCUG_all = WCUG(M_all, DermEquity Benchmark)
3. Train M_{−i} on all images except i
4. Compute WCUG_{−i} = WCUG(M_{−i}, DermEquity Benchmark)
Influence_i = WCUG_all − WCUG_{−i}
Positive Influence_i → image i amplifies worst-case bias
Negative Influence_i → image i mitigates worst-case bias
Full Shapley values over 2,658 images would require 2^2,658 model retrainings, which is computationally intractable. The Leave-One-Out (LOO) approximation is tractable and well-grounded: it is a first-order approximation of the Shapley value, and provides the directional information needed for practical data curation decisions. I was careful throughout to call this "LOO approximation of Shapley values" rather than "Shapley values," because the distinction is real and the honest framing matters. The novelty here is not the existence of influence functions (those are Koh and Liang's contribution) but their first application to worst-case fairness certification in medical AI.
Computing 200 LOO models, each trained on 2,657 images with one excluded, running for 20 epochs each on Kaggle's GPU infrastructure, took approximately 50 hours of compute time across several overnight batches. Alex helped monitor these runs and log results to a tracking spreadsheet, restarting failed batches and organizing output. The actual code, model architecture, and analysis were mine.
The influence score distribution across 200 sampled training images produced several findings that surprised me more than they probably should have.
The top 50 bias-amplifying images (those with the highest positive influence scores) were not, predominantly, dark skin images. They were predominantly light skin images. 66% of the top-50 amplifiers depicted light skin tones. The intuition for this makes sense in retrospect: if the model is being trained primarily on light skin data, the images that most strongly pull its decision boundary toward light skin feature representations are the ones most likely to increase its error rate on dark skin at the worst-case threshold. The bias is not introduced by the presence of dark skin images. It is amplified by the excess weight of light skin images in determining what the model considers "normal."
The diagnosis split among the top-50 amplifiers was nearly even: 77 malignant vs. 75 benign. This was important because it ruled out the hypothesis that bias was driven by class imbalance within a skin tone group. The dominant driver appeared to be skin tone, not diagnosis type.
The most important finding was that removing the top-50 bias-amplifying images reduced WCUG by 15.6% (from 0.0682 to 0.0576, p=0.008) without retraining on any new data, without changing the model architecture, and without collecting additional images. This has a direct practical implication: for datasets where new diverse data is impossible or expensive to obtain, strategic data curation based on influence scores is a tractable intervention.
Thus, bias in medical AI is not merely a model architecture problem. It is a data curation problem. You can reduce worst-case diagnostic disparity by systematically identifying and removing the training images that most strongly amplify it.
DermEquity, as a framework, answers the question of how to evaluate and improve medical AI fairness. DermiScope asks what it would look like to deploy that framework in a real-world context where the data scarcity problem originated.
Professional dermoscopes, the polarized-light imaging devices that produce the dermoscopic images used in AI training, cost between $800 and $2,000. There is approximately one dermatologist per 100,000 people in low- and middle-income countries. These two facts are related. The underrepresentation of dark skin tones in dermatological training data is not an accident or an oversight. It is a consequence of who had access to dermoscopic imaging infrastructure and who participated in the clinical studies that produced the datasets.
DermiScope is a sub-$30 smartphone dermoscope (you can read more about its development process here): a 3D-printed PLA housing, a $20 macro lens with polarized film overlay, and a universal smartphone mount. The polarized illumination enables subsurface dermoscopic imaging, which is what separates a dermoscope from a camera with a flashlight. This required understanding cross-polarized light physics: the illumination source and the camera lens are polarized at perpendicular angles, which suppresses surface glare and reveals subsurface skin structures. It took over 16 CAD iterations in Autodesk Fusion to get the geometry right, and involved more than a few conversations with people who know optics better than I do.
My lab partner and long-time childhood friend Alex assisted with design + assembly work during later prototype iterations. The design, CAD modeling, optics research, and iteration process were mine. The motivation for building it was clear: DermEquity identifies that dark skin data scarcity is the root cause of worst-case bias. DermiScope provides community-deployable imaging infrastructure that enables collection of diverse dermatological images in underserved settings. The two projects close a loop. DermEquity says "your data is the problem." DermiScope says "here is how to fix the data."
A framework whose conclusions only hold for models you trained yourself is not a framework as much as it's an audit of your own project. To establish that DermEquity detects bias at the architecture level rather than just the dataset level, I applied the benchmark and metrics to EfficientNetB3/ISIC 2019, a model pretrained on the ISIC 2019 challenge dataset and evaluated on 25,331 Fitzpatrick17k images.
Results: average accuracy of 73.6% was considered reasonable by standard metrics. FPNAI = 0.089, meaning dark skin patients miss melanomas at a rate 8.9% higher than the EfficientNetB3 model's threshold optimized on light skin. WCUG = 0.43 at a threshold of 0.45. For comparison, our worst-case improved model had WCUG = 0.0507. The external model's worst-case gap was roughly 8.5 times larger.
Average accuracy showed no alarm. The DermEquity metrics showed that this model, certified and widely benchmarked, exhibits deployment-level bias on dark skin that is substantially more severe than our improved model and would not be detectable by any single-threshold evaluation. This is the point. DermEquity is not a tool for auditing training mistakes. It is infrastructure for detecting deployment-level risk in any dermatology AI model.
Every WCUG value in DermEquity is reported with a bootstrapped 95% confidence interval computed over 1,000 resampling iterations. This was not a courtesy; it was a necessity. A point estimate of WCUG without uncertainty quantification is not meaningful for deployment decisions. The bootstrap CI establishes whether a measured difference in WCUG between two models is statistically robust or within sampling noise.
McNemar's test for paired predictions was used to assess whether observed performance differences between models were significant. The 26% relative FNR reduction in the improved model relative to baseline was significant at p=0.0118. The 15.6% WCUG reduction from the influence-based curation experiment was significant at p=0.008. These are not suggestive trends. They are statistically robust findings.
The sensitivity analysis deserves separate mention. Plotting WCUG against benchmark subsample size (500, 1,000, 1,658 images) with confidence intervals at each point established that the metric stabilizes above approximately 1,000 images. The DermEquity Benchmark at 1,658 images is above this threshold. Importantly, this analysis also addressed the most common methodological challenge from reviewers: dataset size. The benchmark is small by general ML standards. It is appropriately sized for WCUG estimation given the constraint that only 411 dark skin images existed in the source dataset.
I have now spent the better part of two years working on a single question that keeps revealing new layers: how do you know if a medical AI system is safe for everyone? Not just on average. Not just at one threshold. For everyone, at every operating point, under the worst conditions a clinician might reasonably set.
I do not have a complete answer. DermEquity gives you a framework for asking the question more precisely. WCUG tells you the worst it gets. Influence attribution tells you whose data made it that way. DermiScope offers a way to begin collecting the data that would fix it. Together, these are a diagnostic toolkit, not a cure. The cure would require regulatory frameworks that mandate pre-deployment bias stress testing, clinical outcome studies linking WCUG values to patient harm, and hardware infrastructure enabling dermatological data collection in underserved communities at scale. None of that exists yet.
What I keep thinking about is a specific property of the problem that feels important for reasons I can't fully articulate yet. The most dangerous products are not the ones that fail catastrophically. They are the ones that succeed partially and on a non-representative subset of users. The failure is invisible precisely because the success is real. An AI that correctly diagnoses melanoma 80% of the time has value. The fact that it fails disproportionately on dark skin at specific operating points can remain undetected across the entire lifetime of the product unless someone builds specifically designed tools to find it.
DermEquity is, at its core, an argument that the question "is this model accurate?" is not the same as the question "is this model safe for everyone?" and that confusing the two is not innocent ignorance. The data exists to ask the second question. The methods exist. The computational cost is tractable. What has been missing is the framework, the nomenclature, and the operational discipline to make worst-case fairness evaluation a standard part of medical AI development.
That gap is what DermEquity attempts to close, at least partially and provisionally, for at least this one domain.
But as you may have seen with my other projects, I fully intend to generalize it.
Angie X.