A Universal Safety Framework for Medical AI Using Worst-Case Bias Certification and Influence-Based Attribution Validated in Skin Cancer Screening
April 2025- May 2026
There is a specific kind of failure in applied technology that is harder to perceive than outright malfunction. A system that breaks completely is legible: it fails loudly, leaves evidence, invites correction. But a system that works on most people while quietly failing the rest produces a deceptive coherence. It reports good numbers and passes tests. The people it harms most are, statistically speaking, least likely to be in the test set. This is the failure mode I spent the better part of 2 years trying to solve.
When I built Dermi, an AI skin cancer screening app, in 2024, I ran standard validation metrics on my CNN model: accuracy, recall, F1. The numbers looked reasonable. Then I started reading the literature more carefully, and discovered that what the numbers were hiding might kill someone. Specifically: fewer than 10% of dermatological training images in the most widely used datasets represent dark skin tones. Models trained on these datasets can exhibit differential false negative rates across skin tones at specific operating thresholds while reporting globally acceptable average accuracy. The average accuracy tells you nothing about who the model fails, at what rate, and under what clinical conditions.
The standard metric conceals what matters most. This is not just a technical problem. It is an epistemological one: the diagnostic question was being asked incorrectly. Asking "how accurate is this model?" when the correct question is "how much worse does this model perform on dark skin, at its worst operating point, and which training images made it that way?" is the difference between a thermometer and a pressure gauge. They both report a number. Only one tells you when the boiler is about to fail.
DermEquity is my attempt to build the pressure gauge. The framework has 3 components: a deliberately balanced benchmark dataset (the DermEquity Benchmark), a worst-case deployment stress-test metric I call the Worst-Case Underdiagnosis Gap (WCUG), and a data bias attribution system I've named Influence-Based Bias Attribution. Together, they transform fairness evaluation from something you do after a model is trained to something that actively informs what data to use and what deployment thresholds to set. The novelty is not incremental. It lies in asking a categorically different question about what safety evaluation should mean.
Note: this project is actively updated. Apologies for any content gaps during this time.
Cheers,
Angie X.
Most AI performance reports are like a student's GPA: they tell you average performance and almost nothing else. A model that is 80% accurate overall might be 90% accurate on light skin and 65% accurate on dark skin, and the aggregate number absorbs both into a smooth, misleading 80. This matters enormously in medical AI, where the cost of a missed diagnosis is not a grade reduction. It is a melanoma that does not get caught in time.
The compounding problem is that bias in medical AI is not fixed. Models can look fair at one detection threshold (the cutoff above which you flag something as malignant) and catastrophically unfair at another. Clinicians set operating thresholds based on context: a high-risk screening clinic might lower the threshold to catch more potential cancers. A model that appears equitable at the default threshold can spike to a 30% worse miss rate on dark skin when that threshold shifts. No existing tool was measuring this.
DermEquity has three components. The DermEquity Benchmark is a curated set of 1,658 skin lesion images, deliberately balanced by both skin tone and diagnosis, designed as a stress-test set rather than a representative sample. WCUG (Worst-Case Underdiagnosis Gap) sweeps all possible classification thresholds and finds the single worst-case moment: the maximum gap between the false negative rate on dark skin and the false negative rate on light skin, across every possible operating point. Think of it as asking: at the worst possible threshold, how many more melanomas does this model miss on dark skin? Influence-Based Bias Attribution goes further by systematically retraining models with individual training images removed, identifying which specific images make worst-case bias worse, and proving that removing them reduces it. The result is a framework that tells you not just that a model is biased, but how bad it gets, where it gets that bad, and whose data caused it. Finally, cross-domain validation on NIH ChestXRay-14 confirms the framework transfers without modification to chest radiology with biological sex as the protected attribute, which establishes WCUG as domain-agnostic in a way that single-domain frameworks cannot claim.
Dermi began in 2024 as a fairly conventional machine learning project: I wanted to build a smartphone app that could screen skin lesions for potential malignancy using a CNN, and make it free and accessible. The personal motivation came from a friend's family member facing a delayed diagnosis. The technical motivation came from roughly five months of self-directed study in ML, deep learning, and medical image analysis, followed by around 20 months of actual building.
The model I trained, based on InceptionV3 with transfer learning from ImageNet, achieved 83% test accuracy with 87% sensitivity for malignant lesions. On paper, those are solid numbers. The app launched with a waitlist and real users. I was satisfied, and then I read Groh et al. (2021).
The Fitzpatrick17k dataset paper, which analyzed one of the largest publicly available dermatological image datasets, reported that only around 9.5% of clinically labeled skin lesion images with both skin tone and diagnosis labels represent dark skin tones. The implications took a while to land. My training data, sourced from ISIC and HAM10000, had similar representation problems. A model trained on 90% light skin images doesn't merely perform slightly worse on dark skin. At certain classification thresholds, it can produce dramatically different false negative rates across groups while still reporting acceptable global metrics. I had built something that worked. I had not verified for whom, at what operating points, and under what conditions it failed.
The first version of what became DermEquity started as a fairness audit of Dermi. It became something considerably more ambitious when I realized the audit tools available were not adequate for the questions I was asking. Average fairness metrics give average answers. Deployment decisions are not made on averages. They are made on specific thresholds, specific populations, specific clinical contexts. I needed a worst-case metric.
The original plan involved HAM10000, one of the most widely used dermoscopic datasets. After two weeks of data preparation work, I discovered there was effectively no usable overlap between HAM10000 and Fitzpatrick17k: different image ID systems, different sources, different annotation frameworks. Fitzpatrick17k was the only dataset with both skin tone labels (using the Fitzpatrick scale) and clinical diagnosis labels (benign/malignant) at scale. Everything pivoted.
After filtering Fitzpatrick17k to images with known skin tone and binary diagnosis labels, the available pool was 4,320 images. The skin tone distribution was stark: 53.5% light skin images, 37.0% medium skin, and only 9.5% dark skin images.
This distribution is the problem. It is also the data. There are exactly 411 dark skin images available with both tone and diagnosis labels in the entire filtered Fitzpatrick17k. DermEquity uses all of them for the benchmark, and none of them for baseline model training. This was a deliberate stress-test philosophy: you cannot evaluate worst-case bias by putting your worst-case cases in the training set. The design was intentional, and it required defending.
The final DermEquity Benchmark consists of 1,658 images (499 light, 750 medium, 409 dark), downloaded and verified at a 99.8% success rate. The training set consists of 2,658 images (light and medium skin only, with 0% dark skin representation at baseline). From this split, four InceptionV3 models were trained: a baseline (0% dark skin training), Ablation A (5% dark skin), Ablation B (10% dark skin), and an improved model with data augmentation.
*We do not claim 0% dark skin training is realistic. We claim it is the correct stress test. Most deployed dermatology AI models were trained on datasets with this level of underrepresentation.
Before building WCUG, I needed to understand what existing fairness metrics could and could not tell me. I spent several weeks reading the foundational fairness literature: Hardt, Price, and Srebro (2016) on equalized odds, Guo et al. (2017) on calibration error, Verma and Rubin (2018) on the multiplicity of fairness definitions, and various applied medical AI fairness papers from 2019 onward.
The central insight was uncomfortable: average fairness metrics are structurally incapable of capturing threshold-dependent bias. If you evaluate a model at a single threshold and compare false negative rates across groups, you are only measuring fairness at one operating point. Clinical deployment does not work that way. Different institutions set different thresholds. Screening programs for high-risk populations lower thresholds. The same model can appear equitable at one operating point and catastrophically unfair at another. No existing metric was sweeping thresholds to find the worst case.
Two supporting metrics I implemented before WCUG helped establish that average metrics were truly missing signal: CCG and FPNAI.
CCG: Confidence Calibration Gap
CCG measures whether the model's confidence is differentially miscalibrated across skin tones. A miscalibrated model does not just get predictions wrong; it is wrong with inappropriate confidence. On dark skin lesions, overconfidence in incorrect predictions is a specific failure mode with clinical implications: a clinician trusting the model's confidence score would have even less reason to override a wrong prediction.
CCG Formula:
calibration_error(group) = |mean_confidence(group) − actual_accuracy(group)|
CCG = calibration_error(dark) − calibration_error(light)
FPNAI: False Positive/Negative Asymmetry Index
FPNAI captures whether error is directional across skin tones. Negative values indicate underdiagnosis on dark skin (missed cancers), which is the clinically dangerous direction. Positive values indicate overdiagnosis on dark skin (unnecessary biopsies). The direction matters as much as the magnitude.
FPNAI Formula:
FPNAI = (FPR_dark − FPR_light) − (FNR_dark − FNR_light)
At baseline, CCG = 0.0476 and FPNAI = +0.0273. Both metrics improved monotonically as dark skin representation increased through the ablation series. The critical finding: standard accuracy metrics showed no alarm at any of the four model configurations. Average accuracy across all four models was approximately 76.4%, and the gap in average dark skin accuracy was modest. The supporting metrics revealed hidden miscalibration and directional error that standard evaluation was structurally unable to detect. This is the setup for WCUG.
The conceptual leap that produced WCUG came from thinking about how other high-stakes industries handle the difference between typical performance and worst-case performance. FDA pharmaceutical testing does not evaluate drugs at typical doses and declare them safe. It tests at extreme doses, under stress conditions, across vulnerable subpopulations. Aircraft are not certified based on performance at standard cruising altitude. They are stress-tested at operating limits.
Medical AI is deployed at a chosen threshold. That threshold is not fixed. Clinicians adjust it. Screening programs adjust it. Individual institutions adjust it based on local patient risk profiles. Evaluating fairness at only one threshold reports the thermometer reading in one room of a burning building. WCUG asks a different question: across all possible operating thresholds, what is the maximum gap between the false negative rate on dark skin and the false negative rate on light skin?
WCUG Definition:
For each threshold τ ∈ [0, 1]:
FNR_dark(τ) = P(prediction < τ | malignant, dark skin)
FNR_light(τ) = P(prediction < τ | malignant, light skin)
WCUG = max_τ |FNR_dark(τ) − FNR_light(τ)|
Also computed:
τ* = argmax_τ |FNR_dark(τ) − FNR_light(τ)|
CI_WCUG = bootstrapped 95% confidence interval (n=1,000 iterations)
The formal claim for WCUG is not mathematical novelty. The maximum of an absolute difference is not a new mathematical object. The claim is operational novelty: this is the first worst-case deployment stress-test metric designed for medical AI with explicit threshold-sweep interpretation. Prior work on worst-case fairness (Dwork et al. 2012, Hashimoto et al. 2018) focused on theoretical guarantees under distributionally robust optimization. DermEquity operationalizes this thinking into a clinical deployment diagnostic that produces actionable threshold guidance, not theoretical bounds.
I also ran a sensitivity analysis: WCUG was computed across benchmark subsamples of 500, 1,000, and 1,658 images, with bootstrapped confidence intervals at each size. The intervals converge above approximately 1,000 images, validating the benchmark size as adequate for stable WCUG estimation.
The most striking result was the bias direction reversal. In the improved model (trained with augmentation), dark skin patients were not disadvantaged at any threshold relative to light skin. FPNAI shifted from +0.0273 to −0.0207. CCG improved from 0.0476 to 0.0083, an 82.6% reduction. All three metrics confirmed the same story, and WCUG told the story that average accuracy alone could not: augmentation did not merely reduce the gap, it eliminated it at every clinically plausible operating point.
Once WCUG established that worst-case bias existed and was measurable, the next question was causative, or at least as close to causative as a non-interventional study can get: which specific training images were amplifying worst-case bias, and could removing them reduce it?
The methodological foundation comes from Koh and Liang (2017), who introduced influence functions as a framework for tracing model predictions back to specific training examples. The core idea is elegant: if you train a model on all N images, then retrain it leaving out image i, and measure how much the outcome of interest changes, you get a direct estimate of that image's influence on the outcome. In the context of fairness rather than accuracy, you can substitute WCUG as the outcome of interest, and the same framework produces influence scores for each image with respect to worst-case bias.
Influence Score Definition:
For each training image i:
Train M_all on all 2,658 training images
Compute WCUG_all = WCUG(M_all, DermEquity Benchmark)
Train M_{−i} on all images except i
Compute WCUG_{−i} = WCUG(M_{−i}, DermEquity Benchmark)
Influence_i = WCUG_all − WCUG_{−i}
Positive Influence_i means image i amplifies worst-case bias. Negative Influence_i means image i mitigates worst-case bias.
Full Shapley values over 2,658 images would require 2^2,658 model retrainings, which is computationally intractable. The Leave-One-Out (LOO) approximation is tractable and well-grounded: it is a first-order approximation of the Shapley value, and provides the directional information needed for practical data curation decisions. I was careful throughout to call this "LOO approximation of Shapley values" rather than "Shapley values," because the distinction is real. The novelty here is not the existence of influence functions (Koh and Liang's contribution) but their first application to worst-case fairness certification in medical AI.
Computing 200 LOO models, each trained on 2,657 images with one excluded, running for 20 epochs each on Kaggle's GPU infrastructure, took approximately 50 hours of compute time across several overnight batches. Alex and Kris helped monitor these runs and log results to a tracking spreadsheet, restarting failed batches and organizing output. The analytical code, model architecture, and interpretation were mine.
The influence score distribution across 200 sampled training images produced several findings worth dwelling on.
The top 50 bias-amplifying images were not predominantly dark skin images. They were predominantly light skin images: 66% of the top-50 amplifiers depicted light skin tones. The intuition makes sense in retrospect. If the model is being trained primarily on light skin data, the images that most strongly pull its decision boundary toward light skin feature representations are the ones most likely to increase its error rate on dark skin at the worst-case threshold. Bias is not introduced by the presence of dark skin images. It is amplified by the excess weight of light skin images in determining what the model considers normal.
The diagnosis split among the top-50 amplifiers was nearly even: 77 malignant versus 75 benign. This ruled out the hypothesis that bias was driven by class imbalance within a skin tone group. The dominant driver appeared to be skin tone, not diagnosis type.
Removing the top-50 bias-amplifying images reduced WCUG by 15.6% (from 0.0682 to 0.0576, p=0.008) without retraining on any new data, without changing the model architecture, and without collecting additional images. The practical implication is direct: for datasets where new diverse data is impossible or expensive to obtain, strategic data curation based on influence scores is a tractable intervention.
Bias in medical AI is not merely a model architecture problem. It is a data curation problem. You can reduce worst-case diagnostic disparity by systematically identifying and removing the training images that most strongly amplify it.
DermEquity, as a framework, answers the question of how to evaluate and improve medical AI fairness. DermiScope asks what it would look like to deploy that framework in a real-world context where the data scarcity problem originated.
Professional dermoscopes, the polarized-light imaging devices that produce the dermoscopic images used in AI training, cost between $800 and $2,000. There is approximately one dermatologist per 100,000 people in low- and middle-income countries. These two facts are related. The underrepresentation of dark skin tones in dermatological training data is not an accident or an oversight. It is a consequence of who had access to dermoscopic imaging infrastructure and who participated in the clinical studies that produced the datasets.
DermiScope is a sub-$30 smartphone dermoscope prototype: a 3D-printed PLA housing, a $20 macro lens with polarized film overlay, and a universal smartphone mount. The polarized illumination enables subsurface dermoscopic imaging, which is what separates a dermoscope from a camera with a flashlight. This required understanding cross-polarized light physics: the illumination source and the camera lens are polarized at perpendicular angles, which suppresses surface glare and reveals subsurface skin structures. It took over 30 CAD iterations in Autodesk Fusion to get the geometry right, and involved conversations with clinicians who know optics at a level I definitely did not.
Alex and Kris assisted with design and assembly work during later prototype iterations. The original design, CAD modeling, optics research, and iteration process were mine. The motivation for building it was clear: DermEquity identifies that dark skin data scarcity is the root cause of worst-case bias. DermiScope provides community-deployable imaging infrastructure that enables collection of diverse dermatological images in underserved settings. The two projects close a loop. DermEquity says "your data is the problem." DermiScope says "here is how to fix the data."
Finally, our Dermi app integrates the full DermEquity AI framework into a mobile-first smartphone app, completing the pipeline from bias-certified model to patient-facing deployment. You can read more about my development of Dermi here. Real screenshots of finalized app screens are shown to the right.
*Note: DermiScope is currently only a prototype. We do not claim comparable image quality to professional dermatoscopes at this stage of development!
*Note: The Dermi app is not on App Stores yet. My team is planning beta testing and final production details.
A framework whose conclusions only hold for models you trained yourself is not really a framework. To establish that DermEquity detects bias at the architecture level, and across domains entirely, I applied the benchmark and metrics to two independent scenarios with zero formula modification.
Validation 1: EfficientNetB3 (Architecture Validation)
EfficientNetB3, the winning architecture of the ISIC 2019 Skin Lesion Classification Challenge (Gessert et al., 2020), was trained on the ISIC 2019 challenge dataset (6,000 images, completely separate from our Fitzpatrick17k training data). The full DermEquity pipeline was applied to its outputs on the DermEquity benchmark, without modification.
Results: WCUG = 0.2078 at τ = 0.61, representing a 204% increase over the DermEquity baseline (0.0682). CCG = 0.0601, 26.2% worse than baseline. FPNAI = +0.3367, indicating systematic overdiagnosis on dark skin at the standard clinical threshold. This last finding deserves careful interpretation: the positive FPNAI means dark skin patients are being overcalled (unnecessary biopsies), not undercalled. This is a different failure mode than the baseline InceptionV3's underdiagnosis pattern, and WCUG's positive signed gap at τ = 0.61 reflects the same thing from another angle. At the worst-case threshold, light and medium skin patients carry the higher miss rate, because EfficientNetB3's confidence scores for malignancy cluster differently for different skin tones as a result of its ISIC 2019 training distribution.
The point is not that one failure mode is worse than the other. Both are bias. WCUG captured both, and FPNAI correctly identified the directional character of each. A framework that could only detect underdiagnosis would have missed EfficientNetB3's failure entirely. Average accuracy showed no alarm.
Validation 2: NIH ChestXRay-14 (Cross-Domain Validation)
To test whether WCUG transfers beyond dermatology, I applied it without modification to a DenseNet-121 model trained on 7,797 images from the NIH ChestXRay-14 dataset (Wang et al., 2017), with pleural effusion detection as the binary task and biological sex as the protected attribute. The disparity axis here is sex, not skin tone. The imaging modality is chest radiography, not dermoscopy. The disease is pulmonary, not cutaneous. The formula was unchanged.
At the standard clinical threshold (τ = 0.50), the FNR gap between female and male patients was 1.77 percentage points, with p = 0.619 (statistically invisible). Standard single-threshold analysis would report no concerning disparity. WCUG identified τ = 0.90 as the worst-case operating point, where female patients had a 10.81 percentage point higher miss rate than male patients (χ² = 7.80, p = 0.0052, 95% bootstrap CI [0.0616, 0.1786]). This matches the sex-based underdiagnosis pattern documented in Seyyed-Kalantari et al. (2021, Nature Medicine) and Larrazabal et al. (2020, PNAS), confirming that the framework detected a real disparity rather than a sampling artifact.
The finding has a specific implication worth stating plainly. A clinician or health system evaluating this model at the standard threshold would see no meaningful sex-based difference and have no reason to adjust their deployment approach. WCUG identified a threshold at which female patients' worst-case miss rate is more than ten percentage points higher than male patients': not a small or ambiguous effect, and one that the standard evaluation pipeline is architecturally incapable of surfacing.
Every WCUG value reported is accompanied by a bootstrapped 95% confidence interval computed across 1,000 resampling iterations, confirming that worst-case gap estimates reflect the underlying distribution rather than sampling noise. The sensitivity analysis (Figure 2) demonstrates WCUG stability above n = 1,000 images, validating the benchmark size as sufficient.
For causal-adjacent claims, we used McNemar's test throughout: influence removal yields p = 0.008, augmentation's FNR reduction yields p = 0.0118, and the NIH sex-based gap at the worst-case threshold yields p = 0.0052 via chi-squared test. All tests were run at α = 0.01 with Yates' continuity correction applied to contingency tables.
Two independent fairness metrics, CCG and FPNAI, confirm the bias signal from orthogonal directions: CCG measures confidence miscalibration across groups, FPNAI captures error asymmetry. Their agreement at every model configuration rules out isolated measurement artifacts and supports the convergent validity of WCUG as the primary metric.
LOO controls were enforced throughout: all 200 leave-one-out models use identical architecture, hyperparameters, random seeds, and training duration. The only variable is the excluded image, which isolates influence scores from architectural confounds.
The dual benchmark design enforces strict train/test separation. Dark skin images appear exclusively in the evaluation benchmark and never in training at baseline, maximizing stress-test severity and preventing data leakage.
I have spent the better part of two years working on a single question that keeps revealing new layers: how do you know if a medical AI system is safe for everyone? Not just on average. Not just at one threshold. For everyone, at every operating point, under the worst conditions a clinician might reasonably set.
I do not have a complete answer. DermEquity gives you a framework for asking the question more precisely. WCUG tells you the worst it gets. Influence attribution tells you whose data made it that way. Cross-domain validation on chest radiology confirms the framework transfers without modification across imaging modalities, model architectures, and protected demographic attributes. DermiScope offers a way to begin collecting the data that would fix the underlying scarcity.
Together, these are a diagnostic toolkit, not a cure. The cure would require regulatory frameworks that mandate pre-deployment bias stress testing, clinical outcome studies linking WCUG values to patient harm, and hardware infrastructure enabling dermatological data collection in underserved communities at scale. None of that exists yet.
What I keep returning to is a specific property of the problem that feels important for reasons I cannot fully articulate yet. The most dangerous products are not the ones that fail catastrophically. They are the ones that succeed partially and on a non-representative subset of users. The failure is invisible precisely because the success is real. An AI that correctly diagnoses melanoma 80% of the time has value. The fact that it fails disproportionately on dark skin at specific operating points, or on female patients in radiology at specific thresholds, can remain undetected across the entire lifetime of the product unless someone builds specifically designed tools to find it.
DermEquity is, at its core, an argument that "is this model accurate?" and "is this model safe for everyone?" are not the same question, and that confusing them is not innocent ignorance. The data exists to ask the second question. The methods exist. The computational cost is tractable. What has been missing is the framework, the nomenclature, and the operational discipline to make worst-case fairness evaluation a standard part of medical AI development.
That gap is what DermEquity attempts to close, at least partially and provisionally, for at least this one domain. And now, at least partially, for a second.
But as you may have seen with my other projects, I fully intend to generalize it.
Angie X.