A Large-Scale Audit Framework for Demographic Representation Gaps in U.S. Clinical Trials
March 2026 - April 2026
The problem with clinical trials is not that they exclude people on purpose. The exclusion is structural, and structures are harder to argue with than intentions.
The FDA has been issuing diversity guidance for clinical trials since 2022, with a sharpened version in 2024. The guidance is real, the language is specific, and yet there is, as of this writing, no open and reproducible large-scale audit of whether enrollment demographics in completed trials actually match the demographic distribution of the populations those trials are meant to help. Individual papers audit individual diseases. There is no unified framework that applies a consistent metric across thousands of trials simultaneously, joined to disease burden denominators, stratified by therapeutic area, phase, sponsor type, and decade.
LACUNA is my attempt to fill that gap. The name means a gap or missing portion, which is about as on-the-nose as I can get lol.
The project does one thing: it computes a Representation Gap Index (RGI) per trial per demographic dimension per group, defined as the difference between the proportion of enrolled participants from a demographic group and that group's share of the disease burden from CDC WONDER data. A negative RGI means a group is under-represented relative to the actual disease population. A positive RGI means the opposite. The goal is measurement at scale, not explanation. LACUNA is a measurement tool, and I want to be precise about that distinction before anything else.
This is probably the most sociologically direct project I have built so far and explicitly representative of my project themes in their entirety: equitable performance. DermEquity started because I noticed that the original Dermi app, which I had built myself, had a worsened accuracy on darker skin tones. LACUNA is an abstraction of that same observation: if a medical imaging tool can have embedded demographic bias, so can the clinical evidence base that informs the tools and treatments deployed downstream of it. The data quality problem goes all the way up.
Cheers,
Angie X.
Clinical trials are the process by which a proposed medical treatment moves from hypothesis to evidence. The FDA requires that a drug demonstrate safety and efficacy in a trial before it can be prescribed to patients. This seems straightforward until you ask: efficacy in whom?
A trial that enrolls 1,000 participants, of whom 850 are white, produces evidence primarily about the treatment's behavior in white patients. If the condition being studied disproportionately affects Black patients (say, hypertension or sickle cell disease) then the treatment's efficacy in that population is undercharacterized relative to the population that actually needs it most. The trial's evidence base is structurally misaligned with the disease burden it is meant to address.
The Representation Gap Index (RGI) is a number that quantifies how large that misalignment is for each trial, demographic group, and dimension. It does not explain why the misalignment exists. It does not control for disease subtype or geographic clustering or the logistics of patient recruitment. It simply measures the gap between who was enrolled and who bears the burden. This is the appropriate starting point: you can't fix what you can't measure.
LACUNA computes this across thousands of completed U.S. trials using publicly available data from ClinicalTrials.gov and the CDC. The entire codebase is open source, the methodology is documented in this writeup, and the dataset is available for download.
I spent a few days reading through relevant policy literature, including the 2022 FDA draft guidance on Diversity Action Plans, the 2024 final rule, existing peer-reviewed literature on clinical trial enrollment disparities, and the ClinicalTrials.gov data documentation.
The existing literature is actually quite fragmented. Kim et al. (2021) documented racial disparities in oncology trial enrollment, Duma et al. (2018) documented them in lung cancer specifically, and Loree et al. (2019) found under-representation of Black and Hispanic patients across solid tumor trials. These are solid papers, but they are each pretty much confined to a single disease area and use a mix of ad hoc methods for quantifying the gap. None of them use a consistent, reproducible metric applied across the entire ClinicalTrials.gov corpus.
The gap I was filling, therefore, was not the observation that disparities exist (that is overly-well-established), but the absence of a systematic, cross-disease, open-source measurement framework that produces an interpretable metric per trial.
2 design decisions from this phase were load-bearing for everything else.
First, I anchored the denominator to disease burden rather than the general U.S. population. The general population baseline is easy to obtain and tempting to use, but it is the wrong denominator for most conditions. Sickle cell disease overwhelmingly affects Black patients; comparing trial enrollment to the general population's racial distribution would show a Black "over-representation" even in trials that enroll far fewer Black patients than the disease burden would justify. CDC WONDER's cause-of-death and prevalence data provides condition-specific denominators. This is more work to obtain and joins less cleanly to trial data, but it produces an RGI that actually means something.
Second, I decided to treat RGI as a purely descriptive signed gap, not to build a predictive model of enrollment determinism. The causal question (why do gaps exist?) is important and deserves its own research program. LACUNA answers the prior question: how large are the gaps, and where do they concentrate? These are separable questions and should be answered separately.
The ClinicalTrials.gov v2 API, launched in 2024, is considerably cleaner than its predecessor. It returns structured JSON with a resultsSection.baselineCharacteristicsModule for completed trials that have reported results, which is where enrollment demographics live when they are reported at all.
The phrase "when they are reported at all" is doing a lot of heavy lifting.
api_fetcher.py handles pagination across the full corpus of completed U.S. trials with results sections. It saves each page to disk as a numbered JSON file and checkpoints after every page, so a multi-hour fetch that gets interrupted at page 400 can resume from page 401 rather than starting over. The country filter is applied post-fetch at the study level because the v2 API does not expose a clean top-level country filter independently of location data.
demographic_parser.py is the messiest module in the codebase, which is appropriate because the underlying data is the messiest part of the project. ClinicalTrials.gov does not enforce controlled vocabulary for demographic measure titles or category labels. Some trials report "White or Caucasian." Some report "Caucasian." Some report "White, Non-Hispanic" and include Hispanic as a race category. Some report age in 10-year bins; some in 5-year bins; some in bins that don't align with either standard. A few trials report "Not Reported" for every demographic category, which is a technically valid entry that contributes nothing to the analysis.
The parser builds a canonical label mapping from the common variants to a consistent set of OMB 1997 race/ethnicity categories (the same standard the FDA guidance uses), sex categories, and a 5-bin age structure. Trials that cannot be mapped to at least one demographic dimension are flagged as unparseable and excluded. Nothing is imputed. I'd rather have a smaller clean dataset than a larger contaminated one.
condition_mapper.py maps each trial's condition list to one of roughly 20 therapeutic areas using a keyword index over condition strings. This is the most methodologically coarse step in the pipeline. A real MeSH-to-ICD10 crosswalk would be more precise, but it requires NLM's UMLS license, which is not publicly available without institutional affiliation. The keyword approach covers the substantial majority of trials in the main therapeutic areas and flags unclassifiable trials explicitly. I err toward explicit exclusion rather than silent misclassification.
burden_loader.py handles the CDC WONDER disease burden files. CDC WONDER does not have a clean REST API, so burden data requires manual batch downloads as TSV files. The module parses these, normalizes labels to the same canonical set used by the demographic parser, and computes proportions from raw counts rather than from pre-computed rates, so the denominator is consistent with what the trial parser produces.
The math in rgi_calculator.py is, on purpose, extremely simple.
RGI = enrolled_proportion − disease_burden_proportion
That's it. Everything else in the module is bookkeeping, i.e. joining the right burden data to the right trial, propagating the right flags for missing data, outputting a flat dataset where each row is one trial × one dimension × one group. The simplicity is intentional. RGI is not trying to be a causal estimator or a model output. It is a descriptive gap measure, and gap measures should be interpretable without a statistics background.
A few implementation notes to be explicit on:
Proportions are computed from raw counts, not accepted from any pre-computed percentages the trial may have reported. ClinicalTrials.gov results sometimes include percentage cells alongside count cells, but those percentages occasionally use a different denominator (for instance, excluding "Unknown" from the base rather than including it). Computing from counts ourselves guarantees that enrolled proportions and burden proportions use the same denominator structure.
Trials with total enrollment below 30 are excluded from RGI computation. Below that threshold, proportions are too noisy to interpret meaningfully; a single participant moving between demographic categories shifts the RGI by several percentage points.
The output dataset has one row per trial × dimension × group. This structure makes downstream filtering and aggregation straightforward: you can filter to dimension="race" and group="BLACK" and immediately have the full distribution of RGI values for Black participants across every parseable trial in the corpus.
analysis.py generates 5 figures and a regression summary. All summary statistics (mean, standard deviation, median, percentiles) are computed from scratch. scipy.stats is used only for t-statistics and p-values in the regression step, which I acknowledge explicitly in the code. The OLS slope coefficient is equivalent to the difference in mean |RGI| between the two groups being compared, which you could verify manually from the summary statistics output.
The figures are:
Figure 1 shows the distribution of RGI for Black participants across therapeutic areas, sorted by median. This is the main result figure. The expectation, based on the existing literature, is that oncology and cardiovascular trials will show the largest under-representation (most negative median RGI) because those diseases disproportionately burden Black patients while historically producing trial populations that skew white.
Figure 2 is a heatmap of mean RGI by therapeutic area and race group simultaneously. This is the figure where the structure of the problem becomes visually obvious: it shows, across all areas and groups at once, which cells are persistently negative (chronic under-representation) and which are positive (over-representation, which happens too and has its own implications).
Figure 3 plots median RGI by decade of trial completion, with IQR shading. The hypothesis is that RGI has improved over time as diversity guidance has accumulated, but the degree of improvement (if any) is an empirical question the data will answer.
Figure 4 compares mean |RGI| by sponsor class. Industry-sponsored trials have larger and more consistent patient recruitment infrastructure; the prediction is mixed. On one hand, larger recruitment networks should enable broader demographic reach. On the other, industry recruitment tends to concentrate in academic medical centers that disproportionately serve whiter, more affluent patient populations.
Figure 5 shows RGI by trial phase. Phase 3 trials are the most consequential for regulatory decision-making, and therefore the most important to characterize. The hypothesis is that Phase 3 trials may show smaller gaps because they require larger sample sizes and broader enrollment, but this is genuinely uncertain.
The regression analysis looks at whether sponsor class, trial phase, and decade are systematically predictive of |RGI| magnitude. The coefficients are reported with the caveat that these are univariate regressions on observational data. Any association found here is descriptive, not causal.
Here are LACUNA’s limitations.
The most significant is demographic underreporting. Not all completed trials report baseline demographics. The FDA's 2022 guidance encourages demographic reporting but does not make it retroactively mandatory for trials completed before the guidance took effect. The trials in the LACUNA dataset are the subset that chose to report demographics; they may not be representative of the full corpus of completed trials.
The condition mapping is keyword-based. Multi-indication trials (for instance, a trial of an anti-inflammatory in both cardiovascular and musculoskeletal indications) get assigned to a single primary area. The RGI computed for that trial uses the burden denominator for the primary area, which may not be the right denominator for all participants.
RGI uses the CDC WONDER cause-of-death database as the default burden denominator. Mortality data underestimates disease burden for conditions where patients survive for long periods (certain cancers with good prognoses, most psychiatric conditions, most musculoskeletal conditions). Prevalence data from NHANES and NHIS is available as a supplementary denominator for these cases, but coverage is incomplete and joining it requires manual work per condition.
None of these limitations are fatal to the analysis. LACUNA is most reliable for conditions with good CDC WONDER mortality data and clear condition-to-ICD10 mapping. It is less reliable for rare genetic diseases, psychiatric conditions, and dermatology. The documentation and code flag these cases explicitly so downstream users know where the analysis is on firmer ground versus where it is more approximate.
The question that has been running in the background the whole time I built this is: what does it mean to measure something that everyone already suspects is true?
The disparities LACUNA measures are not a secret. The FDA issued guidance because the problem was already documented. Researchers have been publishing individual-disease audits for years. In some sense I am building a more systematic version of a finding that is already in the literature, which might seem redundant.
I think it isn't, for 2 reasons.
The first is precision. "There are enrollment disparities" and "here is the exact RGI for Black patients in completed cardiovascular trials from 2010 to 2023, broken down by sponsor class and phase" are different levels of knowledge. The second sentence is actionable in ways the first isn't. You can argue with a direction; it's much harder to argue with a number and a confidence interval.
The second is infrastructure. A reproducible, openly documented pipeline that anyone can run on new data is different from a published paper. Papers age. Results freeze at the time of analysis. A codebase can be updated as new trials complete, as new burden data becomes available, as the FDA guidance evolves. The point of LACUNA is not just the result, it's the measurement apparatus.
This is the same logic that drove DermEquity. The point was not to reiterate bias existence, but to build a benchmark, a metric (the WCUG), and a dataset (the DermEquity Benchmark) that could be used to evaluate and compare models going forward. Fairness metrics are most useful as infrastructure for ongoing accountability, not as one-time demonstrations.
LACUNA is trying to do the same thing one level up: not just for model performance, but for the clinical evidence that the models are trained and validated on. It's a pretty neat full-circle moment.
It's interesting how systems develop blind spots, because typically not from malice. It is usually a combination of convenience (recruiting from the patient populations closest to the research institution), inertia (using the same recruitment infrastructure that worked for the last trial), and an unspoken assumption that the population studied is close enough to the population affected. The assumption is often wrong and is almost never examined explicitly at scale.
LACUNA is, in some form, a machine for examining that assumption explicitly. For every trial in the corpus, for every demographic dimension with available data, it asks: given who actually has this disease, who was in the room?
The answer, across thousands of trials, is going to be a distribution. Some trials will have near-zero RGI. Many will have substantially negative RGI for Black and Hispanic patients. Some will have positive RGI for groups that are heavily recruited in particular geographic or institutional contexts. The distribution will be the finding, and the distribution will be more honest than any summary statistic I could offer in its place.
The interesting question is not what LACUNA finds, exactly. It's what happens when the measurement exists and is public and reproducible and hard to argue away. Measurement does not fix structural problems. But it does make the problems harder to defer. You can say you'll address something later when the evidence is fuzzy. It becomes more difficult when the gap has a specific number and that number can be recomputed each year.
DermEquity, LACUNA, and everything I'm building in this space are participating in the same underlying project: making the invisible legible. And although legibility itself might not be sufficient, I am increasingly certain that it is necessary.
Cheers,
Angie X.
This project is open source at github.com/axshoe/LACUNA.