Archaeon

AI-Driven Exploration of Extremophile Metagenomes for Industrially Relevant Enzyme Discovery

January 2026 - April 2026

Introduction

There is a peculiar asymmetry baked into the structure of industrial biotechnology. The organisms whose chemistry would be most useful to us, the ones whose proteins function at 100°C, at pH 2, in salt concentrations that would crystallize most cellular machinery, are also overwhelmingly organisms no one has ever grown in a laboratory. They exist in databases as fragments of DNA scraped via some extraordinary means from hydrothermal vents and volcanic soils, annotated and indexed behind interfaces that often require specialist fluency.

Our knowledge of their existence accumulates, yet our access doesn't democratize at the same pace. That gap, between what has been sequenced and what has been made more accessible to someone without substantial funding and a dedicated team, is where software can be of some help.

Archaeon is a Python pipeline I created that queries NCBI and MGnify for extremophile protein sequences, extracts five thermostability features from each, scores them with a composite metric I've knighted the "Industrial Applicability Score", optionally annotates the top candidates via BLAST against eight reference enzyme families + predicts their 3D structures through ESMFold, and generates a self-contained interactive HTML report. It runs locally and costs nothing beyond a free NCBI API key.

I did built this tool with some prior background in molecular biology (I used to compete in USNCO in middle school + was obsessed with biology when I was 14). I've always had an inclination towards the bio-sciences, so this project was especially delightful in its entire process.

What I can say is this: the most interesting solution ideas I've encountered in my 17 years tend to dwell at the intersection of a field that has accumulated enormous knowledge and a field that has built tools for making knowledge free. Archaeon is yet another attempt, however imperfectly, to stand in that gap and build a little bridge.

Cheers,
Angie X.

Archaeon: A Layman's Guide

Most of the enzymes that drive modern industry (i.e. the proteins that break down wood pulp into paper, help your laundry detergent work in cold water, or convert plant material into biofuel) were originally discovered by accident. Researchers stumbled across these little tiny organisms living in boiling hot springs or highly acidic mines and realized that anything surviving there must have evolved extraordinary chemical machinery. The problem is that most of these extreme environments harbor microbial life that has never been grown in a laboratory. We know it is there because we can detect its DNA, but we can't cultivate it, which means these enzymes remain elusive and uncharacterized.

Archaeon is a tool I built to change that. It queries 2 major public databases (NCBI and MGnify) for protein sequences collected directly from environmental samples in 8 extreme biome categories: deep-sea hydrothermal vents, hot springs, permafrost, acid mine drainage, hypersaline lakes, desert soils, alkaline lakes, and deep subsurface rock. For each candidate protein it retrieves, the pipeline computes 5 sequence-based thermostability features drawn from peer-reviewed biophysics literature, then combines them w/ a BLAST homology search against 8 industrial enzyme families to produce a single ranked score called the Industrial Applicability Score (IAS). Top candidates are optionally passed to ESMFold, Meta AI's protein structure prediction model, which predicts their 3D shape in seconds w/out requiring experimental data. The final output is a self-contained interactive HTML report with charts, ranked candidate tables, and embedded 3D structure viewers.

The whole thing runs locally, costs nothing, and requires only a free NCBI API key. It's important to note that Archaeon doesn't replace experimental biochemistry, and every candidate it identifies needs lab validation before it's industrially useful. What Archaeon does is compress months of database sifting + manual feature calculation into a single pipeline, giving researchers a solid starting point instead of just a guess, a hope, and a prayer.

Archaeon Concepts Reference - A.Xiu .pdf

PHASE 1: The Problem I Didn't Think Was a Problem

You would think as a species, we'd already have mastered enzymes by now. Drug discoveries, gene therapies, corn on steroids- there's much positive to be assumed (at least from my perspective) about field of protein discovery.

Industrial biotech, I've found, is a bit different.

Industrial processes are extremely chemically brutal. Laundry detergent, bioethanol production, paper bleaching, pharmaceutical synthesis, wastewater treatment- these processes all run at temperatures, pH levels, and salt concentrations that would probably denature in seconds any protein evolved for life inside a normal cell. The global industrial enzyme market sits in the tens of billions of dollars annually, and the fundamental rate-limiting step remains what it was decades ago: find a protein that performs the function you need under the conditions your process actually imposes.

Extremophiles are the answer biology arrived at long before we thought to even ask the question. Sulfolobus acidocaldarius grows in volcanic hot springs at pH 2 and 80°C. Pyrococcus furiosus lives near deep-sea hydrothermal vents at 100°C. Halobacterium salinarum thrives in salt concentrations that'd crystallize most cells. Their enzymes are the products of billions upon billions of years of selection pressure applied to precisely the conditions industrial chemistry imposes, which is exactly why we want them.

The problem, therefore, is access. Culturing these organisms requires specialized, expensive equipment most labs don't own. Metagenomic sampling has produced enormous databases of sequences from organisms that have never actually been grown in a controlled setting and may very well never be, but sifting these databases requires money, time, and much effort. The computational step that decides which experiments are worth doing has been, by default, an expert-only zone.

This is the problem Archaeon attempts to address.

Fig 1. Archaeon's seven-phase pipeline. Phases 1-3 and 6-7 run by default in under 30 seconds. Phase 4 (BLAST) + Phase 5 (ESMFold structure prediction) are optional flags that add significant runtime but substantially improve output quality. NCBI and MGnify queries run independently and are merged before feature extraction.

PHASE 2: Relearning Molecular Biology for Two Weeks

You're likely familiar with my workflow by now, but here's for old time's sake:

Beyond the AP Biology class I'm taking this year, which focuses more on general biology, I hadn't actually touched a molecular biology textbook in a few years. Thus, I picked up some books from the library and got to studying.

For roughly two weeks, running in parallel with actual coding, I rehashed some concepts on Khan Academy, MIT OpenCourseWare, and YouTube. I also, like always, read a few papers. There was the Kyte-Doolittle hydrophobicity scale paper from 1982, because I wanted to understand where the specific numerical values in the GRAVY formula came from before I implemented it. The Szilagyi and Zavodszky 2000 paper on charged residue analysis in thermophilic proteins, because I wanted to understand the physical mechanism before encoding it as a feature. The Guruprasad 1990 instability index paper was also a notable source, since BioPython's implementation uses a slightly different dipeptide weight table than the original, and I had to know which one I was actually computing.

There were 3 things which surprised me in this process. First, how much thermostability biochemistry reduces to pretty much just basic thermodynamics once you strip away the jargon. I.e. proteins stay folded when folding is energetically favorable, tight hydrophobic packing lowers the energy of the folded state, high proline content reduces the entropy of the unfolded state because proline's cyclic side chain is conformationally restricted, and the unfolded chain therefore has fewer microstates available to it. The aliphatic index, the instability index, the charged residue fraction: none of these are mysterious black-box correlations (hello, diagnostic AI!). They are direct consequences of physics I could reason about from first principles, which made them feel much more comfortably intuitive.

Second, how staggeringly large the uncultured fraction of microbial diversity actually is. Fewer than one percent of environmental microbial species have ever been grown in isolation in a laboratory, by most estimates at least. MGnify contains protein sequences from organisms with no species-level taxonomy assigned, no culturing protocol, no lab characterization, because no one has yet managed to grow them. This reinforced my belief that building a tool that queries this space directly, rather than only the subset that has been domesticated into petri dishes, was an important step for the field.

Third, and perhaps most immediately relevant: ESMFold is an unreasonably scrumptious piece of software to be so free and publicly accessible. We all know that AlphaFold2 changed structural biology permanently, but it requires a multiple sequence alignment that is very computationally expensive and not always feasible for novel metagenomic sequences with few known homologs. ESMFold predicts protein structure directly from a single sequence using a language model trained on 250 million protein sequences. It's like if Sam Altman and John Jumper had a baby. For exactly the metagenomic proteins from undersampled biomes that Archaeon targets, the absence of an MSA requirement is a feature, not a concession. It runs on Meta's servers. There is a free REST API.

Gosh, I love the internet.

Fig 2. The IAS is a composite score (0-100) designed for relative ranking of enzyme candidates by predicted industrial value. BLAST identity and thermostability each receive 40% weight because function annotation and thermal performance are the two primary industrial selection criteria. Sequence quality receives 20% as a filter against assembly artifacts and expression-difficult sequences. Weights are literature-informed priors, not values fit to experimental thermostability data.

PHASE 3: Designing My Industrial Applicability Score

The core design question of this project was: given a list of protein sequences retrieved from a database, how do you rank them by industrial relevance without running any experiments?

The honest answer is, of course, that you cannot do this perfectly. It's plain and simple that experimental validation is the only ground truth, and I won't be pretending otherwise. However, what Archaeon does is make the pre-experimental prioritization step principled, reproducible, and fast, which is the actual job of a computational screening tool before any wet-lab commitment.

The Industrial Applicability Score (IAS) combines 3 independent signal sources.

Thermostability features at 40% weight, a composite of five sequence-level predictors each normalized against absolute biological reference ranges rather than batch min-max, because min-max normalization makes scores dependent on which candidates happen to be in the current run and destroys cross-run comparability.
Sequence quality at 20% weight, a piecewise linear penalty on length that filters against assembly fragments too short to be real enzymes and proteins too long to express cleanly in E. coli.
BLAST identity to reference thermostable enzymes at 40% weight, because function annotation is the single biggest bottleneck in industrial enzyme development, and percent identity to a characterized homolog is the most efficient available computational proxy for it.

As always, these weights are literature-informed priors. I did not fit them to a labeled training set of experimentally measured thermostability values. If I had such a dataset, logistic regression or a gradient-boosted model would be the appropriate tool. What I have instead is a theoretically grounded ranking system designed for relative prioritization within a candidate set, which is what this stage of a discovery pipeline actually needs. I say this not as a disclaimer but more so as a description of what the IAS is and is not. Understanding the difference between a principled prior and a calibrated prediction is, imo, one of the more underrated distinctions in applied science.

PHASE 4: Building the Archaeon Pipeline

The architecture is deliberately simple. Four analysis modules, one CLI, one report generator, no substantial moving parts that require a server or a cloud account.

ncbi.py wraps Biopython's Entrez interface for sequence retrieval and NCBIWWW for remote BLAST. The nontrivial part was constructing search queries that reliably return extremophile sequences rather than noise. NCBI's organism metadata fields are inconsistently populated across its database, so the queries were tuned iteratively against live API responses until they produced reasonable diversity. BLAST is optional by design because remote NCBI BLAST runs on a shared public queue and takes 5 to 30 minutes per sequence. Personally, I think it's worth the wait but not worth making mandatory.
mgnify.py queries MGnify's JSON REST API. The key detail is that MGnify Proteins is a distinct endpoint from MGnify Samples, and the biome lineage strings used for filtering are exact, case-sensitive, and documented in a way that requires actually reading the API specification to get right. Getting them wrong silently returns empty results. I learned this by spending an afternoon confused about why the database appeared empty.
features.py implements all five thermostability features from scratch from original papers rather than using BioPython's ProteinAnalysis class. This was partly epistemic stubbornness: I wanted to understand the formulas rather than call them. It was also practical: BioPython's instability index implementation uses a slightly modified dipeptide weight table that differs from Guruprasad 1990, and I wanted the original.
scorer.py computes the IAS as a weighted sum of normalized component scores. The choice to anchor normalization to absolute biological reference ranges rather than batch statistics is the most important implementation detail for reproducibility across different runs with different candidate sets.
structure.py submits sequences to ESMFold's REST API and parses the returned PDB files. ESMFold returns pLDDT confidence scores as decimal fractions in the B-factor column of ATOM records, which requires rescaling before passing to 3Dmol.js for visualization. The report rescales these inline when rendering structure viewers.
report.py generates a self-contained HTML file. Chart.js and 3Dmol.js load from CDN. All candidate data is embedded as JavaScript. PDB strings are stored in inert script tags and initialized in a single deferred block after page load, which solved the timing problem of 3Dmol initializing before its CDN script had finished loading. The file can be opened in any browser, hosted on GitHub Pages, or emailed w/out modification.

Fig 3. Archaeon queries NCBI and MGnify across eight extreme biome categories representing the full range of industrially relevant selective pressures. Each biome imposes different evolutionary constraints, producing enzymes with distinct stability profiles: thermostability from vents and hot springs, cold-activity from permafrost, halostability from saline environments, and alkaline stability from soda lakes. Targeting multiple biomes simultaneously maximizes chemical diversity in the candidate set.

Fig 4. The pLDDT (predicted Local Distance Difference Test) score is ESMFold's per-residue confidence estimate, stored in the B-factor column of the output PDB file. It ranges from 0 to 100 and follows the same color convention established by AlphaFold2. Archaeon computes mean pLDDT across all residues as the structural quality signal for each candidate. Proteins with mean pLDDT above 70 have reliably predicted backbone geometry; those below 50 are likely intrinsically disordered or represent fold types not well-represented in ESMFold's training data.

PHASE 5: Results and an Honest Assessment

A standard run with --max-per-biome 20 returns roughly 80 to 100 unique candidates after deduplication. Without BLAST, IAS scores cluster in the 65 to 75 range because the BLAST component defaults to 50 neutral for unannotated candidates, compressing the distribution and limiting differentiation. With BLAST enabled, scores diverge meaningfully: candidates matching a known enzyme family jump to 85 to 95, while candidates with no recognizable homologs score lower even when their thermostability features are strong.

The most scientifically interesting candidates are actually not necessarily the highest-scoring ones. Sequences from MGnify metagenomic sources with weak BLAST identity but strong thermostability feature signals represent real computational novelty, proteins that show all the structural hallmarks of thermal adaptation but do not closely resemble anything well-characterized. They are the riskiest experimental targets and, arguably, the most extraordinary ones.

The honest limitation is that I cannot tell you whether any top-IAS candidate is actually thermostable without running an experiment. The aliphatic index threshold of above 80 for thermostability is a population-level observation from a 1980 paper. Individual proteins violate it regularly, and a real industrial pipeline would obviously follow this computational screen with differential scanning fluorimetry for rapid experimental thermostability measurement before committing to full expression + purification. Archaeon is the step before that step, and understanding this is a prerequisite for responsible use.

Fig 5. Scatter plot of aliphatic index (x-axis) against IAS score (y-axis) for the top 20 candidates in a simulated 200-sequence run. Aliphatic index receives the highest sub-weight (30%) within the thermostability component of the IAS, reflecting its status as the empirically strongest single-feature predictor of thermostability in the bioinformatics literature. Candidates above AI = 80 (purple dashed line) are classified as likely thermostable based on Ikai (1980). The correlation between AI and IAS is intentional but not deterministic, since IAS incorporates four additional features and the quality and BLAST components independently.

Fig 6. Simulated IAS score distributions for 200 candidates, comparing runs without BLAST (gray) and with BLAST annotation enabled (purple). Without BLAST, the distribution compresses into a narrow 60-76 range because the BLAST component defaults to 50 (neutral) for all candidates, removing the primary source of score differentiation. With BLAST, candidates matching known enzyme families score significantly higher (80–95), while uncharacterized candidates score lower, producing a bimodal distribution that better reflects true functional diversity in the candidate set.

Fig 7. Comparison of normalized thermostability feature profiles for the top two structure-predicted candidates against a mesophilic E. coli protein baseline. The radar chart shows that both extremophile candidates substantially outperform the mesophilic baseline on aliphatic index and charged residue fraction - the two features with the strongest empirical support as thermostability predictors. Rank #4 from Pyrococcus abyssi shows especially high aliphatic index (93.3 mean pLDDT corresponds to an extremely well-packed hydrophobic core), consistent with the deep-sea hyperthermophile context of its source organism.

Fig 8. The piecewise linear mapping from BLAST percent identity to the BLAST IAS component score (0-100). The curve is intentionally steep between 60-80% identity, where functional inference transitions from uncertain to reliable. At 80% identity, function annotation from a reference enzyme is considered high-confidence. The neutral default of 50 for candidates without BLAST data prevents the BLAST component from artificially inflating or suppressing scores for the majority of candidates when BLAST is run only on the top-N subset.

Fig 9. Left: simulated candidate count per biome query for a standard run with --max-per-biome 20. Hydrothermal vent and hot spring queries consistently return the most sequences due to the density of thermophile genomes in NCBI. Halophile and acidophile queries return fewer hits because the NCBI taxonomy terms for those organisms are less standardized. Right: comparison of mean IAS by database source. NCBI candidates tend to score slightly higher on BLAST identity because they represent sequences from well-characterized cultured organisms with known enzyme annotations; MGnify candidates are more novel but score lower on BLAST identity by definition.

PHASE 6: What I Learned

The molecular biology was pleasing and the pipeline was satisfying to build. But overall, the question I've thought about is narrower and more durable: when is an uncalibrated score useful?

This is likely, I think, an underexamined question in applied science. Most scientific scoring systems in active industrial use are not experimentally calibrated in any rigorous sense. They are structured compressions of prior knowledge applied to new observations, i.e. theoretically grounded rather than empirically fit. The IAS belongs to this category, with its limitations clearly documented, which I believe makes it more useful than either overstating its predictive power or dismissing it as merely theoretical hullabaloo.

What I can defensibly claim is that a candidate with an aliphatic index above 90, an instability index below 35, and a charged residue fraction above 0.25 is a more principled experimental target than a candidate with the opposite profile. That claim is mechanistically grounded, but the probability it holds for any specific individual protein is unknown. The space between defensible and certain is where computational screening is supposed to land, and it's best not to avoid it or pretend it's narrower than it is.

The other thing I confirmed at a level that now feels irreversible: the knowledge that actually moves things is almost always sitting in a place that rewards the people willing to read past the abstract. The formulas I implemented came from papers published between 1980 and 2000, written by researchers who spent years characterizing a handful of proteins at a time using equipment that would now be considered prehistoric. Those papers are not glamorous, yet are where the ground truth lives. I find this encouraging in a way I did not entirely expect. Society is very loud about novelty. The real stuff is mostly just sitting there, waiting for someone to bother reading it.

Closing Remarks

I'm not an expert in the field of metagenomics or extremophiles in any shape or form (haha get it). Regardless, I've finished this project with many thoughts and curiosities on what it means that the organisms living in the places most hostile to life are also the ones producing the most remarkable biochemistry.

The obvious explanation is evolutionary pressure. Survival in extreme environments demands extreme solutions. But I think there is something philosophically worth sitting with underneath that explanation. Constraints are not just limiting. They are generative. The absence of comfortable conditions, of easy solutions, of the standard approaches that work for everyone else, forces a kind of creativity which gentler environments rarely require. The enzyme that functions at 100°C has to solve a problem that no mesophilic protein has ever faced. The solution it arrived at over billions of years of iteration is something we cannot yet (easily) engineer from scratch, even with all the tools we have.

I think about this much in relation to my own work, not that I'm comparing myself to a hyperthermophilic Archaea (though I won't rule this out entirely), but because the projects I have found most generative have almost all emerged from friction I did not choose. Building DermEquity came from noticing a failure in something I had already built, not from setting out from the get-go to design a fairness framework. NEXUS came from being frustrated with how opaque financial jargon felt to a non-DECA bro surrounded by DECA bros. Archaeon came from a throwaway thought following an AP Bio lab.

None of these were designed in advance. They were responses to something that wasn't working, to a gap that ideally shouldn't've been there but was.

I do not think this pattern is accidental. I think friction is, in a non-trivial sense, the point. Not in some motivational-poster, alpha-male sense of romanticizing struggle, which I find personally quite reductive. More in the thermodynamic sense: systems forced to solve hard problems tend to develop cool, generalizing solutions. The enzyme that evolved in boiling acid does not only work in boiling acid. It often functions across a much wider range of conditions than the enzyme that evolved under gentler constraints. Capability built under difficulty tends to transfer more broadly than capability acquired in ease. I have noticed this is true in chemistry and I suspect it is true in people, though the sample size of my own life is still pretty limited.

Archaeon is, at its core, an itty bitty slice of infrastructure for a problem I am not positioned to solve all the way. I can write the computational screening step. I cannot run the experiments. I can rank the candidates. I cannot verify the rankings without a wet lab. There is something very humbling and inevitably unfortunate about building an answer you know is incomplete by design, something that is explicitly a first step and not a final cure. However, it does remove the pressure of totality. The work becomes what it is: a contribution to a chain that will require many other people, many other tools, and years of iteration to complete.

While I want to be a benevolent universal overlord some day, this seems right to me. Most of the work worth doing has this structure. Perhaps I am not destined to be Lord Palpatine, after all.

The things in the easy-to-reach places have already been found. The interesting ones are still out there in the vents and the acid pools, waiting for someone to build the tools and grow the guts to ask for them properly.

What are you waiting for?

Cheers,
Angie X.

This project is open source at github.com/axshoe/Archaeon.

Page updated

Google Sites

Report abuse