SOLUM

A Cross-Land-Use Transferability Benchmark for Soil Organic Carbon Vis-NIR Prediction Models

March 2026 - May 2026

Introduction

I arrived in soil science tabula rasa, which is not false modesty but a statement of fact. Who knew dirt had a data problem? Not me.

The first time I read about pedogenesis, I wondered, at some length, why anyone had christened the upper horizon the solum. It sounds like a pharmaceutical you would see advertised on TV, or possibly a minor deity worshipped by a very small cult in a very remote village. It is neither. In fact, it's where carbon accumulates and cycles, and where the longue durée of land use history writes itself into the ground. And apparently, it is quite literally hell to predict.

Visible near infrared spectroscopy is the preferred shortcut. You illuminate soil, measure the reflectance across hundreds of wavelengths, train a model to estimate carbon. Works beautifully within a given ecosystem. What happens when you cross an ecosystem boundary is less studied and, in practice, quietly elided. The pattern I kept encountering was train on cropland, test on cropland, excellent R squared. Nobody asked what happens when you apply that model to woodland soil. This is the kind of lacuna that makes you suspect the field has been engaged in a collective act of mass hysteria strategic ignorance.

No published work has produced a systematic all pairs transferability matrix across the major LUCAS land use classes in a single unified benchmark. That is the gap. SOLUM is a measurement paper. The methods are standard by design. The contribution is scope and systematicity. I built PLSR from scratch because working through NIPALS forces an honest reckoning with what the algorithm actually does. Random Forest comes from scikit learn, which I am comfortable acknowledging. The question is not whether the models work. The question is where they break, and whether that breaking follows a pattern we could have predicted but never bothered to inscribe.

This is SOLUM. *

Cheers,
Angie X.

*not the drug, deity, or soil horizon.

SOLUM: A Layman's Guide

Soil organic carbon is the carbon stored in decomposed plant and animal material in the top layer of soil. It matters for climate policy because soil holds a significant fraction of global terrestrial carbon, and changes in land use can rapidly shift that balance. It matters for agriculture because SOC correlates with soil fertility, water retention, and long-term productivity. It is also invisible from the surface and traditionally expensive to measure, as laboratory analysis requires chemical digestion and specialized equipment.

Visible-near-infrared spectroscopy offers a shortcut. A spectrometer shines light on a soil sample and measures how much light is reflected at each wavelength, from visible red to the near-infrared. Different molecular bonds absorb light at characteristic wavelengths: water absorbs near 1400 and 1900 nanometers; carbon-hydrogen bonds absorb in the 2000-2200 nanometer range; iron minerals absorb in the visible range. The spectral fingerprint encodes a great deal of soil chemistry.

Machine learning models can learn the mapping from spectral fingerprints to SOC values if trained on enough labeled examples where both the spectrum and the true lab-measured SOC are known. The problem is that these models learn the mapping specific to the soils they were trained on, meaning a model trained on one land use type learns a mapping that works there. Whether that mapping holds for a different land use type is a separate question, and the one SOLUM investigates.

The Transferability Matrix is the core output: an N-by-N table where each row represents a training land use class and each column a testing land use class. Diagonal entries show in-domain performance; off-diagonal entries show what happens when the model crosses a boundary. Performance is measured by RPD (Ratio of Performance to Deviation): above 2.0 is good, 1.4 to 2.0 is moderate, below 1.4 is effectively not useful for quantitative prediction.

SOLUM Concepts Reference - A. Xiu.pdf

Imagine you have a magical scale that can tell you how much sugar is in a piece of fruit just by looking at its color. You train it only on apples (red and green), and it works great! But when you point it at a banana, the scale gets confused. Its "color logic" doesn't work on yellow so it gives a wildly wrong sugar reading.This is the exact problem in soil science. Scientists use a technique called visible-near-infrared spectroscopy (shining light on soil to read its "color" across hundreds of wavelengths) to quickly estimate Soil Organic Carbon (SOC). Often, a model is trained on one soil type with the assumption that it’ll work on other soil types as well. SOLUM’s job is to prove this assumption wrong.

PHASE 1: The Dataset + The Problem

The LUCAS (Land Use/Cover Area frame statistical Survey) 2015 Topsoil dataset is maintained by the EU Joint Research Centre and contains 21,859 georeferenced soil samples from EU member states. Each sample has measured SOC in grams per kilogram and Vis-NIR spectral absorbance recorded at 4,200 wavelength bands from 400 to 2499.5 nanometers in 0.5 nm steps. SOLUM uses the 2015 version because it is the most recent release to include the multispectral reflectance data as a public download.

The LUCAS 2015 dataset arrives as two separate files: the soil chemistry CSV and a set of per-country spectra CSVs. Each country file contains two replicate scans per sample. SOLUM averages the replicates before merging on POINT_ID. After filtering to samples with valid SOC, complete spectra, and known land use codes, the working dataset contains 21,677 samples across five retained land use classes. Arable cropland (A) has only 50 samples in LUCAS 2015 and is excluded for being below the 200-sample minimum.

The five retained classes and their SOC characteristics: Cropland (Permanent, B) with 8,941 samples and mean SOC of 17.6 g/kg; Woodland (C) with 4,286 samples and mean SOC of 93.1 g/kg; Shrubland (D) with 844 samples and mean SOC of 48.7 g/kg; Grassland (E) with 7,007 samples and mean SOC of 45.0 g/kg; Bare Land (F) with 599 samples and mean SOC of 16.7 g/kg.

The SOC range differences are substantial. Woodland mean SOC (93.1 g/kg) is more than five times the cropland mean (17.6 g/kg). This is not a subtle distributional difference that a model might paper over: it represents fundamentally different organic matter chemistry, dominated by aromatic litter-derived compounds in woodland versus aliphatic humic acids in managed cropland. Any model attempting to bridge that boundary is extrapolating in the most demanding sense.

Preprocessing applies Savitzky-Golay smoothing (window=11, polynomial degree=2) followed by Standard Normal Variate transformation. Both are implemented from scratch. SG smoothing removes high-frequency sensor noise while preserving absorption band positions and shapes. SNV mean-centers and scales each spectrum independently, removing multiplicative scatter effects caused by differences in particle size and surface texture between samples.

Figure 1: SOC distribution by land use class (LUCAS 2015 Topsoil). Violin plots show the full distribution of soil organic carbon (g/kg) for each retained land use class. Horizontal bars indicate medians. Woodland (C) has the highest and most variable SOC (mean 93.1 g/kg), more than five times the cropland mean (17.6 g/kg). These distributional differences directly condition cross-domain prediction difficulty.

Figure 2: Mean SNV-preprocessed Vis-NIR absorbance spectra by land use class. Each curve is the mean across all samples following Savitzky-Golay smoothing (window=11, polynomial order=2) and Standard Normal Variate transformation. Shaded regions = ±1 SD. Vertical bands near 1400 nm and 1900 nm mark water absorption features. Systematic shape differences between woodland (C) and permanent cropland (B) in the 1800–2200 nm region are the primary spectral source of cross-domain prediction failure.

PHASE 2: Model Training

Partial Least Squares via NIPALS

PLSR (Partial Least Squares Regression) is the workhorse of chemometrics and soil spectroscopy, and it earns that status honestly. With thousands of spectral bands and hundreds to thousands of samples, ordinary least squares is numerically useless (the system is massively collinear + the coefficient matrix is rank-deficient). PLSR finds latent variables, called components, that simultaneously maximize variance in the spectral predictor matrix X and covariance with the SOC target y. This is a different objective than PCA, which maximizes variance in X alone without regard to y.

I implemented PLSR via the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm in plsr.py. NIPALS extracts one component at a time. For each component, it initializes x-weights by computing the cross-product of the spectral residual matrix with the y residual vector, then iterates to convergence. Once scores and loadings are computed for that component, both matrices are deflated, meaning the variance explained by that component is subtracted, and the next component is extracted from the residuals. This continues until the requested number of components is reached.

The number of components is the key hyperparameter. Too few and the model is underfit; too many and it begins to overfit to noise in the training spectra. I select the optimal number via leave-one-out cross-validation on the training set for datasets up to 300 samples, and 5-fold cross-validation for larger datasets. The component count that minimizes RMSECV is used for the final model.

The NIPALS implementation also computes regression coefficients directly in the original spectral space. This a very useful property as it means you can inspect which wavelength bands receive the largest regression weights, which gives a baseline understanding of which spectral regions drive SOC predictions before SHAP provides the more nuanced per-prediction attributions.

Random Forest

Random Forest is not implemented from scratch (surprise surprise). scikit-learn's RandomForestRegressor is the implementation used here. The decision is a matter of honesty about where the intellectual content lies: reimplementing tree splitting, bootstrapping, and aggregation would produce hundreds of lines of code that express nothing I do not already understand, and would introduce numerical differences relative to the benchmark without any scientific benefit. The NIPALS PLSR implementation is from scratch because implementing it explicitly forces a confrontation with what the algorithm actually does. Random Forest does not offer the same pedagogical payoff in this context.

Hyperparameter tuning uses a grid search over n_estimators (number of trees: 100, 200, 500) and max_features (features considered at each split: sqrt, log2, 0.1, 0.3 of total bands). For high-dimensional spectral data, restricting max_features to a fraction of the 4,200 bands is essential because without this constraint, every tree would have access to nearly identical sets of correlated bands, collapsing ensemble diversity. The grid is evaluated via 5-fold cross-validation scored by RMSE.

Figure 3: In-domain PLSR prediction performance: actual vs. predicted SOC (g/kg) on the 20% hold-out test set for each land use class. Dashed line = 1:1 reference. Woodland (C) achieves the strongest in-domain PLSR performance (RPD 2.33). Permanent cropland (B) achieves only moderate performance (RPD 1.41), reflecting wide within-class SOC variability across management systems and geographies.

Figure 4: In-domain RF prediction performance: actual vs. predicted SOC (g/kg). Random Forest consistently equals or exceeds PLSR within each land use class. The largest RF advantage is for permanent cropland (B), where RF RPD (1.86) substantially exceeds PLSR RPD (1.41), suggesting non-linear spectral-SOC relationships in this heterogeneous class.

PHASE 3: The Transferability Matrix

The transferability matrix is the primary contribution of SOLUM. For every ordered pair (source, target) of the five retained land use classes, both models trained on source are evaluated on the target test set, producing 25 (source, target) combinations per model family. The result is two 5x5 matrices (one for PLSR, one for RF) where diagonal entries are in-domain performance and off-diagonal entries are transfer performance.

The PLSR matrix reveals a stark asymmetry. Permanent cropland (B) in-domain achieves only moderate PLSR performance (RPD 1.41), which already indicates that SOC prediction within this class is not trivial: the class spans a wide geographic range and management practices. Woodland (C) is the strongest in-domain PLSR class at RPD 2.33, reflecting the coherent spectral signature of high-SOC forest soils. The critical observation is the entire B column: every non-B source class produces poor PLSR RPD when evaluated on B's test set. C→B is 0.44, D→B is 0.73, E→B is 0.75, F→B is 1.15. Cropland is spectrally isolated from every other class under linear modeling.

The RF matrix shows the same directional pattern but with higher absolute values throughout. B in-domain improves to RPD 1.86 with RF. The B-column failures persist: C→B is 0.87 and D→B is 0.68, both below 1.0, meaning the RF model trained on woodland or shrubland makes cropland SOC predictions worse than simply predicting the mean. E→B reaches 1.38 with RF, still below the moderate threshold. The RF advantage is most pronounced for transfers into woodland (C) and from grassland (E): E→C achieves 2.09 with RF versus 2.05 with PLSR, and E→F achieves 2.22 with RF, the only off-diagonal good-tier result in either matrix.

The directionality asymmetry is the secondary finding of interest. C→E achieves RPD 1.73 with RF while E→C achieves 2.09: grassland-to-woodland transfer outperforms woodland-to-grassland. This asymmetry makes mechanistic sense. Grassland SOC occupies an intermediate compositional range between cropland and woodland, so a grassland-trained model has encountered enough high-SOC chemistry to partially represent woodland samples. A woodland-trained model, calibrated on the very high end (mean 93.1 g/kg), tends to systematically overpredict when applied to grassland's lower-SOC range.

The train-test split for each class uses an 80/20 ratio stratified by SOC quartile, ensuring the SOC distribution is similar in both subsets. RPD thresholds follow Chang et al. (2001): above 2.0 is good quantitative prediction, 1.4 to 2.0 is moderate, below 1.4 is not useful for quantitative estimation.

Figure 5: PLSR Transferability Matrix (RPD). Rows = source land use class (model trained here); columns = target land use class (model tested here). Bold borders mark diagonal (in-domain) entries. Color: red = Poor (RPD < 1.4), amber = Moderate (1.4–2.0), teal = Good (≥ 2.0). The entire B (Permanent Cropland) column is red under PLSR, with C→B RPD of 0.44 indicating prediction worse than predicting the mean. Woodland (C) is the strongest in-domain class but transfers poorly as a source to cropland.

Figure 6: RF Transferability Matrix (RPD). Same structure as Figure 5. Random Forest improves absolute RPD throughout but preserves the same directional pattern: the B column contains the four worst transfer pairs in either matrix. E→F (Grassland to Bare Land) achieves RPD 2.22, the only off-diagonal Good-tier result. The woodland-to-cropland direction (C→B, D→B) remains below RPD 1.0 with RF, indicating categorical prediction failure.

PHASE 4: SHAP Attribution for Transfer Failures

Five transfer pairs were identified as failures under the RF model (RPD below 1.4): C→B (0.87), D→B (0.68), E→B (1.38), F→B (1.26), and F→D (1.29). SHAP (SHapley Additive exPlanations) attribution was computed for all five via TreeExplainer. For each pair, mean absolute SHAP values were computed on the source test set and separately on the target test set. The SHAP discrepancy at each wavelength band is the absolute difference between these two distributions.

The discrepancy concentrates in the 1800-2200 nm shortwave infrared region across all five failed pairs. This region encodes combination overtone bands of C-H and C-O stretching vibrations in organic molecules. The spectral features here differ systematically between humic acids (dominant in managed permanent cropland, orchards, and vineyards) and the more aromatic, litter-derived organic matter of woodland and shrubland soils. The models learn calibrations for one chemistry and encounter another.

The F→D failure (Bare Land to Shrubland, RPD 1.29 with RF) shows a slightly different discrepancy profile: the primary concentration shifts toward the 1400-1800 nm range rather than 1800-2200. Bare land soils are predominantly mineral-dominated with very low organic matter, and the model trained on F has minimal exposure to the intermediate-SOC organic signatures of shrubland. The 1400-1800 nm range encodes clay and hydroxyl mineral absorption features that differ substantially between these classes.

These attributions do not constitute definitive mechanistic identification. SHAP values describe model behavior rather than underlying chemistry directly. But they provide grounded starting points for domain adaptation: if the 1800-2200 nm region is the primary source of cross-domain discrepancy for woodland-cropland transfers, then adaptation approaches focusing on that region (i.e. domain-invariant feature extraction, weighted retraining, or spectral basis alignment) are the most promising candidates for improvement.

Figure 7: SHAP discrepancy heatmap for failed transfer pairs (RF RPD < 1.4). Each row is a failed transfer pair; columns are wavelength bins across 400–2499.5 nm. Color intensity represents |ΔSHAP|, the absolute difference in mean feature attribution between source and target domains. Warm colors indicate wavelength regions where the model behaves most inconsistently across the domain boundary. The 1800–2200 nm region shows elevated discrepancy across all cropland-target failures; F→D shows a distinct pattern concentrated near 1400–1800 nm.

Figure 8: Top discrepant wavelength bands for each failed transfer pair. Horizontal bars show |ΔSHAP| magnitude for the five most discrepant spectral bands. The 1800–2200 nm shortwave infrared region dominates all cropland-target failures, corresponding to C-H and C-O combination overtone bands that encode organic matter composition differences between ecosystems. F→D shows greater discrepancy in clay and hydroxyl mineral absorption features near 1400–1800 nm

PHASE 5: Limitations

Here are SOLUM’s limitations:

The LUCAS dataset covers EU member states only. The transferability matrix reflects EU soil types, climate zones, and land use practices. Whether the same patterns hold in tropical forest soils or North American prairie systems is an open question. The transferability rankings (which class is most exportable, which is most isolated) may shift substantially in different biogeographic contexts.

Arable cropland (A) is absent from the SOLUM matrix because LUCAS 2015 contains only 50 samples in that category. This is a dataset artifact: the 2009 and 2018 surveys have better A-class coverage. The absence of arable cropland is a meaningful gap, since arable is the dominant agricultural land use in most EU countries and likely the most common deployment context for SOC prediction models.

PLSR implemented from scratch via NIPALS is correct but slower than scikit-learn's optimized PLSRegression. For production deployment, the sklearn implementation is preferable. The from-scratch version serves a pedagogical purpose and should not be interpreted as a recommendation.

SHAP values for Random Forest via TreeExplainer are exact given the tree structure, but the interpretation of SHAP discrepancy as indicating spectral sources of domain shift is inferential. High discrepancy means the model uses that wavelength region differently across domains; it does not prove that the chemistry of that region drives the failure.

Closing Remarks

If the dirt was predictable, why did nobody measure it? I suspect the answer is that measuring it required accepting an uncomfortable premise. That the global ambitions of soil carbon monitoring rest on an assumption the field never stress tested. Train on everything, validate on everything, report the mean. That works until you ask what happens when a model trained on cropland meets a woodland soil it was never built for. It's quite striking how often this theme occurs, if you're slurping up what I'm smacking down.

Papers that show excellent PLSR performance on LUCAS typically train a single model across all land use classes combined. That averages across domain boundaries. It does not bridge them. The apparently high performance can mask poor performance on subgroups. Which is not a novel observation, but I am tired of reciting it as a warning.

What interests me more now is that the confirmation itself has a kind of value. You suspect permanent cropland is spectrally isolated, but suspicion and a table are different orders of knowing. One is a private hunch and the other is a constraint that anyone else can pick up and use.

As ever, the work is not finished. But at least the dirt breakage has been rendered legible.

Cheers,
Angie X.

This project is open source at github.com/axshoe/SOLUM.

Page updated

Google Sites

Report abuse