SOLUM

A Cross-Land-Use Transferability Benchmark for Soil Organic Carbon Vis-NIR Prediction Models

April 2026 - May 2026

Introduction

Cheers,
Angie X.

SOLUM: A Layman's Guide

Soil organic carbon is the carbon stored in decomposed plant and animal material in the top layer of soil. It matters for climate policy because soil holds a significant fraction of global terrestrial carbon, and changes in land use can rapidly shift that balance. It matters for agriculture because SOC correlates with soil fertility, water retention, and long-term productivity. It is also invisible from the surface and traditionally expensive to measure, as laboratory analysis requires chemical digestion and specialized equipment.

Visible-near-infrared spectroscopy offers a shortcut. A spectrometer shines light on a soil sample and measures how much light is reflected at each wavelength, from visible red to the near-infrared. Different molecular bonds absorb light at characteristic wavelengths: water absorbs near 1400 and 1900 nanometers; carbon-hydrogen bonds absorb in the 2000-2200 nanometer range; iron minerals absorb in the visible range. The spectral fingerprint encodes a great deal of soil chemistry.

Machine learning models can learn the mapping from spectral fingerprints to SOC values if trained on enough labeled examples where both the spectrum and the true lab-measured SOC are known. The problem is that these models learn the mapping specific to the soils they were trained on, meaning a model trained on EU cropland soils learns a mapping that works for cropland. Whether that mapping holds for woodland soil is a separate question, and the one which SOLUM investigates.

The Transferability Matrix is the core output. It is an N-by-N table where each row represents a training land use class and each column represents a testing land use class. The diagonal entries show how well the model performs when trained and tested on the same land use type. Off-diagonal entries show what happens when the model crosses a boundary. Performance is measured by RPD (Ratio of Performance to Deviation): values above 2.0 indicate good predictive ability, 1.4 to 2.0 is moderate, and below 1.4 is effectively not useful for quantitative prediction.

SHAP values then explain where the failures come from. For each spectral band, SHAP assigns a contribution score to each prediction. When a model trained on cropland is applied to woodland soil, certain wavelength bands show dramatically different contribution patterns since the model is using those bands in a way that is well-calibrated for one soil chemistry but not the other. Those bands are the spectral source of the domain shift.

SOLUM Concepts Reference - A. Xiu.pdf

Imagine you have a magical scale that can tell you how much sugar is in a piece of fruit just by looking at its color. You train it only on apples (red and green), and it works great! But when you point it at a banana, the scale gets confused. Its "color logic" doesn't work on yellow so it gives a wildly wrong sugar reading.This is the exact problem in soil science. Scientists use a technique called visible-near-infrared spectroscopy (shining light on soil to read its "color" across hundreds of wavelengths) to quickly estimate Soil Organic Carbon (SOC). Often, a model is trained on one soil type with the assumption that it’ll work on other soil types as well. SOLUM’s job is to prove this assumption wrong.

PHASE 1: The Dataset + The Problem

The LUCAS (Land Use/Cover Area frame statistical Survey) Topsoil dataset is maintained by the EU Joint Research Centre. It contains approximately 20,000 georeferenced soil samples from 25 EU member states collected across multiple survey years, with the 2018 version being the largest and most consistent. Each sample has measured SOC (in grams per kilogram), several other chemical properties, and Vis-NIR spectral reflectance recorded at approximately 4,200 wavelength bands from 400 to 2499 nanometers.

Every sample is classified by land use at the collection point. The major LUCAS land use codes are: A for arable cropland, B for permanent cropland (orchards, vineyards, olive groves), C for managed grassland, D for woodland, E for shrubland, and F for bare land. The central analytical constraint is that classes with fewer than 200 samples are excluded from the matrix, since reliable model training requires at minimum several hundred examples.

SOC distributions differ substantially across classes. Woodland soils have the highest median SOC and the widest spread, reflecting the accumulation of litter-derived organic matter under forest canopy. Arable soils have lower and less variable SOC, driven by tillage disturbance and residue management. Bare land and permanent cropland have the lowest median values. Ergo, these distributional differences are the mechanistic reason that cross-domain prediction fails. A model trained on arable soils with a SOC range of 5 to 30 grams per kilogram will not have learned to represent the 40 to 80 gram per kilogram woodland SOC values it encounters in a transfer scenario.

The preprocessing pipeline applies Savitzky-Golay smoothing followed by Standard Normal Variate transformation. SG smoothing fits a local polynomial to a sliding window of spectral points and replaces each value with the polynomial's prediction at that position. With a window of 11 and a polynomial degree of 2, this removes high-frequency sensor noise while preserving real spectral features like absorption bands. SNV then mean-centers and scales each spectrum independently, removing multiplicative scatter effects caused by differences in particle size and surface texture. Both are implemented from scratch in spectral_preprocessing.py.

PHASE 2: Model Training

Partial Least Squares via NIPALS

PLSR (Partial Least Squares Regression) is the workhorse of chemometrics and soil spectroscopy, and it earns that status honestly. With thousands of spectral bands and hundreds to thousands of samples, ordinary least squares is numerically useless (the system is massively collinear + the coefficient matrix is rank-deficient). PLSR finds latent variables, called components, that simultaneously maximize variance in the spectral predictor matrix X and covariance with the SOC target y. This is a different objective than PCA, which maximizes variance in X alone without regard to y.

I implemented PLSR via the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm in plsr.py. NIPALS extracts one component at a time. For each component, it initializes x-weights by computing the cross-product of the spectral residual matrix with the y residual vector, then iterates to convergence. Once scores and loadings are computed for that component, both matrices are deflated, meaning the variance explained by that component is subtracted, and the next component is extracted from the residuals. This continues until the requested number of components is reached.

The number of components is the key hyperparameter. Too few and the model is underfit; too many and it begins to overfit to noise in the training spectra. I select the optimal number via leave-one-out cross-validation on the training set for datasets up to 300 samples, and 5-fold cross-validation for larger datasets. The component count that minimizes RMSECV is used for the final model.

The NIPALS implementation also computes regression coefficients directly in the original spectral space. This a very useful property as it means you can inspect which wavelength bands receive the largest regression weights, which gives a baseline understanding of which spectral regions drive SOC predictions before SHAP provides the more nuanced per-prediction attributions.

Random Forest

Random Forest is not implemented from scratch, and I want to be explicit about that rather than bury it. scikit-learn's RandomForestRegressor is the implementation used here. The decision is a matter of honesty about where the intellectual content lies: reimplementing tree splitting, bootstrapping, and aggregation would produce hundreds of lines of code that express nothing I do not already understand, and would introduce numerical differences relative to the benchmark without any scientific benefit. The NIPALS PLSR implementation is from scratch because implementing it explicitly forces a confrontation with what the algorithm actually does. Random Forest does not offer the same pedagogical payoff in this context.

Hyperparameter tuning uses a grid search over n_estimators (number of trees: 100, 200, 500) and max_features (features considered at each split: sqrt, log2, 0.1, 0.3 of total bands). For high-dimensional spectral data, restricting max_features to a fraction of the 4,200 bands is essential because without this constraint, every tree would have access to nearly identical sets of correlated bands, collapsing ensemble diversity. The grid is evaluated via 5-fold cross-validation scored by RMSE.

PHASE 3: The Transferability Matrix

The transferability matrix is the primary contribution of SOLUM. For every ordered pair (source, target) of retained land use classes, both models trained on source are evaluated on the target test set, and R-squared, RMSE, and RPD are recorded. The result is two N-by-N matrices (one for PLSR, one for RF) where diagonal entries are in-domain performance and off-diagonal entries are transfer performance.

The train-test split for each class uses an 80/20 ratio stratified by SOC quartile. Stratifying by SOC quartile is non-trivial: a random 80/20 split can accidentally place most high-SOC samples in the training set and most low-SOC samples in the test set, which would inflate apparent transfer performance when high-SOC target samples are evaluated against a model that never saw that SOC range in the source class. Stratification ensures that the SOC distribution is similar in both splits.

RPD thresholds come from Chang et al. (2001) and are standard in the soil spectroscopy literature. RPD above 2.0 constitutes good quantitative predictive ability. RPD between 1.4 and 2.0 allows approximate quantitative predictions with moderate accuracy. RPD below 1.4 indicates the model is not meaningfully better than predicting the mean. An RPD below 1.0 can occur when prediction error exceeds the standard deviation of the target set, i.e. the model is making the problem worse than no model at all.

The expected finding, consistent with the existing literature on related questions, is that diagonal entries substantially exceed off-diagonal entries for most class pairs, and that woodland is the most isolated domain: models trained on woodland transfer poorly to other classes, and models from other classes transfer poorly to woodland. This is mechanistically interpretable. Woodland SOC is dominated by aromatic, condensed organic matter from litter decomposition, which has a distinct spectral profile compared to the aliphatic, humic-acid-dominated SOC of tilled cropland. The models learn these domain-specific signatures during training and cannot generalize across them.

The more interesting secondary finding involves directionality. For many cross-domain pairs (A, B), (A, C), (B, C), moderate transfer is possible — RPD values in the 1.4 to 1.8 range, which indicates that closely related land use types share enough spectral-SOC structure that partial transfer works. This makes biogeochemical sense because arable and permanent cropland have similar mineral soil matrices and similar ranges of organic matter composition, even if management history differs. Grassland sits between cropland and woodland in this spectral-SOC continuum. Shrubland and woodland mark the point where the domain boundary becomes too wide for linear or ensemble methods to bridge without retraining.

PHASE 4: SHAP Attribution for Transfer Failures

SHAP (SHapley Additive exPlanations) provides per-feature attribution for each prediction. For a Random Forest model, TreeExplainer computes exact SHAP values efficiently using the tree path structure. Each SHAP value for feature j and sample i represents the marginal contribution of spectral band j to the prediction for sample i, averaged across all possible orderings of features entering the model.

The SHAP discrepancy for a cross-domain transfer pair is computed as follows: for the source-trained RF model, compute mean absolute SHAP values on the source test set and separately on the target test set. The discrepancy at each wavelength band is the absolute difference between these two mean absolute SHAP values. A high discrepancy indicates that the model is using that wavelength band with a substantially different pattern of attribution when applied to the target domain, meaning it is assigning contributions of a different sign or magnitude than it was calibrated to.

For failed transfers, the discrepancy concentrates systematically in the 1800-2200 nanometer shortwave infrared region. This region corresponds to combination overtone bands of C-H and C-O stretching vibrations in organic molecules. The spectral features in this region differ between humic acids (dominant in tilled soils) and fulvic acids or aromatic humic substances (dominant in forest floor organic matter). The model trained on cropland learns to use these bands in a way calibrated for one class of organic matter chemistry; when it encounters woodland soil, those same bands produce misleading signals because the underlying chemistry is different.

A secondary discrepancy region appears near 560-690 nanometers for transfers involving woodland. This visible range is sensitive to chlorophyll content and iron oxide mineralogy. Woodland soil samples often contain surface biomass contamination (leaf fragments and living moss) that creates a distinct visible-range spectral signature absent in mineral-dominated cropland samples. The model was never exposed to this in training and has no mechanism to handle it correctly.

These attributions do not constitute definitive mechanistic identification since SHAP values describe the model's behavior, not the underlying soil chemistry directly. But they provide a well-grounded starting point for domain adaptation research: if the 1800-2200 nm region is the primary source of cross-domain discrepancy, then transfer learning approaches that attend specifically to that region (or that use domain invariant feature extraction across that range) are the most promising candidates for improvement.

PHASE 5: Limitations

Here are SOLUM’s limitations:

The LUCAS dataset covers EU member states only. The transferability matrix produced by SOLUM reflects EU soil types, climate zones, and land use practices. Whether the same patterns hold in, for example, tropical forest soils or North American prairie systems is an open question. The spectral-SOC relationships learned from LUCAS data are not guaranteed to generalize geographically, and the transferability rankings (which class is most exportable, which is most isolated) may shift substantially in different biogeographic contexts.

PLSR implemented from scratch via NIPALS is correct but slower than scikit-learn's optimized PLSRegression. For production deployment or iteration across large parameter grids, the sklearn implementation is preferable. The from-scratch version here serves a pedagogical purpose and should not be interpreted as a recommendation for downstream use.

SHAP values for Random Forest with TreeExplainer are exact given the tree structure, but the interpretation of SHAP discrepancy as indicating spectral sources of domain shift is inferential. High discrepancy in a wavelength region means the model uses that region differently across domains; it does not definitively prove that the chemistry of that region drives the failure, because SHAP values capture model behavior rather than physical causality.

The minimum class size cutoff of 200 samples excludes bare land (F) from the analysis if it falls below threshold in the 2018 dataset. Results should be interpreted as representative of the main agricultural and natural land use classes in the EU, not as a complete survey of all LUCAS categories.

Closing Remarks

Cheers,
Angie X.

This project is open source at github.com/axshoe/SOLUM.

Page updated

Google Sites

Report abuse