A machine-learning framework for predicting microbial carbon-source utilisation (and other binary traits) from genomic feature data.
This is the core Python library behind the manuscript "From Spurious Shortcuts to Mechanistic Specialists: Data Coherence and Applicability Domains in Genome-Based Prediction of Microbial Carbon Utilisation." It integrates binary growth phenotypes across datasets and trains models under different train/test splitting regimes to study cross-dataset generalisation. The data-processing and experiment scripts used to produce the manuscript results live in the manuscript repository; this repository contains only the reusable library.
Uses uv for dependency management (Python 3.12–3.13):
uv syncfrom trait_prediction.pipeline import run_pipeline
scores = run_pipeline(
base_dir="<dir of split folders>",
feature_file="<feature matrix>.parquet",
split_types=["random", "out_of_clade"],
model_type="catboost",
)run_pipeline is a thin wrapper around PipelineRunner for quick experiments;
for full control construct a Config/ConfigSet and drive TrainingPipeline
directly. configs/default.yaml holds sensible defaults (CatBoost classifier,
χ² feature selection to 1000 features, variance/correlation filtering, 5-fold CV,
balanced-accuracy scoring) and configs/config_set.yaml sweeps multiple settings
via ConfigSet; the run_script.py entry point loads these with
Hydra.
Run the test suite with just test (or pytest).
The library lives under trait_prediction/:
main/— core data model:DataSet,Feature/FeatureSet,Phenotype/PhenotypeSet. Loads genomic features and binary phenotype labels and aligns them into trainable matrices.feature_corr.pyprovides a fast (Numba) correlation filter for redundant features.pipeline/— orchestration.Config/ConfigSetdefine a run;PipelineRunner/run_pipelinedrive end-to-end training;TrainingPipelinehandles preprocessing → feature selection → fit → evaluate;splitters.py(RandomSplitter,InCladeSplitter,OutOfCladeSplitter) implement the phylogeny-aware train/test regimes used to probe generalisation;split_ml.pyholds scoring and feature-importance utilities.classifiers/—make_classifierfactory (CatBoost, random forest, decision tree, RFE/RFECV) plus null-model baselines (BernoulliClassifier,IdentityClassifier,NearestNeighborClassifier) for comparison.training/—Predictorand training/CV data containers; sampling and Optuna-based hyperparameter optimisation.visualization/— SHAP-based feature-importance plots.
See LICENSE.