The validation layer
A system that both generates and evaluates starts fooling itself — finding patterns that don't exist, overfitting aggressively, mistaking selection bias for meaningful signal.
This is the heart of the platform. Every other component (LLM strategists, dashboard UI, REST API) exists to feed this layer or read from it. If you want to know why Gnosys exists at all, this is the page.
For a shorter, customer-facing summary of the six validators and the model-card artefact, see Honest evaluation. This page is the full theory.
The deception problem
When the same system both proposes ML pipelines AND evaluates them, selection bias accumulates unchecked. Concretely:
- A grid search over 100 hyperparameter values, picking the best by holdout AUC, is guaranteed to overestimate that AUC by some amount. The more values you try, the worse it gets.
- An LLM proposing pipelines while seeing the holdout (e.g. via cross-validation accuracy in the prompt) will eventually propose something that fits the holdout's noise, not its signal.
- A "feature engineer" pipeline that fits any preprocessor on
concat(X_train, X_test)is leaking. Even if no human wrote that line, an LLM proposing a code template might. - A model whose output happens to track the test labels for non- causal reasons (a row index, a constant feature, a dataset-wide statistic) won't break in honest CV, but will break in production.
The Gnosys validation layer is built around catching these failures after the executor has produced a Result, with re-runs under adversarial conditions that destroy any genuine signal.
The four post-execute surfaces
1. Multi-calibrated ensembles
A model whose predict_proba averages out to calibrated probabilities
overall but is badly calibrated on a particular slice (a demographic
group, a value range of one feature, a cohort) is overconfident on
that slice. The promotion gates require calibrated probabilities,
not just calibrated averages.
The validator computes:
- ECE — equal-width-binned expected calibration error.
Σ_b (|bin_b| / n) × |conf_b − acc_b|. - MCE — worst-bin gap. Sensitive to a single badly-miscalibrated bucket where ECE would average it out.
- Slice-conditional ECE — recompute per slice. Worst slice's ECE is what drives the severity decision.
Severity:
ECE ≥ threshold + max_slice_gap→ BLOCKERECE ≥ threshold→ MEDIUM- otherwise INFO
Default thresholds for tabular: the strict gate requires ECE ≤ 0.05; the relaxed gate requires ECE ≤ 0.10.
2. Distribution-shift decomposition
The train→holdout score gap is the headline metric the executor reports. Without decomposition, you don't know why the gap exists.
Three sources are possible:
- Label shift —
P(Y)changed; covariates intact. Often benign; importance-reweighting can correct. - Covariate shift —
P(X)changed; conditionalP(Y|X)intact. Typically reflects cohort change. - Concept drift —
P(Y|X)itself changed. The dangerous one: the relationship the model learned doesn't hold on the holdout.
The validator runs:
- Per-feature Kolmogorov-Smirnov tests
- RBF Maximum-Mean-Discrepancy
- Classifier-two-sample test (logistic AUC discriminating train vs. test under stratified k-fold)
- Symmetrised KL on label priors
Then attributes the gap proportional to label_kl + covariate
magnitude; concept drift is the residual. A
concept_share_of_gap >= 50% BLOCKER is fired when the gap
itself is nontrivial — it means the model learned a relationship
that doesn't hold.
3. Honest evaluation — the deception test
The bidirectional check. Re-runs the spec under conditions that must destroy any genuine signal. Four verifiers:
honest_eval.shuffled_label
Re-train with permuted y_train. A clean pipeline collapses to
chance. A pipeline whose features encode the holdout labels (data
snooping, target leakage) keeps scoring high because the leak is
independent of training labels.
This is the canonical regression test for the validation layer. A leaky model with observed AUC 1.0:
observed=+1.0000 chance=+0.5042 shuffled_mean=+1.0000 retained=+100.00%
→ BLOCKER
honest_eval.randomized_feature
Re-train with column-wise permuted X_train. Catches column-
identity exploits — pipelines that latched onto column ordering or
constant features rather than predictive structure.
honest_eval.secondary_holdout
Predict on a held-out cohort the generator never saw. Catches overfitting to whichever holdout the generator did see (during LLM-driven proposal, that's the primary test split).
honest_eval.permutation_fwer
Empirical p-value for the headline score against the shuffled-label
null distribution. BLOCKER when p > 0.10 — the observed score is
statistically indistinguishable from random.
Severity translation
Each verifier uses the same logic:
observed_lift = max(observed - chance_baseline, 0)
shuffled_lift = max(shuffled_mean - chance_baseline, 0)
retained = shuffled_lift / observed_lift # if observed_lift > 0
shuffled_lift > chance_tolerance AND retained >= 0.5 → BLOCKER
shuffled_lift > chance_tolerance AND retained >= 0.3 → MEDIUM
shuffled_lift > chance_tolerance → LOW
otherwise → INFO
Tuned for n_test ≥ 200. Smaller test sets need a looser tolerance
because sampling variance dominates.
Pre-execute: the LLM critic
Beyond the four post-execute mechanical validators, an opt-in LLM critic runs before any executor compute. It serialises the spec, sends it to an LLM with a domain-specific checklist (target leakage, over-parameterization, regime fragility, statistical-power, …), and asks for adversarial review.
High-severity concerns in lookahead_bias or data_snooping
BLOCKER the spec — the executor never starts. Other high-severity
concerns become advisory annotations the next-iteration prompt
sees.
The cost saved by catching a leak before execution vs. after is
typically 1000x. The platform's leaky_demo test fixture
demonstrates this: caught pre-execute in 0.00s vs. the post-
execute path's 1.21s.
Cross-validator correlation
The audit DB normalises every finding into one row, so cross-spec /
cross-validator queries are a normal SQL WHERE away. The defense-
in-depth corroboration query is:
specs = client.findings.correlations(
validators=["llm_critic", "honest_eval.shuffled_label"],
severity="blocker",
mode="and",
)
Specs flagged by both validators are real leaks caught twice. Specs flagged by only one are either:
- A false positive from the firing one (LLM critic gets things wrong; honest-eval has finite test-set power), or
- A subtler leak the silent one missed.
Either way it's worth investigating. The dashboard's run-detail page surfaces this view automatically.
Validation outcomes
Every spec lands in one of these outcomes:
| Outcome | Meaning |
|---|---|
validated |
Cleared the promotion gate and no validator blocked — production-ready |
rejected_validation |
A validator BLOCKER fired |
rejected_below_floor |
Did not clear the minimum-quality floor |
execution_error |
Executor raised |
Default tabular gate: auc ≥ 0.85 AND ece ≤ 0.05 (strict) or
auc ≥ 0.70 AND ece ≤ 0.10 (relaxed). Both AUC and ECE must
clear together. See Validation outcomes for the full
threshold breakdown.
Why this differentiates Gnosys
Most AutoML platforms stop at "did this score well on holdout?".
Gnosys runs six additional checks against that holdout score's
honesty (multi_calibration, dist_shift, honest_eval.shuffled_label,
honest_eval.randomized_feature, honest_eval.secondary_holdout,
honest_eval.permutation_fwer) and applies subgroup-conditional
MCGrad calibration on the survivors. The mechanical validators
alone catch ~80% of common ML pitfalls before they ship; the LLM
critic catches another chunk pre-execute for cheap.
The result is a platform you can let run autonomously without worrying that it'll surface a "winning" pipeline that's actually just an artifact of the search procedure — exactly the failure mode the MLE-STAR paper (Table 5) documents in its own agent: test AUC drops from 0.803 to 0.734 once a downstream checker is added. Without that checker, the system grades its own homework. Honest- eval is the platform's checker, and it's statistical, not LLM-based.
Found a typo? Tell us.