The validation layer

A system that both generates and evaluates starts fooling itself — finding patterns that don't exist, overfitting aggressively, mistaking selection bias for meaningful signal.

This is the heart of the platform. Every other component (LLM strategists, dashboard UI, REST API) exists to feed this layer or read from it. If you want to know why Gnosys exists at all, this is the page.

For a shorter, customer-facing summary of the six validators and the model-card artefact, see Honest evaluation. This page is the full theory.

The deception problem

When the same system both proposes ML pipelines AND evaluates them, selection bias accumulates unchecked. Concretely:

  • A grid search over 100 hyperparameter values, picking the best by holdout AUC, is guaranteed to overestimate that AUC by some amount. The more values you try, the worse it gets.
  • An LLM proposing pipelines while seeing the holdout (e.g. via cross-validation accuracy in the prompt) will eventually propose something that fits the holdout's noise, not its signal.
  • A "feature engineer" pipeline that fits any preprocessor on concat(X_train, X_test) is leaking. Even if no human wrote that line, an LLM proposing a code template might.
  • A model whose output happens to track the test labels for non- causal reasons (a row index, a constant feature, a dataset-wide statistic) won't break in honest CV, but will break in production.

The Gnosys validation layer is built around catching these failures after the executor has produced a Result, with re-runs under adversarial conditions that destroy any genuine signal.

The four post-execute surfaces

1. Multi-calibrated ensembles

A model whose predict_proba averages out to calibrated probabilities overall but is badly calibrated on a particular slice (a demographic group, a value range of one feature, a cohort) is overconfident on that slice. The promotion gates require calibrated probabilities, not just calibrated averages.

The validator computes:

  • ECE — equal-width-binned expected calibration error. Σ_b (|bin_b| / n) × |conf_b − acc_b|.
  • MCE — worst-bin gap. Sensitive to a single badly-miscalibrated bucket where ECE would average it out.
  • Slice-conditional ECE — recompute per slice. Worst slice's ECE is what drives the severity decision.

Severity:

  • ECE ≥ threshold + max_slice_gap → BLOCKER
  • ECE ≥ threshold → MEDIUM
  • otherwise INFO

Default thresholds for tabular: the strict gate requires ECE ≤ 0.05; the relaxed gate requires ECE ≤ 0.10.

2. Distribution-shift decomposition

The train→holdout score gap is the headline metric the executor reports. Without decomposition, you don't know why the gap exists.

Three sources are possible:

  • Label shiftP(Y) changed; covariates intact. Often benign; importance-reweighting can correct.
  • Covariate shiftP(X) changed; conditional P(Y|X) intact. Typically reflects cohort change.
  • Concept driftP(Y|X) itself changed. The dangerous one: the relationship the model learned doesn't hold on the holdout.

The validator runs:

  • Per-feature Kolmogorov-Smirnov tests
  • RBF Maximum-Mean-Discrepancy
  • Classifier-two-sample test (logistic AUC discriminating train vs. test under stratified k-fold)
  • Symmetrised KL on label priors

Then attributes the gap proportional to label_kl + covariate magnitude; concept drift is the residual. A concept_share_of_gap >= 50% BLOCKER is fired when the gap itself is nontrivial — it means the model learned a relationship that doesn't hold.

3. Honest evaluation — the deception test

The bidirectional check. Re-runs the spec under conditions that must destroy any genuine signal. Four verifiers:

honest_eval.shuffled_label

Re-train with permuted y_train. A clean pipeline collapses to chance. A pipeline whose features encode the holdout labels (data snooping, target leakage) keeps scoring high because the leak is independent of training labels.

This is the canonical regression test for the validation layer. A leaky model with observed AUC 1.0:

observed=+1.0000 chance=+0.5042 shuffled_mean=+1.0000 retained=+100.00%
→ BLOCKER

honest_eval.randomized_feature

Re-train with column-wise permuted X_train. Catches column- identity exploits — pipelines that latched onto column ordering or constant features rather than predictive structure.

honest_eval.secondary_holdout

Predict on a held-out cohort the generator never saw. Catches overfitting to whichever holdout the generator did see (during LLM-driven proposal, that's the primary test split).

honest_eval.permutation_fwer

Empirical p-value for the headline score against the shuffled-label null distribution. BLOCKER when p > 0.10 — the observed score is statistically indistinguishable from random.

Severity translation

Each verifier uses the same logic:

observed_lift = max(observed - chance_baseline, 0)
shuffled_lift = max(shuffled_mean - chance_baseline, 0)
retained = shuffled_lift / observed_lift   # if observed_lift > 0

shuffled_lift > chance_tolerance AND retained >= 0.5  → BLOCKER
shuffled_lift > chance_tolerance AND retained >= 0.3  → MEDIUM
shuffled_lift > chance_tolerance                        → LOW
otherwise                                               → INFO

Tuned for n_test ≥ 200. Smaller test sets need a looser tolerance because sampling variance dominates.

Pre-execute: the LLM critic

Beyond the four post-execute mechanical validators, an opt-in LLM critic runs before any executor compute. It serialises the spec, sends it to an LLM with a domain-specific checklist (target leakage, over-parameterization, regime fragility, statistical-power, …), and asks for adversarial review.

High-severity concerns in lookahead_bias or data_snooping BLOCKER the spec — the executor never starts. Other high-severity concerns become advisory annotations the next-iteration prompt sees.

The cost saved by catching a leak before execution vs. after is typically 1000x. The platform's leaky_demo test fixture demonstrates this: caught pre-execute in 0.00s vs. the post- execute path's 1.21s.

Cross-validator correlation

The audit DB normalises every finding into one row, so cross-spec / cross-validator queries are a normal SQL WHERE away. The defense- in-depth corroboration query is:

specs = client.findings.correlations(
    validators=["llm_critic", "honest_eval.shuffled_label"],
    severity="blocker",
    mode="and",
)

Specs flagged by both validators are real leaks caught twice. Specs flagged by only one are either:

  • A false positive from the firing one (LLM critic gets things wrong; honest-eval has finite test-set power), or
  • A subtler leak the silent one missed.

Either way it's worth investigating. The dashboard's run-detail page surfaces this view automatically.

Validation outcomes

Every spec lands in one of these outcomes:

Outcome Meaning
validated Cleared the promotion gate and no validator blocked — production-ready
rejected_validation A validator BLOCKER fired
rejected_below_floor Did not clear the minimum-quality floor
execution_error Executor raised

Default tabular gate: auc ≥ 0.85 AND ece ≤ 0.05 (strict) or auc ≥ 0.70 AND ece ≤ 0.10 (relaxed). Both AUC and ECE must clear together. See Validation outcomes for the full threshold breakdown.

Why this differentiates Gnosys

Most AutoML platforms stop at "did this score well on holdout?". Gnosys runs six additional checks against that holdout score's honesty (multi_calibration, dist_shift, honest_eval.shuffled_label, honest_eval.randomized_feature, honest_eval.secondary_holdout, honest_eval.permutation_fwer) and applies subgroup-conditional MCGrad calibration on the survivors. The mechanical validators alone catch ~80% of common ML pitfalls before they ship; the LLM critic catches another chunk pre-execute for cheap.

The result is a platform you can let run autonomously without worrying that it'll surface a "winning" pipeline that's actually just an artifact of the search procedure — exactly the failure mode the MLE-STAR paper (Table 5) documents in its own agent: test AUC drops from 0.803 to 0.734 once a downstream checker is added. Without that checker, the system grades its own homework. Honest- eval is the platform's checker, and it's statistical, not LLM-based.


Found a typo? Tell us.