Honest evaluation
The validation layer is the product. This page is the permalink.
Every spec that survives a Gnosys run has cleared six independent
validators — none of them an LLM. A validated spec passed
all six and cleared the promotion gate. A rejected_validation
spec was dropped because one validator fired a blocker.
The validators run on the executor's results, not on the strategist's intentions. An LLM that proposes a leaky pipeline is caught here, not earlier.
The six validators
| # | Validator | Stage | What it catches |
|---|---|---|---|
| 1 | multi_calibration |
post_execute | Confidence that doesn't match empirical frequency on a slice. Subgroup-conditional ECE, not just global. |
| 2 | dist_shift |
post_execute | Train→holdout gap attributed to label / covariate / concept. Concept-dominated gaps blocker. |
| 3 | honest_eval.shuffled_label |
honest_eval | Retrain with permuted y_train. A pipeline that keeps scoring is leaking the test labels. |
| 4 | honest_eval.randomized_feature |
honest_eval | Retrain with column-wise permuted X_train. Catches column-identity exploits. |
| 5 | honest_eval.secondary_holdout |
honest_eval | Score on a cohort the strategist never saw. Catches overfitting to the primary test split. |
| 6 | honest_eval.permutation_fwer |
honest_eval | Empirical p-value vs the shuffled-label null. Per-batch BH-FDR correction handles selection bias across the cohort. |
The four honest_eval.* validators share an n_repeats parameter
(default 5). The orchestrator runs Benjamini-Hochberg FDR
correction across the batch's p-values before classifying outcomes,
so the empirical p that goes into the model card is already
multiple-testing-corrected.
Severity ladder
Each validator emits at most one finding per spec. Severity rules:
blocker— the spec is rejected (rejected_validation) regardless of headline score. One blocker is enough.medium— the spec is not blocked, but caps the outcome at the relaxed gate (no strict-gate promotion).low/info— recorded for the audit trail; no effect on the outcome.
The exact thresholds are documented in The validation layer.
What you get back
Each completed run produces:
- Per-spec validation findings —
GET /v1/findings?run_id=...orclient.findings.list(run_id=...). Filterable by validator, severity, spec, andcorrelations()for cross-validator queries. - A downloadable HTML model card —
GET /v1/runs/{id}/model_card. Single self-contained file with charts, validator verdicts at both the strict and relaxed thresholds, the calibration curve before and after MCGrad, and the run's audit chain head. Save to disk as a compliance artefact. - The kept-spec list —
runs.{id}.kept_specs_jsonis the ensemble the predict endpoint averages over. Each entry includes the spec's primary score and validator verdicts. - An audit chain — row-level SHA-256 over typed contracts. Reproducible from the database; suitable for handing to model-risk reviewers.
Cross-validator correlation
The defense-in-depth query: which specs are flagged by two validators that catch different failure modes?
# Specs flagged by both shuffled-label retention AND a concept-drift
# blocker. These are usually true leaks — independent validators
# converging on the same spec is rare without a real underlying
# problem.
corroborated = client.findings.correlations(
validators=["honest_eval.shuffled_label", "dist_shift"],
severity="blocker",
mode="and",
run_id=run.run_id,
)
print(f"corroborated leaks: {len(corroborated)}")
The dashboard's run-detail view surfaces this automatically.
Why this is the moat
A model that "scored 0.89 AUC on holdout" is one decimal place on a slide. A model card that says "scored 0.89 AUC, shuffled-label retained 1% of lift, secondary-holdout delta 0.005, BH-FDR-adjusted p < 0.001, MCGrad ECE 0.019 with worst-slice ECE 0.041" is a document a model-risk team can defend.
The validation record is the durable asset. Generation is cheap; trusted measurement is not.
Why MCGrad
Most calibration approaches (Platt scaling, isotonic regression) calibrate globally — the model's average confidence matches empirical frequency overall, but can still be badly miscalibrated on a single subgroup.
MCGrad (Meta, KDD 2026 — arXiv:2509.19884) fits gradient-boosted
correctors per subgroup over the kept-spec ensemble's predict_proba
outputs. The result: when the model card says "ECE 0.019", that's
the worst-slice ECE after subgroup-conditional calibration, not just
the global mean.
For regulated buyers, this is the calibration story compliance actually wants — not "miscalibrated on average across the population", but "miscalibrated on a specific cohort that happens to matter".
What honest-eval does not do
- It does not check that your dataset is what you think it is. Garbage data still yields a garbage model with a passing audit chain. Read Bring your own data for the upload-side contract.
- It does not enforce fairness constraints. Multi-calibration detects per-subgroup miscalibration; it does not impose a particular fairness criterion. That's a policy decision your team owns.
- It does not replace domain review. An honest-eval-clean spec whose features are obviously inappropriate (e.g. using a post-event timestamp on a pre-event prediction task) is still a bad spec. The validators are statistical; some failure modes need a human reading the columns list.
Reading further
- The validation layer — full theory + severity math.
- Tabular quickstart — a worked example with the validators firing on a real leak.
- Validation outcomes — how validator verdicts compose into the validated / rejected outcome.
Found a typo? Tell us.