Honest evaluation

The validation layer is the product. This page is the permalink.

Every spec that survives a Gnosys run has cleared six independent validators — none of them an LLM. A validated spec passed all six and cleared the promotion gate. A rejected_validation spec was dropped because one validator fired a blocker.

The validators run on the executor's results, not on the strategist's intentions. An LLM that proposes a leaky pipeline is caught here, not earlier.

The six validators

#	Validator	Stage	What it catches
1	`multi_calibration`	post_execute	Confidence that doesn't match empirical frequency on a slice. Subgroup-conditional ECE, not just global.
2	`dist_shift`	post_execute	Train→holdout gap attributed to label / covariate / concept. Concept-dominated gaps blocker.
3	`honest_eval.shuffled_label`	honest_eval	Retrain with permuted `y_train`. A pipeline that keeps scoring is leaking the test labels.
4	`honest_eval.randomized_feature`	honest_eval	Retrain with column-wise permuted `X_train`. Catches column-identity exploits.
5	`honest_eval.secondary_holdout`	honest_eval	Score on a cohort the strategist never saw. Catches overfitting to the primary test split.
6	`honest_eval.permutation_fwer`	honest_eval	Empirical p-value vs the shuffled-label null. Per-batch BH-FDR correction handles selection bias across the cohort.

The four honest_eval.* validators share an n_repeats parameter (default 5). The orchestrator runs Benjamini-Hochberg FDR correction across the batch's p-values before classifying outcomes, so the empirical p that goes into the model card is already multiple-testing-corrected.

Severity ladder

Each validator emits at most one finding per spec. Severity rules:

blocker — the spec is rejected (rejected_validation) regardless of headline score. One blocker is enough.
medium — the spec is not blocked, but caps the outcome at the relaxed gate (no strict-gate promotion).
low / info — recorded for the audit trail; no effect on the outcome.

The exact thresholds are documented in The validation layer.

What you get back

Each completed run produces:

Per-spec validation findings — GET /v1/findings?run_id=... or client.findings.list(run_id=...). Filterable by validator, severity, spec, and correlations() for cross-validator queries.
A downloadable HTML model card — GET /v1/runs/{id}/model_card. Single self-contained file with charts, validator verdicts at both the strict and relaxed thresholds, the calibration curve before and after MCGrad, and the run's audit chain head. Save to disk as a compliance artefact.
The kept-spec list — runs.{id}.kept_specs_json is the ensemble the predict endpoint averages over. Each entry includes the spec's primary score and validator verdicts.
An audit chain — row-level SHA-256 over typed contracts. Reproducible from the database; suitable for handing to model-risk reviewers.

Cross-validator correlation

The defense-in-depth query: which specs are flagged by two validators that catch different failure modes?

# Specs flagged by both shuffled-label retention AND a concept-drift
# blocker. These are usually true leaks — independent validators
# converging on the same spec is rare without a real underlying
# problem.
corroborated = client.findings.correlations(
    validators=["honest_eval.shuffled_label", "dist_shift"],
    severity="blocker",
    mode="and",
    run_id=run.run_id,
)
print(f"corroborated leaks: {len(corroborated)}")

The dashboard's run-detail view surfaces this automatically.

Why this is the moat

A model that "scored 0.89 AUC on holdout" is one decimal place on a slide. A model card that says "scored 0.89 AUC, shuffled-label retained 1% of lift, secondary-holdout delta 0.005, BH-FDR-adjusted p < 0.001, MCGrad ECE 0.019 with worst-slice ECE 0.041" is a document a model-risk team can defend.

The validation record is the durable asset. Generation is cheap; trusted measurement is not.

Why MCGrad

Most calibration approaches (Platt scaling, isotonic regression) calibrate globally — the model's average confidence matches empirical frequency overall, but can still be badly miscalibrated on a single subgroup.

MCGrad (Meta, KDD 2026 — arXiv:2509.19884) fits gradient-boosted correctors per subgroup over the kept-spec ensemble's predict_proba outputs. The result: when the model card says "ECE 0.019", that's the worst-slice ECE after subgroup-conditional calibration, not just the global mean.

For regulated buyers, this is the calibration story compliance actually wants — not "miscalibrated on average across the population", but "miscalibrated on a specific cohort that happens to matter".

What honest-eval does not do

It does not check that your dataset is what you think it is. Garbage data still yields a garbage model with a passing audit chain. Read Bring your own data for the upload-side contract.
It does not enforce fairness constraints. Multi-calibration detects per-subgroup miscalibration; it does not impose a particular fairness criterion. That's a policy decision your team owns.
It does not replace domain review. An honest-eval-clean spec whose features are obviously inappropriate (e.g. using a post-event timestamp on a pre-event prediction task) is still a bad spec. The validators are statistical; some failure modes need a human reading the columns list.

Reading further

The validation layer — full theory + severity math.
Tabular quickstart — a worked example with the validators firing on a real leak.
Validation outcomes — how validator verdicts compose into the validated / rejected outcome.

Found a typo? Tell us.