Honest evaluation

The validation layer is the product. This page is the permalink.

Every spec that survives a Gnosys run has cleared six independent validators — none of them an LLM. A validated spec passed all six and cleared the promotion gate. A rejected_validation spec was dropped because one validator fired a blocker.

The validators run on the executor's results, not on the strategist's intentions. An LLM that proposes a leaky pipeline is caught here, not earlier.

The six validators

# Validator Stage What it catches
1 multi_calibration post_execute Confidence that doesn't match empirical frequency on a slice. Subgroup-conditional ECE, not just global.
2 dist_shift post_execute Train→holdout gap attributed to label / covariate / concept. Concept-dominated gaps blocker.
3 honest_eval.shuffled_label honest_eval Retrain with permuted y_train. A pipeline that keeps scoring is leaking the test labels.
4 honest_eval.randomized_feature honest_eval Retrain with column-wise permuted X_train. Catches column-identity exploits.
5 honest_eval.secondary_holdout honest_eval Score on a cohort the strategist never saw. Catches overfitting to the primary test split.
6 honest_eval.permutation_fwer honest_eval Empirical p-value vs the shuffled-label null. Per-batch BH-FDR correction handles selection bias across the cohort.

The four honest_eval.* validators share an n_repeats parameter (default 5). The orchestrator runs Benjamini-Hochberg FDR correction across the batch's p-values before classifying outcomes, so the empirical p that goes into the model card is already multiple-testing-corrected.

Severity ladder

Each validator emits at most one finding per spec. Severity rules:

  • blocker — the spec is rejected (rejected_validation) regardless of headline score. One blocker is enough.
  • medium — the spec is not blocked, but caps the outcome at the relaxed gate (no strict-gate promotion).
  • low / info — recorded for the audit trail; no effect on the outcome.

The exact thresholds are documented in The validation layer.

What you get back

Each completed run produces:

  1. Per-spec validation findingsGET /v1/findings?run_id=... or client.findings.list(run_id=...). Filterable by validator, severity, spec, and correlations() for cross-validator queries.
  2. A downloadable HTML model cardGET /v1/runs/{id}/model_card. Single self-contained file with charts, validator verdicts at both the strict and relaxed thresholds, the calibration curve before and after MCGrad, and the run's audit chain head. Save to disk as a compliance artefact.
  3. The kept-spec listruns.{id}.kept_specs_json is the ensemble the predict endpoint averages over. Each entry includes the spec's primary score and validator verdicts.
  4. An audit chain — row-level SHA-256 over typed contracts. Reproducible from the database; suitable for handing to model-risk reviewers.

Cross-validator correlation

The defense-in-depth query: which specs are flagged by two validators that catch different failure modes?

# Specs flagged by both shuffled-label retention AND a concept-drift
# blocker. These are usually true leaks — independent validators
# converging on the same spec is rare without a real underlying
# problem.
corroborated = client.findings.correlations(
    validators=["honest_eval.shuffled_label", "dist_shift"],
    severity="blocker",
    mode="and",
    run_id=run.run_id,
)
print(f"corroborated leaks: {len(corroborated)}")

The dashboard's run-detail view surfaces this automatically.

Why this is the moat

A model that "scored 0.89 AUC on holdout" is one decimal place on a slide. A model card that says "scored 0.89 AUC, shuffled-label retained 1% of lift, secondary-holdout delta 0.005, BH-FDR-adjusted p < 0.001, MCGrad ECE 0.019 with worst-slice ECE 0.041" is a document a model-risk team can defend.

The validation record is the durable asset. Generation is cheap; trusted measurement is not.

Why MCGrad

Most calibration approaches (Platt scaling, isotonic regression) calibrate globally — the model's average confidence matches empirical frequency overall, but can still be badly miscalibrated on a single subgroup.

MCGrad (Meta, KDD 2026 — arXiv:2509.19884) fits gradient-boosted correctors per subgroup over the kept-spec ensemble's predict_proba outputs. The result: when the model card says "ECE 0.019", that's the worst-slice ECE after subgroup-conditional calibration, not just the global mean.

For regulated buyers, this is the calibration story compliance actually wants — not "miscalibrated on average across the population", but "miscalibrated on a specific cohort that happens to matter".

What honest-eval does not do

  • It does not check that your dataset is what you think it is. Garbage data still yields a garbage model with a passing audit chain. Read Bring your own data for the upload-side contract.
  • It does not enforce fairness constraints. Multi-calibration detects per-subgroup miscalibration; it does not impose a particular fairness criterion. That's a policy decision your team owns.
  • It does not replace domain review. An honest-eval-clean spec whose features are obviously inappropriate (e.g. using a post-event timestamp on a pre-event prediction task) is still a bad spec. The validators are statistical; some failure modes need a human reading the columns list.

Reading further


Found a typo? Tell us.