Validation outcomes + promotion

Every spec the platform executes lands in one of three outcomes: validated, rejected, or execution error. Validated specs are eligible for the kept set; rejected specs are dropped. The platform applies the thresholds for you — you don't need to think about them when scoring predictions.

The outcomes

Outcome When
validated Cleared the promotion gate and no validator blocked
rejected_validation A validator BLOCKER fired (typically honest-eval)
rejected_below_floor Did not clear the minimum-quality floor
execution_error Executor raised before producing a result

Tabular promotion thresholds (default)

A spec is validated when it clears the standard gate for its task type:

Classification:

  • auc ≥ 0.85 AND ece ≤ 0.05 — strict gate (production-ready)
  • auc ≥ 0.70 AND ece ≤ 0.10 — relaxed gate (kept, useful, but not production-ready alone)

Regression:

  • r2 ≥ 0.80 — strict gate
  • r2 ≥ 0.50 — relaxed gate

A spec that clears the AUC threshold but fails the ECE threshold is rejected — calibration is required for any kept spec.

Validation BLOCKERs override gate decisions

If any validator returns Severity.BLOCKER, the spec is rejected_validation regardless of how the gates would have evaluated. This is by design: a model with AUC 1.0 that's secretly leaking the test labels would technically clear every gate, but the honest-eval validators flag it BLOCKER and the orchestrator drops it.

The most common BLOCKER causes:

  • honest_eval.shuffled_label — the model retains performance under permuted training labels (target leakage)
  • honest_eval.secondary_holdout — model overfit to the primary holdout it could see
  • multi_calibration — ECE far above threshold (over-confident probabilities)
  • distribution_shift — concept drift dominates the train→holdout gap
  • llm_critic — the pre-execute critic flagged the spec at high severity in lookahead_bias or data_snooping

Why use a composite gate rather than just AUC?

AUC alone is an unreliable promotion signal:

  • It doesn't reflect calibration. A model with AUC 0.95 but ECE 0.30 ranks well but its 90%-confidence predictions are right 60% of the time. Useless for thresholding decisions.
  • It's vulnerable to test-set overfit during model selection. The gate plus honest-eval check guards against this.
  • It doesn't catch concept drift. The dist-shift validator does.

The composite gate (AUC AND ECE AND honest-eval AND multi-calibration AND dist-shift) shifts the question from "did this score well on holdout?" to "is this score honest, calibrated, and likely to hold on the secondary holdout?". That's the production-readiness question.

Customising thresholds

For now, thresholds are fixed at the platform level. Custom thresholds (e.g. medical-imaging AUC must be ≥ 0.90 because false-negatives are catastrophic) are on the Team plan roadmap. Reach out if your use case requires this.

Reading outcomes

In the SDK:

run = client.runs.wait(run_id)

# Per-iteration breakdown
iters = client.runs.iterations(run_id)
for it in iters.iterations:
    print(it.outcome_counts)   # {"validated": 3, "rejected_validation": 1, ...}

In the dashboard:

gnosyslabs.com/dashboard/runs/{run_id} renders per-iteration validated/rejected counts plus the full validation breakdown inline.

See also


Found a typo? Tell us.