Validation outcomes + promotion
Every spec the platform executes lands in one of three outcomes: validated, rejected, or execution error. Validated specs are eligible for the kept set; rejected specs are dropped. The platform applies the thresholds for you — you don't need to think about them when scoring predictions.
The outcomes
| Outcome | When |
|---|---|
validated |
Cleared the promotion gate and no validator blocked |
rejected_validation |
A validator BLOCKER fired (typically honest-eval) |
rejected_below_floor |
Did not clear the minimum-quality floor |
execution_error |
Executor raised before producing a result |
Tabular promotion thresholds (default)
A spec is validated when it clears the standard gate for its task type:
Classification:
auc ≥ 0.85ANDece ≤ 0.05— strict gate (production-ready)auc ≥ 0.70ANDece ≤ 0.10— relaxed gate (kept, useful, but not production-ready alone)
Regression:
r2 ≥ 0.80— strict gater2 ≥ 0.50— relaxed gate
A spec that clears the AUC threshold but fails the ECE threshold is rejected — calibration is required for any kept spec.
Validation BLOCKERs override gate decisions
If any validator returns Severity.BLOCKER, the spec is
rejected_validation regardless of how the gates would have
evaluated. This is by design: a model with AUC 1.0 that's secretly
leaking the test labels would technically clear every gate, but
the honest-eval validators flag it BLOCKER and the orchestrator
drops it.
The most common BLOCKER causes:
honest_eval.shuffled_label— the model retains performance under permuted training labels (target leakage)honest_eval.secondary_holdout— model overfit to the primary holdout it could seemulti_calibration— ECE far above threshold (over-confident probabilities)distribution_shift— concept drift dominates the train→holdout gapllm_critic— the pre-execute critic flagged the spec at high severity inlookahead_biasordata_snooping
Why use a composite gate rather than just AUC?
AUC alone is an unreliable promotion signal:
- It doesn't reflect calibration. A model with AUC 0.95 but ECE 0.30 ranks well but its 90%-confidence predictions are right 60% of the time. Useless for thresholding decisions.
- It's vulnerable to test-set overfit during model selection. The gate plus honest-eval check guards against this.
- It doesn't catch concept drift. The dist-shift validator does.
The composite gate (AUC AND ECE AND honest-eval AND multi-calibration AND dist-shift) shifts the question from "did this score well on holdout?" to "is this score honest, calibrated, and likely to hold on the secondary holdout?". That's the production-readiness question.
Customising thresholds
For now, thresholds are fixed at the platform level. Custom thresholds (e.g. medical-imaging AUC must be ≥ 0.90 because false-negatives are catastrophic) are on the Team plan roadmap. Reach out if your use case requires this.
Reading outcomes
In the SDK:
run = client.runs.wait(run_id)
# Per-iteration breakdown
iters = client.runs.iterations(run_id)
for it in iters.iterations:
print(it.outcome_counts) # {"validated": 3, "rejected_validation": 1, ...}
In the dashboard:
gnosyslabs.com/dashboard/runs/{run_id}
renders per-iteration validated/rejected counts plus the full
validation breakdown inline.
See also
- The validation layer — what each validator actually checks.
- Plans + limits — what plan you need to enable the LLM strategist + critic.
Found a typo? Tell us.