Tabular quickstart
A worked example end-to-end: the LLM-driven agent strategist proposing gradient-boosting variants on a tabular classification problem, with the pre-execute LLM critic catching bad ideas before they run and all six honest-eval validators firing on every spec.
What this run produced for us: the same setup against the Spaceship Titanic Kaggle competition finished AUC 0.8955 on the held-out test split, validated by all six honest-eval validators, at roughly $0.05 in LLM cost on Claude Sonnet. Top-10% Kaggle territory, end-to-end on the platform.
What you'll build
A run that:
- Round 0 — the agent strategist (Claude / GPT / Ollama, your choice) proposes 3-5 pipelines (model family, hyperparameters, categorical encoder, feature transforms) extending your spec template.
- Pre-execute — an LLM critic adversarially reviews each spec
against a tabular-specific checklist (target leakage,
over-parameterization, etc.). Concerns at high severity in
lookahead_biasordata_snoopingblock the spec. - Execute — each spec trains, predicts, scores.
- Post-execute validators — all six (
multi_calibration,dist_shift,honest_eval.shuffled_label,honest_eval.randomized_feature,honest_eval.secondary_holdout,honest_eval.permutation_fwer) run on every result. - Round 1+ — the strategist sees the full validation findings from prior records (with prose explanations) and proposes improvements that target the specific failure modes.
- Stops on first spec to clear the strict promotion gate or 5 iterations.
- Finalisation — kept specs are deduplicated by output
correlation (ρ ≥ 0.95), ensembled, and run through MCGrad
subgroup multicalibration. The result is a validated model card
you can download from
GET /v1/runs/{id}/model_card.
Prerequisites
- A Gnosys account on Starter or above (the LLM strategist + critic cost LLM tokens — the Free plan is HP-sweep only).
- Either an Anthropic / OpenAI API key, or a running Ollama server with a coding model pulled.
The spec template
The LLM extends this template each round. It only emits the diffs
(hyperparameters, sometimes model_family); everything else stays
the same:
spec_template = {
"spec_id": "_t",
"name": "tabular CTR sweep",
"hypothesis": (
"calibrated gradient boosting on synthetic_binary should beat "
"logistic at the relaxed promotion gate; check both AUC and ECE"
),
"task": "classification",
"dataset_id": "synthetic_binary",
"model_family": "gradient_boosting",
"hyperparameters": {"n_estimators": 100, "max_depth": 3},
"test_size": 0.25,
"secondary_holdout_size": 0.20,
}
Submit the run
from gnosys import GnosysClient
client = GnosysClient() # $GNOSYS_API_KEY
run = client.runs.create(
domain="tabular",
strategist={"kind": "llm"},
spec_template=spec_template,
max_iterations=5,
tier_reached="paper", # exit early on first spec clearing the strict promotion gate
no_progress_window=2, # also exit if 2 rounds without improvement
llm={
"provider": "anthropic", # or "openai" or "ollama"
"model": "claude-opus-4-7",
"temperature": 0.4,
"max_tokens": 4096,
# api_key omitted → uses the platform's pool token
},
llm_critic={
"enabled": True,
# prompt_template / checklist omitted → use the tabular defaults
},
)
runs.create returns immediately. Block until terminal:
run = client.runs.wait(run.run_id, timeout=600.0, poll_interval=5.0)
print(f"final status: {run.status}")
Read the per-iteration breakdown
iters = client.runs.iterations(run.run_id)
for it in iters.iterations:
validated = sum(
n for k, n in (it.tier_counts or {}).items()
if not k.startswith("rejected") and k != "execution_error"
)
rejected = sum(
n for k, n in (it.tier_counts or {}).items()
if k.startswith("rejected") or k == "execution_error"
)
print(
f"iter {it.iteration}: proposed={it.n_proposed} "
f"records={it.n_records} validated={validated} rejected={rejected}"
)
Typical output:
iter 0: proposed=4 records=4 validated=3 rejected=1
iter 1: proposed=3 records=3 validated=2 rejected=1
iter 2: proposed=2 records=2 validated=2 rejected=0
The rejected records are interesting — that's where the validation layer caught something. Drill in:
rejected = [
f for f in client.findings.list(run_id=run.run_id, severity="blocker")
]
for f in rejected:
print(f" spec={f.spec_id}")
print(f" validator={f.validator}")
print(f" detail={f.detail}")
print()
Defense in depth — multi-validator correlations
Specs flagged by both the LLM critic AND the post-execute honest-eval are real leaks caught twice. Specs flagged by only one are worth investigating:
matches = client.findings.correlations(
validators=["llm_critic", "honest_eval.shuffled_label"],
severity="blocker",
mode="and",
pipeline_run_id=run.pipeline_run_id,
)
print(f"corroborated leaks: {len(matches)}")
# vs. specs the critic blocked but honest-eval missed:
critic_only = client.findings.correlations(
validators=["llm_critic", "honest_eval.shuffled_label"],
severity="blocker",
mode="or",
pipeline_run_id=run.pipeline_run_id,
)
critic_only_specs = [
m for m in critic_only
if "honest_eval.shuffled_label" not in m.matched_validators
]
print(f"critic-only blocks: {len(critic_only_specs)}")
What the LLM saw on round 1
The validation findings flow back to the strategist's prompt as structured prose. A round-1 prompt against a leaky round-0 spec includes (truncated):
prior_records:
- spec_id: ts-r0-leaky
outcome: rejected_validation
primary_score: 1.0
blockers: ["honest_eval.shuffled_label"]
findings:
- validator: honest_eval.shuffled_label
severity: blocker
summary: "Shuffled-label re-runs retained 100% of the model's lift
over chance — the canonical signature of target leakage
or holdout snooping in the feature pipeline."
recommendation: "Audit the feature pipeline for any reference to
the test set (combined-set normalisation,
target-conditional encodings fit on all rows,
post-hoc filtering by y). Drop those operations
and re-run."
The LLM reads the recommendation and adjusts its next proposal to avoid the failure mode. That's the closed loop.
Cost notes
- The LLM strategist makes 1 call per round (round 0 generation + rounds 1+ rank-and-iterate + refine).
- The LLM critic adds 1 call per spec (so a 5-iteration run with 4 specs/round = ~25 critic calls).
- Tokens consumed: typical run lands at 50–200K tokens depending on iteration count and validation findings depth.
- The Starter plan includes 1M tokens / month; Team includes 10M.
To run cheaper: turn off the critic (llm_critic.enabled=False),
use a smaller model like gpt-4o-mini or claude-haiku-4-5, or
use --llm-provider ollama against a local model.
Reproducing the Spaceship Titanic result
The same runs.create call above, pointed at a Spaceship Titanic
dataset uploaded via client.datasets.upload(...):
spec_template = {
"spec_id": "_t",
"name": "spaceship-titanic agent sweep",
"hypothesis": (
"xgb/catboost on mixed-type passenger data; mediator features "
"(cabin/group), per-group target encoding for high-cardinality "
"categoricals"
),
"task": "classification",
"dataset_id": dataset.dataset_id, # uploaded CSV
"model_family": "xgb_classifier",
"hyperparameters": {"n_estimators": 200, "max_depth": 6},
"categorical_encoding": "native", # xgb handles directly
"feature_transforms": [ # parsed from spec_id
{"op": "split_delimited", "column": "Cabin", "delim": "/"},
{"op": "drop", "column": "Name"},
],
"test_size": 0.20,
"secondary_holdout_size": 0.20,
"cv_folds": 5,
}
run = client.runs.create(
domain="tabular",
strategist={"kind": "agent"},
spec_template=spec_template,
max_iterations=5,
tier_reached="paper",
llm={"provider": "anthropic", "model": "claude-sonnet-4-6"},
llm_critic={"enabled": True},
)
final = client.runs.wait(run.run_id, timeout=900.0)
card = client.runs.model_card(final.run_id) # HTML bytes
Resulting promoted spec (real run from May 2026):
agent-xgb-conservative-iter0, AUC 0.8955, all six validators
clean (shuffled-label retained 1% of lift, dist-shift concept
share 12%, MCGrad ECE 0.019, worst-slice ECE 0.041).
Next
- Honest evaluation — the six validators in detail.
- The validation layer — full theory.
- Python SDK reference — every method, every param.
Found a typo? Tell us.