Tabular quickstart

A worked example end-to-end: the LLM-driven agent strategist proposing gradient-boosting variants on a tabular classification problem, with the pre-execute LLM critic catching bad ideas before they run and all six honest-eval validators firing on every spec.

What this run produced for us: the same setup against the Spaceship Titanic Kaggle competition finished AUC 0.8955 on the held-out test split, validated by all six honest-eval validators, at roughly $0.05 in LLM cost on Claude Sonnet. Top-10% Kaggle territory, end-to-end on the platform.

What you'll build

A run that:

Round 0 — the agent strategist (Claude / GPT / Ollama, your choice) proposes 3-5 pipelines (model family, hyperparameters, categorical encoder, feature transforms) extending your spec template.
Pre-execute — an LLM critic adversarially reviews each spec against a tabular-specific checklist (target leakage, over-parameterization, etc.). Concerns at high severity in lookahead_bias or data_snooping block the spec.
Execute — each spec trains, predicts, scores.
Post-execute validators — all six (multi_calibration, dist_shift, honest_eval.shuffled_label, honest_eval.randomized_feature, honest_eval.secondary_holdout, honest_eval.permutation_fwer) run on every result.
Round 1+ — the strategist sees the full validation findings from prior records (with prose explanations) and proposes improvements that target the specific failure modes.
Stops on first spec to clear the strict promotion gate or 5 iterations.
Finalisation — kept specs are deduplicated by output correlation (ρ ≥ 0.95), ensembled, and run through MCGrad subgroup multicalibration. The result is a validated model card you can download from GET /v1/runs/{id}/model_card.

Prerequisites

A Gnosys account on Starter or above (the LLM strategist + critic cost LLM tokens — the Free plan is HP-sweep only).
Either an Anthropic / OpenAI API key, or a running Ollama server with a coding model pulled.

The spec template

The LLM extends this template each round. It only emits the diffs (hyperparameters, sometimes model_family); everything else stays the same:

spec_template = {
    "spec_id": "_t",
    "name": "tabular CTR sweep",
    "hypothesis": (
        "calibrated gradient boosting on synthetic_binary should beat "
        "logistic at the relaxed promotion gate; check both AUC and ECE"
    ),
    "task": "classification",
    "dataset_id": "synthetic_binary",
    "model_family": "gradient_boosting",
    "hyperparameters": {"n_estimators": 100, "max_depth": 3},
    "test_size": 0.25,
    "secondary_holdout_size": 0.20,
}

Submit the run

from gnosys import GnosysClient

client = GnosysClient()  # $GNOSYS_API_KEY

run = client.runs.create(
    domain="tabular",
    strategist={"kind": "llm"},
    spec_template=spec_template,
    max_iterations=5,
    tier_reached="paper",       # exit early on first spec clearing the strict promotion gate
    no_progress_window=2,        # also exit if 2 rounds without improvement
    llm={
        "provider": "anthropic",  # or "openai" or "ollama"
        "model": "claude-opus-4-7",
        "temperature": 0.4,
        "max_tokens": 4096,
        # api_key omitted → uses the platform's pool token
    },
    llm_critic={
        "enabled": True,
        # prompt_template / checklist omitted → use the tabular defaults
    },
)

runs.create returns immediately. Block until terminal:

run = client.runs.wait(run.run_id, timeout=600.0, poll_interval=5.0)
print(f"final status: {run.status}")

Read the per-iteration breakdown

iters = client.runs.iterations(run.run_id)
for it in iters.iterations:
    validated = sum(
        n for k, n in (it.tier_counts or {}).items()
        if not k.startswith("rejected") and k != "execution_error"
    )
    rejected = sum(
        n for k, n in (it.tier_counts or {}).items()
        if k.startswith("rejected") or k == "execution_error"
    )
    print(
        f"iter {it.iteration}: proposed={it.n_proposed} "
        f"records={it.n_records} validated={validated} rejected={rejected}"
    )

Typical output:

iter 0: proposed=4 records=4 validated=3 rejected=1
iter 1: proposed=3 records=3 validated=2 rejected=1
iter 2: proposed=2 records=2 validated=2 rejected=0

The rejected records are interesting — that's where the validation layer caught something. Drill in:

rejected = [
    f for f in client.findings.list(run_id=run.run_id, severity="blocker")
]
for f in rejected:
    print(f"  spec={f.spec_id}")
    print(f"  validator={f.validator}")
    print(f"  detail={f.detail}")
    print()

Defense in depth — multi-validator correlations

Specs flagged by both the LLM critic AND the post-execute honest-eval are real leaks caught twice. Specs flagged by only one are worth investigating:

matches = client.findings.correlations(
    validators=["llm_critic", "honest_eval.shuffled_label"],
    severity="blocker",
    mode="and",
    pipeline_run_id=run.pipeline_run_id,
)
print(f"corroborated leaks: {len(matches)}")

# vs. specs the critic blocked but honest-eval missed:
critic_only = client.findings.correlations(
    validators=["llm_critic", "honest_eval.shuffled_label"],
    severity="blocker",
    mode="or",
    pipeline_run_id=run.pipeline_run_id,
)
critic_only_specs = [
    m for m in critic_only
    if "honest_eval.shuffled_label" not in m.matched_validators
]
print(f"critic-only blocks: {len(critic_only_specs)}")

What the LLM saw on round 1

The validation findings flow back to the strategist's prompt as structured prose. A round-1 prompt against a leaky round-0 spec includes (truncated):

prior_records:
  - spec_id: ts-r0-leaky
    outcome: rejected_validation
    primary_score: 1.0
    blockers: ["honest_eval.shuffled_label"]
    findings:
      - validator: honest_eval.shuffled_label
        severity: blocker
        summary: "Shuffled-label re-runs retained 100% of the model's lift
                  over chance — the canonical signature of target leakage
                  or holdout snooping in the feature pipeline."
        recommendation: "Audit the feature pipeline for any reference to
                         the test set (combined-set normalisation,
                         target-conditional encodings fit on all rows,
                         post-hoc filtering by y). Drop those operations
                         and re-run."

The LLM reads the recommendation and adjusts its next proposal to avoid the failure mode. That's the closed loop.

Cost notes

The LLM strategist makes 1 call per round (round 0 generation + rounds 1+ rank-and-iterate + refine).
The LLM critic adds 1 call per spec (so a 5-iteration run with 4 specs/round = ~25 critic calls).
Tokens consumed: typical run lands at 50–200K tokens depending on iteration count and validation findings depth.
The Starter plan includes 1M tokens / month; Team includes 10M.

To run cheaper: turn off the critic (llm_critic.enabled=False), use a smaller model like gpt-4o-mini or claude-haiku-4-5, or use --llm-provider ollama against a local model.

Reproducing the Spaceship Titanic result

The same runs.create call above, pointed at a Spaceship Titanic dataset uploaded via client.datasets.upload(...):

spec_template = {
    "spec_id": "_t",
    "name": "spaceship-titanic agent sweep",
    "hypothesis": (
        "xgb/catboost on mixed-type passenger data; mediator features "
        "(cabin/group), per-group target encoding for high-cardinality "
        "categoricals"
    ),
    "task": "classification",
    "dataset_id": dataset.dataset_id,           # uploaded CSV
    "model_family": "xgb_classifier",
    "hyperparameters": {"n_estimators": 200, "max_depth": 6},
    "categorical_encoding": "native",            # xgb handles directly
    "feature_transforms": [                       # parsed from spec_id
        {"op": "split_delimited", "column": "Cabin", "delim": "/"},
        {"op": "drop", "column": "Name"},
    ],
    "test_size": 0.20,
    "secondary_holdout_size": 0.20,
    "cv_folds": 5,
}

run = client.runs.create(
    domain="tabular",
    strategist={"kind": "agent"},
    spec_template=spec_template,
    max_iterations=5,
    tier_reached="paper",
    llm={"provider": "anthropic", "model": "claude-sonnet-4-6"},
    llm_critic={"enabled": True},
)
final = client.runs.wait(run.run_id, timeout=900.0)
card = client.runs.model_card(final.run_id)  # HTML bytes

Resulting promoted spec (real run from May 2026): agent-xgb-conservative-iter0, AUC 0.8955, all six validators clean (shuffled-label retained 1% of lift, dist-shift concept share 12%, MCGrad ECE 0.019, worst-slice ECE 0.041).

Honest evaluation — the six validators in detail.
The validation layer — full theory.
Python SDK reference — every method, every param.

Found a typo? Tell us.