OpenML benchmark — Gnosys vs FLAML / AutoGluon

We ship an example script that compares Gnosys Labs against two industry-standard AutoML baselines (FLAML and AutoGluon) on public OpenML classification tasks. Run it on your own laptop; publish the resulting markdown table; decide for yourself.

The script lives in the SDK package at client/examples/openml_benchmark.py and is pure customer code — it touches the Gnosys server via the same public SDK calls (client.datasets.upload, client.runs.create, client.runs.predict) that any real-data classification flow uses. Nothing benchmark-specific lives on the server side.

Install

pip install "gnosyslabs[benchmark]"

This pulls in openml, flaml[automl], autogluon.tabular, and scikit-learn. AutoGluon brings torch — install in a virtualenv.

Run

export GNOSYS_API_KEY=gn_live_…
python -m gnosys.examples.openml_benchmark \
    --tasks credit-g,blood-transfusion,kc2 \
    --time-budget-s 60 \
    --gnosys-iters 10 \
    --output benchmark.json

Curated task names available out of the box:

name	OpenML task id	description
credit-g	31	German credit risk (binary)
blood-transfusion	1464	Blood Transfusion Service Center
qsar-biodeg	1494	QSAR biodegradation
kc2	1063	NASA software defects
ilpd	1480	Indian Liver Patient

Pass any integer OpenML task id directly: --tasks 31,1464,1494,1063,1480,2079.

What the script does, step by step

For each task:

Pull the OpenML dataset to your machine via the openml library.
Drop non-numeric columns; label-encode the binary target.
Stratified train/test split locally (sklearn, random_state=42, test_size=0.25). Test rows never leave the machine.
Upload only the training portion to Gnosys via client.datasets.upload(...).
Run client.runs.create(...) — by default an HP sweep over logistic-regression C, 10 iterations.
Wait for completion, then call client.runs.predict(run_id, X_test) with the held-out rows.
Compute the local AUC: roc_auc_score(y_test, gnosys_probas).
Run FLAML on (X_train, y_train) with a wall-clock time budget; score on the same (X_test, y_test).
Same for AutoGluon.
Print one comparison row.

After all tasks: a markdown table to stdout plus a JSON dump.

Sample output

| task            | n_train | gnosys | flaml | autogluon | outcome   |
|-----------------|---------|--------|-------|-----------|-----------|
| credit-g        | 750     | 0.781  | 0.792 | 0.804     | validated |
| blood-transf.   | 561     | 0.722  | 0.711 | 0.733     | validated |
| kc2             | 391     | 0.831  | 0.819 | 0.842     | validated |

(Indicative numbers — your run will vary depending on time budget, strategist choice, and OpenML data refresh.)

Apples-to-apples fairness

Every backend scores on the same y_test array. The split is done on your machine before anyone uploads anything; Gnosys never sees X_test until the predict call, and FLAML / AutoGluon are trained against the same X_train, y_train Gnosys received. The script prints the test-row count so you can confirm the splits match.

The Gnosys internal test-AUC (the one the server computes on its own validation split) is reported separately as gnosys_internal_score in the JSON output for transparency. It is not used for the comparison.

Tuning the budget

Time budget in this script is wall-clock — and the three backends parameterise time differently:

Gnosys: --gnosys-iters is the orchestrator's max_iterations. Each iteration evaluates one HP setting; with HP sweep the default sweep is 6 values, so 10 iterations covers the whole grid plus refinement.
FLAML: --time-budget-s is the wall-clock seconds AutoML.fit gets.
AutoGluon: same --time-budget-s is the time_limit.

For a fairer comparison, scale --gnosys-iters down for short budgets and up for longer ones. The defaults (60 s vs. 10 iterations for HP sweep over logistic regression) are calibrated to take roughly the same wall-clock time on a laptop.

Skipping a baseline

python -m gnosys.examples.openml_benchmark --skip-autogluon …   # FLAML only
python -m gnosys.examples.openml_benchmark --skip-flaml …        # AutoGluon only

Useful if you don't want to install torch.

Plan caps

The script makes one upload + one run + one predict per task. On the free plan that's 1 dataset + 1 run + 1 predict per task; 5 runs/mo means 5 tasks. Starter lifts both caps comfortably. Hitting the cap returns 402 with an upgrade link.

When the comparison surprises you

Gnosys wins on small-data tasks or tasks where leakage is endemic — the validation layer rejects the model AutoGluon silently builds.
AutoGluon usually wins when there's a clean signal and big enough train data — its model zoo is wider.
FLAML lands close to Gnosys on the simple baselines our HP sweep covers; the gap widens once you turn on --strategist llm.

The point of the script isn't "Gnosys wins every time"; it's to let you measure on your data what the validation layer is worth.

Source

github.com/gnosyslabs/gnosys-client/examples/openml_benchmark.py — ~250 lines, copy-paste editable.

Found a typo? Tell us.