OpenML benchmark — Gnosys vs FLAML / AutoGluon
We ship an example script that compares Gnosys Labs against two industry-standard AutoML baselines (FLAML and AutoGluon) on public OpenML classification tasks. Run it on your own laptop; publish the resulting markdown table; decide for yourself.
The script lives in the SDK package at
client/examples/openml_benchmark.py
and is pure customer code — it touches the Gnosys server via
the same public SDK calls (client.datasets.upload,
client.runs.create, client.runs.predict) that any real-data
classification flow uses. Nothing benchmark-specific lives on
the server side.
Install
pip install "gnosyslabs[benchmark]"
This pulls in openml, flaml[automl], autogluon.tabular, and
scikit-learn. AutoGluon brings torch — install in a virtualenv.
Run
export GNOSYS_API_KEY=gn_live_…
python -m gnosys.examples.openml_benchmark \
--tasks credit-g,blood-transfusion,kc2 \
--time-budget-s 60 \
--gnosys-iters 10 \
--output benchmark.json
Curated task names available out of the box:
| name | OpenML task id | description |
|---|---|---|
| credit-g | 31 | German credit risk (binary) |
| blood-transfusion | 1464 | Blood Transfusion Service Center |
| qsar-biodeg | 1494 | QSAR biodegradation |
| kc2 | 1063 | NASA software defects |
| ilpd | 1480 | Indian Liver Patient |
Pass any integer OpenML task id directly:
--tasks 31,1464,1494,1063,1480,2079.
What the script does, step by step
For each task:
- Pull the OpenML dataset to your machine via the
openmllibrary. - Drop non-numeric columns; label-encode the binary target.
- Stratified train/test split locally (sklearn,
random_state=42,test_size=0.25). Test rows never leave the machine. - Upload only the training portion to Gnosys via
client.datasets.upload(...). - Run
client.runs.create(...)— by default an HP sweep over logistic-regressionC, 10 iterations. - Wait for completion, then call
client.runs.predict(run_id, X_test)with the held-out rows. - Compute the local AUC:
roc_auc_score(y_test, gnosys_probas). - Run FLAML on
(X_train, y_train)with a wall-clock time budget; score on the same(X_test, y_test). - Same for AutoGluon.
- Print one comparison row.
After all tasks: a markdown table to stdout plus a JSON dump.
Sample output
| task | n_train | gnosys | flaml | autogluon | outcome |
|-----------------|---------|--------|-------|-----------|-----------|
| credit-g | 750 | 0.781 | 0.792 | 0.804 | validated |
| blood-transf. | 561 | 0.722 | 0.711 | 0.733 | validated |
| kc2 | 391 | 0.831 | 0.819 | 0.842 | validated |
(Indicative numbers — your run will vary depending on time budget, strategist choice, and OpenML data refresh.)
Apples-to-apples fairness
Every backend scores on the same y_test array. The split is
done on your machine before anyone uploads anything; Gnosys never
sees X_test until the predict call, and FLAML / AutoGluon are
trained against the same X_train, y_train Gnosys received. The
script prints the test-row count so you can confirm the splits
match.
The Gnosys internal test-AUC (the one the server computes on
its own validation split) is reported separately as
gnosys_internal_score in the JSON output for transparency. It is
not used for the comparison.
Tuning the budget
Time budget in this script is wall-clock — and the three backends parameterise time differently:
- Gnosys:
--gnosys-itersis the orchestrator'smax_iterations. Each iteration evaluates one HP setting; with HP sweep the default sweep is 6 values, so 10 iterations covers the whole grid plus refinement. - FLAML:
--time-budget-sis the wall-clock seconds AutoML.fit gets. - AutoGluon: same
--time-budget-sis thetime_limit.
For a fairer comparison, scale --gnosys-iters down for short
budgets and up for longer ones. The defaults (60 s vs. 10 iterations
for HP sweep over logistic regression) are calibrated to take
roughly the same wall-clock time on a laptop.
Skipping a baseline
python -m gnosys.examples.openml_benchmark --skip-autogluon … # FLAML only
python -m gnosys.examples.openml_benchmark --skip-flaml … # AutoGluon only
Useful if you don't want to install torch.
Plan caps
The script makes one upload + one run + one predict per task. On
the free plan that's 1 dataset + 1 run + 1 predict per task; 5
runs/mo means 5 tasks. Starter lifts both caps comfortably.
Hitting the cap returns 402 with an upgrade link.
When the comparison surprises you
- Gnosys wins on small-data tasks or tasks where leakage is endemic — the validation layer rejects the model AutoGluon silently builds.
- AutoGluon usually wins when there's a clean signal and big enough train data — its model zoo is wider.
- FLAML lands close to Gnosys on the simple baselines our HP
sweep covers; the gap widens once you turn on
--strategist llm.
The point of the script isn't "Gnosys wins every time"; it's to let you measure on your data what the validation layer is worth.
Source
github.com/gnosyslabs/gnosys-client/examples/openml_benchmark.py — ~250 lines, copy-paste editable.
Found a typo? Tell us.