Evidence Benchmark ================== .. meta:: :description: Reproduce MAMUT validation evidence diagnostics on public sklearn classification datasets. :keywords: MAMUT benchmark, validation evidence, baseline comparison, tabular classification MAMUT includes a lightweight benchmark script for release diagnostics. The goal is not to claim state-of-the-art AutoML performance. The goal is to verify that the selected model beats trivial baselines, that stronger baselines are visible when they challenge the selection, and that score stability is reported with descriptive resampling intervals. Run the benchmark from the repository root: .. code-block:: sh uv run python scripts/benchmark_evidence.py --format markdown The default run uses: * sklearn ``breast_cancer``, ``digits``, and ``wine`` datasets * fixed ``random_state=42`` * ``balanced_accuracy`` as the selection metric * ``holdout_size=0.2`` for final evaluation * final refit on all non-holdout modeling rows before holdout scoring * one random-search iteration for speed * three-fold repeated stratified CV with one repeat for score stability * the lightweight ``quick`` candidate profile (logistic regression, random forest, extra trees, and Gaussian naive Bayes) Example Diagnostic Output ------------------------- The following output was generated from the locked development environment for a release validation pass: .. code-block:: text | dataset | samples | features | classes | selected_model | holdout_score | best_baseline | best_baseline_score | repeated_cv_mean | repeated_cv_ci | guidance | leakage_warnings | | ------------- | ------- | -------- | ------- | ------------------ | ------------- | ------------------- | ------------------- | ---------------- | -------------- | ---------- | ---------------- | | breast_cancer | 569 | 30 | 2 | RandomForestClassifier | 0.943 | Logistic Regression | 0.953 | 0.958 | [0.879, 1.000] | challenged | 0 | | digits | 1797 | 64 | 10 | RandomForestClassifier | 0.969 | Random Forest | 0.969 | 0.962 | [0.939, 0.986] | challenged | 0 | | wine | 178 | 13 | 3 | LogisticRegression | 1.000 | Logistic Regression | 1.000 | 0.981 | [0.934, 1.000] | confirmed | 0 | Interpretation -------------- ``confirmed`` means no evidence baseline exceeded the selected model by the configured practical margin. ``challenged`` means a baseline matched or beat the selected model strongly enough to require review. A challenge is useful signal: it prevents MAMUT from presenting a validation-selected model as stronger than the evidence supports. On ``breast_cancer``, a simple logistic-regression baseline beats the selected random-forest candidate on the holdout split, and MAMUT surfaces that challenge. On ``digits``, a random-forest baseline matches the selected candidate closely enough to require review. On ``wine``, the holdout score is saturated while score-stability evidence remains visible. This is the intended behavior: attractive holdout results do not suppress baseline or stability checks.