User Guide
==========

.. meta::
   :description: Learn how MAMUT handles tabular classification data, preprocessing, model search, metrics, and reproducibility.
   :keywords: MAMUT user guide, tabular classification, preprocessing, model selection, Optuna

MAMUT exposes the main workflow through :class:`mamut.wrapper.Mamut`. The class
expects tabular features in a pandas ``DataFrame`` and a categorical target in a
pandas ``Series`` or compatible array.

Data Requirements
-----------------

* ``X`` should be a pandas ``DataFrame`` with numeric and/or categorical
  feature columns.
* ``y`` must represent classes. Floating point targets are rejected because
  MAMUT is a classification package, not a regression package.
* With preprocessing enabled, MAMUT detects numeric and categorical columns
  automatically unless ``numeric_features`` or ``categorical_features`` are
  passed explicitly.

Preprocessing
-------------

Preprocessing is enabled by default with ``preprocess=True``. Extra keyword
arguments passed to ``Mamut`` are forwarded to
:class:`mamut.preprocessing.preprocessing.Preprocessor`.

.. code-block:: python

   mamut = Mamut(
       num_imputation="mean",
       cat_imputation="most_frequent",
       scaling="standard",
       feature_selection=True,
       pca=False,
   )

The preprocessing pipeline can handle missing numeric values, missing
categorical values, one-hot encoding, native categorical columns for supported
boosting models, skew correction, scaling, optional outlier filtering,
imbalanced target resampling, optional feature selection, and optional PCA.

By default, ``preprocessing_profile="auto"`` lets each candidate use a
model-aware preprocessing profile. Linear, kernel, and distance-based models
use the generic one-hot path. Tree models use one-hot encoded categoricals
without unnecessary numeric scaling. CatBoost and LightGBM use native
categorical columns when compatible with the rest of the preprocessing options.
Set ``preprocessing_profile="generic_ohe"`` to force the legacy shared one-hot
path for every candidate.

Automatic row removal for outliers is disabled by default because it can change
the target distribution and hurt external validity. Enable it only when that is
part of the intended experiment:

.. code-block:: python

   mamut = Mamut(outlier_removal=True)

Model Search
------------

MAMUT compares a set of supported classifiers and selects the best model by the
configured score metric on a validation split. Supported model families include
logistic regression, random forests, extremely randomized trees, histogram
gradient boosting, XGBoost, LightGBM, CatBoost, support vector machines,
multilayer perceptrons, Gaussian naive Bayes, and k-nearest neighbors.

Use ``search_profile`` to choose the candidate pool:

.. code-block:: python

   mamut = Mamut(search_profile="quick")      # small fast pool
   mamut = Mamut(search_profile="balanced")   # default tabular pool
   mamut = Mamut(search_profile="thorough")   # includes slower learners

Use ``include_models`` for an exact candidate set, or ``exclude_models`` to
remove expensive or unwanted estimators by class name:

.. code-block:: python

   mamut = Mamut(include_models=["RandomForestClassifier", "LGBMClassifier"])
   mamut = Mamut(exclude_models=["SVC", "MLPClassifier"])

``include_models`` and ``exclude_models`` are mutually exclusive. Set
``n_jobs`` to control parallelism for supported estimators.

Selection Strategy
------------------

The default ``selection_strategy="single_split"`` keeps runtime low by choosing
the best tuned candidate on one validation split. For higher-integrity model
development, use nested validation over the non-holdout modeling data:

.. code-block:: python

   mamut = Mamut(
       search_profile="balanced",
       selection_strategy="nested_cv",
       selection_cv_splits=5,
       selection_cv_repeats=2,
       selection_practical_margin=0.005,
   )

Nested-CV selection performs tuning inside every outer training fold, fits
preprocessing inside those folds, and never uses final holdout rows. It selects
by mean outer-fold score; models within the practical margin are treated as
ties and resolved by lower score variance, then faster selection runtime.
``selection_strategy="repeated_cv"`` is retained only as a deprecated alias.
Inspect ``selection_summary_`` after ``fit`` to see the selection evidence.

Hyperparameter Search
---------------------

Set the optimization method and iteration budget at initialization:

.. code-block:: python

   mamut = Mamut(
       optimization_method="bayes",
       n_iterations=30,
       random_state=42,
   )

Use ``optimization_method="random_search"`` for a simpler random search. Use
``optimization_method="bayes"`` for Optuna's tree-structured Parzen estimator.
Set ``verbose=True`` when you want model-search progress logging and Optuna
progress bars.

Metrics
-------

Choose a score metric with ``score_metric``:

.. code-block:: python

   mamut = Mamut(score_metric="balanced_accuracy")

Supported values are ``accuracy``, ``precision``, ``recall``, ``f1``,
``balanced_accuracy``, ``jaccard``, and ``roc_auc_score``. Classification
metrics are weighted when needed for multiclass problems.

Validation and Holdout Data
---------------------------

By default, ``fit`` creates a stratified train/validation split. The validation
split is used for model selection, ensemble selection, and
``validation_summary_``:

.. code-block:: python

   mamut = Mamut(validation_size=0.2, random_state=42)
   mamut.fit(X, y)

For final evaluation, reserve a holdout set that is never used during model or
ensemble selection:

.. code-block:: python

   mamut = Mamut(holdout_size=0.2, random_state=42)
   mamut.fit(X, y)
   mamut.evaluate()  # uses the holdout split automatically

You can also provide an explicit holdout set:

.. code-block:: python

   mamut.fit(X_train, y_train, X_holdout=X_holdout, y_holdout=y_holdout)

Use holdout scores for final reporting. Use validation scores for model
selection and debugging.

When observations share a subject, household, session, patient, or other unit,
pass group identifiers so no related rows cross validation boundaries:

.. code-block:: python

   mamut = Mamut(
       selection_strategy="nested_cv",
       holdout_size=0.2,
       refit_final_model=True,
   )
   mamut.fit(X, y, groups=passenger_group)

For an explicit holdout, also pass ``groups_holdout=``; overlapping modeling
and holdout groups are rejected.

Final Refit
-----------

By default, ``best_model_`` is the estimator selected on the validation split.
This keeps the selected model aligned with the validation evidence. If you want
the public prediction pipeline to refit on all non-holdout modeling data after
selection, set ``refit_final_model=True``:

.. code-block:: python

   mamut = Mamut(
       holdout_size=0.2,
       refit_final_model=True,
       random_state=42,
   )
   mamut.fit(X, y)

The final refit never uses holdout rows. Use this option for deployment
artifacts after you have accepted validation diagnostics. With
``selection_strategy="nested_cv"``, the selected model family is also retuned
by cross-validation on all non-holdout modeling rows before the final fit.

Prediction Contract
-------------------

``Mamut.predict`` and ``mamut.best_model_.predict`` return the original target
labels, even though MAMUT encodes labels internally for estimator training.
``predict_proba`` returns estimator probabilities in the class order exposed by
the fitted public model. Unknown categorical levels at prediction time are
encoded as all zeros for that categorical feature group instead of raising an
error.

Evidence Checks
---------------

``evaluate`` includes an evidence layer by default. It is designed to answer
whether the reported model score is trustworthy enough to take seriously, not
only which model has the largest score.

The evidence layer includes:

* basic leakage checks for target-like columns, exact target copies, identifier
  columns, duplicate feature rows, and class imbalance
* comparison against fitted MAMUT candidates plus dummy, logistic regression,
  and random forest baselines
* repeated stratified cross-validation, or group-disjoint stratified folds when
  ``groups=`` is supplied, for score stability
* descriptive t-based stability intervals over repeated fold scores, clipped
  to the valid metric range; these folds are dependent and the interval is not
  a confirmatory confidence claim
* evidence-guided selection guidance that confirms, challenges, or blocks trust
  in the validation-selected model

.. code-block:: python

   mamut = Mamut(
       holdout_size=0.2,
       evidence_cv_splits=5,
       evidence_cv_repeats=3,
       evidence_confidence_level=0.95,
   )
   mamut.fit(X, y)
   mamut.evaluate()

You can compute the evidence tables without writing a report:

.. code-block:: python

   evidence = mamut.generate_evidence()
   mamut.baseline_comparison_
   mamut.score_stability_
   mamut.leakage_checks_
   mamut.selection_guidance_

For a locked final holdout confirmation, avoid comparing alternate MAMUT
candidates on that holdout:

.. code-block:: python

   evidence = mamut.generate_evidence(
       dataset="holdout",
       include_candidate_comparison=False,
   )

For lightweight evaluation in scripts or CI, disable expensive or file-writing
outputs while keeping evidence generation enabled:

.. code-block:: python

   result = mamut.evaluate(
       include_shap=False,
       write_html=False,
       save_plots=False,
   )

The score stability check refits the selected estimator and baseline models
with fold-local preprocessing. It does not retune hyperparameters inside each
fold, so treat it as a stability diagnostic rather than a full nested
cross-validation benchmark. Use ``selection_strategy="nested_cv"`` when model
selection itself needs nested evaluation.

The evidence-guided selection table is intentionally conservative. If a
baseline beats the selected model on final holdout data, MAMUT challenges the
selection for review but keeps the selected candidate as the recommendation.
It does not silently promote the holdout winner. Use that challenge
to rerun model selection or reserve a new final holdout before deployment.

For a reproducible example of these diagnostics on public sklearn datasets, see
:doc:`benchmark_evidence`.

Reproducibility
---------------

Pass ``random_state`` to control the train/validation/holdout split,
preprocessing components, resampling, and supported model initializers:

.. code-block:: python

   mamut = Mamut(random_state=42)

Fitted candidate models are kept in memory by default. Set
``save_models=True`` to write them under ``fitted_models/<timestamp>/``:

.. code-block:: python

   mamut = Mamut(save_models=True)

Because this directory is created relative to the current working directory,
run experiments from a known project or experiment folder.

Limitations
-----------

MAMUT currently targets supervised classification only. It is designed for
tabular data and does not implement time-series validation, regression,
multilabel classification, text pipelines, image pipelines, or custom model
registries. It should be treated as a transparent baseline and reporting
assistant, not as a replacement for larger AutoML systems.