Configuration Overview
======================

Typical LogML configuration file has the following structure:

.. code-block:: yaml

    version:         # <logml version>
    stratification:  # how to split dataset to strata

    # Kinds of analyses:

    # EDA
    eda:  # EDA.

    # Survival analysis: univariate and multivariate cox/
    survival_analysis:
    # ML-based feature importance calculation.

    modeling:


    # Main section for DAG steps configuration
    # Greedy split, Modeling steps, etc.
    analysis:

    # Report building configuration.
    report:

Check detailed documentaiton on each section at :py:mod:`logml.configuration`.

Here is configuration file from the existing logml example:

.. literalinclude:: ../../../examples/wine/modeling.yaml
    :language: yaml
    :linenos:


Default config file usually has most of section disabled:

.. literalinclude:: ../../../examples/default_config.yaml
    :language: yaml
    :linenos:


Top level of entries in the configuration file are mapped to the object
:py:class:`~.GlobalConfig`. Please refer to individual
configuration-related classes: they all have all their attributes directly mapped to the config.


.. autopydantic_model:: logml.configuration.global_config.GlobalConfig
    :noindex:
    :show-inheritance: False
    :model-show-json: False
    :model-show-config-summary: False
    :model-show-validator-members: False
    :model-show-validator-summary: False
    :model-show-field-summary: False
    :members: False
    :model-hide-paramlist: True

Most of configuration-related classes are located at :py:mod:`logml.configuration`.


Input metadata configuration
------------------------------

One of the most important information we want to convey to LogML is a structure of the incoming data.
Base on it, we define `Analysis Problems` for LogML to consider.

This section is covered by class
:py:class:`~.DatasetMetadataSection`.

By using :py:attr:`~.DatasetMetadataSection.columns_metadata` attribute,
we should properly set data type and 'categorical' flags for the columns of the dataset.

By setting :py:attr:`~.DatasetMetadataSection.modeling_specs` we set global definitions
for target variables and what kind of modeling problem to apply.

Sample configuration:

.. code-block:: yaml

    dataset_metadata:
        modeling_specs:

            # Configruation for problems to find relation of covatiates to Overall Survival.
            OS:
                time_column:            time
                event_query:            'cens == 0'
                event_column:           cens

            # Model relation of covariates and "tearment outcome" value, which is categorical,
            # hence use classification approach.
            Outcome:
                task:                  classification
                target:                Outcome
                target_metric:         rocauc

        key_columns:
            - subj_id              # Key column, this is neither feature,
                                   # nor target, just an indicator.

        # Specify some predefined metadata.

        columns_metadata:
            - name: gender
              data_type: str
              is_categorical: true
            - name: birthdate
              data_type: datetime64[ns]


Modeling
------------------------------

This section is covered by class :py:class:`~.ModelingSection`, which is
essentially a list of :py:class:`~.ModelingSetup` items.

There is a predefined modeling setup, which is turned on by setting :code:`preset` to enable state.
Modeling preset performs the following:

- Use Default Data Preprocessing configuration (see `Preprocessing` section).
- Generates 5 shuffled datasets.
- Enables Models Selection process (applies to all models with the matching objective - classification,
  regression or survival).
- Enables Feature Importance with 3 best models.


Preprocessing
------------------------------

If preset is enabled in dataset preprocessing section, then lightweight configuration section is applied.

.. autopydantic_model:: logml.configuration.modeling.DatasetPreprocessingPresetSection
    :noindex:
    :show-inheritance: False
    :model-show-json: False
    :model-show-config-summary: False
    :model-show-validator-members: False
    :model-show-validator-summary: False
    :model-show-field-summary: False
    :members: False
    :model-hide-paramlist: True

Example:

.. code-block:: yaml

    eda:
      enable: true
      preprocessing_problem_id: ''
      dataset_preprocessing:
        preset:
          features_list:
          - .*
          remove_correlated_features: true
          nans_fraction_threshold: 0.7
          apply_log1p_to_target: false
          drop_datetime_columns: true


Configuration utilities
------------------------------

In addition to ability to launch pipelines, `log_ml.py` interface provides several commands
to make configs maintenance easier. It includes schema validation and other useful things.

For config-related utilities see :program:`log_ml config` command in :ref:`Command Line Parameters` page.


Dataset Queries
------------------------------

In some places of config, like strata selection, or survival event query, we use dynamic approach
for querying data from the dataset.
To do this properly, we usually specify a line of text, somewhat
similar to SQL, but with python specifics. For example:

.. code :: yaml

    stratification:
        - strata_id: A_arm query: 'arm == “A”'
        - strata_id: BC_arms query: 'arm.isin([“B”, “C”])'

Here are some basic rules how to create proper query. (There are more for advanced use, but out of
scope of this guide).

- Use general form :code:`<Feature name> <Operator> <Constant Value>`. (It is possible to compare
  one column to another, but do it only when you clearly understand the data).
- As this is a complex string, always surround by quotes in the yaml config, like in the example above.
- Use feature name without quotes, string constants in double quotes.
- If the feature name has whitespaces or special characters, use backtick quoting.
- For equality check use double equal sign "==".
- Use simple operators: "<", ">", "==", "<=", ">=".
- Special function "isin" check that value is present in the list of values: :code:`arm.isin([“B”, “C”])`.
  Use square brackets to define list or values.
- Be sure to check column value. If column is a string, and you type :code:`A == 1`, then python
  treats 1 to be a number, and naturally, number 1 should never be equal for any of string values included
  into column A. So such query will always return no records whatsoever.


Random States
-----------------

There are many places LogML uses random states:

- Fitting models like RandomForest.
- Cross validation splits.
- Random features cutoff test.

Rule of a thumb is as follows:

When random state not set (anywhere it is used), it is initialized from main LogML random generator for the run and
fixed in the '_dag' config for the run.