Data Preprocessing
==================================================

Data preprocessing step transforms input (raw) data so that is becomes suitable
for machine learning models. Basic data transformations usually are:

- removing or imputing missing values.
- encoding non-numerical/categorical variables
- normalizing numeric data

:py:class:`~.DatasetPreprocessingSection` section provides a way to define data transformations of
interest (the full list of available data transformations
is available here: :lml:ref:`Data Transformers`).


Preprocessing preset
---------------------

By default LogML provides predefined sequence of
data preprocessing steps.

Their details can be configured by
:py:class:`~.DatasetPreprocessingPresetSection` section):

.. literalinclude:: /_static/modeling/sample_preset_data_preprocessing.yaml
    :language: yaml
    :linenos:
    :caption: Preset option for dataset preprocessing


Preprocessing preset includes the following steps:

- Select only needed columns (``features_list``).

- Drop columns of date/time data type.

- Drop rows where ``target`` values are missing.

- Drop rows, where fraction of missing features values is greater or equal to ``nans_per_row_fraction_threshold``
  parameter (by default 90%).

- Drop columns, where fraction of missing features values is greater or equal to ``nans_fraction_threshold``
  parameter (by default 70%).


- Numeric features:

    - Apply standardization to numeric features.
    - Apply MICE imputer for numeric columns.

- Categorical features:

    - For categorical columns impute using 'most frequent' strategy.
    - Apply one-hot-encoding for categorical features.


- If ``remove_correlated_features`` flag is True, correlation groups detection and removal of
  correlated features is applied. For details see :ref:`Correlation Groups Detection`.

- Target transformation:

    - If target column is numeric, and ``apply_log1p_to_target`` field is set, then ``log1p`` transformation
      is applied to the tartet column.
    - If target column is categorical, ``label_encoding`` transformation is applied.


Explicit preprocessing
------------------------

Let's take a look at the following sample data preprocessing configuration for survival analysis modeling:

.. literalinclude:: /_static/modeling/sample_data_preprocessing.yaml
    :language: yaml
    :linenos:
    :caption: Sample dataset preprocessing - explicit


For the list of available transformers see :lml:ref:`Data Transformers`.


Correlation Groups Detection
-----------------------------

Motivation
^^^^^^^^^^^^^
Often while conducting data analysis we face with the issue of having correlated columns within dataset of interest.
It could cause the following problems:

- Some machine learning models have strict requirements regarding presence of correlated features
  (linear regression model assumes there are little or no multicollinearity in the data)

- Having multiple correlated features may slow down the required computations while not necessarily bringing more value
  to results - as we could just keep only one feature that provides almost the same predictive power / importance and
  reduce the noise introduced by 'redundant' features.

'Correlation groups' approach is to address mentioned issues using the following logic:

- Some definition of 'feature A is correlated with feature B' is set. It includes correlation metric
  (Pearson, Spearman, etc.), threshold of interest and additional constraints if needed.

- Having 'correlated' function defined we can create an undirected graph where nodes are features and edges
  exist only between correlated features.

- In order to get a subset of features from the initial dataset so that the correlation issue is resolved we need
  to find a `maximal independent set <https://en.wikipedia.org/wiki/Maximal_independent_set>`_ within the correlation
  graph.


Configuration
^^^^^^^^^^^^^^^^^^^^

Correlation groups parameters are set via ``eda`` section of configuration file. Sample example is
provided below:

.. literalinclude:: ./../../_static/correlation_groups/params_sample.yaml
    :language: yaml
    :linenos:
    :caption: sample_config.yaml

Please see :py:class:`logml.configuration.eda.EDAArtifactsGenerationParameters` class for
the details.

After all required correlation groups parameters are properly set and presumably the correlation EDA artifact
will be generated a transformer ``remove_correlated_features`` can be used in data preprocessing steps of interest (
at the moment only within 'modeling' and 'survival_analysis' sections):

.. literalinclude:: ./../../_static/correlation_groups/data_preprocessing_template.yaml
    :language: yaml
    :linenos:
    :caption: data_preprocessing_template.yaml


Please see :py:class:`logml.feature_extraction.transformers.filtering.CorrelatedColumnsFilteringTransformer` class
for the details.


Correlation groups review
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Information on how the result correlation groups look like can be found in
"EDA / Continuous Features: Distribution and Correlation" report section.

.. image:: /_static/correlation_groups/eda_report_structure.png

In addition to the high-level summary of the provided parameters the section contains an interactive table that could
help to understand the correlation groups:

- Search by correlation group name

.. image:: /_static/correlation_groups/search_by_cg_name.png

- Search by feature name

.. image:: /_static/correlation_groups/search_by_feature_name.png

So in case there is a need to understand what features are within some correlation group of interest
(found either in modeling results or survival, if was enabled) - the described above EDA subsection could be used.

Implementation details
^^^^^^^^^^^^^^^^^^^^^^^^^

Correlation groups creation
""""""""""""""""""""""""""""

As part of EDA artifacts generation there is produced :py:class:`logml.eda.artifacts.correlation.CorrelationSummary`
object that stores information about correlation groups:

- ``correlation_groups`` property contains a correlation graph that was created under all parameters/constraints
  discussed above (see :py:class:`logml.eda.artifacts.correlation.CorrelationGraph` for details)

- ``correlation_groups`` property contains a list of correlation groups that were defined
  (see :py:class:`logml.eda.artifacts.correlation.CorrelationGroup` for details)

The main goal of correlation groups definition is to facilitate the process of removing correlated features
(via the corresponding data preprocessing step) - so we want to find a (maximal) subset of features so that within
that subset there are no correlated features. In order to do that the following approach is proposed:

1) Features are processed in order that is defined by corresponding node degrees in the correlation graph (descending)

2) Until all features/nodes are processed the next feature is picked (in case it has adjacent unmarked nodes) and a new
   correlation group is assigned to it. All adjacent and unmarked nodes are assigned to the same correlation group as
   well.

Please see the example below that shows the described approach:

.. image:: /_static/correlation_groups/groups_creation_schema.png

Removing correlated features
"""""""""""""""""""""""""""""

As it was mentioned in case correlation featured should be removed within some data preprocessing -
``remove_correlated_features`` transformer does the job. It implements the following logic:

- "target" column is not affected (for survival - both "event" and "time" columns)

- all correlation groups are sequentially checked: if for a correlation group there are some features within current
  dataset - only one feature is kept and others are filtered out

Please see the example below that shows the described approach:

.. image:: /_static/correlation_groups/transformer_demo.png