Data Preprocessing ================================================== Data preprocessing step transforms input (raw) data so that is becomes suitable for machine learning models. Basic data transformations usually are: - removing or imputing missing values. - encoding non-numerical/categorical variables - normalizing numeric data :py:class:`~.DatasetPreprocessingSection` section provides a way to define data transformations of interest (the full list of available data transformations is available here: :lml:ref:`Data Transformers`). Preprocessing preset --------------------- By default LogML provides predefined sequence of data preprocessing steps. Their details can be configured by :py:class:`~.DatasetPreprocessingPresetSection` section): .. literalinclude:: /_static/modeling/sample_preset_data_preprocessing.yaml :language: yaml :linenos: :caption: Preset option for dataset preprocessing Preprocessing preset includes the following steps: - Select only needed columns (``features_list``). - Drop columns of date/time data type. - Drop rows where ``target`` values are missing. - Drop rows, where fraction of missing features values is greater or equal to ``nans_per_row_fraction_threshold`` parameter (by default 90%). - Drop columns, where fraction of missing features values is greater or equal to ``nans_fraction_threshold`` parameter (by default 70%). - Numeric features: - Apply standardization to numeric features. - Apply MICE imputer for numeric columns. - Categorical features: - For categorical columns impute using 'most frequent' strategy. - Apply one-hot-encoding for categorical features. - If ``remove_correlated_features`` flag is True, correlation groups detection and removal of correlated features is applied. For details see :ref:`Correlation Groups Detection`. - Target transformation: - If target column is numeric, and ``apply_log1p_to_target`` field is set, then ``log1p`` transformation is applied to the tartet column. - If target column is categorical, ``label_encoding`` transformation is applied. Explicit preprocessing ------------------------ Let's take a look at the following sample data preprocessing configuration for survival analysis modeling: .. literalinclude:: /_static/modeling/sample_data_preprocessing.yaml :language: yaml :linenos: :caption: Sample dataset preprocessing - explicit For the list of available transformers see :lml:ref:`Data Transformers`. Correlation Groups Detection ----------------------------- Motivation ^^^^^^^^^^^^^ Often while conducting data analysis we face with the issue of having correlated columns within dataset of interest. It could cause the following problems: - Some machine learning models have strict requirements regarding presence of correlated features (linear regression model assumes there are little or no multicollinearity in the data) - Having multiple correlated features may slow down the required computations while not necessarily bringing more value to results - as we could just keep only one feature that provides almost the same predictive power / importance and reduce the noise introduced by 'redundant' features. 'Correlation groups' approach is to address mentioned issues using the following logic: - Some definition of 'feature A is correlated with feature B' is set. It includes correlation metric (Pearson, Spearman, etc.), threshold of interest and additional constraints if needed. - Having 'correlated' function defined we can create an undirected graph where nodes are features and edges exist only between correlated features. - In order to get a subset of features from the initial dataset so that the correlation issue is resolved we need to find a `maximal independent set `_ within the correlation graph. Configuration ^^^^^^^^^^^^^^^^^^^^ Correlation groups parameters are set via ``eda`` section of configuration file. Sample example is provided below: .. literalinclude:: ./../../_static/correlation_groups/params_sample.yaml :language: yaml :linenos: :caption: sample_config.yaml Please see :py:class:`logml.configuration.eda.EDAArtifactsGenerationParameters` class for the details. After all required correlation groups parameters are properly set and presumably the correlation EDA artifact will be generated a transformer ``remove_correlated_features`` can be used in data preprocessing steps of interest ( at the moment only within 'modeling' and 'survival_analysis' sections): .. literalinclude:: ./../../_static/correlation_groups/data_preprocessing_template.yaml :language: yaml :linenos: :caption: data_preprocessing_template.yaml Please see :py:class:`logml.feature_extraction.transformers.filtering.CorrelatedColumnsFilteringTransformer` class for the details. Correlation groups review ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Information on how the result correlation groups look like can be found in "EDA / Continuous Features: Distribution and Correlation" report section. .. image:: /_static/correlation_groups/eda_report_structure.png In addition to the high-level summary of the provided parameters the section contains an interactive table that could help to understand the correlation groups: - Search by correlation group name .. image:: /_static/correlation_groups/search_by_cg_name.png - Search by feature name .. image:: /_static/correlation_groups/search_by_feature_name.png So in case there is a need to understand what features are within some correlation group of interest (found either in modeling results or survival, if was enabled) - the described above EDA subsection could be used. Implementation details ^^^^^^^^^^^^^^^^^^^^^^^^^ Correlation groups creation """""""""""""""""""""""""""" As part of EDA artifacts generation there is produced :py:class:`logml.eda.artifacts.correlation.CorrelationSummary` object that stores information about correlation groups: - ``correlation_groups`` property contains a correlation graph that was created under all parameters/constraints discussed above (see :py:class:`logml.eda.artifacts.correlation.CorrelationGraph` for details) - ``correlation_groups`` property contains a list of correlation groups that were defined (see :py:class:`logml.eda.artifacts.correlation.CorrelationGroup` for details) The main goal of correlation groups definition is to facilitate the process of removing correlated features (via the corresponding data preprocessing step) - so we want to find a (maximal) subset of features so that within that subset there are no correlated features. In order to do that the following approach is proposed: 1) Features are processed in order that is defined by corresponding node degrees in the correlation graph (descending) 2) Until all features/nodes are processed the next feature is picked (in case it has adjacent unmarked nodes) and a new correlation group is assigned to it. All adjacent and unmarked nodes are assigned to the same correlation group as well. Please see the example below that shows the described approach: .. image:: /_static/correlation_groups/groups_creation_schema.png Removing correlated features """"""""""""""""""""""""""""" As it was mentioned in case correlation featured should be removed within some data preprocessing - ``remove_correlated_features`` transformer does the job. It implements the following logic: - "target" column is not affected (for survival - both "event" and "time" columns) - all correlation groups are sequentially checked: if for a correlation group there are some features within current dataset - only one feature is kept and others are filtered out Please see the example below that shows the described approach: .. image:: /_static/correlation_groups/transformer_demo.png