Data Preprocessing

Data preprocessing step transforms input (raw) data so that is becomes suitable for machine learning models. Basic data transformations usually are:

removing or imputing missing values.
encoding non-numerical/categorical variables
normalizing numeric data

DatasetPreprocessingSection section provides a way to define data transformations of interest (the full list of available data transformations is available here: Data Transformers).

Preprocessing preset

By default LogML provides predefined sequence of data preprocessing steps.

Their details can be configured by DatasetPreprocessingPresetSection section):

Preset option for dataset preprocessing

...

dataset_preprocessing:
    preset:
      enable:                True
      features_list:
        - .*
      remove_correlated_features: true
      nans_per_row_fraction_threshold: 0.9
      nans_fraction_threshold: 0.7
      apply_log1p_to_target: false
      drop_datetime_columns: true

...

Preprocessing preset includes the following steps:

Select only needed columns (features_list).
Drop columns of date/time data type.
Drop rows where target values are missing.
Drop rows, where fraction of missing features values is greater or equal to nans_per_row_fraction_threshold parameter (by default 90%).
Drop columns, where fraction of missing features values is greater or equal to nans_fraction_threshold parameter (by default 70%).
Numeric features:
- Apply standardization to numeric features.
- Apply MICE imputer for numeric columns.
Categorical features:
- For categorical columns impute using ‘most frequent’ strategy.
- Apply one-hot-encoding for categorical features.
If remove_correlated_features flag is True, correlation groups detection and removal of correlated features is applied. For details see Correlation Groups Detection.
Target transformation:
- If target column is numeric, and apply_log1p_to_target field is set, then log1p transformation is applied to the tartet column.
- If target column is categorical, label_encoding transformation is applied.

Explicit preprocessing

Let’s take a look at the following sample data preprocessing configuration for survival analysis modeling:

Sample dataset preprocessing - explicit

...

dataset_preprocessing:
    enable:                     True
    # Target list of transformations.
    steps:
        # Each step consists of the following parameters:
        #  - 'transformer' - alias for transformation
        #  - 'params' - optional, defines additional parameters for transformation

        # keeps only columns of interest
        - transformer:            select_columns
          params:
              columns_to_include:
                  - .*_DNA
                  - .*_RNA
                  - .*_clinical
                  - OS
                  - OS_censor

        # drops rows for which OS and OS_censor are undefined - can't do modeling without targets
        - transformer:            drop_nan_rows
          params:
              columns_to_include:
                  - OS
                  - OS_censor

        # drops all columns with %NaNs > 50, except OS and OS_censor
        - transformer:            drop_nan_columns
          params:
              threshold:          0.5
              columns_to_include: ['.*']
              columns_to_exclude:
                  - OS
                  - OS_censor

        # goes over correlation groups and keeps only one column per group, so that there are no
        # correlated columns within the result dataset
        - transformer:            remove_correlated_features

        # applies 'standard' normalization to all numerical columns (except OS and OS_censor)
        - transformer:            normalize_numericals
          params:
              normalization:      standard
              columns_to_exclude:
                  - OS
                  - OS_censor

        # for FMI/vardict data we want to binarize values by keeping only MUT and WT values
        - transformer:            replace_value
          params:
             columns_to_include:
                 - (.*)_DNA
             mapping:
                 AMP:             MUT
                 WT:              WT
                 VUS:             WT
                 SNP:             MUT
                 DEL:             MUT
                 REARG:           MUT
                 HETLOSS:         MUT

        # Remove columns for which #MUT / #(MUT | WT) < 0.05
        - transformer:            prevalence_filtering
          params:
             columns_to_include:
                 - (.*)_DNA
              threshold:          0.05
              # Values that will be used in numerator
              values:
                  - MUT

        # applies 'one-hot' encoding to all categorical columns (except OS and OS_censor)
        - transformer:            encode_categoricals
          params:
              encoding:           one_hot
              columns_to_exclude:
                  - OS
                  - OS_censor

        # Apply iterative imputation to replace NaN values
        - transformer:            mice
          params:
              columns_to_include:
                  - .*

...

For the list of available transformers see Data Transformers.

Correlation Groups Detection

Motivation

Often while conducting data analysis we face with the issue of having correlated columns within dataset of interest. It could cause the following problems:

Some machine learning models have strict requirements regarding presence of correlated features (linear regression model assumes there are little or no multicollinearity in the data)
Having multiple correlated features may slow down the required computations while not necessarily bringing more value to results - as we could just keep only one feature that provides almost the same predictive power / importance and reduce the noise introduced by ‘redundant’ features.

‘Correlation groups’ approach is to address mentioned issues using the following logic:

Some definition of ‘feature A is correlated with feature B’ is set. It includes correlation metric (Pearson, Spearman, etc.), threshold of interest and additional constraints if needed.
Having ‘correlated’ function defined we can create an undirected graph where nodes are features and edges exist only between correlated features.
In order to get a subset of features from the initial dataset so that the correlation issue is resolved we need to find a maximal independent set within the correlation graph.

Configuration

Correlation groups parameters are set via eda section of configuration file. Sample example is provided below:

sample_config.yaml

...
eda:
    params:
        correlation_type:                 spearman
        correlation_threshold:            0.8
        correlation_min_samples_fraction: 0.2
        correlation_key_names:
            - TP53
            - KRAS
            - CDKN2A
            - CDKN2B
            - PIK3CA
            - ATM
            - BRCA1
            - SOX2
            - GNAS2
            - TERC
            - STK11
            - PDCD1
            - LAG3
            - TIGIT
            - HAVCR2
            - EOMES
            - MTAP
...

Please see logml.configuration.eda.EDAArtifactsGenerationParameters class for the details.

After all required correlation groups parameters are properly set and presumably the correlation EDA artifact will be generated a transformer remove_correlated_features can be used in data preprocessing steps of interest ( at the moment only within ‘modeling’ and ‘survival_analysis’ sections):

data_preprocessing_template.yaml

...
modeling:
    problems:
        y_regression:
            dataset_preprocessing:
                steps:
                  ...
                  - transformer: remove_correlated_features
                  ...
            ...
...

Please see logml.feature_extraction.transformers.filtering.CorrelatedColumnsFilteringTransformer class for the details.

Correlation groups review

Information on how the result correlation groups look like can be found in “EDA / Continuous Features: Distribution and Correlation” report section.

In addition to the high-level summary of the provided parameters the section contains an interactive table that could help to understand the correlation groups:

Search by correlation group name

Search by feature name

../../_images/search_by_feature_name.png

So in case there is a need to understand what features are within some correlation group of interest (found either in modeling results or survival, if was enabled) - the described above EDA subsection could be used.

Implementation details

Correlation groups creation

As part of EDA artifacts generation there is produced logml.eda.artifacts.correlation.CorrelationSummary object that stores information about correlation groups:

correlation_groups property contains a correlation graph that was created under all parameters/constraints discussed above (see logml.eda.artifacts.correlation.CorrelationGraph for details)
correlation_groups property contains a list of correlation groups that were defined (see logml.eda.artifacts.correlation.CorrelationGroup for details)

The main goal of correlation groups definition is to facilitate the process of removing correlated features (via the corresponding data preprocessing step) - so we want to find a (maximal) subset of features so that within that subset there are no correlated features. In order to do that the following approach is proposed:

Features are processed in order that is defined by corresponding node degrees in the correlation graph (descending)
Until all features/nodes are processed the next feature is picked (in case it has adjacent unmarked nodes) and a new correlation group is assigned to it. All adjacent and unmarked nodes are assigned to the same correlation group as well.

Please see the example below that shows the described approach:

../../_images/groups_creation_schema.png

Removing correlated features

As it was mentioned in case correlation featured should be removed within some data preprocessing - remove_correlated_features transformer does the job. It implements the following logic:

“target” column is not affected (for survival - both “event” and “time” columns)
all correlation groups are sequentially checked: if for a correlation group there are some features within current dataset - only one feature is kept and others are filtered out

Please see the example below that shows the described approach: