Feature Importance Analysis

The goal of this kind of analysis is to limit features list to the very minimal set which maximizes model quality in comparison with baselines model. In other words, it drop as many noisy and not-so-informative features as possible, leaving only “core” of the most important features.

(Historical note: this method was created for analysis in bioinformatics area, which has its specifics. The most notable specific trait of data is its form. Due to nature of disease or treatment effectiveness reserach, there are not many samples, uasually ranging as several hudnred. From the other side, amount of facts known about samples may be overwhelming. It includes phisical properties, results of blood and tissue analyses. If we alos could DNA and RNA features - it may easily come over 50K of features.

Naturally in this circumstances being able to removal data noise is an advantage).

Before we start, we agree on the used terminology.

Model

is a simple estimator as sklearn understands it. It is an instance of class like sklearn.ensemle.RandomForestRegressor.

CV

is an cross-validation, i.e. process when dataset is split to non-overlapping train and test subsets. We use 100 CV iterations by default by using sklearn RepeatedKFold method.

Feature Importance

provided by some models naturally, like coefficients for linear model or gain for tree-based models.

LogmlModel

Wrapper object, which trains many CV models. In essence it operates like some sklearn CV models: each logml model has number of “simple” models, each trained and validated on one of the CV folds. Instead of point estimate of model loss or feature importance, it uses set of values (one per cv fold) and considers it as a distribution.

Feaures Extraction Overview

High level steps:

  • Problem definition:
    • Specify each columns of interest as a target. LogML will look for features which explain the target from ML perspective.

  • Optional: data preprocessing.
    • Transform raw data to be suitable for ML models.

  • Model selection:

    • Train baseline model (by default in logml it is the sklearn dummy model).

    • Train and optimize all LogmlModels suitable for the problem (regression/classification/survival). This includes some sklearn and LightGBM models.

    • Pick those LogmlModels which are statistically better than the dummy model. It is done by comparing loss distributions using u-test.

  • Feature Importance Extraction:

    • For each model perform features selection procedure:

      • Filter features by comparing their Feature Importance Distribution.

      • Select those which maximize model quality.

      • Perform Random features cutoff procedure to make sure noisy features are dropped.

    • Aggregate results.

Problem definition

Problem definition consists of the following parts:

  • Modeling problem type definition:

  • Metric - it will be used for tuning/selecting models.

    See for details Modeling Metrics Registry.

  • Target name - target column name should be specified. Note that for survival problems time and event columns should be specified.

Let’s take a look at the following sample example on how to setup dataset metadata and define modeling problems of interest:

Sample metadata definition
 1...
 2
 3dataset_metadata:
 4    modeling_specs:
 5        # Alias that will be referenced later
 6        Outcome_clf_problem:
 7            # We want to predict "Outcome" - classification problem.
 8            task:          classification  # Explicitly define the task
 9            target_column: Outcome  # Target column name
10            target_metric: rocauc  # Metric / loss of interest
11
12        age_reg_problem:
13            # We want to predict "age" - regression problem.
14            task:          regression  # Explicitly define the task
15            target_column: age  # Target column name
16            target_metric: mape  # Metric / loss of interest
17
18        OS:
19            # NOTE:         here we don't need to explicitly specify 'survival' task
20            time_column:    os  # Target time-to-event column
21            event_column:   os_cens  # Target event (censoring) column
22            # Interpretation of event column's values is study-specific,
23            # so explicit definition of 'uncensored' is required
24            event_query:    'os_cens == 0'
25            target_metric:  cindex
26
27...

dataset_metadata section is defined by DatasetMetadataSection class.

Data preprocessing step

First step of pipeline is (as usual) preprocesing.

See Data Preprocessing

Model selection

Model selection is process of evaluating models against some baseline model and picking those which outperform it.

As a baseline model LogML by default uses sklearn DummyRegressor and DummyClassifier models.

Defailt of available LogML models can be found here: Model Types.

Configuration class - ModelSearchSection.

Example of model selection configuration
 1...
 2
 3# ModelingSection
 4modeling:
 5    ...
 6
 7    problems:
 8        Outcome_clf_problem:
 9            ...
10
11            # ModelSearchSection
12            model_search:
13                enable: True
14                # Candidate models
15                models:
16                    - RandomForestClassifierModel
17                    - LogisticRegressionModel
18
19                baseline_model: DummyClassifierModel
20
21        age_reg_problem:
22            ...
23
24            model_search:
25                enable: True
26                # DummyRegressorModel will be used as baseline.
27                # All regression models will be checked.
28
29    ...
30
31...

As part of the report LogML provides the following details regarding Model Selection results:

  • performance metrics for candidate and baseline models

  • ROC/PR curves per each candidate model for classification problems

  • confusion matrices per each candidate model for classification problems

The reporting part could provide user with meaningful insights on what classes of machine learning models are better than other ones for a particular problem.

Feature Importance Extraction

Configuration class - FeatureImportanceSection.

Example of feature importance configuration
 1...
 2
 3# ModelingSection
 4modeling:
 5    ...
 6
 7    problems:
 8        Outcome_clf_problem:
 9            ...
10
11            # Here we want to tune and check performance for all available models
12            model_search:
13                enable:         True
14
15            # FeatureImportanceSection
16            feature_importance:
17
18                # Uses own cross-validation configuration to run more iterations.
19                cross_validation:
20                  random_state: 42
21                  split_type: kfold
22                  n_folds: 100
23                  test_size: 0.25
24
25                perform_tier1_greedy: false
26                fid_pvalue_threshold: 0.05
27                n_random_iters: 5
28                random_state: 42
29                default_extractor: null
30                default_n_perm_imp_iters: 10
31                extractors: {}
32
33                    # When empty - defailt extractors applied to each model
34                    # To configure specific:
35
36                    # ModelName:
37                    #    specific_extractorParams

Note that this section has its own cross-validation settings.

Method Details

As an input we accept LogmlModel and original dataset for which model has been optimized.

  • Step 1: “All features” model.

    • Train LogmlModel with CV settings for feature imprortance extraction. This produces a set of FI values per feature. For example, if we by default 100 cross-validation iterations, then we have 100 FI values per feature.

    • Rank features according to the median FI value. (FeatureRanks data)

  • Step 2: Detect Tier1 features.

    • Define FeaturesSet = all dataset features

    • Define SelectedFeatures = empty

    • Define Losses = empty

    • Perform the following cycle:

      • (One-vs-all FI distribution comparison) For each feature in FeaturesSet:
        • Call it “Current feature” and its FI values are “Current feature FI set”.

        • Combine FI values for all features, except current. Call it “remaining FI set”.

        • Perform u-test with “greater” alternative hypothesis to calculate p-value for test that “feature FI set” is statistically greater than “remaining FI set”.

      • As a result we have set of p-values, one per feauture in FeaturesSet. Apply FDR correction to this set.

      • Remove features whose p-values exceed p-value threshold (0.05 by default).

      • If list is empty - break the cycle.

      • Pick feature with the minimal p-value:

        • Add it to SelectedFeatures.

        • Remove it from FeaturesSet

      • Evaluate model with SelectedFeatures, save its loss to Losses

    • (this step after the cycle break). Pick feature set which corresponds to minimal value in list of Losses. SelectedFeatures = features with min loss.

    • If perform_tier1_greedy parameter is set, perform greedy features selection:
      • Find feature from SelectedFeatures which minimizes model loss, add to stable set.

    • Generate Tier1 Model:
      • Train and evaluate model with SelectedFeatures

      • Drop features for which FI value is zero.

  • Step 3: Random features cutoff.

    • Generate several subsets of data with random varuables.
      • By default, generate 20% of random features for “All features” model, and 100% for “Tier1 Model”.

    • Train model on each of “noisy” datasets.

    • Collect feature ranks data and count ratio of “features” vs “random” placed on a given rank.

    • Normalize counts.

    • Calculate randonm feature cutoff threshold as rank, where value for random features take more than 50%.