Feature Importance Analysis

The goal of this kind of analysis is to limit features list to the very minimal set which maximizes model quality in comparison with baselines model. In other words, it drop as many noisy and not-so-informative features as possible, leaving only “core” of the most important features.

(Historical note: this method was created for analysis in bioinformatics area, which has its specifics. The most notable specific trait of data is its form. Due to nature of disease or treatment effectiveness reserach, there are not many samples, uasually ranging as several hudnred. From the other side, amount of facts known about samples may be overwhelming. It includes phisical properties, results of blood and tissue analyses. If we alos could DNA and RNA features - it may easily come over 50K of features.

Naturally in this circumstances being able to removal data noise is an advantage).

Before we start, we agree on the used terminology.

Model: is a simple estimator as sklearn understands it. It is an instance of class like sklearn.ensemle.RandomForestRegressor.
CV: is an cross-validation, i.e. process when dataset is split to non-overlapping train and test subsets. We use 100 CV iterations by default by using sklearn RepeatedKFold method.
Feature Importance: provided by some models naturally, like coefficients for linear model or gain for tree-based models.
LogmlModel: Wrapper object, which trains many CV models. In essence it operates like some sklearn CV models: each logml model has number of “simple” models, each trained and validated on one of the CV folds. Instead of point estimate of model loss or feature importance, it uses set of values (one per cv fold) and considers it as a distribution.

Feaures Extraction Overview

High level steps:

Problem definition:
- Specify each columns of interest as a target. LogML will look for features which explain the target from ML perspective.
Optional: data preprocessing.
- Transform raw data to be suitable for ML models.
Model selection:
- Train baseline model (by default in logml it is the sklearn dummy model).
- Train and optimize all LogmlModels suitable for the problem (regression/classification/survival). This includes some sklearn and LightGBM models.
- Pick those LogmlModels which are statistically better than the dummy model. It is done by comparing loss distributions using u-test.
Feature Importance Extraction:
- For each model perform features selection procedure:
  Filter features by comparing their Feature Importance Distribution.
  
  Select those which maximize model quality.
  
  Perform Random features cutoff procedure to make sure noisy features are dropped.
- Aggregate results.

Problem definition

Problem definition consists of the following parts:

Modeling problem type definition:
- Regression - for numerical targets
- Classification - for categorical targets
- Survival regression - for combinations of time-to-event targets (like overall survival) and censoring indicators
Metric - it will be used for tuning/selecting models.

See for details Modeling Metrics Registry.
Target name - target column name should be specified. Note that for survival problems time and event columns should be specified.

Let’s take a look at the following sample example on how to setup dataset metadata and define modeling problems of interest:

Sample metadata definition

...

dataset_metadata:
    modeling_specs:
        # Alias that will be referenced later
        Outcome_clf_problem:
            # We want to predict "Outcome" - classification problem.
            task:          classification  # Explicitly define the task
            target_column: Outcome  # Target column name
            target_metric: rocauc  # Metric / loss of interest

        age_reg_problem:
            # We want to predict "age" - regression problem.
            task:          regression  # Explicitly define the task
            target_column: age  # Target column name
            target_metric: mape  # Metric / loss of interest

        OS:
            # NOTE:         here we don't need to explicitly specify 'survival' task
            time_column:    os  # Target time-to-event column
            event_column:   os_cens  # Target event (censoring) column
            # Interpretation of event column's values is study-specific,
            # so explicit definition of 'uncensored' is required
            event_query:    'os_cens == 0'
            target_metric:  cindex

...

dataset_metadata section is defined by DatasetMetadataSection class.

Data preprocessing step

First step of pipeline is (as usual) preprocesing.

See Data Preprocessing

Model selection

Model selection is process of evaluating models against some baseline model and picking those which outperform it.

As a baseline model LogML by default uses sklearn DummyRegressor and DummyClassifier models.

Defailt of available LogML models can be found here: Model Types.

Configuration class - ModelSearchSection.

Example of model selection configuration

...

# ModelingSection
modeling:
    ...

    problems:
        Outcome_clf_problem:
            ...

            # ModelSearchSection
            model_search:
                enable: True
                # Candidate models
                models:
                    - RandomForestClassifierModel
                    - LogisticRegressionModel

                baseline_model: DummyClassifierModel

        age_reg_problem:
            ...

            model_search:
                enable: True
                # DummyRegressorModel will be used as baseline.
                # All regression models will be checked.

    ...

...

As part of the report LogML provides the following details regarding Model Selection results:

performance metrics for candidate and baseline models
ROC/PR curves per each candidate model for classification problems
confusion matrices per each candidate model for classification problems

The reporting part could provide user with meaningful insights on what classes of machine learning models are better than other ones for a particular problem.

Feature Importance Extraction

Configuration class - FeatureImportanceSection.

Example of feature importance configuration

...

# ModelingSection
modeling:
    ...

    problems:
        Outcome_clf_problem:
            ...

            # Here we want to tune and check performance for all available models
            model_search:
                enable:         True

            # FeatureImportanceSection
            feature_importance:

                # Uses own cross-validation configuration to run more iterations.
                cross_validation:
                  random_state: 42
                  split_type: kfold
                  n_folds: 100
                  test_size: 0.25

                perform_tier1_greedy: false
                fid_pvalue_threshold: 0.05
                n_random_iters: 5
                random_state: 42
                default_extractor: null
                default_n_perm_imp_iters: 10
                extractors: {}

                    # When empty - defailt extractors applied to each model
                    # To configure specific:

                    # ModelName:
                    #    specific_extractorParams

Note that this section has its own cross-validation settings.

Method Details

As an input we accept LogmlModel and original dataset for which model has been optimized.

Step 1: “All features” model.
- Train LogmlModel with CV settings for feature imprortance extraction. This produces a set of FI values per feature. For example, if we by default 100 cross-validation iterations, then we have 100 FI values per feature.
- Rank features according to the median FI value. (FeatureRanks data)
Step 2: Detect Tier1 features.
- Define FeaturesSet = all dataset features
- Define SelectedFeatures = empty
- Define Losses = empty
- Perform the following cycle:
  (One-vs-all FI distribution comparison) For each feature in FeaturesSet:
  
  Call it “Current feature” and its FI values are “Current feature FI set”.
  
  Combine FI values for all features, except current. Call it “remaining FI set”.
  
  Perform u-test with “greater” alternative hypothesis to calculate p-value for test that “feature FI set” is statistically greater than “remaining FI set”.
  
  As a result we have set of p-values, one per feauture in FeaturesSet. Apply FDR correction to this set.
  
  Remove features whose p-values exceed p-value threshold (0.05 by default).
  
  If list is empty - break the cycle.
  
  Pick feature with the minimal p-value:
  
  Add it to SelectedFeatures.
  
  Remove it from FeaturesSet
  
  Evaluate model with SelectedFeatures, save its loss to Losses
- (this step after the cycle break). Pick feature set which corresponds to minimal value in list of Losses. SelectedFeatures = features with min loss.
- If perform_tier1_greedy parameter is set, perform greedy features selection:
  
  Find feature from SelectedFeatures which minimizes model loss, add to stable set.
- Generate Tier1 Model:
  
  Train and evaluate model with SelectedFeatures
  
  Drop features for which FI value is zero.
Step 3: Random features cutoff.
- Generate several subsets of data with random varuables.
  
  By default, generate 20% of random features for “All features” model, and 100% for “Tier1 Model”.
- Train model on each of “noisy” datasets.
- Collect feature ranks data and count ratio of “features” vs “random” placed on a given rank.
- Normalize counts.
- Calculate randonm feature cutoff threshold as rank, where value for random features take more than 50%.