Feature Importance Analysis
The goal of this kind of analysis is to limit features list to the very minimal set which maximizes model quality in comparison with baselines model. In other words, it drop as many noisy and not-so-informative features as possible, leaving only “core” of the most important features.
(Historical note: this method was created for analysis in bioinformatics area, which has its specifics. The most notable specific trait of data is its form. Due to nature of disease or treatment effectiveness reserach, there are not many samples, uasually ranging as several hudnred. From the other side, amount of facts known about samples may be overwhelming. It includes phisical properties, results of blood and tissue analyses. If we alos could DNA and RNA features - it may easily come over 50K of features.
Naturally in this circumstances being able to removal data noise is an advantage).
Before we start, we agree on the used terminology.
- Model
is a simple estimator as sklearn understands it. It is an instance of class like
sklearn.ensemle.RandomForestRegressor
.- CV
is an cross-validation, i.e. process when dataset is split to non-overlapping train and test subsets. We use 100 CV iterations by default by using sklearn
RepeatedKFold
method.- Feature Importance
provided by some models naturally, like coefficients for linear model or gain for tree-based models.
- LogmlModel
Wrapper object, which trains many CV models. In essence it operates like some sklearn CV models: each logml model has number of “simple” models, each trained and validated on one of the CV folds. Instead of point estimate of model loss or feature importance, it uses set of values (one per cv fold) and considers it as a distribution.
Feaures Extraction Overview
High level steps:
- Problem definition:
Specify each columns of interest as a target. LogML will look for features which explain the target from ML perspective.
- Optional: data preprocessing.
Transform raw data to be suitable for ML models.
Model selection:
Train baseline model (by default in logml it is the sklearn dummy model).
Train and optimize all LogmlModels suitable for the problem (regression/classification/survival). This includes some sklearn and LightGBM models.
Pick those LogmlModels which are statistically better than the dummy model. It is done by comparing loss distributions using u-test.
Feature Importance Extraction:
For each model perform features selection procedure:
Filter features by comparing their Feature Importance Distribution.
Select those which maximize model quality.
Perform Random features cutoff procedure to make sure noisy features are dropped.
Aggregate results.
Problem definition
Problem definition consists of the following parts:
Modeling problem type definition:
Regression - for numerical targets
Classification - for categorical targets
Survival regression - for combinations of time-to-event targets (like overall survival) and censoring indicators
Metric - it will be used for tuning/selecting models.
See for details Modeling Metrics Registry.
Target name - target column name should be specified. Note that for survival problems time and event columns should be specified.
Let’s take a look at the following sample example on how to setup dataset metadata and define modeling problems of interest:
1...
2
3dataset_metadata:
4 modeling_specs:
5 # Alias that will be referenced later
6 Outcome_clf_problem:
7 # We want to predict "Outcome" - classification problem.
8 task: classification # Explicitly define the task
9 target_column: Outcome # Target column name
10 target_metric: rocauc # Metric / loss of interest
11
12 age_reg_problem:
13 # We want to predict "age" - regression problem.
14 task: regression # Explicitly define the task
15 target_column: age # Target column name
16 target_metric: mape # Metric / loss of interest
17
18 OS:
19 # NOTE: here we don't need to explicitly specify 'survival' task
20 time_column: os # Target time-to-event column
21 event_column: os_cens # Target event (censoring) column
22 # Interpretation of event column's values is study-specific,
23 # so explicit definition of 'uncensored' is required
24 event_query: 'os_cens == 0'
25 target_metric: cindex
26
27...
dataset_metadata section is defined by
DatasetMetadataSection
class.
Data preprocessing step
First step of pipeline is (as usual) preprocesing.
Model selection
Model selection is process of evaluating models against some baseline model and picking those which outperform it.
As a baseline model LogML by default uses sklearn DummyRegressor and DummyClassifier models.
Defailt of available LogML models can be found here: Model Types
.
Configuration class - ModelSearchSection
.
1...
2
3# ModelingSection
4modeling:
5 ...
6
7 problems:
8 Outcome_clf_problem:
9 ...
10
11 # ModelSearchSection
12 model_search:
13 enable: True
14 # Candidate models
15 models:
16 - RandomForestClassifierModel
17 - LogisticRegressionModel
18
19 baseline_model: DummyClassifierModel
20
21 age_reg_problem:
22 ...
23
24 model_search:
25 enable: True
26 # DummyRegressorModel will be used as baseline.
27 # All regression models will be checked.
28
29 ...
30
31...
As part of the report LogML provides the following details regarding Model Selection results:
performance metrics for candidate and baseline models
ROC/PR curves per each candidate model for classification problems
confusion matrices per each candidate model for classification problems
The reporting part could provide user with meaningful insights on what classes of machine learning models are better than other ones for a particular problem.
Feature Importance Extraction
Configuration class - FeatureImportanceSection
.
1...
2
3# ModelingSection
4modeling:
5 ...
6
7 problems:
8 Outcome_clf_problem:
9 ...
10
11 # Here we want to tune and check performance for all available models
12 model_search:
13 enable: True
14
15 # FeatureImportanceSection
16 feature_importance:
17
18 # Uses own cross-validation configuration to run more iterations.
19 cross_validation:
20 random_state: 42
21 split_type: kfold
22 n_folds: 100
23 test_size: 0.25
24
25 perform_tier1_greedy: false
26 fid_pvalue_threshold: 0.05
27 n_random_iters: 5
28 random_state: 42
29 default_extractor: null
30 default_n_perm_imp_iters: 10
31 extractors: {}
32
33 # When empty - defailt extractors applied to each model
34 # To configure specific:
35
36 # ModelName:
37 # specific_extractorParams
Note that this section has its own cross-validation settings.
Method Details
As an input we accept LogmlModel and original dataset for which model has been optimized.
Step 1: “All features” model.
Train LogmlModel with CV settings for feature imprortance extraction. This produces a set of FI values per feature. For example, if we by default 100 cross-validation iterations, then we have 100 FI values per feature.
Rank features according to the median FI value. (FeatureRanks data)
Step 2: Detect Tier1 features.
Define
FeaturesSet
= all dataset featuresDefine
SelectedFeatures
= emptyDefine
Losses
= emptyPerform the following cycle:
- (One-vs-all FI distribution comparison) For each feature in
FeaturesSet
: Call it “Current feature” and its FI values are “Current feature FI set”.
Combine FI values for all features, except current. Call it “remaining FI set”.
Perform u-test with “greater” alternative hypothesis to calculate p-value for test that “feature FI set” is statistically greater than “remaining FI set”.
- (One-vs-all FI distribution comparison) For each feature in
As a result we have set of p-values, one per feauture in
FeaturesSet
. Apply FDR correction to this set.Remove features whose p-values exceed p-value threshold (0.05 by default).
If list is empty - break the cycle.
Pick feature with the minimal p-value:
Add it to
SelectedFeatures
.Remove it from
FeaturesSet
Evaluate model with
SelectedFeatures
, save its loss toLosses
(this step after the cycle break). Pick feature set which corresponds to minimal value in list of
Losses
.SelectedFeatures
= features with min loss.- If
perform_tier1_greedy
parameter is set, perform greedy features selection: Find feature from
SelectedFeatures
which minimizes model loss, add to stable set.
- If
- Generate
Tier1 Model
: Train and evaluate model with
SelectedFeatures
Drop features for which FI value is zero.
- Generate
Step 3: Random features cutoff.
- Generate several subsets of data with random varuables.
By default, generate 20% of random features for “All features” model, and 100% for “Tier1 Model”.
Train model on each of “noisy” datasets.
Collect feature ranks data and count ratio of “features” vs “random” placed on a given rank.
Normalize counts.
Calculate randonm feature cutoff threshold as rank, where value for random features take more than 50%.