logml.feature_importance.sfd

Statistical Features Detection.

Use statistical methods over ML models to detect valuable features.

Methods available:

  • Feature Importance Distribution Selection - FIDS. Compares Feature importance distributions in one-vs-rest or one-vs-next mode, then filter picked set by median 0 FI.

  • Signal Detection by Features Removal - SDFR. Removes features from “all features” model until model quality is indistinguishable from the baseline. (tier1) Then picks features from the remaining set, if that improves model quality.

Functions

add_random_features(df, num_features_to_add)

Adds rabdin features to the dataframe.

display_fds_features_plots(obj, df[, ...])

Display 3 plots per stable feature, explaining its relations to target and importance.

display_fds_result(fids_result, dataset[, ...])

Visualize result of FIDS work.

display_feature_evidense(i, f, obj, raw_fis, ...)

display_pdp_plot(obj, features, data[, ...])

Plot partial depennce chart for the given feature.

display_rank_composition(rank_comp[, limit, ...])

Display composition of ranks as read vs random features.

display_ranks_plot(rank_stats[, limit, ...])

Plot Feature importance distributions (aka Green-Blue plot).

entropy(values)

Calculates entropy for series.

extract_survival_data(dataset, time_col, ...)

get_correlated_cols(df[, prefix, thresh, method])

Returns list of correlated columns.

get_summary_ranks(fids, loss_name, objective)

Aggregate result of FIDS for multiple models into one summary table.

make_rnd_dataset(num_random_features, dataset)

Add num_random_features random features to the dataset, making sure that correlation of random feautures with existing data does not exceed the threshold.

sanitize(dataframe)

train_and_eval(dataset, model_cls[, ...])

Train CV model on the dataset, evaluate, and save feature importance data.

u_test_diff(a, b)

Test if B dist is different from A

u_test_greater(a, b)

Test if A dist is greater than B

Classes

FIDS([dataset, objective, model_cls, ...])

FIDSEnv([fids_run_name, config_path, ...])

Class to run FIDS manually as a continuation of the regular logml experiment.

FIDSResult(model_name, objective_cfg[, ...])

FIDSRunner(cfg, global_params[, ...])

Feature importance extractors executor.

TrialResult(loss, model, fi_summary[, fis])

Result of one model training on a CV dataset.

logml.feature_importance.sfd.sanitize(dataframe)
logml.feature_importance.sfd.u_test_diff(a: pandas.core.series.Series, b: pandas.core.series.Series)

Test if B dist is different from A

logml.feature_importance.sfd.u_test_greater(a: pandas.core.series.Series, b: pandas.core.series.Series)

Test if A dist is greater than B

logml.feature_importance.sfd.entropy(values: pandas.core.series.Series)

Calculates entropy for series.

logml.feature_importance.sfd.add_random_features(df: pandas.core.frame.DataFrame, num_features_to_add: int, prefix='rand_') pandas.core.frame.DataFrame

Adds rabdin features to the dataframe.

logml.feature_importance.sfd.get_correlated_cols(df, prefix='rand_', thresh=0.2, method='spearman')

Returns list of correlated columns.

logml.feature_importance.sfd.make_rnd_dataset(num_random_features: int, dataset: logml.data.datasets.cv_dataset.ModelingDataset, corr_th=0.3, n_attempts=20, rnd_prefix: str = 'rand_')

Add num_random_features random features to the dataset, making sure that correlation of random feautures with existing data does not exceed the threshold.

class logml.feature_importance.sfd.TrialResult(loss: logml.models.base.CVMetricsResult, model: logml.models.base.BaseModel, fi_summary: pandas.core.frame.DataFrame, fis: Optional[pandas.core.frame.DataFrame] = None)

Bases: object

Result of one model training on a CV dataset.

loss: logml.models.base.CVMetricsResult
model: logml.models.base.BaseModel
fi_summary: pandas.core.frame.DataFrame
fis: pandas.core.frame.DataFrame = None
get_mean_loss(loss_name: str) float
get_std_loss(loss_name: str) float
logml.feature_importance.sfd.train_and_eval(dataset: logml.data.datasets.cv_dataset.ModelingDataset, model_cls: Type[logml.models.base.BaseModel], model_params: Optional[Dict] = None, model_fit_params: Optional[Dict] = None, fi_extractor: Optional[logml.feature_importance.base.BaseImportanceExtractor] = None, track_fi=True, logger=None, store_interim=False) logml.feature_importance.sfd.TrialResult

Train CV model on the dataset, evaluate, and save feature importance data.

logml.feature_importance.sfd.extract_survival_data(dataset: logml.data.datasets.cv_dataset.ModelingDataset, time_col, event_col, event_query) Tuple[pandas.core.frame.DataFrame, logml.data.datasets.cv_dataset.ModelingDataset]
class logml.feature_importance.sfd.FIDSResult(model_name: str, objective_cfg: logml.configuration.modeling.ModelingTaskSpec, num_initial_features: int = - 1, stable_features: List[str] = None, candidate_stable_features: List[str] = None, stable_features_before_rand_threshold: List[str] = None, result_bl: logml.feature_importance.sfd.TrialResult = None, result_orig: logml.feature_importance.sfd.TrialResult = None, result_stable: logml.feature_importance.sfd.TrialResult = None, losses: pandas.core.frame.DataFrame = None, rank_composition_orig: pandas.core.frame.DataFrame = None, rank_composition_stable: pandas.core.frame.DataFrame = None, rand_rank_threshold: int = 0, status: str = '')

Bases: object

model_name: str
objective_cfg: logml.configuration.modeling.ModelingTaskSpec
num_initial_features: int = -1
stable_features: List[str] = None
candidate_stable_features: List[str] = None
stable_features_before_rand_threshold: List[str] = None
result_bl: logml.feature_importance.sfd.TrialResult = None
result_orig: logml.feature_importance.sfd.TrialResult = None
result_stable: logml.feature_importance.sfd.TrialResult = None
losses: pandas.core.frame.DataFrame = None
rank_composition_orig: pandas.core.frame.DataFrame = None
rank_composition_stable: pandas.core.frame.DataFrame = None
rand_rank_threshold: int = 0
status: str = ''
property objective: logml.common.ModelingTask
property loss_name: str
property target_column: str
property event_column: str
get_rel_loss(loss_name: Optional[str] = None, objective: Optional[logml.common.ModelingTask] = None, rel='baseline')

Returns relative loss of stable model relative to the baseline.

class logml.feature_importance.sfd.FIDS(dataset: Optional[logml.data.datasets.cv_dataset.ModelingDataset] = None, objective: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, model_cls: Optional[Type[logml.models.base.BaseModel]] = None, model_params: Optional[dict] = None, baseline_model_cls: Optional[Type[logml.models.base.BaseModel]] = None, baseline_model_params: Optional[dict] = None, fi_extractor: Optional[logml.feature_importance.base.BaseImportanceExtractor] = None, fid_pvalue_threshold: float = 0.05, n_random_iters: int = 5, data_with_noise: Optional[List[logml.data.datasets.cv_dataset.ModelingDataset]] = None, n_random_features=0, rnd_feature_prefix: str = 'rand_', num_features_in_tier1_model_batch=1, logger=None, use_survival=False, survival_df=None, perform_tier1_greedy=True, store_interim=False)

Bases: object

get_rel_loss(rel='baseline')

Returns relative loss of stable model relative to the baseline.

get_tiered_ranks()

Ger the ranked list of featurtes for the original model, specify byt Tier1 stable features.

calc_surival()

If survival data is present, calculate optimal cutoff for all stable features

fit() logml.feature_importance.sfd.FIDSResult

Run features identification process.

logml.feature_importance.sfd.display_ranks_plot(rank_stats, limit=None, figsize=(16, 10), draw_iqrange=True, range_std=0, rank_stats2=None, title2=None, markers=None, title='Ranks', tier1_end=None, rand_f=True, knee_point=0)

Plot Feature importance distributions (aka Green-Blue plot).

logml.feature_importance.sfd.display_fds_features_plots(obj: logml.feature_importance.sfd.FIDSResult, df: pandas.core.frame.DataFrame, features=None, kde=False, figsize=(12, 4), fi_log_scale=False, target_col=None, use_orig=True, plot_fi_dist=True)

Display 3 plots per stable feature, explaining its relations to target and importance.

logml.feature_importance.sfd.display_feature_evidense(i, f, obj: logml.feature_importance.sfd.FIDSResult, raw_fis, final_ranks, df, kde, target_col, plot_fi_dist, figsize, fi_log_scale, use_orig)
logml.feature_importance.sfd.display_pdp_plot(obj: logml.feature_importance.sfd.FIDSResult, features, data, kind='both', figsize=(12, 12), limit=- 1, ax=None, use_orig=False)

Plot partial depennce chart for the given feature.

logml.feature_importance.sfd.display_fds_result(fids_result: logml.feature_importance.sfd.FIDSResult, dataset: logml.data.datasets.cv_dataset.ModelingDataset, kde=True, plot_fi_dist=True, ranks_figsize=(14, 6))

Visualize result of FIDS work.

logml.feature_importance.sfd.display_rank_composition(rank_comp, limit=None, figsize=None, dist_threshold=- 1, rand_threshold=- 1, avg=False, window=5)

Display composition of ranks as read vs random features.

class logml.feature_importance.sfd.FIDSEnv(fids_run_name='', config_path=None, output_path=None, run_name=None, stratum_name=None, problem_name=None, rnd_state=42, n_folds=100, test_size=0.25, n_models=5, n_perm_imp_iters=10, corr_thresh=0.3, perform_tier1_greedy=True)

Bases: object

Class to run FIDS manually as a continuation of the regular logml experiment.

dump()
classmethod load(gparams)
load_exp_data(corr_thresh=0.5, drop_all=False, drop_cat=False, method='spearman')
run_fids(models: Optional[list] = None)
run_model_fids(name, model_params=None)
class logml.feature_importance.sfd.FIDSFeatureResult

Bases: pydantic.main.BaseModel

Show JSON schema
{
   "title": "FIDSFeatureResult",
   "type": "object",
   "properties": {
      "name": {
         "title": "Name",
         "type": "string"
      },
      "medan_rank": {
         "title": "Medan Rank",
         "default": -1.0,
         "type": "number"
      },
      "num_models": {
         "title": "Num Models",
         "default": 0,
         "type": "integer"
      },
      "ranks": {
         "title": "Ranks",
         "type": "object",
         "additionalProperties": {
            "type": "number"
         }
      }
   },
   "required": [
      "name",
      "ranks"
   ]
}

Fields
field name: str [Required]
field medan_rank: float = -1.0
field num_models: int = 0
field ranks: Dict[str, float] [Required]
class logml.feature_importance.sfd.FIDSModelResult

Bases: pydantic.main.BaseModel

Show JSON schema
{
   "title": "FIDSModelResult",
   "type": "object",
   "properties": {
      "name": {
         "title": "Name",
         "type": "string"
      },
      "features": {
         "title": "Features",
         "type": "object",
         "additionalProperties": {
            "type": "object"
         }
      },
      "qty_rel_baseline": {
         "title": "Qty Rel Baseline",
         "default": 0.0,
         "type": "number"
      },
      "qty_rel_orig": {
         "title": "Qty Rel Orig",
         "default": 0.0,
         "type": "number"
      }
   },
   "required": [
      "name",
      "features"
   ]
}

Fields
field name: str [Required]
field features: Dict[str, dict] [Required]
field qty_rel_baseline: float = 0.0
field qty_rel_orig: float = 0.0
class logml.feature_importance.sfd.FIDSummaryResult

Bases: pydantic.main.BaseModel

Show JSON schema
{
   "title": "FIDSummaryResult",
   "type": "object",
   "properties": {
      "features": {
         "title": "Features",
         "default": [],
         "type": "array",
         "items": {
            "$ref": "#/definitions/FIDSFeatureResult"
         }
      },
      "models": {
         "title": "Models",
         "default": [],
         "type": "array",
         "items": {
            "$ref": "#/definitions/FIDSModelResult"
         }
      }
   },
   "definitions": {
      "FIDSFeatureResult": {
         "title": "FIDSFeatureResult",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "type": "string"
            },
            "medan_rank": {
               "title": "Medan Rank",
               "default": -1.0,
               "type": "number"
            },
            "num_models": {
               "title": "Num Models",
               "default": 0,
               "type": "integer"
            },
            "ranks": {
               "title": "Ranks",
               "type": "object",
               "additionalProperties": {
                  "type": "number"
               }
            }
         },
         "required": [
            "name",
            "ranks"
         ]
      },
      "FIDSModelResult": {
         "title": "FIDSModelResult",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "type": "string"
            },
            "features": {
               "title": "Features",
               "type": "object",
               "additionalProperties": {
                  "type": "object"
               }
            },
            "qty_rel_baseline": {
               "title": "Qty Rel Baseline",
               "default": 0.0,
               "type": "number"
            },
            "qty_rel_orig": {
               "title": "Qty Rel Orig",
               "default": 0.0,
               "type": "number"
            }
         },
         "required": [
            "name",
            "features"
         ]
      }
   }
}

Fields
field features: List[logml.feature_importance.sfd.FIDSFeatureResult] = []
field models: List[logml.feature_importance.sfd.FIDSModelResult] = []
logml.feature_importance.sfd.get_summary_ranks(fids: Collection[logml.feature_importance.sfd.FIDSResult], loss_name: str, objective: logml.common.ModelingTask) Tuple[Optional[pandas.core.frame.DataFrame], Optional[pandas.core.frame.DataFrame], logml.feature_importance.sfd.FIDSummaryResult]

Aggregate result of FIDS for multiple models into one summary table.

class logml.feature_importance.sfd.FIDSRunner(cfg: GlobalConfig, global_params: dict, model_provider: Optional[logml.model_search.provider.ModelProvider] = None, logger=None)

Bases: logml.common.BaseRunner

Feature importance extractors executor.

FEATURE_SOURCE = 'Source'
FEATURE_CORRELATION_GROUP = 'Correlation Group'
aggregate_results(dump=True)

Generates high-level summaries for FIDS

run()

Invokes required feature importance extractors according to ‘feature_importance’ cfg section.

run_single_model(dataset, model_name) Optional[logml.feature_importance.sfd.FIDSResult]
load_fids_results() Dict[str, logml.feature_importance.sfd.FIDSResult]

Loads serialized FIDS results.