logml.feature_importance.sfd
Statistical Features Detection.
Use statistical methods over ML models to detect valuable features.
Methods available:
Feature Importance Distribution Selection - FIDS. Compares Feature importance distributions in one-vs-rest or one-vs-next mode, then filter picked set by median 0 FI.
Signal Detection by Features Removal - SDFR. Removes features from “all features” model until model quality is indistinguishable from the baseline. (tier1) Then picks features from the remaining set, if that improves model quality.
Functions
|
Adds rabdin features to the dataframe. |
|
Display 3 plots per stable feature, explaining its relations to target and importance. |
|
Visualize result of FIDS work. |
|
|
|
Plot partial depennce chart for the given feature. |
|
Display composition of ranks as read vs random features. |
|
Plot Feature importance distributions (aka Green-Blue plot). |
|
Calculates entropy for series. |
|
|
|
Returns list of correlated columns. |
|
Aggregate result of FIDS for multiple models into one summary table. |
|
Add num_random_features random features to the dataset, making sure that correlation of random feautures with existing data does not exceed the threshold. |
|
|
|
Train CV model on the dataset, evaluate, and save feature importance data. |
|
Test if B dist is different from A |
|
Test if A dist is greater than B |
Classes
|
|
|
Class to run FIDS manually as a continuation of the regular logml experiment. |
|
|
|
Feature importance extractors executor. |
|
Result of one model training on a CV dataset. |
- logml.feature_importance.sfd.sanitize(dataframe)
- logml.feature_importance.sfd.u_test_diff(a: pandas.core.series.Series, b: pandas.core.series.Series)
Test if B dist is different from A
- logml.feature_importance.sfd.u_test_greater(a: pandas.core.series.Series, b: pandas.core.series.Series)
Test if A dist is greater than B
- logml.feature_importance.sfd.entropy(values: pandas.core.series.Series)
Calculates entropy for series.
- logml.feature_importance.sfd.add_random_features(df: pandas.core.frame.DataFrame, num_features_to_add: int, prefix='rand_') pandas.core.frame.DataFrame
Adds rabdin features to the dataframe.
Returns list of correlated columns.
- logml.feature_importance.sfd.make_rnd_dataset(num_random_features: int, dataset: logml.data.datasets.cv_dataset.ModelingDataset, corr_th=0.3, n_attempts=20, rnd_prefix: str = 'rand_')
Add num_random_features random features to the dataset, making sure that correlation of random feautures with existing data does not exceed the threshold.
- class logml.feature_importance.sfd.TrialResult(loss: logml.models.base.CVMetricsResult, model: logml.models.base.BaseModel, fi_summary: pandas.core.frame.DataFrame, fis: Optional[pandas.core.frame.DataFrame] = None)
Bases:
object
Result of one model training on a CV dataset.
- model: logml.models.base.BaseModel
- fi_summary: pandas.core.frame.DataFrame
- fis: pandas.core.frame.DataFrame = None
- get_mean_loss(loss_name: str) float
- get_std_loss(loss_name: str) float
- logml.feature_importance.sfd.train_and_eval(dataset: logml.data.datasets.cv_dataset.ModelingDataset, model_cls: Type[logml.models.base.BaseModel], model_params: Optional[Dict] = None, model_fit_params: Optional[Dict] = None, fi_extractor: Optional[logml.feature_importance.base.BaseImportanceExtractor] = None, track_fi=True, logger=None, store_interim=False) logml.feature_importance.sfd.TrialResult
Train CV model on the dataset, evaluate, and save feature importance data.
- logml.feature_importance.sfd.extract_survival_data(dataset: logml.data.datasets.cv_dataset.ModelingDataset, time_col, event_col, event_query) Tuple[pandas.core.frame.DataFrame, logml.data.datasets.cv_dataset.ModelingDataset]
- class logml.feature_importance.sfd.FIDSResult(model_name: str, objective_cfg: logml.configuration.modeling.ModelingTaskSpec, num_initial_features: int = - 1, stable_features: List[str] = None, candidate_stable_features: List[str] = None, stable_features_before_rand_threshold: List[str] = None, result_bl: logml.feature_importance.sfd.TrialResult = None, result_orig: logml.feature_importance.sfd.TrialResult = None, result_stable: logml.feature_importance.sfd.TrialResult = None, losses: pandas.core.frame.DataFrame = None, rank_composition_orig: pandas.core.frame.DataFrame = None, rank_composition_stable: pandas.core.frame.DataFrame = None, rand_rank_threshold: int = 0, status: str = '')
Bases:
object
- model_name: str
- objective_cfg: logml.configuration.modeling.ModelingTaskSpec
- num_initial_features: int = -1
- stable_features: List[str] = None
- candidate_stable_features: List[str] = None
- stable_features_before_rand_threshold: List[str] = None
- result_bl: logml.feature_importance.sfd.TrialResult = None
- result_orig: logml.feature_importance.sfd.TrialResult = None
- result_stable: logml.feature_importance.sfd.TrialResult = None
- losses: pandas.core.frame.DataFrame = None
- rank_composition_orig: pandas.core.frame.DataFrame = None
- rank_composition_stable: pandas.core.frame.DataFrame = None
- rand_rank_threshold: int = 0
- status: str = ''
- property objective: logml.common.ModelingTask
- property loss_name: str
- property target_column: str
- property event_column: str
- get_rel_loss(loss_name: Optional[str] = None, objective: Optional[logml.common.ModelingTask] = None, rel='baseline')
Returns relative loss of stable model relative to the baseline.
- class logml.feature_importance.sfd.FIDS(dataset: Optional[logml.data.datasets.cv_dataset.ModelingDataset] = None, objective: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, model_cls: Optional[Type[logml.models.base.BaseModel]] = None, model_params: Optional[dict] = None, baseline_model_cls: Optional[Type[logml.models.base.BaseModel]] = None, baseline_model_params: Optional[dict] = None, fi_extractor: Optional[logml.feature_importance.base.BaseImportanceExtractor] = None, fid_pvalue_threshold: float = 0.05, n_random_iters: int = 5, data_with_noise: Optional[List[logml.data.datasets.cv_dataset.ModelingDataset]] = None, n_random_features=0, rnd_feature_prefix: str = 'rand_', num_features_in_tier1_model_batch=1, logger=None, use_survival=False, survival_df=None, perform_tier1_greedy=True, store_interim=False)
Bases:
object
- get_rel_loss(rel='baseline')
Returns relative loss of stable model relative to the baseline.
- get_tiered_ranks()
Ger the ranked list of featurtes for the original model, specify byt Tier1 stable features.
- calc_surival()
If survival data is present, calculate optimal cutoff for all stable features
- fit() logml.feature_importance.sfd.FIDSResult
Run features identification process.
- logml.feature_importance.sfd.display_ranks_plot(rank_stats, limit=None, figsize=(16, 10), draw_iqrange=True, range_std=0, rank_stats2=None, title2=None, markers=None, title='Ranks', tier1_end=None, rand_f=True, knee_point=0)
Plot Feature importance distributions (aka Green-Blue plot).
- logml.feature_importance.sfd.display_fds_features_plots(obj: logml.feature_importance.sfd.FIDSResult, df: pandas.core.frame.DataFrame, features=None, kde=False, figsize=(12, 4), fi_log_scale=False, target_col=None, use_orig=True, plot_fi_dist=True)
Display 3 plots per stable feature, explaining its relations to target and importance.
- logml.feature_importance.sfd.display_feature_evidense(i, f, obj: logml.feature_importance.sfd.FIDSResult, raw_fis, final_ranks, df, kde, target_col, plot_fi_dist, figsize, fi_log_scale, use_orig)
- logml.feature_importance.sfd.display_pdp_plot(obj: logml.feature_importance.sfd.FIDSResult, features, data, kind='both', figsize=(12, 12), limit=- 1, ax=None, use_orig=False)
Plot partial depennce chart for the given feature.
- logml.feature_importance.sfd.display_fds_result(fids_result: logml.feature_importance.sfd.FIDSResult, dataset: logml.data.datasets.cv_dataset.ModelingDataset, kde=True, plot_fi_dist=True, ranks_figsize=(14, 6))
Visualize result of FIDS work.
- logml.feature_importance.sfd.display_rank_composition(rank_comp, limit=None, figsize=None, dist_threshold=- 1, rand_threshold=- 1, avg=False, window=5)
Display composition of ranks as read vs random features.
- class logml.feature_importance.sfd.FIDSEnv(fids_run_name='', config_path=None, output_path=None, run_name=None, stratum_name=None, problem_name=None, rnd_state=42, n_folds=100, test_size=0.25, n_models=5, n_perm_imp_iters=10, corr_thresh=0.3, perform_tier1_greedy=True)
Bases:
object
Class to run FIDS manually as a continuation of the regular logml experiment.
- dump()
- classmethod load(gparams)
- load_exp_data(corr_thresh=0.5, drop_all=False, drop_cat=False, method='spearman')
- run_fids(models: Optional[list] = None)
- run_model_fids(name, model_params=None)
- class logml.feature_importance.sfd.FIDSFeatureResult
Bases:
pydantic.main.BaseModel
Show JSON schema
{ "title": "FIDSFeatureResult", "type": "object", "properties": { "name": { "title": "Name", "type": "string" }, "medan_rank": { "title": "Medan Rank", "default": -1.0, "type": "number" }, "num_models": { "title": "Num Models", "default": 0, "type": "integer" }, "ranks": { "title": "Ranks", "type": "object", "additionalProperties": { "type": "number" } } }, "required": [ "name", "ranks" ] }
- field name: str [Required]
- field medan_rank: float = -1.0
- field num_models: int = 0
- field ranks: Dict[str, float] [Required]
- class logml.feature_importance.sfd.FIDSModelResult
Bases:
pydantic.main.BaseModel
Show JSON schema
{ "title": "FIDSModelResult", "type": "object", "properties": { "name": { "title": "Name", "type": "string" }, "features": { "title": "Features", "type": "object", "additionalProperties": { "type": "object" } }, "qty_rel_baseline": { "title": "Qty Rel Baseline", "default": 0.0, "type": "number" }, "qty_rel_orig": { "title": "Qty Rel Orig", "default": 0.0, "type": "number" } }, "required": [ "name", "features" ] }
- field name: str [Required]
- field features: Dict[str, dict] [Required]
- field qty_rel_baseline: float = 0.0
- field qty_rel_orig: float = 0.0
- class logml.feature_importance.sfd.FIDSummaryResult
Bases:
pydantic.main.BaseModel
Show JSON schema
{ "title": "FIDSummaryResult", "type": "object", "properties": { "features": { "title": "Features", "default": [], "type": "array", "items": { "$ref": "#/definitions/FIDSFeatureResult" } }, "models": { "title": "Models", "default": [], "type": "array", "items": { "$ref": "#/definitions/FIDSModelResult" } } }, "definitions": { "FIDSFeatureResult": { "title": "FIDSFeatureResult", "type": "object", "properties": { "name": { "title": "Name", "type": "string" }, "medan_rank": { "title": "Medan Rank", "default": -1.0, "type": "number" }, "num_models": { "title": "Num Models", "default": 0, "type": "integer" }, "ranks": { "title": "Ranks", "type": "object", "additionalProperties": { "type": "number" } } }, "required": [ "name", "ranks" ] }, "FIDSModelResult": { "title": "FIDSModelResult", "type": "object", "properties": { "name": { "title": "Name", "type": "string" }, "features": { "title": "Features", "type": "object", "additionalProperties": { "type": "object" } }, "qty_rel_baseline": { "title": "Qty Rel Baseline", "default": 0.0, "type": "number" }, "qty_rel_orig": { "title": "Qty Rel Orig", "default": 0.0, "type": "number" } }, "required": [ "name", "features" ] } } }
- Fields
- field features: List[logml.feature_importance.sfd.FIDSFeatureResult] = []
- field models: List[logml.feature_importance.sfd.FIDSModelResult] = []
- logml.feature_importance.sfd.get_summary_ranks(fids: Collection[logml.feature_importance.sfd.FIDSResult], loss_name: str, objective: logml.common.ModelingTask) Tuple[Optional[pandas.core.frame.DataFrame], Optional[pandas.core.frame.DataFrame], logml.feature_importance.sfd.FIDSummaryResult]
Aggregate result of FIDS for multiple models into one summary table.
- class logml.feature_importance.sfd.FIDSRunner(cfg: GlobalConfig, global_params: dict, model_provider: Optional[logml.model_search.provider.ModelProvider] = None, logger=None)
Bases:
logml.common.BaseRunner
Feature importance extractors executor.
- FEATURE_SOURCE = 'Source'
- FEATURE_CORRELATION_GROUP = 'Correlation Group'
- aggregate_results(dump=True)
Generates high-level summaries for FIDS
- run()
Invokes required feature importance extractors according to ‘feature_importance’ cfg section.
- run_single_model(dataset, model_name) Optional[logml.feature_importance.sfd.FIDSResult]
- load_fids_results() Dict[str, logml.feature_importance.sfd.FIDSResult]
Loads serialized FIDS results.