logml.data.transformers.filtering
Classes
|
For a given master set of values: 1) checks that the master set is presented within column's values 2) removed values outof the master list |
|
Provides columns filtering functionality. |
Provides columns filtering based on mutations presence within. |
|
|
Provides columns filtering based on mutual information for target. |
|
Provides columns filtering based on variance thresholding. |
|
Provides columns filtering based on NA fraction thresholding. |
|
Provides rows filtering based on NAs presence within target columns. |
|
Drops columns for which values prevalence falls lower than the threshold. |
|
Removes correlated features based on predefined correlation groups. |
|
Provides columns selection functionality. |
- class logml.data.transformers.filtering.DropColumnsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)
Bases:
logml.data.base.BaseTransformer
Provides columns filtering functionality.
- LABEL = 'drop_columns'
- CONFIG_CLASS
- fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)
Fit by determining affected columns.
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops columns from a given dataframe.
- update_transform_log(change: logml.data.utils.DataTransformLogItem)
Add custom data to the log.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.DropLowVarianceColumnsTransformer(**kwargs)
Bases:
logml.data.base.BaseTransformer
Provides columns filtering based on variance thresholding.
NOTE: only numerical columns are considered.
- LABEL = 'drop_low_var_columns'
- CONFIG_CLASS
- fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)
Find low-variance columns
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops columns from a given dataframe.
- update_transform_log(change: logml.data.utils.DataTransformLogItem)
See parent description
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.DropHighMutualInfoColumnsTransformer(**kwargs)
Bases:
logml.data.base.BaseTransformer
Provides columns filtering based on mutual information for target.
NOTE: only numerical columns are considered.
- LABEL = 'drop_high_mutual_info_columns'
- CONFIG_CLASS
- ESTIMATOR = {ModelingTask.CLF: <function mutual_info_classif>, ModelingTask.REG: <function mutual_info_regression>}
- get_affected_columns(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None) List[str]
Returns a list of a given dataframe’s columns that would be affected by a transformer.
- fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)
Save columns with high mutual information for target.
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops columns from a given dataframe.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.DropNanColumnsTransformer(**kwargs)
Bases:
logml.data.base.BaseTransformer
Provides columns filtering based on NA fraction thresholding.
- LABEL = 'drop_nan_columns'
- CONFIG_CLASS
- fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)
See parent description
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops columns from a given dataframe.
- update_transform_log(change: logml.data.utils.DataTransformLogItem)
See parent
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.DropNanRowsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)
Bases:
logml.data.base.BaseTransformer
Provides rows filtering based on NAs presence within target columns.
- LABEL = 'drop_nan_rows'
- CONFIG_CLASS
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops rows with NA within specified columns from a given dataframe.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.DropColumnsWithoutMutationsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)
Bases:
logml.data.base.BaseTransformer
Provides columns filtering based on mutations presence within.
- LABEL = 'drop_columns_without_mutations'
- CONFIG_CLASS
alias of
logml.data.config.MutationsFilteringTransformerParams
- fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)
Saves columns without mutations.
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops columns from a given dataframe.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.SelectColumnsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)
Bases:
logml.data.base.BaseTransformer
Provides columns selection functionality.
- LABEL = 'select_columns'
- CONFIG_CLASS
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Selects only specified columns within a given dataframe.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.PrevalenceFilteringTransformer(**kwargs)
Bases:
logml.data.base.BaseTransformer
Drops columns for which values prevalence falls lower than the threshold.
Configuration class:
PrevalenceFilteringTransformerParams
.Filter is performed as follows:
for given column count values of params.values (if there is more than one, sum them).
divide this number by total number of values in the column (ignoring NaNs), this gives the prevalence number from 0 to 1.
if prevalence is less than params.threshold, drop the column.
- LABEL = 'prevalence_filtering'
- CONFIG_CLASS
alias of
logml.data.config.PrevalenceFilteringTransformerParams
- fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)
Calculate columns’ prevalence numbers and identify which to drop.
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Drops columns from a given dataframe.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
- class logml.data.transformers.filtering.DNASubsetFilteringTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)
Bases:
logml.data.base.BaseTransformer
For a given master set of values: 1) checks that the master set is presented within column’s values 2) removed values outof the master list
- LABEL = 'dna_subset_filtering'
- CONFIG_CLASS
alias of
logml.data.config.MutationsFilteringTransformerParams
- transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Transforms a given dataframe - wipes out values that are out of master subset.
- params: BaseTransformerParams
- global_params: Dict
- metadata_cfg: ModelingTaskSpec
- affected_columns_: List[str]
Bases:
logml.data.base.BaseTransformer
Removes correlated features based on predefined correlation groups.
Saves the correlation groups.
Keeps at most one column from each correlation group.
Update metadata according to the change made.
Add custom data to the log.
Returns correlation groups as dataframe.