logml.data.transformers.filtering

Classes

DNASubsetFilteringTransformer(params[, ...])

For a given master set of values: 1) checks that the master set is presented within column's values 2) removed values outof the master list

DropColumnsTransformer(params[, ...])

Provides columns filtering functionality.

DropColumnsWithoutMutationsTransformer(params)

Provides columns filtering based on mutations presence within.

DropHighMutualInfoColumnsTransformer(**kwargs)

Provides columns filtering based on mutual information for target.

DropLowVarianceColumnsTransformer(**kwargs)

Provides columns filtering based on variance thresholding.

DropNanColumnsTransformer(**kwargs)

Provides columns filtering based on NA fraction thresholding.

DropNanRowsTransformer(params[, ...])

Provides rows filtering based on NAs presence within target columns.

PrevalenceFilteringTransformer(**kwargs)

Drops columns for which values prevalence falls lower than the threshold.

RemoveCorrelatedColumnsTransformer(**kwargs)

Removes correlated features based on predefined correlation groups.

SelectColumnsTransformer(params[, ...])

Provides columns selection functionality.

class logml.data.transformers.filtering.DropColumnsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)

Bases: logml.data.base.BaseTransformer

Provides columns filtering functionality.

LABEL = 'drop_columns'
CONFIG_CLASS

alias of logml.data.config.DropColumnsTransformerParams

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

Fit by determining affected columns.

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops columns from a given dataframe.

update_transform_log(change: logml.data.utils.DataTransformLogItem)

Add custom data to the log.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.DropLowVarianceColumnsTransformer(**kwargs)

Bases: logml.data.base.BaseTransformer

Provides columns filtering based on variance thresholding.

NOTE: only numerical columns are considered.

LABEL = 'drop_low_var_columns'
CONFIG_CLASS

alias of logml.data.config.FilteringTransformerParams

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

Find low-variance columns

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops columns from a given dataframe.

update_transform_log(change: logml.data.utils.DataTransformLogItem)

See parent description

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.DropHighMutualInfoColumnsTransformer(**kwargs)

Bases: logml.data.base.BaseTransformer

Provides columns filtering based on mutual information for target.

NOTE: only numerical columns are considered.

LABEL = 'drop_high_mutual_info_columns'
CONFIG_CLASS

alias of logml.data.config.FilteringTransformerParams

ESTIMATOR = {ModelingTask.CLF: <function mutual_info_classif>, ModelingTask.REG: <function mutual_info_regression>}
get_affected_columns(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None) List[str]

Returns a list of a given dataframe’s columns that would be affected by a transformer.

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

Save columns with high mutual information for target.

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops columns from a given dataframe.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.DropNanColumnsTransformer(**kwargs)

Bases: logml.data.base.BaseTransformer

Provides columns filtering based on NA fraction thresholding.

LABEL = 'drop_nan_columns'
CONFIG_CLASS

alias of logml.data.config.FilteringTransformerParams

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

See parent description

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops columns from a given dataframe.

update_transform_log(change: logml.data.utils.DataTransformLogItem)

See parent

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.DropNanRowsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)

Bases: logml.data.base.BaseTransformer

Provides rows filtering based on NAs presence within target columns.

LABEL = 'drop_nan_rows'
CONFIG_CLASS

alias of logml.data.config.DropNanRowsTransformerParams

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops rows with NA within specified columns from a given dataframe.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.DropColumnsWithoutMutationsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)

Bases: logml.data.base.BaseTransformer

Provides columns filtering based on mutations presence within.

LABEL = 'drop_columns_without_mutations'
CONFIG_CLASS

alias of logml.data.config.MutationsFilteringTransformerParams

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

Saves columns without mutations.

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops columns from a given dataframe.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.SelectColumnsTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)

Bases: logml.data.base.BaseTransformer

Provides columns selection functionality.

LABEL = 'select_columns'
CONFIG_CLASS

alias of logml.data.config.BaseTransformerParams

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Selects only specified columns within a given dataframe.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.PrevalenceFilteringTransformer(**kwargs)

Bases: logml.data.base.BaseTransformer

Drops columns for which values prevalence falls lower than the threshold.

Configuration class: PrevalenceFilteringTransformerParams.

Filter is performed as follows:

  • for given column count values of params.values (if there is more than one, sum them).

  • divide this number by total number of values in the column (ignoring NaNs), this gives the prevalence number from 0 to 1.

  • if prevalence is less than params.threshold, drop the column.

LABEL = 'prevalence_filtering'
CONFIG_CLASS

alias of logml.data.config.PrevalenceFilteringTransformerParams

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

Calculate columns’ prevalence numbers and identify which to drop.

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Drops columns from a given dataframe.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.DNASubsetFilteringTransformer(params: logml.data.config.BaseTransformerParams, metadata_cfg: logml.configuration.modeling.ModelingTaskSpec = None, cfg: GlobalConfig = None, global_params: Dict = None, logger=None)

Bases: logml.data.base.BaseTransformer

For a given master set of values: 1) checks that the master set is presented within column’s values 2) removed values outof the master list

LABEL = 'dna_subset_filtering'
CONFIG_CLASS

alias of logml.data.config.MutationsFilteringTransformerParams

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Transforms a given dataframe - wipes out values that are out of master subset.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]
class logml.data.transformers.filtering.RemoveCorrelatedColumnsTransformer(**kwargs)

Bases: logml.data.base.BaseTransformer

Removes correlated features based on predefined correlation groups.

LABEL = 'remove_correlated_features'
CONFIG_CLASS

alias of logml.data.config.RemoveCorrelatedColumnsParams

fit(dataframe: pandas.core.frame.DataFrame, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **kwargs)

Saves the correlation groups.

transform(dataframe: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Keeps at most one column from each correlation group.

update_metadata(dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, dataframe: Optional[pandas.core.frame.DataFrame] = None) None

Update metadata according to the change made.

update_transform_log(change: logml.data.utils.DataTransformLogItem)

Add custom data to the log.

corr_groups_to_df() Optional[pandas.core.frame.DataFrame]

Returns correlation groups as dataframe.

params: BaseTransformerParams
global_params: Dict
metadata_cfg: ModelingTaskSpec
affected_columns_: List[str]