Data Transformers Registry

Data Transformers

Provides registry functionality for Transformers (feature extraction utilities). For implementation details see EligibleTransformers

normalize_numericals

Description

This transformer is osbolete. For implementation details see Obsolete_NN

impute_categoricals

Description

This transformer is osbolete. For implementation details see Obsolete_NN

impute_numericals

Description

This transformer is osbolete. For implementation details see Obsolete_NN

encode_categoricals

Description

This transformer is osbolete. For implementation details see Obsolete_NN

pca

Description

Provides PCA decomposition functionality. For implementation details see PCADecompositionTransformer

pca

nmf

Description

Provides NMF decomposition functionality. For implementation details see NMFDecompositionTransformer

nmf

binary_encoding

Description

Provides binary encoding functionality. For implementation details see BinaryEncodingTransformer

binary_encoding

label_encoding

Description

Provides label encoding functionality. For implementation details see LabelEncodingTransformer

label_encoding

one_hot

Description

Provides one-hot encoding functionality. For implementation details see OneHotEncodingTransformer

one_hot

multi_label_encoding

Description

Provides one-hot encoding functionality for multi-label dtypes. For implementation details see MultiLabelEncodingTransformer

multi_label_encoding

binarize_dna

Description

Provides binarization functionality for _DNA columns. For implementation details see DNAIndicatorsBinarizationTransformer

binarize_dna

map_encoding

Description

Encode values according to the map provided. Sample config: .. code-block:: yaml steps: - transformer: map_encoding params: columns_to_include: - .*_DNA$ mapping: AMP: 0 DEL: 1 REARG: 2 SNP: 3 VUS: 4 WT: 5 unknown_values: -1 For implementation details see MapEncodingTransformer

map_encoding

bucketize

Description

Encodes values based on a list of buckets (intervals + labels). Sample config: .. code-block :: yaml steps: - transformer: bucketize params: columns_to_include: - PDL1_score suffix: _1_50_bucketized buckets: - left_bound: 0 right_bound: 1 alias: ‘<1%’ - left_bound: 1 right_bound: 50 alias: ‘>=1%-<50%’ - left_bound: 50 right_bound: 100 alias: ‘>=50%’ remove_base_columns: True For implementation details see BucketizeTransformer

bucketize

encode_datetime

Description

Encode datetime columns for ML. A datetime column produces three new columns: - {colname}_year_rel - relative number of years since minimal value of the column. - {colname}_year_day_sin/cos: day of the year cyclicly encoded. For implementation details see DateTimeEncodingTransformer

encode_datetime

drop_columns

Description

Provides columns filtering functionality. For implementation details see DropColumnsTransformer

drop_columns

drop_low_var_columns

Description

Provides columns filtering based on variance thresholding. NOTE: only numerical columns are considered. For implementation details see DropLowVarianceColumnsTransformer

drop_low_var_columns

drop_high_mutual_info_columns

Description

Provides columns filtering based on mutual information for target. NOTE: only numerical columns are considered. For implementation details see DropHighMutualInfoColumnsTransformer

drop_high_mutual_info_columns

drop_nan_columns

Description

Provides columns filtering based on NA fraction thresholding. For implementation details see DropNanColumnsTransformer

drop_nan_columns

drop_nan_rows

Description

Provides rows filtering based on NAs presence within target columns. For implementation details see DropNanRowsTransformer

drop_nan_rows

drop_columns_without_mutations

Description

Provides columns filtering based on mutations presence within. For implementation details see DropColumnsWithoutMutationsTransformer

drop_columns_without_mutations

select_columns

Description

Provides columns selection functionality. For implementation details see SelectColumnsTransformer

select_columns

prevalence_filtering

Description

Drops columns for which values prevalence falls lower than the threshold. Configuration class: PrevalenceFilteringTransformerParams. Filter is performed as follows: - for given column count values of params.values (if there is more than one, sum them). - divide this number by total number of values in the column (ignoring NaNs), this gives the prevalence number from 0 to 1. - if prevalence is less than params.threshold, drop the column. For implementation details see PrevalenceFilteringTransformer

prevalence_filtering

dna_subset_filtering

Description

For a given master set of values: 1) checks that the master set is presented within column’s values 2) removed values outof the master list For implementation details see DNASubsetFilteringTransformer

dna_subset_filtering

remove_correlated_features

Description

Removes correlated features based on predefined correlation groups. For implementation details see RemoveCorrelatedColumnsTransformer

remove_correlated_features

impute

Description

Provides imputation functionality. For implementation details see SimpleImputeTransformer

impute

impute_mice

Description

Provides MICE imputation functionality (Multivariate Imputation by Chained Equations). NOTE: affected columns are additionally filtered to be numerical only For implementation details see MICEImputeTransformer

impute_mice

replace_value

Description

Replace values according to the map provided. Not listed values are not affected. Note: yaml natively supports special values like .nan, see https://yaml.org/spec/1.2.2/ Sample config: .. code-block:: yaml steps: - transformer: replace_value params: columns_to_include: - .*_DNA$ mapping: # combine two categories into the same VUS: ‘VUS_WT’ WT: ‘VUS_WT’ For implementation details see ReplaceValueTransformer

replace_value

log1p

Description

Applies ‘log1p’ transformation. For implementation details see Log1pLambdaTransformer

log1p

log

Description

Applies ‘log’ transformation. For implementation details see LogLambdaTransformer

log

binarization

Description

Binarizes all target columns using a given threshold. For implementation details see BinarizationLambdaTransformer

binarization

resolve_multiple_choice

Description

Resolves multi-value issue for list-type columns. For implementation details see ResolveMultipleChoiceTransformer

resolve_multiple_choice

convert_to_float

Description

Converts column values to floats by removing and parsing special characters. In case casting is not possible for a value - replaces it with NaN. For implementation details see ConvertToFloatTransformer

convert_to_float

query_to_bool

Description

Transforms column to boolean using query. Puts 1 where query result is True, 0 otherwise. Sample config: .. code-block:: yaml data_preprocessing: steps: - transformer: query_to_bool params: columns_to_include: [‘single_column_here’] query: “single_column_here == ‘YES’” For implementation details see QueryBooleanTransformer

query_to_bool

sanitize_columns

Description

Simple metadata transformation. For implementation details see SanitizeColumnsTransformer

standard

Description

Standard normalization. For implementation details see StandardNormalizationTransformer

standard

maxabs

Description

MaxAbs normalization. For implementation details see MaxAbsNormalizationTransformer

maxabs

minmax

Description

MinMax normalization. For implementation details see MinMaxNormalizationTransformer

minmax

minmax_neg

Description

MinMax normalization (result range is [-1; -1]). For implementation details see MinMaxNegNormalizationTransformer

minmax_neg

log_standard

Description

Log transformation + standard normalization. For implementation details see LogStandardNormalizationTransformer

log_standard

log1p_standard

Description

Log1p transformation + standard normalization. For implementation details see Log1pStandardNormalizationTransformer

log1p_standard

shuffle

Description

Randomly permutes a given dataset’s rows. For implementation details see ShuffleTransformer

add_random_columns

Description

For a given set of target columns, creates additional ‘randomized’ columns for later significance assessment. For implementation details see AddRandomColumnsTransformer

add_random_columns

upsampling

Description

Implements ‘upsampling’ method for balancing classes. For implementation details see UpsamplingTransformer

upsampling

downsampling

Description

Implements ‘downsampling’ method for balancing classes. For implementation details see DownsamplingTransformer

downsampling