Data Transformers Registry
- Data Transformers
Provides registry functionality for Transformers (feature extraction utilities). For implementation details see
EligibleTransformers
normalize_numericals
- Description
This transformer is osbolete. For implementation details see
Obsolete_NN
impute_categoricals
- Description
This transformer is osbolete. For implementation details see
Obsolete_NN
impute_numericals
- Description
This transformer is osbolete. For implementation details see
Obsolete_NN
encode_categoricals
- Description
This transformer is osbolete. For implementation details see
Obsolete_NN
pca
- Description
Provides PCA decomposition functionality. For implementation details see
PCADecompositionTransformer
pca
nmf
- Description
Provides NMF decomposition functionality. For implementation details see
NMFDecompositionTransformer
nmf
binary_encoding
- Description
Provides binary encoding functionality. For implementation details see
BinaryEncodingTransformer
binary_encoding
label_encoding
- Description
Provides label encoding functionality. For implementation details see
LabelEncodingTransformer
label_encoding
one_hot
- Description
Provides one-hot encoding functionality. For implementation details see
OneHotEncodingTransformer
one_hot
multi_label_encoding
- Description
Provides one-hot encoding functionality for multi-label dtypes. For implementation details see
MultiLabelEncodingTransformer
multi_label_encoding
binarize_dna
- Description
Provides binarization functionality for _DNA columns. For implementation details see
DNAIndicatorsBinarizationTransformer
binarize_dna
map_encoding
- Description
Encode values according to the map provided. Sample config: .. code-block:: yaml steps: - transformer: map_encoding params: columns_to_include: - .*_DNA$ mapping: AMP: 0 DEL: 1 REARG: 2 SNP: 3 VUS: 4 WT: 5 unknown_values: -1 For implementation details see
MapEncodingTransformer
map_encoding
bucketize
- Description
Encodes values based on a list of buckets (intervals + labels). Sample config: .. code-block :: yaml steps: - transformer: bucketize params: columns_to_include: - PDL1_score suffix: _1_50_bucketized buckets: - left_bound: 0 right_bound: 1 alias: ‘<1%’ - left_bound: 1 right_bound: 50 alias: ‘>=1%-<50%’ - left_bound: 50 right_bound: 100 alias: ‘>=50%’ remove_base_columns: True For implementation details see
BucketizeTransformer
bucketize
encode_datetime
- Description
Encode datetime columns for ML. A datetime column produces three new columns: - {colname}_year_rel - relative number of years since minimal value of the column. - {colname}_year_day_sin/cos: day of the year cyclicly encoded. For implementation details see
DateTimeEncodingTransformer
encode_datetime
drop_columns
- Description
Provides columns filtering functionality. For implementation details see
DropColumnsTransformer
drop_columns
drop_low_var_columns
- Description
Provides columns filtering based on variance thresholding. NOTE: only numerical columns are considered. For implementation details see
DropLowVarianceColumnsTransformer
drop_low_var_columns
drop_high_mutual_info_columns
- Description
Provides columns filtering based on mutual information for target. NOTE: only numerical columns are considered. For implementation details see
DropHighMutualInfoColumnsTransformer
drop_high_mutual_info_columns
drop_nan_columns
- Description
Provides columns filtering based on NA fraction thresholding. For implementation details see
DropNanColumnsTransformer
drop_nan_columns
drop_nan_rows
- Description
Provides rows filtering based on NAs presence within target columns. For implementation details see
DropNanRowsTransformer
drop_nan_rows
drop_columns_without_mutations
- Description
Provides columns filtering based on mutations presence within. For implementation details see
DropColumnsWithoutMutationsTransformer
drop_columns_without_mutations
select_columns
- Description
Provides columns selection functionality. For implementation details see
SelectColumnsTransformer
select_columns
prevalence_filtering
- Description
Drops columns for which values prevalence falls lower than the threshold. Configuration class:
PrevalenceFilteringTransformerParams
. Filter is performed as follows: - for given column count values of params.values (if there is more than one, sum them). - divide this number by total number of values in the column (ignoring NaNs), this gives the prevalence number from 0 to 1. - if prevalence is less than params.threshold, drop the column. For implementation details seePrevalenceFilteringTransformer
prevalence_filtering
dna_subset_filtering
- Description
For a given master set of values: 1) checks that the master set is presented within column’s values 2) removed values outof the master list For implementation details see
DNASubsetFilteringTransformer
dna_subset_filtering
remove_correlated_features
- Description
Removes correlated features based on predefined correlation groups. For implementation details see
RemoveCorrelatedColumnsTransformer
remove_correlated_features
impute
- Description
Provides imputation functionality. For implementation details see
SimpleImputeTransformer
impute
impute_mice
- Description
Provides MICE imputation functionality (Multivariate Imputation by Chained Equations). NOTE: affected columns are additionally filtered to be numerical only For implementation details see
MICEImputeTransformer
impute_mice
replace_value
- Description
Replace values according to the map provided. Not listed values are not affected. Note: yaml natively supports special values like .nan, see https://yaml.org/spec/1.2.2/ Sample config: .. code-block:: yaml steps: - transformer: replace_value params: columns_to_include: - .*_DNA$ mapping: # combine two categories into the same VUS: ‘VUS_WT’ WT: ‘VUS_WT’ For implementation details see
ReplaceValueTransformer
replace_value
log1p
- Description
Applies ‘log1p’ transformation. For implementation details see
Log1pLambdaTransformer
log1p
log
- Description
Applies ‘log’ transformation. For implementation details see
LogLambdaTransformer
log
binarization
- Description
Binarizes all target columns using a given threshold. For implementation details see
BinarizationLambdaTransformer
binarization
resolve_multiple_choice
- Description
Resolves multi-value issue for list-type columns. For implementation details see
ResolveMultipleChoiceTransformer
resolve_multiple_choice
convert_to_float
- Description
Converts column values to floats by removing and parsing special characters. In case casting is not possible for a value - replaces it with NaN. For implementation details see
ConvertToFloatTransformer
convert_to_float
query_to_bool
- Description
Transforms column to boolean using query. Puts 1 where query result is True, 0 otherwise. Sample config: .. code-block:: yaml data_preprocessing: steps: - transformer: query_to_bool params: columns_to_include: [‘single_column_here’] query: “single_column_here == ‘YES’” For implementation details see
QueryBooleanTransformer
query_to_bool
sanitize_columns
- Description
Simple metadata transformation. For implementation details see
SanitizeColumnsTransformer
standard
- Description
Standard normalization. For implementation details see
StandardNormalizationTransformer
standard
maxabs
- Description
MaxAbs normalization. For implementation details see
MaxAbsNormalizationTransformer
maxabs
minmax
- Description
MinMax normalization. For implementation details see
MinMaxNormalizationTransformer
minmax
minmax_neg
- Description
MinMax normalization (result range is [-1; -1]). For implementation details see
MinMaxNegNormalizationTransformer
minmax_neg
log_standard
- Description
Log transformation + standard normalization. For implementation details see
LogStandardNormalizationTransformer
log_standard
log1p_standard
- Description
Log1p transformation + standard normalization. For implementation details see
Log1pStandardNormalizationTransformer
log1p_standard
shuffle
- Description
Randomly permutes a given dataset’s rows. For implementation details see
ShuffleTransformer
add_random_columns
- Description
For a given set of target columns, creates additional ‘randomized’ columns for later significance assessment. For implementation details see
AddRandomColumnsTransformer
add_random_columns
upsampling
- Description
Implements ‘upsampling’ method for balancing classes. For implementation details see
UpsamplingTransformer
upsampling
downsampling
- Description
Implements ‘downsampling’ method for balancing classes. For implementation details see
DownsamplingTransformer
downsampling