logml.analysis.items.data_transform
Classes
|
Wraps analysis algo into data loading/saving and config. |
|
Result of transformer step (usually a single dataframe) |
- class logml.analysis.items.data_transform.DataTransformerResult(data: Optional[logml.data.datasets.base.BaseDataset] = None)
Bases:
object
Result of transformer step (usually a single dataframe)
- data: logml.data.datasets.base.BaseDataset = None
- class logml.analysis.items.data_transform.DataTransformerConfig
Bases:
pydantic.main.BaseModel
Transformer step config: invokes data preprocessing.
Show JSON schema
{ "title": "DataTransformerConfig", "description": "Transformer step config: invokes data preprocessing.", "type": "object", "properties": { "input_ref": { "title": "Input Ref", "description": "Refers to a source of data. Currently only default source is supported.", "default": "", "type": "string" }, "data_preprocessing": { "title": "Data Preprocessing", "default": { "enable": true, "preset": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "steps": [] }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingSection" } ] }, "dataset_metadata": { "$ref": "#/definitions/DatasetMetadataSection" }, "modeling_task": { "$ref": "#/definitions/ModelingTaskSpec" }, "modeling_ref": { "title": "Modeling Ref", "type": "string" }, "dataset_type": { "title": "Dataset Type", "default": "base_dataset", "type": "string" } }, "definitions": { "DatasetPreprocessingPresetSection": { "title": "DatasetPreprocessingPresetSection", "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable automated generation of preprocessing steps.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] }, "remove_correlated_features": { "title": "Remove Correlated Features", "description": "Whether to include a step that removes correlated features.", "default": true, "type": "boolean" }, "nans_per_row_fraction_threshold": { "title": "Nans Per Row Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a row.", "default": 0.9, "type": "number" }, "nans_fraction_threshold": { "title": "Nans Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a column.", "default": 0.7, "type": "number" }, "apply_log1p_to_target": { "title": "Apply Log1P To Target", "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).", "default": false, "type": "boolean" }, "drop_datetime_columns": { "title": "Drop Datetime Columns", "description": "Whether to drop date time columns.", "default": true, "type": "boolean" }, "drop_dna_wt": { "title": "Drop Dna Wt", "description": "Whether to drop DNA WT values after one-hot-encoding.", "default": false, "type": "boolean" }, "imputer": { "title": "Imputer", "description": "Imputer to use. Possible values: (median, mice)", "default": "median", "type": "string" } } }, "PreprocessingStep": { "title": "PreprocessingStep", "description": "Defines data preprocessing step.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable preprocessing step.", "default": true, "type": "boolean" }, "transformer": { "title": "Transformer", "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the correspoding transformer instance.", "default": {}, "type": "object" } }, "required": [ "transformer" ] }, "DatasetPreprocessingSection": { "title": "DatasetPreprocessingSection", "description": "Defines data preprocessing section for modeling/survival setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable Preprocessing Pipeline for dataset transformation.", "default": true, "type": "boolean" }, "preset": { "title": "Preset", "default": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingPresetSection" } ] }, "steps": { "title": "Steps", "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.", "default": [], "type": "array", "items": { "$ref": "#/definitions/PreprocessingStep" } } } }, "SurvivalTimeSpec": { "title": "SurvivalTimeSpec", "description": "Configure right-censored time-to-event columns in the dataset", "type": "object", "properties": { "time_column": { "title": "Time Column", "description": "Column name that contains time-to-event values (usually OS or PFS).", "type": "string" }, "event_query": { "title": "Event Query", "description": "Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "type": "string" }, "event_column": { "title": "Event Column", "description": "Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_query: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate survival models performance. Please refer to :lml:ref:`ML Metrics` for details.", "default": "cindex", "type": "string" }, "SPEC_TYPE": { "title": "Spec Type", "default": "survival", "type": "string" } }, "required": [ "time_column", "event_query", "event_column" ] }, "ModelingTask": { "title": "ModelingTask", "description": "Defines supported modeling tasks.", "enum": [ "classification", "regression", "survival" ], "type": "string" }, "ModelingTargetSpec": { "title": "ModelingTargetSpec", "description": "Specification for modeling target (regression/classification)", "type": "object", "properties": { "target_column": { "title": "Target Column", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "default": "regression", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "SPEC_TYPE": { "title": "Spec Type", "default": "reg-clf", "type": "string" } }, "required": [ "target_column" ] }, "ColumnSpec": { "title": "ColumnSpec", "description": "Configure special columns.", "type": "object", "properties": { "name": { "title": "Name", "description": "Special column name.", "type": "string" }, "comment": { "title": "Comment", "description": "Column description (e.g. \"Treatment Arm, used for stratification in such and such analysis.\")", "default": "", "type": "string" } }, "required": [ "name" ] }, "ColumnMetadataConfig": { "title": "ColumnMetadataConfig", "description": "Column-specific metadata, currently including data type.\n\nIn future this structure can be extended to contain field description,\nsemantics, display name, formatting, etc.", "type": "object", "properties": { "name": { "title": "Name", "description": "Column name. Used to refer to column in the dataframe directly.", "type": "string" }, "data_type": { "title": "Data Type", "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.", "default": "", "type": "string" }, "is_categorical": { "title": "Is Categorical", "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.", "default": false, "type": "boolean" }, "parent_name": { "title": "Parent Name", "description": "Column name which used to produce current column as a result of transformation.", "type": "string" }, "description": { "title": "Description", "description": "Column description.", "type": "string" }, "group": { "title": "Group", "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.", "type": "string" } }, "required": [ "name" ] }, "DatasetMetadataSection": { "title": "DatasetMetadataSection", "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.", "type": "object", "properties": { "key_columns": { "title": "Key Columns", "description": "List of identifier fields for a row in the dataset.", "default": [], "type": "array", "items": { "type": "string" } }, "modeling_specs": { "title": "Modeling Specs", "description": "Collection of modeling specification for survival/regression/classification problems.\n NOTE: Modeling problem ids are expected to have corresponding values.", "default": {}, "type": "object", "additionalProperties": { "anyOf": [ { "$ref": "#/definitions/SurvivalTimeSpec" }, { "$ref": "#/definitions/ModelingTargetSpec" } ] } }, "columns_specs": { "title": "Columns Specs", "description": "Named collection of special columns (targets, groupings, etc.)", "default": {}, "type": "object", "additionalProperties": { "$ref": "#/definitions/ColumnSpec" } }, "columns_metadata": { "title": "Columns Metadata", "description": "Provide list of columns-specific metadata.", "default": [], "type": "array", "items": { "$ref": "#/definitions/ColumnMetadataConfig" } } } }, "ModelingTaskSpec": { "title": "ModelingTaskSpec", "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.", "type": "object", "properties": { "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target": { "title": "Target", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "event_query": { "title": "Event Query", "description": "(Applies for survival problems.) Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "default": "", "type": "string" }, "event_column": { "title": "Event Column", "description": "(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "default": "", "type": "string" } }, "required": [ "task", "target" ] } } }
- Fields
- field input_ref: str = ''
Refers to a source of data. Currently only default source is supported.
- field data_preprocessing: logml.configuration.modeling.DatasetPreprocessingSection = DatasetPreprocessingSection(enable=True, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[])
- field dataset_metadata: logml.configuration.modeling.DatasetMetadataSection = None
- field modeling_task: logml.configuration.modeling.ModelingTaskSpec = None
- field modeling_ref: str = None
- field dataset_type: str = 'base_dataset'
- class logml.analysis.items.data_transform.DataTransformer(cfg: logml.analysis.items.data_transform.DataTransformerConfig, logger=None, **kwargs)
Bases:
logml.analysis.base_item.AnalysisItem
Wraps analysis algo into data loading/saving and config.
- LABEL = 'data_transform'
- PARAMS_CLS
alias of
logml.analysis.items.data_transform.DataTransformerConfig
- RESULT_CLS
alias of
logml.analysis.items.data_transform.DataTransformerResult
- classmethod prepare_params(params: logml.analysis.items.data_transform.DataTransformerConfig, global_cfg)
Convert params, if they are not of PARAMS_CLS class.
- run()
Run data transformation.
- get_result() logml.analysis.items.data_transform.DataTransformerResult
Return step final result. Should be of RESULT_CLS type, if RESULT_CLS is declared.