logml.analysis.items.data_transform

Classes

DataTransformer(cfg[, logger])

Wraps analysis algo into data loading/saving and config.

DataTransformerResult([data])

Result of transformer step (usually a single dataframe)

class logml.analysis.items.data_transform.DataTransformerResult(data: Optional[logml.data.datasets.base.BaseDataset] = None)

Bases: object

Result of transformer step (usually a single dataframe)

data: logml.data.datasets.base.BaseDataset = None
class logml.analysis.items.data_transform.DataTransformerConfig

Bases: pydantic.main.BaseModel

Transformer step config: invokes data preprocessing.

Show JSON schema
{
   "title": "DataTransformerConfig",
   "description": "Transformer step config: invokes data preprocessing.",
   "type": "object",
   "properties": {
      "input_ref": {
         "title": "Input Ref",
         "description": "Refers to a source of data. Currently only default source is supported.",
         "default": "",
         "type": "string"
      },
      "data_preprocessing": {
         "title": "Data Preprocessing",
         "default": {
            "enable": true,
            "preset": {
               "enable": false,
               "features_list": [],
               "remove_correlated_features": true,
               "nans_per_row_fraction_threshold": 0.9,
               "nans_fraction_threshold": 0.7,
               "apply_log1p_to_target": false,
               "drop_datetime_columns": true,
               "drop_dna_wt": false,
               "imputer": "median"
            },
            "steps": []
         },
         "allOf": [
            {
               "$ref": "#/definitions/DatasetPreprocessingSection"
            }
         ]
      },
      "dataset_metadata": {
         "$ref": "#/definitions/DatasetMetadataSection"
      },
      "modeling_task": {
         "$ref": "#/definitions/ModelingTaskSpec"
      },
      "modeling_ref": {
         "title": "Modeling Ref",
         "type": "string"
      },
      "dataset_type": {
         "title": "Dataset Type",
         "default": "base_dataset",
         "type": "string"
      }
   },
   "definitions": {
      "DatasetPreprocessingPresetSection": {
         "title": "DatasetPreprocessingPresetSection",
         "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable automated generation of preprocessing steps.",
               "default": true,
               "type": "boolean"
            },
            "features_list": {
               "title": "Features List",
               "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n            is just to reference a configuration file that contains the required list of features:\n            ...\n            features_list: sub_cfg/features_list.yaml  # a config file\n            ...\n        ",
               "default": [],
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "items": {
                        "type": "string"
                     }
                  }
               ]
            },
            "remove_correlated_features": {
               "title": "Remove Correlated Features",
               "description": "Whether to include a step that removes correlated features.",
               "default": true,
               "type": "boolean"
            },
            "nans_per_row_fraction_threshold": {
               "title": "Nans Per Row Fraction Threshold",
               "description": "Defines maximum acceptable fraction of NaNs within a row.",
               "default": 0.9,
               "type": "number"
            },
            "nans_fraction_threshold": {
               "title": "Nans Fraction Threshold",
               "description": "Defines maximum acceptable fraction of NaNs within a column.",
               "default": 0.7,
               "type": "number"
            },
            "apply_log1p_to_target": {
               "title": "Apply Log1P To Target",
               "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).",
               "default": false,
               "type": "boolean"
            },
            "drop_datetime_columns": {
               "title": "Drop Datetime Columns",
               "description": "Whether to drop date time columns.",
               "default": true,
               "type": "boolean"
            },
            "drop_dna_wt": {
               "title": "Drop Dna Wt",
               "description": "Whether to drop DNA WT values after one-hot-encoding.",
               "default": false,
               "type": "boolean"
            },
            "imputer": {
               "title": "Imputer",
               "description": "Imputer to use. Possible values: (median, mice)",
               "default": "median",
               "type": "string"
            }
         }
      },
      "PreprocessingStep": {
         "title": "PreprocessingStep",
         "description": "Defines data preprocessing step.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable preprocessing step.",
               "default": true,
               "type": "boolean"
            },
            "transformer": {
               "title": "Transformer",
               "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.",
               "type": "string"
            },
            "params": {
               "title": "Params",
               "description": "Parameters that will be passed to the correspoding transformer instance.",
               "default": {},
               "type": "object"
            }
         },
         "required": [
            "transformer"
         ]
      },
      "DatasetPreprocessingSection": {
         "title": "DatasetPreprocessingSection",
         "description": "Defines data preprocessing section for modeling/survival setup.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable Preprocessing Pipeline for dataset transformation.",
               "default": true,
               "type": "boolean"
            },
            "preset": {
               "title": "Preset",
               "default": {
                  "enable": false,
                  "features_list": [],
                  "remove_correlated_features": true,
                  "nans_per_row_fraction_threshold": 0.9,
                  "nans_fraction_threshold": 0.7,
                  "apply_log1p_to_target": false,
                  "drop_datetime_columns": true,
                  "drop_dna_wt": false,
                  "imputer": "median"
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/DatasetPreprocessingPresetSection"
                  }
               ]
            },
            "steps": {
               "title": "Steps",
               "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/PreprocessingStep"
               }
            }
         }
      },
      "SurvivalTimeSpec": {
         "title": "SurvivalTimeSpec",
         "description": "Configure right-censored time-to-event columns in the dataset",
         "type": "object",
         "properties": {
            "time_column": {
               "title": "Time Column",
               "description": "Column name that contains time-to-event values (usually OS or PFS).",
               "type": "string"
            },
            "event_query": {
               "title": "Event Query",
               "description": "Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.",
               "type": "string"
            },
            "event_column": {
               "title": "Event Column",
               "description": "Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_query: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ",
               "type": "string"
            },
            "target_metric": {
               "title": "Target Metric",
               "description": "Metric (loss) that will be used to evaluate survival models performance. Please refer to :lml:ref:`ML Metrics` for details.",
               "default": "cindex",
               "type": "string"
            },
            "SPEC_TYPE": {
               "title": "Spec Type",
               "default": "survival",
               "type": "string"
            }
         },
         "required": [
            "time_column",
            "event_query",
            "event_column"
         ]
      },
      "ModelingTask": {
         "title": "ModelingTask",
         "description": "Defines supported modeling tasks.",
         "enum": [
            "classification",
            "regression",
            "survival"
         ],
         "type": "string"
      },
      "ModelingTargetSpec": {
         "title": "ModelingTargetSpec",
         "description": "Specification for modeling target (regression/classification)",
         "type": "object",
         "properties": {
            "target_column": {
               "title": "Target Column",
               "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).",
               "type": "string"
            },
            "task": {
               "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".",
               "default": "regression",
               "allOf": [
                  {
                     "$ref": "#/definitions/ModelingTask"
                  }
               ]
            },
            "target_metric": {
               "title": "Target Metric",
               "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.",
               "default": "",
               "type": "string"
            },
            "SPEC_TYPE": {
               "title": "Spec Type",
               "default": "reg-clf",
               "type": "string"
            }
         },
         "required": [
            "target_column"
         ]
      },
      "ColumnSpec": {
         "title": "ColumnSpec",
         "description": "Configure special columns.",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "description": "Special column name.",
               "type": "string"
            },
            "comment": {
               "title": "Comment",
               "description": "Column description (e.g. \"Treatment Arm, used for stratification in such and such analysis.\")",
               "default": "",
               "type": "string"
            }
         },
         "required": [
            "name"
         ]
      },
      "ColumnMetadataConfig": {
         "title": "ColumnMetadataConfig",
         "description": "Column-specific metadata, currently including data type.\n\nIn future this structure can be extended to contain field description,\nsemantics, display name, formatting, etc.",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "description": "Column name. Used to refer to column in the dataframe directly.",
               "type": "string"
            },
            "data_type": {
               "title": "Data Type",
               "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.",
               "default": "",
               "type": "string"
            },
            "is_categorical": {
               "title": "Is Categorical",
               "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.",
               "default": false,
               "type": "boolean"
            },
            "parent_name": {
               "title": "Parent Name",
               "description": "Column name which used to produce current column as a result of transformation.",
               "type": "string"
            },
            "description": {
               "title": "Description",
               "description": "Column description.",
               "type": "string"
            },
            "group": {
               "title": "Group",
               "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.",
               "type": "string"
            }
         },
         "required": [
            "name"
         ]
      },
      "DatasetMetadataSection": {
         "title": "DatasetMetadataSection",
         "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.",
         "type": "object",
         "properties": {
            "key_columns": {
               "title": "Key Columns",
               "description": "List of identifier fields for a row in the dataset.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "modeling_specs": {
               "title": "Modeling Specs",
               "description": "Collection of modeling specification for survival/regression/classification problems.\n            NOTE: Modeling problem ids are expected to have corresponding values.",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "anyOf": [
                     {
                        "$ref": "#/definitions/SurvivalTimeSpec"
                     },
                     {
                        "$ref": "#/definitions/ModelingTargetSpec"
                     }
                  ]
               }
            },
            "columns_specs": {
               "title": "Columns Specs",
               "description": "Named collection of special columns (targets, groupings, etc.)",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "$ref": "#/definitions/ColumnSpec"
               }
            },
            "columns_metadata": {
               "title": "Columns Metadata",
               "description": "Provide list of columns-specific metadata.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/ColumnMetadataConfig"
               }
            }
         }
      },
      "ModelingTaskSpec": {
         "title": "ModelingTaskSpec",
         "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.",
         "type": "object",
         "properties": {
            "task": {
               "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".",
               "allOf": [
                  {
                     "$ref": "#/definitions/ModelingTask"
                  }
               ]
            },
            "target": {
               "title": "Target",
               "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).",
               "type": "string"
            },
            "target_metric": {
               "title": "Target Metric",
               "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.",
               "default": "",
               "type": "string"
            },
            "event_query": {
               "title": "Event Query",
               "description": "(Applies for survival problems.) Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.",
               "default": "",
               "type": "string"
            },
            "event_column": {
               "title": "Event Column",
               "description": "(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ",
               "default": "",
               "type": "string"
            }
         },
         "required": [
            "task",
            "target"
         ]
      }
   }
}

Fields
field input_ref: str = ''

Refers to a source of data. Currently only default source is supported.

field data_preprocessing: logml.configuration.modeling.DatasetPreprocessingSection = DatasetPreprocessingSection(enable=True, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[])
field dataset_metadata: logml.configuration.modeling.DatasetMetadataSection = None
field modeling_task: logml.configuration.modeling.ModelingTaskSpec = None
field modeling_ref: str = None
field dataset_type: str = 'base_dataset'
class logml.analysis.items.data_transform.DataTransformer(cfg: logml.analysis.items.data_transform.DataTransformerConfig, logger=None, **kwargs)

Bases: logml.analysis.base_item.AnalysisItem

Wraps analysis algo into data loading/saving and config.

LABEL = 'data_transform'
PARAMS_CLS

alias of logml.analysis.items.data_transform.DataTransformerConfig

RESULT_CLS

alias of logml.analysis.items.data_transform.DataTransformerResult

classmethod prepare_params(params: logml.analysis.items.data_transform.DataTransformerConfig, global_cfg)

Convert params, if they are not of PARAMS_CLS class.

run()

Run data transformation.

get_result() logml.analysis.items.data_transform.DataTransformerResult

Return step final result. Should be of RESULT_CLS type, if RESULT_CLS is declared.