logml.configuration.eda

Classes

CorrelationType(value)

Defines available correlation types.

class logml.configuration.eda.EDAArtifactSection

Bases: pydantic.main.BaseModel

Configuration for an EDA artifact.

Show JSON schema

{
   "title": "EDAArtifactSection",
   "description": "Configuration for an EDA artifact.",
   "type": "object",
   "properties": {
      "enable": {
         "title": "Enable",
         "description": "Whether to enable this EDA artifact generation.",
         "default": true,
         "type": "boolean"
      },
      "name": {
         "title": "Name",
         "description": "Registered artifact name. See lml:ref:`EDA Artifacts` for the complete list.",
         "type": "string"
      }
   },
   "required": [
      "name"
   ]
}

Fields

enable (bool)
name (str)

field enable: bool = True: Whether to enable this EDA artifact generation.

field name: str [Required]: Registered artifact name. See lml:ref:EDA Artifacts for the complete list.

class logml.configuration.eda.CorrelationType(value)

Bases: str, enum.Enum

Defines available correlation types.

PEARSON = 'pearson'

SPEARMAN = 'spearman'

class logml.configuration.eda.EDAArtifactsGenerationParameters

Bases: pydantic.main.BaseModel

Defines a set of hyperparams and thresholds that will be used for EDA artifacts generation.

Show JSON schema

{
   "title": "EDAArtifactsGenerationParameters",
   "description": "Defines a set of hyperparams and thresholds that will be used for EDA artifacts generation.",
   "type": "object",
   "properties": {
      "correlation_type": {
         "description": "Type of correlation that will be used to produce EDA artifacts as well as while removing\n            correlated features.",
         "default": "pearson",
         "allOf": [
            {
               "$ref": "#/definitions/CorrelationType"
            }
         ]
      },
      "correlation_threshold": {
         "title": "Correlation Threshold",
         "description": "Defines a correlation threshold that will be used to identify \"correlated\" features.",
         "default": 0.8,
         "type": "number"
      },
      "correlation_min_samples_fraction": {
         "title": "Correlation Min Samples Fraction",
         "description": "Additional parameter that defines the minimum fraction of samples that is required to calculate\n            correlation coefficient between two columns. As NaNs are ignored and correlation coefficient is calculated\n            on top of non-NaN subset of rows for a pair of columns - this parameter could help to make the results\n            more meaningful. Please see the reference of \"min_periods\" here:\n            https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html\n            ",
         "default": 0.2,
         "type": "number"
      },
      "correlation_group_level_cutoff": {
         "title": "Correlation Group Level Cutoff",
         "description": "Sets cutoff for how many levels of neighbours to consider when building correlation groups.\n        \n        For example consider the following correlation matrix:\n            \n        .. code-block::\n        \n                    a    b    c    d\n                a  1.0  0.8  0.8  0.7\n                b  0.8  1.0    0    0\n                c  0.8    0  1.0  0.8\n                d  0.7    0  0.8  1.0\n        \n        Let's say, we use threshold as ``> 0.7``. In this case `a` is correlated strongly with `b` and `c`, and \n        `c` correlated with `d`.\n        \n        When we set cutoff to `1`, we use direct neighbours only, so there is one group `'a', 'c', 'b'`. \n        In this case `d` is not included, because the group has been already formed around `a` column.\n        \n        If we set it to `-1` or anything more than 1, we use all reachable neighbours. In this case, correlation \n        group is formed as ``'a', 'c', 'b', 'd'`` due to fact that `d` is strongly correlated with `c`, disregarding \n        it weak connection to `a`. As you can see, it will result in larger groups, and possibility to assign to the \n        same group columns with correlation less than a threshold. It could reflect cross-correlation more\n        naturally in some cases.\n        ",
         "default": 1,
         "type": "integer"
      },
      "correlation_key_names": {
         "title": "Correlation Key Names",
         "description": "Defines a list of biologically rational gene names (subst) that\n            will be used for correlation groups naming. In case some of those names will appear in one of column names\n            within the same correlation group - the result correlation group identifier will contain those names.",
         "default": [
            "TP53",
            "KRAS",
            "CDKN2A",
            "CDKN2B",
            "PIK3CA",
            "ATM",
            "BRCA1",
            "SOX2",
            "GNAS2",
            "TERC",
            "STK11",
            "PDCD1",
            "LAG3",
            "TIGIT",
            "HAVCR2",
            "EOMES",
            "MTAP"
         ],
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "large_data_threshold": {
         "title": "Large Data Threshold",
         "description": "Threshold (rows, columns) to apply large dataset processing, simplifying certain analysis steps which may not make sense for large data. Any rows or columns number of the dataset shouldexceed the limit.",
         "default": "(500, 1000)",
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "array",
               "minItems": 2,
               "maxItems": 2,
               "items": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "integer"
                  }
               ]
            }
         ]
      },
      "huge_data_threshold": {
         "title": "Huge Data Threshold",
         "description": "Threshold (rows, columns) to omit certain analysis steps which may not make sense for huge data. Any rows or columns number of the dataset shouldexceed the limit.",
         "default": "(2000, 5000)",
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "array",
               "minItems": 2,
               "maxItems": 2,
               "items": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "integer"
                  }
               ]
            }
         ]
      }
   },
   "definitions": {
      "CorrelationType": {
         "title": "CorrelationType",
         "description": "Defines available correlation types.",
         "enum": [
            "pearson",
            "spearman"
         ],
         "type": "string"
      }
   }
}

Fields

correlation_group_level_cutoff (int)
correlation_key_names (List[str])
correlation_min_samples_fraction (float)
correlation_threshold (float)
correlation_type (logml.configuration.eda.CorrelationType)
huge_data_threshold (Union[str, Tuple[int, int]])
large_data_threshold (Union[str, Tuple[int, int]])

field correlation_type: logml.configuration.eda.CorrelationType = CorrelationType.PEARSON: Type of correlation that will be used to produce EDA artifacts as well as while removing correlated features.

field correlation_threshold: float = 0.8: Defines a correlation threshold that will be used to identify “correlated” features.

field correlation_min_samples_fraction: float = 0.2: Additional parameter that defines the minimum fraction of samples that is required to calculate correlation coefficient between two columns. As NaNs are ignored and correlation coefficient is calculated on top of non-NaN subset of rows for a pair of columns - this parameter could help to make the results more meaningful. Please see the reference of “min_periods” here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

field correlation_group_level_cutoff: int = 1: Sets cutoff for how many levels of neighbours to consider when building correlation groups. For example consider the following correlation matrix: .. code-block:: a b c d a 1.0 0.8 0.8 0.7 b 0.8 1.0 0 0 c 0.8 0 1.0 0.8 d 0.7 0 0.8 1.0 Let’s say, we use threshold as > 0.7. In this case a is correlated strongly with b and c, and c correlated with d. When we set cutoff to 1, we use direct neighbours only, so there is one group ‘a’, ‘c’, ‘b’. In this case d is not included, because the group has been already formed around a column. If we set it to -1 or anything more than 1, we use all reachable neighbours. In this case, correlation group is formed as 'a', 'c', 'b', 'd' due to fact that d is strongly correlated with c, disregarding it weak connection to a. As you can see, it will result in larger groups, and possibility to assign to the same group columns with correlation less than a threshold. It could reflect cross-correlation more naturally in some cases.

field correlation_key_names: List[str] = ['TP53', 'KRAS', 'CDKN2A', 'CDKN2B', 'PIK3CA', 'ATM', 'BRCA1', 'SOX2', 'GNAS2', 'TERC', 'STK11', 'PDCD1', 'LAG3', 'TIGIT', 'HAVCR2', 'EOMES', 'MTAP']: Defines a list of biologically rational gene names (subst) that will be used for correlation groups naming. In case some of those names will appear in one of column names within the same correlation group - the result correlation group identifier will contain those names.

field large_data_threshold: Union[str, Tuple[int, int]] = '(500, 1000)': Threshold (rows, columns) to apply large dataset processing, simplifying certain analysis steps which may not make sense for large data. Any rows or columns number of the dataset shouldexceed the limit.

field huge_data_threshold: Union[str, Tuple[int, int]] = '(2000, 5000)': Threshold (rows, columns) to omit certain analysis steps which may not make sense for huge data. Any rows or columns number of the dataset shouldexceed the limit.

dict(*args, **kwargs) → Dict[str, Any]: Convert object to dictionary.

class logml.configuration.eda.EDAArtifactsGenerationSection

Bases: pydantic.main.BaseModel

Configure Exploratory Data Analysis (EDA).

Show JSON schema

{
   "title": "EDAArtifactsGenerationSection",
   "description": "Configure Exploratory Data Analysis (EDA).",
   "type": "object",
   "properties": {
      "enable": {
         "title": "Enable",
         "description": "Whether to enable EDA artifacts generation. Tightly coupled with BaselineKit report generation - required step in case EDA sections are needed there.",
         "default": true,
         "type": "boolean"
      },
      "preprocessing_problem_id": {
         "title": "Preprocessing Problem Id",
         "description": "Existing within the config modeling problem id is expected. If not set, EDA artifacts will be built using \"raw\" dataframe. This options allows a user to reference some available modeling problem and reuse it's preprocessing pipeline.",
         "default": "",
         "type": "string"
      },
      "dataset_preprocessing": {
         "title": "Dataset Preprocessing",
         "description": "Declare preprocessing rules specific to EDA, e.g drop identifiers, null values, etc. This configuration has priority over `preprocessing_problem_id`.",
         "default": {
            "enable": false,
            "preset": {
               "enable": false,
               "features_list": [],
               "remove_correlated_features": true,
               "nans_per_row_fraction_threshold": 0.9,
               "nans_fraction_threshold": 0.7,
               "apply_log1p_to_target": false,
               "drop_datetime_columns": true,
               "drop_dna_wt": false,
               "imputer": "median"
            },
            "steps": []
         },
         "allOf": [
            {
               "$ref": "#/definitions/DatasetPreprocessingSection"
            }
         ]
      },
      "artifacts": {
         "title": "Artifacts",
         "description": "List of required items to generate. Leave empty to generate all registered items.",
         "default": [],
         "type": "array",
         "items": {
            "$ref": "#/definitions/EDAArtifactSection"
         }
      },
      "params": {
         "title": "Params",
         "default": {
            "correlation_type": "pearson",
            "correlation_threshold": 0.8,
            "correlation_min_samples_fraction": 0.2,
            "correlation_group_level_cutoff": 1,
            "correlation_key_names": [
               "TP53",
               "KRAS",
               "CDKN2A",
               "CDKN2B",
               "PIK3CA",
               "ATM",
               "BRCA1",
               "SOX2",
               "GNAS2",
               "TERC",
               "STK11",
               "PDCD1",
               "LAG3",
               "TIGIT",
               "HAVCR2",
               "EOMES",
               "MTAP"
            ],
            "large_data_threshold": "(500, 1000)",
            "huge_data_threshold": "(2000, 5000)"
         },
         "allOf": [
            {
               "$ref": "#/definitions/EDAArtifactsGenerationParameters"
            }
         ]
      }
   },
   "definitions": {
      "DatasetPreprocessingPresetSection": {
         "title": "DatasetPreprocessingPresetSection",
         "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable automated generation of preprocessing steps.",
               "default": true,
               "type": "boolean"
            },
            "features_list": {
               "title": "Features List",
               "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n            is just to reference a configuration file that contains the required list of features:\n            ...\n            features_list: sub_cfg/features_list.yaml  # a config file\n            ...\n        ",
               "default": [],
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "items": {
                        "type": "string"
                     }
                  }
               ]
            },
            "remove_correlated_features": {
               "title": "Remove Correlated Features",
               "description": "Whether to include a step that removes correlated features.",
               "default": true,
               "type": "boolean"
            },
            "nans_per_row_fraction_threshold": {
               "title": "Nans Per Row Fraction Threshold",
               "description": "Defines maximum acceptable fraction of NaNs within a row.",
               "default": 0.9,
               "type": "number"
            },
            "nans_fraction_threshold": {
               "title": "Nans Fraction Threshold",
               "description": "Defines maximum acceptable fraction of NaNs within a column.",
               "default": 0.7,
               "type": "number"
            },
            "apply_log1p_to_target": {
               "title": "Apply Log1P To Target",
               "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).",
               "default": false,
               "type": "boolean"
            },
            "drop_datetime_columns": {
               "title": "Drop Datetime Columns",
               "description": "Whether to drop date time columns.",
               "default": true,
               "type": "boolean"
            },
            "drop_dna_wt": {
               "title": "Drop Dna Wt",
               "description": "Whether to drop DNA WT values after one-hot-encoding.",
               "default": false,
               "type": "boolean"
            },
            "imputer": {
               "title": "Imputer",
               "description": "Imputer to use. Possible values: (median, mice)",
               "default": "median",
               "type": "string"
            }
         }
      },
      "PreprocessingStep": {
         "title": "PreprocessingStep",
         "description": "Defines data preprocessing step.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable preprocessing step.",
               "default": true,
               "type": "boolean"
            },
            "transformer": {
               "title": "Transformer",
               "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.",
               "type": "string"
            },
            "params": {
               "title": "Params",
               "description": "Parameters that will be passed to the correspoding transformer instance.",
               "default": {},
               "type": "object"
            }
         },
         "required": [
            "transformer"
         ]
      },
      "DatasetPreprocessingSection": {
         "title": "DatasetPreprocessingSection",
         "description": "Defines data preprocessing section for modeling/survival setup.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable Preprocessing Pipeline for dataset transformation.",
               "default": true,
               "type": "boolean"
            },
            "preset": {
               "title": "Preset",
               "default": {
                  "enable": false,
                  "features_list": [],
                  "remove_correlated_features": true,
                  "nans_per_row_fraction_threshold": 0.9,
                  "nans_fraction_threshold": 0.7,
                  "apply_log1p_to_target": false,
                  "drop_datetime_columns": true,
                  "drop_dna_wt": false,
                  "imputer": "median"
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/DatasetPreprocessingPresetSection"
                  }
               ]
            },
            "steps": {
               "title": "Steps",
               "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/PreprocessingStep"
               }
            }
         }
      },
      "EDAArtifactSection": {
         "title": "EDAArtifactSection",
         "description": "Configuration for an EDA artifact.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable this EDA artifact generation.",
               "default": true,
               "type": "boolean"
            },
            "name": {
               "title": "Name",
               "description": "Registered artifact name. See lml:ref:`EDA Artifacts` for the complete list.",
               "type": "string"
            }
         },
         "required": [
            "name"
         ]
      },
      "CorrelationType": {
         "title": "CorrelationType",
         "description": "Defines available correlation types.",
         "enum": [
            "pearson",
            "spearman"
         ],
         "type": "string"
      },
      "EDAArtifactsGenerationParameters": {
         "title": "EDAArtifactsGenerationParameters",
         "description": "Defines a set of hyperparams and thresholds that will be used for EDA artifacts generation.",
         "type": "object",
         "properties": {
            "correlation_type": {
               "description": "Type of correlation that will be used to produce EDA artifacts as well as while removing\n            correlated features.",
               "default": "pearson",
               "allOf": [
                  {
                     "$ref": "#/definitions/CorrelationType"
                  }
               ]
            },
            "correlation_threshold": {
               "title": "Correlation Threshold",
               "description": "Defines a correlation threshold that will be used to identify \"correlated\" features.",
               "default": 0.8,
               "type": "number"
            },
            "correlation_min_samples_fraction": {
               "title": "Correlation Min Samples Fraction",
               "description": "Additional parameter that defines the minimum fraction of samples that is required to calculate\n            correlation coefficient between two columns. As NaNs are ignored and correlation coefficient is calculated\n            on top of non-NaN subset of rows for a pair of columns - this parameter could help to make the results\n            more meaningful. Please see the reference of \"min_periods\" here:\n            https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html\n            ",
               "default": 0.2,
               "type": "number"
            },
            "correlation_group_level_cutoff": {
               "title": "Correlation Group Level Cutoff",
               "description": "Sets cutoff for how many levels of neighbours to consider when building correlation groups.\n        \n        For example consider the following correlation matrix:\n            \n        .. code-block::\n        \n                    a    b    c    d\n                a  1.0  0.8  0.8  0.7\n                b  0.8  1.0    0    0\n                c  0.8    0  1.0  0.8\n                d  0.7    0  0.8  1.0\n        \n        Let's say, we use threshold as ``> 0.7``. In this case `a` is correlated strongly with `b` and `c`, and \n        `c` correlated with `d`.\n        \n        When we set cutoff to `1`, we use direct neighbours only, so there is one group `'a', 'c', 'b'`. \n        In this case `d` is not included, because the group has been already formed around `a` column.\n        \n        If we set it to `-1` or anything more than 1, we use all reachable neighbours. In this case, correlation \n        group is formed as ``'a', 'c', 'b', 'd'`` due to fact that `d` is strongly correlated with `c`, disregarding \n        it weak connection to `a`. As you can see, it will result in larger groups, and possibility to assign to the \n        same group columns with correlation less than a threshold. It could reflect cross-correlation more\n        naturally in some cases.\n        ",
               "default": 1,
               "type": "integer"
            },
            "correlation_key_names": {
               "title": "Correlation Key Names",
               "description": "Defines a list of biologically rational gene names (subst) that\n            will be used for correlation groups naming. In case some of those names will appear in one of column names\n            within the same correlation group - the result correlation group identifier will contain those names.",
               "default": [
                  "TP53",
                  "KRAS",
                  "CDKN2A",
                  "CDKN2B",
                  "PIK3CA",
                  "ATM",
                  "BRCA1",
                  "SOX2",
                  "GNAS2",
                  "TERC",
                  "STK11",
                  "PDCD1",
                  "LAG3",
                  "TIGIT",
                  "HAVCR2",
                  "EOMES",
                  "MTAP"
               ],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "large_data_threshold": {
               "title": "Large Data Threshold",
               "description": "Threshold (rows, columns) to apply large dataset processing, simplifying certain analysis steps which may not make sense for large data. Any rows or columns number of the dataset shouldexceed the limit.",
               "default": "(500, 1000)",
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "minItems": 2,
                     "maxItems": 2,
                     "items": [
                        {
                           "type": "integer"
                        },
                        {
                           "type": "integer"
                        }
                     ]
                  }
               ]
            },
            "huge_data_threshold": {
               "title": "Huge Data Threshold",
               "description": "Threshold (rows, columns) to omit certain analysis steps which may not make sense for huge data. Any rows or columns number of the dataset shouldexceed the limit.",
               "default": "(2000, 5000)",
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "minItems": 2,
                     "maxItems": 2,
                     "items": [
                        {
                           "type": "integer"
                        },
                        {
                           "type": "integer"
                        }
                     ]
                  }
               ]
            }
         }
      }
   }
}

Fields

artifacts (List[logml.configuration.eda.EDAArtifactSection])
dataset_preprocessing (logml.configuration.modeling.DatasetPreprocessingSection)
enable (bool)
params (logml.configuration.eda.EDAArtifactsGenerationParameters)
preprocessing_problem_id (str)

field enable: bool = True: Whether to enable EDA artifacts generation. Tightly coupled with BaselineKit report generation - required step in case EDA sections are needed there.

field preprocessing_problem_id: str = '': Existing within the config modeling problem id is expected. If not set, EDA artifacts will be built using “raw” dataframe. This options allows a user to reference some available modeling problem and reuse it’s preprocessing pipeline.

field dataset_preprocessing: logml.configuration.modeling.DatasetPreprocessingSection = DatasetPreprocessingSection(enable=False, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[]): Declare preprocessing rules specific to EDA, e.g drop identifiers, null values, etc. This configuration has priority over preprocessing_problem_id.

field artifacts: List[logml.configuration.eda.EDAArtifactSection] = []: List of required items to generate. Leave empty to generate all registered items.

field params: logml.configuration.eda.EDAArtifactsGenerationParameters = EDAArtifactsGenerationParameters(correlation_type=<CorrelationType.PEARSON: 'pearson'>, correlation_threshold=0.8, correlation_min_samples_fraction=0.2, correlation_group_level_cutoff=1, correlation_key_names=['TP53', 'KRAS', 'CDKN2A', 'CDKN2B', 'PIK3CA', 'ATM', 'BRCA1', 'SOX2', 'GNAS2', 'TERC', 'STK11', 'PDCD1', 'LAG3', 'TIGIT', 'HAVCR2', 'EOMES', 'MTAP'], large_data_threshold=(500, 1000), huge_data_threshold=(2000, 5000))

static get_default_correlation_config(): Returns default configuration section.modeling.py