logml.configuration.global_config

Functions

`flatten_errors`(errors, config[, loc])	Format pydantic validation message in a human-readable way (use class name instead of '__root__' string.
`format_validation_error`(ve)	Format pydantic error
`print_schema`([output, use_json])	Print configuration file schema.
`validate_config`(file, output)	Validate config file schema and content.

class logml.configuration.global_config.ConfigSources

Bases: pydantic.main.BaseModel

Internal information about logml configuration source files.

Show JSON schema

{
   "title": "ConfigSources",
   "description": "Internal information about logml configuration source files.",
   "type": "object",
   "properties": {
      "main_config": {
         "title": "Main Config",
         "description": "Path to main configuration file.",
         "default": "",
         "type": "string"
      },
      "refs": {
         "title": "Refs",
         "description": "Paths to files referred by the main config file.",
         "default": [],
         "type": "array",
         "items": {
            "type": "string"
         }
      }
   }
}

Fields

main_config (str)
refs (List[str])

field main_config: str = '': Path to main configuration file.

field refs: List[str] = []: Paths to files referred by the main config file.

copy_to(dest_folder: Union[str, pathlib.Path]) → None: Copies source config files to dest_folder location.

logml.configuration.global_config.flatten_errors(errors: Sequence[Any], config: Type['BaseConfig'], loc: Optional['Loc'] = None) → Generator[Dict[str, Any], None, None]

Format pydantic validation message in a human-readable way (use class name instead of ‘__root__’ string.

(In standard way, nested models errors are reported as this: “Error __root__ -> __root__ -> __root__ -> params: Required property missing.”. In our case you are able to observe class names.)

class logml.configuration.global_config.GlobalConfig

Bases: pydantic.main.BaseModel

Global config schema.

When configuring logml, an object of this class is populated from yaml config directly, meaning that all root-level yaml entries are mapped to properties of this class.

Show JSON schema

{
   "title": "GlobalConfig",
   "description": "Global config schema.\n\nWhen configuring logml, an object of this class is populated from yaml config directly,\nmeaning that all root-level yaml entries are mapped to properties of this class.",
   "type": "object",
   "properties": {
      "version": {
         "title": "Version",
         "description": "Corresponds to the last compatible LogML version, e.g. \"0.2.4\". This field is for information only, logml does not validate it.",
         "default": "unknown",
         "type": "string"
      },
      "random_state": {
         "title": "Random State",
         "description": "Random state for main random numbers generator.",
         "type": "integer"
      },
      "dataset_metadata": {
         "title": "Dataset Metadata",
         "description": "Metadata for the incoming dataset (identifiers, columns groups, etc).",
         "allOf": [
            {
               "$ref": "#/definitions/DatasetMetadataSection"
            }
         ]
      },
      "stratification": {
         "title": "Stratification",
         "description": "Stata are sub-sets of the original dataset, independent of one another.\n\n    Essentially, the complete analysis pipeline runs for a each single stratum independently (but all strata are\n    compared in a special report page named \u201cCross-strata comparison\u201d). Stratification can follow for example\n    treatment arms or test/control grouping of samples.\n\n    NOTE: `analysis` section does not follow stratification rules, but defines its own specific ones.",
         "default": [],
         "type": "array",
         "items": {
            "$ref": "#/definitions/Strata"
         }
      },
      "survival_analysis": {
         "title": "Survival Analysis",
         "default": {
            "enable": false,
            "problems": {}
         },
         "allOf": [
            {
               "$ref": "#/definitions/SurvivalAnalysisSection"
            }
         ]
      },
      "modeling": {
         "title": "Modeling",
         "default": {
            "enable": false,
            "cross_validation": {
               "random_state": null,
               "split_type": "kfold",
               "n_folds": 20,
               "test_size": 0.2,
               "type": "",
               "params": {}
            },
            "hpo": {
               "algorithm": "tpe",
               "max_evals": 3,
               "random_state": null
            },
            "problems": {}
         },
         "allOf": [
            {
               "$ref": "#/definitions/ModelingSection"
            }
         ]
      },
      "eda": {
         "title": "Eda",
         "default": {
            "enable": false,
            "preprocessing_problem_id": "",
            "dataset_preprocessing": {
               "enable": false,
               "preset": {
                  "enable": false,
                  "features_list": [],
                  "remove_correlated_features": true,
                  "nans_per_row_fraction_threshold": 0.9,
                  "nans_fraction_threshold": 0.7,
                  "apply_log1p_to_target": false,
                  "drop_datetime_columns": true,
                  "drop_dna_wt": false,
                  "imputer": "median"
               },
               "steps": []
            },
            "artifacts": [],
            "params": {
               "correlation_type": "pearson",
               "correlation_threshold": 0.8,
               "correlation_min_samples_fraction": 0.2,
               "correlation_group_level_cutoff": 1,
               "correlation_key_names": [
                  "TP53",
                  "KRAS",
                  "CDKN2A",
                  "CDKN2B",
                  "PIK3CA",
                  "ATM",
                  "BRCA1",
                  "SOX2",
                  "GNAS2",
                  "TERC",
                  "STK11",
                  "PDCD1",
                  "LAG3",
                  "TIGIT",
                  "HAVCR2",
                  "EOMES",
                  "MTAP"
               ],
               "large_data_threshold": "(500, 1000)",
               "huge_data_threshold": "(2000, 5000)"
            }
         },
         "allOf": [
            {
               "$ref": "#/definitions/EDAArtifactsGenerationSection"
            }
         ]
      },
      "report": {
         "title": "Report",
         "default": {
            "enable": true,
            "report_structure": {
               "master_summary": {
                  "enable": false,
                  "eda": false,
                  "feature_importance": "",
                  "survival_feature_importance": "",
                  "baseline_modeling": "",
                  "survival_analysis": ""
               },
               "eda": false,
               "modeling": [],
               "cross_strata_fi_summary": [],
               "survival_analysis": [],
               "greedy_split": [],
               "rnaseq_differential_expression": [],
               "rnaseq_enrichment_analysis": [],
               "report_diagnostics": true,
               "report_summary": true
            },
            "workflow": {
               "produce_and_execute_strata_notebooks": true,
               "produce_and_execute_global_notebooks": true,
               "generate_report": true
            }
         },
         "allOf": [
            {
               "$ref": "#/definitions/ReportSection"
            }
         ]
      },
      "analysis": {
         "title": "Analysis",
         "default": {
            "enable": false,
            "items": []
         },
         "allOf": [
            {
               "$ref": "#/definitions/AnalysisPipelineConfig"
            }
         ]
      },
      "is_dag_config": {
         "title": "Is Dag Config",
         "default": false,
         "type": "boolean"
      },
      "source_files": {
         "title": "Source Files",
         "default": {
            "main_config": "",
            "refs": []
         },
         "allOf": [
            {
               "$ref": "#/definitions/ConfigSources"
            }
         ]
      }
   },
   "definitions": {
      "SurvivalTimeSpec": {
         "title": "SurvivalTimeSpec",
         "description": "Configure right-censored time-to-event columns in the dataset",
         "type": "object",
         "properties": {
            "time_column": {
               "title": "Time Column",
               "description": "Column name that contains time-to-event values (usually OS or PFS).",
               "type": "string"
            },
            "event_query": {
               "title": "Event Query",
               "description": "Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.",
               "type": "string"
            },
            "event_column": {
               "title": "Event Column",
               "description": "Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_query: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ",
               "type": "string"
            },
            "target_metric": {
               "title": "Target Metric",
               "description": "Metric (loss) that will be used to evaluate survival models performance. Please refer to :lml:ref:`ML Metrics` for details.",
               "default": "cindex",
               "type": "string"
            },
            "SPEC_TYPE": {
               "title": "Spec Type",
               "default": "survival",
               "type": "string"
            }
         },
         "required": [
            "time_column",
            "event_query",
            "event_column"
         ]
      },
      "ModelingTask": {
         "title": "ModelingTask",
         "description": "Defines supported modeling tasks.",
         "enum": [
            "classification",
            "regression",
            "survival"
         ],
         "type": "string"
      },
      "ModelingTargetSpec": {
         "title": "ModelingTargetSpec",
         "description": "Specification for modeling target (regression/classification)",
         "type": "object",
         "properties": {
            "target_column": {
               "title": "Target Column",
               "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).",
               "type": "string"
            },
            "task": {
               "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".",
               "default": "regression",
               "allOf": [
                  {
                     "$ref": "#/definitions/ModelingTask"
                  }
               ]
            },
            "target_metric": {
               "title": "Target Metric",
               "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.",
               "default": "",
               "type": "string"
            },
            "SPEC_TYPE": {
               "title": "Spec Type",
               "default": "reg-clf",
               "type": "string"
            }
         },
         "required": [
            "target_column"
         ]
      },
      "ColumnSpec": {
         "title": "ColumnSpec",
         "description": "Configure special columns.",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "description": "Special column name.",
               "type": "string"
            },
            "comment": {
               "title": "Comment",
               "description": "Column description (e.g. \"Treatment Arm, used for stratification in such and such analysis.\")",
               "default": "",
               "type": "string"
            }
         },
         "required": [
            "name"
         ]
      },
      "ColumnMetadataConfig": {
         "title": "ColumnMetadataConfig",
         "description": "Column-specific metadata, currently including data type.\n\nIn future this structure can be extended to contain field description,\nsemantics, display name, formatting, etc.",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "description": "Column name. Used to refer to column in the dataframe directly.",
               "type": "string"
            },
            "data_type": {
               "title": "Data Type",
               "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.",
               "default": "",
               "type": "string"
            },
            "is_categorical": {
               "title": "Is Categorical",
               "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.",
               "default": false,
               "type": "boolean"
            },
            "parent_name": {
               "title": "Parent Name",
               "description": "Column name which used to produce current column as a result of transformation.",
               "type": "string"
            },
            "description": {
               "title": "Description",
               "description": "Column description.",
               "type": "string"
            },
            "group": {
               "title": "Group",
               "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.",
               "type": "string"
            }
         },
         "required": [
            "name"
         ]
      },
      "DatasetMetadataSection": {
         "title": "DatasetMetadataSection",
         "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.",
         "type": "object",
         "properties": {
            "key_columns": {
               "title": "Key Columns",
               "description": "List of identifier fields for a row in the dataset.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "modeling_specs": {
               "title": "Modeling Specs",
               "description": "Collection of modeling specification for survival/regression/classification problems.\n            NOTE: Modeling problem ids are expected to have corresponding values.",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "anyOf": [
                     {
                        "$ref": "#/definitions/SurvivalTimeSpec"
                     },
                     {
                        "$ref": "#/definitions/ModelingTargetSpec"
                     }
                  ]
               }
            },
            "columns_specs": {
               "title": "Columns Specs",
               "description": "Named collection of special columns (targets, groupings, etc.)",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "$ref": "#/definitions/ColumnSpec"
               }
            },
            "columns_metadata": {
               "title": "Columns Metadata",
               "description": "Provide list of columns-specific metadata.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/ColumnMetadataConfig"
               }
            }
         }
      },
      "Strata": {
         "title": "Strata",
         "description": "Defines data subset (stratum) by the means of query to filter origina dataset.\n\nThis allows to run analysis against different subsets of data, and then comparing\n\nTypical example here: you would want to get per-arm report (assuming your input file contains a column arm):\n\n.. code-block:: yaml\n\n    stratification:\n    # Here we wanted to separate data by treatment arms into two groups.\n    - strata_id: A_arm query: \u2018arm == \u201cA\u201d\u2019\n    - strata_id: BC_arms query: \u2018arm.isin([\u201cB\u201d, \u201cC\u201d])\u2019",
         "type": "object",
         "properties": {
            "strata_id": {
               "title": "Strata Id",
               "description": "Unique identifier for a stratum. This identifier is also used as a folder name to store stratum-relared data on a disk, so it should not contain files-specific symbols like slashes. NOTE: spaces will be replaced with underscores.",
               "type": "string"
            },
            "query": {
               "title": "Query",
               "description": "Query-like expression that indicates how to select samples for corresponding stratum. This follows python syntax which is quite unlike SQL. See :ref:`Dataset Queries` for details.",
               "type": "string"
            }
         },
         "required": [
            "strata_id",
            "query"
         ]
      },
      "DatasetPreprocessingPresetSection": {
         "title": "DatasetPreprocessingPresetSection",
         "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable automated generation of preprocessing steps.",
               "default": true,
               "type": "boolean"
            },
            "features_list": {
               "title": "Features List",
               "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n            is just to reference a configuration file that contains the required list of features:\n            ...\n            features_list: sub_cfg/features_list.yaml  # a config file\n            ...\n        ",
               "default": [],
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "items": {
                        "type": "string"
                     }
                  }
               ]
            },
            "remove_correlated_features": {
               "title": "Remove Correlated Features",
               "description": "Whether to include a step that removes correlated features.",
               "default": true,
               "type": "boolean"
            },
            "nans_per_row_fraction_threshold": {
               "title": "Nans Per Row Fraction Threshold",
               "description": "Defines maximum acceptable fraction of NaNs within a row.",
               "default": 0.9,
               "type": "number"
            },
            "nans_fraction_threshold": {
               "title": "Nans Fraction Threshold",
               "description": "Defines maximum acceptable fraction of NaNs within a column.",
               "default": 0.7,
               "type": "number"
            },
            "apply_log1p_to_target": {
               "title": "Apply Log1P To Target",
               "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).",
               "default": false,
               "type": "boolean"
            },
            "drop_datetime_columns": {
               "title": "Drop Datetime Columns",
               "description": "Whether to drop date time columns.",
               "default": true,
               "type": "boolean"
            },
            "drop_dna_wt": {
               "title": "Drop Dna Wt",
               "description": "Whether to drop DNA WT values after one-hot-encoding.",
               "default": false,
               "type": "boolean"
            },
            "imputer": {
               "title": "Imputer",
               "description": "Imputer to use. Possible values: (median, mice)",
               "default": "median",
               "type": "string"
            }
         }
      },
      "PreprocessingStep": {
         "title": "PreprocessingStep",
         "description": "Defines data preprocessing step.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable preprocessing step.",
               "default": true,
               "type": "boolean"
            },
            "transformer": {
               "title": "Transformer",
               "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.",
               "type": "string"
            },
            "params": {
               "title": "Params",
               "description": "Parameters that will be passed to the correspoding transformer instance.",
               "default": {},
               "type": "object"
            }
         },
         "required": [
            "transformer"
         ]
      },
      "DatasetPreprocessingSection": {
         "title": "DatasetPreprocessingSection",
         "description": "Defines data preprocessing section for modeling/survival setup.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable Preprocessing Pipeline for dataset transformation.",
               "default": true,
               "type": "boolean"
            },
            "preset": {
               "title": "Preset",
               "default": {
                  "enable": false,
                  "features_list": [],
                  "remove_correlated_features": true,
                  "nans_per_row_fraction_threshold": 0.9,
                  "nans_fraction_threshold": 0.7,
                  "apply_log1p_to_target": false,
                  "drop_datetime_columns": true,
                  "drop_dna_wt": false,
                  "imputer": "median"
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/DatasetPreprocessingPresetSection"
                  }
               ]
            },
            "steps": {
               "title": "Steps",
               "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/PreprocessingStep"
               }
            }
         }
      },
      "SurvivalAnalysisMethod": {
         "title": "SurvivalAnalysisMethod",
         "description": "Configures specific survival analysis method.\n\nThere are the following registered methods: :lml:ref:`Survival Analysis Methods`.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enable or disable this analysis.",
               "default": true,
               "type": "boolean"
            },
            "method_id": {
               "title": "Method Id",
               "description": "Alias of survival analysis method to use. Refer to :lml:ref:`Survival Analysis Methods` for details.",
               "type": "string"
            },
            "params": {
               "title": "Params",
               "description": "Parameters specific for the chosen analysis method. Refer to :lml:ref:`Survival Analysis Methods` to find out exact configurationstructure for the method.",
               "default": {},
               "type": "object"
            }
         },
         "required": [
            "method_id"
         ]
      },
      "SurvivalAnalysisSetup": {
         "title": "SurvivalAnalysisSetup",
         "description": "Defines configuration for a \"survival analysis problem\", also called briefly \"setup\".",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enable or disable survival anaysis problem.",
               "default": true,
               "type": "boolean"
            },
            "survival_metric": {
               "title": "Survival Metric",
               "description": "Column of the input dataset that contains \"time-to-event\" values. Usually this is overall\n            survival (OS) or progression-free-survival (PFS) time. NOTE: deprecated, please use\n            \"dataset_metadata\" section.\n        ",
               "default": "",
               "type": "string"
            },
            "event_observed": {
               "title": "Event Observed",
               "description": "Query-like expression that indicates \"events\" (\"uncensored\") samples. For example:\n            \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details. NOTE: deprecated, please use\n            \"dataset_metadata\" section.\n        ",
               "default": "",
               "type": "string"
            },
            "event_column": {
               "title": "Event Column",
               "description": "Column used for event calculation. (We have to specify it so that is can be removed\n            from features list after the dataset preprocessing).\n            If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`.\n            If not specified, we attempt to extract column name from the `event_observed` expression.\n            NOTE: deprecated, please use \"dataset_metadata\" section.\n        ",
               "default": "",
               "type": "string"
            },
            "dataset_preprocessing": {
               "title": "Dataset Preprocessing",
               "description": "Defines dataset preprocessing configuration. It runs before `time` and `event` data extraction, so make sure the final dataset contains both columns.",
               "default": {
                  "enable": true,
                  "preset": {
                     "enable": false,
                     "features_list": [],
                     "remove_correlated_features": true,
                     "nans_per_row_fraction_threshold": 0.9,
                     "nans_fraction_threshold": 0.7,
                     "apply_log1p_to_target": false,
                     "drop_datetime_columns": true,
                     "drop_dna_wt": false,
                     "imputer": "median"
                  },
                  "steps": []
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/DatasetPreprocessingSection"
                  }
               ]
            },
            "methods": {
               "title": "Methods",
               "description": "List of configurations for specific survival analysis methods to apply.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/SurvivalAnalysisMethod"
               }
            }
         }
      },
      "SurvivalAnalysisSection": {
         "title": "SurvivalAnalysisSection",
         "description": "Defines survival analysis section.\n\nSurvival analysis goal is to consider the data from survival modeling perspecive and\nthen make a conclusion based on the results.\n\nExample is Kaplan-Meier univariate survival estimator, which shows how good median feature\nvalues separate data to two groups.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enable or disable Survival Anaysis.",
               "default": true,
               "type": "boolean"
            },
            "problems": {
               "title": "Problems",
               "description": "Defines list of \"survival problem setup\" configurations. Usually problems are similar to one another, but have different target variables.",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "$ref": "#/definitions/SurvivalAnalysisSetup"
               }
            }
         }
      },
      "CVSplitType": {
         "title": "CVSplitType",
         "description": "Type of CV splits: k-fold or shuffle",
         "enum": [
            "kfold",
            "shuffle"
         ],
         "type": "string"
      },
      "CrossValidationSection": {
         "title": "CrossValidationSection",
         "description": "Configure CV application for the dataset.",
         "type": "object",
         "properties": {
            "random_state": {
               "title": "Random State",
               "description": "State to initialize random numbers generation.",
               "type": "integer"
            },
            "split_type": {
               "description": "Configures coverage of splits. 'kfold' covers dataset completely, 'shuffle' - does not guarantee it due to sampling.",
               "default": "kfold",
               "allOf": [
                  {
                     "$ref": "#/definitions/CVSplitType"
                  }
               ]
            },
            "n_folds": {
               "title": "N Folds",
               "description": "How many CV folds should be produced.",
               "default": 20,
               "type": "integer"
            },
            "test_size": {
               "title": "Test Size",
               "description": "Which portion of the dataset to leave for evaluation of the fold.",
               "default": 0.2,
               "type": "number"
            },
            "type": {
               "title": "Type",
               "description": "To be set automatically. Cross Validation strategy alias to use (\"kfold\", \"stratifiedkfold\", etc.). Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection",
               "default": "",
               "type": "string"
            },
            "params": {
               "title": "Params",
               "description": "To be set automatically.Parameters that will be passed to corresponding Scikit-learn classes. Please refer to the official Scikit-learn documentation for details.",
               "default": {},
               "type": "object"
            }
         }
      },
      "HPOSection": {
         "title": "HPOSection",
         "description": "Configure hyper-params optimization for models selection process.",
         "type": "object",
         "properties": {
            "algorithm": {
               "title": "Algorithm",
               "description": "Target \"hyperopt\" algorithm that will be used for models hyper-parameter optimization.",
               "default": "tpe",
               "type": "string"
            },
            "max_evals": {
               "title": "Max Evals",
               "description": "Defines a target number of HPO trials for all models. The more trials - the better models (in theory), the less trials - the faster HPO is done.",
               "default": 3,
               "type": "integer"
            },
            "random_state": {
               "title": "Random State",
               "description": "Random state",
               "type": "integer"
            }
         }
      },
      "ModelingPresetConfiguration": {
         "title": "ModelingPresetConfiguration",
         "description": "Defines an approach for automatically configuring modeling sections.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enables automatic presets for modeling pipeline sections.",
               "default": true,
               "type": "boolean"
            },
            "features_list": {
               "title": "Features List",
               "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n            is just to reference a configuration file that contains the required list of features:\n            ...\n            features_list: sub_cfg/features_list.yaml  # a config file\n            ...\n        ",
               "default": [
                  ".*"
               ],
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "items": {
                        "type": "string"
                     }
                  }
               ]
            }
         }
      },
      "ModelingTaskSpec": {
         "title": "ModelingTaskSpec",
         "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.",
         "type": "object",
         "properties": {
            "task": {
               "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".",
               "allOf": [
                  {
                     "$ref": "#/definitions/ModelingTask"
                  }
               ]
            },
            "target": {
               "title": "Target",
               "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).",
               "type": "string"
            },
            "target_metric": {
               "title": "Target Metric",
               "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.",
               "default": "",
               "type": "string"
            },
            "event_query": {
               "title": "Event Query",
               "description": "(Applies for survival problems.) Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.",
               "default": "",
               "type": "string"
            },
            "event_column": {
               "title": "Event Column",
               "description": "(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ",
               "default": "",
               "type": "string"
            }
         },
         "required": [
            "task",
            "target"
         ]
      },
      "ModelSelectionConfig": {
         "title": "ModelSelectionConfig",
         "description": "Configuration for particular model type selection.",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "description": "Model's alias to use. Please refer to `EligibleModels` for available options.",
               "type": "string"
            },
            "use_hpo": {
               "title": "Use Hpo",
               "description": "Whether model should be fine-tuned (HPO). Otherwise the default paramameters will be used.",
               "default": true,
               "type": "boolean"
            },
            "hyper_params": {
               "title": "Hyper Params",
               "description": "Hyperparameters to use, in case a user wants to explicitly set those.",
               "default": {},
               "type": "object"
            },
            "params_space": {
               "title": "Params Space",
               "description": "Hyperparameters space to use within HPO (instead of predefined ones).",
               "default": {},
               "type": "object"
            }
         },
         "required": [
            "name"
         ]
      },
      "ModelSearchSection": {
         "title": "ModelSearchSection",
         "description": "Defines model search section.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enable or disable Model Search/Selection process. It is recommended to enable `model_search` section in case `feature_importance` section is enabled. NOTE: Model Search section will be implicitly enabled in case `feature_importance` section is enabled.",
               "default": true,
               "type": "boolean"
            },
            "models_random_state": {
               "title": "Models Random State",
               "description": "Random state for models which require it.",
               "type": "integer"
            },
            "limit": {
               "title": "Limit",
               "description": "Limit number of selected models.",
               "default": 6,
               "type": "integer"
            },
            "models": {
               "title": "Models",
               "description": "Defines a list of models which are to be fine-tuned. NOTE: in case this option is unset, all available models for corresponding \"task\" from `metadata` section will be used. Please refer to :lml:ref:`Model Types` for available options.",
               "default": [],
               "type": "array",
               "items": {
                  "anyOf": [
                     {
                        "$ref": "#/definitions/ModelSelectionConfig"
                     },
                     {
                        "type": "string"
                     }
                  ]
               }
            },
            "baseline_model": {
               "title": "Baseline Model",
               "description": "Defines a model's alias that will be used to filter out models that don't perform better (in terms of averaged \"target_metric\" on cross-validation) than \"baseline\" model. NOTE: by default \"dummy\" model will be used for corresponding \"task\". Please refer to :lml:ref:`Model Types` for available options.",
               "default": "",
               "anyOf": [
                  {
                     "$ref": "#/definitions/ModelSelectionConfig"
                  },
                  {
                     "type": "string"
                  }
               ]
            },
            "pvalue_threshold": {
               "title": "Pvalue Threshold",
               "description": "Threshold for p-value when testing hypothesis that model loss is less than baseline loss. Applicable only when CV list is long enough (>=7)",
               "default": 0.05,
               "type": "number"
            },
            "cross_validation": {
               "title": "Cross Validation",
               "default": {
                  "random_state": null,
                  "split_type": "kfold",
                  "n_folds": 20,
                  "test_size": 0.2,
                  "type": "",
                  "params": {}
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/CrossValidationSection"
                  }
               ]
            }
         }
      },
      "FeatureImportanceMethod": {
         "title": "FeatureImportanceMethod",
         "description": "Defines feature importance method.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "DEPRECATED Enables feature importance method.",
               "default": true,
               "type": "boolean"
            },
            "extractor_id": {
               "title": "Extractor Id",
               "description": "Alias of feature importance extractor/method to use. Please refer to `EligibleFIExtractors` for details.",
               "type": "string"
            },
            "params": {
               "title": "Params",
               "description": "Parameters that will be passed to the extractor constructor.",
               "default": {},
               "type": "object"
            },
            "n_models": {
               "title": "N Models",
               "description": "DEPRECATED Implicitly use only the top N models (in terms of CV performance) from available \"selected\" candidates. Might make sense to use that option when different models perform better on different stratas. NOTE: ignored when \"models\" option is set. NOTE: should be non-negative. In case \"n_models\" is equal to 0 - all available candidate models are used.",
               "default": 0,
               "type": "integer"
            },
            "models": {
               "title": "Models",
               "description": "DEPRECATED Explicit list of models that should be used (in case Model Selection resulted in too many models - it is possible to narrow down the list).",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "fallback_model": {
               "title": "Fallback Model",
               "description": "DEPRECATeD Alias of fallback model to use. Please refer to `EligibleModels` for available options. In case Model Selection resulted in no \"reasonable\" models, it still might make sense to use some model anyway for importances extraction.",
               "default": "",
               "type": "string"
            },
            "dump_raw_extractor": {
               "title": "Dump Raw Extractor",
               "description": "NEED REVIEW When set to True, the whole feature importance extractor is dumped to pickle file. Default location is \"extractors\" subfolder (see `FeatureImportanceOutputStructure.get_extractor_dump_path`",
               "default": false,
               "type": "boolean"
            }
         },
         "required": [
            "extractor_id"
         ]
      },
      "FeatureImportanceWorkflow": {
         "title": "FeatureImportanceWorkflow",
         "description": "Defines a workflow for FI execution.",
         "type": "object",
         "properties": {
            "generate_intermediate_results": {
               "title": "Generate Intermediate Results",
               "description": "Enables creation of per-dataset/per-repeat/per-bootstrap-iteration feature importance artifacts.The main purpose of this option is to enable parallelization of feature importance artifacts generation process.",
               "default": true,
               "type": "boolean"
            },
            "aggregate_intermediate_results": {
               "title": "Aggregate Intermediate Results",
               "description": "Enables aggregation of per-dataset feature importance artifacts to \"global\" level. Currently feature importance artifacts are represented as features ranks, aggregation is simply a ranks `averaging` across all dataset-level results. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.",
               "default": true,
               "type": "boolean"
            },
            "generate_global_summary": {
               "title": "Generate Global Summary",
               "description": "Enables summarization of aggregated feature importance results across different method used. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.",
               "default": true,
               "type": "boolean"
            }
         }
      },
      "FeatureImportanceSection": {
         "title": "FeatureImportanceSection",
         "description": "Defines feature importance section for modeling setup.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enables feature importance artifacts generation.",
               "default": true,
               "type": "boolean"
            },
            "cross_validation": {
               "title": "Cross Validation",
               "default": {
                  "random_state": null,
                  "split_type": "kfold",
                  "n_folds": 100,
                  "test_size": 0.25,
                  "type": "",
                  "params": {}
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/CrossValidationSection"
                  }
               ]
            },
            "perform_tier1_greedy": {
               "title": "Perform Tier1 Greedy",
               "default": false,
               "type": "boolean"
            },
            "fid_pvalue_threshold": {
               "title": "Fid Pvalue Threshold",
               "default": 0.05,
               "type": "number"
            },
            "n_random_iters": {
               "title": "N Random Iters",
               "default": 5,
               "type": "integer"
            },
            "random_state": {
               "title": "Random State",
               "description": "State to initialize random numbers generation.",
               "type": "integer"
            },
            "default_extractor": {
               "title": "Default Extractor",
               "description": "Feature importance extractor to be used by default. When not specified, importance is extracted from the model coefficients -  which is naturally possible only for models which support it.",
               "allOf": [
                  {
                     "$ref": "#/definitions/FeatureImportanceMethod"
                  }
               ]
            },
            "default_n_perm_imp_iters": {
               "title": "Default N Perm Imp Iters",
               "description": "Number of permutations for (default) permutation feature extractor.",
               "default": 10,
               "type": "integer"
            },
            "extractors": {
               "title": "Extractors",
               "description": "Map specific model to Feature importance extractor. If not specified, `default_extractor` is used.",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "$ref": "#/definitions/FeatureImportanceMethod"
               }
            },
            "workflow": {
               "title": "Workflow",
               "default": {
                  "generate_intermediate_results": true,
                  "aggregate_intermediate_results": true,
                  "generate_global_summary": true
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/FeatureImportanceWorkflow"
                  }
               ]
            },
            "methods": {
               "title": "Methods",
               "description": "DEPREPCATED! Enumerates target feature imporatance methods/extractors to apply.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/FeatureImportanceMethod"
               }
            }
         }
      },
      "ModelingSetup": {
         "title": "ModelingSetup",
         "description": "Defines parameters for modeling problem (also called \"setup\").\n\nTypical modeling workflow has the following steps:\n\n- metadata - key information for modeling (task, target, metric).\n- dataset preprocessing - preferred strategy for data preparation prior to modeling.\n- datasets - defined bootstrapping setup (number of iterations).\n- model search - defines target set of models to use for feature importance extraction,\n  models are tuned and only appropriate ones (in term of performance) are selected for upstream usage.\n- feature importance - defines target feature extractions methods.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enable or disable modeling setup.",
               "default": true,
               "type": "boolean"
            },
            "preset": {
               "title": "Preset",
               "default": {
                  "enable": false,
                  "features_list": [
                     ".*"
                  ]
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/ModelingPresetConfiguration"
                  }
               ]
            },
            "metadata": {
               "$ref": "#/definitions/ModelingTaskSpec"
            },
            "dataset_preprocessing": {
               "title": "Dataset Preprocessing",
               "default": {
                  "enable": false,
                  "preset": {
                     "enable": false,
                     "features_list": [],
                     "remove_correlated_features": true,
                     "nans_per_row_fraction_threshold": 0.9,
                     "nans_fraction_threshold": 0.7,
                     "apply_log1p_to_target": false,
                     "drop_datetime_columns": true,
                     "drop_dna_wt": false,
                     "imputer": "median"
                  },
                  "steps": []
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/DatasetPreprocessingSection"
                  }
               ]
            },
            "model_search": {
               "title": "Model Search",
               "default": {
                  "enable": true,
                  "models_random_state": null,
                  "limit": 6,
                  "models": [],
                  "baseline_model": "",
                  "pvalue_threshold": 0.05,
                  "cross_validation": {
                     "random_state": null,
                     "split_type": "kfold",
                     "n_folds": 20,
                     "test_size": 0.2,
                     "type": "",
                     "params": {}
                  }
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/ModelSearchSection"
                  }
               ]
            },
            "feature_importance": {
               "title": "Feature Importance",
               "default": {
                  "enable": true,
                  "cross_validation": {
                     "random_state": null,
                     "split_type": "kfold",
                     "n_folds": 100,
                     "test_size": 0.25,
                     "type": "",
                     "params": {}
                  },
                  "perform_tier1_greedy": false,
                  "fid_pvalue_threshold": 0.05,
                  "n_random_iters": 5,
                  "random_state": null,
                  "default_extractor": null,
                  "default_n_perm_imp_iters": 10,
                  "extractors": {},
                  "workflow": {
                     "generate_intermediate_results": true,
                     "aggregate_intermediate_results": true,
                     "generate_global_summary": true
                  },
                  "methods": []
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/FeatureImportanceSection"
                  }
               ]
            }
         }
      },
      "ModelingSection": {
         "title": "ModelingSection",
         "description": "Machine Learning Modeling section definition.\n\nThis section configures \"modeling\" analysis type, which combines the following steps:\n\n- dataset preprocessing, which adjusts data to be suitable for models.\n- model selection, which determines best model to be used at the next steps.\n- feature importance extraction, which calculates features impact on the target variable from\n  ML standpoint. This may be done in several ways, which are indicated as \"FI methods\" below.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enable or disable Modeling workflow.",
               "default": true,
               "type": "boolean"
            },
            "cross_validation": {
               "title": "Cross Validation",
               "default": {
                  "random_state": null,
                  "split_type": "kfold",
                  "n_folds": 20,
                  "test_size": 0.2,
                  "type": "",
                  "params": {}
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/CrossValidationSection"
                  }
               ]
            },
            "hpo": {
               "title": "Hpo",
               "default": {
                  "algorithm": "tpe",
                  "max_evals": 3,
                  "random_state": null
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/HPOSection"
                  }
               ]
            },
            "problems": {
               "title": "Problems",
               "description": "Defines list of \"modeling problem setup\" configurations. Usually problems are similar to one another, but have different target variables.",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "$ref": "#/definitions/ModelingSetup"
               }
            }
         }
      },
      "EDAArtifactSection": {
         "title": "EDAArtifactSection",
         "description": "Configuration for an EDA artifact.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable this EDA artifact generation.",
               "default": true,
               "type": "boolean"
            },
            "name": {
               "title": "Name",
               "description": "Registered artifact name. See lml:ref:`EDA Artifacts` for the complete list.",
               "type": "string"
            }
         },
         "required": [
            "name"
         ]
      },
      "CorrelationType": {
         "title": "CorrelationType",
         "description": "Defines available correlation types.",
         "enum": [
            "pearson",
            "spearman"
         ],
         "type": "string"
      },
      "EDAArtifactsGenerationParameters": {
         "title": "EDAArtifactsGenerationParameters",
         "description": "Defines a set of hyperparams and thresholds that will be used for EDA artifacts generation.",
         "type": "object",
         "properties": {
            "correlation_type": {
               "description": "Type of correlation that will be used to produce EDA artifacts as well as while removing\n            correlated features.",
               "default": "pearson",
               "allOf": [
                  {
                     "$ref": "#/definitions/CorrelationType"
                  }
               ]
            },
            "correlation_threshold": {
               "title": "Correlation Threshold",
               "description": "Defines a correlation threshold that will be used to identify \"correlated\" features.",
               "default": 0.8,
               "type": "number"
            },
            "correlation_min_samples_fraction": {
               "title": "Correlation Min Samples Fraction",
               "description": "Additional parameter that defines the minimum fraction of samples that is required to calculate\n            correlation coefficient between two columns. As NaNs are ignored and correlation coefficient is calculated\n            on top of non-NaN subset of rows for a pair of columns - this parameter could help to make the results\n            more meaningful. Please see the reference of \"min_periods\" here:\n            https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html\n            ",
               "default": 0.2,
               "type": "number"
            },
            "correlation_group_level_cutoff": {
               "title": "Correlation Group Level Cutoff",
               "description": "Sets cutoff for how many levels of neighbours to consider when building correlation groups.\n        \n        For example consider the following correlation matrix:\n            \n        .. code-block::\n        \n                    a    b    c    d\n                a  1.0  0.8  0.8  0.7\n                b  0.8  1.0    0    0\n                c  0.8    0  1.0  0.8\n                d  0.7    0  0.8  1.0\n        \n        Let's say, we use threshold as ``> 0.7``. In this case `a` is correlated strongly with `b` and `c`, and \n        `c` correlated with `d`.\n        \n        When we set cutoff to `1`, we use direct neighbours only, so there is one group `'a', 'c', 'b'`. \n        In this case `d` is not included, because the group has been already formed around `a` column.\n        \n        If we set it to `-1` or anything more than 1, we use all reachable neighbours. In this case, correlation \n        group is formed as ``'a', 'c', 'b', 'd'`` due to fact that `d` is strongly correlated with `c`, disregarding \n        it weak connection to `a`. As you can see, it will result in larger groups, and possibility to assign to the \n        same group columns with correlation less than a threshold. It could reflect cross-correlation more\n        naturally in some cases.\n        ",
               "default": 1,
               "type": "integer"
            },
            "correlation_key_names": {
               "title": "Correlation Key Names",
               "description": "Defines a list of biologically rational gene names (subst) that\n            will be used for correlation groups naming. In case some of those names will appear in one of column names\n            within the same correlation group - the result correlation group identifier will contain those names.",
               "default": [
                  "TP53",
                  "KRAS",
                  "CDKN2A",
                  "CDKN2B",
                  "PIK3CA",
                  "ATM",
                  "BRCA1",
                  "SOX2",
                  "GNAS2",
                  "TERC",
                  "STK11",
                  "PDCD1",
                  "LAG3",
                  "TIGIT",
                  "HAVCR2",
                  "EOMES",
                  "MTAP"
               ],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "large_data_threshold": {
               "title": "Large Data Threshold",
               "description": "Threshold (rows, columns) to apply large dataset processing, simplifying certain analysis steps which may not make sense for large data. Any rows or columns number of the dataset shouldexceed the limit.",
               "default": "(500, 1000)",
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "minItems": 2,
                     "maxItems": 2,
                     "items": [
                        {
                           "type": "integer"
                        },
                        {
                           "type": "integer"
                        }
                     ]
                  }
               ]
            },
            "huge_data_threshold": {
               "title": "Huge Data Threshold",
               "description": "Threshold (rows, columns) to omit certain analysis steps which may not make sense for huge data. Any rows or columns number of the dataset shouldexceed the limit.",
               "default": "(2000, 5000)",
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "array",
                     "minItems": 2,
                     "maxItems": 2,
                     "items": [
                        {
                           "type": "integer"
                        },
                        {
                           "type": "integer"
                        }
                     ]
                  }
               ]
            }
         }
      },
      "EDAArtifactsGenerationSection": {
         "title": "EDAArtifactsGenerationSection",
         "description": "Configure Exploratory Data Analysis (EDA).",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Whether to enable EDA artifacts generation. Tightly coupled with BaselineKit report generation - required step in case EDA sections are needed there.",
               "default": true,
               "type": "boolean"
            },
            "preprocessing_problem_id": {
               "title": "Preprocessing Problem Id",
               "description": "Existing within the config modeling problem id is expected. If not set, EDA artifacts will be built using \"raw\" dataframe. This options allows a user to reference some available modeling problem and reuse it's preprocessing pipeline.",
               "default": "",
               "type": "string"
            },
            "dataset_preprocessing": {
               "title": "Dataset Preprocessing",
               "description": "Declare preprocessing rules specific to EDA, e.g drop identifiers, null values, etc. This configuration has priority over `preprocessing_problem_id`.",
               "default": {
                  "enable": false,
                  "preset": {
                     "enable": false,
                     "features_list": [],
                     "remove_correlated_features": true,
                     "nans_per_row_fraction_threshold": 0.9,
                     "nans_fraction_threshold": 0.7,
                     "apply_log1p_to_target": false,
                     "drop_datetime_columns": true,
                     "drop_dna_wt": false,
                     "imputer": "median"
                  },
                  "steps": []
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/DatasetPreprocessingSection"
                  }
               ]
            },
            "artifacts": {
               "title": "Artifacts",
               "description": "List of required items to generate. Leave empty to generate all registered items.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/EDAArtifactSection"
               }
            },
            "params": {
               "title": "Params",
               "default": {
                  "correlation_type": "pearson",
                  "correlation_threshold": 0.8,
                  "correlation_min_samples_fraction": 0.2,
                  "correlation_group_level_cutoff": 1,
                  "correlation_key_names": [
                     "TP53",
                     "KRAS",
                     "CDKN2A",
                     "CDKN2B",
                     "PIK3CA",
                     "ATM",
                     "BRCA1",
                     "SOX2",
                     "GNAS2",
                     "TERC",
                     "STK11",
                     "PDCD1",
                     "LAG3",
                     "TIGIT",
                     "HAVCR2",
                     "EOMES",
                     "MTAP"
                  ],
                  "large_data_threshold": "(500, 1000)",
                  "huge_data_threshold": "(2000, 5000)"
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/EDAArtifactsGenerationParameters"
                  }
               ]
            }
         }
      },
      "MasterSummarySection": {
         "title": "MasterSummarySection",
         "description": "Defines which sections to include to the Master Summary.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enables Master Summary section that contains highlights from multiple sections.",
               "default": false,
               "type": "boolean"
            },
            "eda": {
               "title": "Eda",
               "description": "Turns on EDA section highlights. This requires EDA artifacts to be generated (see `ed` section.)",
               "default": false,
               "type": "boolean"
            },
            "feature_importance": {
               "title": "Feature Importance",
               "description": "Enables Feature Importance summary for a given modeling problem \"setup id\".",
               "default": "",
               "type": "string"
            },
            "survival_feature_importance": {
               "title": "Survival Feature Importance",
               "description": "Enables Survival Feature Importance summary for a given modeling problem \"setup id\".",
               "default": "",
               "type": "string"
            },
            "baseline_modeling": {
               "title": "Baseline Modeling",
               "description": "Enables Model Search/Selection summary for a given modeling problem \"setup id\".",
               "default": "",
               "type": "string"
            },
            "survival_analysis": {
               "title": "Survival Analysis",
               "description": "Enables Survival Analysis summary for a given survival problem \"setup id\".",
               "default": "",
               "type": "string"
            }
         }
      },
      "ReportStructure": {
         "title": "ReportStructure",
         "description": "Defines BaselineKit report structure (topics) and expected params.\nPlease refer to `EligibleBaselineKitNotebooks` for details on supported fields.",
         "type": "object",
         "properties": {
            "master_summary": {
               "title": "Master Summary",
               "default": {
                  "enable": false,
                  "eda": false,
                  "feature_importance": "",
                  "survival_feature_importance": "",
                  "baseline_modeling": "",
                  "survival_analysis": ""
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/MasterSummarySection"
                  }
               ]
            },
            "eda": {
               "title": "Eda",
               "description": "Enables multiple EDA sections: dataset overview, numericals/categoricals analysis, dimensionality reduction, correlation analysis, etc.",
               "default": false,
               "type": "boolean"
            },
            "modeling": {
               "title": "Modeling",
               "description": "Target modeling problems/setups for which Modeling Report sections should be produced.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "cross_strata_fi_summary": {
               "title": "Cross Strata Fi Summary",
               "description": "Target modeling problems/setups for which cross-strata analysis/comparison sections should be produced. NOTE: survival feature importance is supported as well.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "survival_analysis": {
               "title": "Survival Analysis",
               "description": "Target survival problems/setups for which survival analysis sections should be produced.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "greedy_split": {
               "title": "Greedy Split",
               "description": "Target analysis items (referenced by \"name\") of \"greedy_split\" type.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "rnaseq_differential_expression": {
               "title": "Rnaseq Differential Expression",
               "description": "Target analysis items (referenced by \"name\") of \"rnaseq_differential_expression\" type.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "rnaseq_enrichment_analysis": {
               "title": "Rnaseq Enrichment Analysis",
               "description": "Target analysis items (referenced by \"name\") of \"rnaseq_enrichment_analysis\" type.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "report_diagnostics": {
               "title": "Report Diagnostics",
               "description": "Enables report diagnostic view.",
               "default": true,
               "type": "boolean"
            },
            "report_summary": {
               "title": "Report Summary",
               "description": "Enables report summary view.",
               "default": true,
               "type": "boolean"
            }
         }
      },
      "BaselineKitWorkflowSection": {
         "title": "BaselineKitWorkflowSection",
         "description": "Defines workflow for BaselineKit.",
         "type": "object",
         "properties": {
            "produce_and_execute_strata_notebooks": {
               "title": "Produce And Execute Strata Notebooks",
               "description": "Enables creation and execution of strata-level notebooks for BaselineKit report. The main purpose of this option is to enable parallelization of report generation process.",
               "default": true,
               "type": "boolean"
            },
            "produce_and_execute_global_notebooks": {
               "title": "Produce And Execute Global Notebooks",
               "description": "Enables creation and execution of global-level notebooks for BaselineKit report. The main purpose of this option is to enable parallelization of report generation process.",
               "default": true,
               "type": "boolean"
            },
            "generate_report": {
               "title": "Generate Report",
               "description": "Enables BaselineKit report generation: TOC is created, strata-level notebooks are moved to the target global folder, rendering with Jupyterbook. The main purpose of this option is to enable parallelization of report generation process, additionally it allows a user to manually re-run some notebooks and regenerate the report having \"produce_and_execute_*\" flags turned off.",
               "default": true,
               "type": "boolean"
            }
         }
      },
      "ReportSection": {
         "title": "ReportSection",
         "description": "Defines BaselineKit section.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "description": "Enables BaselineKit report generation: underlying Jupyter notebooks generation (for target scopes) + report rendering via JupyterBook.",
               "default": true,
               "type": "boolean"
            },
            "report_structure": {
               "title": "Report Structure",
               "default": {
                  "master_summary": {
                     "enable": false,
                     "eda": false,
                     "feature_importance": "",
                     "survival_feature_importance": "",
                     "baseline_modeling": "",
                     "survival_analysis": ""
                  },
                  "eda": false,
                  "modeling": [],
                  "cross_strata_fi_summary": [],
                  "survival_analysis": [],
                  "greedy_split": [],
                  "rnaseq_differential_expression": [],
                  "rnaseq_enrichment_analysis": [],
                  "report_diagnostics": true,
                  "report_summary": true
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/ReportStructure"
                  }
               ]
            },
            "workflow": {
               "title": "Workflow",
               "default": {
                  "produce_and_execute_strata_notebooks": true,
                  "produce_and_execute_global_notebooks": true,
                  "generate_report": true
               },
               "allOf": [
                  {
                     "$ref": "#/definitions/BaselineKitWorkflowSection"
                  }
               ]
            }
         }
      },
      "PipelineItemConfig": {
         "title": "PipelineItemConfig",
         "description": "Analysis step configuration.",
         "type": "object",
         "properties": {
            "name": {
               "title": "Name",
               "description": "Step name defined by the author. This name can be used for cross-reference between different steps (for example to indicate that input data should be taken from an output of some other step). Also it works as a name for a subfolder on a disk which stores all the step-related data.",
               "type": "string"
            },
            "type": {
               "title": "Type",
               "description": "Analysis step type. Refer to :lml:ref:`Analysis Step Types` for details.",
               "type": "string"
            },
            "enable": {
               "title": "Enable",
               "default": true,
               "type": "boolean"
            },
            "params": {
               "title": "Params",
               "description": "Parameters for the analysis step. Refer to :lml:ref:`Analysis Step Types` for to understand structure required by a particular step.",
               "default": {},
               "type": "object"
            },
            "release_artifacts": {
               "title": "Release Artifacts",
               "description": "When set to True, artifacts produced by the step are published to the release folder.",
               "default": false,
               "type": "boolean"
            },
            "depends_on": {
               "title": "Depends On",
               "description": "List of items from which current item depends. Current item won't be executed untilall dependencies are successfully completed.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            }
         },
         "required": [
            "name",
            "type"
         ]
      },
      "AnalysisPipelineConfig": {
         "title": "AnalysisPipelineConfig",
         "description": "Analysis pipeline configuration.",
         "type": "object",
         "properties": {
            "enable": {
               "title": "Enable",
               "default": true,
               "type": "boolean"
            },
            "items": {
               "title": "Items",
               "description": "List of analysis steps to perform.",
               "default": [],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/PipelineItemConfig"
               }
            }
         }
      },
      "ConfigSources": {
         "title": "ConfigSources",
         "description": "Internal information about logml configuration source files.",
         "type": "object",
         "properties": {
            "main_config": {
               "title": "Main Config",
               "description": "Path to main configuration file.",
               "default": "",
               "type": "string"
            },
            "refs": {
               "title": "Refs",
               "description": "Paths to files referred by the main config file.",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               }
            }
         }
      }
   }
}

Fields

analysis (logml.analysis.config.AnalysisPipelineConfig)
dataset_metadata (logml.configuration.modeling.DatasetMetadataSection)
eda (logml.configuration.eda.EDAArtifactsGenerationSection)
is_dag_config (bool)
modeling (logml.configuration.modeling.ModelingSection)
random_state (Optional[int])
report (logml.configuration.report.ReportSection)
source_files (logml.configuration.global_config.ConfigSources)
stratification (List[logml.configuration.stratification.Strata])
survival_analysis (logml.configuration.survival_analysis.SurvivalAnalysisSection)
version (str)

field version: str = 'unknown': Corresponds to the last compatible LogML version, e.g. “0.2.4”. This field is for information only, logml does not validate it.

field random_state: Optional[int] = None: Random state for main random numbers generator.

field dataset_metadata: logml.configuration.modeling.DatasetMetadataSection = None: Metadata for the incoming dataset (identifiers, columns groups, etc).

field stratification: List[logml.configuration.stratification.Strata] = []: Stata are sub-sets of the original dataset, independent of one another. Essentially, the complete analysis pipeline runs for a each single stratum independently (but all strata are compared in a special report page named “Cross-strata comparison”). Stratification can follow for example treatment arms or test/control grouping of samples. NOTE: analysis section does not follow stratification rules, but defines its own specific ones.

field survival_analysis: logml.configuration.survival_analysis.SurvivalAnalysisSection = SurvivalAnalysisSection(enable=False, problems={})

field modeling: logml.configuration.modeling.ModelingSection = ModelingSection(enable=False, cross_validation=CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=20, test_size=0.2, type='', params={}), hpo=HPOSection(algorithm='tpe', max_evals=3, random_state=None), problems={})

field eda: logml.configuration.eda.EDAArtifactsGenerationSection = EDAArtifactsGenerationSection(enable=False, preprocessing_problem_id='', dataset_preprocessing=DatasetPreprocessingSection(enable=False, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[]), artifacts=[], params=EDAArtifactsGenerationParameters(correlation_type=<CorrelationType.PEARSON: 'pearson'>, correlation_threshold=0.8, correlation_min_samples_fraction=0.2, correlation_group_level_cutoff=1, correlation_key_names=['TP53', 'KRAS', 'CDKN2A', 'CDKN2B', 'PIK3CA', 'ATM', 'BRCA1', 'SOX2', 'GNAS2', 'TERC', 'STK11', 'PDCD1', 'LAG3', 'TIGIT', 'HAVCR2', 'EOMES', 'MTAP'], large_data_threshold=(500, 1000), huge_data_threshold=(2000, 5000)))

field report: logml.configuration.report.ReportSection = ReportSection(enable=True, report_structure=ReportStructure(master_summary=MasterSummarySection(enable=False, eda=False, feature_importance='', survival_feature_importance='', baseline_modeling='', survival_analysis=''), eda=False, modeling=[], cross_strata_fi_summary=[], survival_analysis=[], greedy_split=[], rnaseq_differential_expression=[], rnaseq_enrichment_analysis=[], report_diagnostics=True, report_summary=True), workflow=BaselineKitWorkflowSection(produce_and_execute_strata_notebooks=True, produce_and_execute_global_notebooks=True, generate_report=True))

field analysis: logml.analysis.config.AnalysisPipelineConfig = AnalysisPipelineConfig(enable=False, items=[])

field is_dag_config: bool = False

field source_files: logml.configuration.global_config.ConfigSources = ConfigSources(main_config='', refs=[])

classmethod populate_reporting_problems(values) → None: Automatically put modeling and survival problems into reporting section.

classmethod validate_report(values) → None: Baselinekit validation.

analysis_section_enabled(): Checks whether ‘analysis’ section is enabled.

modeling_section_enabled(): Checks whether ‘modeling’ section is enabled.

survival_section_enabled(): Checks whether ‘survival_analysis’ section is enabled.

get_target_survival_problems() → List[str]: Returns modeling setups for which a given section is enabled.

baselinekit_section_enabled(): Checks whether ‘report’ section is enabled.

classmethod load(path: Union[str, pathlib.Path]) → logml.configuration.global_config.GlobalConfig: Load config from yaml file.

generate_random_state() → int: Generates random state

logml.configuration.global_config.format_validation_error(ve: pydantic.error_wrappers.ValidationError): Format pydantic error

logml.configuration.global_config.validate_config(file, output) → int

Validate config file schema and content.

Returns: Zero in case if validation passed ok, other valuse in case of failure.
Return type: int

logml.configuration.global_config.print_schema(output=None, use_json=False): Print configuration file schema.