logml.configuration.modeling
Functions
Returns a basic feature importance section. |
- class logml.configuration.modeling.PreprocessingStep
Bases:
pydantic.main.BaseModel
Defines data preprocessing step.
Show JSON schema
{ "title": "PreprocessingStep", "description": "Defines data preprocessing step.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable preprocessing step.", "default": true, "type": "boolean" }, "transformer": { "title": "Transformer", "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the correspoding transformer instance.", "default": {}, "type": "object" } }, "required": [ "transformer" ] }
- field enable: bool = True
Whether to enable preprocessing step.
- field transformer: str [Required]
Alias of transformer to use. Please refer to
Data Transformers
for details.
- field params: Dict = {}
Parameters that will be passed to the correspoding transformer instance.
- class logml.configuration.modeling.ModelingTargetSpec
Bases:
pydantic.main.BaseModel
Specification for modeling target (regression/classification)
Show JSON schema
{ "title": "ModelingTargetSpec", "description": "Specification for modeling target (regression/classification)", "type": "object", "properties": { "target_column": { "title": "Target Column", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "default": "regression", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "SPEC_TYPE": { "title": "Spec Type", "default": "reg-clf", "type": "string" } }, "required": [ "target_column" ], "definitions": { "ModelingTask": { "title": "ModelingTask", "description": "Defines supported modeling tasks.", "enum": [ "classification", "regression", "survival" ], "type": "string" } } }
- field target_column: str [Required]
Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).
- field task: logml.common.ModelingTask = ModelingTask.REG
Problem definition for modeling setup. Possible options: “classification”, “regression”, “survival”.
- field target_metric: str = ''
Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: “logloss” for classification, “mse” for regression, “cindex_inv” for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to
ML Metrics
for details.
- class logml.configuration.modeling.SurvivalTimeSpec
Bases:
pydantic.main.BaseModel
Configure right-censored time-to-event columns in the dataset
Show JSON schema
{ "title": "SurvivalTimeSpec", "description": "Configure right-censored time-to-event columns in the dataset", "type": "object", "properties": { "time_column": { "title": "Time Column", "description": "Column name that contains time-to-event values (usually OS or PFS).", "type": "string" }, "event_query": { "title": "Event Query", "description": "Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "type": "string" }, "event_column": { "title": "Event Column", "description": "Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_query: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate survival models performance. Please refer to :lml:ref:`ML Metrics` for details.", "default": "cindex", "type": "string" }, "SPEC_TYPE": { "title": "Spec Type", "default": "survival", "type": "string" } }, "required": [ "time_column", "event_query", "event_column" ] }
- field time_column: str [Required]
Column name that contains time-to-event values (usually OS or PFS).
- field event_query: str [Required]
Query-like expression that indicates “events” (“uncensored”) samples. For example: “OS_CNSR == 1”. See Dataset Queries for details.
- field event_column: str [Required]
Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify event_query: “OS_CNSR == 1”, then also put event_column: OS_CNSR.
- field target_metric: str = 'cindex'
Metric (loss) that will be used to evaluate survival models performance. Please refer to
ML Metrics
for details.
- class logml.configuration.modeling.HPOSection
Bases:
pydantic.main.BaseModel
Configure hyper-params optimization for models selection process.
Show JSON schema
{ "title": "HPOSection", "description": "Configure hyper-params optimization for models selection process.", "type": "object", "properties": { "algorithm": { "title": "Algorithm", "description": "Target \"hyperopt\" algorithm that will be used for models hyper-parameter optimization.", "default": "tpe", "type": "string" }, "max_evals": { "title": "Max Evals", "description": "Defines a target number of HPO trials for all models. The more trials - the better models (in theory), the less trials - the faster HPO is done.", "default": 3, "type": "integer" }, "random_state": { "title": "Random State", "description": "Random state", "type": "integer" } } }
- field algorithm: str = 'tpe'
Target “hyperopt” algorithm that will be used for models hyper-parameter optimization.
- field max_evals: int = 3
Defines a target number of HPO trials for all models. The more trials - the better models (in theory), the less trials - the faster HPO is done.
- field random_state: Optional[int] = None
Random state
- class logml.configuration.modeling.ModelingTaskSpec
Bases:
pydantic.main.BaseModel
Defines metadata for modeling setup: modeling objective, target column and evaluation metric.
Show JSON schema
{ "title": "ModelingTaskSpec", "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.", "type": "object", "properties": { "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target": { "title": "Target", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "event_query": { "title": "Event Query", "description": "(Applies for survival problems.) Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "default": "", "type": "string" }, "event_column": { "title": "Event Column", "description": "(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "default": "", "type": "string" } }, "required": [ "task", "target" ], "definitions": { "ModelingTask": { "title": "ModelingTask", "description": "Defines supported modeling tasks.", "enum": [ "classification", "regression", "survival" ], "type": "string" } } }
- Fields
- field task: logml.common.ModelingTask [Required]
Problem definition for modeling setup. Possible options: “classification”, “regression”, “survival”.
- field target: str [Required]
Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).
- field target_metric: str = ''
Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: “logloss” for classification, “mse” for regression, “cindex_inv” for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to
ML Metrics
for details.
- field event_query: str = ''
(Applies for survival problems.) Query-like expression that indicates “events” (“uncensored”) samples. For example: “OS_CNSR == 1”. See Dataset Queries for details.
- field event_column: str = ''
(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify event_observed: “OS_CNSR == 1”, then also put event_column: OS_CNSR.
- static from_target_spec(specification: Union[logml.configuration.modeling.SurvivalTimeSpec, logml.configuration.modeling.ModelingTargetSpec]) logml.configuration.modeling.ModelingTaskSpec
Create new instance from modeling specification.
- class logml.configuration.modeling.ModelSelectionConfig
Bases:
pydantic.main.BaseModel
Configuration for particular model type selection.
Show JSON schema
{ "title": "ModelSelectionConfig", "description": "Configuration for particular model type selection.", "type": "object", "properties": { "name": { "title": "Name", "description": "Model's alias to use. Please refer to `EligibleModels` for available options.", "type": "string" }, "use_hpo": { "title": "Use Hpo", "description": "Whether model should be fine-tuned (HPO). Otherwise the default paramameters will be used.", "default": true, "type": "boolean" }, "hyper_params": { "title": "Hyper Params", "description": "Hyperparameters to use, in case a user wants to explicitly set those.", "default": {}, "type": "object" }, "params_space": { "title": "Params Space", "description": "Hyperparameters space to use within HPO (instead of predefined ones).", "default": {}, "type": "object" } }, "required": [ "name" ] }
- field name: str [Required]
Model’s alias to use. Please refer to EligibleModels for available options.
- field use_hpo: bool = True
Whether model should be fine-tuned (HPO). Otherwise the default paramameters will be used.
- field hyper_params: dict = {}
Hyperparameters to use, in case a user wants to explicitly set those.
- field params_space: dict = {}
Hyperparameters space to use within HPO (instead of predefined ones).
- class logml.configuration.modeling.DatasetPreprocessingPresetSection
Bases:
pydantic.main.BaseModel
Defines ‘syntax sugar’ for semi-automated data preprocessing steps generation.
Show JSON schema
{ "title": "DatasetPreprocessingPresetSection", "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable automated generation of preprocessing steps.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] }, "remove_correlated_features": { "title": "Remove Correlated Features", "description": "Whether to include a step that removes correlated features.", "default": true, "type": "boolean" }, "nans_per_row_fraction_threshold": { "title": "Nans Per Row Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a row.", "default": 0.9, "type": "number" }, "nans_fraction_threshold": { "title": "Nans Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a column.", "default": 0.7, "type": "number" }, "apply_log1p_to_target": { "title": "Apply Log1P To Target", "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).", "default": false, "type": "boolean" }, "drop_datetime_columns": { "title": "Drop Datetime Columns", "description": "Whether to drop date time columns.", "default": true, "type": "boolean" }, "drop_dna_wt": { "title": "Drop Dna Wt", "description": "Whether to drop DNA WT values after one-hot-encoding.", "default": false, "type": "boolean" }, "imputer": { "title": "Imputer", "description": "Imputer to use. Possible values: (median, mice)", "default": "median", "type": "string" } } }
- Fields
- field enable: bool = True
Whether to enable automated generation of preprocessing steps.
- field features_list: Union[str, List[str]] = []
Defines a list of features (referenced by regexps) that should be selected. Additional option is just to reference a configuration file that contains the required list of features: … features_list: sub_cfg/features_list.yaml # a config file …
Whether to include a step that removes correlated features.
- field nans_per_row_fraction_threshold: float = 0.9
Defines maximum acceptable fraction of NaNs within a row.
- field nans_fraction_threshold: float = 0.7
Defines maximum acceptable fraction of NaNs within a column.
- field apply_log1p_to_target: bool = False
Whether to apply log1p transformation to target column (applicable only for regression problems).
- field drop_datetime_columns: bool = True
Whether to drop date time columns.
- field drop_dna_wt: bool = False
Whether to drop DNA WT values after one-hot-encoding.
- field imputer: str = 'median'
Imputer to use. Possible values: (median, mice)
- class logml.configuration.modeling.DatasetPreprocessingSection
Bases:
pydantic.main.BaseModel
Defines data preprocessing section for modeling/survival setup.
Show JSON schema
{ "title": "DatasetPreprocessingSection", "description": "Defines data preprocessing section for modeling/survival setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable Preprocessing Pipeline for dataset transformation.", "default": true, "type": "boolean" }, "preset": { "title": "Preset", "default": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingPresetSection" } ] }, "steps": { "title": "Steps", "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.", "default": [], "type": "array", "items": { "$ref": "#/definitions/PreprocessingStep" } } }, "definitions": { "DatasetPreprocessingPresetSection": { "title": "DatasetPreprocessingPresetSection", "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable automated generation of preprocessing steps.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] }, "remove_correlated_features": { "title": "Remove Correlated Features", "description": "Whether to include a step that removes correlated features.", "default": true, "type": "boolean" }, "nans_per_row_fraction_threshold": { "title": "Nans Per Row Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a row.", "default": 0.9, "type": "number" }, "nans_fraction_threshold": { "title": "Nans Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a column.", "default": 0.7, "type": "number" }, "apply_log1p_to_target": { "title": "Apply Log1P To Target", "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).", "default": false, "type": "boolean" }, "drop_datetime_columns": { "title": "Drop Datetime Columns", "description": "Whether to drop date time columns.", "default": true, "type": "boolean" }, "drop_dna_wt": { "title": "Drop Dna Wt", "description": "Whether to drop DNA WT values after one-hot-encoding.", "default": false, "type": "boolean" }, "imputer": { "title": "Imputer", "description": "Imputer to use. Possible values: (median, mice)", "default": "median", "type": "string" } } }, "PreprocessingStep": { "title": "PreprocessingStep", "description": "Defines data preprocessing step.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable preprocessing step.", "default": true, "type": "boolean" }, "transformer": { "title": "Transformer", "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the correspoding transformer instance.", "default": {}, "type": "object" } }, "required": [ "transformer" ] } } }
- Fields
- field enable: bool = True
Whether to enable Preprocessing Pipeline for dataset transformation.
- field preset: logml.configuration.modeling.DatasetPreprocessingPresetSection = DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median')
- field steps: List[logml.configuration.modeling.PreprocessingStep] = []
Defines a list of preprocessing steps (transformations) to apply. See
Data Transformers
for details.
- class logml.configuration.modeling.ModelSearchSection
Bases:
pydantic.main.BaseModel
Defines model search section.
Show JSON schema
{ "title": "ModelSearchSection", "description": "Defines model search section.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enable or disable Model Search/Selection process. It is recommended to enable `model_search` section in case `feature_importance` section is enabled. NOTE: Model Search section will be implicitly enabled in case `feature_importance` section is enabled.", "default": true, "type": "boolean" }, "models_random_state": { "title": "Models Random State", "description": "Random state for models which require it.", "type": "integer" }, "limit": { "title": "Limit", "description": "Limit number of selected models.", "default": 6, "type": "integer" }, "models": { "title": "Models", "description": "Defines a list of models which are to be fine-tuned. NOTE: in case this option is unset, all available models for corresponding \"task\" from `metadata` section will be used. Please refer to :lml:ref:`Model Types` for available options.", "default": [], "type": "array", "items": { "anyOf": [ { "$ref": "#/definitions/ModelSelectionConfig" }, { "type": "string" } ] } }, "baseline_model": { "title": "Baseline Model", "description": "Defines a model's alias that will be used to filter out models that don't perform better (in terms of averaged \"target_metric\" on cross-validation) than \"baseline\" model. NOTE: by default \"dummy\" model will be used for corresponding \"task\". Please refer to :lml:ref:`Model Types` for available options.", "default": "", "anyOf": [ { "$ref": "#/definitions/ModelSelectionConfig" }, { "type": "string" } ] }, "pvalue_threshold": { "title": "Pvalue Threshold", "description": "Threshold for p-value when testing hypothesis that model loss is less than baseline loss. Applicable only when CV list is long enough (>=7)", "default": 0.05, "type": "number" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 20, "test_size": 0.2, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] } }, "definitions": { "ModelSelectionConfig": { "title": "ModelSelectionConfig", "description": "Configuration for particular model type selection.", "type": "object", "properties": { "name": { "title": "Name", "description": "Model's alias to use. Please refer to `EligibleModels` for available options.", "type": "string" }, "use_hpo": { "title": "Use Hpo", "description": "Whether model should be fine-tuned (HPO). Otherwise the default paramameters will be used.", "default": true, "type": "boolean" }, "hyper_params": { "title": "Hyper Params", "description": "Hyperparameters to use, in case a user wants to explicitly set those.", "default": {}, "type": "object" }, "params_space": { "title": "Params Space", "description": "Hyperparameters space to use within HPO (instead of predefined ones).", "default": {}, "type": "object" } }, "required": [ "name" ] }, "CVSplitType": { "title": "CVSplitType", "description": "Type of CV splits: k-fold or shuffle", "enum": [ "kfold", "shuffle" ], "type": "string" }, "CrossValidationSection": { "title": "CrossValidationSection", "description": "Configure CV application for the dataset.", "type": "object", "properties": { "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "split_type": { "description": "Configures coverage of splits. 'kfold' covers dataset completely, 'shuffle' - does not guarantee it due to sampling.", "default": "kfold", "allOf": [ { "$ref": "#/definitions/CVSplitType" } ] }, "n_folds": { "title": "N Folds", "description": "How many CV folds should be produced.", "default": 20, "type": "integer" }, "test_size": { "title": "Test Size", "description": "Which portion of the dataset to leave for evaluation of the fold.", "default": 0.2, "type": "number" }, "type": { "title": "Type", "description": "To be set automatically. Cross Validation strategy alias to use (\"kfold\", \"stratifiedkfold\", etc.). Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection", "default": "", "type": "string" }, "params": { "title": "Params", "description": "To be set automatically.Parameters that will be passed to corresponding Scikit-learn classes. Please refer to the official Scikit-learn documentation for details.", "default": {}, "type": "object" } } } } }
- Fields
- field enable: bool = True
Enable or disable Model Search/Selection process. It is recommended to enable model_search section in case feature_importance section is enabled. NOTE: Model Search section will be implicitly enabled in case feature_importance section is enabled.
- field models_random_state: Optional[int] = None
Random state for models which require it.
- field limit: int = 6
Limit number of selected models.
- field models: List[Union[logml.configuration.modeling.ModelSelectionConfig, str]] = []
Defines a list of models which are to be fine-tuned. NOTE: in case this option is unset, all available models for corresponding “task” from metadata section will be used. Please refer to
Model Types
for available options.
- field baseline_model: Union[logml.configuration.modeling.ModelSelectionConfig, str] = ''
Defines a model’s alias that will be used to filter out models that don’t perform better (in terms of averaged “target_metric” on cross-validation) than “baseline” model. NOTE: by default “dummy” model will be used for corresponding “task”. Please refer to
Model Types
for available options.
- field pvalue_threshold: float = 0.05
Threshold for p-value when testing hypothesis that model loss is less than baseline loss. Applicable only when CV list is long enough (>=7)
- field cross_validation: logml.configuration.cross_validation.CrossValidationSection = CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=20, test_size=0.2, type='', params={})
- set_default_models(metadata: logml.configuration.modeling.ModelingTaskSpec)
Use default models if not specified. Invoked by parent section (ModelingSetup) during validation.
NOTE if using this section without parent, make sure to invoke this method manually.
- class logml.configuration.modeling.ColumnSpec
Bases:
pydantic.main.BaseModel
Configure special columns.
Show JSON schema
{ "title": "ColumnSpec", "description": "Configure special columns.", "type": "object", "properties": { "name": { "title": "Name", "description": "Special column name.", "type": "string" }, "comment": { "title": "Comment", "description": "Column description (e.g. \"Treatment Arm, used for stratification in such and such analysis.\")", "default": "", "type": "string" } }, "required": [ "name" ] }
- Fields
- field name: str [Required]
Special column name.
- field comment: str = ''
Column description (e.g. “Treatment Arm, used for stratification in such and such analysis.”)
- class logml.configuration.modeling.ColumnMetadataConfig
Bases:
pydantic.main.BaseModel
Column-specific metadata, currently including data type.
In future this structure can be extended to contain field description, semantics, display name, formatting, etc.
Show JSON schema
{ "title": "ColumnMetadataConfig", "description": "Column-specific metadata, currently including data type.\n\nIn future this structure can be extended to contain field description,\nsemantics, display name, formatting, etc.", "type": "object", "properties": { "name": { "title": "Name", "description": "Column name. Used to refer to column in the dataframe directly.", "type": "string" }, "data_type": { "title": "Data Type", "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.", "default": "", "type": "string" }, "is_categorical": { "title": "Is Categorical", "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.", "default": false, "type": "boolean" }, "parent_name": { "title": "Parent Name", "description": "Column name which used to produce current column as a result of transformation.", "type": "string" }, "description": { "title": "Description", "description": "Column description.", "type": "string" }, "group": { "title": "Group", "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.", "type": "string" } }, "required": [ "name" ] }
- Fields
- field name: str [Required]
Column name. Used to refer to column in the dataframe directly.
- field data_type: str = ''
Data type for the field. Most frequent are string, int, float, datetime64[ns]. If not specified, automatically detected while reading original dataset. See https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes forlist of available standard pandas types.
- field is_categorical: bool = False
Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.
- field parent_name: str = None
Column name which used to produce current column as a result of transformation.
- field description: str = None
Column description.
- field group: str = None
Name of a group this column belongs to. If MCT config is provided, set to column input_source name.
- class logml.configuration.modeling.DatasetMetadataSection
Bases:
pydantic.main.BaseModel
Defines metadata for modeling setup: modeling objective, target column and evaluation metric.
Show JSON schema
{ "title": "DatasetMetadataSection", "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.", "type": "object", "properties": { "key_columns": { "title": "Key Columns", "description": "List of identifier fields for a row in the dataset.", "default": [], "type": "array", "items": { "type": "string" } }, "modeling_specs": { "title": "Modeling Specs", "description": "Collection of modeling specification for survival/regression/classification problems.\n NOTE: Modeling problem ids are expected to have corresponding values.", "default": {}, "type": "object", "additionalProperties": { "anyOf": [ { "$ref": "#/definitions/SurvivalTimeSpec" }, { "$ref": "#/definitions/ModelingTargetSpec" } ] } }, "columns_specs": { "title": "Columns Specs", "description": "Named collection of special columns (targets, groupings, etc.)", "default": {}, "type": "object", "additionalProperties": { "$ref": "#/definitions/ColumnSpec" } }, "columns_metadata": { "title": "Columns Metadata", "description": "Provide list of columns-specific metadata.", "default": [], "type": "array", "items": { "$ref": "#/definitions/ColumnMetadataConfig" } } }, "definitions": { "SurvivalTimeSpec": { "title": "SurvivalTimeSpec", "description": "Configure right-censored time-to-event columns in the dataset", "type": "object", "properties": { "time_column": { "title": "Time Column", "description": "Column name that contains time-to-event values (usually OS or PFS).", "type": "string" }, "event_query": { "title": "Event Query", "description": "Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "type": "string" }, "event_column": { "title": "Event Column", "description": "Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_query: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate survival models performance. Please refer to :lml:ref:`ML Metrics` for details.", "default": "cindex", "type": "string" }, "SPEC_TYPE": { "title": "Spec Type", "default": "survival", "type": "string" } }, "required": [ "time_column", "event_query", "event_column" ] }, "ModelingTask": { "title": "ModelingTask", "description": "Defines supported modeling tasks.", "enum": [ "classification", "regression", "survival" ], "type": "string" }, "ModelingTargetSpec": { "title": "ModelingTargetSpec", "description": "Specification for modeling target (regression/classification)", "type": "object", "properties": { "target_column": { "title": "Target Column", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "default": "regression", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "SPEC_TYPE": { "title": "Spec Type", "default": "reg-clf", "type": "string" } }, "required": [ "target_column" ] }, "ColumnSpec": { "title": "ColumnSpec", "description": "Configure special columns.", "type": "object", "properties": { "name": { "title": "Name", "description": "Special column name.", "type": "string" }, "comment": { "title": "Comment", "description": "Column description (e.g. \"Treatment Arm, used for stratification in such and such analysis.\")", "default": "", "type": "string" } }, "required": [ "name" ] }, "ColumnMetadataConfig": { "title": "ColumnMetadataConfig", "description": "Column-specific metadata, currently including data type.\n\nIn future this structure can be extended to contain field description,\nsemantics, display name, formatting, etc.", "type": "object", "properties": { "name": { "title": "Name", "description": "Column name. Used to refer to column in the dataframe directly.", "type": "string" }, "data_type": { "title": "Data Type", "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.", "default": "", "type": "string" }, "is_categorical": { "title": "Is Categorical", "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.", "default": false, "type": "boolean" }, "parent_name": { "title": "Parent Name", "description": "Column name which used to produce current column as a result of transformation.", "type": "string" }, "description": { "title": "Description", "description": "Column description.", "type": "string" }, "group": { "title": "Group", "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.", "type": "string" } }, "required": [ "name" ] } } }
- Fields
- field key_columns: List[str] = []
List of identifier fields for a row in the dataset.
- field modeling_specs: Dict[str, Union[logml.configuration.modeling.SurvivalTimeSpec, logml.configuration.modeling.ModelingTargetSpec]] = {}
Collection of modeling specification for survival/regression/classification problems. NOTE: Modeling problem ids are expected to have corresponding values.
- field columns_specs: Dict[str, logml.configuration.modeling.ColumnSpec] = {}
Named collection of special columns (targets, groupings, etc.)
- field columns_metadata: List[logml.configuration.modeling.ColumnMetadataConfig] = []
Provide list of columns-specific metadata.
- class logml.configuration.modeling.FeatureImportanceWorkflow
Bases:
pydantic.main.BaseModel
Defines a workflow for FI execution.
Show JSON schema
{ "title": "FeatureImportanceWorkflow", "description": "Defines a workflow for FI execution.", "type": "object", "properties": { "generate_intermediate_results": { "title": "Generate Intermediate Results", "description": "Enables creation of per-dataset/per-repeat/per-bootstrap-iteration feature importance artifacts.The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "aggregate_intermediate_results": { "title": "Aggregate Intermediate Results", "description": "Enables aggregation of per-dataset feature importance artifacts to \"global\" level. Currently feature importance artifacts are represented as features ranks, aggregation is simply a ranks `averaging` across all dataset-level results. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "generate_global_summary": { "title": "Generate Global Summary", "description": "Enables summarization of aggregated feature importance results across different method used. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" } } }
- Fields
- field generate_intermediate_results: bool = True
Enables creation of per-dataset/per-repeat/per-bootstrap-iteration feature importance artifacts.The main purpose of this option is to enable parallelization of feature importance artifacts generation process.
- field aggregate_intermediate_results: bool = True
Enables aggregation of per-dataset feature importance artifacts to “global” level. Currently feature importance artifacts are represented as features ranks, aggregation is simply a ranks averaging across all dataset-level results. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.
- field generate_global_summary: bool = True
Enables summarization of aggregated feature importance results across different method used. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.
- class logml.configuration.modeling.FeatureImportanceMethod
Bases:
pydantic.main.BaseModel
Defines feature importance method.
Show JSON schema
{ "title": "FeatureImportanceMethod", "description": "Defines feature importance method.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "DEPRECATED Enables feature importance method.", "default": true, "type": "boolean" }, "extractor_id": { "title": "Extractor Id", "description": "Alias of feature importance extractor/method to use. Please refer to `EligibleFIExtractors` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the extractor constructor.", "default": {}, "type": "object" }, "n_models": { "title": "N Models", "description": "DEPRECATED Implicitly use only the top N models (in terms of CV performance) from available \"selected\" candidates. Might make sense to use that option when different models perform better on different stratas. NOTE: ignored when \"models\" option is set. NOTE: should be non-negative. In case \"n_models\" is equal to 0 - all available candidate models are used.", "default": 0, "type": "integer" }, "models": { "title": "Models", "description": "DEPRECATED Explicit list of models that should be used (in case Model Selection resulted in too many models - it is possible to narrow down the list).", "default": [], "type": "array", "items": { "type": "string" } }, "fallback_model": { "title": "Fallback Model", "description": "DEPRECATeD Alias of fallback model to use. Please refer to `EligibleModels` for available options. In case Model Selection resulted in no \"reasonable\" models, it still might make sense to use some model anyway for importances extraction.", "default": "", "type": "string" }, "dump_raw_extractor": { "title": "Dump Raw Extractor", "description": "NEED REVIEW When set to True, the whole feature importance extractor is dumped to pickle file. Default location is \"extractors\" subfolder (see `FeatureImportanceOutputStructure.get_extractor_dump_path`", "default": false, "type": "boolean" } }, "required": [ "extractor_id" ] }
- Fields
- field enable: bool = True
DEPRECATED Enables feature importance method.
- field extractor_id: str [Required]
Alias of feature importance extractor/method to use. Please refer to EligibleFIExtractors for details.
- field params: Dict = {}
Parameters that will be passed to the extractor constructor.
- field n_models: int = 0
DEPRECATED Implicitly use only the top N models (in terms of CV performance) from available “selected” candidates. Might make sense to use that option when different models perform better on different stratas. NOTE: ignored when “models” option is set. NOTE: should be non-negative. In case “n_models” is equal to 0 - all available candidate models are used.
- field models: List[str] = []
DEPRECATED Explicit list of models that should be used (in case Model Selection resulted in too many models - it is possible to narrow down the list).
- field fallback_model: str = ''
DEPRECATeD Alias of fallback model to use. Please refer to EligibleModels for available options. In case Model Selection resulted in no “reasonable” models, it still might make sense to use some model anyway for importances extraction.
- field dump_raw_extractor: bool = False
NEED REVIEW When set to True, the whole feature importance extractor is dumped to pickle file. Default location is “extractors” subfolder (see FeatureImportanceOutputStructure.get_extractor_dump_path
- class logml.configuration.modeling.FeatureImportanceSection
Bases:
pydantic.main.BaseModel
Defines feature importance section for modeling setup.
Show JSON schema
{ "title": "FeatureImportanceSection", "description": "Defines feature importance section for modeling setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enables feature importance artifacts generation.", "default": true, "type": "boolean" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 100, "test_size": 0.25, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] }, "perform_tier1_greedy": { "title": "Perform Tier1 Greedy", "default": false, "type": "boolean" }, "fid_pvalue_threshold": { "title": "Fid Pvalue Threshold", "default": 0.05, "type": "number" }, "n_random_iters": { "title": "N Random Iters", "default": 5, "type": "integer" }, "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "default_extractor": { "title": "Default Extractor", "description": "Feature importance extractor to be used by default. When not specified, importance is extracted from the model coefficients - which is naturally possible only for models which support it.", "allOf": [ { "$ref": "#/definitions/FeatureImportanceMethod" } ] }, "default_n_perm_imp_iters": { "title": "Default N Perm Imp Iters", "description": "Number of permutations for (default) permutation feature extractor.", "default": 10, "type": "integer" }, "extractors": { "title": "Extractors", "description": "Map specific model to Feature importance extractor. If not specified, `default_extractor` is used.", "default": {}, "type": "object", "additionalProperties": { "$ref": "#/definitions/FeatureImportanceMethod" } }, "workflow": { "title": "Workflow", "default": { "generate_intermediate_results": true, "aggregate_intermediate_results": true, "generate_global_summary": true }, "allOf": [ { "$ref": "#/definitions/FeatureImportanceWorkflow" } ] }, "methods": { "title": "Methods", "description": "DEPREPCATED! Enumerates target feature imporatance methods/extractors to apply.", "default": [], "type": "array", "items": { "$ref": "#/definitions/FeatureImportanceMethod" } } }, "definitions": { "CVSplitType": { "title": "CVSplitType", "description": "Type of CV splits: k-fold or shuffle", "enum": [ "kfold", "shuffle" ], "type": "string" }, "CrossValidationSection": { "title": "CrossValidationSection", "description": "Configure CV application for the dataset.", "type": "object", "properties": { "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "split_type": { "description": "Configures coverage of splits. 'kfold' covers dataset completely, 'shuffle' - does not guarantee it due to sampling.", "default": "kfold", "allOf": [ { "$ref": "#/definitions/CVSplitType" } ] }, "n_folds": { "title": "N Folds", "description": "How many CV folds should be produced.", "default": 20, "type": "integer" }, "test_size": { "title": "Test Size", "description": "Which portion of the dataset to leave for evaluation of the fold.", "default": 0.2, "type": "number" }, "type": { "title": "Type", "description": "To be set automatically. Cross Validation strategy alias to use (\"kfold\", \"stratifiedkfold\", etc.). Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection", "default": "", "type": "string" }, "params": { "title": "Params", "description": "To be set automatically.Parameters that will be passed to corresponding Scikit-learn classes. Please refer to the official Scikit-learn documentation for details.", "default": {}, "type": "object" } } }, "FeatureImportanceMethod": { "title": "FeatureImportanceMethod", "description": "Defines feature importance method.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "DEPRECATED Enables feature importance method.", "default": true, "type": "boolean" }, "extractor_id": { "title": "Extractor Id", "description": "Alias of feature importance extractor/method to use. Please refer to `EligibleFIExtractors` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the extractor constructor.", "default": {}, "type": "object" }, "n_models": { "title": "N Models", "description": "DEPRECATED Implicitly use only the top N models (in terms of CV performance) from available \"selected\" candidates. Might make sense to use that option when different models perform better on different stratas. NOTE: ignored when \"models\" option is set. NOTE: should be non-negative. In case \"n_models\" is equal to 0 - all available candidate models are used.", "default": 0, "type": "integer" }, "models": { "title": "Models", "description": "DEPRECATED Explicit list of models that should be used (in case Model Selection resulted in too many models - it is possible to narrow down the list).", "default": [], "type": "array", "items": { "type": "string" } }, "fallback_model": { "title": "Fallback Model", "description": "DEPRECATeD Alias of fallback model to use. Please refer to `EligibleModels` for available options. In case Model Selection resulted in no \"reasonable\" models, it still might make sense to use some model anyway for importances extraction.", "default": "", "type": "string" }, "dump_raw_extractor": { "title": "Dump Raw Extractor", "description": "NEED REVIEW When set to True, the whole feature importance extractor is dumped to pickle file. Default location is \"extractors\" subfolder (see `FeatureImportanceOutputStructure.get_extractor_dump_path`", "default": false, "type": "boolean" } }, "required": [ "extractor_id" ] }, "FeatureImportanceWorkflow": { "title": "FeatureImportanceWorkflow", "description": "Defines a workflow for FI execution.", "type": "object", "properties": { "generate_intermediate_results": { "title": "Generate Intermediate Results", "description": "Enables creation of per-dataset/per-repeat/per-bootstrap-iteration feature importance artifacts.The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "aggregate_intermediate_results": { "title": "Aggregate Intermediate Results", "description": "Enables aggregation of per-dataset feature importance artifacts to \"global\" level. Currently feature importance artifacts are represented as features ranks, aggregation is simply a ranks `averaging` across all dataset-level results. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "generate_global_summary": { "title": "Generate Global Summary", "description": "Enables summarization of aggregated feature importance results across different method used. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" } } } } }
- Fields
cross_validation (logml.configuration.cross_validation.CrossValidationSection)
default_extractor (Optional[logml.configuration.modeling.FeatureImportanceMethod])
extractors (Dict[str, logml.configuration.modeling.FeatureImportanceMethod])
methods (List[logml.configuration.modeling.FeatureImportanceMethod])
workflow (logml.configuration.modeling.FeatureImportanceWorkflow)
- field enable: bool = True
Enables feature importance artifacts generation.
- field cross_validation: logml.configuration.cross_validation.CrossValidationSection = CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=100, test_size=0.25, type='', params={})
- field perform_tier1_greedy: bool = False
- field fid_pvalue_threshold: float = 0.05
- field n_random_iters: int = 5
- field random_state: Optional[int] = None
State to initialize random numbers generation.
- field default_extractor: Optional[logml.configuration.modeling.FeatureImportanceMethod] = None
Feature importance extractor to be used by default. When not specified, importance is extracted from the model coefficients - which is naturally possible only for models which support it.
- field default_n_perm_imp_iters: int = 10
Number of permutations for (default) permutation feature extractor.
- field extractors: Dict[str, logml.configuration.modeling.FeatureImportanceMethod] = {}
Map specific model to Feature importance extractor. If not specified, default_extractor is used.
- field workflow: logml.configuration.modeling.FeatureImportanceWorkflow = FeatureImportanceWorkflow(generate_intermediate_results=True, aggregate_intermediate_results=True, generate_global_summary=True)
- field methods: List[logml.configuration.modeling.FeatureImportanceMethod] = []
DEPREPCATED! Enumerates target feature imporatance methods/extractors to apply.
- get_target_methods() List[logml.configuration.modeling.FeatureImportanceMethod]
Returns a list of enabled methods.
- logml.configuration.modeling.get_default_feature_importance_section() logml.configuration.modeling.FeatureImportanceSection
Returns a basic feature importance section.
- class logml.configuration.modeling.ModelingPresetConfiguration
Bases:
pydantic.main.BaseModel
Defines an approach for automatically configuring modeling sections.
Show JSON schema
{ "title": "ModelingPresetConfiguration", "description": "Defines an approach for automatically configuring modeling sections.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enables automatic presets for modeling pipeline sections.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [ ".*" ], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] } } }
- field enable: bool = True
Enables automatic presets for modeling pipeline sections.
- field features_list: Union[str, List[str]] = ['.*']
Defines a list of features (referenced by regexps) that should be selected. Additional option is just to reference a configuration file that contains the required list of features: … features_list: sub_cfg/features_list.yaml # a config file …
- class logml.configuration.modeling.ModelingSetup
Bases:
pydantic.main.BaseModel
Defines parameters for modeling problem (also called “setup”).
Typical modeling workflow has the following steps:
metadata - key information for modeling (task, target, metric).
dataset preprocessing - preferred strategy for data preparation prior to modeling.
datasets - defined bootstrapping setup (number of iterations).
model search - defines target set of models to use for feature importance extraction, models are tuned and only appropriate ones (in term of performance) are selected for upstream usage.
feature importance - defines target feature extractions methods.
Show JSON schema
{ "title": "ModelingSetup", "description": "Defines parameters for modeling problem (also called \"setup\").\n\nTypical modeling workflow has the following steps:\n\n- metadata - key information for modeling (task, target, metric).\n- dataset preprocessing - preferred strategy for data preparation prior to modeling.\n- datasets - defined bootstrapping setup (number of iterations).\n- model search - defines target set of models to use for feature importance extraction,\n models are tuned and only appropriate ones (in term of performance) are selected for upstream usage.\n- feature importance - defines target feature extractions methods.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enable or disable modeling setup.", "default": true, "type": "boolean" }, "preset": { "title": "Preset", "default": { "enable": false, "features_list": [ ".*" ] }, "allOf": [ { "$ref": "#/definitions/ModelingPresetConfiguration" } ] }, "metadata": { "$ref": "#/definitions/ModelingTaskSpec" }, "dataset_preprocessing": { "title": "Dataset Preprocessing", "default": { "enable": false, "preset": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "steps": [] }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingSection" } ] }, "model_search": { "title": "Model Search", "default": { "enable": true, "models_random_state": null, "limit": 6, "models": [], "baseline_model": "", "pvalue_threshold": 0.05, "cross_validation": { "random_state": null, "split_type": "kfold", "n_folds": 20, "test_size": 0.2, "type": "", "params": {} } }, "allOf": [ { "$ref": "#/definitions/ModelSearchSection" } ] }, "feature_importance": { "title": "Feature Importance", "default": { "enable": true, "cross_validation": { "random_state": null, "split_type": "kfold", "n_folds": 100, "test_size": 0.25, "type": "", "params": {} }, "perform_tier1_greedy": false, "fid_pvalue_threshold": 0.05, "n_random_iters": 5, "random_state": null, "default_extractor": null, "default_n_perm_imp_iters": 10, "extractors": {}, "workflow": { "generate_intermediate_results": true, "aggregate_intermediate_results": true, "generate_global_summary": true }, "methods": [] }, "allOf": [ { "$ref": "#/definitions/FeatureImportanceSection" } ] } }, "definitions": { "ModelingPresetConfiguration": { "title": "ModelingPresetConfiguration", "description": "Defines an approach for automatically configuring modeling sections.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enables automatic presets for modeling pipeline sections.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [ ".*" ], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] } } }, "ModelingTask": { "title": "ModelingTask", "description": "Defines supported modeling tasks.", "enum": [ "classification", "regression", "survival" ], "type": "string" }, "ModelingTaskSpec": { "title": "ModelingTaskSpec", "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.", "type": "object", "properties": { "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target": { "title": "Target", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "event_query": { "title": "Event Query", "description": "(Applies for survival problems.) Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "default": "", "type": "string" }, "event_column": { "title": "Event Column", "description": "(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "default": "", "type": "string" } }, "required": [ "task", "target" ] }, "DatasetPreprocessingPresetSection": { "title": "DatasetPreprocessingPresetSection", "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable automated generation of preprocessing steps.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] }, "remove_correlated_features": { "title": "Remove Correlated Features", "description": "Whether to include a step that removes correlated features.", "default": true, "type": "boolean" }, "nans_per_row_fraction_threshold": { "title": "Nans Per Row Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a row.", "default": 0.9, "type": "number" }, "nans_fraction_threshold": { "title": "Nans Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a column.", "default": 0.7, "type": "number" }, "apply_log1p_to_target": { "title": "Apply Log1P To Target", "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).", "default": false, "type": "boolean" }, "drop_datetime_columns": { "title": "Drop Datetime Columns", "description": "Whether to drop date time columns.", "default": true, "type": "boolean" }, "drop_dna_wt": { "title": "Drop Dna Wt", "description": "Whether to drop DNA WT values after one-hot-encoding.", "default": false, "type": "boolean" }, "imputer": { "title": "Imputer", "description": "Imputer to use. Possible values: (median, mice)", "default": "median", "type": "string" } } }, "PreprocessingStep": { "title": "PreprocessingStep", "description": "Defines data preprocessing step.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable preprocessing step.", "default": true, "type": "boolean" }, "transformer": { "title": "Transformer", "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the correspoding transformer instance.", "default": {}, "type": "object" } }, "required": [ "transformer" ] }, "DatasetPreprocessingSection": { "title": "DatasetPreprocessingSection", "description": "Defines data preprocessing section for modeling/survival setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable Preprocessing Pipeline for dataset transformation.", "default": true, "type": "boolean" }, "preset": { "title": "Preset", "default": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingPresetSection" } ] }, "steps": { "title": "Steps", "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.", "default": [], "type": "array", "items": { "$ref": "#/definitions/PreprocessingStep" } } } }, "ModelSelectionConfig": { "title": "ModelSelectionConfig", "description": "Configuration for particular model type selection.", "type": "object", "properties": { "name": { "title": "Name", "description": "Model's alias to use. Please refer to `EligibleModels` for available options.", "type": "string" }, "use_hpo": { "title": "Use Hpo", "description": "Whether model should be fine-tuned (HPO). Otherwise the default paramameters will be used.", "default": true, "type": "boolean" }, "hyper_params": { "title": "Hyper Params", "description": "Hyperparameters to use, in case a user wants to explicitly set those.", "default": {}, "type": "object" }, "params_space": { "title": "Params Space", "description": "Hyperparameters space to use within HPO (instead of predefined ones).", "default": {}, "type": "object" } }, "required": [ "name" ] }, "CVSplitType": { "title": "CVSplitType", "description": "Type of CV splits: k-fold or shuffle", "enum": [ "kfold", "shuffle" ], "type": "string" }, "CrossValidationSection": { "title": "CrossValidationSection", "description": "Configure CV application for the dataset.", "type": "object", "properties": { "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "split_type": { "description": "Configures coverage of splits. 'kfold' covers dataset completely, 'shuffle' - does not guarantee it due to sampling.", "default": "kfold", "allOf": [ { "$ref": "#/definitions/CVSplitType" } ] }, "n_folds": { "title": "N Folds", "description": "How many CV folds should be produced.", "default": 20, "type": "integer" }, "test_size": { "title": "Test Size", "description": "Which portion of the dataset to leave for evaluation of the fold.", "default": 0.2, "type": "number" }, "type": { "title": "Type", "description": "To be set automatically. Cross Validation strategy alias to use (\"kfold\", \"stratifiedkfold\", etc.). Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection", "default": "", "type": "string" }, "params": { "title": "Params", "description": "To be set automatically.Parameters that will be passed to corresponding Scikit-learn classes. Please refer to the official Scikit-learn documentation for details.", "default": {}, "type": "object" } } }, "ModelSearchSection": { "title": "ModelSearchSection", "description": "Defines model search section.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enable or disable Model Search/Selection process. It is recommended to enable `model_search` section in case `feature_importance` section is enabled. NOTE: Model Search section will be implicitly enabled in case `feature_importance` section is enabled.", "default": true, "type": "boolean" }, "models_random_state": { "title": "Models Random State", "description": "Random state for models which require it.", "type": "integer" }, "limit": { "title": "Limit", "description": "Limit number of selected models.", "default": 6, "type": "integer" }, "models": { "title": "Models", "description": "Defines a list of models which are to be fine-tuned. NOTE: in case this option is unset, all available models for corresponding \"task\" from `metadata` section will be used. Please refer to :lml:ref:`Model Types` for available options.", "default": [], "type": "array", "items": { "anyOf": [ { "$ref": "#/definitions/ModelSelectionConfig" }, { "type": "string" } ] } }, "baseline_model": { "title": "Baseline Model", "description": "Defines a model's alias that will be used to filter out models that don't perform better (in terms of averaged \"target_metric\" on cross-validation) than \"baseline\" model. NOTE: by default \"dummy\" model will be used for corresponding \"task\". Please refer to :lml:ref:`Model Types` for available options.", "default": "", "anyOf": [ { "$ref": "#/definitions/ModelSelectionConfig" }, { "type": "string" } ] }, "pvalue_threshold": { "title": "Pvalue Threshold", "description": "Threshold for p-value when testing hypothesis that model loss is less than baseline loss. Applicable only when CV list is long enough (>=7)", "default": 0.05, "type": "number" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 20, "test_size": 0.2, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] } } }, "FeatureImportanceMethod": { "title": "FeatureImportanceMethod", "description": "Defines feature importance method.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "DEPRECATED Enables feature importance method.", "default": true, "type": "boolean" }, "extractor_id": { "title": "Extractor Id", "description": "Alias of feature importance extractor/method to use. Please refer to `EligibleFIExtractors` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the extractor constructor.", "default": {}, "type": "object" }, "n_models": { "title": "N Models", "description": "DEPRECATED Implicitly use only the top N models (in terms of CV performance) from available \"selected\" candidates. Might make sense to use that option when different models perform better on different stratas. NOTE: ignored when \"models\" option is set. NOTE: should be non-negative. In case \"n_models\" is equal to 0 - all available candidate models are used.", "default": 0, "type": "integer" }, "models": { "title": "Models", "description": "DEPRECATED Explicit list of models that should be used (in case Model Selection resulted in too many models - it is possible to narrow down the list).", "default": [], "type": "array", "items": { "type": "string" } }, "fallback_model": { "title": "Fallback Model", "description": "DEPRECATeD Alias of fallback model to use. Please refer to `EligibleModels` for available options. In case Model Selection resulted in no \"reasonable\" models, it still might make sense to use some model anyway for importances extraction.", "default": "", "type": "string" }, "dump_raw_extractor": { "title": "Dump Raw Extractor", "description": "NEED REVIEW When set to True, the whole feature importance extractor is dumped to pickle file. Default location is \"extractors\" subfolder (see `FeatureImportanceOutputStructure.get_extractor_dump_path`", "default": false, "type": "boolean" } }, "required": [ "extractor_id" ] }, "FeatureImportanceWorkflow": { "title": "FeatureImportanceWorkflow", "description": "Defines a workflow for FI execution.", "type": "object", "properties": { "generate_intermediate_results": { "title": "Generate Intermediate Results", "description": "Enables creation of per-dataset/per-repeat/per-bootstrap-iteration feature importance artifacts.The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "aggregate_intermediate_results": { "title": "Aggregate Intermediate Results", "description": "Enables aggregation of per-dataset feature importance artifacts to \"global\" level. Currently feature importance artifacts are represented as features ranks, aggregation is simply a ranks `averaging` across all dataset-level results. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "generate_global_summary": { "title": "Generate Global Summary", "description": "Enables summarization of aggregated feature importance results across different method used. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" } } }, "FeatureImportanceSection": { "title": "FeatureImportanceSection", "description": "Defines feature importance section for modeling setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enables feature importance artifacts generation.", "default": true, "type": "boolean" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 100, "test_size": 0.25, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] }, "perform_tier1_greedy": { "title": "Perform Tier1 Greedy", "default": false, "type": "boolean" }, "fid_pvalue_threshold": { "title": "Fid Pvalue Threshold", "default": 0.05, "type": "number" }, "n_random_iters": { "title": "N Random Iters", "default": 5, "type": "integer" }, "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "default_extractor": { "title": "Default Extractor", "description": "Feature importance extractor to be used by default. When not specified, importance is extracted from the model coefficients - which is naturally possible only for models which support it.", "allOf": [ { "$ref": "#/definitions/FeatureImportanceMethod" } ] }, "default_n_perm_imp_iters": { "title": "Default N Perm Imp Iters", "description": "Number of permutations for (default) permutation feature extractor.", "default": 10, "type": "integer" }, "extractors": { "title": "Extractors", "description": "Map specific model to Feature importance extractor. If not specified, `default_extractor` is used.", "default": {}, "type": "object", "additionalProperties": { "$ref": "#/definitions/FeatureImportanceMethod" } }, "workflow": { "title": "Workflow", "default": { "generate_intermediate_results": true, "aggregate_intermediate_results": true, "generate_global_summary": true }, "allOf": [ { "$ref": "#/definitions/FeatureImportanceWorkflow" } ] }, "methods": { "title": "Methods", "description": "DEPREPCATED! Enumerates target feature imporatance methods/extractors to apply.", "default": [], "type": "array", "items": { "$ref": "#/definitions/FeatureImportanceMethod" } } } } } }
- Fields
- field enable: bool = True
Enable or disable modeling setup.
- field preset: logml.configuration.modeling.ModelingPresetConfiguration = ModelingPresetConfiguration(enable=False, features_list=['.*'])
- field metadata: logml.configuration.modeling.ModelingTaskSpec = None
- field dataset_preprocessing: logml.configuration.modeling.DatasetPreprocessingSection = DatasetPreprocessingSection(enable=False, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[])
- field model_search: logml.configuration.modeling.ModelSearchSection = ModelSearchSection(enable=True, models_random_state=None, limit=6, models=[], baseline_model='', pvalue_threshold=0.05, cross_validation=CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=20, test_size=0.2, type='', params={}))
- field feature_importance: logml.configuration.modeling.FeatureImportanceSection = FeatureImportanceSection(enable=True, cross_validation=CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=100, test_size=0.25, type='', params={}), perform_tier1_greedy=False, fid_pvalue_threshold=0.05, n_random_iters=5, random_state=None, default_extractor=None, default_n_perm_imp_iters=10, extractors={}, workflow=FeatureImportanceWorkflow(generate_intermediate_results=True, aggregate_intermediate_results=True, generate_global_summary=True), methods=[])
- classmethod model_search_check(ms: logml.configuration.modeling.ModelSearchSection, rnd: logml.common.RandomGen) None
Validate and expand steps
- classmethod expand_preset_steps(dataset_preprocessing, objective, rnd: logml.common.RandomGen) None
Validate and expand steps
- classmethod validate_model_search(feature_importance, inconsistent_with_task_filter, md, values)
Validate model seach.
- classmethod expand_preset(values)
Configure processing accoding to the preset.
- class logml.configuration.modeling.ModelingSection
Bases:
pydantic.main.BaseModel
Machine Learning Modeling section definition.
This section configures “modeling” analysis type, which combines the following steps:
dataset preprocessing, which adjusts data to be suitable for models.
model selection, which determines best model to be used at the next steps.
feature importance extraction, which calculates features impact on the target variable from ML standpoint. This may be done in several ways, which are indicated as “FI methods” below.
Show JSON schema
{ "title": "ModelingSection", "description": "Machine Learning Modeling section definition.\n\nThis section configures \"modeling\" analysis type, which combines the following steps:\n\n- dataset preprocessing, which adjusts data to be suitable for models.\n- model selection, which determines best model to be used at the next steps.\n- feature importance extraction, which calculates features impact on the target variable from\n ML standpoint. This may be done in several ways, which are indicated as \"FI methods\" below.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enable or disable Modeling workflow.", "default": true, "type": "boolean" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 20, "test_size": 0.2, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] }, "hpo": { "title": "Hpo", "default": { "algorithm": "tpe", "max_evals": 3, "random_state": null }, "allOf": [ { "$ref": "#/definitions/HPOSection" } ] }, "problems": { "title": "Problems", "description": "Defines list of \"modeling problem setup\" configurations. Usually problems are similar to one another, but have different target variables.", "default": {}, "type": "object", "additionalProperties": { "$ref": "#/definitions/ModelingSetup" } } }, "definitions": { "CVSplitType": { "title": "CVSplitType", "description": "Type of CV splits: k-fold or shuffle", "enum": [ "kfold", "shuffle" ], "type": "string" }, "CrossValidationSection": { "title": "CrossValidationSection", "description": "Configure CV application for the dataset.", "type": "object", "properties": { "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "split_type": { "description": "Configures coverage of splits. 'kfold' covers dataset completely, 'shuffle' - does not guarantee it due to sampling.", "default": "kfold", "allOf": [ { "$ref": "#/definitions/CVSplitType" } ] }, "n_folds": { "title": "N Folds", "description": "How many CV folds should be produced.", "default": 20, "type": "integer" }, "test_size": { "title": "Test Size", "description": "Which portion of the dataset to leave for evaluation of the fold.", "default": 0.2, "type": "number" }, "type": { "title": "Type", "description": "To be set automatically. Cross Validation strategy alias to use (\"kfold\", \"stratifiedkfold\", etc.). Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection", "default": "", "type": "string" }, "params": { "title": "Params", "description": "To be set automatically.Parameters that will be passed to corresponding Scikit-learn classes. Please refer to the official Scikit-learn documentation for details.", "default": {}, "type": "object" } } }, "HPOSection": { "title": "HPOSection", "description": "Configure hyper-params optimization for models selection process.", "type": "object", "properties": { "algorithm": { "title": "Algorithm", "description": "Target \"hyperopt\" algorithm that will be used for models hyper-parameter optimization.", "default": "tpe", "type": "string" }, "max_evals": { "title": "Max Evals", "description": "Defines a target number of HPO trials for all models. The more trials - the better models (in theory), the less trials - the faster HPO is done.", "default": 3, "type": "integer" }, "random_state": { "title": "Random State", "description": "Random state", "type": "integer" } } }, "ModelingPresetConfiguration": { "title": "ModelingPresetConfiguration", "description": "Defines an approach for automatically configuring modeling sections.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enables automatic presets for modeling pipeline sections.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [ ".*" ], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] } } }, "ModelingTask": { "title": "ModelingTask", "description": "Defines supported modeling tasks.", "enum": [ "classification", "regression", "survival" ], "type": "string" }, "ModelingTaskSpec": { "title": "ModelingTaskSpec", "description": "Defines metadata for modeling setup: modeling objective, target column and evaluation metric.", "type": "object", "properties": { "task": { "description": "Problem definition for modeling setup. Possible options: \"classification\", \"regression\", \"survival\".", "allOf": [ { "$ref": "#/definitions/ModelingTask" } ] }, "target": { "title": "Target", "description": "Target column for modeling, also known as dependent variable or outcome. In case of survival modeling, this column should contain time-to-event values (usuallyOS or PFS).", "type": "string" }, "target_metric": { "title": "Target Metric", "description": "Metric (loss) that will be used to evaluate models performance. Typical options per modeling objective: \"logloss\" for classification, \"mse\" for regression, \"cindex_inv\" for survival (inverted concordance index, so that it could be minimized). NOTE: at the moment only loss function are supported (minimization problems). Please refer to :lml:ref:`ML Metrics` for details.", "default": "", "type": "string" }, "event_query": { "title": "Event Query", "description": "(Applies for survival problems.) Query-like expression that indicates \"events\" (\"uncensored\") samples. For example: \"OS_CNSR == 1\". See :ref:`Dataset Queries` for details.", "default": "", "type": "string" }, "event_column": { "title": "Event Column", "description": "(Applies for survival problems.) Column used for event calculation. (We have to specify it so that is can be removed from features list after the dataset preprocessing). If you specify `event_observed: \"OS_CNSR == 1\"`, then also put `event_column: OS_CNSR`. ", "default": "", "type": "string" } }, "required": [ "task", "target" ] }, "DatasetPreprocessingPresetSection": { "title": "DatasetPreprocessingPresetSection", "description": "Defines 'syntax sugar' for semi-automated data preprocessing steps generation.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable automated generation of preprocessing steps.", "default": true, "type": "boolean" }, "features_list": { "title": "Features List", "description": "Defines a list of features (referenced by regexps) that should be selected. Additional option\n is just to reference a configuration file that contains the required list of features:\n ...\n features_list: sub_cfg/features_list.yaml # a config file\n ...\n ", "default": [], "anyOf": [ { "type": "string" }, { "type": "array", "items": { "type": "string" } } ] }, "remove_correlated_features": { "title": "Remove Correlated Features", "description": "Whether to include a step that removes correlated features.", "default": true, "type": "boolean" }, "nans_per_row_fraction_threshold": { "title": "Nans Per Row Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a row.", "default": 0.9, "type": "number" }, "nans_fraction_threshold": { "title": "Nans Fraction Threshold", "description": "Defines maximum acceptable fraction of NaNs within a column.", "default": 0.7, "type": "number" }, "apply_log1p_to_target": { "title": "Apply Log1P To Target", "description": "Whether to apply log1p transformation to target column (applicable only for regression problems).", "default": false, "type": "boolean" }, "drop_datetime_columns": { "title": "Drop Datetime Columns", "description": "Whether to drop date time columns.", "default": true, "type": "boolean" }, "drop_dna_wt": { "title": "Drop Dna Wt", "description": "Whether to drop DNA WT values after one-hot-encoding.", "default": false, "type": "boolean" }, "imputer": { "title": "Imputer", "description": "Imputer to use. Possible values: (median, mice)", "default": "median", "type": "string" } } }, "PreprocessingStep": { "title": "PreprocessingStep", "description": "Defines data preprocessing step.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable preprocessing step.", "default": true, "type": "boolean" }, "transformer": { "title": "Transformer", "description": "Alias of transformer to use. Please refer to :lml:ref:`Data Transformers` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the correspoding transformer instance.", "default": {}, "type": "object" } }, "required": [ "transformer" ] }, "DatasetPreprocessingSection": { "title": "DatasetPreprocessingSection", "description": "Defines data preprocessing section for modeling/survival setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Whether to enable Preprocessing Pipeline for dataset transformation.", "default": true, "type": "boolean" }, "preset": { "title": "Preset", "default": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingPresetSection" } ] }, "steps": { "title": "Steps", "description": "Defines a list of preprocessing steps (transformations) to apply. See :lml:ref:`Data Transformers` for details.", "default": [], "type": "array", "items": { "$ref": "#/definitions/PreprocessingStep" } } } }, "ModelSelectionConfig": { "title": "ModelSelectionConfig", "description": "Configuration for particular model type selection.", "type": "object", "properties": { "name": { "title": "Name", "description": "Model's alias to use. Please refer to `EligibleModels` for available options.", "type": "string" }, "use_hpo": { "title": "Use Hpo", "description": "Whether model should be fine-tuned (HPO). Otherwise the default paramameters will be used.", "default": true, "type": "boolean" }, "hyper_params": { "title": "Hyper Params", "description": "Hyperparameters to use, in case a user wants to explicitly set those.", "default": {}, "type": "object" }, "params_space": { "title": "Params Space", "description": "Hyperparameters space to use within HPO (instead of predefined ones).", "default": {}, "type": "object" } }, "required": [ "name" ] }, "ModelSearchSection": { "title": "ModelSearchSection", "description": "Defines model search section.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enable or disable Model Search/Selection process. It is recommended to enable `model_search` section in case `feature_importance` section is enabled. NOTE: Model Search section will be implicitly enabled in case `feature_importance` section is enabled.", "default": true, "type": "boolean" }, "models_random_state": { "title": "Models Random State", "description": "Random state for models which require it.", "type": "integer" }, "limit": { "title": "Limit", "description": "Limit number of selected models.", "default": 6, "type": "integer" }, "models": { "title": "Models", "description": "Defines a list of models which are to be fine-tuned. NOTE: in case this option is unset, all available models for corresponding \"task\" from `metadata` section will be used. Please refer to :lml:ref:`Model Types` for available options.", "default": [], "type": "array", "items": { "anyOf": [ { "$ref": "#/definitions/ModelSelectionConfig" }, { "type": "string" } ] } }, "baseline_model": { "title": "Baseline Model", "description": "Defines a model's alias that will be used to filter out models that don't perform better (in terms of averaged \"target_metric\" on cross-validation) than \"baseline\" model. NOTE: by default \"dummy\" model will be used for corresponding \"task\". Please refer to :lml:ref:`Model Types` for available options.", "default": "", "anyOf": [ { "$ref": "#/definitions/ModelSelectionConfig" }, { "type": "string" } ] }, "pvalue_threshold": { "title": "Pvalue Threshold", "description": "Threshold for p-value when testing hypothesis that model loss is less than baseline loss. Applicable only when CV list is long enough (>=7)", "default": 0.05, "type": "number" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 20, "test_size": 0.2, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] } } }, "FeatureImportanceMethod": { "title": "FeatureImportanceMethod", "description": "Defines feature importance method.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "DEPRECATED Enables feature importance method.", "default": true, "type": "boolean" }, "extractor_id": { "title": "Extractor Id", "description": "Alias of feature importance extractor/method to use. Please refer to `EligibleFIExtractors` for details.", "type": "string" }, "params": { "title": "Params", "description": "Parameters that will be passed to the extractor constructor.", "default": {}, "type": "object" }, "n_models": { "title": "N Models", "description": "DEPRECATED Implicitly use only the top N models (in terms of CV performance) from available \"selected\" candidates. Might make sense to use that option when different models perform better on different stratas. NOTE: ignored when \"models\" option is set. NOTE: should be non-negative. In case \"n_models\" is equal to 0 - all available candidate models are used.", "default": 0, "type": "integer" }, "models": { "title": "Models", "description": "DEPRECATED Explicit list of models that should be used (in case Model Selection resulted in too many models - it is possible to narrow down the list).", "default": [], "type": "array", "items": { "type": "string" } }, "fallback_model": { "title": "Fallback Model", "description": "DEPRECATeD Alias of fallback model to use. Please refer to `EligibleModels` for available options. In case Model Selection resulted in no \"reasonable\" models, it still might make sense to use some model anyway for importances extraction.", "default": "", "type": "string" }, "dump_raw_extractor": { "title": "Dump Raw Extractor", "description": "NEED REVIEW When set to True, the whole feature importance extractor is dumped to pickle file. Default location is \"extractors\" subfolder (see `FeatureImportanceOutputStructure.get_extractor_dump_path`", "default": false, "type": "boolean" } }, "required": [ "extractor_id" ] }, "FeatureImportanceWorkflow": { "title": "FeatureImportanceWorkflow", "description": "Defines a workflow for FI execution.", "type": "object", "properties": { "generate_intermediate_results": { "title": "Generate Intermediate Results", "description": "Enables creation of per-dataset/per-repeat/per-bootstrap-iteration feature importance artifacts.The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "aggregate_intermediate_results": { "title": "Aggregate Intermediate Results", "description": "Enables aggregation of per-dataset feature importance artifacts to \"global\" level. Currently feature importance artifacts are represented as features ranks, aggregation is simply a ranks `averaging` across all dataset-level results. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" }, "generate_global_summary": { "title": "Generate Global Summary", "description": "Enables summarization of aggregated feature importance results across different method used. The main purpose of this option is to enable parallelization of feature importance artifacts generation process.", "default": true, "type": "boolean" } } }, "FeatureImportanceSection": { "title": "FeatureImportanceSection", "description": "Defines feature importance section for modeling setup.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enables feature importance artifacts generation.", "default": true, "type": "boolean" }, "cross_validation": { "title": "Cross Validation", "default": { "random_state": null, "split_type": "kfold", "n_folds": 100, "test_size": 0.25, "type": "", "params": {} }, "allOf": [ { "$ref": "#/definitions/CrossValidationSection" } ] }, "perform_tier1_greedy": { "title": "Perform Tier1 Greedy", "default": false, "type": "boolean" }, "fid_pvalue_threshold": { "title": "Fid Pvalue Threshold", "default": 0.05, "type": "number" }, "n_random_iters": { "title": "N Random Iters", "default": 5, "type": "integer" }, "random_state": { "title": "Random State", "description": "State to initialize random numbers generation.", "type": "integer" }, "default_extractor": { "title": "Default Extractor", "description": "Feature importance extractor to be used by default. When not specified, importance is extracted from the model coefficients - which is naturally possible only for models which support it.", "allOf": [ { "$ref": "#/definitions/FeatureImportanceMethod" } ] }, "default_n_perm_imp_iters": { "title": "Default N Perm Imp Iters", "description": "Number of permutations for (default) permutation feature extractor.", "default": 10, "type": "integer" }, "extractors": { "title": "Extractors", "description": "Map specific model to Feature importance extractor. If not specified, `default_extractor` is used.", "default": {}, "type": "object", "additionalProperties": { "$ref": "#/definitions/FeatureImportanceMethod" } }, "workflow": { "title": "Workflow", "default": { "generate_intermediate_results": true, "aggregate_intermediate_results": true, "generate_global_summary": true }, "allOf": [ { "$ref": "#/definitions/FeatureImportanceWorkflow" } ] }, "methods": { "title": "Methods", "description": "DEPREPCATED! Enumerates target feature imporatance methods/extractors to apply.", "default": [], "type": "array", "items": { "$ref": "#/definitions/FeatureImportanceMethod" } } } }, "ModelingSetup": { "title": "ModelingSetup", "description": "Defines parameters for modeling problem (also called \"setup\").\n\nTypical modeling workflow has the following steps:\n\n- metadata - key information for modeling (task, target, metric).\n- dataset preprocessing - preferred strategy for data preparation prior to modeling.\n- datasets - defined bootstrapping setup (number of iterations).\n- model search - defines target set of models to use for feature importance extraction,\n models are tuned and only appropriate ones (in term of performance) are selected for upstream usage.\n- feature importance - defines target feature extractions methods.", "type": "object", "properties": { "enable": { "title": "Enable", "description": "Enable or disable modeling setup.", "default": true, "type": "boolean" }, "preset": { "title": "Preset", "default": { "enable": false, "features_list": [ ".*" ] }, "allOf": [ { "$ref": "#/definitions/ModelingPresetConfiguration" } ] }, "metadata": { "$ref": "#/definitions/ModelingTaskSpec" }, "dataset_preprocessing": { "title": "Dataset Preprocessing", "default": { "enable": false, "preset": { "enable": false, "features_list": [], "remove_correlated_features": true, "nans_per_row_fraction_threshold": 0.9, "nans_fraction_threshold": 0.7, "apply_log1p_to_target": false, "drop_datetime_columns": true, "drop_dna_wt": false, "imputer": "median" }, "steps": [] }, "allOf": [ { "$ref": "#/definitions/DatasetPreprocessingSection" } ] }, "model_search": { "title": "Model Search", "default": { "enable": true, "models_random_state": null, "limit": 6, "models": [], "baseline_model": "", "pvalue_threshold": 0.05, "cross_validation": { "random_state": null, "split_type": "kfold", "n_folds": 20, "test_size": 0.2, "type": "", "params": {} } }, "allOf": [ { "$ref": "#/definitions/ModelSearchSection" } ] }, "feature_importance": { "title": "Feature Importance", "default": { "enable": true, "cross_validation": { "random_state": null, "split_type": "kfold", "n_folds": 100, "test_size": 0.25, "type": "", "params": {} }, "perform_tier1_greedy": false, "fid_pvalue_threshold": 0.05, "n_random_iters": 5, "random_state": null, "default_extractor": null, "default_n_perm_imp_iters": 10, "extractors": {}, "workflow": { "generate_intermediate_results": true, "aggregate_intermediate_results": true, "generate_global_summary": true }, "methods": [] }, "allOf": [ { "$ref": "#/definitions/FeatureImportanceSection" } ] } } } } }
- Fields
- field enable: bool = True
Enable or disable Modeling workflow.
- field cross_validation: logml.configuration.cross_validation.CrossValidationSection = CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=20, test_size=0.2, type='', params={})
- field hpo: logml.configuration.modeling.HPOSection = HPOSection(algorithm='tpe', max_evals=3, random_state=None)
- field problems: Dict[str, logml.configuration.modeling.ModelingSetup] = {}
Defines list of “modeling problem setup” configurations. Usually problems are similar to one another, but have different target variables.
- get_target_dg_problems() List[str]
Returns modeling setups for which DG section is enabled.
- get_target_fi_problems() List[str]
Returns modeling setups for which FI section is enabled.
- get_target_ms_problems() List[str]
Returns modeling setups for which MS section is enabled.