Configuration Overview

Typical LogML configuration file has the following structure:

version:         # <logml version>
stratification:  # how to split dataset to strata

# Kinds of analyses:

# EDA
eda:  # EDA.

# Survival analysis: univariate and multivariate cox/
survival_analysis:
# ML-based feature importance calculation.

modeling:


# Main section for DAG steps configuration
# Greedy split, Modeling steps, etc.
analysis:

# Report building configuration.
report:

Check detailed documentaiton on each section at logml.configuration.

Here is configuration file from the existing logml example:

 1# ./log_ml.sh pipeline run -p wine -c examples/wine/modeling.yaml -d examples/wine/wine.csv -n wine-modeling
 2
 3version:                    1.0.0
 4
 5dataset_metadata:
 6    modeling_specs:
 7        QUALITY:
 8            # We want to predict "quality" - regression problem.
 9            task:           regression
10            target_column:  quality
11            target_metric:  rmse
12
13modeling:
14    hpo:
15        max_evals:              4
16    cross_validation:
17        type:                                 kfold
18        params:
19            n_splits:                         8
20    problems:
21        QUALITY:
22            preset:
23                enable:     True
24            model_search:
25                models:  # limit to two models - linear and tree-based.
26                    - LassoModel
27                    - RandomForestRegressorModel
28            # Make less computations for test.
29            feature_importance:
30                cross_validation:
31                    random_state: 42
32                    split_type: kfold
33                    n_folds: 12
34                    test_size: 0.25
35                    type: ''
36                    params: { }
37                perform_tier1_greedy: false
38                fid_pvalue_threshold: 0.1
39                n_random_iters: 2
40                random_state: 42
41                default_n_perm_imp_iters: 2

Default config file usually has most of section disabled:

 1version: 1.0.0
 2dataset_metadata: null
 3stratification: []
 4survival_analysis:
 5  enable: false
 6  problems: {}
 7modeling:
 8  enable: false
 9  cross_validation:
10    random_state: 42
11    split_type: kfold
12    n_folds: 20
13    test_size: 0.2
14    type: ''
15    params: {}
16  hpo:
17    algorithm: tpe
18    max_evals: 3
19  problems: {}
20eda:
21  enable: false
22  preprocessing_problem_id: ''
23  dataset_preprocessing:
24    enable: false
25    preset:
26      enable: false
27      features_list: []
28      remove_correlated_features: true
29      nans_per_row_fraction_threshold: 0.9
30      nans_fraction_threshold: 0.7
31      apply_log1p_to_target: false
32      drop_datetime_columns: true
33      drop_dna_wt: false
34    steps: []
35  artifacts: []
36  params:
37    correlation_type: pearson
38    correlation_threshold: 0.8
39    correlation_min_samples_fraction: 0.2
40    correlation_group_level_cutoff: 1
41    correlation_key_names:
42    - TP53
43    - KRAS
44    - CDKN2A
45    - CDKN2B
46    - PIK3CA
47    - ATM
48    - BRCA1
49    - SOX2
50    - GNAS2
51    - TERC
52    - STK11
53    - PDCD1
54    - LAG3
55    - TIGIT
56    - HAVCR2
57    - EOMES
58    - MTAP
59    large_data_threshold: (500, 1000)
60    huge_data_threshold: (2000, 5000)
61report:
62  enable: true
63  report_structure:
64    master_summary:
65      enable: false
66      eda: false
67      feature_importance: ''
68      survival_feature_importance: ''
69      baseline_modeling: ''
70      survival_analysis: ''
71    eda: false
72    modeling: []
73    cross_strata_fi_summary: []
74    survival_analysis: []
75    greedy_split: []
76    rnaseq_differential_expression: []
77    rnaseq_enrichment_analysis: []
78    report_diagnostics: true
79    report_summary: true
80  workflow:
81    produce_and_execute_strata_notebooks: true
82    produce_and_execute_global_notebooks: true
83    generate_report: true
84analysis:
85  enable: false
86  items: []
87is_dag_config: false
88source_files:
89  main_config: ''
90  refs: []

Top level of entries in the configuration file are mapped to the object GlobalConfig. Please refer to individual configuration-related classes: they all have all their attributes directly mapped to the config.

class logml.configuration.global_config.GlobalConfig

Bases: pydantic.main.BaseModel

Global config schema.

When configuring logml, an object of this class is populated from yaml config directly, meaning that all root-level yaml entries are mapped to properties of this class.

field version: str = 'unknown'

Corresponds to the last compatible LogML version, e.g. “0.2.4”. This field is for information only, logml does not validate it.

field random_state: Optional[int] = None

Random state for main random numbers generator.

field dataset_metadata: logml.configuration.modeling.DatasetMetadataSection = None

Metadata for the incoming dataset (identifiers, columns groups, etc).

field stratification: List[logml.configuration.stratification.Strata] = []

Stata are sub-sets of the original dataset, independent of one another. Essentially, the complete analysis pipeline runs for a each single stratum independently (but all strata are compared in a special report page named “Cross-strata comparison”). Stratification can follow for example treatment arms or test/control grouping of samples. NOTE: analysis section does not follow stratification rules, but defines its own specific ones.

field survival_analysis: logml.configuration.survival_analysis.SurvivalAnalysisSection = SurvivalAnalysisSection(enable=False, problems={})
field modeling: logml.configuration.modeling.ModelingSection = ModelingSection(enable=False, cross_validation=CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=20, test_size=0.2, type='', params={}), hpo=HPOSection(algorithm='tpe', max_evals=3, random_state=None), problems={})
field eda: logml.configuration.eda.EDAArtifactsGenerationSection = EDAArtifactsGenerationSection(enable=False, preprocessing_problem_id='', dataset_preprocessing=DatasetPreprocessingSection(enable=False, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[]), artifacts=[], params=EDAArtifactsGenerationParameters(correlation_type=<CorrelationType.PEARSON: 'pearson'>, correlation_threshold=0.8, correlation_min_samples_fraction=0.2, correlation_group_level_cutoff=1, correlation_key_names=['TP53', 'KRAS', 'CDKN2A', 'CDKN2B', 'PIK3CA', 'ATM', 'BRCA1', 'SOX2', 'GNAS2', 'TERC', 'STK11', 'PDCD1', 'LAG3', 'TIGIT', 'HAVCR2', 'EOMES', 'MTAP'], large_data_threshold=(500, 1000), huge_data_threshold=(2000, 5000)))
field report: logml.configuration.report.ReportSection = ReportSection(enable=True, report_structure=ReportStructure(master_summary=MasterSummarySection(enable=False, eda=False, feature_importance='', survival_feature_importance='', baseline_modeling='', survival_analysis=''), eda=False, modeling=[], cross_strata_fi_summary=[], survival_analysis=[], greedy_split=[], rnaseq_differential_expression=[], rnaseq_enrichment_analysis=[], report_diagnostics=True, report_summary=True), workflow=BaselineKitWorkflowSection(produce_and_execute_strata_notebooks=True, produce_and_execute_global_notebooks=True, generate_report=True))
field analysis: logml.analysis.config.AnalysisPipelineConfig = AnalysisPipelineConfig(enable=False, items=[])
field is_dag_config: bool = False
field source_files: logml.configuration.global_config.ConfigSources = ConfigSources(main_config='', refs=[])
classmethod populate_reporting_problems(values) None

Automatically put modeling and survival problems into reporting section.

classmethod validate_report(values) None

Baselinekit validation.

analysis_section_enabled()

Checks whether ‘analysis’ section is enabled.

modeling_section_enabled()

Checks whether ‘modeling’ section is enabled.

survival_section_enabled()

Checks whether ‘survival_analysis’ section is enabled.

get_target_survival_problems() List[str]

Returns modeling setups for which a given section is enabled.

baselinekit_section_enabled()

Checks whether ‘report’ section is enabled.

classmethod load(path: Union[str, pathlib.Path]) logml.configuration.global_config.GlobalConfig

Load config from yaml file.

generate_random_state() int

Generates random state

Most of configuration-related classes are located at logml.configuration.

Input metadata configuration

One of the most important information we want to convey to LogML is a structure of the incoming data. Base on it, we define Analysis Problems for LogML to consider.

This section is covered by class DatasetMetadataSection.

By using columns_metadata attribute, we should properly set data type and ‘categorical’ flags for the columns of the dataset.

By setting modeling_specs we set global definitions for target variables and what kind of modeling problem to apply.

Sample configuration:

dataset_metadata:
    modeling_specs:

        # Configruation for problems to find relation of covatiates to Overall Survival.
        OS:
            time_column:            time
            event_query:            'cens == 0'
            event_column:           cens

        # Model relation of covariates and "tearment outcome" value, which is categorical,
        # hence use classification approach.
        Outcome:
            task:                  classification
            target:                Outcome
            target_metric:         rocauc

    key_columns:
        - subj_id              # Key column, this is neither feature,
                               # nor target, just an indicator.

    # Specify some predefined metadata.

    columns_metadata:
        - name: gender
          data_type: str
          is_categorical: true
        - name: birthdate
          data_type: datetime64[ns]

Modeling

This section is covered by class ModelingSection, which is essentially a list of ModelingSetup items.

There is a predefined modeling setup, which is turned on by setting preset to enable state. Modeling preset performs the following:

  • Use Default Data Preprocessing configuration (see Preprocessing section).

  • Generates 5 shuffled datasets.

  • Enables Models Selection process (applies to all models with the matching objective - classification, regression or survival).

  • Enables Feature Importance with 3 best models.

Preprocessing

If preset is enabled in dataset preprocessing section, then lightweight configuration section is applied.

class logml.configuration.modeling.DatasetPreprocessingPresetSection

Bases: pydantic.main.BaseModel

Defines ‘syntax sugar’ for semi-automated data preprocessing steps generation.

field enable: bool = True

Whether to enable automated generation of preprocessing steps.

field features_list: Union[str, List[str]] = []

Defines a list of features (referenced by regexps) that should be selected. Additional option is just to reference a configuration file that contains the required list of features: … features_list: sub_cfg/features_list.yaml # a config file …

field remove_correlated_features: bool = True

Whether to include a step that removes correlated features.

field nans_per_row_fraction_threshold: float = 0.9

Defines maximum acceptable fraction of NaNs within a row.

field nans_fraction_threshold: float = 0.7

Defines maximum acceptable fraction of NaNs within a column.

field apply_log1p_to_target: bool = False

Whether to apply log1p transformation to target column (applicable only for regression problems).

field drop_datetime_columns: bool = True

Whether to drop date time columns.

field drop_dna_wt: bool = False

Whether to drop DNA WT values after one-hot-encoding.

field imputer: str = 'median'

Imputer to use. Possible values: (median, mice)

Example:

eda:
  enable: true
  preprocessing_problem_id: ''
  dataset_preprocessing:
    preset:
      features_list:
      - .*
      remove_correlated_features: true
      nans_fraction_threshold: 0.7
      apply_log1p_to_target: false
      drop_datetime_columns: true

Configuration utilities

In addition to ability to launch pipelines, log_ml.py interface provides several commands to make configs maintenance easier. It includes schema validation and other useful things.

For config-related utilities see log_ml config command in Command Line Parameters page.

Dataset Queries

In some places of config, like strata selection, or survival event query, we use dynamic approach for querying data from the dataset. To do this properly, we usually specify a line of text, somewhat similar to SQL, but with python specifics. For example:

stratification:
    - strata_id: A_arm query: 'arm == “A”'
    - strata_id: BC_arms query: 'arm.isin([“B”, “C”])'

Here are some basic rules how to create proper query. (There are more for advanced use, but out of scope of this guide).

  • Use general form <Feature name> <Operator> <Constant Value>. (It is possible to compare one column to another, but do it only when you clearly understand the data).

  • As this is a complex string, always surround by quotes in the yaml config, like in the example above.

  • Use feature name without quotes, string constants in double quotes.

  • If the feature name has whitespaces or special characters, use backtick quoting.

  • For equality check use double equal sign “==”.

  • Use simple operators: “<”, “>”, “==”, “<=”, “>=”.

  • Special function “isin” check that value is present in the list of values: arm.isin([“B”, “C”]). Use square brackets to define list or values.

  • Be sure to check column value. If column is a string, and you type A == 1, then python treats 1 to be a number, and naturally, number 1 should never be equal for any of string values included into column A. So such query will always return no records whatsoever.

Random States

There are many places LogML uses random states:

  • Fitting models like RandomForest.

  • Cross validation splits.

  • Random features cutoff test.

Rule of a thumb is as follows:

When random state not set (anywhere it is used), it is initialized from main LogML random generator for the run and fixed in the ‘_dag’ config for the run.