Configuration Overview

Typical LogML configuration file has the following structure:

version:         # <logml version>
stratification:  # how to split dataset to strata

# Kinds of analyses:

# EDA
eda:  # EDA.

# Survival analysis: univariate and multivariate cox/
survival_analysis:
# ML-based feature importance calculation.

modeling:


# Main section for DAG steps configuration
# Greedy split, Modeling steps, etc.
analysis:

# Report building configuration.
report:

Check detailed documentaiton on each section at logml.configuration.

Here is configuration file from the existing logml example:

# ./log_ml.sh pipeline run -p wine -c examples/wine/modeling.yaml -d examples/wine/wine.csv -n wine-modeling

version:                    1.0.0

dataset_metadata:
    modeling_specs:
        QUALITY:
            # We want to predict "quality" - regression problem.
            task:           regression
            target_column:  quality
            target_metric:  rmse

modeling:
    hpo:
        max_evals:              4
    cross_validation:
        type:                                 kfold
        params:
            n_splits:                         8
    problems:
        QUALITY:
            preset:
                enable:     True
            model_search:
                models:  # limit to two models - linear and tree-based.
                    - LassoModel
                    - RandomForestRegressorModel
            # Make less computations for test.
            feature_importance:
                cross_validation:
                    random_state: 42
                    split_type: kfold
                    n_folds: 12
                    test_size: 0.25
                    type: ''
                    params: { }
                perform_tier1_greedy: false
                fid_pvalue_threshold: 0.1
                n_random_iters: 2
                random_state: 42
                default_n_perm_imp_iters: 2

Default config file usually has most of section disabled:

version: 1.0.0
dataset_metadata: null
stratification: []
survival_analysis:
  enable: false
  problems: {}
modeling:
  enable: false
  cross_validation:
    random_state: 42
    split_type: kfold
    n_folds: 20
    test_size: 0.2
    type: ''
    params: {}
  hpo:
    algorithm: tpe
    max_evals: 3
  problems: {}
eda:
  enable: false
  preprocessing_problem_id: ''
  dataset_preprocessing:
    enable: false
    preset:
      enable: false
      features_list: []
      remove_correlated_features: true
      nans_per_row_fraction_threshold: 0.9
      nans_fraction_threshold: 0.7
      apply_log1p_to_target: false
      drop_datetime_columns: true
      drop_dna_wt: false
    steps: []
  artifacts: []
  params:
    correlation_type: pearson
    correlation_threshold: 0.8
    correlation_min_samples_fraction: 0.2
    correlation_group_level_cutoff: 1
    correlation_key_names:
    - TP53
    - KRAS
    - CDKN2A
    - CDKN2B
    - PIK3CA
    - ATM
    - BRCA1
    - SOX2
    - GNAS2
    - TERC
    - STK11
    - PDCD1
    - LAG3
    - TIGIT
    - HAVCR2
    - EOMES
    - MTAP
    large_data_threshold: (500, 1000)
    huge_data_threshold: (2000, 5000)
report:
  enable: true
  report_structure:
    master_summary:
      enable: false
      eda: false
      feature_importance: ''
      survival_feature_importance: ''
      baseline_modeling: ''
      survival_analysis: ''
    eda: false
    modeling: []
    cross_strata_fi_summary: []
    survival_analysis: []
    greedy_split: []
    rnaseq_differential_expression: []
    rnaseq_enrichment_analysis: []
    report_diagnostics: true
    report_summary: true
  workflow:
    produce_and_execute_strata_notebooks: true
    produce_and_execute_global_notebooks: true
    generate_report: true
analysis:
  enable: false
  items: []
is_dag_config: false
source_files:
  main_config: ''
  refs: []

Top level of entries in the configuration file are mapped to the object GlobalConfig. Please refer to individual configuration-related classes: they all have all their attributes directly mapped to the config.

class logml.configuration.global_config.GlobalConfig

Bases: pydantic.main.BaseModel

Global config schema.

When configuring logml, an object of this class is populated from yaml config directly, meaning that all root-level yaml entries are mapped to properties of this class.

field version: str = 'unknown': Corresponds to the last compatible LogML version, e.g. “0.2.4”. This field is for information only, logml does not validate it.

field random_state: Optional[int] = None: Random state for main random numbers generator.

field dataset_metadata: logml.configuration.modeling.DatasetMetadataSection = None: Metadata for the incoming dataset (identifiers, columns groups, etc).

field stratification: List[logml.configuration.stratification.Strata] = []: Stata are sub-sets of the original dataset, independent of one another. Essentially, the complete analysis pipeline runs for a each single stratum independently (but all strata are compared in a special report page named “Cross-strata comparison”). Stratification can follow for example treatment arms or test/control grouping of samples. NOTE: analysis section does not follow stratification rules, but defines its own specific ones.

field survival_analysis: logml.configuration.survival_analysis.SurvivalAnalysisSection = SurvivalAnalysisSection(enable=False, problems={})

field modeling: logml.configuration.modeling.ModelingSection = ModelingSection(enable=False, cross_validation=CrossValidationSection(random_state=None, split_type=<CVSplitType.KFOLD: 'kfold'>, n_folds=20, test_size=0.2, type='', params={}), hpo=HPOSection(algorithm='tpe', max_evals=3, random_state=None), problems={})

field eda: logml.configuration.eda.EDAArtifactsGenerationSection = EDAArtifactsGenerationSection(enable=False, preprocessing_problem_id='', dataset_preprocessing=DatasetPreprocessingSection(enable=False, preset=DatasetPreprocessingPresetSection(enable=False, features_list=[], remove_correlated_features=True, nans_per_row_fraction_threshold=0.9, nans_fraction_threshold=0.7, apply_log1p_to_target=False, drop_datetime_columns=True, drop_dna_wt=False, imputer='median'), steps=[]), artifacts=[], params=EDAArtifactsGenerationParameters(correlation_type=<CorrelationType.PEARSON: 'pearson'>, correlation_threshold=0.8, correlation_min_samples_fraction=0.2, correlation_group_level_cutoff=1, correlation_key_names=['TP53', 'KRAS', 'CDKN2A', 'CDKN2B', 'PIK3CA', 'ATM', 'BRCA1', 'SOX2', 'GNAS2', 'TERC', 'STK11', 'PDCD1', 'LAG3', 'TIGIT', 'HAVCR2', 'EOMES', 'MTAP'], large_data_threshold=(500, 1000), huge_data_threshold=(2000, 5000)))

field report: logml.configuration.report.ReportSection = ReportSection(enable=True, report_structure=ReportStructure(master_summary=MasterSummarySection(enable=False, eda=False, feature_importance='', survival_feature_importance='', baseline_modeling='', survival_analysis=''), eda=False, modeling=[], cross_strata_fi_summary=[], survival_analysis=[], greedy_split=[], rnaseq_differential_expression=[], rnaseq_enrichment_analysis=[], report_diagnostics=True, report_summary=True), workflow=BaselineKitWorkflowSection(produce_and_execute_strata_notebooks=True, produce_and_execute_global_notebooks=True, generate_report=True))

field analysis: logml.analysis.config.AnalysisPipelineConfig = AnalysisPipelineConfig(enable=False, items=[])

field is_dag_config: bool = False

field source_files: logml.configuration.global_config.ConfigSources = ConfigSources(main_config='', refs=[])

classmethod populate_reporting_problems(values) → None: Automatically put modeling and survival problems into reporting section.

classmethod validate_report(values) → None: Baselinekit validation.

analysis_section_enabled(): Checks whether ‘analysis’ section is enabled.

modeling_section_enabled(): Checks whether ‘modeling’ section is enabled.

survival_section_enabled(): Checks whether ‘survival_analysis’ section is enabled.

get_target_survival_problems() → List[str]: Returns modeling setups for which a given section is enabled.

baselinekit_section_enabled(): Checks whether ‘report’ section is enabled.

classmethod load(path: Union[str, pathlib.Path]) → logml.configuration.global_config.GlobalConfig: Load config from yaml file.

generate_random_state() → int: Generates random state

Most of configuration-related classes are located at logml.configuration.

Input metadata configuration

One of the most important information we want to convey to LogML is a structure of the incoming data. Base on it, we define Analysis Problems for LogML to consider.

This section is covered by class DatasetMetadataSection.

By using columns_metadata attribute, we should properly set data type and ‘categorical’ flags for the columns of the dataset.

By setting modeling_specs we set global definitions for target variables and what kind of modeling problem to apply.

Sample configuration:

dataset_metadata:
    modeling_specs:

        # Configruation for problems to find relation of covatiates to Overall Survival.
        OS:
            time_column:            time
            event_query:            'cens == 0'
            event_column:           cens

        # Model relation of covariates and "tearment outcome" value, which is categorical,
        # hence use classification approach.
        Outcome:
            task:                  classification
            target:                Outcome
            target_metric:         rocauc

    key_columns:
        - subj_id              # Key column, this is neither feature,
                               # nor target, just an indicator.

    # Specify some predefined metadata.

    columns_metadata:
        - name: gender
          data_type: str
          is_categorical: true
        - name: birthdate
          data_type: datetime64[ns]

Modeling

This section is covered by class ModelingSection, which is essentially a list of ModelingSetup items.

There is a predefined modeling setup, which is turned on by setting preset to enable state. Modeling preset performs the following:

Use Default Data Preprocessing configuration (see Preprocessing section).
Generates 5 shuffled datasets.
Enables Models Selection process (applies to all models with the matching objective - classification, regression or survival).
Enables Feature Importance with 3 best models.

Preprocessing

If preset is enabled in dataset preprocessing section, then lightweight configuration section is applied.

class logml.configuration.modeling.DatasetPreprocessingPresetSection

Bases: pydantic.main.BaseModel

Defines ‘syntax sugar’ for semi-automated data preprocessing steps generation.

field enable: bool = True: Whether to enable automated generation of preprocessing steps.

field features_list: Union[str, List[str]] = []: Defines a list of features (referenced by regexps) that should be selected. Additional option is just to reference a configuration file that contains the required list of features: … features_list: sub_cfg/features_list.yaml # a config file …

field remove_correlated_features: bool = True: Whether to include a step that removes correlated features.

field nans_per_row_fraction_threshold: float = 0.9: Defines maximum acceptable fraction of NaNs within a row.

field nans_fraction_threshold: float = 0.7: Defines maximum acceptable fraction of NaNs within a column.

field apply_log1p_to_target: bool = False: Whether to apply log1p transformation to target column (applicable only for regression problems).

field drop_datetime_columns: bool = True: Whether to drop date time columns.

field drop_dna_wt: bool = False: Whether to drop DNA WT values after one-hot-encoding.

field imputer: str = 'median': Imputer to use. Possible values: (median, mice)

Example:

eda:
  enable: true
  preprocessing_problem_id: ''
  dataset_preprocessing:
    preset:
      features_list:
      - .*
      remove_correlated_features: true
      nans_fraction_threshold: 0.7
      apply_log1p_to_target: false
      drop_datetime_columns: true

Configuration utilities

In addition to ability to launch pipelines, log_ml.py interface provides several commands to make configs maintenance easier. It includes schema validation and other useful things.

For config-related utilities see log_ml config command in Command Line Parameters page.

Dataset Queries

In some places of config, like strata selection, or survival event query, we use dynamic approach for querying data from the dataset. To do this properly, we usually specify a line of text, somewhat similar to SQL, but with python specifics. For example:

stratification:
    - strata_id: A_arm query: 'arm == “A”'
    - strata_id: BC_arms query: 'arm.isin([“B”, “C”])'

Here are some basic rules how to create proper query. (There are more for advanced use, but out of scope of this guide).

Use general form <Feature name> <Operator> <Constant Value>. (It is possible to compare one column to another, but do it only when you clearly understand the data).
As this is a complex string, always surround by quotes in the yaml config, like in the example above.
Use feature name without quotes, string constants in double quotes.
If the feature name has whitespaces or special characters, use backtick quoting.
For equality check use double equal sign “==”.
Use simple operators: “<”, “>”, “==”, “<=”, “>=”.
Special function “isin” check that value is present in the list of values: arm.isin([“B”, “C”]). Use square brackets to define list or values.
Be sure to check column value. If column is a string, and you type A == 1, then python treats 1 to be a number, and naturally, number 1 should never be equal for any of string values included into column A. So such query will always return no records whatsoever.

Random States

There are many places LogML uses random states:

Fitting models like RandomForest.
Cross validation splits.
Random features cutoff test.

Rule of a thumb is as follows:

When random state not set (anywhere it is used), it is initialized from main LogML random generator for the run and fixed in the ‘_dag’ config for the run.