Configuration Overview ====================== Typical LogML configuration file has the following structure: .. code-block:: yaml version: # stratification: # how to split dataset to strata # Kinds of analyses: # EDA eda: # EDA. # Survival analysis: univariate and multivariate cox/ survival_analysis: # ML-based feature importance calculation. modeling: # Main section for DAG steps configuration # Greedy split, Modeling steps, etc. analysis: # Report building configuration. report: Check detailed documentaiton on each section at :py:mod:`logml.configuration`. Here is configuration file from the existing logml example: .. literalinclude:: ../../../examples/wine/modeling.yaml :language: yaml :linenos: Default config file usually has most of section disabled: .. literalinclude:: ../../../examples/default_config.yaml :language: yaml :linenos: Top level of entries in the configuration file are mapped to the object :py:class:`~.GlobalConfig`. Please refer to individual configuration-related classes: they all have all their attributes directly mapped to the config. .. autopydantic_model:: logml.configuration.global_config.GlobalConfig :noindex: :show-inheritance: False :model-show-json: False :model-show-config-summary: False :model-show-validator-members: False :model-show-validator-summary: False :model-show-field-summary: False :members: False :model-hide-paramlist: True Most of configuration-related classes are located at :py:mod:`logml.configuration`. Input metadata configuration ------------------------------ One of the most important information we want to convey to LogML is a structure of the incoming data. Base on it, we define `Analysis Problems` for LogML to consider. This section is covered by class :py:class:`~.DatasetMetadataSection`. By using :py:attr:`~.DatasetMetadataSection.columns_metadata` attribute, we should properly set data type and 'categorical' flags for the columns of the dataset. By setting :py:attr:`~.DatasetMetadataSection.modeling_specs` we set global definitions for target variables and what kind of modeling problem to apply. Sample configuration: .. code-block:: yaml dataset_metadata: modeling_specs: # Configruation for problems to find relation of covatiates to Overall Survival. OS: time_column: time event_query: 'cens == 0' event_column: cens # Model relation of covariates and "tearment outcome" value, which is categorical, # hence use classification approach. Outcome: task: classification target: Outcome target_metric: rocauc key_columns: - subj_id # Key column, this is neither feature, # nor target, just an indicator. # Specify some predefined metadata. columns_metadata: - name: gender data_type: str is_categorical: true - name: birthdate data_type: datetime64[ns] Modeling ------------------------------ This section is covered by class :py:class:`~.ModelingSection`, which is essentially a list of :py:class:`~.ModelingSetup` items. There is a predefined modeling setup, which is turned on by setting :code:`preset` to enable state. Modeling preset performs the following: - Use Default Data Preprocessing configuration (see `Preprocessing` section). - Generates 5 shuffled datasets. - Enables Models Selection process (applies to all models with the matching objective - classification, regression or survival). - Enables Feature Importance with 3 best models. Preprocessing ------------------------------ If preset is enabled in dataset preprocessing section, then lightweight configuration section is applied. .. autopydantic_model:: logml.configuration.modeling.DatasetPreprocessingPresetSection :noindex: :show-inheritance: False :model-show-json: False :model-show-config-summary: False :model-show-validator-members: False :model-show-validator-summary: False :model-show-field-summary: False :members: False :model-hide-paramlist: True Example: .. code-block:: yaml eda: enable: true preprocessing_problem_id: '' dataset_preprocessing: preset: features_list: - .* remove_correlated_features: true nans_fraction_threshold: 0.7 apply_log1p_to_target: false drop_datetime_columns: true Configuration utilities ------------------------------ In addition to ability to launch pipelines, `log_ml.py` interface provides several commands to make configs maintenance easier. It includes schema validation and other useful things. For config-related utilities see :program:`log_ml config` command in :ref:`Command Line Parameters` page. Dataset Queries ------------------------------ In some places of config, like strata selection, or survival event query, we use dynamic approach for querying data from the dataset. To do this properly, we usually specify a line of text, somewhat similar to SQL, but with python specifics. For example: .. code :: yaml stratification: - strata_id: A_arm query: 'arm == “A”' - strata_id: BC_arms query: 'arm.isin([“B”, “C”])' Here are some basic rules how to create proper query. (There are more for advanced use, but out of scope of this guide). - Use general form :code:` `. (It is possible to compare one column to another, but do it only when you clearly understand the data). - As this is a complex string, always surround by quotes in the yaml config, like in the example above. - Use feature name without quotes, string constants in double quotes. - If the feature name has whitespaces or special characters, use backtick quoting. - For equality check use double equal sign "==". - Use simple operators: "<", ">", "==", "<=", ">=". - Special function "isin" check that value is present in the list of values: :code:`arm.isin([“B”, “C”])`. Use square brackets to define list or values. - Be sure to check column value. If column is a string, and you type :code:`A == 1`, then python treats 1 to be a number, and naturally, number 1 should never be equal for any of string values included into column A. So such query will always return no records whatsoever. Random States ----------------- There are many places LogML uses random states: - Fitting models like RandomForest. - Cross validation splits. - Random features cutoff test. Rule of a thumb is as follows: When random state not set (anywhere it is used), it is initialized from main LogML random generator for the run and fixed in the '_dag' config for the run.