About LogML

LogML is a modeling and data analysis automation framework.

Why use LogML?

There is a considerable amount is setup, boiler-plate code, analysis in every data-centric project. LogML performs most of these boring tasks, so you can focus on what’s important and adds value.

LogML performs a consistent data analysis pipeline, keeping track every action and saving all results and intermediate files automatically.

What does LogML do for me?

LogML is designed to:

  • Enforce best practices.

  • Perform a set of common, well defined, well tested analyses.

  • Quickly turn around the first analysis results.

  • Facilitates logging: No more writing down results in a notepad, LogML creates log file in a systematic manner.

  • Save models and results: LogML saves all your models, so you can always retrieve the best ones.

How does LogML work? LogML has a standard “data analysis workflow” a.k.a. “pipeline”.

The workflow include several steps, such as data transformation, data exploration, invoking data analyses, and finally - generation of the report. Each step in the workflow can be customized in a configuration YAML file.

Terms and abbreviations

Here is a short list of LogML-specific terms used to describe its configuration and expected outcome:

Analysis

General term which defines a procedure performed by LogML to find out some aspect of the dataset. Most general type of analysis is EDA (exploratory data analysis). Other types of analysis are “Feature Importance”, “Survival Analysis”, etc. See LogML Pipeline for more information.

Strata

Subsets of the original input dataset. Analysis is applied to each independently. For example, your dataset can contain data on two studies, so here stratification makes sense if you want to consider studies’ analysis in separation from one another.

Problem

A.k.a “modeling problem” - definition of Machine Learning or Statistical model, which includes:

  • Target: modeling target also known as “y”, “outcome”, “dependent variable”. Usually target is result of a treatment or survival time.

  • Target metric: Metric used to to evaluate ML or statistical model performance. It is important to remember that there are two kinds of metrics, generally defined as “scores” (the larger the better) and “losses” (the smaller the better).

CV

cross-validation - a method of model evaluation, when a dataset is split to train and evaluation subsets repeatedly, each time producing the model which is trained on a training subset and evaluated on the exempted subset (so-called CV-models). By averaging CV-models metric we get an approximation of the resulting model quality.

Feature

Input variable for a model, also known as “independent variable” and “covariate”.

Target

Target variable for a model: treatment result, outcome, etc. Numeric target usually used in Regression models, categorical targets - in Classification models. Combination of survival time and censoring flags use Survival Models.

Feature Importance

Context-specific score assigned by an ML model to a feature, higher the better. Be be generally split to two groups - algebraic (when a numeric coefficient expresses linear relationship between feature and target) and statistical (when feature importance is calculated from statistical model properties. For example, for Decision Trees importance is a sum of feature-specific information gain). Note that in general case absolute feature importance value produced by different model instances cannot be compared per se.

Pipeline Workflow

Logml processing pipeline consists of several steps which in the end produce analysis result. Steps have dependencies, and in some cases can be executed in parallel (like evaluation of different models on the same dataset).

Logml pipeline is launched by log_ml pipeline command. Here is an overview of default pipeline steps.

  1. Run EDA and generate its artifacts.

  2. Run Modeling Analysis:

    1. Generate CV datasets for modeling problems.

    2. Run Model Seach (train models and evaluate). This process picks several top-performing models (number of models is configurable).

    3. Extract Feature Importance data and safe result artifacts.

  3. Run Survival Analysis.

  4. Run other Analysis Items. (see Analysis Step Types for details.)

  5. Execute Jupyter notebooks and generate final report.

  6. Prepare report and artifacts for publishing. (Filter only required files and create final zip archive).