Quick Start

Cond Environment

Install miniconda, then create the conda environment:

./scripts/env_create.sh
./scripts/env_update.sh

Check LogML :

conda activate logml

python log_ml.py --help

This generates the following output:

Usage: log_ml.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  config            Configuration commands (Validation, schema, etc.)
  info              Print LogML version and environment info.
  models            Models commands.
  pipeline          Pipeline DAG commands.

So LogML is installed and available for launch.

First Run

Prepare for run

To run LogML you need:

  • Dataset

  • Logml configuration file

Dataset is a CSV file with a single line per sample/patient. It should contain covariates (free variables) and targets (dependent variables). Ultimate goal of LogML is to investigate connection between covariates and targets.

Covariates are usually a combination of clinical data and gene expressions, but also can be any other kind of data.

Targets usually are:

  • Censored Survival time: Overall survival (OS), Progression-free survival (PFS).

  • Treatment Response: Response rate or kind (complete, partial, etc).

Configuration file

Configuration file allows us to specifically parameters for analysis to be executed. It is a top level we specify set of sections which configure particular Analysis kind such as Modelling or Survival Analysis.

Next we prepare data and familiarise ourselves with the sample dataset.

Sample Dataset

LogML distribution includes a set of example datasets and configs which can be used for playing with LogML and understanding its basics. In this guide we’re using GBSG2 dataset.

Copy data and configs from LogML distribution to local folder:

cp -rvf ./examples ~/logml_examples
ls ~/logml_examples
head ~/logml_examples/gbsg2/GBSG2.csv
GBSG2 dataset

age

cens

estrec

horTh

menostat

pnodes

progrec

tgrade

time

tsize

70

1

66.0

no

Post

3.0

48.0

II

1814.0

21.0

56.0

1

77.0

yes

Post

7.0

61.0

II

2018.0

12.0

58.0

1

271.0

yes

Post

9.0

52.0

II

712.0

35.0

59.0

1

29.0

yes

Post

4.0

60.0

II

1807.0

17.0

73.0

1

65.0

no

Post

1.0

26.0

II

772.0

35.0

Sample Configuration File

Consider the simplest possible configuration file for exploratory data analysis of our sample dataset:

1# ./log_ml.sh pipeline run -p gbsg2 -c examples/gbsg2/eda.yaml -d examples/gbsg2/GBSG2.csv -n gbsg2-eda
2
3version:                     1.0.0
4
5eda:
6    params:
7      correlation_threshold: 0.8

(We usually put example of logml command for launching it with this config file).

It is quite short and it asks LogML to produce EDA artifacts and then generate report.

Just in case, we validate config file:

log_ml.sh config validate ~/logml_examples/gbsg2/eda.yaml
OK: ~/logml_examples/gbsg2/eda.yaml is a valid config file.

At this point we know that the file is OK but in future it makes sense to validate it when you modify something manually.

Execute LogML

Now let’s run LogML to demonstrate this simple analysis.

log_ml.sh pipeline run --project-id gbsg2 \
    --config-path ~/logml_examples/gbsg2/eda.yaml \
    --dataset-path ~/logml_examples/gbsg2/GBSG2.csv \
    --output-path ~/logml_result  \
    --run-name gbsg2-eda

Let’s review this minimal set of parameters for logml to start working:

  • –run-name which is gbsg2-eda in our case. Usually in real life this should follow some pre-defined schema, like (project-name)(stage)(date-time).

  • –project-id: This is an indicator label, used to distinguish different projects.

  • –config-path: Path to configuration file.

  • –dataset-path: Path to dataset path. (Additionally we can provide path to a dataset metadata, but that’s out of scope of this guide).

  • –output-path: Path to a root output folder. Specific output for the run will be available at {output path}/{run name}.

Warning

It is important to remember that output folder will contain parts of the original data, as well as intermediate files and reports. So output folder should be protected with the same access level as the data. In short, never put output data to public folders.

After we launch LogML, it will start processing, and generate quite amount of output. When it is finished, there is a final message about report readiness:

===============================================================================
Finished generating HTML for book.
Your book's HTML pages are here:
    ~/logml_result/gbsg2-eda/report/notebooks/_build/html/
You can look at your book by opening this file in a browser:
    ~/logml_result/gbsg2-eda/report/notebooks/_build/html/index.html
===============================================================================

Review Report

Now open the index.html file mentioned above: it is an entry point to the complete LogML report.

../../_images/eda_overview.png
Report is organized in a top-down hierarchy with level:
  • Stratum (Default in this case, which includes all the data).

  • Analysis (EDA in our case. In case of modeling there is an instance per modeling problem)

  • View (different aspects of the analysis. Most analyses have one view, EDA is quite diverse example).

First Survival Analysis

Let us now run some more interesting stuff: basic survival analysis. Consider the configuration file:

 1# ./log_ml.sh pipeline run -p gbsg2 -c examples/gbsg2/survival_analysis.yaml -d examples/gbsg2/GBSG2.csv -n gbsg2-survival-analysis
 2
 3version:                    1.0.0
 4
 5dataset_metadata:
 6    modeling_specs:
 7        OS:
 8            time_column:    time
 9            event_query:    'cens == 1'
10            event_column:   cens
11
12survival_analysis:
13    problems:
14        OS:
15            methods:
16                - method_id: kaplan_meier
17                - method_id: optimal_cut_off
18                - method_id: cox

You should notice several things here:

  • We specify a dataset metadata, i.e. now we give columns special meaning. In the survival_specs section we created a Survival Specification named OS which declares time and event columns.

  • Then we declare a problem for Survival Analysis. It is named OS too, which means that we also want to use survival specification named ‘OS’ here.

  • Again, we ask ‘OS’ survival problem to be included into the report.

Now we run it:

log_ml.sh pipeline run --project-id gbsg2 \
    --config-path ~/logml_examples/gbsg2/survival_analysis.yaml \
    --dataset-path ~/logml_examples/gbsg2/GBSG2.csv \
    --output-path ~/logml_result  \
    --run-name gbsg2-sa

And again review the report, which contain Survival analysis report for “OS” problem.

This concludes a Quick start guide for LogML. We have walked through the environment, have found where to find and activate LogML distribution, examined basic LogML configuration files, and have ran two kinds of analysis for GBSG2 dataset.

Good luck, you’re now ready to explore LogML for your project.

Check other sections of this documentation for advanced topics.