Running - Advanced Topics

DAG - manual step execution

When you need to execute only a one or two steps of the dag, there are several possibilities:

Remove /_dag/steps/{STEP_NAME}.json file, which will render step as not executed, and restart the whole pipeline. NOTE: this will cause to execute all steps, which depend on this step (directly or indirectly) to re-execute.

After successful execution the whole state of the experiment remains consistent.
Run single step manually. LogML does not track dependencies in this case - you would make sure that all the dependencies are executed beforehand. NOTE that in this case only the step in question is executed - its dependencies remain unchanged (including final report) and the whole state of the experiment may become inconsistent. Consider this as a debugging tool only.

Command sequence:
$ python log_ml.py pipeline generate_dag -c config.yaml -n run_name -o ../data/output DEBUG:DAG config file to ../data/output/run_name/configs/dag.yaml DEBUG:DAG schedule file dumped to ../data/output/run_name/configs/dag_schedule.json $ python log_ml.py pipeline run --step STEP_NAME -n run_name -o ../data/output \ --dag-config-path ../data/output/run_name/configs/dag.yaml \ -d dataset.csv
Optionally you can provide several step names --step NAME1 --step NAME2 ... --step NAME_N, which are executed in the order of passing.

Parallel jobs - internal scheduler

Pass --n_jobs parameter, which forces LogML to execute DAG steps in parallel as a separate process. (Each job can use multithreading though). Note: not applicable to individual steps execution.

External scheduler

There is a possibility to use external scheduler:

Generate dag config and schedule files in the “_dag” folder of the experiment:

$ python log_ml.py pipeline generate_dag -n RUN_NAME -c CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH
$ ls $OUTPUT_PATH/$RUN_NAME/_dag

dag.yaml
schedule.json

Generated schedule file contains all dag steps and parameters needed to invoke them:

{
    "jobs": [
        {
          "unique_id": "modeling_data_transform-Module_1-p1-0",
          "type": "modeling_data_transform",
          "depends_on": [],
          "resources": {
            "cpu": 1,
            "mem": 4000,
            "timeout": 36000
          }
        },
...
}

Feed data to the scheduler via its API.
python pipeline run \ --step $unique_id --job-id $unique_id --job-completion-file $tracker_file \ --log-file $log_file \ -n RUN_NAME -c DAG_CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH
To run a step, you need to know its identifier (passed into –step). Job-specific log files help to keep log tidy. Job completion file will be created when the step is complete successfully - this is for those schedulers which track dependencies as file.

Scheduler should track dependencies, which are listed in depends_on field all by itself.