Running - Advanced Topics

DAG - manual step execution

When you need to execute only a one or two steps of the dag, there are several possibilities:

  • Remove /_dag/steps/{STEP_NAME}.json file, which will render step as not executed, and restart the whole pipeline. NOTE: this will cause to execute all steps, which depend on this step (directly or indirectly) to re-execute.

    After successful execution the whole state of the experiment remains consistent.

  • Run single step manually. LogML does not track dependencies in this case - you would make sure that all the dependencies are executed beforehand. NOTE that in this case only the step in question is executed - its dependencies remain unchanged (including final report) and the whole state of the experiment may become inconsistent. Consider this as a debugging tool only.

    Command sequence:

    $ python log_ml.py pipeline generate_dag -c config.yaml -n run_name -o ../data/output
    
    DEBUG:DAG config file to ../data/output/run_name/configs/dag.yaml
    DEBUG:DAG schedule file dumped to ../data/output/run_name/configs/dag_schedule.json
    
    $ python log_ml.py pipeline run --step STEP_NAME -n run_name -o ../data/output \
        --dag-config-path ../data/output/run_name/configs/dag.yaml \
        -d dataset.csv
    

    Optionally you can provide several step names --step NAME1 --step NAME2 ... --step NAME_N, which are executed in the order of passing.

Parallel jobs - internal scheduler

Pass --n_jobs parameter, which forces LogML to execute DAG steps in parallel as a separate process. (Each job can use multithreading though). Note: not applicable to individual steps execution.

External scheduler

There is a possibility to use external scheduler:

  1. Generate dag config and schedule files in the “_dag” folder of the experiment:

    $ python log_ml.py pipeline generate_dag -n RUN_NAME -c CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH
    $ ls $OUTPUT_PATH/$RUN_NAME/_dag
    
    dag.yaml
    schedule.json
    

    Generated schedule file contains all dag steps and parameters needed to invoke them:

    {
        "jobs": [
            {
              "unique_id": "modeling_data_transform-Module_1-p1-0",
              "type": "modeling_data_transform",
              "depends_on": [],
              "resources": {
                "cpu": 1,
                "mem": 4000,
                "timeout": 36000
              }
            },
    ...
    }
    
  2. Feed data to the scheduler via its API.

    python pipeline run \
        --step $unique_id
        --job-id $unique_id
        --job-completion-file $tracker_file \
        --log-file $log_file  \
        -n RUN_NAME
        -c DAG_CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH
    

    To run a step, you need to know its identifier (passed into –step). Job-specific log files help to keep log tidy. Job completion file will be created when the step is complete successfully - this is for those schedulers which track dependencies as file.

    Scheduler should track dependencies, which are listed in depends_on field all by itself.