Running - Advanced Topics
DAG - manual step execution
When you need to execute only a one or two steps of the dag, there are several possibilities:
Remove
/_dag/steps/{STEP_NAME}.json
file, which will render step as not executed, and restart the whole pipeline. NOTE: this will cause to execute all steps, which depend on this step (directly or indirectly) to re-execute.After successful execution the whole state of the experiment remains consistent.
Run single step manually. LogML does not track dependencies in this case - you would make sure that all the dependencies are executed beforehand. NOTE that in this case only the step in question is executed - its dependencies remain unchanged (including final report) and the whole state of the experiment may become inconsistent. Consider this as a debugging tool only.
Command sequence:
$ python log_ml.py pipeline generate_dag -c config.yaml -n run_name -o ../data/output DEBUG:DAG config file to ../data/output/run_name/configs/dag.yaml DEBUG:DAG schedule file dumped to ../data/output/run_name/configs/dag_schedule.json $ python log_ml.py pipeline run --step STEP_NAME -n run_name -o ../data/output \ --dag-config-path ../data/output/run_name/configs/dag.yaml \ -d dataset.csv
Optionally you can provide several step names
--step NAME1 --step NAME2 ... --step NAME_N
, which are executed in the order of passing.
Parallel jobs - internal scheduler
Pass --n_jobs
parameter, which forces LogML to execute DAG steps in parallel as a separate process.
(Each job can use multithreading though). Note: not applicable to individual steps execution.
External scheduler
There is a possibility to use external scheduler:
Generate dag config and schedule files in the “_dag” folder of the experiment:
$ python log_ml.py pipeline generate_dag -n RUN_NAME -c CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH $ ls $OUTPUT_PATH/$RUN_NAME/_dag dag.yaml schedule.json
Generated schedule file contains all dag steps and parameters needed to invoke them:
{ "jobs": [ { "unique_id": "modeling_data_transform-Module_1-p1-0", "type": "modeling_data_transform", "depends_on": [], "resources": { "cpu": 1, "mem": 4000, "timeout": 36000 } }, ... }
Feed data to the scheduler via its API.
python pipeline run \ --step $unique_id --job-id $unique_id --job-completion-file $tracker_file \ --log-file $log_file \ -n RUN_NAME -c DAG_CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH
To run a step, you need to know its identifier (passed into –step). Job-specific log files help to keep log tidy. Job completion file will be created when the step is complete successfully - this is for those schedulers which track dependencies as file.
Scheduler should track dependencies, which are listed in
depends_on
field all by itself.