.. _advanced_running: Running - Advanced Topics =========================== DAG - manual step execution ------------------------------ When you need to execute only a one or two steps of the dag, there are several possibilities: - Remove ``/_dag/steps/{STEP_NAME}.json`` file, which will render step as not executed, and restart the whole pipeline. NOTE: this will cause to execute all steps, which depend on this step (directly or indirectly) to re-execute. After successful execution the whole state of the experiment remains consistent. - Run single step manually. LogML does not track dependencies in this case - you would make sure that all the dependencies are executed beforehand. NOTE that in this case only the step in question is executed - its dependencies remain unchanged (including final report) and the whole state of the experiment may become inconsistent. Consider this as a debugging tool only. Command sequence: .. code-block:: bash $ python log_ml.py pipeline generate_dag -c config.yaml -n run_name -o ../data/output DEBUG:DAG config file to ../data/output/run_name/configs/dag.yaml DEBUG:DAG schedule file dumped to ../data/output/run_name/configs/dag_schedule.json $ python log_ml.py pipeline run --step STEP_NAME -n run_name -o ../data/output \ --dag-config-path ../data/output/run_name/configs/dag.yaml \ -d dataset.csv Optionally you can provide several step names :code:`--step NAME1 --step NAME2 ... --step NAME_N`, which are executed in the order of passing. Parallel jobs - internal scheduler ------------------------------------ Pass :code:`--n_jobs` parameter, which forces LogML to execute DAG steps in parallel as a separate process. (Each job can use multithreading though). Note: not applicable to individual steps execution. External scheduler -------------------- There is a possibility to use external scheduler: 1. Generate dag config and schedule files in the "_dag" folder of the experiment: .. code-block:: bash $ python log_ml.py pipeline generate_dag -n RUN_NAME -c CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH $ ls $OUTPUT_PATH/$RUN_NAME/_dag dag.yaml schedule.json Generated schedule file contains all dag steps and parameters needed to invoke them: .. code-block:: javascript { "jobs": [ { "unique_id": "modeling_data_transform-Module_1-p1-0", "type": "modeling_data_transform", "depends_on": [], "resources": { "cpu": 1, "mem": 4000, "timeout": 36000 } }, ... } 2. Feed data to the scheduler via its API. .. code-block:: bash python pipeline run \ --step $unique_id --job-id $unique_id --job-completion-file $tracker_file \ --log-file $log_file \ -n RUN_NAME -c DAG_CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH To run a step, you need to know its identifier (passed into --step). Job-specific log files help to keep log tidy. Job completion file will be created when the step is complete successfully - this is for those schedulers which track dependencies as file. Scheduler should track dependencies, which are listed in ``depends_on`` field all by itself.