.. _advanced_running:

Running - Advanced Topics
===========================

DAG - manual step execution
------------------------------

When you need to execute only a one or two steps of the dag, there are several possibilities:

- Remove ``/_dag/steps/{STEP_NAME}.json`` file, which will render step as not executed, and restart
  the whole pipeline.
  NOTE: this will cause to execute all steps, which depend on this step (directly or indirectly) to re-execute.

  After successful execution the whole state of the experiment remains consistent.

- Run single step manually. LogML does not track dependencies in this case - you would make sure
  that all the dependencies are executed beforehand.
  NOTE that in this case only the step in question is executed - its dependencies remain unchanged
  (including final report) and the whole state of the experiment may become inconsistent.
  Consider this as a debugging tool only.

  Command sequence:

    .. code-block:: bash

        $ python log_ml.py pipeline generate_dag -c config.yaml -n run_name -o ../data/output

        DEBUG:DAG config file to ../data/output/run_name/configs/dag.yaml
        DEBUG:DAG schedule file dumped to ../data/output/run_name/configs/dag_schedule.json

        $ python log_ml.py pipeline run --step STEP_NAME -n run_name -o ../data/output \
            --dag-config-path ../data/output/run_name/configs/dag.yaml \
            -d dataset.csv

  Optionally you can provide several step names :code:`--step NAME1 --step NAME2 ... --step NAME_N`, which
  are executed in the order of passing.


Parallel jobs - internal scheduler
------------------------------------

Pass :code:`--n_jobs` parameter, which forces LogML to execute DAG steps in parallel as a separate process.
(Each job can use multithreading though). Note: not applicable to individual steps execution.


External scheduler
--------------------

There is a possibility to use external scheduler:

1. Generate dag config and schedule files in the "_dag" folder of the experiment:

    .. code-block:: bash

        $ python log_ml.py pipeline generate_dag -n RUN_NAME -c CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH
        $ ls $OUTPUT_PATH/$RUN_NAME/_dag

        dag.yaml
        schedule.json


    Generated schedule file contains all dag steps and parameters needed to invoke them:

    .. code-block:: javascript

        {
            "jobs": [
                {
                  "unique_id": "modeling_data_transform-Module_1-p1-0",
                  "type": "modeling_data_transform",
                  "depends_on": [],
                  "resources": {
                    "cpu": 1,
                    "mem": 4000,
                    "timeout": 36000
                  }
                },
        ...
        }

2. Feed data to the scheduler via its API.

    .. code-block:: bash

        python pipeline run \
            --step $unique_id
            --job-id $unique_id
            --job-completion-file $tracker_file \
            --log-file $log_file  \
            -n RUN_NAME
            -c DAG_CONFIG_PATH -o OUTPUT_PATH -d DATASET_PATH

    To run a step, you need to know its identifier (passed into --step).
    Job-specific log files help to keep log tidy.
    Job completion file will be created when the step is complete successfully - this is for
    those schedulers which track dependencies as file.

    Scheduler should track dependencies, which are listed in ``depends_on`` field all by itself.