Exploratory Data Analysis
==========================

LogML provides EDA (Exploratory Data Analysis) capabilities that could help users to quickly assess data and get some
insights.

Configuration
-------------

Typical configuration file for EDA have the following structure (here and below ``diabetes`` dataset will be used):

.. literalinclude:: /_static/eda/sample_config.yaml
    :language: yaml
    :linenos:
    :caption: sample_diabetes_config.yaml


What it is all about:

- 'stratification' section defines strata of interest - EDA report will be generated for each stratum.
  Please see the corresponding class :py:class:`logml.configuration.stratification.Strata` for details on how to
  define stratum. Within the configuration file above there 2 strata that are defined based on 'Age' column.

- 'eda' section configures conditions under which EDA artifacts are generated, please see
  :py:class:`logml.configuration.eda.EDAArtifactsGenerationSection` class for details. Important
  parameters to consider:

  - 'params' (:py:class:`logml.configuration.eda.EDAArtifactsGenerationParameters`) - defines key
    parameters for artifacts generation (correlation artifacts are particularly affected).

  - 'dataset_preprocessing' (:py:class:`logml.configuration.data_preprocessing.DatasetPreprocessingSection`) - might be
    usefull in case additional transformations are required before doing EDA (apart from stratification-based
    filtering). For example columns filtering.

- 'report' section defines (via 'report_structure' subsection) a set of views that should be included
  into the result report. Please see :py:class:`logml.configuration.baselinekit.BaselineKitStructure` for details on
  what views are available.

For the configuration file above LogML performs the following sequence of actions:

- input dataset is stratified based on given configuration file

- EDA artifacts are produced per each stratum

- Result report is generated (includes EDA views for all strata)


High-level schema of EDA:

.. image:: /_static/eda/eda_process_schema.png


Structure of EDA report (diabetes):

.. image:: /_static/eda/eda_report_structure.png


Produced artifacts
------------------

As it was mentioned above LogML decouples artifacts generation and visualization processes. All required EDA
artifacts are saved to be reused later (either by LogML or users, if needed).

For each stratum EDA artifacts are saved within ``{run_name}/{stratum id}/eda/artifacts/`` folder.
Example of how and where EDA artifacts are saved:

.. image:: /_static/eda/eda_artifacts_schema.png


Metadata
^^^^^^^^
:py:class:`logml.eda.artifacts.metadata.DatasetMetadata` artifact provides a very basic metadata-like information:

- list of numeric columns within dataset
- list of categorical column within dataset

The artifact is utilized by while producing other EDA artifacts.

Correlation
^^^^^^^^^^^

:py:class:`logml.eda.artifacts.correlation.CorrelationSummary` artifact contains the following information:

- correlation matrix that was produced using a given EDA parameters. For visualization purposes linkage matrix is
  kept as well (to order dataset columns by correlation similarity).

- correlation groups information (please see the details here: :ref:`correlation_groups_overview`).

Missingness
^^^^^^^^^^^

:py:class:`logml.eda.artifacts.missingness.MissingnessSummary` artifact contains the following information:

- missing values summaries (per rows/columns)

- complete datasets for numerical/categorical columns. For a given number of columns - N, 'complete dataset'
  can be defined as a subset of N columns and maximal number of rows so that there are no NaN values within
  the sub-dataset.

- matrix of pairwise NaN distances (how missingness patterns across rows are similar)

Statistics
^^^^^^^^^^

:py:class:`logml.eda.artifacts.stats_summary.StatisticsSummary` artifact contains the following information for
numerical columns:

- basic statistics (mean, std, min, max, 25%/50%/75% percentiles)
- custom statistics (# unique, skewness, kurtosis, corrected coefficient of variation (std/mean))
- distribution fitness statistics (normality / log-normality via Shapiro-Wilks test)

Dimensionality reduction
^^^^^^^^^^^^^^^^^^^^^^^^

:py:class:`logml.eda.artifacts.dimensionality_reduction.DimensionalityReduction` artifact contains
the following information:

- based on numerical columns

  - `PCA <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html>`_ output
  - `TSNE <https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE>`_ output
  - `UMAP <https://umap-learn.readthedocs.io/en/latest/>`_ output
  - `LDA <https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis>`_ output
- based on categoricals columns

  - `MCA <https://pypi.org/project/mca/>`_ output

Distributions
^^^^^^^^^^^^^

:py:class:`logml.eda.artifacts.distributions.DistributionsSummary` artifact contains
the following information:

- histograms for numerical features so that user could visually assess how distributions across features differ and
  apply appropriate transformations if needed (log1p, for example)


Report structure
----------------

Let's take a look at available EDA views and what visualizations those do include.

Dataset Overview
^^^^^^^^^^^^^^^^

Introductory section that helps to briefly take a look at the given dataset.

- Dataset head/tail - helps to check whether dataset's schema is fine:

.. image:: /_static/eda/dataset_overview_head_tail.png


- Lists of numerical and categorical features - to confirm user's assumptions if any,
  and explicitly 'define' those lists:

.. image:: /_static/eda/dataset_overview_features.png


- Basic dataset statistics:

.. image:: /_static/eda/dataset_overview_stats.png


Missingness
^^^^^^^^^^^

TODO


Continuous Features: Summary Statistics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Section with different statistics for numerical features.

- Interactive table with descriptive statistics:

.. image:: /_static/eda/statistics_summary_table.png

- Interactive plot to assess coefficient of variation across features:

.. image:: /_static/eda/statistics_summary_cv.png


Continuous Features: Distribution and Correlation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Section helps to understand feature distribution shapes and how features correlate.

- Feature distribution are showed in form of heatmap - categorical features could be captured, outliers, etc.:

.. image:: /_static/eda/distributions_histograms.png

- Interactive plot for assessment of features skewness and kurtosis:

.. image:: /_static/eda/distributions_sk.png

- Quantile-normal plots for comparing features distributions and checking whether features have normal distribution:

.. image:: /_static/eda/distributions_qn.png

- Searchable table with most correlated pairs of features:

.. image:: /_static/eda/distributions_corr_pairs.png

- Correlation matrix heatmap:

.. image:: /_static/eda/distributions_corr_heatmap.png

- Correlation matrix dendrogram (to assess groups of similar features - correlation similarity):

.. image:: /_static/eda/distributions_corr_dendrogram.png

- Correlation groups overview (please see the details here: :ref:`correlation_groups_overview`)


Continuous Features: Dimensionality Reduction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Section helps to understand whether small number of descriptive dimensions could reveal any latent patterns of
interest.

- Interactive UMAP plot

.. image:: /_static/eda/dim_reduction_umap.png

- Interactive t-SNE plot

.. image:: /_static/eda/dim_reduction_tsne.png

- Interactive PCA plot

.. image:: /_static/eda/dim_reduction_pca.png

- Scree plot of PCA components

.. image:: /_static/eda/dim_reduction_pca_scre.png

Categorical Features: Summary, Distribution, Dim. Reduc.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

TODO

Data Exploration Tool
^^^^^^^^^^^^^^^^^^^^^

Section provides assess to `FACETS DIVE <https://github.com/pair-code/facets>`_ tool for in-browser data manipulation.

.. image:: /_static/eda/facets_dive.png