Exploratory Data Analysis ========================== LogML provides EDA (Exploratory Data Analysis) capabilities that could help users to quickly assess data and get some insights. Configuration ------------- Typical configuration file for EDA have the following structure (here and below ``diabetes`` dataset will be used): .. literalinclude:: /_static/eda/sample_config.yaml :language: yaml :linenos: :caption: sample_diabetes_config.yaml What it is all about: - 'stratification' section defines strata of interest - EDA report will be generated for each stratum. Please see the corresponding class :py:class:`logml.configuration.stratification.Strata` for details on how to define stratum. Within the configuration file above there 2 strata that are defined based on 'Age' column. - 'eda' section configures conditions under which EDA artifacts are generated, please see :py:class:`logml.configuration.eda.EDAArtifactsGenerationSection` class for details. Important parameters to consider: - 'params' (:py:class:`logml.configuration.eda.EDAArtifactsGenerationParameters`) - defines key parameters for artifacts generation (correlation artifacts are particularly affected). - 'dataset_preprocessing' (:py:class:`logml.configuration.data_preprocessing.DatasetPreprocessingSection`) - might be usefull in case additional transformations are required before doing EDA (apart from stratification-based filtering). For example columns filtering. - 'report' section defines (via 'report_structure' subsection) a set of views that should be included into the result report. Please see :py:class:`logml.configuration.baselinekit.BaselineKitStructure` for details on what views are available. For the configuration file above LogML performs the following sequence of actions: - input dataset is stratified based on given configuration file - EDA artifacts are produced per each stratum - Result report is generated (includes EDA views for all strata) High-level schema of EDA: .. image:: /_static/eda/eda_process_schema.png Structure of EDA report (diabetes): .. image:: /_static/eda/eda_report_structure.png Produced artifacts ------------------ As it was mentioned above LogML decouples artifacts generation and visualization processes. All required EDA artifacts are saved to be reused later (either by LogML or users, if needed). For each stratum EDA artifacts are saved within ``{run_name}/{stratum id}/eda/artifacts/`` folder. Example of how and where EDA artifacts are saved: .. image:: /_static/eda/eda_artifacts_schema.png Metadata ^^^^^^^^ :py:class:`logml.eda.artifacts.metadata.DatasetMetadata` artifact provides a very basic metadata-like information: - list of numeric columns within dataset - list of categorical column within dataset The artifact is utilized by while producing other EDA artifacts. Correlation ^^^^^^^^^^^ :py:class:`logml.eda.artifacts.correlation.CorrelationSummary` artifact contains the following information: - correlation matrix that was produced using a given EDA parameters. For visualization purposes linkage matrix is kept as well (to order dataset columns by correlation similarity). - correlation groups information (please see the details here: :ref:`correlation_groups_overview`). Missingness ^^^^^^^^^^^ :py:class:`logml.eda.artifacts.missingness.MissingnessSummary` artifact contains the following information: - missing values summaries (per rows/columns) - complete datasets for numerical/categorical columns. For a given number of columns - N, 'complete dataset' can be defined as a subset of N columns and maximal number of rows so that there are no NaN values within the sub-dataset. - matrix of pairwise NaN distances (how missingness patterns across rows are similar) Statistics ^^^^^^^^^^ :py:class:`logml.eda.artifacts.stats_summary.StatisticsSummary` artifact contains the following information for numerical columns: - basic statistics (mean, std, min, max, 25%/50%/75% percentiles) - custom statistics (# unique, skewness, kurtosis, corrected coefficient of variation (std/mean)) - distribution fitness statistics (normality / log-normality via Shapiro-Wilks test) Dimensionality reduction ^^^^^^^^^^^^^^^^^^^^^^^^ :py:class:`logml.eda.artifacts.dimensionality_reduction.DimensionalityReduction` artifact contains the following information: - based on numerical columns - `PCA <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html>`_ output - `TSNE <https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE>`_ output - `UMAP <https://umap-learn.readthedocs.io/en/latest/>`_ output - `LDA <https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis>`_ output - based on categoricals columns - `MCA <https://pypi.org/project/mca/>`_ output Distributions ^^^^^^^^^^^^^ :py:class:`logml.eda.artifacts.distributions.DistributionsSummary` artifact contains the following information: - histograms for numerical features so that user could visually assess how distributions across features differ and apply appropriate transformations if needed (log1p, for example) Report structure ---------------- Let's take a look at available EDA views and what visualizations those do include. Dataset Overview ^^^^^^^^^^^^^^^^ Introductory section that helps to briefly take a look at the given dataset. - Dataset head/tail - helps to check whether dataset's schema is fine: .. image:: /_static/eda/dataset_overview_head_tail.png - Lists of numerical and categorical features - to confirm user's assumptions if any, and explicitly 'define' those lists: .. image:: /_static/eda/dataset_overview_features.png - Basic dataset statistics: .. image:: /_static/eda/dataset_overview_stats.png Missingness ^^^^^^^^^^^ TODO Continuous Features: Summary Statistics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Section with different statistics for numerical features. - Interactive table with descriptive statistics: .. image:: /_static/eda/statistics_summary_table.png - Interactive plot to assess coefficient of variation across features: .. image:: /_static/eda/statistics_summary_cv.png Continuous Features: Distribution and Correlation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Section helps to understand feature distribution shapes and how features correlate. - Feature distribution are showed in form of heatmap - categorical features could be captured, outliers, etc.: .. image:: /_static/eda/distributions_histograms.png - Interactive plot for assessment of features skewness and kurtosis: .. image:: /_static/eda/distributions_sk.png - Quantile-normal plots for comparing features distributions and checking whether features have normal distribution: .. image:: /_static/eda/distributions_qn.png - Searchable table with most correlated pairs of features: .. image:: /_static/eda/distributions_corr_pairs.png - Correlation matrix heatmap: .. image:: /_static/eda/distributions_corr_heatmap.png - Correlation matrix dendrogram (to assess groups of similar features - correlation similarity): .. image:: /_static/eda/distributions_corr_dendrogram.png - Correlation groups overview (please see the details here: :ref:`correlation_groups_overview`) Continuous Features: Dimensionality Reduction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Section helps to understand whether small number of descriptive dimensions could reveal any latent patterns of interest. - Interactive UMAP plot .. image:: /_static/eda/dim_reduction_umap.png - Interactive t-SNE plot .. image:: /_static/eda/dim_reduction_tsne.png - Interactive PCA plot .. image:: /_static/eda/dim_reduction_pca.png - Scree plot of PCA components .. image:: /_static/eda/dim_reduction_pca_scre.png Categorical Features: Summary, Distribution, Dim. Reduc. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TODO Data Exploration Tool ^^^^^^^^^^^^^^^^^^^^^ Section provides assess to `FACETS DIVE <https://github.com/pair-code/facets>`_ tool for in-browser data manipulation. .. image:: /_static/eda/facets_dive.png