logml.report.controllers.eda

Functions

get_missing_values_df(missing_values, _target)

Returns a dataframe with missigness stats using a given EDA artifact.

get_normalized_data(data, list_feat)

Takes the numeric data from a dataset and returns the normalized linear data, normalized log transformed data, and a randomly generated normal distribution with mean = 0, std = 1 and length = numer of samples in the original dataset.

identify_correlated(df, threshold)

A function to identify highly correlated features.

make_discrete_colorscale(cmap_name, num_colors)

Generate colorscale with "discrete" colors, as opposed to "continuous" gradient-based default plotly colorscale. Note: this applies only to plotly colorscales. For matplotlib use seaborn.color_palette() :param cmap_name: One of the legitimate plotly colorscale names. See plotly.colors.named_colorscales() for the full list. :param num_colors:.

plot_complete_dataset(complete_data_set_df, ...)

Produces a plot on top of a given complete dataset.

plot_cv_mean(summary)

Plots the coefficient of variation vs.

plot_dim_reduction(original_df, ...[, ...])

Generic function for plotting dim reduction that lets the user plot different components and color based on different continuous features.

plot_feature_completeness(missing_fraction, ...)

Plots feature completeness: for a number of features - how many samples can we have with NaNs?

plot_feature_loadings(numeric_columns, ...)

Takes the feature_weights matrix from logml DimensionalityReduction artifact.

plot_first_three_components(...)

Should take any dim_reduct_output from loader And plot the first three components (if they exist).

plot_general_summary(general_summary, ...)

Returns a table that includes information on the number features in the dataset, how they are divided in terms of categorical vs.

plot_highest_correlation(reduced_matrix, ...)

Takes a reduced correlation matrix and returns a df of: Feature | Highest Correlation | Feature's Correlated With.

plot_quantile_normal(scaled_linear_data, ...)

Plots a quantile-normal plot that lets the user select which features they want to see.

plot_scree(explained_variance)

Takes the explained variance from loader.dim_reduction.explained_variance and plots the components vs their respective explained variance from PCA.

plot_skew_kurt(continuous_summary)

Uses the continuous summary from stats_summary.py which returns a dataframe that includes the skewness and kurtosis.

plot_target_summary_classification(df, ...)

Plots the table for the target feature for a regression problem.

plot_target_summary_regression(df, target)

Plots the distribution of the target feature for a regression problem.

reorder_matrix(features, similarity_order, ...)

Maps index of a list of features to the similarity order and reorders the matrix based on the similarity order.

Classes

CategoricalSummary(df, categorical_features)

This class is used for plotting summaries related to categorical features.

EDAController(cfg, global_params[, setup_id])

Implements data handling and plotting API for EDA results.

class logml.report.controllers.eda.CategoricalSummary(df: pandas.core.frame.DataFrame, categorical_features: List[str])

Bases: object

This class is used for plotting summaries related to categorical features. These include:

  • Looking at the cardinality of the features and highlighted features with low cardinality. (plot_cardinality).

  • Looking at the features and their individual categories

    (plot_class_distribution).

  • A pie chart that lets you select any categorical feature you want to

    look at (plot_pie_chart).

static get_cardinality(data: pandas.core.frame.DataFrame)

Returns a dictionary of features: cardinality.

static get_represented(data: pandas.core.frame.DataFrame)

Returns a tuple of two dictionaries of features: (percent of samples in lowest represented, highest

represented)

plot_cardinality()

Plots a table showing cardinality of the features.

get_low_cardinality_features()

Returns a list of categorical features with low (< 20) cardinality.

plot_class_distributions()

Plots a bar chart with features on the y axis and x axis has the fraction that belongs with each category.

plot_pie_chart()

Plot a pie chart which allows the user to see how each categorical feature is divided. Has a drop down menu to select different features to look at.

plot_mca_scree_plot(mca_explained_var)

Takes the explained variance from mca components and plots them as a bar chart.

plot_mca_components(mca_components: numpy.array, target: str, title: str)

Plots the first two components of the mca.

logml.report.controllers.eda.plot_feature_loadings(numeric_columns: List[str], feature_weights: numpy.array, unused_target: str)

Takes the feature_weights matrix from logml DimensionalityReduction artifact. Plots the feature_weights for the first 3 principal components for the top 20 features that have the largest weights for the first Principal Component.

logml.report.controllers.eda.plot_first_three_components(dim_reduct_output: numpy.array, target: str, target_values: numpy.array, labels: List[str], task: str, title: str)

Should take any dim_reduct_output from loader And plot the first three components (if they exist). Used in LDA and PCA for both regression and classification.

logml.report.controllers.eda.plot_scree(explained_variance: numpy.array)

Takes the explained variance from loader.dim_reduction.explained_variance and plots the components vs their respective explained variance from PCA. Does this for all the components that cumulatively account fo 95% of the variance in the data.

logml.report.controllers.eda.make_discrete_colorscale(cmap_name: str, num_colors: int) tuple

Generate colorscale with “discrete” colors, as opposed to “continuous” gradient-based default plotly colorscale. Note: this applies only to plotly colorscales. For matplotlib use seaborn.color_palette() :param cmap_name: One of the legitimate plotly colorscale names.

See plotly.colors.named_colorscales() for the full list.

Parameters

num_colors

Returns

tuple

Return type

(colorscale, list of unique colors)

logml.report.controllers.eda.plot_dim_reduction(original_df: pandas.core.frame.DataFrame, dim_reduct_output: numpy.array, continuous_columns: Optional[List[str]] = None, discrete_columns: Optional[List[str]] = None, labels: Optional[List[str]] = None, title: Optional[str] = None, max_label_len: int = 20, cmap_name: str = 'RdYlBu')

Generic function for plotting dim reduction that lets the user plot different components and color based on different continuous features.

original_df: original datadrame. Must contain all the continuous and

discrete columns specified.

dim_reduct_output: result of dim reduction operation (PCA, etc)

continuous_columns: list of (numeric) continous columns.

Will have smooth color bar.

discrete_columns: list of categorical (string or number) columns.

labels: columns of dim_reduct_output dataframe to use.

title: chart title.

max_label_len: Limit for original df column names - used to limit width of

drop-down list with these names.

cmap_name: Valid name of plotly colorscale.

See plotly.colors.named_colorscales() for the full list.

logml.report.controllers.eda.plot_skew_kurt(continuous_summary: pandas.core.frame.DataFrame)

Uses the continuous summary from stats_summary.py which returns a dataframe that includes the skewness and kurtosis. Plots the skewness vs the kurtosis of various features and outputs a heatmap of the features based on the skewness.

logml.report.controllers.eda.get_normalized_data(data: pandas.core.frame.DataFrame, list_feat: List[str])

Takes the numeric data from a dataset and returns the normalized linear data, normalized log transformed data, and a randomly generated normal distribution with mean = 0, std = 1 and length = numer of samples in the original dataset. These will be used to generate the quantile normal plots.

logml.report.controllers.eda.plot_quantile_normal(scaled_linear_data: pandas.core.frame.DataFrame, scaled_log_data: pandas.core.frame.DataFrame, normal: numpy.array, title: str)

Plots a quantile-normal plot that lets the user select which features they want to see.

logml.report.controllers.eda.identify_correlated(df, threshold)

A function to identify highly correlated features.

logml.report.controllers.eda.plot_highest_correlation(reduced_matrix, correlation_threshold: float)

Takes a reduced correlation matrix and returns a df of: Feature | Highest Correlation | Feature’s Correlated With.

logml.report.controllers.eda.plot_general_summary(general_summary, header_list, columns_to_highlight)

Returns a table that includes information on the number features in the dataset, how they are divided in terms of categorical vs. numeric and how many features have over 80%.

logml.report.controllers.eda.plot_target_summary_classification(df, target, represent_threshold)

Plots the table for the target feature for a regression problem. Shows the cardinality of the target feature.

logml.report.controllers.eda.plot_target_summary_regression(df: pandas.core.frame.DataFrame, target: str)

Plots the distribution of the target feature for a regression problem.

logml.report.controllers.eda.get_missing_values_df(missing_values, _target)

Returns a dataframe with missigness stats using a given EDA artifact.

logml.report.controllers.eda.plot_feature_completeness(missing_fraction, typing)

Plots feature completeness: for a number of features - how many samples can we have with NaNs?

logml.report.controllers.eda.plot_complete_dataset(complete_data_set_df, typing)

Produces a plot on top of a given complete dataset.

logml.report.controllers.eda.plot_cv_mean(summary)

Plots the coefficient of variation vs. the mean of continuous features.

logml.report.controllers.eda.reorder_matrix(features: List[str], similarity_order: List[str], matrix: numpy.array)

Maps index of a list of features to the similarity order and reorders the matrix based on the similarity order.

class logml.report.controllers.eda.EDAController(cfg: GlobalConfig, global_params: dict, setup_id: str = '')

Bases: object

Implements data handling and plotting API for EDA results.

has_artifact(artifact_name) bool

Determine if a controller has access to a certain artifact data.

Parameters

artifact_name – One of the eligible artifact names.

Returns

True if artifact exists.

property metadata: Optional[logml.configuration.modeling.ModelingTaskSpec]

In case a modeling problem is provided for EDA - returns its metadata.

property task: Optional[str]

Retrieves a task from the corresponding metadata, if possible.

property target: Optional[str]

Retrieves a target from the corresponding metadata, if possible.

property categorical_features: List[str]

Returns a list of categorical features (excluding target, if set).

property numerical_features: List[str]

Returns a list of numerical features (excluding target, if set).

property categoricals_summary: logml.report.controllers.eda.CategoricalSummary

Returns CategoricalSummary (plotter) object.

dim_reduction_artifact_exists(artifact_name: str) bool

Checks whether a given dimensionality reduction artifacts exists (the corresponding property is set within the EDA artifact).

show_mca_scree()

Shows scree plot for MCA result.

show_mca_components()

Shows MCA result’s components (the first 3).

show_mca_loadings()

Shows how princical components of the MCA result affect features.

show_mca_output()

Shows the MCA result (scatter plot).

show_dataset_head()

Shows the first 5 rows.

show_dataset_tail()

Shows the last 5 rows.

show_categorical_features()

Shows a list of categorical features.

show_continuous_features()

Shows a list of numerical features.

show_target_type()

Show target’s type: numerical/categorical.

get_general_dataset_summary() Dict

Composes a brief summary with dataset’s statistics.

show_general_summary()

Plots a summary of datasets statistics.

plot_target_distribution()

Shows target’s distribution.

show_pca_scree()

Shows scree plot for PCA result.

show_pca_components()

Shows PCA result’s components (the first 3).

show_pca_loadings()

Shows how princical components of the PCA result affect features.

show_tsne(max_columns=20)

Render TSNE results

show_pca_output()

Shows the PCA result (scatter plot).

show_lda_plots()

Shows LDA results (basic visualizations - the first 3 components and scatter).

show_distributions_heatmap()

Shows a heatmap for numerical features distributions.

show_skew_kurt_plot()

Shows a scatter for numerical features: kurtosis vs skewness.

show_qn_plots()

Shows 2 identical QN plots so the user could compare distributions of numerical features.

show_highest_correlation_pairs()

Shows a list of feature pairs that are highly correlated (>0.8).

show_correlation_table_plots()

Shows plots for correlation table: correlation table itself and a dendrogram on top of it.

show_correlation_groups()

Displays available correlation groups in table format.

get_dataset_with_truncated_columns() pandas.core.frame.DataFrame

Returns a dataset with truncated columns (up to 18 chars kept).

show_missing_data_matrix()

Shows a missing data matrix. Use only columns with NA values.

show_missingness_per_feature()

Shows missingness stats per feature.

show_missingness_similarity()

Shows missingness similarity - heatmap.

show_feature_completeness_for_all()

Shows completeness for all features.

show_complete_dataset_for_all()

Shows complete dataset for all features.

render_missingness_overview() None

Summary numbers of missing data

show_missingness_for_categoricals()

Shows completeness and complete dataset for categoricals only.

show_missingness_for_numericals()

Shows completeness and complete dataset for numericals only.

show_descriptive_stats_for_numericals()

Shows a table with descriptive statistics for numericals.

show_coefficient_of_variation_table()

Shows coefficients of variation for numericals (table).

show_coefficient_of_variation_plot()

Shows coefficients of variation vs mean for numericals (plot).