logml.report.controllers.eda

Functions

`get_missing_values_df`(missing_values, _target)	Returns a dataframe with missigness stats using a given EDA artifact.
`get_normalized_data`(data, list_feat)	Takes the numeric data from a dataset and returns the normalized linear data, normalized log transformed data, and a randomly generated normal distribution with mean = 0, std = 1 and length = numer of samples in the original dataset.
`identify_correlated`(df, threshold)	A function to identify highly correlated features.
`make_discrete_colorscale`(cmap_name, num_colors)	Generate colorscale with "discrete" colors, as opposed to "continuous" gradient-based default plotly colorscale. Note: this applies only to plotly colorscales. For matplotlib use seaborn.color_palette() :param cmap_name: One of the legitimate plotly colorscale names. See plotly.colors.named_colorscales() for the full list. :param num_colors:.
`plot_complete_dataset`(complete_data_set_df, ...)	Produces a plot on top of a given complete dataset.
`plot_cv_mean`(summary)	Plots the coefficient of variation vs.
`plot_dim_reduction`(original_df, ...[, ...])	Generic function for plotting dim reduction that lets the user plot different components and color based on different continuous features.
`plot_feature_completeness`(missing_fraction, ...)	Plots feature completeness: for a number of features - how many samples can we have with NaNs?
`plot_feature_loadings`(numeric_columns, ...)	Takes the feature_weights matrix from logml DimensionalityReduction artifact.
`plot_first_three_components`(...)	Should take any dim_reduct_output from loader And plot the first three components (if they exist).
`plot_general_summary`(general_summary, ...)	Returns a table that includes information on the number features in the dataset, how they are divided in terms of categorical vs.
`plot_highest_correlation`(reduced_matrix, ...)	Takes a reduced correlation matrix and returns a df of: Feature \| Highest Correlation \| Feature's Correlated With.
`plot_quantile_normal`(scaled_linear_data, ...)	Plots a quantile-normal plot that lets the user select which features they want to see.
`plot_scree`(explained_variance)	Takes the explained variance from loader.dim_reduction.explained_variance and plots the components vs their respective explained variance from PCA.
`plot_skew_kurt`(continuous_summary)	Uses the continuous summary from stats_summary.py which returns a dataframe that includes the skewness and kurtosis.
`plot_target_summary_classification`(df, ...)	Plots the table for the target feature for a regression problem.
`plot_target_summary_regression`(df, target)	Plots the distribution of the target feature for a regression problem.
`reorder_matrix`(features, similarity_order, ...)	Maps index of a list of features to the similarity order and reorders the matrix based on the similarity order.

Classes

`CategoricalSummary`(df, categorical_features)	This class is used for plotting summaries related to categorical features.
`EDAController`(cfg, global_params[, setup_id])	Implements data handling and plotting API for EDA results.

class logml.report.controllers.eda.CategoricalSummary(df: pandas.core.frame.DataFrame, categorical_features: List[str])

Bases: object

This class is used for plotting summaries related to categorical features. These include:

Looking at the cardinality of the features and highlighted features with low cardinality. (plot_cardinality).

Looking at the features and their individual categories
(plot_class_distribution).

A pie chart that lets you select any categorical feature you want to
look at (plot_pie_chart).

static get_cardinality(data: pandas.core.frame.DataFrame): Returns a dictionary of features: cardinality.

static get_represented(data: pandas.core.frame.DataFrame): Returns a tuple of two dictionaries of features: (percent of samples in lowest represented, highest

represented)

plot_cardinality(): Plots a table showing cardinality of the features.

get_low_cardinality_features(): Returns a list of categorical features with low (< 20) cardinality.

plot_class_distributions(): Plots a bar chart with features on the y axis and x axis has the fraction that belongs with each category.

plot_pie_chart(): Plot a pie chart which allows the user to see how each categorical feature is divided. Has a drop down menu to select different features to look at.

plot_mca_scree_plot(mca_explained_var): Takes the explained variance from mca components and plots them as a bar chart.

plot_mca_components(mca_components: numpy.array, target: str, title: str): Plots the first two components of the mca.

logml.report.controllers.eda.plot_feature_loadings(numeric_columns: List[str], feature_weights: numpy.array, unused_target: str): Takes the feature_weights matrix from logml DimensionalityReduction artifact. Plots the feature_weights for the first 3 principal components for the top 20 features that have the largest weights for the first Principal Component.

logml.report.controllers.eda.plot_first_three_components(dim_reduct_output: numpy.array, target: str, target_values: numpy.array, labels: List[str], task: str, title: str): Should take any dim_reduct_output from loader And plot the first three components (if they exist). Used in LDA and PCA for both regression and classification.

logml.report.controllers.eda.plot_scree(explained_variance: numpy.array): Takes the explained variance from loader.dim_reduction.explained_variance and plots the components vs their respective explained variance from PCA. Does this for all the components that cumulatively account fo 95% of the variance in the data.

logml.report.controllers.eda.make_discrete_colorscale(cmap_name: str, num_colors: int) → tuple

Generate colorscale with “discrete” colors, as opposed to “continuous” gradient-based default plotly colorscale. Note: this applies only to plotly colorscales. For matplotlib use seaborn.color_palette() :param cmap_name: One of the legitimate plotly colorscale names.

See plotly.colors.named_colorscales() for the full list.

Parameters: num_colors –
Returns: tuple
Return type: (colorscale, list of unique colors)

logml.report.controllers.eda.plot_dim_reduction(original_df: pandas.core.frame.DataFrame, dim_reduct_output: numpy.array, continuous_columns: Optional[List[str]] = None, discrete_columns: Optional[List[str]] = None, labels: Optional[List[str]] = None, title: Optional[str] = None, max_label_len: int = 20, cmap_name: str = 'RdYlBu')

Generic function for plotting dim reduction that lets the user plot different components and color based on different continuous features.

original_df: original datadrame. Must contain all the continuous and: discrete columns specified.

dim_reduct_output: result of dim reduction operation (PCA, etc)

continuous_columns: list of (numeric) continous columns.: Will have smooth color bar.

discrete_columns: list of categorical (string or number) columns.

labels: columns of dim_reduct_output dataframe to use.

title: chart title.

max_label_len: Limit for original df column names - used to limit width of: drop-down list with these names.
cmap_name: Valid name of plotly colorscale.: See plotly.colors.named_colorscales() for the full list.

logml.report.controllers.eda.plot_skew_kurt(continuous_summary: pandas.core.frame.DataFrame): Uses the continuous summary from stats_summary.py which returns a dataframe that includes the skewness and kurtosis. Plots the skewness vs the kurtosis of various features and outputs a heatmap of the features based on the skewness.

logml.report.controllers.eda.get_normalized_data(data: pandas.core.frame.DataFrame, list_feat: List[str]): Takes the numeric data from a dataset and returns the normalized linear data, normalized log transformed data, and a randomly generated normal distribution with mean = 0, std = 1 and length = numer of samples in the original dataset. These will be used to generate the quantile normal plots.

logml.report.controllers.eda.plot_quantile_normal(scaled_linear_data: pandas.core.frame.DataFrame, scaled_log_data: pandas.core.frame.DataFrame, normal: numpy.array, title: str): Plots a quantile-normal plot that lets the user select which features they want to see.

logml.report.controllers.eda.identify_correlated(df, threshold): A function to identify highly correlated features.

logml.report.controllers.eda.plot_highest_correlation(reduced_matrix, correlation_threshold: float): Takes a reduced correlation matrix and returns a df of: Feature | Highest Correlation | Feature’s Correlated With.

logml.report.controllers.eda.plot_general_summary(general_summary, header_list, columns_to_highlight): Returns a table that includes information on the number features in the dataset, how they are divided in terms of categorical vs. numeric and how many features have over 80%.

logml.report.controllers.eda.plot_target_summary_classification(df, target, represent_threshold): Plots the table for the target feature for a regression problem. Shows the cardinality of the target feature.

logml.report.controllers.eda.plot_target_summary_regression(df: pandas.core.frame.DataFrame, target: str): Plots the distribution of the target feature for a regression problem.

logml.report.controllers.eda.get_missing_values_df(missing_values, _target): Returns a dataframe with missigness stats using a given EDA artifact.

logml.report.controllers.eda.plot_feature_completeness(missing_fraction, typing): Plots feature completeness: for a number of features - how many samples can we have with NaNs?

logml.report.controllers.eda.plot_complete_dataset(complete_data_set_df, typing): Produces a plot on top of a given complete dataset.

logml.report.controllers.eda.plot_cv_mean(summary): Plots the coefficient of variation vs. the mean of continuous features.

logml.report.controllers.eda.reorder_matrix(features: List[str], similarity_order: List[str], matrix: numpy.array): Maps index of a list of features to the similarity order and reorders the matrix based on the similarity order.

class logml.report.controllers.eda.EDAController(cfg: GlobalConfig, global_params: dict, setup_id: str = '')

Bases: object

Implements data handling and plotting API for EDA results.

has_artifact(artifact_name) → bool

Determine if a controller has access to a certain artifact data.

Parameters: artifact_name – One of the eligible artifact names.
Returns: True if artifact exists.

property metadata: Optional[logml.configuration.modeling.ModelingTaskSpec]: In case a modeling problem is provided for EDA - returns its metadata.

property task: Optional[str]: Retrieves a task from the corresponding metadata, if possible.

property target: Optional[str]: Retrieves a target from the corresponding metadata, if possible.

property categorical_features: List[str]: Returns a list of categorical features (excluding target, if set).

property numerical_features: List[str]: Returns a list of numerical features (excluding target, if set).

property categoricals_summary: logml.report.controllers.eda.CategoricalSummary: Returns CategoricalSummary (plotter) object.

dim_reduction_artifact_exists(artifact_name: str) → bool: Checks whether a given dimensionality reduction artifacts exists (the corresponding property is set within the EDA artifact).

show_mca_scree(): Shows scree plot for MCA result.

show_mca_components(): Shows MCA result’s components (the first 3).

show_mca_loadings(): Shows how princical components of the MCA result affect features.

show_mca_output(): Shows the MCA result (scatter plot).

show_dataset_head(): Shows the first 5 rows.

show_dataset_tail(): Shows the last 5 rows.

show_categorical_features(): Shows a list of categorical features.

show_continuous_features(): Shows a list of numerical features.

show_target_type(): Show target’s type: numerical/categorical.

get_general_dataset_summary() → Dict: Composes a brief summary with dataset’s statistics.

show_general_summary(): Plots a summary of datasets statistics.

plot_target_distribution(): Shows target’s distribution.

show_pca_scree(): Shows scree plot for PCA result.

show_pca_components(): Shows PCA result’s components (the first 3).

show_pca_loadings(): Shows how princical components of the PCA result affect features.

show_tsne(max_columns=20): Render TSNE results

show_pca_output(): Shows the PCA result (scatter plot).

show_lda_plots(): Shows LDA results (basic visualizations - the first 3 components and scatter).

show_distributions_heatmap(): Shows a heatmap for numerical features distributions.

show_skew_kurt_plot(): Shows a scatter for numerical features: kurtosis vs skewness.

show_qn_plots(): Shows 2 identical QN plots so the user could compare distributions of numerical features.

show_highest_correlation_pairs(): Shows a list of feature pairs that are highly correlated (>0.8).

show_correlation_table_plots(): Shows plots for correlation table: correlation table itself and a dendrogram on top of it.

show_correlation_groups(): Displays available correlation groups in table format.

get_dataset_with_truncated_columns() → pandas.core.frame.DataFrame: Returns a dataset with truncated columns (up to 18 chars kept).

show_missing_data_matrix(): Shows a missing data matrix. Use only columns with NA values.

show_missingness_per_feature(): Shows missingness stats per feature.

show_missingness_similarity(): Shows missingness similarity - heatmap.

show_feature_completeness_for_all(): Shows completeness for all features.

show_complete_dataset_for_all(): Shows complete dataset for all features.

render_missingness_overview() → None: Summary numbers of missing data

show_missingness_for_categoricals(): Shows completeness and complete dataset for categoricals only.

show_missingness_for_numericals(): Shows completeness and complete dataset for numericals only.

show_descriptive_stats_for_numericals(): Shows a table with descriptive statistics for numericals.

show_coefficient_of_variation_table(): Shows coefficients of variation for numericals (table).

show_coefficient_of_variation_plot(): Shows coefficients of variation vs mean for numericals (plot).