logml.report.controllers.eda
Functions
|
Returns a dataframe with missigness stats using a given EDA artifact. |
|
Takes the numeric data from a dataset and returns the normalized linear data, normalized log transformed data, and a randomly generated normal distribution with mean = 0, std = 1 and length = numer of samples in the original dataset. |
|
A function to identify highly correlated features. |
|
Generate colorscale with "discrete" colors, as opposed to "continuous" gradient-based default plotly colorscale. Note: this applies only to plotly colorscales. For matplotlib use seaborn.color_palette() :param cmap_name: One of the legitimate plotly colorscale names. See plotly.colors.named_colorscales() for the full list. :param num_colors:. |
|
Produces a plot on top of a given complete dataset. |
|
Plots the coefficient of variation vs. |
|
Generic function for plotting dim reduction that lets the user plot different components and color based on different continuous features. |
|
Plots feature completeness: for a number of features - how many samples can we have with NaNs? |
|
Takes the feature_weights matrix from logml DimensionalityReduction artifact. |
Should take any dim_reduct_output from loader And plot the first three components (if they exist). |
|
|
Returns a table that includes information on the number features in the dataset, how they are divided in terms of categorical vs. |
|
Takes a reduced correlation matrix and returns a df of: Feature | Highest Correlation | Feature's Correlated With. |
|
Plots a quantile-normal plot that lets the user select which features they want to see. |
|
Takes the explained variance from loader.dim_reduction.explained_variance and plots the components vs their respective explained variance from PCA. |
|
Uses the continuous summary from stats_summary.py which returns a dataframe that includes the skewness and kurtosis. |
|
Plots the table for the target feature for a regression problem. |
|
Plots the distribution of the target feature for a regression problem. |
|
Maps index of a list of features to the similarity order and reorders the matrix based on the similarity order. |
Classes
|
This class is used for plotting summaries related to categorical features. |
|
Implements data handling and plotting API for EDA results. |
- class logml.report.controllers.eda.CategoricalSummary(df: pandas.core.frame.DataFrame, categorical_features: List[str])
Bases:
object
This class is used for plotting summaries related to categorical features. These include:
Looking at the cardinality of the features and highlighted features with low cardinality. (plot_cardinality).
- Looking at the features and their individual categories
(plot_class_distribution).
- A pie chart that lets you select any categorical feature you want to
look at (plot_pie_chart).
- static get_cardinality(data: pandas.core.frame.DataFrame)
Returns a dictionary of features: cardinality.
- static get_represented(data: pandas.core.frame.DataFrame)
Returns a tuple of two dictionaries of features: (percent of samples in lowest represented, highest
represented)
- plot_cardinality()
Plots a table showing cardinality of the features.
- get_low_cardinality_features()
Returns a list of categorical features with low (< 20) cardinality.
- plot_class_distributions()
Plots a bar chart with features on the y axis and x axis has the fraction that belongs with each category.
- plot_pie_chart()
Plot a pie chart which allows the user to see how each categorical feature is divided. Has a drop down menu to select different features to look at.
- plot_mca_scree_plot(mca_explained_var)
Takes the explained variance from mca components and plots them as a bar chart.
- plot_mca_components(mca_components: numpy.array, target: str, title: str)
Plots the first two components of the mca.
- logml.report.controllers.eda.plot_feature_loadings(numeric_columns: List[str], feature_weights: numpy.array, unused_target: str)
Takes the feature_weights matrix from logml DimensionalityReduction artifact. Plots the feature_weights for the first 3 principal components for the top 20 features that have the largest weights for the first Principal Component.
- logml.report.controllers.eda.plot_first_three_components(dim_reduct_output: numpy.array, target: str, target_values: numpy.array, labels: List[str], task: str, title: str)
Should take any dim_reduct_output from loader And plot the first three components (if they exist). Used in LDA and PCA for both regression and classification.
- logml.report.controllers.eda.plot_scree(explained_variance: numpy.array)
Takes the explained variance from loader.dim_reduction.explained_variance and plots the components vs their respective explained variance from PCA. Does this for all the components that cumulatively account fo 95% of the variance in the data.
- logml.report.controllers.eda.make_discrete_colorscale(cmap_name: str, num_colors: int) tuple
Generate colorscale with “discrete” colors, as opposed to “continuous” gradient-based default plotly colorscale. Note: this applies only to plotly colorscales. For matplotlib use seaborn.color_palette() :param cmap_name: One of the legitimate plotly colorscale names.
See plotly.colors.named_colorscales() for the full list.
- Parameters
num_colors –
- Returns
tuple
- Return type
(colorscale, list of unique colors)
- logml.report.controllers.eda.plot_dim_reduction(original_df: pandas.core.frame.DataFrame, dim_reduct_output: numpy.array, continuous_columns: Optional[List[str]] = None, discrete_columns: Optional[List[str]] = None, labels: Optional[List[str]] = None, title: Optional[str] = None, max_label_len: int = 20, cmap_name: str = 'RdYlBu')
Generic function for plotting dim reduction that lets the user plot different components and color based on different continuous features.
- original_df: original datadrame. Must contain all the continuous and
discrete columns specified.
dim_reduct_output: result of dim reduction operation (PCA, etc)
- continuous_columns: list of (numeric) continous columns.
Will have smooth color bar.
discrete_columns: list of categorical (string or number) columns.
labels: columns of dim_reduct_output dataframe to use.
title: chart title.
- max_label_len: Limit for original df column names - used to limit width of
drop-down list with these names.
- cmap_name: Valid name of plotly colorscale.
See plotly.colors.named_colorscales() for the full list.
- logml.report.controllers.eda.plot_skew_kurt(continuous_summary: pandas.core.frame.DataFrame)
Uses the continuous summary from stats_summary.py which returns a dataframe that includes the skewness and kurtosis. Plots the skewness vs the kurtosis of various features and outputs a heatmap of the features based on the skewness.
- logml.report.controllers.eda.get_normalized_data(data: pandas.core.frame.DataFrame, list_feat: List[str])
Takes the numeric data from a dataset and returns the normalized linear data, normalized log transformed data, and a randomly generated normal distribution with mean = 0, std = 1 and length = numer of samples in the original dataset. These will be used to generate the quantile normal plots.
- logml.report.controllers.eda.plot_quantile_normal(scaled_linear_data: pandas.core.frame.DataFrame, scaled_log_data: pandas.core.frame.DataFrame, normal: numpy.array, title: str)
Plots a quantile-normal plot that lets the user select which features they want to see.
A function to identify highly correlated features.
- logml.report.controllers.eda.plot_highest_correlation(reduced_matrix, correlation_threshold: float)
Takes a reduced correlation matrix and returns a df of: Feature | Highest Correlation | Feature’s Correlated With.
- logml.report.controllers.eda.plot_general_summary(general_summary, header_list, columns_to_highlight)
Returns a table that includes information on the number features in the dataset, how they are divided in terms of categorical vs. numeric and how many features have over 80%.
- logml.report.controllers.eda.plot_target_summary_classification(df, target, represent_threshold)
Plots the table for the target feature for a regression problem. Shows the cardinality of the target feature.
- logml.report.controllers.eda.plot_target_summary_regression(df: pandas.core.frame.DataFrame, target: str)
Plots the distribution of the target feature for a regression problem.
- logml.report.controllers.eda.get_missing_values_df(missing_values, _target)
Returns a dataframe with missigness stats using a given EDA artifact.
- logml.report.controllers.eda.plot_feature_completeness(missing_fraction, typing)
Plots feature completeness: for a number of features - how many samples can we have with NaNs?
- logml.report.controllers.eda.plot_complete_dataset(complete_data_set_df, typing)
Produces a plot on top of a given complete dataset.
- logml.report.controllers.eda.plot_cv_mean(summary)
Plots the coefficient of variation vs. the mean of continuous features.
- logml.report.controllers.eda.reorder_matrix(features: List[str], similarity_order: List[str], matrix: numpy.array)
Maps index of a list of features to the similarity order and reorders the matrix based on the similarity order.
- class logml.report.controllers.eda.EDAController(cfg: GlobalConfig, global_params: dict, setup_id: str = '')
Bases:
object
Implements data handling and plotting API for EDA results.
- has_artifact(artifact_name) bool
Determine if a controller has access to a certain artifact data.
- Parameters
artifact_name – One of the eligible artifact names.
- Returns
True if artifact exists.
- property metadata: Optional[logml.configuration.modeling.ModelingTaskSpec]
In case a modeling problem is provided for EDA - returns its metadata.
- property task: Optional[str]
Retrieves a task from the corresponding metadata, if possible.
- property target: Optional[str]
Retrieves a target from the corresponding metadata, if possible.
- property categorical_features: List[str]
Returns a list of categorical features (excluding target, if set).
- property numerical_features: List[str]
Returns a list of numerical features (excluding target, if set).
- property categoricals_summary: logml.report.controllers.eda.CategoricalSummary
Returns CategoricalSummary (plotter) object.
- dim_reduction_artifact_exists(artifact_name: str) bool
Checks whether a given dimensionality reduction artifacts exists (the corresponding property is set within the EDA artifact).
- show_mca_scree()
Shows scree plot for MCA result.
- show_mca_components()
Shows MCA result’s components (the first 3).
- show_mca_loadings()
Shows how princical components of the MCA result affect features.
- show_mca_output()
Shows the MCA result (scatter plot).
- show_dataset_head()
Shows the first 5 rows.
- show_dataset_tail()
Shows the last 5 rows.
- show_categorical_features()
Shows a list of categorical features.
- show_continuous_features()
Shows a list of numerical features.
- show_target_type()
Show target’s type: numerical/categorical.
- get_general_dataset_summary() Dict
Composes a brief summary with dataset’s statistics.
- show_general_summary()
Plots a summary of datasets statistics.
- plot_target_distribution()
Shows target’s distribution.
- show_pca_scree()
Shows scree plot for PCA result.
- show_pca_components()
Shows PCA result’s components (the first 3).
- show_pca_loadings()
Shows how princical components of the PCA result affect features.
- show_tsne(max_columns=20)
Render TSNE results
- show_pca_output()
Shows the PCA result (scatter plot).
- show_lda_plots()
Shows LDA results (basic visualizations - the first 3 components and scatter).
- show_distributions_heatmap()
Shows a heatmap for numerical features distributions.
- show_skew_kurt_plot()
Shows a scatter for numerical features: kurtosis vs skewness.
- show_qn_plots()
Shows 2 identical QN plots so the user could compare distributions of numerical features.
- show_highest_correlation_pairs()
Shows a list of feature pairs that are highly correlated (>0.8).
- show_correlation_table_plots()
Shows plots for correlation table: correlation table itself and a dendrogram on top of it.
- show_correlation_groups()
Displays available correlation groups in table format.
- get_dataset_with_truncated_columns() pandas.core.frame.DataFrame
Returns a dataset with truncated columns (up to 18 chars kept).
- show_missing_data_matrix()
Shows a missing data matrix. Use only columns with NA values.
- show_missingness_per_feature()
Shows missingness stats per feature.
- show_missingness_similarity()
Shows missingness similarity - heatmap.
- show_feature_completeness_for_all()
Shows completeness for all features.
- show_complete_dataset_for_all()
Shows complete dataset for all features.
- render_missingness_overview() None
Summary numbers of missing data
- show_missingness_for_categoricals()
Shows completeness and complete dataset for categoricals only.
- show_missingness_for_numericals()
Shows completeness and complete dataset for numericals only.
- show_descriptive_stats_for_numericals()
Shows a table with descriptive statistics for numericals.
- show_coefficient_of_variation_table()
Shows coefficients of variation for numericals (table).
- show_coefficient_of_variation_plot()
Shows coefficients of variation vs mean for numericals (plot).