logml.eda.artifacts_producers.stats_summary

Functions

calculate_corrected_variance_coef(df_column)

Returns Var(x) / Mean(x) for a given column.

check_normality(df_column[, p_threshold])

Runs Shapiro-Wilks test to check whether a given column's values have normal distribution.

describe_column(df_column, column)

Describe a single field (i.e.

get_custom_stats_summary(df_column, column)

Calculates a set of custom statistics for a given column.

get_distributions_fit_summary(df_column, column)

Runs statistical tests to check whether a given column fits target distributions (normal, etc.).

Classes

StatisticsSummaryProducer(metadata_cfg, ...)

Produces:

class logml.eda.artifacts_producers.stats_summary.StatisticsSummaryProducer(metadata_cfg: logml.configuration.modeling.ModelingTaskSpec, global_params: dict, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, logger=None, eda_params: Optional[logml.configuration.eda.EDAArtifactsGenerationParameters] = None)

Bases: logml.eda.artifacts_producers.base.BaseEDAArtifactsProducer

Produces:

  • multiple statistics for numerical columns

Dependencies:

  • metadata artifact

LABEL = 'statistics'
DEPENDENCIES = ['metadata']
ALIAS = 'Statistics summary producer'
produce(dataframe: pandas.core.frame.DataFrame)

Creates and dumps EDA artifact for a given dataframe.

logml.eda.artifacts_producers.stats_summary.describe_column(df_column: pandas.core.series.Series, column: str) pandas.core.frame.DataFrame

Describe a single field (i.e. a single column from a dataframe).

logml.eda.artifacts_producers.stats_summary.calculate_corrected_variance_coef(df_column: pandas.core.series.Series) float

Returns Var(x) / Mean(x) for a given column.

logml.eda.artifacts_producers.stats_summary.get_custom_stats_summary(df_column: pandas.core.series.Series, column: str) pandas.core.frame.DataFrame

Calculates a set of custom statistics for a given column.

The result dataframe has the following format:

index

column

stat1

value1

stat2

value2

stat3

value3

statN

valueN

logml.eda.artifacts_producers.stats_summary.check_normality(df_column: pandas.core.series.Series, p_threshold=0.05) dict

Runs Shapiro-Wilks test to check whether a given column’s values have normal distribution.

logml.eda.artifacts_producers.stats_summary.get_distributions_fit_summary(df_column: pandas.core.series.Series, column: str) Optional[pandas.core.frame.DataFrame]

Runs statistical tests to check whether a given column fits target distributions (normal, etc.).

The result dataframe has the following format:

index

column

stat1

value1

stat2

value2

stat3

value3

statN

valueN