logml.eda.artifacts_producers.stats_summary
Functions
|
Returns Var(x) / Mean(x) for a given column. |
|
Runs Shapiro-Wilks test to check whether a given column's values have normal distribution. |
|
Describe a single field (i.e. |
|
Calculates a set of custom statistics for a given column. |
|
Runs statistical tests to check whether a given column fits target distributions (normal, etc.). |
Classes
|
Produces: |
- class logml.eda.artifacts_producers.stats_summary.StatisticsSummaryProducer(metadata_cfg: logml.configuration.modeling.ModelingTaskSpec, global_params: dict, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, logger=None, eda_params: Optional[logml.configuration.eda.EDAArtifactsGenerationParameters] = None)
Bases:
logml.eda.artifacts_producers.base.BaseEDAArtifactsProducer
Produces:
multiple statistics for numerical columns
Dependencies:
metadata artifact
- LABEL = 'statistics'
- DEPENDENCIES = ['metadata']
- ALIAS = 'Statistics summary producer'
- produce(dataframe: pandas.core.frame.DataFrame)
Creates and dumps EDA artifact for a given dataframe.
- logml.eda.artifacts_producers.stats_summary.describe_column(df_column: pandas.core.series.Series, column: str) pandas.core.frame.DataFrame
Describe a single field (i.e. a single column from a dataframe).
- logml.eda.artifacts_producers.stats_summary.calculate_corrected_variance_coef(df_column: pandas.core.series.Series) float
Returns Var(x) / Mean(x) for a given column.
- logml.eda.artifacts_producers.stats_summary.get_custom_stats_summary(df_column: pandas.core.series.Series, column: str) pandas.core.frame.DataFrame
Calculates a set of custom statistics for a given column.
The result dataframe has the following format:
index
column
stat1
value1
stat2
value2
stat3
value3
…
…
statN
valueN
- logml.eda.artifacts_producers.stats_summary.check_normality(df_column: pandas.core.series.Series, p_threshold=0.05) dict
Runs Shapiro-Wilks test to check whether a given column’s values have normal distribution.
- logml.eda.artifacts_producers.stats_summary.get_distributions_fit_summary(df_column: pandas.core.series.Series, column: str) Optional[pandas.core.frame.DataFrame]
Runs statistical tests to check whether a given column fits target distributions (normal, etc.).
The result dataframe has the following format:
index
column
stat1
value1
stat2
value2
stat3
value3
…
…
statN
valueN