logml.eda.artifacts_producers.missingness

Functions

get_complete_dateset_summary(dataframe, ...)

Returns a mapping between number of columns and the maximum amount of rows that have no NaNs in the columns subset of that size.

get_missingness_summary_per_axes(dataframe, ...)

Returns a summary for missing values per column.

Classes

MissingnessSummaryProducer(metadata_cfg, ...)

Produces:

class logml.eda.artifacts_producers.missingness.MissingnessSummaryProducer(metadata_cfg: logml.configuration.modeling.ModelingTaskSpec, global_params: dict, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, logger=None, eda_params: Optional[logml.configuration.eda.EDAArtifactsGenerationParameters] = None)

Bases: logml.eda.artifacts_producers.base.BaseEDAArtifactsProducer

Produces:

  • missing values per columns summaries (for num/cat/all columns)

  • missing values per row summaries (for num/cat/all columns)

  • complete dataset summaries (for num/cat/all columns)

  • similarity order by pairwise nan distances

Dependencies:

  • metadata artifact

LABEL = 'missingness'
DEPENDENCIES = ['metadata']
ALIAS = 'Missingness summary producer'
produce(dataframe: pandas.core.frame.DataFrame)

Generate missing data eda.

logml.eda.artifacts_producers.missingness.get_missingness_summary_per_axes(dataframe: pandas.core.frame.DataFrame, target_columns: List[str]) Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Returns a summary for missing values per column.

logml.eda.artifacts_producers.missingness.get_complete_dateset_summary(dataframe: pandas.core.frame.DataFrame, target_columns: List[str]) pandas.core.frame.DataFrame

Returns a mapping between number of columns and the maximum amount of rows that have no NaNs in the columns subset of that size.