logml.eda.artifacts_producers.correlation

Functions

create_correlation_graph(correlation_matrix, ...)

Given a correlation matrix and a threshold builds CorrelationGraph.

create_correlation_graph_nx(...)

Given a correlation matrix and a threshold builds correlation graph.

generate_corr_groups(dataframe[, columns, ...])

Generates correlation artifacts.

get_correlation_groups(correlation_graph, ...)

Returns a list of correlation groups.

get_correlation_groups_nx(correlation_graph)

Find groups from the graph connectivity.

Classes

CorrelationSummaryProducer(metadata_cfg, ...)

Produces: - pearson/spearman correlation for numerical columns - orders artifact by similarity using AgglomerativeClustering - list of correlation groups See logml.eda.artifacts.correlation.CorrelationSummary.

logml.eda.artifacts_producers.correlation.create_correlation_graph(correlation_matrix: numpy.array, node_labels: List[str], threshold: float) logml.eda.artifacts.correlation.CorrelationGraph

Given a correlation matrix and a threshold builds CorrelationGraph.

logml.eda.artifacts_producers.correlation.create_correlation_graph_nx(correlation_matrix: numpy.array, node_labels: List[str], threshold: float) networkx.classes.graph.Graph

Given a correlation matrix and a threshold builds correlation graph.

Correlation matrix is binarized as > threshold and then becomes graph adjacency matrix.

logml.eda.artifacts_producers.correlation.get_correlation_groups_nx(correlation_graph: networkx.classes.graph.Graph, level: int = - 1, key_names: List[str] = None, corr: Optional[pandas.core.frame.DataFrame] = None, threshold=None, target_weights: Dict[str, float] = None) List[logml.eda.artifacts.correlation.CorrelationGroup]

Find groups from the graph connectivity.

Parameters
  • correlation_graph – Graph build from correlation as an adjacency matrix.

  • level – How many levels of neighbours to consider a group. If zero or less, then a whole graph component is used (traverse all reachable neighbours).

  • key_names – Configured correlation key names.

  • target_weights – weight to sort columns (correlation with target)

Returns

List of correlation groups.

logml.eda.artifacts_producers.correlation.get_correlation_groups(correlation_graph: logml.eda.artifacts.correlation.CorrelationGraph, target_columns: List[str], key_names: Optional[List[str]] = None) List[logml.eda.artifacts.correlation.CorrelationGroup]

Returns a list of correlation groups.

Greedy approach is used to find an approximate solution for maximal independent set problem.

logml.eda.artifacts_producers.correlation.generate_corr_groups(dataframe: pandas.core.frame.DataFrame, columns: Optional[List[str]] = None, corr_type: str = 'pearson', min_samples_fraction: float = 0.2, corr_thresh: float = 0.5, corr_graph_level_cutoff: int = 1, corr_key_names: Optional[List[str]] = None, target_values: Optional[numpy.ndarray] = None)

Generates correlation artifacts.

Returns

Tuple(correlation dataframe, correlation graph, correlation groups.)

class logml.eda.artifacts_producers.correlation.CorrelationSummaryProducer(metadata_cfg: logml.configuration.modeling.ModelingTaskSpec, global_params: dict, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, logger=None, eda_params: Optional[logml.configuration.eda.EDAArtifactsGenerationParameters] = None)

Bases: logml.eda.artifacts_producers.base.BaseEDAArtifactsProducer

Produces: - pearson/spearman correlation for numerical columns - orders artifact by similarity using AgglomerativeClustering - list of correlation groups See logml.eda.artifacts.correlation.CorrelationSummary.

Dependencies: - metadata artifact

LABEL = 'correlation'
DEPENDENCIES = ['metadata']
ALIAS = 'Correlation summary producer'
produce(dataframe: pandas.core.frame.DataFrame)

Creates and dumps EDA artifact for a given dataframe.