logml.eda.artifacts_producers.correlation
Functions
|
Given a correlation matrix and a threshold builds CorrelationGraph. |
Given a correlation matrix and a threshold builds correlation graph. |
|
|
Generates correlation artifacts. |
|
Returns a list of correlation groups. |
|
Find groups from the graph connectivity. |
Classes
|
Produces: - pearson/spearman correlation for numerical columns - orders artifact by similarity using AgglomerativeClustering - list of correlation groups See |
- logml.eda.artifacts_producers.correlation.create_correlation_graph(correlation_matrix: numpy.array, node_labels: List[str], threshold: float) logml.eda.artifacts.correlation.CorrelationGraph
Given a correlation matrix and a threshold builds CorrelationGraph.
- logml.eda.artifacts_producers.correlation.create_correlation_graph_nx(correlation_matrix: numpy.array, node_labels: List[str], threshold: float) networkx.classes.graph.Graph
Given a correlation matrix and a threshold builds correlation graph.
Correlation matrix is binarized as > threshold and then becomes graph adjacency matrix.
- logml.eda.artifacts_producers.correlation.get_correlation_groups_nx(correlation_graph: networkx.classes.graph.Graph, level: int = - 1, key_names: List[str] = None, corr: Optional[pandas.core.frame.DataFrame] = None, threshold=None, target_weights: Dict[str, float] = None) List[logml.eda.artifacts.correlation.CorrelationGroup]
Find groups from the graph connectivity.
- Parameters
correlation_graph – Graph build from correlation as an adjacency matrix.
level – How many levels of neighbours to consider a group. If zero or less, then a whole graph component is used (traverse all reachable neighbours).
key_names – Configured correlation key names.
target_weights – weight to sort columns (correlation with target)
- Returns
List of correlation groups.
- logml.eda.artifacts_producers.correlation.get_correlation_groups(correlation_graph: logml.eda.artifacts.correlation.CorrelationGraph, target_columns: List[str], key_names: Optional[List[str]] = None) List[logml.eda.artifacts.correlation.CorrelationGroup]
Returns a list of correlation groups.
Greedy approach is used to find an approximate solution for maximal independent set problem.
- logml.eda.artifacts_producers.correlation.generate_corr_groups(dataframe: pandas.core.frame.DataFrame, columns: Optional[List[str]] = None, corr_type: str = 'pearson', min_samples_fraction: float = 0.2, corr_thresh: float = 0.5, corr_graph_level_cutoff: int = 1, corr_key_names: Optional[List[str]] = None, target_values: Optional[numpy.ndarray] = None)
Generates correlation artifacts.
- Returns
Tuple(correlation dataframe, correlation graph, correlation groups.)
- class logml.eda.artifacts_producers.correlation.CorrelationSummaryProducer(metadata_cfg: logml.configuration.modeling.ModelingTaskSpec, global_params: dict, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, logger=None, eda_params: Optional[logml.configuration.eda.EDAArtifactsGenerationParameters] = None)
Bases:
logml.eda.artifacts_producers.base.BaseEDAArtifactsProducer
Produces: - pearson/spearman correlation for numerical columns - orders artifact by similarity using AgglomerativeClustering - list of correlation groups See
logml.eda.artifacts.correlation.CorrelationSummary
.Dependencies: - metadata artifact
- LABEL = 'correlation'
- DEPENDENCIES = ['metadata']
- ALIAS = 'Correlation summary producer'
- produce(dataframe: pandas.core.frame.DataFrame)
Creates and dumps EDA artifact for a given dataframe.