logml.data.utils
Functions
|
|
|
For a given set of p-values performs FPR correction (Benjamini/Hochberg). |
|
Reshapes a given Series into 2D np.array. |
|
Filter columns by groups |
|
Filter by a keyword. |
|
DEPRECATEAD. |
|
Improved version of filter_columns function. |
|
Get union of all columns matching regular expressions. |
|
Replaces data outside provided mean+-range with NaN. |
|
Parse filter string into filter type and filter expression. |
|
Shuffles a subset of a given column's elements. |
|
Truncates columns of the dataframe to given len, adding identity numbers in case if truncated names duplicate. |
Exceptions
Exception raised when there is an FeatureExtractionStepException caught while using Preprocessing Pipeline API. |
- exception logml.data.utils.PreprocessingPipelineException
Bases:
Exception
Exception raised when there is an FeatureExtractionStepException caught while using Preprocessing Pipeline API.
- class logml.data.utils.DataTransformLogItem
Bases:
pydantic.main.BaseModel
Describes one action of dataset transformation.
Show JSON schema
{ "title": "DataTransformLogItem", "description": "Describes one action of dataset transformation.", "type": "object", "properties": { "action": { "title": "Action", "type": "string" }, "action_params": { "title": "Action Params", "type": "object" }, "action_data": { "title": "Action Data", "type": "object" }, "input_shape": { "title": "Input Shape", "type": "array", "items": {} }, "output_shape": { "title": "Output Shape", "type": "array", "items": {} } }, "required": [ "action" ] }
- Fields
- field action: str [Required]
- field action_params: Optional[dict] = None
- field action_data: Optional[dict] = None
- field input_shape: Optional[tuple] = None
- field output_shape: Optional[tuple] = None
- class logml.data.utils.DataTransformLog
Bases:
pydantic.main.BaseModel
Contains history of dataset transformation.
Show JSON schema
{ "title": "DataTransformLog", "description": "Contains history of dataset transformation.", "type": "object", "properties": { "items": { "title": "Items", "default": [], "type": "array", "items": { "$ref": "#/definitions/DataTransformLogItem" } } }, "definitions": { "DataTransformLogItem": { "title": "DataTransformLogItem", "description": "Describes one action of dataset transformation.", "type": "object", "properties": { "action": { "title": "Action", "type": "string" }, "action_params": { "title": "Action Params", "type": "object" }, "action_data": { "title": "Action Data", "type": "object" }, "input_shape": { "title": "Input Shape", "type": "array", "items": {} }, "output_shape": { "title": "Output Shape", "type": "array", "items": {} } }, "required": [ "action" ] } } }
- field items: List[logml.data.utils.DataTransformLogItem] = []
- logml.data.utils.nullify_outliers(data: numpy.ndarray, mul_std=3) numpy.ndarray
Replaces data outside provided mean+-range with NaN.
- logml.data.utils.filter_by_keyword(dataframe: pandas.core.frame.DataFrame, keyword: str, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, objective: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, **_kwargs) Set[str]
Filter by a keyword.
- logml.data.utils.filter_by_group(unused_dataframe: pandas.core.frame.DataFrame, group: str, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **_kwargs) Set[str]
Filter columns by groups
- logml.data.utils.filter_columns2(dataframe: pandas.core.frame.DataFrame, include: Optional[List[str]] = None, exclude: Optional[List[str]] = None, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, objective: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, **kwargs) Optional[List[str]]
Improved version of filter_columns function. In addition to regex filtering, uses specialized ones: groups and keywords filters.
- logml.data.utils.filter_columns(dataframe: pandas.core.frame.DataFrame, regexps_to_include: Optional[List[str]] = None, regexps_to_exclude: Optional[List[str]] = None, metadata: Optional[logml.data.metadata.DatasetMetadata] = None, groups_include: Optional[List[str]] = None, groups_exclude: Optional[List[str]] = None)
DEPRECATEAD. For a given list of columns regexps and group names returns list of all matching columns. Metadata is required.
Columns inclusion/exclusion schema:
include columns from groups_to_include
include columns from columns_to_include
exclude columns from groups_to_exclude
exclude columns from columns_to_exclude
- logml.data.utils.get_matching_regex_columns(dataframe: pandas.core.frame.DataFrame, regexp: Union[str, List[str]], **_kwargs) Set[str]
Get union of all columns matching regular expressions.
Applies regex regular expression to all columns of the dataframe and returns those which fully match the reqex
Typical usage example:
df = pd.DataFrame(data, columns=['a1', 'a2', 'b1', 'b2']) get_matching_columns(df, r'^a.*')
Output:
['a1', 'a2']
- Parameters
dataframe – Dataframe which columns to filter.
regexp – Regular expression definition.
- Returns
Set of matching columns.
- logml.data.utils.parse_df_filter_string(filter_str: str) Tuple[Callable, str]
Parse filter string into filter type and filter expression.
- logml.data.utils.apply_fdr_with_nas(p_values: numpy.ndarray, nan_p_value=1.0) Tuple[numpy.ndarray, numpy.ndarray]
For a given set of p-values performs FPR correction (Benjamini/Hochberg).
Fills all NaN p-values with “nan_p_value”. By default - 1.0.
- logml.data.utils.column_to_np(df_column: pandas.core.series.Series) numpy.array
Reshapes a given Series into 2D np.array.
- logml.data.utils.shuffle_column(df_column: pandas.core.series.Series, fraction: float) numpy.array
Shuffles a subset of a given column’s elements.
- logml.data.utils.truncate_column_names(max_name_len: int, df: pandas.core.frame.DataFrame)
Truncates columns of the dataframe to given len, adding identity numbers in case if truncated names duplicate.