logml.data.utils

Functions

_remove_keys(cols, dataset_metadata)

apply_fdr_with_nas(p_values[, nan_p_value])

For a given set of p-values performs FPR correction (Benjamini/Hochberg).

column_to_np(df_column)

Reshapes a given Series into 2D np.array.

filter_by_group(unused_dataframe, group[, ...])

Filter columns by groups

filter_by_keyword(dataframe, keyword[, ...])

Filter by a keyword.

filter_columns(dataframe[, ...])

DEPRECATEAD.

filter_columns2(dataframe[, include, ...])

Improved version of filter_columns function.

get_matching_regex_columns(dataframe, ...)

Get union of all columns matching regular expressions.

nullify_outliers(data[, mul_std])

Replaces data outside provided mean+-range with NaN.

parse_df_filter_string(filter_str)

Parse filter string into filter type and filter expression.

shuffle_column(df_column, fraction)

Shuffles a subset of a given column's elements.

truncate_column_names(max_name_len, df)

Truncates columns of the dataframe to given len, adding identity numbers in case if truncated names duplicate.

Exceptions

PreprocessingPipelineException

Exception raised when there is an FeatureExtractionStepException caught while using Preprocessing Pipeline API.

exception logml.data.utils.PreprocessingPipelineException

Bases: Exception

Exception raised when there is an FeatureExtractionStepException caught while using Preprocessing Pipeline API.

class logml.data.utils.DataTransformLogItem

Bases: pydantic.main.BaseModel

Describes one action of dataset transformation.

Show JSON schema
{
   "title": "DataTransformLogItem",
   "description": "Describes one action of dataset transformation.",
   "type": "object",
   "properties": {
      "action": {
         "title": "Action",
         "type": "string"
      },
      "action_params": {
         "title": "Action Params",
         "type": "object"
      },
      "action_data": {
         "title": "Action Data",
         "type": "object"
      },
      "input_shape": {
         "title": "Input Shape",
         "type": "array",
         "items": {}
      },
      "output_shape": {
         "title": "Output Shape",
         "type": "array",
         "items": {}
      }
   },
   "required": [
      "action"
   ]
}

Fields
field action: str [Required]
field action_params: Optional[dict] = None
field action_data: Optional[dict] = None
field input_shape: Optional[tuple] = None
field output_shape: Optional[tuple] = None
class logml.data.utils.DataTransformLog

Bases: pydantic.main.BaseModel

Contains history of dataset transformation.

Show JSON schema
{
   "title": "DataTransformLog",
   "description": "Contains history of dataset transformation.",
   "type": "object",
   "properties": {
      "items": {
         "title": "Items",
         "default": [],
         "type": "array",
         "items": {
            "$ref": "#/definitions/DataTransformLogItem"
         }
      }
   },
   "definitions": {
      "DataTransformLogItem": {
         "title": "DataTransformLogItem",
         "description": "Describes one action of dataset transformation.",
         "type": "object",
         "properties": {
            "action": {
               "title": "Action",
               "type": "string"
            },
            "action_params": {
               "title": "Action Params",
               "type": "object"
            },
            "action_data": {
               "title": "Action Data",
               "type": "object"
            },
            "input_shape": {
               "title": "Input Shape",
               "type": "array",
               "items": {}
            },
            "output_shape": {
               "title": "Output Shape",
               "type": "array",
               "items": {}
            }
         },
         "required": [
            "action"
         ]
      }
   }
}

Fields
field items: List[logml.data.utils.DataTransformLogItem] = []
logml.data.utils.nullify_outliers(data: numpy.ndarray, mul_std=3) numpy.ndarray

Replaces data outside provided mean+-range with NaN.

logml.data.utils.filter_by_keyword(dataframe: pandas.core.frame.DataFrame, keyword: str, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, objective: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, **_kwargs) Set[str]

Filter by a keyword.

logml.data.utils.filter_by_group(unused_dataframe: pandas.core.frame.DataFrame, group: str, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, **_kwargs) Set[str]

Filter columns by groups

logml.data.utils.filter_columns2(dataframe: pandas.core.frame.DataFrame, include: Optional[List[str]] = None, exclude: Optional[List[str]] = None, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, objective: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, **kwargs) Optional[List[str]]

Improved version of filter_columns function. In addition to regex filtering, uses specialized ones: groups and keywords filters.

logml.data.utils.filter_columns(dataframe: pandas.core.frame.DataFrame, regexps_to_include: Optional[List[str]] = None, regexps_to_exclude: Optional[List[str]] = None, metadata: Optional[logml.data.metadata.DatasetMetadata] = None, groups_include: Optional[List[str]] = None, groups_exclude: Optional[List[str]] = None)

DEPRECATEAD. For a given list of columns regexps and group names returns list of all matching columns. Metadata is required.

Columns inclusion/exclusion schema:

  • include columns from groups_to_include

  • include columns from columns_to_include

  • exclude columns from groups_to_exclude

  • exclude columns from columns_to_exclude

logml.data.utils.get_matching_regex_columns(dataframe: pandas.core.frame.DataFrame, regexp: Union[str, List[str]], **_kwargs) Set[str]

Get union of all columns matching regular expressions.

Applies regex regular expression to all columns of the dataframe and returns those which fully match the reqex

Typical usage example:

df = pd.DataFrame(data, columns=['a1', 'a2', 'b1', 'b2'])
get_matching_columns(df, r'^a.*')

Output:

['a1', 'a2']
Parameters
  • dataframe – Dataframe which columns to filter.

  • regexp – Regular expression definition.

Returns

Set of matching columns.

logml.data.utils.parse_df_filter_string(filter_str: str) Tuple[Callable, str]

Parse filter string into filter type and filter expression.

logml.data.utils.apply_fdr_with_nas(p_values: numpy.ndarray, nan_p_value=1.0) Tuple[numpy.ndarray, numpy.ndarray]

For a given set of p-values performs FPR correction (Benjamini/Hochberg).

Fills all NaN p-values with “nan_p_value”. By default - 1.0.

logml.data.utils.column_to_np(df_column: pandas.core.series.Series) numpy.array

Reshapes a given Series into 2D np.array.

logml.data.utils.shuffle_column(df_column: pandas.core.series.Series, fraction: float) numpy.array

Shuffles a subset of a given column’s elements.

logml.data.utils.truncate_column_names(max_name_len: int, df: pandas.core.frame.DataFrame)

Truncates columns of the dataframe to given len, adding identity numbers in case if truncated names duplicate.