logml.data.metadata

Functions

get_dtype_by_name(data_type_name)

Get data type by name

load_incoming_metadata([configured_md, ...])

Attempt to load metadata from the MCT configuration file.

Classes

DatasetMetadata([metadata_config])

Store and manage dataset metadata.

DtypeKind(value)

Map of numpy dtype kinds.

logml.data.metadata.get_dtype_by_name(data_type_name: str) numpy.dtype

Get data type by name

class logml.data.metadata.DtypeKind(value)

Bases: str, enum.Enum

Map of numpy dtype kinds. Ses https://numpy.org/doc/stable/reference/generated/numpy.dtype.kind.html.

BOOL = 'b'
INT = 'i'
UINT = 'u'
FLOAT = 'f'
COMPLEX = 'c'
TIMEDELTA = 'm'
DATETIME = 'M'
OBJECT = 'O'
BSTRING = 'S'
UNICODE = 'U'
VOID = 'V'
class logml.data.metadata.ColumnMetadata

Bases: logml.configuration.modeling.ColumnMetadataConfig

Column metadata class.

Show JSON schema
{
   "title": "ColumnMetadata",
   "description": "Column metadata class.",
   "type": "object",
   "properties": {
      "name": {
         "title": "Name",
         "description": "Column name. Used to refer to column in the dataframe directly.",
         "type": "string"
      },
      "data_type": {
         "title": "Data Type",
         "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.",
         "default": "",
         "type": "string"
      },
      "is_categorical": {
         "title": "Is Categorical",
         "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.",
         "default": false,
         "type": "boolean"
      },
      "parent_name": {
         "title": "Parent Name",
         "description": "Column name which used to produce current column as a result of transformation.",
         "type": "string"
      },
      "description": {
         "title": "Description",
         "description": "Column description.",
         "type": "string"
      },
      "group": {
         "title": "Group",
         "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.",
         "type": "string"
      },
      "dtype": {
         "title": "Dtype"
      },
      "properties": {
         "title": "Properties",
         "description": "Stores extra properties - usually introduced by transformers.",
         "default": {},
         "type": "object"
      }
   },
   "required": [
      "name"
   ]
}

Fields
field dtype: numpy.dtype = None
field properties: Dict[str, Any] = {}

Stores extra properties - usually introduced by transformers.

class logml.data.metadata.DatasetMetadata(metadata_config: Optional[logml.configuration.modeling.DatasetMetadataSection] = None, **kwargs)

Bases: object

Store and manage dataset metadata.

get_all_special_names(subset: Optional[List[str]] = None) List[str]

Returns names of all special columns.

get_target_names(subset: Optional[List[str]] = None) List[str]

Get list of columns which participate in modeling as targets.

add_column_md(md: Optional[logml.data.metadata.ColumnMetadata] = None, dtype: Optional[numpy.dtype] = None, data_type: str = '', **kwargs) None

Add metadata for a new column.

get_column_md(name: str) Optional[logml.data.metadata.ColumnMetadata]

Fetch column metadata

fill_from_mct_config(mct_cfg: dict)

Populates metadata from MCT config. Existing columns are not overridden.

expand_regex(columns: List[str])

Examine list of real columns and replace regexes in metadata.

fill_from_dataframe(dataframe: pandas.core.frame.DataFrame) None

Match and update current metadata.

get_columns_by_dtype(dtypes: Optional[List[logml.data.metadata.DtypeKind]] = None, subset: Optional[List[str]] = None, categorical: Optional[bool] = None) List[str]

Returns a subset of columns by dtype

Parameters
  • dtypes – column should match any of dtypes. If emtpy, not applied.

  • categorical – True: include, False: exclude. None: not applied.

  • subset – return only those found columns which match the list.

get_numerical_columns(subset: Optional[List[str]] = None) List[str]

Returns a subset of numerical columns, excluding categorical

get_categorical_columns(subset: Optional[List[str]] = None) List[str]

Returns a subset of numerical columns (except those which are explicitly marked as categorical.

get_date_columns(subset: Optional[List[str]] = None) List[str]

Returns a subset of datetime columns

to_dict() dict

Pack all data to a dumpable entity.

logml.data.metadata.load_incoming_metadata(configured_md: Optional[logml.configuration.modeling.DatasetMetadataSection] = None, dataset_path: Optional[str] = None, mct_config_path: Optional[str] = None) logml.data.metadata.DatasetMetadata

Attempt to load metadata from the MCT configuration file.

When dataset path is not provided, attempt to look up config file by namingt convention: dataset: “mct.{project-date-version}.csv” config: “config.{project-date-version}.yaml”