logml.data.metadata
Functions
|
Get data type by name |
|
Attempt to load metadata from the MCT configuration file. |
Classes
|
Store and manage dataset metadata. |
|
Map of numpy dtype kinds. |
- logml.data.metadata.get_dtype_by_name(data_type_name: str) numpy.dtype
Get data type by name
- class logml.data.metadata.DtypeKind(value)
Bases:
str
,enum.Enum
Map of numpy dtype kinds. Ses https://numpy.org/doc/stable/reference/generated/numpy.dtype.kind.html.
- BOOL = 'b'
- INT = 'i'
- UINT = 'u'
- FLOAT = 'f'
- COMPLEX = 'c'
- TIMEDELTA = 'm'
- DATETIME = 'M'
- OBJECT = 'O'
- BSTRING = 'S'
- UNICODE = 'U'
- VOID = 'V'
- class logml.data.metadata.ColumnMetadata
Bases:
logml.configuration.modeling.ColumnMetadataConfig
Column metadata class.
Show JSON schema
{ "title": "ColumnMetadata", "description": "Column metadata class.", "type": "object", "properties": { "name": { "title": "Name", "description": "Column name. Used to refer to column in the dataframe directly.", "type": "string" }, "data_type": { "title": "Data Type", "description": "Data type for the field. Most frequent are `string`, `int`, `float`, `datetime64[ns]`.\n\nIf not specified, automatically detected while reading original dataset.\n\nSee `https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes` forlist of available standard pandas types.", "default": "", "type": "string" }, "is_categorical": { "title": "Is Categorical", "description": "Specify if a column should be considered as categorical (opposed to continuous numeric). Not applicable to date-time types, but can be string, integer or float.", "default": false, "type": "boolean" }, "parent_name": { "title": "Parent Name", "description": "Column name which used to produce current column as a result of transformation.", "type": "string" }, "description": { "title": "Description", "description": "Column description.", "type": "string" }, "group": { "title": "Group", "description": "Name of a group this column belongs to. If MCT config is provided, set to column input_source name.", "type": "string" }, "dtype": { "title": "Dtype" }, "properties": { "title": "Properties", "description": "Stores extra properties - usually introduced by transformers.", "default": {}, "type": "object" } }, "required": [ "name" ] }
- field dtype: numpy.dtype = None
- field properties: Dict[str, Any] = {}
Stores extra properties - usually introduced by transformers.
- class logml.data.metadata.DatasetMetadata(metadata_config: Optional[logml.configuration.modeling.DatasetMetadataSection] = None, **kwargs)
Bases:
object
Store and manage dataset metadata.
- get_all_special_names(subset: Optional[List[str]] = None) List[str]
Returns names of all special columns.
- get_target_names(subset: Optional[List[str]] = None) List[str]
Get list of columns which participate in modeling as targets.
- add_column_md(md: Optional[logml.data.metadata.ColumnMetadata] = None, dtype: Optional[numpy.dtype] = None, data_type: str = '', **kwargs) None
Add metadata for a new column.
- get_column_md(name: str) Optional[logml.data.metadata.ColumnMetadata]
Fetch column metadata
- fill_from_mct_config(mct_cfg: dict)
Populates metadata from MCT config. Existing columns are not overridden.
- expand_regex(columns: List[str])
Examine list of real columns and replace regexes in metadata.
- fill_from_dataframe(dataframe: pandas.core.frame.DataFrame) None
Match and update current metadata.
- get_columns_by_dtype(dtypes: Optional[List[logml.data.metadata.DtypeKind]] = None, subset: Optional[List[str]] = None, categorical: Optional[bool] = None) List[str]
Returns a subset of columns by dtype
- Parameters
dtypes – column should match any of dtypes. If emtpy, not applied.
categorical – True: include, False: exclude. None: not applied.
subset – return only those found columns which match the list.
- get_numerical_columns(subset: Optional[List[str]] = None) List[str]
Returns a subset of numerical columns, excluding categorical
- get_categorical_columns(subset: Optional[List[str]] = None) List[str]
Returns a subset of numerical columns (except those which are explicitly marked as categorical.
- get_date_columns(subset: Optional[List[str]] = None) List[str]
Returns a subset of datetime columns
- to_dict() dict
Pack all data to a dumpable entity.
- logml.data.metadata.load_incoming_metadata(configured_md: Optional[logml.configuration.modeling.DatasetMetadataSection] = None, dataset_path: Optional[str] = None, mct_config_path: Optional[str] = None) logml.data.metadata.DatasetMetadata
Attempt to load metadata from the MCT configuration file.
When dataset path is not provided, attempt to look up config file by namingt convention: dataset: “mct.{project-date-version}.csv” config: “config.{project-date-version}.yaml”