logml.data.datasets.base

Functions

abstract_field()

Declare abstract field, which is to be set by an inheritor.

Classes

BaseDataset(*dont_use_positional_args[, ...])

Base dataset: provides dataframe with metadata.

CrossValidationMixin([cross_validator])

Defines cross-validation behaviour of dataset.

logml.data.datasets.base.abstract_field() property

Declare abstract field, which is to be set by an inheritor.

class logml.data.datasets.base.BaseDataset(*dont_use_positional_args, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, dataframe: Optional[pandas.core.frame.DataFrame] = None, logger=None, generator_type: str = 'plain', **kwargs)

Bases: object

Base dataset: provides dataframe with metadata.

LABEL = 'base_dataset'
validate_metadata(raise_error=False) None

Validates dataset metadata.

Raises

ValueError in case if columns listed in metadata are not present in the dataframe.

property dataframe: pandas.core.frame.DataFrame

Get underlying pandas dataframe.

get_hash() str

Return a hash for the dataframe.

NOTE: serialization breaks the logic, so after you dump and load a dataset the result dataset hash will differ from the initial one.

dump(path: Union[str, pathlib.Path], metadata_path: Optional[Union[str, pathlib.Path]] = None) None

Saves dataset to disk.

classmethod load(path: Union[str, pathlib.Path], metadata_path: Optional[Union[str, pathlib.Path]] = None) logml.data.datasets.base.BaseDataset

Load dataset from pair of files (data + metadata).

get_features_list() List[str]

Return a list of ‘feature’ columns. For base dataset it is all columns minus special ones.

get_features_dataframe(set_index=True) pandas.core.frame.DataFrame

Return subset of current dataframe with feature columns.

get_targets_dataframe(set_index=True) pandas.core.frame.DataFrame

Return subset of current dataframe with target columns

class logml.data.datasets.base.CrossValidationMixin(cross_validator: Optional[Union[sklearn.model_selection._split.BaseCrossValidator, Iterable]] = None, **kwargs)

Bases: abc.ABC

Defines cross-validation behaviour of dataset.

abstract property cv_dataframe: pandas.core.frame.DataFrame

Returns CV dataframe.

abstract property cv_features: numpy.array

Returns features from CV dataframe.

abstract property cv_targets: numpy.array

Returns targets from CV dataframe.

property n_folds: int

Get number of CV folds.

get_cv_generator() Iterator[Tuple[numpy.ndarray, numpy.ndarray]]

Returns an iterable of CV train/test indices.

Yields

tuple – Pair of train and test indices: (train, test).

get_folds_generator() Iterator[Tuple[Tuple[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]]]

Returns an iterable of CV train/test data arrays (as opposed to indices).

Yields

tuple(tuple(x_train, y_train), tuple(x_test, y_test)) – Train-test folds, split to X and y parts.

set_cv(cross_validator)

Update cross validation.