logml.data.datasets.survival_dataset

Functions

make_survival_target(dataframe, event_query, ...)

Generate survival targets in scikit-survival library style.

Classes

SurvivalDataset(*dont_use_positional_args[, ...])

Dataset extension for survival analysis.

logml.data.datasets.survival_dataset.make_survival_target(dataframe: pandas.core.frame.DataFrame, event_query: str, time_column: str) → numpy.array

Generate survival targets in scikit-survival library style.

Resulting structure is np.array with shape=(nrows,) and two fields: (event: np.bool, time: np.float64)

Parameters

dataframe – Dataframe to fetch columns from.
event_query – Event query. In case if query result dtype != bool, will be cast to bool.
time_column – Time column. Will be cast to float64

Returns

Survival targets in scikit-survival style.

Return type

np.array

class logml.data.datasets.survival_dataset.UnivariateSurvivalContainer

Bases: pydantic.main.BaseModel

Wrapper for survival target (events and times) and one feature.

Show JSON schema

{
   "title": "UnivariateSurvivalContainer",
   "description": "Wrapper for survival target (events and times) and one feature.",
   "type": "object",
   "properties": {
      "column_name": {
         "title": "Column Name",
         "description": "Name of the only column of interest.",
         "type": "string"
      },
      "events": {
         "title": "Events",
         "description": "List of indicators of whether events occured.",
         "type": "array",
         "items": {}
      },
      "times": {
         "title": "Times",
         "description": "List of time-to-event measurments (OS, PFS, etc.).",
         "type": "array",
         "items": {
            "type": "number"
         }
      },
      "values": {
         "title": "Values",
         "description": "List of values that correspond to the only variable.",
         "type": "array",
         "items": {}
      },
      "threshold": {
         "title": "Threshold",
         "description": "Threshold that is used to split the values into Low and High groups.\n            NOTE: applicable for numericals only.",
         "type": "number"
      }
   },
   "required": [
      "column_name",
      "events",
      "times",
      "values"
   ]
}

Fields

column_name (str)
events (List)
threshold (float)
times (List[float])
values (List)

field column_name: str [Required]: Name of the only column of interest.

field events: List [Required]: List of indicators of whether events occured.

field times: List[float] [Required]: List of time-to-event measurments (OS, PFS, etc.).

field values: List [Required]: List of values that correspond to the only variable.

field threshold: float = None: Threshold that is used to split the values into Low and High groups. NOTE: applicable for numericals only.

discretize_values()

Binarizes values based on a threshold.

In case groups are not discrete, simply uses median as a threshold to create two groups - ‘Low’ and ‘High’.

groups_to_str() → str: Returns a string representation of available values.

get_valid_cut_offs(n_percentiles: int, min_population: float) → List[float]: Returns a list of percentiles that split groups range into valid parts.

property size: int: Return a number of samples within the container.

class logml.data.datasets.survival_dataset.SurvivalDataset(*dont_use_positional_args, dataset_metadata: Optional[logml.data.metadata.DatasetMetadata] = None, dataframe: Optional[pandas.core.frame.DataFrame] = None, objective_cfg: Optional[logml.configuration.modeling.ModelingTaskSpec] = None, cross_validator: Optional[Union[sklearn.model_selection._split.BaseCrossValidator, Iterable]] = None, features: Optional[List[str]] = None, logger=None, **kwargs)

Bases: logml.data.datasets.cv_dataset.ModelingDataset, logml.data.datasets.base.CrossValidationMixin

Dataset extension for survival analysis. Expects presence of special field ‘event_column’, which is by default included into target variable as a form (event, survival_time)

Example config: .. code-block:: yaml

modeling:

problems:

y_regression:

metadata:
task: survival target: ‘time’ event_query: ‘cens == 1’ # query to generate boolean value. # event column (not to mix it with features) event_column: ‘cens’ target_metric: cindex

Note about event column:

use ‘event_query’ to specify which values map to event (say, ‘x1 = “YES”’
or “zz = 0”
use ‘event_column’ to specify which column had been used for the event
query, and therefore should be excluded from general list of features.
result of ‘event_query’ is cast
to boolean and used as an event indicator for downstream model.

LABEL = 'cv_survival_dataset'

get_target_values() → numpy.array: Fetch array with shape (rows,), with tuple (event_column, target_column).

property event_column: Event column for survival analysis.

property event_query: Event query for survival analysis.

get_target_columns() → List: Returns list of target columns.

get_univariate_container(column_name: str, drop_nans=True) → Optional[logml.data.datasets.survival_dataset.UnivariateSurvivalContainer]

Returns a wrapped targets and values for a given column.

If specified - samples with NaNs (within the column values) are dropped.