lifelines.utils

lifelines.utils.qth_survival_times(q, survival_functions, cdf=False)

Find the times when one or more survival functions reach the qth percentile.

Parameters:
  • q (float or array) – a float between 0 and 1 that represents the time when the survival function hits the qth percentile.
  • survival_functions (a (n,d) DataFrame or numpy array.) – If DataFrame, will return index values (actual times) If numpy array, will return indices.
  • cdf (boolean, optional) – When doing left-censored data, cdf=True is used.
Returns:

if d==1, returns a float, np.inf if infinity. if d > 1, an DataFrame containing the first times the value was crossed.

Return type:

float, or DataFrame

lifelines.utils.qth_survival_time(q, survival_function, cdf=False)

Returns the time when a single survival function reaches the qth percentile.

Parameters:
  • q (float) – a float between 0 and 1 that represents the time when the survival function hit’s the qth percentile.
  • survival_function (Series or single-column DataFrame.)
  • cdf (boolean, optional) – When doing left-censored data, cdf=True is used.
Returns:

Return type:

float

lifelines.utils.median_survival_times(density_or_survival_function, left_censorship=False)
lifelines.utils.survival_table_from_events(death_times, event_observed, birth_times=None, columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'], weights=None, collapse=False, intervals=None)
Parameters:
  • death_times ((n,) array) – represent the event times
  • event_observed ((n,) array) – 1 if observed event, 0 is censored event.
  • birth_times (a (n,) array, optional) – representing when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. If None (default), birth_times are set to be the first observation or 0, which ever is smaller.
  • columns (iterable, optional) – a 3-length array to call the, in order, removed individuals, observed deaths and censorships.
  • weights ((n,1) array, optional) – Optional argument to use weights for individuals. Assumes weights of 1 if not provided.
  • collapse (boolean, optional (default=False)) – If True, collapses survival table into lifetable to show events in interval bins
  • intervals (iterable, optional) – Default None, otherwise a list/(n,1) array of interval edge measures. If left as None while collapse=True, then Freedman-Diaconis rule for histogram bins will be used to determine intervals.
Returns:

Pandas DataFrame with index as the unique times or intervals in event_times. The columns named ‘removed’ refers to the number of individuals who were removed from the population by the end of the period. The column ‘observed’ refers to the number of removed individuals who were observed to have died (i.e. not censored.) The column ‘censored’ is defined as ‘removed’ - ‘observed’ (the number of individuals who left the population due to event_observed)

Return type:

DataFrame

Example

>>> #Uncollapsed output
>>>           removed  observed  censored  entrance   at_risk
>>> event_at
>>> 0               0         0         0        11        11
>>> 6               1         1         0         0        11
>>> 7               2         2         0         0        10
>>> 9               3         3         0         0         8
>>> 13              3         3         0         0         5
>>> 15              2         2         0         0         2
>>> #Collapsed output
>>>          removed observed censored at_risk
>>>              sum      sum      sum     max
>>> event_at
>>> (0, 2]        34       33        1     312
>>> (2, 4]        84       42       42     278
>>> (4, 6]        64       17       47     194
>>> (6, 8]        63       16       47     130
>>> (8, 10]       35       12       23      67
>>> (10, 12]      24        5       19      32
lifelines.utils.group_survival_table_from_events(groups, durations, event_observed, birth_times=None, limit=-1)

Joins multiple event series together into DataFrames. A generalization of survival_table_from_events to data with groups. Previously called group_event_series pre 0.2.3.

Parameters:
  • groups (a (n,) array) – individuals’ group ids.
  • durations (a (n,) array) – durations of each individual
  • event_observed (a (n,) array) – event observations, 1 if observed, 0 else.
  • birth_times (a (n,) array) – when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. Normally set to all zeros, but can be positive or negative.
  • limit
Returns:

  • unique_groups (np.array) – array of all the unique groups present
  • removed (DataFrame) – DataFrame of removal count data at event_times for each group, column names are ‘removed:<group name>’
  • observed (DataFrame) – DataFrame of observed count data at event_times for each group, column names are ‘observed:<group name>’
  • censored (DataFrame) – DataFrame of censored count data at event_times for each group, column names are ‘censored:<group name>’

Example

>>> #input
>>> group_survival_table_from_events(waltonG, waltonT, np.ones_like(waltonT)) #data available in test_suite.py
>>> #output
>>> [
>>>     array(['control', 'miR-137'], dtype=object),
>>>               removed:control  removed:miR-137
>>>     event_at
>>>     6                       0                1
>>>     7                       2                0
>>>     9                       0                3
>>>     13                      0                3
>>>     15                      0                2
>>>     ,
>>>               observed:control  observed:miR-137
>>>     event_at
>>>     6                        0                 1
>>>     7                        2                 0
>>>     9                        0                 3
>>>     13                       0                 3
>>>     15                       0                 2
>>>     ,
>>>               censored:control  censored:miR-137
>>>     event_at
>>>     6                        0                 0
>>>     7                        0                 0
>>>     9                        0                 0
>>>     ,
>>> ]
lifelines.utils.survival_events_from_table(survival_table, observed_deaths_col='observed', censored_col='censored')

This is the inverse of the function survival_table_from_events.

Parameters:
  • survival_table (DataFrame) –

    a pandas DataFrame with index as the durations and columns “observed” and “censored”, referring to

    the number of individuals that died and were censored at time t.

  • observed_deaths_col (str, optional (default: “observed”)) – the column in the survival table that represents the number of subjects that were observed to die at a specific time

  • censored_col (str, optional (default: “censored”)) – the column in the survival table that represents the number of subjects that were censored at a specific time

Returns:

  • T (array) – durations of observation – one element for observed time
  • E (array) – event observations – 1 if observed, 0 else.
  • W (array) – weights - integer weights to “condense” the data

Example

>>> # Ex: The survival table, as a pandas DataFrame:
>>>
>>>                  observed  censored
>>>    index
>>>    1                1         0
>>>    2                0         1
>>>    3                1         0
>>>    4                1         1
>>>    5                0         1
>>>
>>> # would return
>>> T = np.array([ 1.,  2.,  3.,  4.,  4.,  5.]),
>>> E = np.array([ 1.,  0.,  1.,  1.,  0.,  0.])
>>> W = np.array([ 1,  1,  1,  1,  1,  1])
lifelines.utils.datetimes_to_durations(start_times, end_times, fill_date=datetime.datetime(2019, 6, 23, 13, 41, 21, 647959), freq='D', dayfirst=False, na_values=None)

This is a very flexible function for transforming arrays of start_times and end_times to the proper format for lifelines: duration and event observation arrays.

Parameters:
  • start_times (an array, Series or DataFrame) – iterable representing start times. These can be strings, or datetime objects.
  • end_times (an array, Series or DataFrame) – iterable representing end times. These can be strings, or datetimes. These values can be None, or an empty string, which corresponds to censorship.
  • fill_date (datetime, optional (default=datetime.Today())) – the date to use if end_times is a None or empty string. This corresponds to last date of observation. Anything after this date is also censored.
  • freq (string, optional (default=’D’)) – the units of time to use. See Pandas ‘freq’. Default ‘D’ for days.
  • dayfirst (boolean, optional (default=False)) – convert assuming European-style dates, i.e. day/month/year.
  • na_values (list, optional) – list of values to recognize as NA/NaN. Ex: [‘’, ‘NaT’]
Returns:

  • T (numpy array) – array of floats representing the durations with time units given by freq.
  • C (numpy array) – boolean array of event observations: 1 if death observed, 0 else.

Examples

>>> from lifelines.utils import datetimes_to_durations
>>>
>>> start_dates = ['2015-01-01', '2015-04-01', '2014-04-05']
>>> end_dates = ['2016-02-02', None, '2014-05-06']
>>>
>>> T, E = datetimes_to_durations(start_dates, end_dates, freq="D")
>>> T # array([ 397., 1414.,   31.])
>>> E # array([ True, False,  True])
lifelines.utils.concordance_index(event_times, predicted_scores, event_observed=None)

Calculates the concordance index (C-index) between two series of event times. The first is the real survival times from the experimental data, and the other is the predicted survival times from a model of some kind.

The c-index is the average of how often a model says X is greater than Y when, in the observed data, X is indeed greater than Y. The c-index also handles how to handle censored values (obviously, if Y is censored, it’s hard to know if X is truly greater than Y).

The concordance index is a value between 0 and 1 where:

  • 0.5 is the expected result from random predictions,
  • 1.0 is perfect concordance and,
  • 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)
Parameters:
Returns:

c-index – a value between 0 and 1.

Return type:

float

References

Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):361-87.

Examples

>>> from lifelines.utils import concordance_index
>>> cph = CoxPHFitter().fit(df, 'T', 'E')
>>> concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
lifelines.utils.k_fold_cross_validation(fitters, df, duration_col, event_col=None, k=5, evaluation_measure=<function concordance_index>, predictor='predict_expectation', predictor_kwargs={}, fitter_kwargs={})

Perform cross validation on a dataset. If multiple models are provided, all models will train on each of the k subsets.

Parameters:
  • fitters (model) – one or several objects which possess a method: fit(self, data, duration_col, event_col) Note that the last two arguments will be given as keyword arguments, and that event_col is optional. The objects must also have the “predictor” method defined below.
  • df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and (optional) event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).
  • duration_col ((n,) array) – the column in DataFrame that contains the subjects lifetimes.
  • event_col ((n,) array) – the column in DataFrame that contains the subject’s death observation. If left as None, assumes all individuals are non-censored.
  • k (int) – the number of folds to perform. n/k data will be withheld for testing on.
  • evaluation_measure (function) – a function that accepts either (event_times, predicted_event_times), or (event_times, predicted_event_times, event_observed). Default: statistics.concordance_index: (C-index)
  • predictor (string) – a string that matches a prediction method on the fitter instances. For example, predict_expectation or predict_percentile. Default is “predict_expectation” The interface for the method is: predict(self, data, **optional_kwargs)
  • fitter_kwargs – keyword args to pass into fitter.fit method.
  • predictor_kwargs – keyword args to pass into predictor-method.
Returns:

results – (k,1) list of scores for each fold. The scores can be anything.

Return type:

list

See also

lifelines.utils.sklearn_adapter.sklearn_adapter()

lifelines.utils.to_long_format(df, duration_col)

This function converts a survival analysis DataFrame to a lifelines “long” format. The lifelines “long” format is used in a common next function, add_covariate_to_timeline.

Parameters:
  • df (DataFrame) – a DataFrame in the standard survival analysis form (one for per observation, with covariates, duration and event flag)
  • duration_col (string) – string representing the column in df that represents the durations of each subject.
Returns:

long_form_df – A DataFrame with new columns. This can be fed into add_covariate_to_timeline

Return type:

DataFrame

lifelines.utils.to_episodic_format(df, duration_col, event_col, id_col=None, time_gaps=1)

This function takes a “flat” dataset (that is, non-time-varying), and converts it into a time-varying dataset with static variables.

Useful if your dataset has variables that do not satisfy the proportional hazard assumption, and you need to create a time-varying dataset to include interaction terms with time.

Parameters:
  • df (DataFrame) – a DataFrame of the static dataset.
  • duration_col (string) – string representing the column in df that represents the durations of each subject.
  • event_col (string) – string representing the column in df that represents whether the subject experienced the event or not.
  • id_col (string, optional) – Specify the column that represents an id, else lifelines creates an auto-incrementing one.
  • time_gaps (float or int) – Specify a desired time_gap. For example, if time_gap is 2 and a subject lives for 10.5 units of time, then the final long form will have 5 + 1 rows for that subject: (0, 2], (2, 4], (4, 6], (6, 8], (8, 10], (10, 10.5] Smaller time_gaps will produce larger DataFrames, and larger time_gaps will produce smaller DataFrames. In the limit, the long DataFrame will be identical to the original DataFrame.
Returns:

Return type:

DataFrame

Example

>>> from lifelines.datasets import load_rossi
>>> from lifelines.utils import to_episodic_format
>>> rossi = load_rossi()
>>> long_rossi = to_episodic_format(rossi, 'week', 'arrest', time_gaps=2.)
>>>
>>> from lifelines import CoxTimeVaryingFitter
>>> ctv = CoxTimeVaryingFitter()
>>> # age variable violates proportional hazard
>>> long_rossi['time * age'] = long_rossi['stop'] * long_rossi['age']
>>> ctv.fit(long_rossi, id_col='id', event_col='arrest', show_progress=True)
>>> ctv.print_summary()
lifelines.utils.add_covariate_to_timeline(long_form_df, cv, id_col, duration_col, event_col, start_col='start', stop_col='stop', add_enum=False, overwrite=True, cumulative_sum=False, cumulative_sum_prefix='cumsum_', delay=0)

This is a util function to help create a long form table tracking subjects’ covariate changes over time. It is meant to be used iteratively as one adds more and more covariates to track over time. Before using this function, it is recommended to view the documentation at https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html#dataset-creation-for-time-varying-regression

Parameters:
  • long_form_df (DataFrame) – a DataFrame that has the initial or intermediate “long” form of time-varying observations. Must contain columns id_col, ‘start’, ‘stop’, and event_col. See function to_long_format to transform data into long form.
  • cv (DataFrame) – a DataFrame that contains (possibly more than) one covariate to track over time. Must contain columns id_col and duration_col. duration_col represents time since the start of the subject’s life.
  • id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.
  • duration_col (string) – the column in cv that represents the time-since-birth the observation occurred at.
  • event_col (string) – the column in df that represents if the event-of-interest occurred
  • add_enum (boolean, optional) – a Boolean flag to denote whether to add a column enumerating rows per subject. Useful to specify a specific observation, ex: df[df[‘enum’] == 1] will grab the first observations per subject.
  • overwrite (boolean, optional) – if True, covariate values in long_form_df will be overwritten by covariate values in cv if the column exists in both cv and long_form_df and the timestamps are identical. If False, the default behavior will be to sum the values together.
  • cumulative_sum (boolean, optional) – sum over time the new covariates. Makes sense if the covariates are new additions, and not state changes (ex: administering more drugs vs taking a temperature.)
  • cumulative_sum_prefix (string, optional) – a prefix to add to calculated cumulative sum columns
  • delay (int, optional) – add a delay to covariates (useful for checking for reverse causality in analysis)
Returns:

long_form_df – A DataFrame with updated rows to reflect the novel times slices (if any) being added from cv, and novel (or updated) columns of new covariates from cv

Return type:

DataFrame

lifelines.utils.covariates_from_event_matrix(df, id_col)

This is a helper function to handle binary event datastreams in a specific format and convert it to a format that add_covariate_to_timeline will accept. For example, suppose you have a dataset that looks like:

   id  promotion  movement  raise
0   1        1.0       NaN    2.0
1   2        NaN       5.0    NaN
2   3        3.0       5.0    7.0

where the values (aside from the id column) represent when an event occurred for a specific user, relative to the subject’s birth/entry. This is a common way format to pull data from a SQL table. We call this a duration matrix, and we want to convert this DataFrame to a format that can be included in a long form DataFrame (see add_covariate_to_timeline for more details on this).

The duration matrix should have 1 row per subject (but not necessarily all subjects).

Parameters:
  • df (DataFrame) – the DataFrame we want to transform
  • id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.

Example

>>> cv = covariates_from_event_matrix(duration_df, 'id')
>>> long_form_df = add_covariate_to_timeline(long_form_df, cv, 'id', 'duration', 'e', cumulative_sum=True)