utils

lifelines.utils.add_covariate_to_timeline(long_form_df, cv, id_col, duration_col, event_col, start_col='start', stop_col='stop', add_enum=False, overwrite=True, cumulative_sum=False, cumulative_sum_prefix='cumsum_', delay=0) DataFrame

This is a util function to help create a long form table tracking subjects’ covariate changes over time. It is meant to be used iteratively as one adds more and more covariates to track over time. Before using this function, it is recommended to view the documentation at https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html#dataset-creation-for-time-varying-regression

Parameters:
  • long_form_df (DataFrame) – a DataFrame that has the initial or intermediate “long” form of time-varying observations. Must contain columns id_col, ‘start’, ‘stop’, and event_col. See function to_long_format to transform data into long form.

  • cv (DataFrame) – a DataFrame that contains (possibly more than) one covariate to track over time. Must contain columns id_col and duration_col. duration_col represents time since the start of the subject’s life.

  • id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.

  • duration_col (string) – the column in cv that represents the time-since-birth the observation occurred at.

  • event_col (string) – the column in df that represents if the event-of-interest occurred

  • add_enum (bool, optional) – a Boolean flag to denote whether to add a column enumerating rows per subject. Useful to specify a specific observation, ex: df[df[‘enum’] == 1] will grab the first observations per subject.

  • overwrite (bool, optional) – if True, covariate values in long_form_df will be overwritten by covariate values in cv if the column exists in both cv and long_form_df and the timestamps are identical. If False, the default behavior will be to sum the values together.

  • cumulative_sum (bool, optional) – sum over time the new covariates. Makes sense if the covariates are new additions, and not state changes (ex: administering more drugs vs taking a temperature.)

  • cumulative_sum_prefix (string, optional) – a prefix to add to calculated cumulative sum columns

  • delay (int, optional) – add a delay to covariates (useful for checking for reverse causality in analysis)

Returns:

long_form_df – A DataFrame with updated rows to reflect the novel times slices (if any) being added from cv, and novel (or updated) columns of new covariates from cv

Return type:

DataFrame

lifelines.utils.concordance_index(event_times, predicted_scores, event_observed=None) float

Calculates the concordance index (C-index) between a series of event times and a predicted score. The first is the real survival times from the observational data, and the other is the predicted score from a model of some kind.

The c-index is the average of how often a model says X is greater than Y when, in the observed data, X is indeed greater than Y. The c-index also handles how to handle censored values (obviously, if Y is censored, it’s hard to know if X is truly greater than Y).

The concordance index is a value between 0 and 1 where:

  • 0.5 is the expected result from random predictions,

  • 1.0 is perfect concordance and,

  • 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)

The calculation internally done is

>>> (pairs_correct + 0.5 * pairs_tied) / admissable_pairs

where pairs_correct is the number of pairs s.t. if t_x > t_y, then s_x > s_y, pairs, pairs_tied is the number of pairs where s_x = s_y, and admissable_pairs is all possible pairs. The subtleties are in how censored observation are handled (ex: not all pairs can be evaluated due to censoring).

Parameters:
Returns:

c-index – a value between 0 and 1.

Return type:

float

References

Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):361-87.

Examples

from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
lifelines.utils.covariates_from_event_matrix(df, id_col) DataFrame

This is a helper function to handle binary event datastreams in a specific format and convert it to a format that add_covariate_to_timeline will accept. For example, suppose you have a dataset that looks like:

   id  promotion  movement  raise
0   1        1.0       NaN    2.0
1   2        NaN       5.0    NaN
2   3        3.0       5.0    7.0

where the values (aside from the id column) represent when an event occurred for a specific user, relative to the subject’s birth/entry. This is a common way format to pull data from a SQL table. We call this a duration matrix, and we want to convert this DataFrame to a format that can be included in a long form DataFrame (see add_covariate_to_timeline for more details on this).

The duration matrix should have 1 row per subject (but not necessarily all subjects).

Parameters:
  • df (DataFrame) – the DataFrame we want to transform

  • id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.

Example

cv = covariates_from_event_matrix(duration_df, 'id')
long_form_df = add_covariate_to_timeline(long_form_df, cv, 'id', 'duration', 'e', cumulative_sum=True)
lifelines.utils.datetimes_to_durations(start_times, end_times, fill_date=datetime.datetime(2024, 3, 7, 12, 11, 48, 511807), freq='D', dayfirst=False, na_values=None, format=None)

This is a very flexible function for transforming arrays of start_times and end_times to the proper format for lifelines: duration and event observation arrays.

Parameters:
  • start_times (an array, Series or DataFrame) – iterable representing start times. These can be strings, or datetime objects.

  • end_times (an array, Series or DataFrame) – iterable representing end times. These can be strings, or datetimes. These values can be None, or an empty string, which corresponds to censorship.

  • fill_date (a datetime, array, Series or DataFrame, optional (default=datetime.Today())) – the date to use if end_times is a missing or empty. This corresponds to last date of observation. Anything after this date is also censored.

  • freq (string, optional (default=’D’)) – the units of time to use. See Pandas ‘freq’. Default ‘D’ for days.

  • dayfirst (bool, optional (default=False)) – see Pandas to_datetime

  • na_values (list, optional) – list of values to recognize as NA/NaN. Ex: [‘’, ‘NaT’]

  • format – see Pandas to_datetime

Returns:

  • T (numpy array) – array of floats representing the durations with time units given by freq.

  • C (numpy array) – boolean array of event observations: 1 if death observed, 0 else.

Examples

from lifelines.utils import datetimes_to_durations

start_dates = ['2015-01-01', '2015-04-01', '2014-04-05']
end_dates = ['2016-02-02', None, '2014-05-06']

T, E = datetimes_to_durations(start_dates, end_dates, freq="D")
T # array([ 397., 1414.,   31.])
E # array([ True, False,  True])
lifelines.utils.find_best_parametric_model(event_times, event_observed=None, scoring_method: str = 'AIC', additional_models=None, censoring_type='right', timeline=None, alpha=None, ci_labels=None, entry=None, weights=None, show_progress=False)

To quickly determine the best¹ univariate model, this function will iterate through each parametric model available in lifelines and select the one that minimizes a particular measure of fit.

¹Best, according to the measure of fit.

Parameters:
  • event_times (list, np.array, pd.Series) – a (n,) array of observed survival times. If interval censoring, a tuple of (lower_bound, upper_bound).

  • event_observed (list, np.array, pd.Series) – a (n,) array of censored flags, 1 if observed, 0 if not. Default None assumes all observed.

  • scoring_method (string) – one of {“AIC”, “BIC”}

  • additional_models (list) – list of other parametric models that implement the lifelines API.

  • censoring_type (str) – {“right”, “left”, “interval”}

  • timeline (list, optional) – return the model at the values in timeline (positively increasing)

  • alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.

  • ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [<lower-bound name>, <upper-bound name>]. Default: <label>_lower_<alpha>

  • entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for left-truncated (not left-censored) observations. If None, all members of the population entered study when they were “born”: time zero.

  • weights (an array, or pd.Series, of length n) – integer weights per observation

Note

Due to instability, the GeneralizedGammaFitter is not tested here.

Return type:

tuple of fitted best_model and best_score

lifelines.utils.group_survival_table_from_events(groups, durations, event_observed, birth_times=None, weights=None, limit=-1) Tuple[ndarray, DataFrame, DataFrame, DataFrame]

Joins multiple event series together into DataFrames. A generalization of survival_table_from_events to data with groups.

Parameters:
  • groups (a (n,) array) – individuals’ group ids.

  • durations (a (n,) array) – durations of each individual

  • event_observed (a (n,) array) – event observations, 1 if observed, 0 else.

  • birth_times (a (n,) array) – when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. Normally set to all zeros, but can be positive or negative.

  • limit

Returns:

  • unique_groups (np.array) – array of all the unique groups present

  • removed (DataFrame) – DataFrame of removal count data at event_times for each group, column names are ‘removed:<group name>’

  • observed (DataFrame) – DataFrame of observed count data at event_times for each group, column names are ‘observed:<group name>’

  • censored (DataFrame) – DataFrame of censored count data at event_times for each group, column names are ‘censored:<group name>’

Example

#input
group_survival_table_from_events(waltonG, waltonT, np.ones_like(waltonT)) #data available in test_suite.py
#output
[
    array(['control', 'miR-137'], dtype=object),
              removed:control  removed:miR-137
    event_at
    6                       0                1
    7                       2                0
    9                       0                3
    13                      0                3
    15                      0                2
    ,
              observed:control  observed:miR-137
    event_at
    6                        0                 1
    7                        2                 0
    9                        0                 3
    13                       0                 3
    15                       0                 2
    ,
              censored:control  censored:miR-137
    event_at
    6                        0                 0
    7                        0                 0
    9                        0                 0
    ,
]
lifelines.utils.k_fold_cross_validation(fitters, df, duration_col, event_col=None, k=5, scoring_method='log_likelihood', fitter_kwargs={}, seed=None)

Perform cross validation on a dataset. If multiple models are provided, all models will train on each of the k subsets.

Parameters:
  • fitters (model) – one or several objects which possess a method: fit(self, data, duration_col, event_col) Note that the last two arguments will be given as keyword arguments, and that event_col is optional. The objects must also have the “predictor” method defined below.

  • df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and (optional) event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).

  • duration_col (string) – the name of the column in DataFrame that contains the subjects’ lifetimes.

  • event_col (string, optional) – the name of the column in DataFrame that contains the subjects’ death observation. If left as None, assume all individuals are uncensored.

  • k (int) – the number of folds to perform. n/k data will be withheld for testing on.

  • scoring_method (str) – one of {‘log_likelihood’, ‘concordance_index’} log_likelihood: returns the average unpenalized partial log-likelihood. concordance_index: returns the concordance-index

  • fitter_kwargs – keyword args to pass into fitter.fit method.

  • seed (fix a seed in np.random.seed)

Returns:

results – (k,1) list of scores for each fold. The scores can be anything.

Return type:

list

lifelines.utils.median_survival_times(model_or_survival_function) float

Compute the median survival time of survival function(s).

Parameters:

model_or_survival_function (lifelines model or DataFrame) – This can be a univariate lifelines model, or a DataFrame of one or more survival functions.

lifelines.utils.qth_survival_time(q: float, model_or_survival_function) float

Returns the time when a single survival function reaches the qth percentile, that is, solves \(q = S(t)\) for \(t\).

Parameters:
  • q (float) – value between 0 and 1.

  • model_or_survival_function (Series, single-column DataFrame, or lifelines model)

lifelines.utils.qth_survival_times(q, survival_functions) DataFrame | float

Find the times when one or more survival functions reach the qth percentile.

Parameters:
  • q (float or array) – a float between 0 and 1 that represents the time when the survival function hits the qth percentile.

  • survival_functions (a (n,d) DataFrame, Series, or NumPy array.) – If DataFrame or Series, will return index values (actual times) If NumPy array, will return indices.

Returns:

if d==1, returns a float, np.inf if infinity. if d > 1, an DataFrame containing the first times the value was crossed.

Return type:

float, or DataFrame

lifelines.utils.restricted_mean_survival_time(model_or_survival_function, t: float = inf, return_variance=False) float | Tuple[float, float]

Compute the restricted mean survival time, RMST, of a survival function. This is defined as

\[\text{RMST}(t) = \int_0^t S(\tau) d\tau\]

For reason why we use an upper bound and not always \(\infty\) is because the tail of a survival function has high variance and strongly effects the RMST.

Parameters:
  • model_or_survival_function (lifelines model or DataFrame) – This can be a univariate model, or a pandas DataFrame. The former will provide a more accurate estimate however.

  • t (float) – The upper limit of the integration in the RMST.

Example

from lifelines import KaplanMeierFitter, WeibullFitter
from lifelines.utils import restricted_mean_survival_time

kmf = KaplanMeierFitter().fit(T, E)
restricted_mean_survival_time(kmf, t=3.5)
restricted_mean_survival_time(kmf.survival_function_, t=3.5)

wf = WeibullFitter().fit(T, E)
restricted_mean_survival_time(wf)
restricted_mean_survival_time(wf.survival_function_)

References

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-13-152#Sec27

lifelines.utils.survival_events_from_table(survival_table, observed_deaths_col='observed', censored_col='censored')

This is the inverse of the function survival_table_from_events.

Parameters:
  • survival_table (DataFrame) –

    a pandas DataFrame with index as the durations and columns “observed” and “censored”, referring to

    the number of individuals that died and were censored at time t.

  • observed_deaths_col (str, optional (default: “observed”)) – the column in the survival table that represents the number of subjects that were observed to die at a specific time

  • censored_col (str, optional (default: “censored”)) – the column in the survival table that represents the number of subjects that were censored at a specific time

Returns:

  • T (array) – durations of observation – one element for observed time

  • E (array) – event observations – 1 if observed, 0 else.

  • W (array) – weights - integer weights to “condense” the data

Example

# Ex: The survival table, as a pandas DataFrame:

                 observed  censored
   index
   1                1         0
   2                0         1
   3                1         0
   4                1         1
   5                0         1

# would return
T = np.array([ 1.,  2.,  3.,  4.,  4.,  5.]),
E = np.array([ 1.,  0.,  1.,  1.,  0.,  0.])
W = np.array([ 1,  1,  1,  1,  1,  1])

See also

survival_table_from_events

lifelines.utils.survival_table_from_events(death_times, event_observed, birth_times=None, columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'], weights=None, collapse=False, intervals=None) DataFrame

Create a survival table from right-censored dataset.

Parameters:
  • death_times ((n,) array) – represent the event times

  • event_observed ((n,) array) – 1 if observed event, 0 is censored event.

  • birth_times (a (n,) array, optional) – representing when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. If None (default), birth_times are set to be the first observation or 0, which ever is smaller.

  • columns (iterable, optional) – a 3-length array to call the, in order, removed individuals, observed deaths and censorships.

  • weights ((n,1) array, optional) – Optional argument to use weights for individuals. Assumes weights of 1 if not provided.

  • collapse (bool, optional (default=False)) – If True, collapses survival table into lifetable to show events in interval bins

  • intervals (iterable, optional) – Default None, otherwise a list/(n,1) array of interval edge measures. If left as None while collapse=True, then Freedman-Diaconis rule for histogram bins will be used to determine intervals.

Returns:

Pandas DataFrame with index as the unique times or intervals in event_times. The columns named ‘removed’ refers to the number of individuals who were removed from the population by the end of the period. The column ‘observed’ refers to the number of removed individuals who were observed to have died (i.e. not censored.) The column ‘censored’ is defined as ‘removed’ - ‘observed’ (the number of individuals who left the population due to event_observed)

Return type:

DataFrame

Example

#Uncollapsed output
          removed  observed  censored  entrance   at_risk
event_at
0               0         0         0        11        11
6               1         1         0         0        11
7               2         2         0         0        10
9               3         3         0         0         8
13              3         3         0         0         5
15              2         2         0         0         2
#Collapsed output
         removed observed censored at_risk
event_at
(0, 2]        34       33        1     312
(2, 4]        84       42       42     278
(4, 6]        64       17       47     194
(6, 8]        63       16       47     130
(8, 10]       35       12       23      67
(10, 12]      24        5       19      32
lifelines.utils.to_episodic_format(df, duration_col, event_col, id_col=None, time_gaps=1) DataFrame

This function takes a “flat” dataset (that is, non-time-varying), and converts it into a time-varying dataset with static variables.

Useful if your dataset has variables that do not satisfy the proportional hazard assumption, and you need to create a time-varying dataset to include interaction terms with time.

Parameters:
  • df (DataFrame) – a DataFrame of the static dataset.

  • duration_col (string) – string representing the column in df that represents the durations of each subject.

  • event_col (string) – string representing the column in df that represents whether the subject experienced the event or not.

  • id_col (string, optional) – Specify the column that represents an id, else lifelines creates an auto-incrementing one.

  • time_gaps (float or int) – Specify a desired time_gap. For example, if time_gap is 2 and a subject lives for 10.5 units of time, then the final long form will have 5 + 1 rows for that subject: (0, 2], (2, 4], (4, 6], (6, 8], (8, 10], (10, 10.5] Smaller time_gaps will produce larger DataFrames, and larger time_gaps will produce smaller DataFrames. In the limit, the long DataFrame will be identical to the original DataFrame.

Example

from lifelines.datasets import load_rossi
from lifelines.utils import to_episodic_format
rossi = load_rossi()
long_rossi = to_episodic_format(rossi, 'week', 'arrest', time_gaps=2.)

from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()
# age variable violates proportional hazard
long_rossi['time * age'] = long_rossi['stop'] * long_rossi['age']
ctv.fit(long_rossi, id_col='id', event_col='arrest', show_progress=True)
ctv.print_summary()
lifelines.utils.to_long_format(df, duration_col) DataFrame

This function converts a survival analysis DataFrame to a lifelines “long” format. The lifelines “long” format is used in a common next function, add_covariate_to_timeline.

Parameters:
  • df (DataFrame) – a DataFrame in the standard survival analysis form (one for per observation, with covariates, duration and event flag)

  • duration_col (string) – string representing the column in df that represents the durations of each subject.

Returns:

long_form_df – A DataFrame with new columns. This can be fed into add_covariate_to_timeline

Return type:

DataFrame