lifelines.utils package

Submodules

lifelines.utils.btree module

lifelines.utils.concordance module

lifelines.utils.concordance.concordance_index(event_times, predicted_scores, event_observed=None)

Calculates the concordance index (C-index) between two series of event times. The first is the real survival times from the experimental data, and the other is the predicted survival times from a model of some kind.

The concordance index is a value between 0 and 1 where, 0.5 is the expected result from random predictions, 1.0 is perfect concordance and, 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)

Parameters:
Returns:

c-index – a value between 0 and 1.

Return type:

float

Notes

Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):361-87.

lifelines.utils.lowess module

This module implements the Lowess function for nonparametric regression. Functions: lowess Fit a smooth nonparametric regression curve to a scatterplot. For more information, see William S. Cleveland: “Robust locally weighted regression and smoothing scatterplots”, Journal of the American Statistical Association, December 1979, volume 74, number 368, pp. 829-836. William S. Cleveland and Susan J. Devlin: “Locally weighted regression: An approach to regression analysis by local fitting”, Journal of the American Statistical Association, September 1988, volume 83, number 403, pp. 596-610.

lifelines.utils.lowess.lowess(x, y, f=2./3., iter=3) → yest

Lowess smoother: Robust locally weighted regression. The lowess function fits a nonparametric regression curve to a scatterplot. The arrays x and y contain an equal number of elements; each pair (x[i], y[i]) defines a data point in the scatterplot. The function returns the estimated (smooth) values of y. The smoothing span is given by f. A larger value for f will result in a smoother curve. The number of robustifying iterations is given by iter. The function will run faster with a smaller number of iterations.

lifelines.utils.progress_bar module

Module contents

lifelines.utils.qth_survival_times(q, survival_functions, cdf=False)

Find the times when one or more survival functions reach the qth percentile.

Parameters:
  • q (float) – a float between 0 and 1 that represents the time when the survival function hit’s the qth percentile.
  • survival_functions (a (n,d) dataframe or numpy array.) – If dataframe, will return index values (actual times) If numpy array, will return indices.
  • cdf (boolean, optional) – When doing left-censored data, cdf=True is used.
Returns:

v – if d==1, returns a float, np.inf if infinity. if d > 1, an DataFrame containing the first times the value was crossed.

Return type:

float, or DataFrame

lifelines.utils.qth_survival_time(q, survival_function, cdf=False)

Returns the time when a single survival function reachess the qth percentile.

Parameters:
  • q (float) – a float between 0 and 1 that represents the time when the survival function hit’s the qth percentile.
  • survival_function (Series or single-column DataFrame.)
  • cdf (boolean, optional) – When doing left-censored data, cdf=True is used.
Returns:

v

Return type:

float

lifelines.utils.median_survival_times(density_or_survival_function, left_censorship=False)
lifelines.utils.survival_table_from_events(death_times, event_observed, birth_times=None, columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'], weights=None, collapse=False, intervals=None)
Parameters:
  • death_times ((n,) array) – represent the event times
  • event_observed ((n,) array) – 1 if observed event, 0 is censored event.
  • birth_times (a (n,) array, optional) – representing when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. If None (default), birth_times are set to be the first observation or 0, which ever is smaller.
  • columns (interable, optional) – a 3-length array to call the, in order, removed individuals, observed deaths and censorships.
  • weights ((n,1) array, optional) – Optional argument to use weights for individuals. Assumes weights of 1 if not provided.
  • collapse (boolean, optional (default=False)) – If True, collapses survival table into lifetable to show events in interval bins
  • intervals (iterable, optional) – Default None, otherwise a list/(n,1) array of interval edge measures. If left as None while collapse=True, then Freedman-Diaconis rule for histogram bins will be used to determine intervals.
Returns:

output – Pandas DataFrame with index as the unique times or intervals in event_times. The columns named ‘removed’ refers to the number of individuals who were removed from the population by the end of the period. The column ‘observed’ refers to the number of removed individuals who were observed to have died (i.e. not censored.) The column ‘censored’ is defined as ‘removed’ - ‘observed’ (the number of individuals who

left the population due to event_observed)

Return type:

DataFrame

Example

>>> #Uncollapsed output
>>>           removed  observed  censored  entrance   at_risk
>>> event_at
>>> 0               0         0         0        11        11
>>> 6               1         1         0         0        11
>>> 7               2         2         0         0        10
>>> 9               3         3         0         0         8
>>> 13              3         3         0         0         5
>>> 15              2         2         0         0         2
>>> #Collapsed output
>>>          removed observed censored at_risk
>>>              sum      sum      sum     max
>>> event_at
>>> (0, 2]        34       33        1     312
>>> (2, 4]        84       42       42     278
>>> (4, 6]        64       17       47     194
>>> (6, 8]        63       16       47     130
>>> (8, 10]       35       12       23      67
>>> (10, 12]      24        5       19      32

See also

group_survival_table_from_events()

lifelines.utils.datetimes_to_durations(start_times, end_times, fill_date=datetime.datetime(2019, 1, 15, 22, 30, 39, 296329), freq='D', dayfirst=False, na_values=None)

This is a very flexible function for transforming arrays of start_times and end_times to the proper format for lifelines: duration and event observation arrays.

Parameters:
  • start_times (an array, Series or DataFrame) – iterable representing start times. These can be strings, or datetime objects.
  • end_times (an array, Series or Dataframe) – iterable representing end times. These can be strings, or datetimes. These values can be None, or an empty string, which corresponds to censorship.
  • fill_date (datetime, optional (default=datetime.Today())) – the date to use if end_times is a None or empty string. This corresponds to last date of observation. Anything after this date is also censored.
  • freq (string, optional (default=’D’)) – the units of time to use. See Pandas ‘freq’. Default ‘D’ for days.
  • day_first (boolean, optional (default=False)) – convert assuming European-style dates, i.e. day/month/year.
  • na_values (list, optional) – list of values to recognize as NA/NaN. Ex: [‘’, ‘NaT’]
Returns:

  • T (numpy array) – array of floats representing the durations with time units given by freq.
  • C (numpy array) – boolean array of event observations: 1 if death observed, 0 else.

lifelines.utils.concordance_index(event_times, predicted_scores, event_observed=None)

Calculates the concordance index (C-index) between two series of event times. The first is the real survival times from the experimental data, and the other is the predicted survival times from a model of some kind.

The concordance index is a value between 0 and 1 where, 0.5 is the expected result from random predictions, 1.0 is perfect concordance and, 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)

Parameters:
Returns:

c-index – a value between 0 and 1.

Return type:

float

Notes

Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):361-87.

lifelines.utils.k_fold_cross_validation(fitters, df, duration_col, event_col=None, k=5, evaluation_measure=<function concordance_index>, predictor='predict_expectation', predictor_kwargs={}, fitter_kwargs={})

Perform cross validation on a dataset. If multiple models are provided, all models will train on each of the k subsets.

Parameters:
  • fitter(s) (model) –

    one or several objects which possess a method:

    fit(self, data, duration_col, event_col)

    Note that the last two arguments will be given as keyword arguments, and that event_col is optional. The objects must also have the “predictor” method defined below.

  • df (DataFrame) – a Pandas dataframe with necessary columns duration_col and event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).

  • duration_col ((n,) array) – the column in dataframe that contains the subjects lifetimes.

  • event_col ((n,) array) – the column in dataframe that contains the subject’s death observation. If left as None, assumes all individuals are non-censored.

  • k (int) – the number of folds to perform. n/k data will be withheld for testing on.

  • evaluation_measure (function) – a function that accepts either (event_times, predicted_event_times), or (event_times, predicted_event_times, event_observed) and returns something (could be anything). Default: statistics.concordance_index: (C-index) between two series of event times

  • predictor (string) – a string that matches a prediction method on the fitter instances. For example, “predict_expectation” or “predict_percentile”. Default is “predict_expectation” The interface for the method is:

    predict(self, data, **optional_kwargs)

  • fitter_kwargs – keyword args to pass into fitter.fit method

  • predictor_kwargs – keyword args to pass into predictor-method.

Returns:

results – (k,1) list of scores for each fold. The scores can be anything.

Return type:

list

lifelines.utils.to_long_format(df, duration_col)

This function converts a survival analysis dataframe to a lifelines “long” format. The lifelines “long” format is used in a common next function, add_covariate_to_timeline.

Parameters:
  • df (DataFrame) – a Dataframe in the standard survival analysis form (one for per observation, with covariates, duration and event flag)
  • duration_col (string) – string representing the column in df that represents the durations of each subject.
Returns:

long_form_df – A DataFrame with new columns. This can be fed into add_covariate_to_timeline

Return type:

DataFrame

lifelines.utils.to_episodic_format(df, duration_col, event_col, id_col=None, time_gaps=1)

This function takes a “flat” dataset (that is, non-time-varying), and converts it into a time-varying dataset with static variables.

Useful if your dataset has variables that do not satisfy the proportional hazard assumption, and you need to create a time-varying dataset to include interaction terms with time.

Parameters:
  • df (DataFrame) – a DataFrame of the static dataset.
  • duration_col (string) – string representing the column in df that represents the durations of each subject.
  • event_col (string) – string representing the column in df that represents whether the subject experienced the event or not.
  • id_col (string, optional) – Specify the column that represents an id, else lifelines creates an autoincrementing one.
  • time_gaps (float or int) – Specify a desired time_gap. For example, if time_gap is 2 and a subject lives for 10.5 units of time, then the final long form will have 5 + 1 rows for that subject: (0, 2], (2, 4], (4, 6], (6, 8], (8, 10], (10, 10.5] Smaller time_gaps will produce larger dataframes, and larger time_gaps will produce smaller dataframes. In the limit, the long dataframe will be identical to the original dataframe.
Returns:

Return type:

DataFrame

Example

>>> from lifelines.datasets import load_rossi
>>> from lifelines.utils import to_episodic_format
>>> rossi = load_rossi()
>>> long_rossi = to_episodic_format(rossi, 'week', 'arrest', time_gaps=2.)
>>>
>>> from lifelines import CoxTimeVaryingFitter
>>> ctv = CoxTimeVaryingFitter()
>>> # age variable violates proprotional hazard
>>> long_rossi['time * age'] = long_rossi['stop'] * long_rossi['age']
>>> ctv.fit(long_rossi, id_col='id', event_col='arrest', show_progress=True)
>>> ctv.print_summary()
lifelines.utils.add_covariate_to_timeline(long_form_df, cv, id_col, duration_col, event_col, add_enum=False, overwrite=True, cumulative_sum=False, cumulative_sum_prefix='cumsum_', delay=0)

This is a util function to help create a long form table tracking subjects’ covariate changes over time. It is meant to be used iteratively as one adds more and more covariates to track over time. Before using this function, it is recommended to view the documentation at https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#dataset-creation-for-time-varying-regression.

Parameters:
  • long_form_df (DataFrame) – a DataFrame that has the intial or intermediate “long” form of time-varying observations. Must contain columns id_col, ‘start’, ‘stop’, and event_col. See function to_long_format to transform data into long form.
  • cv (DataFrame) – a DataFrame that contains (possibly more than) one covariate to track over time. Must contain columns id_col and duration_col. duration_col represents time since the start of the subject’s life.
  • id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.
  • duration_col (string) – the column in cv that represents the time-since-birth the observation occured at.
  • event_col (string) – the column in df that represents if the event-of-interest occured
  • add_enum (boolean, optional) – a Boolean flag to denote whether to add a column enumerating rows per subject. Useful to specify a specific observation, ex: df[df[‘enum’] == 1] will grab the first observations per subject.
  • overwrite (boolean, optional) – if True, covariate values in long_form_df will be overwritten by covariate values in cv if the column exists in both cv and long_form_df and the timestamps are identical. If False, the default behaviour will be to sum the values together.
  • cumulative_sum (boolean, optional) – sum over time the new covariates. Makes sense if the covariates are new additions, and not state changes (ex: administering more drugs vs taking a temperature.)
  • cumulative_sum_prefix (string, optional) – a prefix to add to calculated cumulative sum columns
  • delay (int, optional) – add a delay to covariates (useful for checking for reverse causality in analysis)
Returns:

long_form_df – A DataFrame with updated rows to reflect the novel times slices (if any) being added from cv, and novel (or updated) columns of new covariates from cv

Return type:

DataFrame

lifelines.utils.covariates_from_event_matrix(df, id_col)

This is a helper function to handle binary event datastreams in a specific format and convert it to a format that add_covariate_to_timeline will accept. For example, suppose you have a dataset that looks like:

   id  promotion  movement  raise
0   1        1.0       NaN    2.0
1   2        NaN       5.0    NaN
2   3        3.0       5.0    7.0

where the values (aside from the id column) represent when an event occured for a specific user, relative to the subject’s birth/entry. This is a common way format to pull data from a SQL table. We call this a duration matrix, and we want to convert this dataframe to a format that can be included in a long form dataframe (see add_covariate_to_timeline for more details on this).

The duration matrix should have 1 row per subject (but not necessarily all subjects).

Parameters:
  • df (DataFrame) – the DataFrame we want to transform
  • id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.

Example

>>> cv = covariates_from_event_matrix(duration_df, 'id')
>>> long_form_df = add_covariate_to_timeline(long_form_df, cv, 'id', 'duration', 'e', cumulative_sum=True)