utils¶

lifelines.utils.
qth_survival_times
(q, survival_functions, cdf=False)¶ Find the times when one or more survival functions reach the qth percentile.
Parameters:  q (float or array) – a float between 0 and 1 that represents the time when the survival function hits the qth percentile.
 survival_functions (a (n,d) DataFrame or numpy array.) – If DataFrame, will return index values (actual times) If numpy array, will return indices.
 cdf (boolean, optional) – When doing leftcensored data, cdf=True is used.
Returns: if d==1, returns a float, np.inf if infinity. if d > 1, an DataFrame containing the first times the value was crossed.
Return type: float, or DataFrame
See also

lifelines.utils.
qth_survival_time
(q, survival_function, cdf=False)¶ Returns the time when a single survival function reaches the qth percentile.
Parameters:  q (float) – a float between 0 and 1 that represents the time when the survival function hit’s the qth percentile.
 survival_function (Series or singlecolumn DataFrame.)
 cdf (boolean, optional) – When doing leftcensored data, cdf=True is used.
Returns: Return type: float
See also

lifelines.utils.
median_survival_times
(density_or_survival_function, left_censorship=False)¶

lifelines.utils.
survival_table_from_events
(death_times, event_observed, birth_times=None, columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'], weights=None, collapse=False, intervals=None)¶ Parameters:  death_times ((n,) array) – represent the event times
 event_observed ((n,) array) – 1 if observed event, 0 is censored event.
 birth_times (a (n,) array, optional) – representing when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. If None (default), birth_times are set to be the first observation or 0, which ever is smaller.
 columns (iterable, optional) – a 3length array to call the, in order, removed individuals, observed deaths and censorships.
 weights ((n,1) array, optional) – Optional argument to use weights for individuals. Assumes weights of 1 if not provided.
 collapse (boolean, optional (default=False)) – If True, collapses survival table into lifetable to show events in interval bins
 intervals (iterable, optional) – Default None, otherwise a list/(n,1) array of interval edge measures. If left as None while collapse=True, then FreedmanDiaconis rule for histogram bins will be used to determine intervals.
Returns: Pandas DataFrame with index as the unique times or intervals in event_times. The columns named ‘removed’ refers to the number of individuals who were removed from the population by the end of the period. The column ‘observed’ refers to the number of removed individuals who were observed to have died (i.e. not censored.) The column ‘censored’ is defined as ‘removed’  ‘observed’ (the number of individuals who left the population due to event_observed)
Return type: DataFrame
Example
>>> #Uncollapsed output >>> removed observed censored entrance at_risk >>> event_at >>> 0 0 0 0 11 11 >>> 6 1 1 0 0 11 >>> 7 2 2 0 0 10 >>> 9 3 3 0 0 8 >>> 13 3 3 0 0 5 >>> 15 2 2 0 0 2 >>> #Collapsed output >>> removed observed censored at_risk >>> sum sum sum max >>> event_at >>> (0, 2] 34 33 1 312 >>> (2, 4] 84 42 42 278 >>> (4, 6] 64 17 47 194 >>> (6, 8] 63 16 47 130 >>> (8, 10] 35 12 23 67 >>> (10, 12] 24 5 19 32
See also

lifelines.utils.
group_survival_table_from_events
(groups, durations, event_observed, birth_times=None, limit=1)¶ Joins multiple event series together into DataFrames. A generalization of survival_table_from_events to data with groups.
Parameters:  groups (a (n,) array) – individuals’ group ids.
 durations (a (n,) array) – durations of each individual
 event_observed (a (n,) array) – event observations, 1 if observed, 0 else.
 birth_times (a (n,) array) – when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. Normally set to all zeros, but can be positive or negative.
 limit
Returns:  unique_groups (np.array) – array of all the unique groups present
 removed (DataFrame) – DataFrame of removal count data at event_times for each group, column names are ‘removed:<group name>’
 observed (DataFrame) – DataFrame of observed count data at event_times for each group, column names are ‘observed:<group name>’
 censored (DataFrame) – DataFrame of censored count data at event_times for each group, column names are ‘censored:<group name>’
Example
>>> #input >>> group_survival_table_from_events(waltonG, waltonT, np.ones_like(waltonT)) #data available in test_suite.py >>> #output >>> [ >>> array(['control', 'miR137'], dtype=object), >>> removed:control removed:miR137 >>> event_at >>> 6 0 1 >>> 7 2 0 >>> 9 0 3 >>> 13 0 3 >>> 15 0 2 >>> , >>> observed:control observed:miR137 >>> event_at >>> 6 0 1 >>> 7 2 0 >>> 9 0 3 >>> 13 0 3 >>> 15 0 2 >>> , >>> censored:control censored:miR137 >>> event_at >>> 6 0 0 >>> 7 0 0 >>> 9 0 0 >>> , >>> ]
See also

lifelines.utils.
survival_events_from_table
(survival_table, observed_deaths_col='observed', censored_col='censored')¶ This is the inverse of the function
survival_table_from_events
.Parameters: survival_table (DataFrame) –
 a pandas DataFrame with index as the durations and columns “observed” and “censored”, referring to
the number of individuals that died and were censored at time t.
observed_deaths_col (str, optional (default: “observed”)) – the column in the survival table that represents the number of subjects that were observed to die at a specific time
censored_col (str, optional (default: “censored”)) – the column in the survival table that represents the number of subjects that were censored at a specific time
Returns:  T (array) – durations of observation – one element for observed time
 E (array) – event observations – 1 if observed, 0 else.
 W (array) – weights  integer weights to “condense” the data
Example
>>> # Ex: The survival table, as a pandas DataFrame: >>> >>> observed censored >>> index >>> 1 1 0 >>> 2 0 1 >>> 3 1 0 >>> 4 1 1 >>> 5 0 1 >>> >>> # would return >>> T = np.array([ 1., 2., 3., 4., 4., 5.]), >>> E = np.array([ 1., 0., 1., 1., 0., 0.]) >>> W = np.array([ 1, 1, 1, 1, 1, 1])

lifelines.utils.
datetimes_to_durations
(start_times, end_times, fill_date=datetime.datetime(2019, 8, 10, 3, 36, 57, 632055), freq='D', dayfirst=False, na_values=None)¶ This is a very flexible function for transforming arrays of start_times and end_times to the proper format for lifelines: duration and event observation arrays.
Parameters:  start_times (an array, Series or DataFrame) – iterable representing start times. These can be strings, or datetime objects.
 end_times (an array, Series or DataFrame) – iterable representing end times. These can be strings, or datetimes. These values can be None, or an empty string, which corresponds to censorship.
 fill_date (datetime, optional (default=datetime.Today())) – the date to use if end_times is a None or empty string. This corresponds to last date of observation. Anything after this date is also censored.
 freq (string, optional (default=’D’)) – the units of time to use. See Pandas ‘freq’. Default ‘D’ for days.
 dayfirst (boolean, optional (default=False)) – convert assuming Europeanstyle dates, i.e. day/month/year.
 na_values (list, optional) – list of values to recognize as NA/NaN. Ex: [‘’, ‘NaT’]
Returns:  T (numpy array) – array of floats representing the durations with time units given by freq.
 C (numpy array) – boolean array of event observations: 1 if death observed, 0 else.
Examples
>>> from lifelines.utils import datetimes_to_durations >>> >>> start_dates = ['20150101', '20150401', '20140405'] >>> end_dates = ['20160202', None, '20140506'] >>> >>> T, E = datetimes_to_durations(start_dates, end_dates, freq="D") >>> T # array([ 397., 1414., 31.]) >>> E # array([ True, False, True])

lifelines.utils.
concordance_index
(event_times, predicted_scores, event_observed=None)¶ Calculates the concordance index (Cindex) between two series of event times. The first is the real survival times from the experimental data, and the other is the predicted survival times from a model of some kind.
The cindex is the average of how often a model says X is greater than Y when, in the observed data, X is indeed greater than Y. The cindex also handles how to handle censored values (obviously, if Y is censored, it’s hard to know if X is truly greater than Y).
The concordance index is a value between 0 and 1 where:
 0.5 is the expected result from random predictions,
 1.0 is perfect concordance and,
 0.0 is perfect anticoncordance (multiply predictions with 1 to get 1.0)
Parameters:  event_times (iterable) – a lengthn iterable of observed survival times.
 predicted_scores (iterable) – a lengthn iterable of predicted scores  these could be survival times, or hazards, etc. See https://stats.stackexchange.com/questions/352183/usemediansurvivaltimetocalculatecphcstatistic/352435#352435
 event_observed (iterable, optional) – a lengthn iterable censorship flags, 1 if observed, 0 if not. Default None assumes all observed.
Returns: cindex – a value between 0 and 1.
Return type: float
References
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):36187.
Examples
>>> from lifelines.utils import concordance_index >>> cph = CoxPHFitter().fit(df, 'T', 'E') >>> concordance_index(df['T'], cph.predict_partial_hazard(df), df['E'])

lifelines.utils.
k_fold_cross_validation
(fitters, df, duration_col, event_col=None, k=5, evaluation_measure=<function concordance_index>, predictor='predict_expectation', predictor_kwargs={}, fitter_kwargs={})¶ Perform cross validation on a dataset. If multiple models are provided, all models will train on each of the k subsets.
Parameters:  fitters (model) – one or several objects which possess a method:
fit(self, data, duration_col, event_col)
Note that the last two arguments will be given as keyword arguments, and that event_col is optional. The objects must also have the “predictor” method defined below.  df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and (optional) event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).
 duration_col ((n,) array) – the column in DataFrame that contains the subjects lifetimes.
 event_col ((n,) array) – the column in DataFrame that contains the subject’s death observation. If left as None, assumes all individuals are noncensored.
 k (int) – the number of folds to perform. n/k data will be withheld for testing on.
 evaluation_measure (function) – a function that accepts either (event_times, predicted_event_times), or (event_times, predicted_event_times, event_observed). Default: statistics.concordance_index: (Cindex)
 predictor (string) – a string that matches a prediction method on the fitter instances.
For example,
predict_expectation
orpredict_percentile
. Default is “predict_expectation” The interface for the method is:predict(self, data, **optional_kwargs)
 fitter_kwargs – keyword args to pass into fitter.fit method.
 predictor_kwargs – keyword args to pass into predictormethod.
Returns: results – (k,1) list of scores for each fold. The scores can be anything.
Return type: list
See also
lifelines.utils.sklearn_adapter.sklearn_adapter()
 fitters (model) – one or several objects which possess a method:

lifelines.utils.
to_long_format
(df, duration_col)¶ This function converts a survival analysis DataFrame to a lifelines “long” format. The lifelines “long” format is used in a common next function,
add_covariate_to_timeline
.Parameters:  df (DataFrame) – a DataFrame in the standard survival analysis form (one for per observation, with covariates, duration and event flag)
 duration_col (string) – string representing the column in df that represents the durations of each subject.
Returns: long_form_df – A DataFrame with new columns. This can be fed into add_covariate_to_timeline
Return type: DataFrame

lifelines.utils.
to_episodic_format
(df, duration_col, event_col, id_col=None, time_gaps=1)¶ This function takes a “flat” dataset (that is, nontimevarying), and converts it into a timevarying dataset with static variables.
Useful if your dataset has variables that do not satisfy the proportional hazard assumption, and you need to create a timevarying dataset to include interaction terms with time.
Parameters:  df (DataFrame) – a DataFrame of the static dataset.
 duration_col (string) – string representing the column in df that represents the durations of each subject.
 event_col (string) – string representing the column in df that represents whether the subject experienced the event or not.
 id_col (string, optional) – Specify the column that represents an id, else lifelines creates an autoincrementing one.
 time_gaps (float or int) – Specify a desired time_gap. For example, if time_gap is 2 and a subject lives for 10.5 units of time, then the final long form will have 5 + 1 rows for that subject: (0, 2], (2, 4], (4, 6], (6, 8], (8, 10], (10, 10.5] Smaller time_gaps will produce larger DataFrames, and larger time_gaps will produce smaller DataFrames. In the limit, the long DataFrame will be identical to the original DataFrame.
Returns: Return type: DataFrame
Example
>>> from lifelines.datasets import load_rossi >>> from lifelines.utils import to_episodic_format >>> rossi = load_rossi() >>> long_rossi = to_episodic_format(rossi, 'week', 'arrest', time_gaps=2.) >>> >>> from lifelines import CoxTimeVaryingFitter >>> ctv = CoxTimeVaryingFitter() >>> # age variable violates proportional hazard >>> long_rossi['time * age'] = long_rossi['stop'] * long_rossi['age'] >>> ctv.fit(long_rossi, id_col='id', event_col='arrest', show_progress=True) >>> ctv.print_summary()
See also

lifelines.utils.
add_covariate_to_timeline
(long_form_df, cv, id_col, duration_col, event_col, start_col='start', stop_col='stop', add_enum=False, overwrite=True, cumulative_sum=False, cumulative_sum_prefix='cumsum_', delay=0)¶ This is a util function to help create a long form table tracking subjects’ covariate changes over time. It is meant to be used iteratively as one adds more and more covariates to track over time. Before using this function, it is recommended to view the documentation at https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html#datasetcreationfortimevaryingregression
Parameters:  long_form_df (DataFrame) – a DataFrame that has the initial or intermediate “long” form of timevarying observations. Must contain columns id_col, ‘start’, ‘stop’, and event_col. See function to_long_format to transform data into long form.
 cv (DataFrame) – a DataFrame that contains (possibly more than) one covariate to track over time. Must contain columns id_col and duration_col. duration_col represents time since the start of the subject’s life.
 id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.
 duration_col (string) – the column in cv that represents the timesincebirth the observation occurred at.
 event_col (string) – the column in df that represents if the eventofinterest occurred
 add_enum (boolean, optional) – a Boolean flag to denote whether to add a column enumerating rows per subject. Useful to specify a specific observation, ex: df[df[‘enum’] == 1] will grab the first observations per subject.
 overwrite (boolean, optional) – if True, covariate values in long_form_df will be overwritten by covariate values in cv if the column exists in both cv and long_form_df and the timestamps are identical. If False, the default behavior will be to sum the values together.
 cumulative_sum (boolean, optional) – sum over time the new covariates. Makes sense if the covariates are new additions, and not state changes (ex: administering more drugs vs taking a temperature.)
 cumulative_sum_prefix (string, optional) – a prefix to add to calculated cumulative sum columns
 delay (int, optional) – add a delay to covariates (useful for checking for reverse causality in analysis)
Returns: long_form_df – A DataFrame with updated rows to reflect the novel times slices (if any) being added from cv, and novel (or updated) columns of new covariates from cv
Return type: DataFrame

lifelines.utils.
covariates_from_event_matrix
(df, id_col)¶ This is a helper function to handle binary event datastreams in a specific format and convert it to a format that add_covariate_to_timeline will accept. For example, suppose you have a dataset that looks like:
id promotion movement raise 0 1 1.0 NaN 2.0 1 2 NaN 5.0 NaN 2 3 3.0 5.0 7.0
where the values (aside from the id column) represent when an event occurred for a specific user, relative to the subject’s birth/entry. This is a common way format to pull data from a SQL table. We call this a duration matrix, and we want to convert this DataFrame to a format that can be included in a long form DataFrame (see add_covariate_to_timeline for more details on this).
The duration matrix should have 1 row per subject (but not necessarily all subjects).
Parameters:  df (DataFrame) – the DataFrame we want to transform
 id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.
Example
>>> cv = covariates_from_event_matrix(duration_df, 'id') >>> long_form_df = add_covariate_to_timeline(long_form_df, cv, 'id', 'duration', 'e', cumulative_sum=True)