utils¶
- lifelines.utils.add_covariate_to_timeline(long_form_df, cv, id_col, duration_col, event_col, start_col='start', stop_col='stop', add_enum=False, overwrite=True, cumulative_sum=False, cumulative_sum_prefix='cumsum_', delay=0) DataFrame ¶
This is a util function to help create a long form table tracking subjects’ covariate changes over time. It is meant to be used iteratively as one adds more and more covariates to track over time. Before using this function, it is recommended to view the documentation at https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html#dataset-creation-for-time-varying-regression
- Parameters:
long_form_df (DataFrame) – a DataFrame that has the initial or intermediate “long” form of time-varying observations. Must contain columns id_col, ‘start’, ‘stop’, and event_col. See function to_long_format to transform data into long form.
cv (DataFrame) – a DataFrame that contains (possibly more than) one covariate to track over time. Must contain columns id_col and duration_col. duration_col represents time since the start of the subject’s life.
id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.
duration_col (string) – the column in cv that represents the time-since-birth the observation occurred at.
event_col (string) – the column in df that represents if the event-of-interest occurred
add_enum (bool, optional) – a Boolean flag to denote whether to add a column enumerating rows per subject. Useful to specify a specific observation, ex: df[df[‘enum’] == 1] will grab the first observations per subject.
overwrite (bool, optional) – if True, covariate values in long_form_df will be overwritten by covariate values in cv if the column exists in both cv and long_form_df and the timestamps are identical. If False, the default behavior will be to sum the values together.
cumulative_sum (bool, optional) – sum over time the new covariates. Makes sense if the covariates are new additions, and not state changes (ex: administering more drugs vs taking a temperature.)
cumulative_sum_prefix (string, optional) – a prefix to add to calculated cumulative sum columns
delay (int, optional) – add a delay to covariates (useful for checking for reverse causality in analysis)
- Returns:
long_form_df – A DataFrame with updated rows to reflect the novel times slices (if any) being added from cv, and novel (or updated) columns of new covariates from cv
- Return type:
DataFrame
- lifelines.utils.concordance_index(event_times, predicted_scores, event_observed=None) float ¶
Calculates the concordance index (C-index) between a series of event times and a predicted score. The first is the real survival times from the observational data, and the other is the predicted score from a model of some kind.
The c-index is the average of how often a model says X is greater than Y when, in the observed data, X is indeed greater than Y. The c-index also handles how to handle censored values (obviously, if Y is censored, it’s hard to know if X is truly greater than Y).
The concordance index is a value between 0 and 1 where:
0.5 is the expected result from random predictions,
1.0 is perfect concordance and,
0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)
The calculation internally done is
>>> (pairs_correct + 0.5 * pairs_tied) / admissable_pairs
where
pairs_correct
is the number of pairs s.t. ift_x > t_y
, thens_x > s_y
, pairs,pairs_tied
is the number of pairs wheres_x = s_y
, andadmissable_pairs
is all possible pairs. The subtleties are in how censored observation are handled (ex: not all pairs can be evaluated due to censoring).- Parameters:
event_times (iterable) – a length-n iterable of observed survival times.
predicted_scores (iterable) – a length-n iterable of predicted scores - these could be survival times, or hazards, etc. See https://stats.stackexchange.com/questions/352183/use-median-survival-time-to-calculate-cph-c-statistic/352435#352435
event_observed (iterable, optional) – a length-n iterable censoring flags, 1 if observed, 0 if not. Default None assumes all observed.
- Returns:
c-index – a value between 0 and 1.
- Return type:
float
References
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996;15(4):361-87.
Examples
from lifelines.utils import concordance_index cph = CoxPHFitter().fit(df, 'T', 'E') concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
- lifelines.utils.covariates_from_event_matrix(df, id_col) DataFrame ¶
This is a helper function to handle binary event datastreams in a specific format and convert it to a format that add_covariate_to_timeline will accept. For example, suppose you have a dataset that looks like:
id promotion movement raise 0 1 1.0 NaN 2.0 1 2 NaN 5.0 NaN 2 3 3.0 5.0 7.0
where the values (aside from the id column) represent when an event occurred for a specific user, relative to the subject’s birth/entry. This is a common way format to pull data from a SQL table. We call this a duration matrix, and we want to convert this DataFrame to a format that can be included in a long form DataFrame (see add_covariate_to_timeline for more details on this).
The duration matrix should have 1 row per subject (but not necessarily all subjects).
- Parameters:
df (DataFrame) – the DataFrame we want to transform
id_col (string) – the column in long_form_df and cv representing a unique identifier for subjects.
Example
cv = covariates_from_event_matrix(duration_df, 'id') long_form_df = add_covariate_to_timeline(long_form_df, cv, 'id', 'duration', 'e', cumulative_sum=True)
- lifelines.utils.datetimes_to_durations(start_times, end_times, fill_date=datetime.datetime(2023, 11, 15, 13, 3, 42, 931699), freq='D', dayfirst=False, na_values=None, format=None)¶
This is a very flexible function for transforming arrays of start_times and end_times to the proper format for lifelines: duration and event observation arrays.
- Parameters:
start_times (an array, Series or DataFrame) – iterable representing start times. These can be strings, or datetime objects.
end_times (an array, Series or DataFrame) – iterable representing end times. These can be strings, or datetimes. These values can be None, or an empty string, which corresponds to censorship.
fill_date (a datetime, array, Series or DataFrame, optional (default=datetime.Today())) – the date to use if end_times is a missing or empty. This corresponds to last date of observation. Anything after this date is also censored.
freq (string, optional (default=’D’)) – the units of time to use. See Pandas ‘freq’. Default ‘D’ for days.
dayfirst (bool, optional (default=False)) – see Pandas to_datetime
na_values (list, optional) – list of values to recognize as NA/NaN. Ex: [‘’, ‘NaT’]
format – see Pandas to_datetime
- Returns:
T (numpy array) – array of floats representing the durations with time units given by freq.
C (numpy array) – boolean array of event observations: 1 if death observed, 0 else.
Examples
from lifelines.utils import datetimes_to_durations start_dates = ['2015-01-01', '2015-04-01', '2014-04-05'] end_dates = ['2016-02-02', None, '2014-05-06'] T, E = datetimes_to_durations(start_dates, end_dates, freq="D") T # array([ 397., 1414., 31.]) E # array([ True, False, True])
- lifelines.utils.find_best_parametric_model(event_times, event_observed=None, scoring_method: str = 'AIC', additional_models=None, censoring_type='right', timeline=None, alpha=None, ci_labels=None, entry=None, weights=None, show_progress=False)¶
To quickly determine the best¹ univariate model, this function will iterate through each parametric model available in lifelines and select the one that minimizes a particular measure of fit.
¹Best, according to the measure of fit.
- Parameters:
event_times (list, np.array, pd.Series) – a (n,) array of observed survival times. If interval censoring, a tuple of (lower_bound, upper_bound).
event_observed (list, np.array, pd.Series) – a (n,) array of censored flags, 1 if observed, 0 if not. Default None assumes all observed.
scoring_method (string) – one of {“AIC”, “BIC”}
additional_models (list) – list of other parametric models that implement the lifelines API.
censoring_type (str) – {“right”, “left”, “interval”}
timeline (list, optional) – return the model at the values in timeline (positively increasing)
alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [<lower-bound name>, <upper-bound name>]. Default: <label>_lower_<alpha>
entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for left-truncated (not left-censored) observations. If None, all members of the population entered study when they were “born”: time zero.
weights (an array, or pd.Series, of length n) – integer weights per observation
Note
Due to instability, the GeneralizedGammaFitter is not tested here.
- Return type:
tuple of fitted best_model and best_score
- lifelines.utils.group_survival_table_from_events(groups, durations, event_observed, birth_times=None, weights=None, limit=-1) Tuple[ndarray, DataFrame, DataFrame, DataFrame] ¶
Joins multiple event series together into DataFrames. A generalization of survival_table_from_events to data with groups.
- Parameters:
groups (a (n,) array) – individuals’ group ids.
durations (a (n,) array) – durations of each individual
event_observed (a (n,) array) – event observations, 1 if observed, 0 else.
birth_times (a (n,) array) – when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. Normally set to all zeros, but can be positive or negative.
limit
- Returns:
unique_groups (np.array) – array of all the unique groups present
removed (DataFrame) – DataFrame of removal count data at event_times for each group, column names are ‘removed:<group name>’
observed (DataFrame) – DataFrame of observed count data at event_times for each group, column names are ‘observed:<group name>’
censored (DataFrame) – DataFrame of censored count data at event_times for each group, column names are ‘censored:<group name>’
Example
#input group_survival_table_from_events(waltonG, waltonT, np.ones_like(waltonT)) #data available in test_suite.py #output [ array(['control', 'miR-137'], dtype=object), removed:control removed:miR-137 event_at 6 0 1 7 2 0 9 0 3 13 0 3 15 0 2 , observed:control observed:miR-137 event_at 6 0 1 7 2 0 9 0 3 13 0 3 15 0 2 , censored:control censored:miR-137 event_at 6 0 0 7 0 0 9 0 0 , ]
See also
- lifelines.utils.k_fold_cross_validation(fitters, df, duration_col, event_col=None, k=5, scoring_method='log_likelihood', fitter_kwargs={}, seed=None)¶
Perform cross validation on a dataset. If multiple models are provided, all models will train on each of the k subsets.
- Parameters:
fitters (model) – one or several objects which possess a method:
fit(self, data, duration_col, event_col)
Note that the last two arguments will be given as keyword arguments, and that event_col is optional. The objects must also have the “predictor” method defined below.df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and (optional) event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).
duration_col (string) – the name of the column in DataFrame that contains the subjects’ lifetimes.
event_col (string, optional) – the name of the column in DataFrame that contains the subjects’ death observation. If left as None, assume all individuals are uncensored.
k (int) – the number of folds to perform. n/k data will be withheld for testing on.
scoring_method (str) – one of {‘log_likelihood’, ‘concordance_index’} log_likelihood: returns the average unpenalized partial log-likelihood. concordance_index: returns the concordance-index
fitter_kwargs – keyword args to pass into fitter.fit method.
seed (fix a seed in np.random.seed)
- Returns:
results – (k,1) list of scores for each fold. The scores can be anything.
- Return type:
list
See also
lifelines.utils.sklearn_adapter.sklearn_adapter
- lifelines.utils.median_survival_times(model_or_survival_function) float ¶
Compute the median survival time of survival function(s).
- Parameters:
model_or_survival_function (lifelines model or DataFrame) – This can be a univariate lifelines model, or a DataFrame of one or more survival functions.
- lifelines.utils.qth_survival_time(q: float, model_or_survival_function) float ¶
Returns the time when a single survival function reaches the qth percentile, that is, solves \(q = S(t)\) for \(t\).
- Parameters:
q (float) – value between 0 and 1.
model_or_survival_function (Series, single-column DataFrame, or lifelines model)
See also
- lifelines.utils.qth_survival_times(q, survival_functions) DataFrame | float ¶
Find the times when one or more survival functions reach the qth percentile.
- Parameters:
q (float or array) – a float between 0 and 1 that represents the time when the survival function hits the qth percentile.
survival_functions (a (n,d) DataFrame, Series, or NumPy array.) – If DataFrame or Series, will return index values (actual times) If NumPy array, will return indices.
- Returns:
if d==1, returns a float, np.inf if infinity. if d > 1, an DataFrame containing the first times the value was crossed.
- Return type:
float, or DataFrame
See also
- lifelines.utils.restricted_mean_survival_time(model_or_survival_function, t: float = inf, return_variance=False) float | Tuple[float, float] ¶
Compute the restricted mean survival time, RMST, of a survival function. This is defined as
\[\text{RMST}(t) = \int_0^t S(\tau) d\tau\]For reason why we use an upper bound and not always \(\infty\) is because the tail of a survival function has high variance and strongly effects the RMST.
- Parameters:
model_or_survival_function (lifelines model or DataFrame) – This can be a univariate model, or a pandas DataFrame. The former will provide a more accurate estimate however.
t (float) – The upper limit of the integration in the RMST.
Example
from lifelines import KaplanMeierFitter, WeibullFitter from lifelines.utils import restricted_mean_survival_time kmf = KaplanMeierFitter().fit(T, E) restricted_mean_survival_time(kmf, t=3.5) restricted_mean_survival_time(kmf.survival_function_, t=3.5) wf = WeibullFitter().fit(T, E) restricted_mean_survival_time(wf) restricted_mean_survival_time(wf.survival_function_)
References
https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-13-152#Sec27
- lifelines.utils.survival_events_from_table(survival_table, observed_deaths_col='observed', censored_col='censored')¶
This is the inverse of the function
survival_table_from_events
.- Parameters:
survival_table (DataFrame) –
- a pandas DataFrame with index as the durations and columns “observed” and “censored”, referring to
the number of individuals that died and were censored at time t.
observed_deaths_col (str, optional (default: “observed”)) – the column in the survival table that represents the number of subjects that were observed to die at a specific time
censored_col (str, optional (default: “censored”)) – the column in the survival table that represents the number of subjects that were censored at a specific time
- Returns:
T (array) – durations of observation – one element for observed time
E (array) – event observations – 1 if observed, 0 else.
W (array) – weights - integer weights to “condense” the data
Example
# Ex: The survival table, as a pandas DataFrame: observed censored index 1 1 0 2 0 1 3 1 0 4 1 1 5 0 1 # would return T = np.array([ 1., 2., 3., 4., 4., 5.]), E = np.array([ 1., 0., 1., 1., 0., 0.]) W = np.array([ 1, 1, 1, 1, 1, 1])
See also
survival_table_from_events
- lifelines.utils.survival_table_from_events(death_times, event_observed, birth_times=None, columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'], weights=None, collapse=False, intervals=None) DataFrame ¶
Create a survival table from right-censored dataset.
- Parameters:
death_times ((n,) array) – represent the event times
event_observed ((n,) array) – 1 if observed event, 0 is censored event.
birth_times (a (n,) array, optional) – representing when the subject was first observed. A subject’s death event is then at [birth times + duration observed]. If None (default), birth_times are set to be the first observation or 0, which ever is smaller.
columns (iterable, optional) – a 3-length array to call the, in order, removed individuals, observed deaths and censorships.
weights ((n,1) array, optional) – Optional argument to use weights for individuals. Assumes weights of 1 if not provided.
collapse (bool, optional (default=False)) – If True, collapses survival table into lifetable to show events in interval bins
intervals (iterable, optional) – Default None, otherwise a list/(n,1) array of interval edge measures. If left as None while collapse=True, then Freedman-Diaconis rule for histogram bins will be used to determine intervals.
- Returns:
Pandas DataFrame with index as the unique times or intervals in event_times. The columns named ‘removed’ refers to the number of individuals who were removed from the population by the end of the period. The column ‘observed’ refers to the number of removed individuals who were observed to have died (i.e. not censored.) The column ‘censored’ is defined as ‘removed’ - ‘observed’ (the number of individuals who left the population due to event_observed)
- Return type:
DataFrame
Example
#Uncollapsed output removed observed censored entrance at_risk event_at 0 0 0 0 11 11 6 1 1 0 0 11 7 2 2 0 0 10 9 3 3 0 0 8 13 3 3 0 0 5 15 2 2 0 0 2 #Collapsed output removed observed censored at_risk event_at (0, 2] 34 33 1 312 (2, 4] 84 42 42 278 (4, 6] 64 17 47 194 (6, 8] 63 16 47 130 (8, 10] 35 12 23 67 (10, 12] 24 5 19 32
See also
- lifelines.utils.to_episodic_format(df, duration_col, event_col, id_col=None, time_gaps=1) DataFrame ¶
This function takes a “flat” dataset (that is, non-time-varying), and converts it into a time-varying dataset with static variables.
Useful if your dataset has variables that do not satisfy the proportional hazard assumption, and you need to create a time-varying dataset to include interaction terms with time.
- Parameters:
df (DataFrame) – a DataFrame of the static dataset.
duration_col (string) – string representing the column in df that represents the durations of each subject.
event_col (string) – string representing the column in df that represents whether the subject experienced the event or not.
id_col (string, optional) – Specify the column that represents an id, else lifelines creates an auto-incrementing one.
time_gaps (float or int) – Specify a desired time_gap. For example, if time_gap is 2 and a subject lives for 10.5 units of time, then the final long form will have 5 + 1 rows for that subject: (0, 2], (2, 4], (4, 6], (6, 8], (8, 10], (10, 10.5] Smaller time_gaps will produce larger DataFrames, and larger time_gaps will produce smaller DataFrames. In the limit, the long DataFrame will be identical to the original DataFrame.
Example
from lifelines.datasets import load_rossi from lifelines.utils import to_episodic_format rossi = load_rossi() long_rossi = to_episodic_format(rossi, 'week', 'arrest', time_gaps=2.) from lifelines import CoxTimeVaryingFitter ctv = CoxTimeVaryingFitter() # age variable violates proportional hazard long_rossi['time * age'] = long_rossi['stop'] * long_rossi['age'] ctv.fit(long_rossi, id_col='id', event_col='arrest', show_progress=True) ctv.print_summary()
See also
- lifelines.utils.to_long_format(df, duration_col) DataFrame ¶
This function converts a survival analysis DataFrame to a lifelines “long” format. The lifelines “long” format is used in a common next function,
add_covariate_to_timeline
.- Parameters:
df (DataFrame) – a DataFrame in the standard survival analysis form (one for per observation, with covariates, duration and event flag)
duration_col (string) – string representing the column in df that represents the durations of each subject.
- Returns:
long_form_df – A DataFrame with new columns. This can be fed into add_covariate_to_timeline
- Return type:
DataFrame
See also