statistics¶

class lifelines.statistics.StatisticalResult(p_value, test_statistic, name=None, test_name=None, **kwargs)¶

This class holds the result of statistical tests with a nice printer wrapper to display the results.

Note

This class’ API changed in version 0.16.0.

Parameters:

p_value (iterable or float) – the p-values of a statistical test(s)
test_statistic (iterable or float) – the test statistics of a statistical test(s). Must be the same size as p-values if iterable.
test_name (string) – the test that was used. lifelines should set this.
name (iterable or string) – if this class holds multiple results (ex: from a pairwise comparison), this can hold the names. Must be the same size as p-values if iterable.
kwargs – additional information to attach to the object and display in print_summary().

print_summary(decimals=2, style=None, **kwargs)¶

Print summary statistics describing the results.

Parameters:

decimals (int, optional (default=2)) – specify the number of decimal places to show
style (string,) – {html, ascii, latex}, default ascii
kwargs – print additional meta data in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

property summary¶: returns: a DataFrame containing the test statistics and the p-value :rtype: DataFrame

to_ascii(decimals=2, **kwargs)¶

to_html(decimals=2, **kwargs)¶

to_latex(decimals=2, **kwargs)¶

lifelines.statistics.logrank_test(durations_A, durations_B, event_observed_A=None, event_observed_B=None, t_0=-1, weights_A=None, weights_B=None, weightings=None, **kwargs) → StatisticalResult¶

Measures and reports on whether two intensity processes are different. That is, given two event series, determines whether the data generating processes are statistically different. The test-statistic is chi-squared under the null hypothesis. Let \(h_i(t)\) be the hazard ratio of group \(i\) at time \(t\), then:

\[\begin{split}\begin{align} & H_0: h_1(t) = h_2(t) \\ & H_A: h_1(t) = c h_2(t), \;\; c \ne 1 \end{align}\end{split}\]

This implicitly uses the log-rank weights.

Note

lifelines logrank implementation only handles right-censored data.
The logrank test has maximum power when the assumption of proportional hazards is true. As a consequence, if the survival curves cross, the logrank test will give an inaccurate assessment of differences.
This implementation is a special case of the function multivariate_logrank_test, which is used internally. See Survival and Event Analysis, page 108.
There are only disadvantages to using the log-rank test versus using the Cox regression. See more here for a discussion. To convert to using the Cox regression:

from lifelines import CoxPHFitter

dfA = pd.DataFrame({'E': event_observed_A, 'T': durations_A, 'groupA': 1})
dfB = pd.DataFrame({'E': event_observed_B, 'T': durations_B, 'groupA': 0})
df = pd.concat([dfA, dfB])

cph = CoxPHFitter().fit(df, 'T', 'E')
cph.print_summary()

Parameters:

durations_A (iterable) – a (n,) list-like of event durations (birth to death,…) for the first population.
durations_B (iterable) – a (n,) list-like of event durations (birth to death,…) for the second population.
event_observed_A (iterable, optional) – a (n,) list-like of censorship flags, (1 if observed, 0 if not), for the first population. Default assumes all observed.
event_observed_B (iterable, optional) – a (n,) list-like of censorship flags, (1 if observed, 0 if not), for the second population. Default assumes all observed.
weights_A (iterable, optional) – case weights
weights_B (iterable, optional) – case weights
t_0 (float, optional (default=-1)) – The final time period under observation, and subjects who experience the event after this time are set to be censored. Specify -1 to use all time.
weightings (str, optional) – apply a weighted logrank test: options are “wilcoxon” for Wilcoxon (also known as Breslow), “tarone-ware” for Tarone-Ware, “peto” for Peto test and “fleming-harrington” for Fleming-Harrington test. These are useful for testing for early or late differences in the survival curve. For the Fleming-Harrington test, keyword arguments p and q must also be provided with non-negative values.

Weightings are applied at the ith ordered failure time, \(t_{i}\), according to:
Wilcoxon: \(n_i\) Tarone-Ware: \(\sqrt{n_i}\) Peto: \(\bar{S}(t_i)\) Fleming-Harrington: \(\hat{S}(t_i)^p \times (1 - \hat{S}(t_i))^q\)

where \(n_i\) is the number at risk just prior to time \(t_{i}\), \(\bar{S}(t_i)\) is Peto-Peto’s modified survival estimate and \(\hat{S}(t_i)\) is the left-continuous Kaplan-Meier survival estimate at time \(t_{i}\).

Returns:

a StatisticalResult object with properties p_value, summary, test_statistic, print_summary

Return type:

StatisticalResult

Examples

T1 = [1, 4, 10, 12, 12, 3, 5.4]
E1 = [1, 0, 1,  0,  1,  1, 1]

T2 = [4, 5, 7, 11, 14, 20, 8, 8]
E2 = [1, 1, 1, 1,  1,  1,  1, 1]

from lifelines.statistics import logrank_test
results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2)

results.print_summary()
print(results.p_value)        # 0.7676
print(results.test_statistic) # 0.0872

lifelines.statistics.multivariate_logrank_test(event_durations, groups, event_observed=None, weights=None, t_0=-1, weightings=None, **kwargs) → StatisticalResult¶

This test is a generalization of the logrank_test: it can deal with n>2 populations (and should be equal when n=2):

\[\begin{split}\begin{align} & H_0: h_1(t) = h_2(t) = h_3(t) = ... = h_n(t) \\ & H_A: \text{there exist at least one group that differs from the other.} \end{align}\end{split}\]

Parameters:

event_durations (iterable) – a (n,) list-like representing the (possibly partial) durations of all individuals
groups (iterable) – a (n,) list-like of unique group labels for each individual.
event_observed (iterable, optional) – a (n,) list-like of event_observed events: 1 if observed death, 0 if censored. Defaults to all observed.
weights (iterable, optional) – case-weights
t_0 (float, optional (default=-1)) – The final time period under observation, and subjects who experience the event after this time are set to be censored. Specify -1 to use all time.
weightings (str, optional) – apply a weighted logrank test: options are “wilcoxon” for Wilcoxon (also known as Breslow), “tarone-ware” for Tarone-Ware, “peto” for Peto test and “fleming-harrington” for Fleming-Harrington test. These are useful for testing for early or late differences in the survival curve. For the Fleming-Harrington test, keyword arguments p and q must also be provided with non-negative values.

Weightings are applied at the ith ordered failure time, \(t_{i}\), according to:
Wilcoxon: \(n_i\) Tarone-Ware: \(\sqrt{n_i}\) Peto: \(\bar{S}(t_i)\) Fleming-Harrington: \(\hat{S}(t_i)^p \times (1 - \hat{S}(t_i))^q\)

where \(n_i\) is the number at risk just prior to time \(t_{i}\), \(\bar{S}(t_i)\) is Peto-Peto’s modified survival estimate and \(\hat{S}(t_i)\) is the left-continuous Kaplan-Meier survival estimate at time \(t_{i}\).
kwargs – add keywords and meta-data to the experiment summary.

Returns:

a StatisticalResult object with properties p_value, summary, test_statistic, print_summary

Return type:

StatisticalResult

Examples

df = pd.DataFrame({
   'durations': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
   'events': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
   'groups': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2]
})
result = multivariate_logrank_test(df['durations'], df['groups'], df['events'])
result.test_statistic
result.p_value
result.print_summary()

# numpy example
G = [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2]
T = [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7]
E = [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0]
result = multivariate_logrank_test(T, G, E)
result.test_statistic

See also

pairwise_logrank_test, logrank_test

lifelines.statistics.pairwise_logrank_test(event_durations, groups, event_observed=None, t_0=-1, weightings=None, **kwargs) → StatisticalResult¶

Perform the logrank test pairwise for all \(n \ge 2\) unique groups.

Parameters:

event_durations (iterable) – a (n,) list-like representing the (possibly partial) durations of all individuals
groups (iterable) – a (n,) list-like of unique group labels for each individual.
event_observed (iterable, optional) – a (n,) list-like of event_observed events: 1 if observed death, 0 if censored. Defaults to all observed.
t_0 (float, optional (default=-1)) – The final time period under observation, and subjects who experience the event after this time are set to be censored. Specify -1 to use all time.
weightings (str, optional) – apply a weighted logrank test: options are “wilcoxon” for Wilcoxon (also known as Breslow), “tarone-ware” for Tarone-Ware, “peto” for Peto test and “fleming-harrington” for Fleming-Harrington test. These are useful for testing for early or late differences in the survival curve. For the Fleming-Harrington test, keyword arguments p and q must also be provided with non-negative values.

Weightings are applied at the ith ordered failure time, \(t_{i}\), according to:
Wilcoxon: \(n_i\) Tarone-Ware: \(\sqrt{n_i}\) Peto: \(\bar{S}(t_i)\) Fleming-Harrington: \(\hat{S}(t_i)^p \times (1 - \hat{S}(t_i))^q\)

where \(n_i\) is the number at risk just prior to time \(t_{i}\), \(\bar{S}(t_i)\) is Peto-Peto’s modified survival estimate and \(\hat{S}(t_i)\) is the left-continuous Kaplan-Meier survival estimate at time \(t_{i}\).
kwargs – add keywords and meta-data to the experiment summary.

Returns:

a StatisticalResult object that contains all the pairwise comparisons (try StatisticalResult.summary or StatisticalResult.print_summary)

Return type:

StatisticalResult

lifelines.statistics.power_under_cph(n_exp, n_con, p_exp, p_con, postulated_hazard_ratio, alpha=0.05) → float¶

This computes the power of the hypothesis test that the two groups, experiment and control, have different hazards (that is, the relative hazard ratio is different from 1.)

Parameters:

n_exp (integer) – size of the experiment group.
n_con (integer) – size of the control group.
p_exp (float) – probability of failure in experimental group over period of study.
p_con (float) – probability of failure in control group over period of study
postulated_hazard_ratio (float)
the postulated hazard ratio
alpha (float, optional (default=0.05)) – type I error rate

Returns:

power to detect the magnitude of the hazard ratio as small as that specified by postulated_hazard_ratio.

Return type:

float

Notes

Reference.

See also

sample_size_necessary_under_cph

lifelines.statistics.proportional_hazard_test(fitted_cox_model, training_df, time_transform='rank', precomputed_residuals=None, **kwargs) → StatisticalResult¶

Test whether any variable in a Cox model breaks the proportional hazard assumption. This method uses an approximation that R’s survival use to use, but changed it in late 2019, hence there will be differences here between lifelines and R.

Parameters:

fitted_cox_model (CoxPHFitter) – the fitted Cox model, fitted with training_df, you wish to test. Currently only the CoxPHFitter is supported, but later CoxTimeVaryingFitter, too.
training_df (DataFrame) – the DataFrame used in the call to the Cox model’s fit. Optional if providing precomputed_residuals
time_transform (vectorized function, list, or string, optional (default=’rank’)) – {‘all’, ‘km’, ‘rank’, ‘identity’, ‘log’} One of the strings above, a list of strings, or a function to transform the time (must accept (time, durations, weights) however). ‘all’ will present all the transforms.
precomputed_residuals (DataFrame, optional) – specify the scaled Schoenfeld residuals, if already computed.
kwargs – additional parameters to add to the StatisticalResult

Notes

R uses the default km, we use rank, as this performs well versus other transforms. See http://eprints.lse.ac.uk/84988/1/06_ParkHendry2015-ReassessingSchoenfeldTests_Final.pdf

References

lifelines.statistics.sample_size_necessary_under_cph(power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio, alpha=0.05)¶

This computes the sample size for needed power to compare two groups under a Cox Proportional Hazard model.

Parameters:

power (float) – power to detect the magnitude of the hazard ratio as small as that specified by postulated_hazard_ratio.
ratio_of_participants (ratio of participants in experimental group over control group.)
p_exp (float) – probability of failure in experimental group over period of study.
p_con (float) – probability of failure in control group over period of study
postulated_hazard_ratio (float) – the postulated hazard ratio
alpha (float, optional (default=0.05)) – type I error rate

Returns:

n_exp (integer) – the samples sizes need for the experiment to achieve desired power
n_con (integer) – the samples sizes need for the control group to achieve desired power

Examples

from lifelines.statistics import sample_size_necessary_under_cph

desired_power = 0.8
ratio_of_participants = 1.
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.7
n_exp, n_con = sample_size_necessary_under_cph(desired_power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio)
# (421, 421)

References

https://cran.r-project.org/web/packages/powerSurvEpi/powerSurvEpi.pdf

See also

power_under_cph

lifelines.statistics.survival_difference_at_fixed_point_in_time_test(point_in_time, fitterA, fitterB, **result_kwargs) → StatisticalResult¶

Often analysts want to compare the survival-ness of groups at specific times, rather than comparing the entire survival curves against each other. For example, analysts may be interested in 5-year survival. Statistically comparing the naive Kaplan-Meier points at a specific time actually has reduced power (see [1]). By transforming the survival function, we can recover more power. This function uses the log(-log(·)) transformation.

Parameters:

point_in_time (float,) – the point in time to analyze the survival curves at.
fitterA – A lifelines univariate model fitted to the data. This can be a KaplanMeierFitter, WeibullFitter, etc.
fitterB – the second lifelines model to compare against.
result_kwargs – add keywords and meta-data to the experiment summary

Returns:

a StatisticalResult object with properties p_value, summary, test_statistic, print_summary

Return type:

StatisticalResult

Examples

T1 = [1, 4, 10, 12, 12, 3, 5.4]
E1 = [1, 0, 1,  0,  1,  1, 1]
kmf1 = KaplanMeierFitter().fit(T1, E1)

T2 = [4, 5, 7, 11, 14, 20, 8, 8]
E2 = [1, 1, 1, 1,  1,  1,  1, 1]
kmf2 = KaplanMeierFitter().fit(T2, E2)

from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
results = survival_difference_at_fixed_point_in_time_test(12.0, kmf1, kmf2)

results.print_summary()
print(results.p_value)        # 0.77
print(results.test_statistic) # 0.09

Notes

1. Other transformations are possible, but Klein et al. [1] showed that the log(-log(·)) transform has the most desirable statistical properties.

The API of this function changed in v0.25.3. This new API allows for right, left and interval censoring models to be tested.

References

[1] Klein, J. P., Logan, B. , Harhoff, M. and Andersen, P. K. (2007), Analyzing survival curves at a fixed point in time. Statist. Med., 26: 4505-4519. doi:10.1002/sim.2864