# Survival Regression¶

Often we have additional data aside from the durations, and if applicable any censorships that occurred. In the regime dataset, we have the type of government the political leader was part of, the country they were head of, and the year they were elected. Can we use this data in survival analysis?

Yes, the technique is called *survival regression* – the name implies
we regress covariates (eg: year elected, country, etc.) against a
another variable – in this case durations and lifetimes. Similar to the
logic in the first part of this tutorial, we cannot use traditional
methods like linear regression.

There are two popular competing techniques in survival regression: Cox’s model and Aalen’s additive model. Both models attempt to represent the hazard rate \(\lambda(t)\). In Cox’s model, the relationship is defined:

On the other hand, Aalen’s additive model assumes the following form:

Warning

These are still experimental.

## Aalen’s Additive model¶

The estimator to fit unknown coefficients in Aalen’s additive model is
located in `estimators`

under `AalenAdditiveFitter`

. For this
exercise, we will use the regime dataset and include the categorical
variables `un_continent_name`

(eg: Asia, North America,...), the
`regime`

type (eg: monarchy, civilan,...) and the year the regime
started in, `start_year`

.

Aalens additive model typically does not estimate the individual
\(b_i(t)\) but instead estimates \(\int_0^t b_i(s) \; ds\)
(similar to estimate of the hazard rate using `NelsonAalenFitter`

above). This is important to keep in mind when analzying the output.

```
from lifelines import AalenAdditiveFitter
data.head()
```

ctryname | cowcode2 | politycode | un_region_name | un_continent_name | ehead | leaderspellreg | democracy | regime | start_year | duration | observed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | Afghanistan | 700 | 700 | Southern Asia | Asia | Mohammad Zahir Shah | Mohammad Zahir Shah.Afghanistan.1946.1952.Mona... | Non-democracy | Monarchy | 1946 | 7 | 1 |

2 | Afghanistan | 700 | 700 | Southern Asia | Asia | Sardar Mohammad Daoud | Sardar Mohammad Daoud.Afghanistan.1953.1962.Ci... | Non-democracy | Civilian Dict | 1953 | 10 | 1 |

3 | Afghanistan | 700 | 700 | Southern Asia | Asia | Mohammad Zahir Shah | Mohammad Zahir Shah.Afghanistan.1963.1972.Mona... | Non-democracy | Monarchy | 1963 | 10 | 1 |

4 | Afghanistan | 700 | 700 | Southern Asia | Asia | Sardar Mohammad Daoud | Sardar Mohammad Daoud.Afghanistan.1973.1977.Ci... | Non-democracy | Civilian Dict | 1973 | 5 | 0 |

5 | Afghanistan | 700 | 700 | Southern Asia | Asia | Nur Mohammad Taraki | Nur Mohammad Taraki.Afghanistan.1978.1978.Civi... | Non-democracy | Civilian Dict | 1978 | 1 | 0 |

5 rows × 12 columns

I’m using the lovely library `patsy`

<https://github.com/pydata/patsy>`__ here to create a
covariance matrix from my original dataframe.

```
import patsy
# the '-1' term
# refers to not adding an intercept column (a column of all 1s).
# It can be added to the Fitter class.
X = patsy.dmatrix('un_continent_name + regime + start_year -1', data, return_type='dataframe')
```

```
X.columns
```

```
['un_continent_name[Africa]',
'un_continent_name[Americas]',
'un_continent_name[Asia]',
'un_continent_name[Europe]',
'un_continent_name[Oceania]',
'regime[T.Military Dict]',
'regime[T.Mixed Dem]',
'regime[T.Monarchy]',
'regime[T.Parliamentary Dem]',
'regime[T.Presidential Dem]',
'start_year']
```

Below we create our Fitter class. Since we did not supply an intercept
column in our matrix we have included the keyword `fit_intercept=True`

(`True`

by default) which will append the column of ones to our
matrix. (Sidenote: the intercept term, \(b_0(t)\) in survival
regression is often referred to as the *baseline* hazard.)

We have also included the `coef_penalizer`

option. During the estimation, a
linear regression is computed at each step. Often the regression can be
unstable (due to high
co-linearity
or small sample sizes) – adding a penalizer term controls the stability. I recommend always starting with a small penalizer term – if
the estimates still appear to be too unstable, try increasing it.

```
aaf = AalenAdditiveFitter(coef_penalizer=1.0, fit_intercept=True)
```

Like the API syntax above, an instance of `AalenAdditiveFitter`

includes a `fit`

method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and
two special columns: a *duration* column and a boolean *event occured* column (where event occured refers to the event of interest - expulsion from government in this case)

```
data = lifelines.datasets.load_dd()
X['T'] = data['duration']
X['E'] = data['observed']
```

**The api for .fit was different prior to lifelines 0.3, below refers to the 0.3+ versions**

```
aaf.fit(X, 'T', event_col='E')
```

After fitting, the instance exposes a `cumulative_hazards_`

DataFrame
containing the estimates of \(\int_0^t b_i(s) \; ds\):

```
figsize(12.5,8)
aaf.cumulative_hazards_.head()
```

un_continent_name[Africa] | un_continent_name[Americas] | un_continent_name[Asia] | un_continent_name[Europe] | un_continent_name[Oceania] | regime[T.Military Dict] | regime[T.Mixed Dem] | regime[T.Monarchy] | regime[T.Parliamentary Dem] | regime[T.Presidential Dem] | start_year | baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | -0.051595 | -0.082406 | 0.010666 | 0.154493 | -0.060438 | 0.075333 | 0.086274 | -0.133938 | 0.048077 | 0.127171 | 0.000116 | -0.029280 |

2 | -0.014713 | -0.039471 | 0.095668 | 0.194251 | -0.092696 | 0.115033 | 0.358702 | -0.226233 | 0.168783 | 0.121862 | 0.000053 | 0.143039 |

3 | 0.007389 | -0.064758 | 0.115121 | 0.170549 | 0.069371 | 0.161490 | 0.677347 | -0.271183 | 0.328483 | 0.146234 | 0.000004 | 0.297672 |

4 | -0.058418 | 0.011399 | 0.091784 | 0.205824 | 0.125722 | 0.220028 | 0.932674 | -0.294900 | 0.365604 | 0.422617 | 0.000002 | 0.376311 |

5 | -0.099282 | 0.106641 | 0.112083 | 0.150708 | 0.091900 | 0.241575 | 1.123860 | -0.391103 | 0.536185 | 0.743913 | 0.000057 | 0.362049 |

`AalenAdditiveFitter`

also has built in plotting:

```
aaf.plot( columns=[ 'regime[T.Presidential Dem]', 'baseline', 'un_continent_name[Europe]' ], ix=slice(1,15) )
```

Regression is most interesting if we use it on data we have not yet seen, i.e. prediction! We can use what we have learned to predict individual hazard rates, survival functions, and median survival time. The dataset we are using is limited to 2008, so let’s use this data to predict the (though already partly seen) possible duration of Canadian Prime Minister Stephen Harper.

```
ix = (data['ctryname'] == 'Canada') * (data['start_year'] == 2006)
harper = X.ix[ix]
print "Harper's unique data point", harper
```

```
Harper's unique data point
```

```
array([[ 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0., 2003.]])
```

```
ax = plt.subplot(2,1,1)
aaf.predict_cumulative_hazard(harper).plot(ax=ax)
ax = plt.subplot(2,1,2)
aaf.predict_survival_function(harper).plot(ax=ax);
```

## Cox’s Proportional Hazard model¶

New in 0.4.0 is the implementation of the Propotional Hazard’s regression model (implemented in
R under `coxph`

). It has a similar API to Aalen’s Additive model. Like R, it has a `print_summary`

function that prints a tabuluar view of coefficients and related stats.

This example data is from the paper here.

```
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cf = CoxPHFitter()
cf.fit(rossi_dataset, 'week', event_col='arrest')
cf.print_summary() # access the results using cf.summary
"""
n=432, number of events=114
coef exp(coef) se(coef) z p lower 0.95 upper 0.95
fin -1.897e-01 8.272e-01 9.579e-02 -1.981e+00 4.763e-02 -3.775e-01 -1.938e-03 *
age -3.500e-01 7.047e-01 1.344e-01 -2.604e+00 9.210e-03 -6.134e-01 -8.651e-02 **
race 1.032e-01 1.109e+00 1.012e-01 1.020e+00 3.078e-01 -9.516e-02 3.015e-01
wexp -7.486e-02 9.279e-01 1.051e-01 -7.124e-01 4.762e-01 -2.809e-01 1.311e-01
mar -1.421e-01 8.675e-01 1.254e-01 -1.134e+00 2.570e-01 -3.880e-01 1.037e-01
paro -4.134e-02 9.595e-01 9.522e-02 -4.341e-01 6.642e-01 -2.280e-01 1.453e-01
prio 2.639e-01 1.302e+00 8.291e-02 3.182e+00 1.460e-03 1.013e-01 4.264e-01 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Concordance = 0.642
"""
```

To access the coefficients and the baseline hazard, you can use `cf.hazards_`

and `cf.baseline_hazard_`

respectively. After fitting, you can use use the suite of prediction methods (similar to Aalen’s additve model above): `.predict_hazard(X)`

, `.predict_survival_function(X)`

, etc.

### Stratification¶

Sometimes a covariate may not obey the proportional hazard assumption. In this case, we can allow a factor to be adjusted for without estimating its effect. To specify categorical variables to be used in stratification, we specify them in the call to `fit`

:

```
cf.fit(rossi_dataset, 'week', event_col='arrest', strata=['race'])
cf.print_summary() # access the results using cf.summary
"""
n=432, number of events=114
coef exp(coef) se(coef) z p lower 0.95 upper 0.95
fin -1.890e-01 8.278e-01 9.576e-02 -1.973e+00 4.848e-02 -3.767e-01 -1.218e-03 *
age -3.503e-01 7.045e-01 1.343e-01 -2.608e+00 9.106e-03 -6.137e-01 -8.700e-02 **
wexp -7.107e-02 9.314e-01 1.053e-01 -6.746e-01 4.999e-01 -2.776e-01 1.355e-01
mar -1.452e-01 8.649e-01 1.255e-01 -1.157e+00 2.473e-01 -3.911e-01 1.008e-01
paro -4.079e-02 9.600e-01 9.524e-02 -4.283e-01 6.684e-01 -2.275e-01 1.459e-01
prio 2.661e-01 1.305e+00 8.319e-02 3.198e+00 1.381e-03 1.030e-01 4.292e-01 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Concordance = 0.638
"""
```

## Model Selection in Survival Regression¶

With censorship, it’s not correct to use a loss function like mean-squared-error or mean-absolute-loss. Instead, one measure is the c-index, or concordance-index. This measure evaluates the ordering of predicted times: how correct is the ordering? It is infact a generalization of AUC, another common loss function, and is interpretted similarly:

- 0.5 is the expected result from random predictions,
- 1.0 is perfect concordance and,
- 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)

The measure is implemented in lifelines under lifelines.utils.concordance_index and accepts the actual times (along with any censorships), and the predicted times.

### Cross Validation¶

Lifelines has an implementation of k-fold cross validation under lifelines.utils.k_fold_cross_validation. This function accepts an instance of a regression fitter (either `CoxPHFitter`

of `AalenAdditiveFitter`

), a dataset, plus k (the number of folds to perform, default 5). On each fold, it splits the data
into a training set and a testing set, fits itself on the training set, and evaluates itself on the testing set (using the concordance measure).

```
from lifelines import CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
regression_dataset = load_regression_dataset()
cf = CoxPHFitter()
scores = k_fold_cross_validation(cf, regression_dataset, 'T', event_col='E', k=3)
print scores
print np.mean(scores)
print np.std(scores)
#[ 0.5896 0.5358 0.5028]
# 0.542
# 0.035
```