In one sentence

Kaplan–Meier estimates the probability of not having had the event by time \(t\)—the survival function \(S(t)\)—using only the observed event times and who was still at risk at each moment. The result is a step function: it stays flat between event times and drops at each time someone has the event.


What you need

Two things per subject:

What Meaning
Time Time from start until the event or until follow-up ended (censoring).
Event 1 = event happened at that time; 0 = censored (event not seen).

Censored means we don’t know what happened after that time (e.g. lost to follow-up, study ended). KM uses them in the “at risk” count until their time, then they drop out.


A tiny example

Suppose we have 8 people. Time is in months; 1 = event (e.g. relapse), 0 = censored.

Example: 8 subjects, time (months) and event (1 = yes, 0 = censored).
id time event
1 2 1
2 3 0
3 4 1
4 4 1
5 5 0
6 6 1
7 8 1
8 10 0

Event times are 2, 4, 6, and 8 months. One person is censored at 3, one at 5, one at 10 (they don’t create a “step” in the curve).


The formula

At each event time \(t_i\), we know how many had the event there (\(d_i\)) and how many were still at risk just before (\(n_i\)). The probability of surviving past that moment is \(1 - d_i/n_i\). The Kaplan–Meier estimator is the product of these terms up to time \(t\):

\[ \hat{S}(t) = \prod_{t_i \leq t} \left( 1 - \frac{d_i}{n_i} \right) \]

So we start at 1 (100% “surviving”) and multiply by \((1 - d_i/n_i)\) at each event time. Censored subjects stay in \(n_i\) until their censoring time, then leave the at-risk set.


Step-by-step with our example

We order by time and at each event time compute \(n_i\) (at risk), \(d_i\) (events), and the running product \(\hat{S}(t)\):

At each event time: number at risk, number of events, and the Kaplan–Meier estimate S(t).
Event time t At risk n Events d S(t) S(t) %
2 8 1 0.875 87.5%
4 6 2 0.583 58.3%
6 3 1 0.389 38.9%
8 2 1 0.194 19.4%

Interpretation: just before the first event (t=2), all 8 are at risk. One has the event, so the proportion surviving that moment is \(1 - 1/8 = 0.875\). So \(\hat{S}(2) = 0.875\). By t=4, one person was censored at 3, so 7 are at risk; 2 have the event at 4, so we multiply by \(1 - 2/7\). That gives \(\hat{S}(4) = 0.875 \times (5/7) \approx 0.625\). The table continues the same way.


How the table becomes the curve

The survival curve is a step function:

  • Flat between event times (nothing happens, so the estimated probability doesn’t change).
  • Drops at each event time (by the amount \(d_i/n_i\) of the current height).
  • Censored times do not cause a drop; they only reduce the “at risk” count for later steps.

So the numbers in the table are exactly the heights of the curve at and after each event time. Here we produce the curve with trcpetc: estimate_cif_km() does the same calculation, and show_surv() draws it.

Kaplan–Meier survival curve (trcpetc). Steps at event times (2, 4, 6, 8); censored times (3, 5, 10) do not create steps.

Kaplan–Meier survival curve (trcpetc). Steps at event times (2, 4, 6, 8); censored times (3, 5, 10) do not create steps.

The step down at 2, 4, 6, and 8 matches the calculation table above. Creating a cleaner example so the calculation table is clear: <|tool▁calls▁begin|><|tool▁call▁begin|> StrReplace