Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent 'date of entry' reported by survexp and questions on survexp #134

Open
qmarcou opened this issue Feb 25, 2021 · 4 comments
Open

Comments

@qmarcou
Copy link

qmarcou commented Feb 25, 2021

Dear @therneau ,
First of all, thank you very much for this great package !

Inconsistent date of entry

I have been trying to use the survexp function to compute individual expected survival and have been confused by one of the helper output of the function (more precisely by the $summ attribute). I'm using version 3.2.7 of the package.
I would expect the "date of entry" to correspond to the date passed for start of follow up passed through the year= attribute in the rmap list. However I have found out that the reported date of entry depends on the provided ratetable, and thus possibly returns wrong date of entries. Here is a minimal example reproducing this behavior

library(survival)
library(relsurv)
data("slopop") # import slovenian population mortality table from the relsurv package

# Build a mock data frame with a single 63 years old male patient with start of follow up on April 24th 2003
df <- data.frame(list("sex"=c("male"),
                     "diag"=c(as.Date.character("28/04/2003",format = "%d/%m/%Y")),
                     "age"=c(63*365)))

# Computed expected survival at 100 days for this single individual cohort
exp_surv_us <- survexp(formula = ~1, data = df,
                        rmap = list(sex = sex, year = diag, age = age),
                        times = (100),
                        ratetable = survexp.us)

exp_surv_slo <- survexp(formula = ~1, data = df,
        rmap = list(sex = sex, year = diag, age = age),
        times = (100),
        ratetable = slopop)

print(exp_surv_us$summ) # output the correct entry date
print(exp_surv_slo$summ) # output an incorrect entry date

Here is the obtained result

> print(exp_surv_us$summ) # output the correct date
[1] "  age ranges from 63 to 63 years\n  male: 1  female: 0 \n  date of entry from 2003-04-28 to 2003-04-28 \n"
> print(exp_surv_slo$summ) # output an incorrect date
[1] "  age ranges from 63 to 63 years\n  male: 1  female: 0 \n  date of entry from 27Apr93 to 27Apr93 \n"

Though after a thorough check I'm pretty sure that the correct hazard (the one provided for 2003 and not 1993) is used to compute survival even for the slovenian ratetable, which is reassuring, I still find this output very misleading. In particular I wouldn't expect the reported date of entry to depend on the provided ratetable object, but I might be missing something.

Other questions on survexp

I would also have several other questions on the survexp function, for which I failed to find clear answers in the documentation:

  1. Should all times (in the response term of the formula and in the times= arguments) be in "days" unit? I found elusive hints in the documentation that it is not necessarily the case e.g:

All numeric dimensions of a rate table must be in the same units. The survexp.us rate table contains daily hazard rates, the age cutpoints are in days, and the calendar year cutpoints are a Date.
ratetables documentation, R survival package v3.2.7

or

ptime <- mdy.date(4,1,74) - jasa$entry.dt
#max potential fu
ptime <- ifelse(jasa$fustat==1, ptime, jasa$futime) / 365.24
exp4 <- survexp(ptime ~ ratetable(age=(age=(age/365.24), status="Current",
amount=2, duration=1, sex='Male'),
data=jasa, ratetable=smoke.rate, conditional=F, scale=1)

This example does illustrate some points. For any factor variable, the ratetable function allows use of either a character name or the actual column number. Since I have chosen the current smoker category, duration is unimportant, and any value could have been specified. The most important point is to note that age has been rescaled. This table contains rates per year, whereas the US tables contained rates per day. It is crucial that all of the time variables (age, duration, etc) be scaled to the same units,
or the results may not be even remotely correct. The US rate tables were created using days as the basic unit since year of entry will normally be a julian date; for the smoking data years seemed more natural.

A Package for Survival Analysis in S, 2012, P.69

However if provided times can be in any unit, I fail to understand how the implementation knows which unit to use to perform the integral for the cumulated hazard when times are simply passed as numeric values. A similar question stands for the unit of provided hazards in ratetable objects.

  1. In the current documentation I could not find information on the 'conditional' method, see the correponding pieces of documentation of the survexp function:

weights: case weights. This is most useful when conditional survival for a known population is desired, e.g., the data set would contain all unique age/sex combinations and the weights would be the proportion of each.

method : computational method for the creating the survival curves. The individual option does not create a curve, rather it retrieves the predicted survival individual.s or cumulative hazard individual.h for each subject. The default is to use method='ederer' if the formula has no response, and method='hakulinen' otherwise.

conditional: logical value. This argument has been superseded by the method argument. To maintain backwards compatability, if it is present and TRUE it implies method='conditional'.

Details section:

Cohort survival is used to produce an overall survival curve. This is then added to the Kaplan-Meier plot of the study group for visual comparison between these subjects and the population at large. There are three common methods of computing cohort survival. In the "exact method" of Ederer the cohort is not censored, for this case no response variable is required in the formula. Hakulinen recommends censoring the cohort at the anticipated censoring time of each patient, and Verheul recommends censoring the cohort at the actual observation time of each patient. The last of these is the conditional method. These are obtained by using the respective time values as the follow-up time or response in the formula.

There's a bit more information in the survexp.fit documentation about this 'conditional', but I'm still not sure what is the computed quantity at the end.

Thank you very much for your help and again thank you very much for this great work!
Best regards

@qmarcou qmarcou changed the title Inconsistent 'date of entry' reported by survexp and questions Inconsistent 'date of entry' reported by survexp and questions on survexp Feb 25, 2021
@jgoungounga
Copy link

jgoungounga commented Mar 2, 2021

Dear @qmarcou ,

This is interesting questions for me also.

I tried to reproduce your first code and add some suggestions for using slopop rate table.

library(survival)
library(relsurv)

str(devtools::session_info()["packages"][[1]][c("survival","relsurv"), c("package", "loadedversion")])

Classes ‘packages_info’ and 'data.frame': 2 obs. of 2 variables:
$ package : chr "survival" "relsurv"
$ loadedversion: chr "3.2-7" "2.2-3"

data("slopop") # import slovenian population mortality table from the relsurv package

Build a mock data frame with a single 63 years old male patient with start of follow up on April 24th 2003

df <- data.frame(list(
"sex" = c("male"),
"diag" = c(as.Date.character("28/04/2003", format = "%d/%m/%Y")),
"age" = c(63 * 365)
))

Computed expected survival at 100 days for this single individual cohort

exp_surv_us <- survexp(
formula = ~ 1,
data = df,
rmap = list(sex = sex, year = diag, age = age),
times = (100),
ratetable = survexp.us
)

exp_surv_slo <- survexp(
formula = ~ 1,
data = df,
rmap = list(sex = sex, year = diag, age = age),
times = (100),
ratetable = slopop
)

exp_surv_slo2 <- survexp(
formula = ~ 1,
data = df,
rmap = list(sex = sex, year = diag-as.Date("1960-01-01"), age = age), # year correspond for this table to the number of years since 1960-01-01
times = (100),
ratetable = slopop
)

print(exp_surv_us$summ) # output the correct entry date
print(exp_surv_slo$summ) # output an incorrect entry date
print(exp_surv_slo2$summ)

print(exp_surv_us$summ) # output the correct entry date
[1] " age ranges from 63 to 63 years\n male: 1 female: 0 \n date of entry from 2003-04-28 to 2003-04-28 \n"
print(exp_surv_slo$summ) # output an incorrect entry date
[1] " age ranges from 63 to 63 years\n male: 1 female: 0 \n date of entry from 27Apr93 to 27Apr93 \n"
print(exp_surv_slo2$summ)
[1] " age ranges from 63 to 63 years\n male: 1 female: 0 \n date of entry from 28Apr2003 to 28Apr2003 \n"

exp_surv_slo2

Call:
survexp(formula = ~1, data = df, rmap = list(sex = sex, year = diag -
as.Date("1960-01-01"), age = age), times = (100), ratetable = slopop)

age ranges from 63 to 63 years
male: 1 female: 0
date of entry from 28Apr2003 to 28Apr2003

time n.risk survival
100 1 0.996

I also tried another way using the option for argument method "individual.h" using slopop rate table:

df2 <- data.frame(list(
time = 100, "sex" = c("male"),
"diag" = c(as.Date.character("28/04/2003", format = "%d/%m/%Y")),
"age" = c(63 * 365)))

exp_surv_slo3 <- survexp(
formula = time ~ 1, data = df2,
method = "individual.h",
rmap = list(sex = sex,
year = diag - as.Date("1960-01-01"), age = age ),
ratetable = slopop )

expected survival

exp(-exp_surv_slo3)
1
0.9960001

Thank you again for opening this issue, which allow us to go deeper in survexp function use.

@therneau
Copy link
Owner

therneau commented Mar 3, 2021

  1. This issue you raise is essentailly a documentation flaw. The low level routine match.ratetable checks all the user's input against the rate table for consistency: does the sex variable in the data have levels that are found in the sex variable in the ratetable, etc. With respect to dates, it uses the ratetableDate method to convert them all to a common scale, both the rate table and the user input.
    The problem you point out is due to the fact that match.ratetable routine calls the attr(slopop, "summary") AFTER it has done the checks and conversion. The help page for ?ratetable does not include this information.

Maja, could I get your opinion here? I can change the code to call summary earlier, which will fix the issue with slopop, but may break other rate tables that "figured out" what I was doing. Any guess as to which version is likely to break fewer things?

  1. A ratetable will include rates in whatever units the creator of the ratetable has chosen. The call to survexp (or pyears) that uses that rate table needs to make sure that any time related variables are in the same units as the ratetable.
    a. Any ratetable which is going to involve dates should almost certainly use days, since that is the natural unit of date and Date. That is, the user is going to have their dates in days.
    b. The routine has no way to guess what the user did. That is what the summmary is for. See ?ratetable

  2. The methods are found in an old technical report from the Department of Health Science Research at Mayo Clinic (1999). Mayo changes the web interface from time to time, which can make this hard to find.

@qmarcou
Copy link
Author

qmarcou commented Mar 3, 2021

Hi @jgoungounga ,
Thanks for the hint on this date origin thing!
Sadly I'm afraid this is more highlighting an issue in survexp (and probably in R in general) handling of dates. Your solution will provide correct date of entry but in turn it will return incorrect survival probability. See below:

# Compute expected survival for 100 days for 63 year old male in 2003 in a naive way
print(exp(-100*slopop["63","2003","male"]))

> print(exp(-100*slopop["63","2003","male"]))
[1] 0.9945673

Now if you look at the obtained values with the snippets you proposed:

print(exp_surv_slo$surv)
print(exp_surv_slo2$surv)
print(exp(-exp_surv_slo3))

> print(exp_surv_slo$surv)
[1] 0.9945052
> print(exp_surv_slo2$surv)
[1] 0.9960001
> print(exp(-exp_surv_slo3))
1
0.9960001

Since there seem to be already a slight difference between the naive computation and all others (maybe an extra conversion with 365.25 somewhere?), let's make a brute force check that the correct year is used:

test <- slopop
test["63","2003","male"]<-test["63","2003","male"]*100 #set the daily hazard for this year at an absurdly large value
exp_surv_test <- survexp(formula = ~1, data = df,
                        rmap = list(sex = sex, year = diag, age = age),
                        times = (100),
                        ratetable = test)

exp_surv_test2 <- survexp(
  formula = ~ 1,
  data = df,
  rmap = list(sex = sex, year = diag-as.Date("1960-01-01"), age = age), # year correspond for this table to the number of years since 1960-01-01
  times = (100),
  ratetable = test
)
print(exp_surv_test$surv)
print(exp_surv_test2$surv)

> print(exp_surv_test$surv)
[1] 0.6294348
> print(exp_surv_test2$surv)
[1] 0.9960001

Package versions:

> $ package : chr "survival" "relsurv"
> $ loadedversion: chr "3.2-7" "2.2-3"

@qmarcou
Copy link
Author

qmarcou commented Mar 3, 2021

Dear @therneau ,
Thank you very much for your detailed answer!

1. A ratetable will include rates in whatever units the creator of the ratetable has chosen.   The call to survexp (or pyears) that uses that rate table needs to make sure that any time related variables are in the same units as the ratetable.
   a. Any ratetable which is going to involve dates should almost certainly use days, since that is the natural unit of date and Date.  That is, the user is going to have their dates in days.
   b. The routine has no way to guess what the user did.  That is what the summmary is for.   See ?ratetable

Thank you for this note on the natural scale/unit of Dates, things make more sense now.
What about adding a "unit" attribute for ratetable objects? E.g "daily.haz","yearly.haz", etc. Maybe this could help lifting some ambiguities regarding times units? Another alternative would be to use objects such as lubridate::days etc, but I have no idea if that would be handled well in the Surv() object.
If this seems too complicated, I think adding a note in the documentation on the fact that Date units are by default days, and thus remaining arguments should be in the same units would be of great help (both in terms of usage and understanding).

2. The methods are found in an old technical report from the Department of Health Science Research at Mayo Clinic (1999).    Mayo changes the web interface from time to time, which can make this hard to find.

Yes I think I found the document here. Thank you for the pointer.

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants