Why you can’t use logistic for non-rare events

4:42 AM, November 27, 2024

Ok, the title might be *slightly* clickbait, since any statistician who’s worked with logistic regression before would immediately click the title trying to figure out what the hell I’m talking about, since it’s just fine to use logistic regression for non-rare events. But I wanted a title to pique the readers curiosity about the post, which will be relatively short today since I have a doctor’s appointment early tomorrow morning and it’s already past 3 AM. First, some context. For the T32 Environmental Health training grant that I’m on, we have weekly informational meetings / informal seminars where we discuss things we are working on or specific subjects, and in a week we will each review one another’s preliminary analysis plans for current projects we are working on which I’m looking forward to. These meetings are very helpful and help build up a great friendly atmosphere among us trainees on the grant. One fellow trainee, the amazing Claire Nurse, is the epidemiologist of the group, while all of the rest of us trainees are statisticians. For the informal seminar this past week Sally asked Claire to do a presentation of some of her dissertation work on the association between PFAS exposure and certain health outcomes, as well as her work on the association between certain types of fish consumption and PFAS (all you need to know about PFAS is that it’s probably not good for you). The presentation was derailed several times by us statisticians asking all sorts of questions about all of the methods used (and I also got to learn a bit about survey statistics) but this only contributed to the amount we all learned about her work and just in general from discussion with fellow statisticians the way discussion often leads to learning. But the focus of this blog post will be on just one thing she said that led to a significant portion of the discussion.

Since PFAS exposure was coded as binary, she explained that a poisson GLM was used in order to model PFAS with predictors of sex, diet, etcetera, and when I asked why she couldn’t have just used logistic regression, since poisson is not ‘really’ the distribution of PFAS (since Poisson is non-binary) she said that she used the poisson GLM because the outcome was not sufficiently rare. About 25% of participants had an outcome of 1. I had a moment of existential crisis as I wondered how I could have possibly been doing logistic regression in a circumstance where it shouldn’t be used, followed by realization that this couldn’t possibly be right. Luckily when I asked what she meant by her statement about not using logistic for non-rare outcomes, she explained that there’s nothing wrong with using logistic regression for non-rare outcomes if you are fine with the log-odds interpretation of the coefficients (which, as a statistician, I think is among the most easily interpretable coefficients that you can have), but that in epidemiology the quantity of interest is the Relative Risk, not the odds-ratio. She then explained that when the event is rare, the odds ratio is a good approximation of the relative risk, but otherwise it is not. However, the Poisson regression produces coefficients that can give you a relative risk. An interesting little fact! First, let’s explain what the odds-ratio is. Let’s say you have an exposure of interest we call scallops, indicating you eat a lot of scallops (assume this is binary), and an outcome PFAS which is binary. The odds of a thing is the probability of it divided by the probability of not it. So if you roll a 6-sided dice the odds of a 5 is 1/5 = 0.2, since odds(5) = P(5)/P(not 5) = (1/6)/(5/6) = 1/5 = 0.2. The odds ratio is just the ratio of two odds, so the odds-ratio in the PFAS scallops example would be

    \[OR = \frac{(P(PFAS = 1 | Scallops = 1)/P(PFAS = 0 | Scallops = 1))}{(P(PFAS = 1 | Scallops = 0)/P(PFAS = 0 | Scallops = 0))}\]

Now, the Relative Risk, or RR, is defined as the ratio of the probabilities of the event of interest in the exposed and unexposed groups. In the PFAS Scallops example, this would be:

    \[RR = \frac{P(PFAS = 1 | Scallops = 1)}{P(PFAS = 1 | Scallops = 0)}\]

Now it should be easy to see why when the event is rare the odds ratio is approximately equivalent to the Relative Risk: when the event is rare P(PFAS = 0 | Scallops = 1) is close to 1, as is P(PFAS = 0 | Scallops = 0). Now, onto a quick bit about logistic regression. Logistic regression models a binary outcome by assuming the following form (which is called the ‘canonical’ form because of the way that the likelihood of a bernoulli can be reparameterized, which could be the subject of an entire blog post):

    \[\log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\]

Where \pi is the probability of the event. Now what’s nice about this form is that once you’ve fit your model, if you want to interpret a coefficient beta, you can interpret it as the change in the log of the odds of the event for a 1-unit change in the associated X. This interpretation can be reformatted into one that’s more useful. Let’s say we have a logistic model in the PFAS example with only one predictor variable, Scallops (X_{scallops}), which is binary, and the outcome is PFAS which has probability \pi that depends on scallops:

    \[\log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1 X_{scallops}\]

Now let’s say we’ve fit this model, and now want to interpret \hat{\beta}_1. Since X_{scallops} is binary, let’s call the probability of PFAS = 1 when scallops is 1 as \pi_1, and the probability of PFAS = 1 when scallops is 0 as \pi_0. Then we have:

    \[\log\left(\frac{\pi_0}{1 - \pi_0}\right) = \beta_0\]

    \[\log\left(\frac{\pi_1}{1 - \pi_1}\right) = \beta_0 + \beta_1\]

Thus:

    \[\log\left(\frac{\pi_1}{1 - \pi_1}\right) - \beta_0 = \beta_1 \Rightarrow \beta_1 = \log\left(\frac{\pi_1}{1 - \pi_1}\right) - \log\left(\frac{\pi_0}{1 - \pi_0}\right)\]

But wait, there’s more: we can now take both sides as the exponents of e and solve for the odds ratio, which comes out to just e^{\beta_1}:

    \[\beta_1 = \log\left(\frac{\pi_1}{1 - \pi_1}\right) - \log\left(\frac{\pi_0}{1 - \pi_0}\right) = \log\left(\frac{\pi_1}{1 - \pi_1} / \frac{\pi_0}{1 - \pi_0}\right)\]

    \[exp(\beta_1) =  \frac{\pi_1}{1 - \pi_1} / \frac{\pi_0}{1 - \pi_0}\]

Tada! And just like that, you have the odds ratio between scallops = 1 and scallops = 0. I remember first seeing this at RIT with professor Fokoue and not really understanding what was going on at first because it was my first course ever on regression, and now I’ve seen log odds so many times I’ll be able to tell the nursing home staff about it when I’m 98.

Epidemiologists want the Relative Risk though, and so unlike statisticians they aren’t completely satisified with this when the event is not rare. So they instead do poisson regression with log-link, which has this form:

    \[log(\lambda) = \beta_0 + \beta_1 X_1 + ...\]

This of course assumes the outcome follows the poisson distribution, which of course it doesn’t, but we can still fit the poisson nonetheless because 0 and 1 are both possible outcomes in the poisson. Why is this helpful? Because the rate parameter can be thought of as the risk. For example, let’s say you’re sitting outside your door watching cars go by, and counting the number that pass in ten minutes, this is a typical situation you would imagine might follow something like a poisson distribution. The ‘rate’ is the rate at which cars come by, the expected number that show up every ten minutes. Now let’s say you instead count the number that pass in four seconds. This is still poisson, but since you’ve looked at the road for a smaller increment of time, your observations are more likely to only be zero or one. The rate parameter you would use would be lower. Now let’s say you want to compare the rate at 9 in the morning to the rate at 4AM (I use 4AM because its the current time). Most people (except irresponsible grad students doing math at 4AM like me) are asleep at this time of night and not out driving, but a few are. You can see how the ratio of the rates at 9 in the morning to 4AM can be thought of as a relative risk, your ‘risk’ of seeing a car at 9AM is much higher than your risk at 4AM, and the ratio of the risk of seeing a car at 9AM to the risk of seeing one at 4AM is approximated by the observed rate at 9AM divided by the observed rate at 4AM. As in the logistic model, we can get the regression coefficient \beta_1 to easily give us the ratio of the rate parameters which estimates the relative risk when we fit the model:

    \[\log(\lambda) = \beta_0 + \beta_1 X_{scallops}\]

    \[\log(\lambda_1) = \beta_0 + \beta_1\]

    \[\log(\lambda_0) = \beta_0\]

    \[\log(\lambda_1) - \beta_0 = \beta_1\]

    \[\log(\lambda_1) - \log(\lambda_0) = \beta_1\]

    \[\beta_1 = \log(\lambda_1 / \lambda_0)\]

    \[exp(\beta_1) = \lambda_1 / \lambda_0\]

Which gives us our estimate of the relative risk of PFAS for a scallop eater versus an non-scallop eater. The discussion we had during the meeting didn’t go through all of this, but after a couple of minutes back and forth I asked her (by paraphrasing the above) if this was what was going on and she told me it was. Which is very cool and something that wouldn’t have occurred to me had I not seen it in that meeting, since I’ve yet to need relative risk instead of the odds ratio.

In other news, it looks like my plans for next semester make a bit more sense now. I still have to talk to the program director Matt McCall after the Thanksgiving break about TAing and a couple of other things, but one of the professors (the first one I asked) who said they probably wouldn’t be able to give me a reading course in the spring for a (very good) reason I will not announce online just yet told me they would talk to a couple of people and see if it might still be possible to give me one in spite of that reason, so I’m very excited about that. I don’t want to assume just yet that it will pan out, but I am cautiously optimistic.

More updates will come presuming I don’t get in a fatal wreck driving sleep deprived to the early morning Doctor’s appointment I have tomorrow.

Edit: Thinking it over again today I feel like I didn’t really justify that \lambda_1 / \lambda_0 estimates the RR, but this is not terribly difficult to do. When X \sim poisson(\lambda), the E[X] = \lambda, so

    \[\lambda_1 / \lambda_0 = E[PFAS | Scallop = 1]/ E[PFAS | Scallop = 0]\]

but since PFAS is binary, E[PFAS] = P(PFAS = 1), so

    \[E[PFAS | Scallop = 1]/ E[PFAS | Scallop = 0] = P(PFAS = 1 | Scallop = 1) / P(PFAS = 1 | Scallop = 0)\]

which is the Relative Risk.

Leave a Comment

Your email address will not be published. Required fields are marked *