As I mentioned on the landing, the issue I sought to solve (or at least understand) is why my plots of logistic regression over aggregated data sometimes looked awful. Unfortunately, the article that my teammate gave me on marginal standardization didn’t solve the main issue I was trying to fix. The difference between marginally standardized plots and plots with predictions at the mean were often similar and didn’t address the weirdness I frequently saw.

I’m a considerably better programmer than statistician so I think I’ll simulate a data set and do some experiments with different types of covariates. We’ll get a chance to try out this simstudy library that I saw on a blog. Here’s the vignette for reference. The code for these experiments can be founds here.

Simulation Description

For this experiment I’m going to create two outcome variables. One will be a binary response with linear log odds changes through time. The incidence changes from the low 40s to high 50s through the entire time period. The other will be a normally distributed variable with a mean that moves from about -0.5 to 0.5 and a variance of 4. I’ve created roughly 1,000 observations for each of the 100 time points. Here are the aggregated values for each of the 100 times points:

After creating the outcome and time variables, I created a lot of binary and normal variables to act as covariates for our models. The binary versions were all created in sets of low, medium, and high incidence. For the predictors that were correlated with time and/or an outcome, they were created in sets of 20 where I gradually increased the strength of the relationships by 5% each time. The types were:

  1. Not correlated with either the time or outcome
  2. Correlated with only the time variable.
  3. Correlated with only the outcome
  4. Correlated with both time and the predictor.

Because I created the outcomes to be correlated with time, categories #2 & #3 are somewhat similar to the predictors in category #4. The difference is that the way I programmed this, the predictors that are listed as being only correlated with time or an outcome have a stronger, direct relationship.

A graph of the binary predictors that were of medium incidence and correlate with time only are depicted below:

Uncorrelated predictors

The first thing I want to do is look at some graphs of what happens when you use uncorrelated predictors (both normal and binary). We should probably think that this will not affect the shape of the graph very much because the effect on the outcome is constant through time. Here we’ll just look at a graph of the outcomes vs time while we successively sample a larger number of predictors that are not correlated with either the outcome or time.

All of the lines are basically on top of each other (the darkest shade is placed last) indicating that these graphs aren’t really affected by uncorrelated predictors in the model. There also does not seem to be a cummulative effect of having more predictors.

Time correlated predictors

Here I’m going to do the same thing with predictors that are correlated with time only. We’ll make 20 models, each time adding in a variable that has a slightly stronger correlation with time.

If you look very closely you can see that the lines appears thinner at the center. This is due to the different regression lines being slightly different and not completely overlapping. The time correlated predictors do affect these graphs but in a minor way.

Outcome correlated predictors

What about predictors that are (directly) correlated with the outcome only? Here I’ll sucessively add in a stronger predictor with each iteration:

Here we see basically the same thing, there is a little variation between the graphs but not a lot.

Time and outcome correlated predictors

Finally, we’ll look at the predictors that are correlated with both time and the outcome. To start off, I’ll create models with only time and one binary predictor of medium incidence, then successively increased the strength of the covariate. The unadjusted (time only) model is plotted in black. We can see that the unadjusted model is closer to the models with weaker relationships to the outcome and time variables:

Thisis a very controlled example where we’re just looking at the effect of one predictor and changing how strong its relationship is with both the outcome and time variable. In practice we’re likely to have many predictors of differing data types and incidences and strengths of relationships. Let’s try an example where we successively add in an additional predictor of 5% stronger relationship. This time we’ll randomly select from the various binary predictors and normal predictors at each level.

Here we can see some very ugly lines of best fit. These look even worse than the single predictor models and I interpret this is there being a cumulative effect as we add more predictors. The fact that the lines don’t fan out as uniformly is probably due to the fact that I sampled different incidences. This indicates that the incidence of the confounder variable also has an impact. I previously tried this experiment with some negatively correlated predictors as well and found that sometimes the effects pushed in opposite directions and the graph ended up not looking too bad.

Sometimes when I think about this, it makes intuitive sense. If the frequency or the mean of the confoudner increases with time and that variable is also predictive of the outcome, we would expect the predictions later in the time period to be greater than they would be if we used time alone as a predictor. Our model is predicting outcomes along multiple dimensions which are interrelated and when we look at only two axes, it hides these other dimensions.

Hopefully I will have some time to look at this a bit more later but for now my takeaways are:

  1. Predictors that are associated with only time or the outcome don’t affect the outcome much
  2. Predictors that are associated with both time and the outcome really mess up graph
  3. In practice, you are likely to want to include variables that are correlated with both the outcome and primary exposure, otherwise they wouldn’t be confounders! At the moment I don’t have any remedy for this at the moment so I suspect that this method of plotting won’t work in general.