Mark Rieke | Intro to Workboots: Make Prediction Intervals from Tidymodel Workflows | Posit (2022)
Sometimes, we want a model that generates a range of possible outcomes around each prediction. Other times, we just care about point predictions and may opt to use a fancy model like XGBoost. But what if we want the best of both worlds: getting a range of predictions while still using a fancy model? That’s where bootstrapping comes to the rescue! By using bootstrap resampling, we can create many models that produce a prediction distribution – regardless of the model type! In this talk, I’ll give an overview of bootstrap resampling for prediction, the pros/cons of this method, and how to implement it as a part of a tidymodel workflow with the workboots package. Talk materials are available at https://github.com/markjrieke/rstudio-conf-2022 Session: Machine learning
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello, everyone. Thank you for coming to my talk. I'll just get running and rolling here. If you'd like to follow along as I go through this, you can find the slides on the conference app for this speaker session. I also just tweeted out the link. You can find that at markjrieke, that's I-E-K-E. But today, I'll be introducing Workboots, which is a tool and a package for generating prediction intervals from tidymodel workflows.
So I used to be a mechanical engineer. So before I jump into talking about this tool, I want to talk about some other niche and interesting tools that I've seen as a mechanical engineer.
So this is something interesting. If you look at it, you might not know what it's used for offhand. One of my friends who works at AB sent this to me. This is a pigtail-proof coil. It lets brewers take samples from these humongous beer tanks safely.
The second tool, again, really interesting-looking. What's it used for off the bat? I don't know. But for folks who do know, this is a wire crimper, and it's very useful for dealing with large electrical lugs, again, safely and efficiently. And this abomination, we don't really have to talk about too much. But one tool that I use often that is, again, niche and specific is this Kreg jig. On first glance, you might not know what it would be used for. It's got this odd cylinder on one side. But you can see that in a specific scenario, it's a perfect tool for this job.
And that's kind of what I like to think of Workboots as. It is a specific tool for a specific job. And that specific job is for when you are building a model and you would like to generate a prediction interval, but you're using a model type that on its own only generates point predictions. Workboots will let you take that model and generate prediction intervals.
It is a specific tool for a specific job. And that specific job is for when you are building a model and you would like to generate a prediction interval, but you're using a model type that on its own only generates point predictions. Workboots will let you take that model and generate prediction intervals.
The case for uncertainty
So why might you be interested in using this package? Well, maybe you're somebody like me. Again, my name is Mark Rieke. I am a Senior Consumer Experience Analyst for Memorial Hermann Health System in Houston, Texas. We are a large hospital system in the Houston area with 17 hospitals, 11 of which are acute care, hundreds of clinics and outpatient facilities, servicing about 30,000 employees. And my small but mighty team of four handles and fields requests from all 30,000 employees.
But what we do is we deal with our patient satisfaction survey data. Our job is to administer, understand, analyze this data to hopefully improve our patient satisfaction when they go through our hospital system. And the majority of the time, our day-to-day involves answering questions.
So some of the questions that come up are ones like this. We see a pretty consistent trend going upwards in our patient satisfaction score and then this drop off in April. What's going on there? If we look at scores across hospitals, maybe we're seeing differences. Are these real differences or is this just due to sampling variation?
And thirdly, we see different response rates depending on if we send invites out via email or if we send invites out via text. Which one should we use? So these are all questions that on their own, giving a point estimate for, you know, is a good first step. But to get the full picture, we really need to add in uncertainty because that will help us separate signal from noise.
So maybe we see by adding in confidence intervals that in April, perhaps there was a disruption in the survey distribution workflow and we only sent out a few surveys that month. So this is really just statistical noise.
You know, if we look across hospitals, maybe we see a lot of overlap in the potential underlying scores. So again, you know, nothing to be too concerned about here. But hey, if we look at the difference in response rates, these are pretty significantly different and it's readily apparent once we add in our uncertainty intervals. And this is something that we can action around.
I think when I say this out loud, we all understand at a base level that point estimates don't give the full picture. But we really need additional context to help that point sink through.
So this also extends to not just when I'm answering specific questions, but when I'm answering questions with a model. So, you know, what are the drivers of patient satisfaction? What will our score be? How many responses are we expecting? These are all questions that I can and have answered with models. But again, because I'm always wanting to layer in uncertainty, I'm really limited to the types of models that I can use.
My problem is that this is me, like, all the time. I see people using, like, fun and fancy models like XGBoost, but because of the work I'm doing, I'm stuck using plain old LM.
So that's where Workboots comes into play. This is a package that hopefully bridges the gap between simpler models that let you generate prediction intervals right out of the bag and more powerful models that let you, you know, get nonlinearity and complex interactions. Workboots hopefully gets the best of both worlds by letting you generate prediction intervals while still using your more powerful models.
Where Workboots fits in the model building timeline
So let's go into the package itself. And I think one thing we should do right out the gate is look at where Workboots sits in the model building timeline. If some of you recognize this slide, this is something that I totally stole from an old Julia Silgi presentation that walks through how you might go from model conception and idea to model finalization. So maybe you're starting off on the left with doing some exploratory analysis, engineering new features, and then fitting, tuning, and evaluating a set of models. And perhaps you repeat this a few times before you finally get to your finalized model.
A lot of this is already handled by tidy models. Workboots really just comes in as that last leg of the journey. So you've done all your preprocessing, done all your feature engineering, you know, all of your hyperparameters are tuned. And now you're ready to answer that big question that your stakeholder has been asking you.
And part of the reason why it makes sense that Workboots is, you know, just this last step is that there are a couple of checks that you really want to make sure you're hitting before it makes sense to use Workboots. Again, it's a really specific tool for a really specific job. So firstly, you should make sure that the residuals of your model are normally distributed. This is an assumption that it makes under the hood. Secondly, you need to make sure you actually have enough time to run this model.
I'm an analyst. I wrote this for myself primarily. I'm never really concerned with putting models into production. So depending on the amount of preprocessing and the size of your data set, Workboots could take anywhere between 5 and 20 minutes to output prediction intervals. So please do not deploy this when you need to have something answer a question in a fraction of a second.
And thirdly and finally, most importantly, it really makes sense to make sure that you actually need this more powerful model that Workboots allows you to use. What I found a lot of the time is that good old plain Jane LM will, you know, really get you the answers that you need maybe 80, 90% of the time.
How Workboots works
But if you do find yourself in a scenario where you are meeting all three of these checks and you're ready to start using Workboots, how does it actually work? I'm going to gloss over this very quickly, but if you'd like to dig in deeper into some of the math that's underlying it, you can scan this QR code. That'll take you to a little bit of an explainer.
But at a very high level, Workboots takes your training data, generates a whole bunch of bootstrap resamples, then it fits many models to each of these resamples. Or excuse me, it fits a model to each resample so that you end up with many models. Thirdly, it takes each of these many models and predicts on new data. So for each observation in your new dataset, you have many predictions. And then finally, it summarizes all of these with a prediction range.
Using Workboots with code
So now that we've checked that we can use Workboots, we kind of know what's going on, let's get into actually using it with some code. So there's three things that we're going to do. First of all, Workboots builds on top of tidy models. So what we'll do is set up a tidy model workflow. Next, we'll use Workboots to generate our predictions. And then thirdly, we'll use Workboots to summarize these results.
There is some great documentation on tidy models and how to use it. But I will just set up a very quick workflow for explanation's sake. So in this case, we'll be using the Palmer penguins dataset for this example. This is a dataset that contains a bunch of information and characteristics about penguins from different islands, such as their bill length, bill depth, what species they are, which island they came from, and what we'll be predicting on their body mass.
As is best practice for this example, we'll split into a testing and training dataset.
So next, we'll set up a pretty basic preprocessing recipe. In this case, again, we'll be predicting each penguin's body mass based on all of the data that's available in the dataset.
And then we'll combine this preprocessing with a model in a workflow. In this case, using a boosted tree model for regression, which uses XGBoost under the hood. As a reminder, XGBoost can only generate point predictions on its own. So now that we have the workflow set up, we're done with tidy models. Now we can jump into Workboots.
One of the large workhorse functions within the package is this predict boots function. This is going to take the workflow that we just set up. We'll pass that to predict boots. Specify the number of bootstrap resamples that we'd like to take. And then because we are fitting and predicting all in one fell swoop, we're going to pass both the training data and the new data that we're going to predict on. In this case, this is the penguin's test dataset.
And what this returns is a table with nested predictions for each observation in the test dataset. So in this case, each row is a penguin in penguin's test, and we have 2,000 estimations of their weight from our many models that we created. We can summarize this by passing this to the summarize predictions function. This just returns a median, lower, and upper prediction interval range. And just like that, we have taken a model that only generates point predictions and returned prediction ranges. Excuse me, prediction intervals.
By default, Workboots and predict boots will generate prediction intervals. But if we so desire, we can change this to generate confidence intervals by setting interval to confidence. And just as a reminder of the difference between the two, here's an example looking at the AIMS housing dataset. In this case, this example is predicting each home's sale price based on the first floor square footage. On the left, prediction intervals are the range in which we might expect to find any individual home's price. And on the right is an example of a confidence interval, which is the range in which we might expect to find the model output.
Variable importance with Workboots
Thirdly and finally, in addition to generating prediction and confidence intervals, Workboots allows us to generate estimations of variable importance.
This function that does this is very similar to predict boots. It takes the workflow that we've created, passes it to VI boots. We specify, again, the number of resamples that we'd like to take and give it our training data. And what this returns is a table with nested estimations of variable importance for each feature in the model.
And again, there's another summarize function that lets us return a median lower and upper interval range. We can pass this to summarize importance to get this output.
And what we can see from this example is that flipper length and whether this species was, you know, of the gen two species is pretty important to this XGBoost or this set of XGBoost models in determining each penguin's weight. But there are wide ranges around those estimations of importance. And the other hand, what island the penguin comes from is we're pretty confident it's not very important.
So in closing, my call to action for you all is to give Workboots a shot. Again, I think this is a very specific tool for a very specific job. But if you find yourself in the situation where you need to generate prediction intervals and you don't want to be limited by model type, I think Workboots might be the package for you.
So thank you. I'm happy to take any questions. I have some links for additional reading material that'll if you're interested in digging in further. Thank you.