Ralph Asher | Intro to Monte Carlo Simulation | RStudio

Transcript#

This transcript was generated automatically and may contain errors.

Good. All right. So again, I'll be talking about Monte Carlo Simulation today. First, a very brief introduction about myself. So like probably many of you on the call, I have a math and science background, a bachelor's in physics and a master's in operations research. After I graduated college, I was in the military on active duty and I continue to serve in a part-time capacity. Towards the end of my time on active duty, I ended up getting my master's in operations research. And in 2013, I moved here to Minneapolis, Minnesota and began working in the supply chain analytics field at General Mills. So if you've eaten Cheerios, you've eaten a General Mills product, they're headquartered here in Minneapolis.

Worked there for a couple of years and then I was recruited to work at Target, also headquartered here in Minneapolis, where I was the principal designer of e-commerce, Last Mile Networks for Target. Worked there for about six years and just last month, left to found my own business, Data-Driven Supply Chain. Essentially, it's my skill set, applying data science and artificial intelligence to the supply chain. I've been an R user since 2012, pretty much since I was looking for a job. And I've just found it to become an increasingly powerful language for all sorts of supply chain analytics. And I love math and maps since I was a little kid, which is probably indicative of me going into this career field.

All right. Today's agenda, how can we use data math and advanced analytical methods to understand risk and uncertainty? An introduction to the flaw of averages and thinking in terms of ranges, not single values. And then I'll give a conceptual introduction to Monte Carlo simulation and a very simple example of it. Hopefully, you know, this is by no means intended to be a comprehensive treatment of Monte Carlo simulation, but hopefully it'll get you excited about exploring it in your work.

Understanding risk and uncertainty

All right. First off, how can we use data to understand risk and uncertainty? You know, most people are inherently paralyzed by uncertainty and we tend to ignore it in our lives. There's just way too many things that could turn out to be different than what we're expecting. You know, so our decisions are typically made based upon either heuristics or, you know, good rules of thumb or a single estimate, which may or may not be useful in that particular situation. Like for example, before COVID, I flew quite regularly. And so I knew how long to plan for, right? I knew I needed to take 10 minutes at the airport, 15 minutes of security, five minutes for the flight, right? 30 minutes.

Well, things have changed, right? Next week, I'm actually going on my first flights and COVID, and I have no idea what security is like, I have no idea what to expect. And so my past experience actually is not useful for making decisions about the future. You know, and what's worst about this is that our ability to objectively evaluate decisions actually declines the more complex and uncertain the decision is about the important details. You know, for example, should I go to college? And if so, where? For some people, the answer is clearly yes, some people it's clearly no, most people it's probably somewhere in the middle.

And so what's interesting is that the tools of Monte Carlo simulation and other analytical methods from my field of operations research are actually quite useful for understanding how do we make important decisions under uncertainty. And so while I'm not here to give you life advice on your education or your career or your marriage, hopefully I can give you a little bit of mathematics to help you with some other things.

The flaw of averages

Okay, so those of you working in the business world, you know that in nearly all business decisions, future projections, for example, sales forecasting, or how much a material is going to cost six months from now, these things may be presented as a single value, like we're going to sell 1000 widgets next month, or the cost of paper will be $10. But really, nobody knows that precisely. Those would be more accurately represented as a range, you know, or a distribution, you know, to actually use the math terms. And so making decisions based upon a single point value of those inputs, typically the average ignores the impact if those inputs come in much higher or much lower than that one input.

You know, Dr. Sam Savage, he's an operations research scientist, he popularized the phrase the flaw of averages, of course, a take away from the phrase, the law of averages for mathematics, to popularize this very common mistake. The phrase that Dr. Savage uses that if you're six feet tall, and you can't swim, it's okay to walk across a creek if it's four feet deep the entire way. But if it's an average of four feet, some parts are two feet, some parts are 20 feet, if you can't swim, you need to understand the distribution of those depths there, because you'll drown. And so purely looking at the average rather than the distribution of those inputs could literally be a killer.

And so purely looking at the average rather than the distribution of those inputs could literally be a killer.

By looking purely at my planning factors, I was going to show up five minutes early. By using Monte Carlo simulation, I can actually see that I'm less than half the time I'm going to make it on time, and that's really, really powerful.

So left-hand side, the control panel, and I'll say I am not a real, I'm not a Shiny developer. I kind of just hacked my way into this for the presentation, so appreciate or apologize for kind of the entry-level visuals here. Number today is to simulate. It defaults to 5,000. You can put whatever you want. We both depart at 5 30 p.m. This uses the Shiny time add-in, our package for it, and our commute is 30 minutes, and you can adjust this with standard deviation of 10 minutes, and our walk time to the restaurant is 10 minutes with standard deviation, too.

So again, this is a table, the result of all 5,000 of our simulation. Here is a histogram of our arrival time. If we made it in time, it's in blue, and if we didn't, it's red, and it breaks at that 6 15 mark. Here's the histogram of my commute time from the simulations. As you can see, it looks, you know, approximately normally distributed right around an average of 30 minutes, and then the one standard deviation and two standard deviation marks are those vertical lines. Bob's commute time is also a histogram, you know, average of 30 minutes, and our walk time is a histogram at an average of 10 minutes.

Okay. So question for the group. If we are only making it on time, you know, approximately 47% of the time, what can we do to make it more likely that we will arrive on time?

Leave earlier. Thank you. I was really hoping somebody would say that right away. Okay. So let's, this is the, again, the power of Monte Carlo simulation. All we need to do is change the inputs. I'm going to leave earlier. I'm going to leave at 5 15 p.m. Bob will still leave at 5 30, and now, instead of making it on time 47% of the time, we now have a 67% probability. And if Bob also leaves at 5 15, the probability goes up to the high 90%. Okay. Here we go. 95.

And what you see is that once we actually start leaving even earlier than 5 15, the probability actually doesn't go up that much. All right. Joe and Brian, yeah, you guys hit the other one. You got to lower the standard deviations of the commute. So one of the things about Monte Carlo simulation is you actually get to understand the variability of, or the impact of the variations of those input distributions. And so, yeah, I mean, if we have a more dependable commute that is not a 30-minute standard deviation, but rather a five-minute standard deviation, that improves our reliability from 47 to 67% right there.

If we can get it down even more, down to say a three-minute standard deviation, you know, we get it to an 86% reliability. What that essentially means is that by lowering the standard deviation of the commute, we're getting our commute time down to that one single point estimate of 30 minutes that we'd been planning on initially. Variability is the death of operations.

All right. So question for the group. What is the magnitude of being five minutes early versus five minutes late? Or what are the different impacts of being five minutes early versus five minutes late? Yeah, missing a reservation. Yeah. If I show up five minutes early, I'm going to make my table. If I show up five minutes late, maybe, maybe not. If I show up 20 minutes early, I'm definitely going to make my table, but I just wasted 20 minutes of my life. If I show up 20 minutes late, I'm definitely going to miss my table.

And so even though you're wrong in the same way, you're 20 minutes off, the impact is very, very different. And that's one of the real powers of Monte Carlo simulation is you can see, hey, if I am wrong on my inputs, I'm going to be wrong on my outputs, but I'm going to be wrong in different ways. And that really helps drive the conversations around how does this impact our operations? Like, am I okay being 20 minutes early, but not 20 minutes late? Maybe I value my time so much that I don't want to be 20 minutes early.

What's unrealistic about this simulation?

All right. So quick question. What is unrealistic about this simulation? I intentionally put a few things in that were more unrealistic as a conversation starter. What are some of them?

Normality of the bus route. Absolutely. Yeah. Most transit times are exponentially distributed. Okay. Yeah. We don't exactly have, yeah, we don't have to arrive precisely on time. Absolutely. Yeah. So really it's more of who's the first person to show up. That's part of it. Okay. What's another thing? Let's go back to the very last example that I gave.

Okay. Brandon. Yeah. You got it. Independence of the commute time. Yeah. The fact is that if me and my neighbor are both leaving downtown Minneapolis and traveling to our neighborhood at the same time, the traffic conditions are almost going to be, I mean, they're going to essentially lead that we're going to very similar commutes. Now, it's not, we don't necessarily need to have the exact same commute time. Like I've been on a bus that broke down on the side of the freeway. The bus right behind me just went right by us. And so my commute time was a lot longer than the bus that went right by. But for the most part, the conditions that impact me and my commute are going to impact my neighbor as well.

And so that actually leads to the, one of the most important things when doing an actual Monte Carlo simulation study is you need to both define the problem and you need to talk through the details because in this case, the two input variables, my commute time and Bob's commute time, they are not independent. They're heavily correlated to each other. And when we're going to draw from a distribution that represents those, we need to understand that correlation.

Takeaways

Okay. So takeaways, Monte Carlo simulation is an incredibly helpful tool for understanding the potential outcomes of uncertain inputs. Really any platform that can generate random variables can be used, R, Python, if necessary, Excel. And it really helps spur important discussions on a range of outcomes and the impact of those decisions. So thank you all for your time. Leave my contact information if you care to reach me and look forward to any questions I have.

Thank you so much, Ralph. That was great. I'm scrolling through the chat to see if there were any questions yet. So feel free to ask live or put them into the chat. One comment I had, and I think it would be cool to take this application and apply it to the Boston train schedules.

Yeah, I know the transit agency here in Minneapolis keeps very good records of this. Stop times.

Sorry, my internet froze for one second. But I see a question from Joe, if you want to ask that live. Yeah, sure. Um, I was just wondering what type of information about the relationship between the two commutes you would need to properly run the Monte Carlo simulation? Do you just like set them as correlated? Or do you need more information? Yeah, I mean, I would say you'd you would set them as as understand the correlation, but ideally, you would have some sort of predictive model that generates it like, for instance, in Minneapolis, if you're going, going home during a three foot snowstorm, it's gonna be a lot slower than, you know, a Friday afternoon with with no traffic. Right. So there's a lot of conditions that lead to that distribution that you then draw from.

I know you mentioned that was just like a basic Shiny application that you use, but I'm curious if you have any tips for people just getting started, as well, or something that you learned from from making that? Yeah, reactive data frames are, are a tricky concept to get to. And actually, I really appreciate talking through the how you how you got around it. Because I mean, I think you all saw as I was changing things in the app, it was lagging.

Yeah, it's a really fundamentally different way of thinking about coding. And I came from a pure sort of R for data analysis background. And I would say if you're just if you're just getting started in Shiny, the the one part of it where I do recommend watching tutorials and reading things instead of just jumping into it, is like really getting your understanding of what reactivity means and how it works, because that is the fundamental difference. And it's really confusing. But it's very powerful once you learn it.

Yeah, for sure. I started out just with the tutorials on the RStudio website, there are a couple different formats. I really like how RStudio makes tutorials available to people who prefer to read or people who prefer to watch videos. So I started by watching a video tutorial and following along with the examples. And I'm also going to link to some articles. Yeah. And then if anyone doesn't know about the R for data science slack, they have a specific channel for help in Shiny. And so I use that a lot. And that's how I found people who are willing to sort of pair program with me as well.

Cool. Garrett, I see you have a question on the simulation. Do you want to ask that one live? Sure. Yeah. So I was just thinking about, so you ran the simulation 5000 times, which I mean, for probably only two normally distributed variables makes sense. But as a heuristic, when you start, you know, just trying to apply this to, let's say, a fortune 500 company that has probably 10s of 1000s of different distributions, it could be tracking, how many simulations really do you need to start running at that point? And, you know, what is the best way to decide that?

Yeah, so there are formulas that to understand the relationship between the number of trials and the confidence interval of your output variable, you know, and so if you want to say, I want to make sure that I get, you know, a 95 or 99% confidence interval, there's formulas to understand that. On a just kind of a practical level, it just comes down to what's the relationship between the compute time and how much you're willing to wait, you know, because if it's something that you're doing once a month, and you can run it overnight, there's no reason not to rank, you know, really jam up the number of trials, particularly if you really care about the tail risk. Okay, if I really want to understand what is the likelihood that we're going to get to the restaurant at 7pm, you're going to need a lot of trials to, to really quantify that.

So I had a question in the chat about the functions that I use to handle the colors, the RGB and the hex codes. So I wanted to mention two things. First of all, I used the color picker package from Dean Attali, or I'm not sure if I'm pronouncing his name correctly. But that so that was a package, the color pickers. And then I used some functions RGB to hex and hex to RGB to convert back and forth. So that would be my other general advice for people working in Shiny is that a lot of people have made a lot of really great infrastructure. And if you can rely on that sort of thing, people have, it's possible people have already invented what you think should be invented, which is really exciting and can sort of stand on the shoulders of giants.

Awesome. Are there any other questions that I missed in the chat that you saw? Scrolling up, I don't, I don't see any others unless there were direct, direct ones to you, Kaya or Ralph. If, if anybody would want to, or if you have other questions, and maybe you want to connect with each other, it could be helpful if we put maybe your LinkedIn links right there, also, as Kaya mentioned, there's the R for data science, online learning community. I think that's a great way to connect with everyone as well. But thank you so much, Kaya and Ralph for awesome presentations. As I mentioned, this was recorded too. So I'll put the link to the recording on the meetup discussion page. But if anyone has any other questions for the speakers, or if you maybe want to give a talk next time, please let me know and reach out through meetup or LinkedIn. I'd love to hear from you.

Ralph Asher | Intro to Monte Carlo Simulation | RStudio

Transcript#

Understanding risk and uncertainty

The flaw of averages

What is Monte Carlo simulation?

The normal distribution

The dinner reservation scenario

Running the simulation

What's unrealistic about this simulation?

Takeaways

Featured software#

air

rstudio

Shiny