Resources

Ralph Asher | Intro to Monte Carlo Simulation | RStudio

Introduction to Monte Carlo simulation, using Shiny Presentation by Ralph Asher Monte Carlo Simulation is a powerful methodology to model and understand the impact of uncertainty upon real life. In this talk, I will introduce Monte Carlo simulation through a simple example: I'm meeting my neighbor after work for dinner in our neighborhood. Given the uncertain length of our commutes, will we make it in time for our reservation? I'll talk through the scenario, then walk through a simple Shiny app that explores the power of Monte Carlo Simulation to recommend decisions under uncertainty. Bio: I am Ralph Asher, and I am the founder of Data Driven Supply Chain LLC, a Minnesota-based consultancy that helps organizations apply data science and AI methods, including simulation, to design and improve their supply chain. Prior to founding Data Driven Supply Chain, I worked as an Operations Research Scientist at Target, designing e-commerce supply chain networks, and at General Mills, designing warehousing networks. I have used R for supply chain analytics for over eight years at these companies. I live in the Minneapolis, MN area and love running in the (usually cool) Minnesota air. I can be reached at ralph@datadrivensupplychain.com

Jul 27, 2021
32 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good. All right. So again, I'll be talking about Monte Carlo Simulation today. First, a very brief introduction about myself. So like probably many of you on the call, I have a math and science background, a bachelor's in physics and a master's in operations research. After I graduated college, I was in the military on active duty and I continue to serve in a part-time capacity. Towards the end of my time on active duty, I ended up getting my master's in operations research. And in 2013, I moved here to Minneapolis, Minnesota and began working in the supply chain analytics field at General Mills. So if you've eaten Cheerios, you've eaten a General Mills product, they're headquartered here in Minneapolis.

Worked there for a couple of years and then I was recruited to work at Target, also headquartered here in Minneapolis, where I was the principal designer of e-commerce, Last Mile Networks for Target. Worked there for about six years and just last month, left to found my own business, Data-Driven Supply Chain. Essentially, it's my skill set, applying data science and artificial intelligence to the supply chain. I've been an R user since 2012, pretty much since I was looking for a job. And I've just found it to become an increasingly powerful language for all sorts of supply chain analytics. And I love math and maps since I was a little kid, which is probably indicative of me going into this career field.

All right. Today's agenda, how can we use data math and advanced analytical methods to understand risk and uncertainty? An introduction to the flaw of averages and thinking in terms of ranges, not single values. And then I'll give a conceptual introduction to Monte Carlo simulation and a very simple example of it. Hopefully, you know, this is by no means intended to be a comprehensive treatment of Monte Carlo simulation, but hopefully it'll get you excited about exploring it in your work.

Understanding risk and uncertainty

All right. First off, how can we use data to understand risk and uncertainty? You know, most people are inherently paralyzed by uncertainty and we tend to ignore it in our lives. There's just way too many things that could turn out to be different than what we're expecting. You know, so our decisions are typically made based upon either heuristics or, you know, good rules of thumb or a single estimate, which may or may not be useful in that particular situation. Like for example, before COVID, I flew quite regularly. And so I knew how long to plan for, right? I knew I needed to take 10 minutes at the airport, 15 minutes of security, five minutes for the flight, right? 30 minutes.

Well, things have changed, right? Next week, I'm actually going on my first flights and COVID, and I have no idea what security is like, I have no idea what to expect. And so my past experience actually is not useful for making decisions about the future. You know, and what's worst about this is that our ability to objectively evaluate decisions actually declines the more complex and uncertain the decision is about the important details. You know, for example, should I go to college? And if so, where? For some people, the answer is clearly yes, some people it's clearly no, most people it's probably somewhere in the middle.

And so what's interesting is that the tools of Monte Carlo simulation and other analytical methods from my field of operations research are actually quite useful for understanding how do we make important decisions under uncertainty. And so while I'm not here to give you life advice on your education or your career or your marriage, hopefully I can give you a little bit of mathematics to help you with some other things.

The flaw of averages

Okay, so those of you working in the business world, you know that in nearly all business decisions, future projections, for example, sales forecasting, or how much a material is going to cost six months from now, these things may be presented as a single value, like we're going to sell 1000 widgets next month, or the cost of paper will be $10. But really, nobody knows that precisely. Those would be more accurately represented as a range, you know, or a distribution, you know, to actually use the math terms. And so making decisions based upon a single point value of those inputs, typically the average ignores the impact if those inputs come in much higher or much lower than that one input.

You know, Dr. Sam Savage, he's an operations research scientist, he popularized the phrase the flaw of averages, of course, a take away from the phrase, the law of averages for mathematics, to popularize this very common mistake. The phrase that Dr. Savage uses that if you're six feet tall, and you can't swim, it's okay to walk across a creek if it's four feet deep the entire way. But if it's an average of four feet, some parts are two feet, some parts are 20 feet, if you can't swim, you need to understand the distribution of those depths there, because you'll drown. And so purely looking at the average rather than the distribution of those inputs could literally be a killer.

And so purely looking at the average rather than the distribution of those inputs could literally be a killer.

What is Monte Carlo simulation?

Okay, so Monte Carlo simulation, what is it? For those of you who maybe have a little bit more background in it, this is not a textbook definition, but I think it's really useful for conveying the idea. It's an analytical method that evaluates a lot of potential combinations of uncertain inputs, drawing from the range of possibilities of each of those inputs, and calculates the overall outcome. So if you could think of a function k, or a value k being the function of three input variables, x, y, and z. If x, y, and z are not single point values, but rather are themselves distributions, Monte Carlo simulation helps you answer the question, well, what is the distribution of k?

And so, you know, and what are the real world impacts of different segments of k's distribution? So if each of these variables yield a k, an output variable k that's normally distributed, the middle of that distribution is going to have one potential outcome, but the high and low tails are going to have very different impacts upon your real world operation. And so Monte Carlo simulation helps you understand what, both what is the outcome distribution, and what are the impacts of those.

This is really commonly used in finance for, you know, stress testing is a, that the Federal Reserve requires is an application of it. Oil and gas industry uses this because they need to understand what is the potential financial impact of, say, drilling a new well, or any sort of investment. And in my field, supply chain, Monte Carlo simulation can be used in a number of different ways. One, it can be used to understand the impact of uncertain sales in the future upon different warehouses and trucking segments in the supply chain. It can also be used to understand investment decisions around, hey, do I need to buy more trucks? Do I need to buy more inventory to anticipate for sales, right?

The normal distribution

All right, so a very brief review of the normal distribution. So those of you probably seen this before, you know, normal distribution is also called a Gaussian distribution or a bell curve. It's a continuous probability distribution that is described by two values, the mean or the average and standard deviation. The image on the left represents a distribution of pizza delivery times after an order is placed. So the mean is 30 minutes, right down the middle, and the standard deviation is five minutes.

With a normal distribution, it is symmetric around the mean, and that means that a value of one standard deviation over the mean, in this case 35 minutes, is just as likely as a value that is one standard deviation below the mean, or 25 minutes. Standard deviation also can be used to describe the likelihood of a range of outcomes. Approximately 68% of the values are within plus or minus one standard deviation, and then 95.5% within two standard deviations, and nearly everything within three.

Kill risk is those unlikely but impactful outcomes that happen when you get the real extremes of the probability distribution. So in this case, if I'm expecting a pizza to come in 30 minutes, and it comes in 15, well hey, maybe I was planning for a party, right, waiting for people to arrive, food will be well, now my pizza's there, people haven't arrived. I'm anticipating at 30 minutes, and it comes at 45, people are going to be hungry, right, that's going to have an impact in different ways. In my example for Monte Carlo simulation, all the inputs will be normally distributed out of just convenience of the talk, but just keep in mind that in real life, not everything is normally distributed.

The dinner reservation scenario

All right, so Monte Carlo simulation example, so uh, this is the, what I'll use to show the power of Monte Carlo simulation. Keep in mind this is both fictional and pre-COVID, so it is doubly fictional, so please keep that in mind. I neither work in downtown Minneapolis nor use the bus anymore. All right, so my neighbor Bob and I, we both worked in downtown Minneapolis, and we take the same bus from our neighborhood to downtown in the morning, and then back home in the evening, so same bus route. Uh, Bob's a good guy, I'd even say he's a friend, so we decided that, hey, one night on a weeknight, we're going to make reservations to go to a restaurant in our neighborhood.

We'll get our reservations at 6 15 p.m. We'll leave our office at 5 30. We don't work in the same place, but you know, we'll each leave our desk at 5 30. We'll go to the bus stops, take the buses back to our neighborhood, walk to our home, and then whoever, you know, we'll wait for the one that shows up later, then we'll walk to the restaurant. Will we make it on time? Well, I don't know. Bob sure doesn't know. You probably don't know. Maybe Monte Carlo simulation will help us know, though.

Okay, so a little bit more specifics of this. I'm going to leave my office at 5 30 p.m. I'm going to take the Minneapolis city bus back to my home, and my commute time from when I leave my office to when I get home is normally distributed with a mean of 30 minutes and a standard deviation of 10 minutes. My next door neighbor, Bob, he also has a 30-minute commute. Even though we don't work in the same office, they're close enough where our commutes are roughly equivalent, and so both of us will have a 30-minute commute from the moment we leave our office to the moment we get to our front doors, normally distributed with a 30-minute average and a 10-minute standard deviation. Once we both get home, we'll then walk to our restaurant in our neighborhood, and that is normally distributed with a average of 10 minutes and a standard deviation of two minutes.

So if we plan our schedule based purely on the average, we'll both leave at 5 30. The average is 30 minutes. That's 6 p.m. Average walk to the restaurant is 10 minutes. That's 6 10 p.m. Our reservation is 6 15, so if we plan purely on the averages, we will show up five minutes early.

One thing that's important to keep in mind is that we may not take the exact same bus. The bus runs every few minutes, so if I show up to the bus stop right when the bus is pulling up, I'll get on the bus whether or not Bob's there, vice versa.

Okay, so if you're already in the app, you'll see a table like this at the top. I'll go over how to read it, and then we'll switch over to the app. So as I mentioned earlier, Monte Carlo simulation is a way to take input variables as distributions and understand the distribution of the outcome variable, whatever that function is. In this case, the input variables are my commute time, Bob's commute time, and our walk time to the restaurant. The outcome variable is our arrival time at the restaurant, and then whether or not we made it in time for our 6 15 reservation. With Monte Carlo simulation, we can actually simulate this many, many times, and we'll do it 5,000 times, actually.

In this first example, I left at 5 30 p.m., and this is all in the 24-hour clock. My commute is just a little bit above that average of 30 minutes, so I arrived home at 6 01 p.m. Bob also left at 5 30 p.m. His commute was 24 minutes, so he arrived actually a little bit earlier than me. So the later arrival, I was the later arrival, and we left our homes at 6 01 p.m. when I showed up. Our walk time to the restaurant was about 10 minutes, 10 and a half minutes, which meant we get to the restaurant at 6 11 p.m. Did we make any time? Yes, we did, four minutes.

In the second simulation, my commute was 21 minutes. Bob's was about 19 minutes, so we both got home actually quite early, quite earlier than expected. So we both got home by 5 51 p.m. Again, we took about 11 minutes to walk to the restaurant, so we get there 13 minutes early, quite early, but we made it in time. Third simulation, my commute was dead on 30 minutes. Bob's was a little bit longer, but by the time we both showed up at home and walked to the restaurant, we got to the restaurant two minutes early, so we're good.

And in the fourth simulation, things kind of went sideways here. So my commute was only 17 minutes, but Bob's was nearly an hour. It was 50 minutes long, and so we didn't actually even leave our homes until 6 20 p.m. after our reservation. Add insult to injury, we took a lot longer than expected to walk to the restaurant, and so we actually missed our reservation by almost 20 minutes.

Okay. Now, I just talked through four. Instead of talking through 5,000 more, we'll actually just let the computer take care of it, and then we'll go from there.

Running the simulation

All right. So if we do this 5,000 times, we actually only made it on time for our reservation about 47 percent of the time. Said another way, you only have about a 47 percent chance of making it on time if you plan purely on the average. Okay. This is the power of Monte Carlo simulation. By looking purely at my planning factors, I was going to show up five minutes early. By using Monte Carlo simulation, I can actually see that I'm less than half the time I'm going to make it on time, and that's really, really powerful.

By looking purely at my planning factors, I was going to show up five minutes early. By using Monte Carlo simulation, I can actually see that I'm less than half the time I'm going to make it on time, and that's really, really powerful.

So left-hand side, the control panel, and I'll say I am not a real, I'm not a Shiny developer. I kind of just hacked my way into this for the presentation, so appreciate or apologize for kind of the entry-level visuals here. Number today is to simulate. It defaults to 5,000. You can put whatever you want. We both depart at 5 30 p.m. This uses the Shiny time add-in, our package for it, and our commute is 30 minutes, and you can adjust this with standard deviation of 10 minutes, and our walk time to the restaurant is 10 minutes with standard deviation, too.

So again, this is a table, the result of all 5,000 of our simulation. Here is a histogram of our arrival time. If we made it in time, it's in blue, and if we didn't, it's red, and it breaks at that 6 15 mark. Here's the histogram of my commute time from the simulations. As you can see, it looks, you know, approximately normally distributed right around an average of 30 minutes, and then the one standard deviation and two standard deviation marks are those vertical lines. Bob's commute time is also a histogram, you know, average of 30 minutes, and our walk time is a histogram at an average of 10 minutes.

Okay. So question for the group. If we are only making it on time, you know, approximately 47% of the time, what can we do to make it more likely that we will arrive on time?

Leave earlier. Thank you. I was really hoping somebody would say that right away. Okay. So let's, this is the, again, the power of Monte Carlo simulation. All we need to do is change the inputs. I'm going to leave earlier. I'm going to leave at 5 15 p.m. Bob will still leave at 5 30, and now, instead of making it on time 47% of the time, we now have a 67% probability. And if Bob also leaves at 5 15, the probability goes up to the high 90%. Okay. Here we go. 95.

And what you see is that once we actually start leaving even earlier than 5 15, the probability actually doesn't go up that much. All right. Joe and Brian, yeah, you guys hit the other one. You got to lower the standard deviations of the commute. So one of the things about Monte Carlo simulation is you actually get to understand the variability of, or the impact of the variations of those input distributions. And so, yeah, I mean, if we have a more dependable commute that is not a 30-minute standard deviation, but rather a five-minute standard deviation, that improves our reliability from 47 to 67% right there.

If we can get it down even more, down to say a three-minute standard deviation, you know, we get it to an 86% reliability. What that essentially means is that by lowering the standard deviation of the commute, we're getting our commute time down to that one single point estimate of 30 minutes that we'd been planning on initially. Variability is the death of operations.

All right. So question for the group. What is the magnitude of being five minutes early versus five minutes late? Or what are the different impacts of being five minutes early versus five minutes late? Yeah, missing a reservation. Yeah. If I show up five minutes early, I'm going to make my table. If I show up five minutes late, maybe, maybe not. If I show up 20 minutes early, I'm definitely going to make my table, but I just wasted 20 minutes of my life. If I show up 20 minutes late, I'm definitely going to miss my table.

And so even though you're wrong in the same way, you're 20 minutes off, the impact is very, very different. And that's one of the real powers of Monte Carlo simulation is you can see, hey, if I am wrong on my inputs, I'm going to be wrong on my outputs, but I'm going to be wrong in different ways. And that really helps drive the conversations around how does this impact our operations? Like, am I okay being 20 minutes early, but not 20 minutes late? Maybe I value my time so much that I don't want to be 20 minutes early.

What's unrealistic about this simulation?

All right. So quick question. What is unrealistic about this simulation? I intentionally put a few things in that were more unrealistic as a conversation starter. What are some of them?

Normality of the bus route. Absolutely. Yeah. Most transit times are exponentially distributed. Okay. Yeah. We don't exactly have, yeah, we don't have to arrive precisely on time. Absolutely. Yeah. So really it's more of who's the first person to show up. That's part of it. Okay. What's another thing? Let's go back to the very last example that I gave.

Okay. Brandon. Yeah. You got it. Independence of the commute time. Yeah. The fact is that if me and my neighbor are both leaving downtown Minneapolis and traveling to our neighborhood at the same time, the traffic conditions are almost going to be, I mean, they're going to essentially lead that we're going to very similar commutes. Now, it's not, we don't necessarily need to have the exact same commute time. Like I've been on a bus that broke down on the side of the freeway. The bus right behind me just went right by us. And so my commute time was a lot longer than the bus that went right by. But for the most part, the conditions that impact me and my commute are going to impact my neighbor as well.

And so that actually leads to the, one of the most important things when doing an actual Monte Carlo simulation study is you need to both define the problem and you need to talk through the details because in this case, the two input variables, my commute time and Bob's commute time, they are not independent. They're heavily correlated to each other. And when we're going to draw from a distribution that represents those, we need to understand that correlation.

Takeaways

Okay. So takeaways, Monte Carlo simulation is an incredibly helpful tool for understanding the potential outcomes of uncertain inputs. Really any platform that can generate random variables can be used, R, Python, if necessary, Excel. And it really helps spur important discussions on a range of outcomes and the impact of those decisions. So thank you all for your time. Leave my contact information if you care to reach me and look forward to any questions I have.

Thank you so much, Ralph. That was great. I'm scrolling through the chat to see if there were any questions yet. So feel free to ask live or put them into the chat. One comment I had, and I think it would be cool to take this application and apply it to the Boston train schedules.

Yeah, I know the transit agency here in Minneapolis keeps very good records of this. Stop times.

Sorry, my internet froze for one second. But I see a question from Joe, if you want to ask that live. Yeah, sure. Um, I was just wondering what type of information about the relationship between the two commutes you would need to properly run the Monte Carlo simulation? Do you just like set them as correlated? Or do you need more information? Yeah, I mean, I would say you'd you would set them as as understand the correlation, but ideally, you would have some sort of predictive model that generates it like, for instance, in Minneapolis, if you're going, going home during a three foot snowstorm, it's gonna be a lot slower than, you know, a Friday afternoon with with no traffic. Right. So there's a lot of conditions that lead to that distribution that you then draw from.

I know you mentioned that was just like a basic Shiny application that you use, but I'm curious if you have any tips for people just getting started, as well, or something that you learned from from making that? Yeah, reactive data frames are, are a tricky concept to get to. And actually, I really appreciate talking through the how you how you got around it. Because I mean, I think you all saw as I was changing things in the app, it was lagging.

Yeah, it's a really fundamentally different way of thinking about coding. And I came from a pure sort of R for data analysis background. And I would say if you're just if you're just getting started in Shiny, the the one part of it where I do recommend watching tutorials and reading things instead of just jumping into it, is like really getting your understanding of what reactivity means and how it works, because that is the fundamental difference. And it's really confusing. But it's very powerful once you learn it.

Yeah, for sure. I started out just with the tutorials on the RStudio website, there are a couple different formats. I really like how RStudio makes tutorials available to people who prefer to read or people who prefer to watch videos. So I started by watching a video tutorial and following along with the examples. And I'm also going to link to some articles. Yeah. And then if anyone doesn't know about the R for data science slack, they have a specific channel for help in Shiny. And so I use that a lot. And that's how I found people who are willing to sort of pair program with me as well.

Cool. Garrett, I see you have a question on the simulation. Do you want to ask that one live? Sure. Yeah. So I was just thinking about, so you ran the simulation 5000 times, which I mean, for probably only two normally distributed variables makes sense. But as a heuristic, when you start, you know, just trying to apply this to, let's say, a fortune 500 company that has probably 10s of 1000s of different distributions, it could be tracking, how many simulations really do you need to start running at that point? And, you know, what is the best way to decide that?

Yeah, so there are formulas that to understand the relationship between the number of trials and the confidence interval of your output variable, you know, and so if you want to say, I want to make sure that I get, you know, a 95 or 99% confidence interval, there's formulas to understand that. On a just kind of a practical level, it just comes down to what's the relationship between the compute time and how much you're willing to wait, you know, because if it's something that you're doing once a month, and you can run it overnight, there's no reason not to rank, you know, really jam up the number of trials, particularly if you really care about the tail risk. Okay, if I really want to understand what is the likelihood that we're going to get to the restaurant at 7pm, you're going to need a lot of trials to, to really quantify that.

So I had a question in the chat about the functions that I use to handle the colors, the RGB and the hex codes. So I wanted to mention two things. First of all, I used the color picker package from Dean Attali, or I'm not sure if I'm pronouncing his name correctly. But that so that was a package, the color pickers. And then I used some functions RGB to hex and hex to RGB to convert back and forth. So that would be my other general advice for people working in Shiny is that a lot of people have made a lot of really great infrastructure. And if you can rely on that sort of thing, people have, it's possible people have already invented what you think should be invented, which is really exciting and can sort of stand on the shoulders of giants.

Awesome. Are there any other questions that I missed in the chat that you saw? Scrolling up, I don't, I don't see any others unless there were direct, direct ones to you, Kaya or Ralph. If, if anybody would want to, or if you have other questions, and maybe you want to connect with each other, it could be helpful if we put maybe your LinkedIn links right there, also, as Kaya mentioned, there's the R for data science, online learning community. I think that's a great way to connect with everyone as well. But thank you so much, Kaya and Ralph for awesome presentations. As I mentioned, this was recorded too. So I'll put the link to the recording on the meetup discussion page. But if anyone has any other questions for the speakers, or if you maybe want to give a talk next time, please let me know and reach out through meetup or LinkedIn. I'd love to hear from you.