Resources

Alex Gold - Avoid App Failures Through Code Promotion

Avoid App Failures Through Code Promotion by Alex Gold Visit https://rstats.ai/gov/ to learn more. Abstract: It’s all too easy to write an app or report or add an update and suddenly, cold bead of sweat running down your back, realize everything is broken. In this talk you’ll learn how to think about avoiding this moment with good R promotion practices. You’ll learn about a general framework for code promotion, as well as specific tools you can use to make deployments risk-free and easy. Bio: Alex leads the Solutions Engineering team at Posit (formerly RStudio), where he helps organizations use R and Python in their enterprise environments. Alex loves all things \#rstats and was a data science manager, data scientist, and economics researcher before coming to RStudio. He lives just outside Washington DC with his wife and their puppy and enjoys cooking, tai chi, and landscaping in his free time. Twitter: https://twitter.com/alexkgold Presented at the 2022 Government & Public Sector R Conference (December 1, 2022)

Jan 4, 2023
20 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Great. Thanks so much, Jared, for having me. Thanks so much, everybody, for being here. My name is Alex. I'm going to be talking a little about code promotion in R.

And just a little about me, I lead the solutions engineering team at a company formerly known as RStudio, now known as Posit. And so my team, primarily what we occupy ourselves with is, like, taking stuff to production in both R and Python.

So in your brains, we're going to rewind to, like, before David's talk, which was great, back to Mark's talk, because, like, these two talks are really, like, peas in a pod. So rewind your brain to, like, oh, Mark's talk was so good. He taught me how to, like, do code in production. It was so good. So this talk is not going to be about so much about R code. It's going to be sort of about the stuff you do around your R code to get it into production.

So I know I have lived this life. You probably have lived this life. You're a very intimidating boss comes to you, and they're like, why is the dashboard down? This is not good. The meeting was today where we were supposed to look at the dashboard. The dashboard is down. It's because you made some update. You push some change. You did something, and you didn't quite check it through beforehand, right?

So what we're going to talk about today is code promotion, which is generally sort of the sequence of things you want to do to put something into production in a way where it's protected, right? And so there are sort of two parts to that. And that is being able to test, validate, and sort of make sure the thing is ready before it goes up. So that's part one. And then part two is making sure it only goes up once you've done those things, right? You have a validation plan, and it only goes there once it's validated.

Prerequisites for code promotion

So I will say this is a little bit of an advanced maneuver. And so I would suggest you do some other stuff first. So writing good code is really helpful. And Mark, I think, talked a bunch about that and also had some of these similar resources. Doing some basic project hygiene, right? Using projects, using correctly subpath kind of things. If you know the What They Forgot to Teach You About R course, it's there. Using Git is good. And using Rn, probably, would also be great. And I have resources here on all of these. They're on the slides linked if you want them later.

Three environments: dev, test, and prod

But OK, let's say you've done all that. Now you're ready to go, right? We're going to do code promotion. So the first thing you need is three environments. You want a dev, a test, and a prod environment, right? This is pretty, pretty, you've probably heard about this before. We're going to get a little deeper into how to do it. But dev, test, prod, right? You're going to develop stuff in dev. You're going to test it in test. And prod is for production. I have this little asterisk. Sometimes two. Depending on your use case, you can kind of smush dev and test together, right? Those could be one environment where you do your development and your testing. But prod should definitely be its own place.

OK, so you need three environments. And so how are these environments different, right? So one big difference is how easily can you make changes in each of those environments, right? In dev in particular, you want to be able to iterate quickly. You want to be able to try stuff out. Prod, like no, right? Only things that have been really validated go into prod.

Additionally, the level of validation, which sort of goes along with that, right? In dev, you're going to be trying things out. You're going to be testing them. Things might work. They might break. They might not work. That's fine. It's a sandbox, right? You want to be able to do that there. Stuff in prod, you want to have it validated. You validate it both for data science. You need to validate both for the empirical correctness of what you're doing and the code quality, right? Like this is going to stand up to some sort of scrutiny, some sort of rigor.

And then the last thing that comes up a lot is sort of the realness of the analysis, right? Are you accessing real data? Or is it some sort of fake data that you have instead because you don't want to have the real data in your dev environment? This is especially important if you're working with PII or PHI data, right? You've got to think about. If you have a dev, that's great where you can use real data. But it has to be really sandboxed in that case, right? So you can't get anything out.

If you're doing writes from what you're doing, are you actually like you probably don't want to be doing your testing and accidentally writing to the real data? That's bad. So you need to be able to figure out some way to not write to the real data as you're doing your development. And then the last piece is sort of like if you're doing things that take a while to run, it can be really annoying to have that happen as you're trying to iterate quickly in your dev environment. So you may want to do things like downsampling to be able to iterate quickly in dev in a way that you don't want to when you're going into production, obviously.

The config package

So there's an R package called config. It's one of the little loved R packages. It's one of my favorites. Actually, I should have brought it. One of my colleagues made a hex sticker for it. It has a fig on it, which I think is very clever. But it's a great package. It lets you sort of differentiate environments in a variety of ways. And we'll get into exactly how that works. But the basic idea is, as usual, there is a YAML file. Going to production is just like YAML. It's just YAML all the way down. Just YAML on YAML on YAML. So you take some YAML, and you load the YAML into your R session. And that's how you do config.

Going to production is just like YAML. It's just YAML all the way down. Just YAML on YAML on YAML.

So the second thing you need, then, is a mechanism. You have your three environments. And you need a mechanism to go from one environment to the other to the other. And so the broad class of mechanisms, obviously, is something called CICD, Continuous Integration Continuous Deployment. You've almost certainly heard of it. Some of you may have heard of it. And it sounds extremely intimidating. It is not so bad, I promise.

And in particular, today, I'm going to be talking a little bit about GitHub Actions, which is one CICD tool. I like it because it's really easy to use. If you use GitHub, it's already right there. But there are lots of other options, like GitLab, Azure DevOps, Jenkins. All of those are completely valid and work quite similarly. So I believe all of them are also YAML. It's YAML all the way down.

So how does CICD work? You have your CICD stuff. And it's watching your Git repo. It's watching for things to happen. Those are triggers. When those things get triggered, then it goes off and does something. And we'll get into what exactly that means. But usually, that something is either automated testing or a promotion from one place to another. Those are usually the two things that are going to happen when your CICD pipeline is triggered.

Live demo

OK. Now we're just going to show off some of this. So I'm going to go to, and the Wi-Fi has been a little spotty. So YOLO, here we go. This is a document that I've got. I'm going to render it. It is a Quarto doc, which I believe have been coming up a lot today, because they're great.

So this is a little report based on the Palmer Penguins data set. If any of you know it, it's just a little toy data set in R. And so what you'll notice is that right up here, this is the dev configuration that I'm using. That's just something I manually put in there. It's not auto-generated. And it uses 166 data points. If any of you are familiar with Palmer Penguins, this is half the data set. So I've down sampled, obviously, on a data set with 300 and whatever, 22 data points. You don't need to down sample, but you get the idea.

But this is just on my machine. This is on local host. This is on my laptop. I cannot show this to you unless I walk my laptop over to you. So if I want to share this, there are many options. Today, we're going to use something called QuartoPub. If any of you haven't played with it, it is a place where you can host public Quarto documents for free. So that's a pretty cool resource if you didn't know about it. It's a little bit like Shiny apps, but with Shiny apps, there's a paid tier. There's no paid tier for QuartoPub, at least not yet.

And so this is the test version of my app. You can tell because it says it is the test version and because it has this great slug that includes test in the name. And so that's cool, right? Here, I'm using the full data set, 333 data points. And then I've got the prod version, which is exactly the same, other than the fact that it was rendered like 30 seconds later and that it's at the prod URL, right? And this is actually really important that test and prod should look really similar, right? You want as few differences as possible between test and prod. Because very often, the reason the dashboard goes down is differences between test and prod. So when you try to do that promotion, everything broke, right? So you want to really minimize those differences as much as possible.

You want as few differences as possible between test and prod. Because very often, the reason the dashboard goes down is differences between test and prod.

OK, so let's make a change. Who has a favorite ggplot theme, other than minimal, since we're using minimal? Black and white. I figured we were going there. All right, so let's check out a new branch. We're going to call it theme for our new theme. We're going to commit this new theme change to new theme.

OK, so now this is getting pushed. And this is where we are living on the edge, because the Wi-Fi is spotty. OK, so I'll show off what would happen if we got to Git. So here is my Git repository. Well, what would happen is I go ahead and make a pull request. And when I made the pull request, this action would kick off. It would start doing things, which is pretty cool. I'll show you one that rendered already. So this is a job that ran a little while ago. It takes about two minutes to run the job, which is, I think, pretty good. It's longer than I want to show you during a 20-minute talk, but really not very long in the scheme of I'm trying to promote this into test. And you can see it does a bunch of different things. It sets up the job. It checks out the repository. It installs Quarto. You can think of it starting in totally a bare server. I've got to set up Quarto. I've got to set up R. I've got to install my R packages. I have to set up something so that it knows where to publish to. And then it's going to render and publish.

So this published the test report that we looked at just a minute ago.

So now we've understood what we've done so far. We had our local change we made. We changed the theme. We were going to push that up to the test branch. It was going to do our testing. You would have seen, and you would have loved, that after the test ran, we could have merged it. And when we merged it, it was going to run another set of GitHub actions to push it to the prod branch. So that's a pretty straightforward set of things. But how does it know how to do all that? How is that what happens? That's what we're going to get into for the last eight minutes.

How config and GitHub Actions work together

OK, so first, let's look at what's going on here with the config package. Remember, I said config is just a bunch of YAML. But we want to use it inside our R code. I used it here to downsample. So you can see I've got this config, config get. And when I do that, it loads up this config object. And you can see that that is just a bunch of variables. It's just a bunch of R entities. It's a list, named list. And so that's pretty useful. So now I can just use that in my code. So for example, here, where I'm slicing out part of the data, I'm taking half of it, in some cases, full in another. And then I have the config name and that sort of thing. So again, how does it know to do this? Well, I've got my handy YAML file. It's YAML all the way down.

So in the YAML file, there are three definitions of environments. There's default, test, and prod, corresponding to dev, test, and prod. In config, they use the term default, because that's sort of a fallback. It's not specified elsewhere. And you can see that in the default, in the dev place, I sample down to a half of the data set. I've also got it configured so that that goes to the test version of the report. If I were to go to the test configuration, there I go up to a full one, all of the sample. And then in the prod version, I add going to the production version of the report.

So how does config know which package to pick up? Well, when I do config get, what this does is it looks at an environment variable. It's in my environment. And if you haven't spent a lot of time playing around with environment variables, they're a really crucial part of how configuring multiple environments work. That is how things that are running know where they are, is by the environment variables that are set around them. So what I'm going to do is here, I'll manually, just to show you. sys.setenv, and we'll set the variable it uses by default is rconfig active. And I'm going to set it to be test. So now when I set rconfig active to be test, and I'm going to run config get again, you can see now I'm picking up the test config. You can see my sample fraction has updated to 1. And so it's just, if we think back about our YAML, it's now selecting this middle block based on that environment variable that I had that is called test. That's how config works. Like, that's the whole thing. It loads variables from a YAML file, depending on the value of an environment variable. It's like super simple, but really powerful if you have multiple environments and want different things to be true in each environment.

One other nice thing you can actually do is inside your config file, you can actually run our code as well, right? So these are all just like values that I'm typing here. But if there's something you need to do that's a little more complex, you can actually run our code inside your config. I don't recommend to do anything complicated, but sometimes it's useful to do other kinds of things. Like, for example, sometimes you want to have a password that gets passed through but not appear in your config file, right? You could do a get the password from an environment variable and pass it right through inside your code.

OK, so that's the config piece of this, right? That's the dividing up the environments. Now the GitHub part. So by convention, in the .githubworkflows folder is where my GitHub actions go. And again, it's just YAML, YAML all the way down. And so what you'll see is if you look in this YAML file, it's all the things, right, if you remember. Here, this is all the big headers, right? Set up quarto, install R, install R package dependencies. These are those steps. They're just defined here, right? That's all this does. It defines those steps. And what it does is it can use other actions that people have predefined for me, right? I can use the actions check out a repository action. I can use the quarto dev, right, that both GitHub and standalone people can define actions. So the quarto developers have defined a set up quarto action. The Rlib group has defined a set up R action. And it actually allows me to provide a variable, right? What version do I want?

I can also just run arbitrary code, right? So I think Mark, right, you were talking about running R script, right? I can run R script here right through GitHub actions by just telling it to run that R script. So that's really useful. So that's what this action does. Now, how does it know, right? Again, this is the test one. It knows it's the test one because I just have it set an environment variable here, right? You can define environment variables inside the definitions for your GitHub actions flows. It works really nicely like that. And the last piece is when does this run? So this runs on a pull request when it opens or reopens, which is, right, that's what you would have seen if the Wi-Fi were being a little nicer. You would have seen that when I opened a pull request, the GitHub action checked off right away. And that's my testing workflow. This is also a really common workflow if you're doing things like linting your code, right? You can put that in here. Doing things like running test suites, really great to put in a pull request triggered action.

And then there's another action, right? This one is identical. I will tell you, you can look through if you want to. But the only difference here is that it's using the prod config. And it runs actually on a push to the main branch, right? And a completed merge is a push to the main branch, right? So that's actually what it's really going to be triggered on. You can do more complex things like nesting them with variables. But the syntax gets kind of complicated. And I didn't want to have to explain the syntax nor actually figure it out for myself. So you can do that. But I just duplicated the same action twice.

Recap and resources

So just to recap a little bit of where we've been. So the bottom line. Use config and environment variables to differentiate your different environments, right, one from the other. Use Git to manage when things move, right? The trigger is a Git branching strategy. And so like I said back towards the beginning, right, having good Git practices is kind of a prerequisite for this, or at least something you should do along with this. Sometimes I see people where this is actually a really good motivation to set up good Git practices, right? Like if your Git practices only matter because you're like, we're going to do Git, it's kind of hard to care. But if it's like, do good Git or else you're going to break the CICD, that's a much stronger motivator. Sometimes this can actually really go along nicely if you're trying to get people to use Git in a more powerful and intentional way. And then use CICD GitHub Actions to do promotion mechanics.

Since I do work for Posit, I would be remiss if I didn't just say that Posit, I'm so used to saying RStudio. It's very hard to change my language. It's been really difficult. Posit Connect can do parts of this for you, right? Connect can do the thing where it watches the Git repo and picks up the changes for you. So you don't have to have a GitHub Action. That's personally my favorite feature of Connect, is that it can do that part for you.

The other thing I will share is if you thought this was cool and you were interested in this, I am writing a book. It's called DevOps for Data Science. It is drafted. The writing is still rough, but the entire book exists. It's online at do4ds.com. And if you're extra double interested, John Harmon, who maybe is watching online and has been mentioned a number of times, is actually running a book club where some folks are reading it. So if you're interested, you can reach out to John, and I'm sure they would be happy to have another couple people in their book club. Thank you very much, everybody.