Alex Gold - Avoid App Failures Through Code Promotion

Transcript#

This transcript was generated automatically and may contain errors.

Great. Thanks so much, Jared, for having me. Thanks so much, everybody, for being here. My name is Alex. I'm going to be talking a little about code promotion in R.

And just a little about me, I lead the solutions engineering team at a company formerly known as RStudio , now known as Posit. And so my team, primarily what we occupy ourselves with is, like, taking stuff to production in both R and Python.

So in your brains, we're going to rewind to, like, before David's talk, which was great, back to Mark's talk, because, like, these two talks are really, like, peas in a pod. So rewind your brain to, like, oh, Mark's talk was so good. He taught me how to, like, do code in production. It was so good. So this talk is not going to be about so much about R code. It's going to be sort of about the stuff you do around your R code to get it into production.

So I know I have lived this life. You probably have lived this life. You're a very intimidating boss comes to you, and they're like, why is the dashboard down? This is not good. The meeting was today where we were supposed to look at the dashboard. The dashboard is down. It's because you made some update. You push some change. You did something, and you didn't quite check it through beforehand, right?

So what we're going to talk about today is code promotion, which is generally sort of the sequence of things you want to do to put something into production in a way where it's protected, right? And so there are sort of two parts to that. And that is being able to test, validate, and sort of make sure the thing is ready before it goes up. So that's part one. And then part two is making sure it only goes up once you've done those things, right? You have a validation plan, and it only goes there once it's validated.

Going to production is just like YAML. It's just YAML all the way down. Just YAML on YAML on YAML.

So the second thing you need, then, is a mechanism. You have your three environments. And you need a mechanism to go from one environment to the other to the other. And so the broad class of mechanisms, obviously, is something called CICD, Continuous Integration Continuous Deployment. You've almost certainly heard of it. Some of you may have heard of it. And it sounds extremely intimidating. It is not so bad, I promise.

And in particular, today, I'm going to be talking a little bit about GitHub Actions, which is one CICD tool. I like it because it's really easy to use. If you use GitHub, it's already right there. But there are lots of other options, like GitLab, Azure DevOps, Jenkins. All of those are completely valid and work quite similarly. So I believe all of them are also YAML. It's YAML all the way down.

So how does CICD work? You have your CICD stuff. And it's watching your Git repo. It's watching for things to happen. Those are triggers. When those things get triggered, then it goes off and does something. And we'll get into what exactly that means. But usually, that something is either automated testing or a promotion from one place to another. Those are usually the two things that are going to happen when your CICD pipeline is triggered.

Live demo

OK. Now we're just going to show off some of this. So I'm going to go to, and the Wi-Fi has been a little spotty. So YOLO, here we go. This is a document that I've got. I'm going to render it. It is a Quarto doc, which I believe have been coming up a lot today, because they're great.

So this is a little report based on the Palmer Penguins data set. If any of you know it, it's just a little toy data set in R. And so what you'll notice is that right up here, this is the dev configuration that I'm using. That's just something I manually put in there. It's not auto-generated. And it uses 166 data points. If any of you are familiar with Palmer Penguins, this is half the data set. So I've down sampled, obviously, on a data set with 300 and whatever, 22 data points. You don't need to down sample, but you get the idea.

But this is just on my machine. This is on local host. This is on my laptop. I cannot show this to you unless I walk my laptop over to you. So if I want to share this, there are many options. Today, we're going to use something called QuartoPub. If any of you haven't played with it, it is a place where you can host public Quarto documents for free. So that's a pretty cool resource if you didn't know about it. It's a little bit like Shiny apps, but with Shiny apps, there's a paid tier. There's no paid tier for QuartoPub, at least not yet.

And so this is the test version of my app. You can tell because it says it is the test version and because it has this great slug that includes test in the name. And so that's cool, right? Here, I'm using the full data set, 333 data points. And then I've got the prod version, which is exactly the same, other than the fact that it was rendered like 30 seconds later and that it's at the prod URL, right? And this is actually really important that test and prod should look really similar, right? You want as few differences as possible between test and prod. Because very often, the reason the dashboard goes down is differences between test and prod. So when you try to do that promotion, everything broke, right? So you want to really minimize those differences as much as possible.

You want as few differences as possible between test and prod. Because very often, the reason the dashboard goes down is differences between test and prod.

OK, so let's make a change. Who has a favorite ggplot theme, other than minimal, since we're using minimal? Black and white. I figured we were going there. All right, so let's check out a new branch. We're going to call it theme for our new theme. We're going to commit this new theme change to new theme.

OK, so now this is getting pushed. And this is where we are living on the edge, because the Wi-Fi is spotty. OK, so I'll show off what would happen if we got to Git. So here is my Git repository. Well, what would happen is I go ahead and make a pull request. And when I made the pull request, this action would kick off. It would start doing things, which is pretty cool. I'll show you one that rendered already. So this is a job that ran a little while ago. It takes about two minutes to run the job, which is, I think, pretty good. It's longer than I want to show you during a 20-minute talk, but really not very long in the scheme of I'm trying to promote this into test. And you can see it does a bunch of different things. It sets up the job. It checks out the repository. It installs Quarto. You can think of it starting in totally a bare server. I've got to set up Quarto. I've got to set up R. I've got to install my R packages. I have to set up something so that it knows where to publish to. And then it's going to render and publish.

So this published the test report that we looked at just a minute ago.

So now we've understood what we've done so far. We had our local change we made. We changed the theme. We were going to push that up to the test branch. It was going to do our testing. You would have seen, and you would have loved, that after the test ran, we could have merged it. And when we merged it, it was going to run another set of GitHub actions to push it to the prod branch. So that's a pretty straightforward set of things. But how does it know how to do all that? How is that what happens? That's what we're going to get into for the last eight minutes.

How config and GitHub Actions work together

OK, so first, let's look at what's going on here with the config package. Remember, I said config is just a bunch of YAML. But we want to use it inside our R code. I used it here to downsample. So you can see I've got this config, config get. And when I do that, it loads up this config object. And you can see that that is just a bunch of variables. It's just a bunch of R entities. It's a list, named list. And so that's pretty useful. So now I can just use that in my code. So for example, here, where I'm slicing out part of the data, I'm taking half of it, in some cases, full in another. And then I have the config name and that sort of thing. So again, how does it know to do this? Well, I've got my handy YAML file. It's YAML all the way down.

So in the YAML file, there are three definitions of environments. There's default, test, and prod, corresponding to dev, test, and prod. In config, they use the term default, because that's sort of a fallback. It's not specified elsewhere. And you can see that in the default, in the dev place, I sample down to a half of the data set. I've also got it configured so that that goes to the test version of the report. If I were to go to the test configuration, there I go up to a full one, all of the sample. And then in the prod version, I add going to the production version of the report.

So how does config know which package to pick up? Well, when I do config get, what this does is it looks at an environment variable. It's in my environment. And if you haven't spent a lot of time playing around with environment variables, they're a really crucial part of how configuring multiple environments work. That is how things that are running know where they are, is by the environment variables that are set around them. So what I'm going to do is here, I'll manually, just to show you. sys.setenv, and we'll set the variable it uses by default is rconfig active. And I'm going to set it to be test. So now when I set rconfig active to be test, and I'm going to run config get again, you can see now I'm picking up the test config. You can see my sample fraction has updated to 1. And so it's just, if we think back about our YAML, it's now selecting this middle block based on that environment variable that I had that is called test. That's how config works. Like, that's the whole thing. It loads variables from a YAML file, depending on the value of an environment variable. It's like super simple, but really powerful if you have multiple environments and want different things to be true in each environment.

One other nice thing you can actually do is inside your config file, you can actually run our code as well, right? So these are all just like values that I'm typing here. But if there's something you need to do that's a little more complex, you can actually run our code inside your config. I don't recommend to do anything complicated, but sometimes it's useful to do other kinds of things. Like, for example, sometimes you want to have a password that gets passed through but not appear in your config file, right? You could do a get the password from an environment variable and pass it right through inside your code.

OK, so that's the config piece of this, right? That's the dividing up the environments. Now the GitHub part. So by convention, in the .githubworkflows folder is where my GitHub actions go. And again, it's just YAML, YAML all the way down. And so what you'll see is if you look in this YAML file, it's all the things, right, if you remember. Here, this is all the big headers, right? Set up quarto, install R, install R package dependencies. These are those steps. They're just defined here, right? That's all this does. It defines those steps. And what it does is it can use other actions that people have predefined for me, right? I can use the actions check out a repository action. I can use the quarto dev, right, that both GitHub and standalone people can define actions. So the quarto developers have defined a set up quarto action. The Rlib group has defined a set up R action. And it actually allows me to provide a variable, right? What version do I want?

I can also just run arbitrary code, right? So I think Mark, right, you were talking about running R script, right? I can run R script here right through GitHub actions by just telling it to run that R script. So that's really useful. So that's what this action does. Now, how does it know, right? Again, this is the test one. It knows it's the test one because I just have it set an environment variable here, right? You can define environment variables inside the definitions for your GitHub actions flows. It works really nicely like that. And the last piece is when does this run? So this runs on a pull request when it opens or reopens, which is, right, that's what you would have seen if the Wi-Fi were being a little nicer. You would have seen that when I opened a pull request, the GitHub action checked off right away. And that's my testing workflow. This is also a really common workflow if you're doing things like linting your code, right? You can put that in here. Doing things like running test suites, really great to put in a pull request triggered action.

And then there's another action, right? This one is identical. I will tell you, you can look through if you want to. But the only difference here is that it's using the prod config. And it runs actually on a push to the main branch, right? And a completed merge is a push to the main branch, right? So that's actually what it's really going to be triggered on. You can do more complex things like nesting them with variables. But the syntax gets kind of complicated. And I didn't want to have to explain the syntax nor actually figure it out for myself. So you can do that. But I just duplicated the same action twice.

Recap and resources

So just to recap a little bit of where we've been. So the bottom line. Use config and environment variables to differentiate your different environments, right, one from the other. Use Git to manage when things move, right? The trigger is a Git branching strategy. And so like I said back towards the beginning, right, having good Git practices is kind of a prerequisite for this, or at least something you should do along with this. Sometimes I see people where this is actually a really good motivation to set up good Git practices, right? Like if your Git practices only matter because you're like, we're going to do Git, it's kind of hard to care. But if it's like, do good Git or else you're going to break the CICD, that's a much stronger motivator. Sometimes this can actually really go along nicely if you're trying to get people to use Git in a more powerful and intentional way. And then use CICD GitHub Actions to do promotion mechanics.

Since I do work for Posit, I would be remiss if I didn't just say that Posit, I'm so used to saying RStudio. It's very hard to change my language. It's been really difficult. Posit Connect can do parts of this for you, right? Connect can do the thing where it watches the Git repo and picks up the changes for you. So you don't have to have a GitHub Action. That's personally my favorite feature of Connect, is that it can do that part for you.

The other thing I will share is if you thought this was cool and you were interested in this, I am writing a book. It's called DevOps for Data Science. It is drafted. The writing is still rough, but the entire book exists. It's online at do4ds.com. And if you're extra double interested, John Harmon, who maybe is watching online and has been mentioned a number of times, is actually running a book club where some folks are reading it. So if you're interested, you can reach out to John, and I'm sure they would be happy to have another couple people in their book club. Thank you very much, everybody.

Alex Gold - Avoid App Failures Through Code Promotion

Transcript#

Prerequisites for code promotion

Three environments: dev, test, and prod

The config package

Live demo

How config and GitHub Actions work together

Recap and resources

Featured software#

rstudio