Transcript#
This transcript was generated automatically and may contain errors.
Welcome back to the Data Science Lab. I'm Libby. I run community here at Posit, and I'm joined by Isabella Velazquez. Hi, everyone. Thanks for joining.
I am so excited to be joined today by our lab manager, our guest lab manager, Davis Vaughan. Sure. Hi, everyone. It's nice to be here. I'm a software engineer at Posit. I work on the Positron teams and the tidyverse teams. I work on dplyr, tidyr. I also am the co-creator of air, the R formatter.
Nice. All right, Davis, what are you going to be showing us today on the lab?
So, yeah, I figured today we could talk about dplyr 1.2 a little bit. It'd be kind of fun to just kind of go over some of the new functions, like filter out and recode values and replace values, and then maybe also a little bit about like how functions in the tidyverse are lifecycled. Like, what does it mean to be superseded? What does it mean to be deprecated? And what is the difference?
How do decisions get made? What kind of conversations are had behind the scenes? That's what I really want to know. I'm going to ask some nosy questions about that. Another thing that we will briefly touch on is also like how – well, not briefly touch on. This is going to be the theme. How community-driven development happens and why community is such a huge part of the development of open source packages. And also, Davis will talk a little bit about wife-driven development.
Introducing recode values
All right, everybody, let's get ready to watch Davis share his screen and walk through a couple of things for us, the first one being we have a new function called recode values. This one's really exciting for me because I have had a janky workaround to recode values for a long time, and I'm not the only one who had that janky workaround. Crystal Lewis also had that janky workaround, and we've all been sort of circumnavigating a slight hole that was in the dplyr suite of functions.
So, it got a lot of traction, and, like, it got us thinking of, like, why is this so stupidly difficult? So, like, I have a named list here, right? This is like a Likert encoding, where, like, one means strongly disagree, five means strongly agree, and then there's everything in between. And you might have the data in the form of just the numbers, and you want to recode that using this mapping into the actual, like, word descriptions.
And the kind of hacky solution was using a superseded dplyr function called recode, plus this, like, triple bang type thing to get that to work. From Arlang. It's an Arlang operator. But it's a very cool Arlang operator, okay? Like, it's still so valid. It's really, really cool.
So, the hole was filled by recode values. So, just, like, a little bit of how this works, right? So, you've got your Likert scores here. It's just a column in your Tybl. And, like, there's some typical other way you might do this if you didn't have, like, a lookup table on the side, where you might go through case win, and you might say, you know, if score is one, strongly disagree. If score is two, disagree, and so on.
And, like, one of the nice things about this new function called recode values is that it takes this, like, common bit of score and says, hey, you can actually, like, pipe that thing in. We recognize that all of the individual cases are going to be comparing to score. So, you can pull that to the front. So, now it's score, and I want to recode all of the individual values from score, where one becomes disagree, two becomes strongly disagree, two becomes disagree, and so on.
But remember, we had that, like, lookup table thing that we cared about. So, what we can do is actually, like, if you squint at recode values here, it kind of looks like a lookup table. Like, it's on the left-hand side, it's mapping to the right-hand side, and that looks pretty nice. So, what we can do is make that more explicit. Like, we can pull that out into a table called lookup. And instead of this, like, very familiar case-winning interface, we can use an alternate interface where you supply from and to columns directly, where I can say, from this lookup table, like, where it's one through five, map that to these words, strongly disagree, disagree, and so on.
And that is kind of the answer to that, like, how do we do this recode bang, bang, bang thing? Like, why is this so hard? Like, we can now pull this really nice lookup table out separately and supply it in our pipeline, which is especially really nice if you have a really long pipeline, and things get, like, it just lets you focus on actually what you're doing rather than the mapping itself.
And the last thing I'll say about it before we stop is this little bit here. Like, it's very common, in my experience, for this lookup table to not be inline. It's actually from CSV file somewhere else. So, that makes, like, the recode bang, bang, bang even, like, it's, like, your only option because you couldn't actually inline it with the case win before. So, now, if you did have it in a CSV, it's just as easy as reading it in, supplying it here.
Community-driven development and the origin of new functions
Yes, and I want to reemphasize for everybody that everybody has a different type of need, right? Like, whatever your experience is of needing a function, most people need it in a different way. Everything is so, so unique, right? So, there are a lot of responses that are, like, why would I ever need this? I've never needed it before. You just specifically have not happened to need it, but then other people have really, really needed it, right?
So, the real thing that I want to talk about behind this with Davis is actually the different little bits and pieces that came together to tell him it was time to add a function to dplyr because it wasn't just that blue sky post and it wasn't just me talking about it, Crystal Lewis, making a blog post about it. Davis, what all was going on behind the scenes that came together for you to say, like, well, it's definitely time to take action?
Yeah, we've had, like, we've had issues on the dplyr tracker for years of, like, you know, you superseded recode, but, like, you didn't give us anything better. Like, it just feels like there's some missing holes where, like, you superseded this function, but there wasn't a complete replacement. And, like, at the time, we just kind of, like, looked at that and we were like, did we? I don't really know. But, like, you know, then you see all this community impact of, like, yeah, you really did. Like, here's the exact example of, like, the hole that you're missing some kind of answer for.
And that really, like, nails down the point of, like, yeah, we think that there really is a gap here. We think we could do something better. Like, we agree that recode probably wasn't the best API to begin with. We don't feel bad about that decision. But, like, we didn't fill the hole with something better.
And that really, like, nails down the point of, like, yeah, we think that there really is a gap here. We think we could do something better.
Thumbs-upping GitHub issues
So there was already stuff going on behind the scenes. And this brings me to GitHub repos. Often when there's a feature that you want added to something, you will hear us in the sort of general open source community say go thumbs up that issue. It's the best way for us to know that you care about it. It's even more important than adding a comment. Right. If you add a comment on an issue that says this is really important to me, it's actually less impactful than putting a thumbs up.
So I work on Positron. Positron is the new data science IDE from Posit. And, like, we will go over in the issues list over here in Positron. And, like, something that we do, like, fairly often is go over here and sort by the number of thumbs up. And you probably won't be surprised if you're using Positron often that Quarto inline output is the number one thumbs up issue with over 106 thumbs up. So if you wanted to add your support to request this feature, you could hit the little smiley face here and thumbs up. And we use this a lot. Like, we go in here and we sort by, like, you know, how many thumbs up. And this is kind of some kind of impact or some kind of guide of, like, what does the community care about.
So, everybody, the main issue, post, reaction, add a thumbs up. That lets people know that you're really excited about it. And I really encourage you to go check out Positron's GitHub. Go look at all the open issues.
Tidy ups
Sure. So, one of the big ways that we get feedback from the community, in addition to Blue Sky and LinkedIn and kind of talking with you all there, is the somewhat new-ish process that we use called tidy ups. If you're familiar with PEPs from the Python ecosystem, which is like a Python enhancement proposal, tidy ups are kind of our community-facing similar-ish idea. So, they live here under Tidyverse, Tidy Ups on GitHub. We have eight of them. And each of these is some kind of proposed large change to the Tidyverse. Either it's some kind of breaking change or some kind of totally new feature that's pretty big and we'd like to get it right the first time around.
Since we got this wrong so many times before, we decided to do a tidy up to get a lot of community feedback to make sure that we got it right this time around. So, you can go through and read this. It's just like a markdown document. You can get a general vibe of our motivation for this function. All these examples of what you might have done instead. Here's how it's better than X, Y, and Z. And how are we going to preserve backwards compatibility? How do you teach this if you're a teacher? We try to think of as many things as possible in this tidy up of what could be interesting for people. And then we release it for the community to kind of give feedback on this markdown document. And a lot of people did, which is great.
I really want to encourage everybody, if you've never, ever participated in open source development before, to go through and read things. Read the comments of others. If you are ever nervous about commenting, hop on the Discord and let us know. Like, hey, I really want to add this comment, but I don't know if this is the right thing to do. Ask somebody, and they can help you. I'll chime in and help as well.
She said every time someone comments on an issue, every dev gets a notification. And it can be really overwhelming if you've got thousands of issues open for every single dev to be getting a notification for every single comment that comes through. So it's useful if the comment really adds something, a new example, a new angle. But it's not as helpful if the comment is just, hey, I'm having this problem too. If that is the point that you're trying to get across, those thumbs-up are a better and more efficient way to do that.
Introducing filter out
Yeah. In fact, let's see. We have issues from 2019 or something like that that was asking for filter out. And we were just like, no, we don't need this right now. We don't think we need this. We don't think we need this. We were the same as these people. We were the people on LinkedIn going, you don't need that. That's no good.
But eventually, there's an example that I'll show you here that kind of convinced me that we need filter out. So filter lets you specify which rows you want to keep. Filter out lets you specify which rows you want to drop.
Let's say we have some patient's data here. This is just tracking whether the person is dead or alive. So like we might have some question where you want to know, you want to drop the rows where two things are happening at once. The patient is deceased. And the year of that kind of information was before 2012. So like two things. Patient is deceased. That's the deceased column. And the year was before 2012. That goes with the date column.
So you could like directly translate this into a filter statement where it's like they're deceased. The date is before 2012. And rather than keeping those rows, I'm looking to drop those rows. So you could like wrap this whole thing in parentheses and just put a not in front of it. That's kind of your most direct translation. And that kind of works, except it's not exactly what you want.
But if you think about the question, like I want to drop like only when I know that that patient is deceased. And I want to drop rows only when I know that the year was before 2012. So if you look at the results here, what you'll notice is that my wife, Sarah, has been dropped out of this kind of data set here. Because we don't actually know that she's deceased or what the date was. But it has dropped her out of this data set kind of by accident. We dropped too many rows.
And the reason for that is kind of complicated, but it has to do with how filter works with missing values. And what you end up having to do instead, you just kind of start adding in these like really, really ugly like deceased and it's not in A. It's before 2012 and it's not in A. And like you end up kind of squinting at this for a while and maybe you end up with the right answer. You're like, okay, now I finally got like what I was looking for.
And typically like that's what you're actually after when your question, your kind of problem statement up top says like I want to filter out specific rows. Like you typically don't want to drop the missing values. So what we've ended up with with filter out is a way for you to kind of directly translate your question right into code as it was before. And keep that intent of I don't want to drop the missing values either. Like if it's this lets you say I only want to drop the deceased and I only want to drop where I know date is before 2012. Since you don't know that about say max here, that row sticks around.
So in this case, these are equivalent. Anytime you have a comma in either filter out or filter, the comma is kind of a stand in for and. But whenever you have like multiple conditions, like they kind of look nicer, like especially as you add more and more conditions. If they're separated by a comma, because that means they kind of end up on their own line and they're kind of separated. And like you could think about them individually rather than having a whole bunch of and signs here.
If you wanted to do an or instead, then you could either use that deceased or the date is less than 2012. Or you could use the other one of the other new things that we've added to dplyr 1.2, which is called when any. And this lets you keep the commas. And write it like this. I want to drop rows when any of them are deceased or when the date is less than 2012. So when things are inside of when any, they're combined with or instead of and.
Wife-driven development
So my wife is kind of like a, she works for a nonprofit and she does like a lot of data analysis work. And she, we both work from home. So she's in the office, like right above mine. And I often will go up there and like look over her shoulder because she uses R and she uses like all the tidyverse stuff and all dplyr and all the packages that we've created. So I would look over her shoulder and she would be like, why does it do this? And I would just like shake my head and be like, gosh, darn it. Like you pointed out like another hole of like something that we need. I'll be back in like half an hour and I go and figure out like a function that she needs and add it.
So like with filter out in particular, like we aggregated like all of these examples of like where people really needed filter out. But like the most compelling to me was this one right here. My wife has brought this up to me at least three times. So that is called wife-driven development, ladies and gentlemen.
So that is called wife-driven development, ladies and gentlemen.
Function lifecycle: deprecated vs. superseded
yeah i think it's helpful to just have this little chart up there's a package that we use called life cycle and this kind of lets us manage how our functions in any given package is how they're deprecated like it lets us manage the life cycle stage of that function most of the time the tidyverse lives in this kind of green staple boat bucket here it won't necessarily say stable or anything but if it doesn't say anything special you can just kind of assume it's stable
now whenever we decide that like we made a bad decision or something about something or we want to change something and replace it with something else things will get moved from stable to either deprecated or superseded now the difference between these two is superseded is like it's frozen in time for forever like you will be able to use this thing for infinity amount of time but we will not make any updates to it we will not add new features to it we will only fix the most critical bug fixes but it will continue to work for forever deprecated is the idea that we will eventually like to remove this it might take us five years but we would eventually like to remove this thing
in tidy r we superseded gather and spread because those existed like from the beginning of tidy r it's been so many years like it's embedded in people's documentation if they're teaching like in their slides and how just how people learn tidy r from the beginning and sometimes with gather and spread so there was no way that we could ever get rid of those but with case match like that was kind of introduced in dplyr 1.1 which is the version you know right before 1.2 it's only been a few years like we don't think that many people actually use it when we're looking at packages and it was kind of a really bad name and we have a really direct replacement with recode values so all of those things kind of piled on and said okay it looks like we could probably remove this thing so we've decided to deprecate case match instead
yes and if you are ever on the dplyr documentation um in the tidyverse.org website and you see a tag it says life cycle something looks like a little pill with half one color half the other um you can click on experimental or deprecated or whatever and it's going to link you directly to the definition of what that means which can be really helpful
Introducing replace values
yeah so we added recode values which is kind of use it's useful for replacing every single value with something else but there's also this like very common scenario where you're just like patching up some column in this in this case we have some state column and things aren't quite right like we have some missing values we have some unknowns we have some not recorded it would be nice if we could like standardize all these things but we still want it to be a character vector and i still really just want all of the good values to stick around
with replace values it's similar kind of case when style like interface that you've seen before except like it starts with state and you can say whenever there's an na put unknown and then the kind of implication here is but keep everything else the same so the na's become unknown but all of the other values from state stick around and that's like something that you may have done with uh like case when a lot like the similar kind of case winnie thing here is a good old dot default
yeah so coalesce comes from like the sequel world it's like a direct translation uh from the sequel world uh and and all that it does is say it says okay anywhere you see na i want you to give me some other value to replace that with so like state has na's and i want to replace that with unknown so in this case state gets replaced with unknown it's a little bit more complicated than that but like that's the general way that people use coalesce yeah it's one of those that if you're not if you're not coming from sequel you might not ever think to look for it or know that it exists
Package versioning and reverse dependency checks
so um it's unclear if it's us that broke this package or not uh from from what he says but if we are going to break your package we typically know because we run reverse dependencies checks on like you know dplyr has like 6 000 packages that depend on it we check the uh tests of all 6 000 of those packages before we send a new version of dplyr to cram and if anything looks like it's going to break because of the changes that we make we go ahead and send the pull request to that maintainer and say like hey it looks like we broke this here is how you fix it and it's just our way of doing some kind of community outreach community goodwill and hopefully making any change we make a little bit easier to swallow
The dot-dot-dot argument and anonymous functions
so this is actually something that when i am teaching, this is a super common question that will stop learners in their tracks because it's not intuitive. it's not explained anywhere. it's one of those things like, you know, hey, i'm looking in the help documentation and i keep seeing dot, dot, dot everywhere as an argument. what does that mean? but it's not explained anywhere. it's kind of tacit knowledge.
so what we're trying to accomplish here is say for every column that starts with score in its name, i want to do something with that column. in this case, i want to recode the values in that column from something to something else. but inside of recode values, we need to reference that column somehow. i need to pass it to recode values. so there wasn't really an easy way to do this in R for a long, long time.
so what we had created inside of like RLANG and tidyverse was this like very compact kind of formula, one-sided formula notation where like you put a formula and on the right-hand side, you get to reference this like dot X here. and all that really is equivalent to is function of dot X recode values. and it just kind of allows you to get away with that without having to put the function up on the front side. and then any column that starts with score kind of gets subbed in for dot X, like whenever you execute the code.
now, as of like relatively, you know, newish R, you can do this instead. and this is probably like this was probably motivated by like us adding this many years ago. but now directly in R without anything from the tidyverse, you can do these nice little anonymous functions. and this is still exactly the same as putting function here. this little, you know, backslash, forward slash, whatever, whichever one this is. parentheses X is exactly the same as function X. and now you don't really need the tilde dot X anymore. you can just use this form instead. and we don't care what you use. like we have no really strong preference, but we've probably started to migrate most of our docs to the new form. same with the base pipe. same kind of idea.

