How to contribute to TidyTuesday by curating a dataset | Jon Harmon

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Lab, everybody. My name is Libby. I run community here at Posit. I'm joined by Isabella Velazquez, who you just heard Isabella say hello. Hi, everyone. Thanks for joining.

Yeah, I am joined today by our esteemed lab manager for the week, Jon Harmon. Jon, would you like to introduce yourself?

Sure. I'm Jon Harmon. The reason I'm here is I run the TidyTuesday project, which I don't know how much you wanted to go into it, Libby, but every week we release a new dataset. Semi-cleaned, usually clean enough that you can play with it a little bit, but often there's more you can do, and the idea is just to give something for people to learn with, to do a new data visualization, maybe make a new model, anything in between.

And I do that through the data science learning community. That's dslc.io, which I am the director of, have been for the last eight years or so, and then also I am a principal developer at Atoris Research.

Yay! And if you are not familiar with DSLC, it used to be called R for DS, and it's where there are all kinds of cool things like book clubs and great ways to learn together, and there's an enormous Slack, so you can go to dslc.io and there is a link to join that Slack. Go make some buddies.

Yes, we are going to be talking about TidyTuesday today. I love TidyTuesday. I have benefited from it greatly. I teach people to code, and I have used TidyTuesday datasets to do that the whole way through. I just love it so much, and so we have Jon here today to show us what curating a dataset looks like. So the things we will cover today, or at least we will try to cover, because it's all live and we have questions and all that stuff, is what is TidyTuesday. We'll talk a little bit about that. How did it get started, really, is sort of the thing I'm getting at. Examples of what a good TidyTuesday dataset might look like, because if you're trying to think of like, well, what am I going to submit? This will help you out. We'll talk a little bit about why Jon set up some of the functions that he's going to live code and show us today.

We'll do that live demo of curating a dataset, and then if we have extra time, we can do a little bit more of a deep dive on how you can help. Like, how can you review existing submissions? How can you download and use the data? And then what happens when you do submit it and Jon is reviewing your PR in GitHub? So I will say, Jon, if you would like to share your screen and take it away for us with the TidyTuesday repo, and Isabella will be sticking stuff in the chat left and right, so if you want to follow along in the Discord with links, they are all there for you. And this will be recorded. It will be on YouTube. All these links will be in the description, so don't worry about trying to, you know, grab them all right now if you don't want to. All right, Jon.

Introducing TidyTuesday

All right. I guess the first thing I should show you is the main page of TidyTuesday. You can get here by typing TidyTuesday.day as the URL. I'm very proud of that. What? I didn't even know that! Yeah. Secret lore. I saw that there was a .day top-level domain, and I was like, well, I have to get that one, so TidyTuesday.day will take you to TidyTuesday, or I post it on Mastodon, BlueSky, and LinkedIn every week. I actually post it on Monday, and then it's reposted on Tuesday. The idea is that I want you to have it if you choose to participate on Tuesday. You can participate whenever you want. It's just datasets to play with.

And yeah, like I said, we release a new dataset every week. Technically there is one week a year when I take a break and tell you to just catch up, use whatever dataset you feel like, because there are 51 or sometimes 52, depending on how the days fall in the year. There are 51 datasets for you to play with. If you come in here, we have several years of datasets that you can go through, including right now, this week's is this How Likely is Likely dataset.

This one was actually curated by an outside curator. Oh yeah, I was like, I think it was Nicola, and it was Nicola Reni. Submitted this dataset. I love it. She found it online. Someone put out an online quiz of various words for probable. So the example here being, which conveys a higher probability, likely or probable, and had people rank those and made this dataset of 5,000 people answering questions like that to allow you to make a plot of what does likely mean.

We have the image that Nicola submitted with it, which was from his original post about it. So if you're ever curious, will happen is, you know, of course, more likely than better than even, but I do like that better than even is slightly above the 50% probability, which is good. But some people said that better than even is a little bit less likely than 50% probability. So, but yeah, so this is, you know, like all these types or all these phrases for likelihood and then like how it ranked.

This is a great example of what makes a good TidyTuesday dataset too, if you're wondering. It's something, first of all, that doesn't exist already in TidyTuesday. So search through TidyTuesday to see, but also can you imagine visualizing this in some way? And that's a good, that's a good one that we're like, this is a clearly visualizable dataset, but it's not all about visualization. 5,000 data points you could do a lot with.

This is a great example of what makes a good TidyTuesday dataset too, if you're wondering. It's something, first of all, that doesn't exist already in TidyTuesday. So search through TidyTuesday to see, but also can you imagine visualizing this in some way?

Yeah. I know a lot of people will use TidyTuesday to, you know, practice modeling because it's a new dataset. It can also be kind of interesting to play with AI tools with it because often the datasets don't exist before we curate them. And so like they exist, but not in a clean form. And so it can be an interesting thing to see what AI thinks is in the dataset if it's not properly actually using the data. And so that can be a useful thing to do. But again, whatever, it's there. You can do whatever you want with it.

We do also try to, or we always include a simple data dictionary of what is in the data. And it's in here, well, in the metadata that we'll see how this gets used. We also include an article. And so this is his post that he made when he made the beta dataset of, you know, his interpretation of it. This makes that question of, can you imagine a visualization much easier? Because he provided one. So yes, I can imagine at least one. But it's always interesting to see, like, I've already seen some visualizations that people have put out that I think convey the same information, but a little clearer. And that's one of the things people will challenge themselves with is, like, if the original visualization is really interesting to you, just try to reproduce it. That's, you know, one thing you can do with it. But then, you know, as you're leveling up a little bit, trying to make it easier to understand than the original version.