Transcript#
This transcript was generated automatically and may contain errors.
Welcome back to the Data Science Lab, everybody. My name is Libby. I run community here at Posit. I'm joined by Isabella Velazquez, who you just heard Isabella say hello. Hi, everyone. Thanks for joining.
Yeah, I am joined today by our esteemed lab manager for the week, Jon Harmon. Jon, would you like to introduce yourself?
Sure. I'm Jon Harmon. The reason I'm here is I run the TidyTuesday project, which I don't know how much you wanted to go into it, Libby, but every week we release a new dataset. Semi-cleaned, usually clean enough that you can play with it a little bit, but often there's more you can do, and the idea is just to give something for people to learn with, to do a new data visualization, maybe make a new model, anything in between.
And I do that through the data science learning community. That's dslc.io, which I am the director of, have been for the last eight years or so, and then also I am a principal developer at Atoris Research.
Yay! And if you are not familiar with DSLC, it used to be called R for DS, and it's where there are all kinds of cool things like book clubs and great ways to learn together, and there's an enormous Slack, so you can go to dslc.io and there is a link to join that Slack. Go make some buddies.
Yes, we are going to be talking about TidyTuesday today. I love TidyTuesday. I have benefited from it greatly. I teach people to code, and I have used TidyTuesday datasets to do that the whole way through. I just love it so much, and so we have Jon here today to show us what curating a dataset looks like. So the things we will cover today, or at least we will try to cover, because it's all live and we have questions and all that stuff, is what is TidyTuesday. We'll talk a little bit about that. How did it get started, really, is sort of the thing I'm getting at. Examples of what a good TidyTuesday dataset might look like, because if you're trying to think of like, well, what am I going to submit? This will help you out. We'll talk a little bit about why Jon set up some of the functions that he's going to live code and show us today.
We'll do that live demo of curating a dataset, and then if we have extra time, we can do a little bit more of a deep dive on how you can help. Like, how can you review existing submissions? How can you download and use the data? And then what happens when you do submit it and Jon is reviewing your PR in GitHub? So I will say, Jon, if you would like to share your screen and take it away for us with the TidyTuesday repo, and Isabella will be sticking stuff in the chat left and right, so if you want to follow along in the Discord with links, they are all there for you. And this will be recorded. It will be on YouTube. All these links will be in the description, so don't worry about trying to, you know, grab them all right now if you don't want to. All right, Jon.
Introducing TidyTuesday
All right. I guess the first thing I should show you is the main page of TidyTuesday. You can get here by typing TidyTuesday.day as the URL. I'm very proud of that. What? I didn't even know that! Yeah. Secret lore. I saw that there was a .day top-level domain, and I was like, well, I have to get that one, so TidyTuesday.day will take you to TidyTuesday, or I post it on Mastodon, BlueSky, and LinkedIn every week. I actually post it on Monday, and then it's reposted on Tuesday. The idea is that I want you to have it if you choose to participate on Tuesday. You can participate whenever you want. It's just datasets to play with.
And yeah, like I said, we release a new dataset every week. Technically there is one week a year when I take a break and tell you to just catch up, use whatever dataset you feel like, because there are 51 or sometimes 52, depending on how the days fall in the year. There are 51 datasets for you to play with. If you come in here, we have several years of datasets that you can go through, including right now, this week's is this How Likely is Likely dataset.
This one was actually curated by an outside curator. Oh yeah, I was like, I think it was Nicola, and it was Nicola Reni. Submitted this dataset. I love it. She found it online. Someone put out an online quiz of various words for probable. So the example here being, which conveys a higher probability, likely or probable, and had people rank those and made this dataset of 5,000 people answering questions like that to allow you to make a plot of what does likely mean.
We have the image that Nicola submitted with it, which was from his original post about it. So if you're ever curious, will happen is, you know, of course, more likely than better than even, but I do like that better than even is slightly above the 50% probability, which is good. But some people said that better than even is a little bit less likely than 50% probability. So, but yeah, so this is, you know, like all these types or all these phrases for likelihood and then like how it ranked.
This is a great example of what makes a good TidyTuesday dataset too, if you're wondering. It's something, first of all, that doesn't exist already in TidyTuesday. So search through TidyTuesday to see, but also can you imagine visualizing this in some way? And that's a good, that's a good one that we're like, this is a clearly visualizable dataset, but it's not all about visualization. 5,000 data points you could do a lot with.
This is a great example of what makes a good TidyTuesday dataset too, if you're wondering. It's something, first of all, that doesn't exist already in TidyTuesday. So search through TidyTuesday to see, but also can you imagine visualizing this in some way?
Yeah. I know a lot of people will use TidyTuesday to, you know, practice modeling because it's a new dataset. It can also be kind of interesting to play with AI tools with it because often the datasets don't exist before we curate them. And so like they exist, but not in a clean form. And so it can be an interesting thing to see what AI thinks is in the dataset if it's not properly actually using the data. And so that can be a useful thing to do. But again, whatever, it's there. You can do whatever you want with it.
We do also try to, or we always include a simple data dictionary of what is in the data. And it's in here, well, in the metadata that we'll see how this gets used. We also include an article. And so this is his post that he made when he made the beta dataset of, you know, his interpretation of it. This makes that question of, can you imagine a visualization much easier? Because he provided one. So yes, I can imagine at least one. But it's always interesting to see, like, I've already seen some visualizations that people have put out that I think convey the same information, but a little clearer. And that's one of the things people will challenge themselves with is, like, if the original visualization is really interesting to you, just try to reproduce it. That's, you know, one thing you can do with it. But then, you know, as you're leveling up a little bit, trying to make it easier to understand than the original version.
Examples of good datasets
All right, perfect. So the next thing I would love to show really quickly, and then we're going to hop into curating is a couple of examples of other datasets. Because one of the things that I see holds people back from submitting is they're like, this isn't good enough. It's not official enough. It's not this or that enough. And a good example is Lisa's vegetable garden data, which is completely non-official. This is Lisa, her observations of her own garden, that she was wonderful enough to curate, put together and share with us so that we could look at them and we can make visualizations and stuff of them. It can be as simple as that, of her spending, her harvest, all that stuff. But then it can be as official as the Bureau of Labor Statistics data for employment from 2021.
This is super official data, right? And it already existed in some way, but needed to be put together, needed to be sort of gathered together. And I think this one, if you scroll down to the cleaning script, there's always a cleaning script at the bottom. You'll see the work that was done here to make this data usable. And I really encourage you, if you are up for a challenge, go get the raw data and see if you can go through this cleaning process and get to the same point that this person did, right?
There's also a great example. The Pokemon package data was submitted in 2025 by Frank. And I remember talking to Frank about this and he was like, I didn't think it would be a good one. I just thought it's fun, but I didn't do anything to clean it. I just pulled it from the Pokemon R package. It ended up being one of the most popular TidyTuesday datasets to visualize that I've ever seen. So many people did it. So if you are hesitant about like, oh, this isn't good enough or fun enough or whatever, I encourage you to go do it.
There's the really great question from Lauren on Discord as well, which is, I assume these TidyTuesday datasets did not get updated after they're published. Is that the case? Like the 2025 April 08 dataset on emergency room visits will almost certainly be constrained by an end date in 2025.
That is correct. I don't update datasets after they're published. There's a very slight asterisk of every once in a while we've had something where there's a major issue in the data that it's unusable. And so I try to get them out a little bit early actually, so that some of the early adopters can tell me, oh, this column is completely empty. I have some checks in place to try to avoid that now, but that kind of thing. And actually, if you watch Joe Cheng's keynote from PositConf this past year, he had an example where he was using DataBot and found a mistake I had made in some data.
Yeah, I was like, come on. It's a lot of datasets and we're human beings. I saw Jeff had a follow on to that was like, how does it work if you use official data that can change? Like if it gets updated at certain intervals, I would say don't hesitate to just go ahead and use it and say, as of this date, this is what this dataset looks like. That can be, you know, job reports get postdated. They get changed the next month, right? They'll be like modified up or down. You could just take data that ends two months ago, right? And say that we're pretty confident that as of two months ago, this was right. And just use that and don't use the most recent, recent data if you're afraid that it might get updated in some like catastrophic way.
The curation process
But yeah, I think that all these are great examples and it is time. We better do it because we might run out of time. Let's go into the curation process. And while we are setting up this demo of curating a dataset, Jon, can you talk a little bit about why you created this process? Yeah. Well, you know, as we've hinted at a couple of times here that for the most part, I curate all of these datasets and, you know, people wanted to help and I wanted to make it easy for people to do that. As easy as possible. Often, as people started to help, I was doing kind of the same things in the reviews. And so, we already had this TidyTuesdayR package.
But it's also on CRAN at TidyTuesdayR. And so, we had this package and I was like, well, why not just put some functions in there to help with the curation step? So, the package also has functions. Like the main purpose of it was to allow you to easily download the dataset every week or download a particular week. But I added some functions for just to guide you through the curation process.
No. I mean, just that like I got to help test these when you were creating them. And it was so wonderful. It was so helpful. That process of testing was when I submitted the Mount Vesuvius data and I had so much fun doing it. I really, really recommend everybody dive into this. So, let's follow along with Jon while we create a dataset. And he will go through all of these little steps. We are on the TidyTuesdayR, not GitHub repo, but the actual documentation site, the GitHub.io slash TidyTuesdayR. And in the article section right at the top, it's the only option there right now, curating a dataset. These are the instructions. And we're going to go through them zero through seven with a dataset today. Please stop us and ask any questions.
And this is showing the dev version. I think I have the dev version installed right now. I will be doing a CRAN update of this probably this week. I wish I had thought to do it last week. But so, the only changes in the functions is there's a little bit more safety net in the current dev version to kind of help you help make sure that you've got everything set up correctly. It was a little bit hard to test at first because I have everything set up correctly. And so, like, I could put tests that should have caught everything. But people did things that I wasn't expecting. And we were able to put some more safety nets in place.
It also I don't remember if I had to do anything or if it just was a matter of Positron being updated. But it was originally written to work with RStudio. It as of the dev version and maybe the CRAN version also will work with Positron. Because these curating steps actually, like, create and open files for you to edit so that you can go through step by step and make sure that you have everything you need.
We will not be walking through step zero, which is set up your GitHub account. I link over to or I have a section here kind of walking through the basics of how to do it. And mostly what I would tell you is use this. The usethis package makes a lot of these steps easier. And we actually use some usethis functions within TidyTuesdayR. Creating a GitHub account is free. Easy enough. And so, that's your first step.
That I've already done. So, we'll be going past that. And I'll just start with this TT curate data so we can see how that or what that does. So, those little steps at the top, those little bullet points, that is really just like a table of contents. You don't have to, like, go up there and click them. They will take you down to these steps. These are the ones starting at open, these instructions. This is the part where you're actually going to be putting code in and running it.
All right. So, just, yeah, some of these are the same idea as this article is in this thing that loads when you say TT curate data. It loads this working document. This, yeah, I wasn't sure if it had all of my info. It does. But after you have TidyTuesdayR, okay. Because I know when you have it installed, this file actually is just like a file in the package directory. And so, if you edit it, it will be updated for you. I'm pretty sure. I haven't done that in a little while. So, I'm not sure. But that way, in theory, all of this, once you have your info there once, I think it will stay there so that you can use it in the future.
All right. So, I will talk a little bit about, you know, first step here is wrangle. But really, the step before that is you have to have some dataset that you want to curate. And so, the one I'm talking about today came from someone on BlueSky. Actually, I think this was in response to a post by Libby, yes. That they had a dataset that they thought was interesting. But they weren't able to actually curate it and submit it. And so, they just posted about it on BlueSky. And I took that and turned it into an issue on the TidyTuesday repo.
If you say new issue and choose dataset suggestion, it will give you these fields to fill in. Which are just saying, you know, first kind of checklist of what you know about the dataset. And things like, okay, this is bird sightings at sea. That's the title we're going to use. I asked for an article and a data source. But all of these are optional. So, if you, you know, the one way to help out is to just submit datasets. If you see something interesting, let us know in an issue. And include as much information as you have.
Over time, I have edited this and filled in more information. Partly because I knew I was going to be using it today. And so, this one is pretty complete. Eventually, we want to have an image. We want to have I'm going to talk about the alt text when we get to it. But the minimum is, you know, enough information for me to know what you're talking about. So, like, the article about this probably would have been enough. The actual link to the dataset would also be enough. And so, just knowing what to do with that.
Cleaning and saving the dataset
And so, that is what I'm going to use. You know, that data is what I'll use when I go into this TT clean. And this is creating sorry. And just to back up a little bit. What is happening here is as I run these, it created within whatever project I'm working in in RStudio, it will create a TT submission folder. This is designed to be, like, standalone. And once it's merged, you can delete it. You can do it wherever. And so, if you already have a dataset, what we're looking at right now is I'm in a project where I actually made a little data package out of this dataset. Inspired by the submission. So, it's already clean. It's ready to go. I already have this dataset. But sharing it in TidyTuesday, I just create this little folder that we'll go through, we'll create it, we'll submit it. And then once I'm done, I'll just delete that folder. It didn't disturb my project at all.
And then, yeah, I include a little, like, starter cleaning script that shows you what to do. Tells you a little bit. And then asks you to delete that block of comments. The idea is to, you know, as we saw a couple examples of show how you got the data. Now, there are kind of two approaches we can do. Since this is a package now, I can just say, you know, clean data provided by, like, VC data package.
And this is the same as, like, the Pokemon one, right? The Pokemon one has, like, this was provided by this. We didn't need to do anything. So, I can I could do that. That there are these so, it's one, two, three, four datasets. Plus, I actually saved the dictionaries within here. Because, again, I wanted to have those handy.
Um, so, we could do something like, you know, Beaufort scale is gets the seabird data Beaufort scale, et cetera. Technically don't even, you know, barely even need this. But I like to have some sort of record of where everything comes from. So, we're gonna do birds as scale birds. We're gonna do ships as keep typing date instead of data. And that's easier that way. I always do the opposite. Type data instead of date. That's funny. It's part of my, like, dates are my nemesis.
I did support at an education company. Some of our first students were in Arizona. Which is, like, smartest because they don't do daylight savings. But the rest of the U.S. does daylight savings. So, their due dates always confused them. Because half of them had the wrong time zone setting. Sometimes their instructor had the wrong time zone setting. All kinds of crazy things. Because when daylight savings switches, Arizona gets out of sync with the rest of the U.S. So, anyway, dates are my nemesis. That's probably why date sticks in my head.
So, again, you know, this is good enough for cleaning. Or I could always go into I actually have my cleaning scripts. And so, I could, you know, take this whole thing and copy paste it. I go back and forth on what I want to do here. So, I think what I will do is just reference that, say, specifically in the cleaning script. All cleaning scripts are available on GitHub.
So, kind of the balance. Like, you don't want it to be overwhelming necessarily. And you want to you don't have to go all out. You don't have to make a data package when you're doing this, by the way. So, if data package already exists. If the data if the script is, you know, if you have to do the cleaning yourself, it's perfectly acceptable. I just didn't want to go into this and find that the data didn't work. And so, I very carefully went through it. I actually was playing with the RStudio Posit Assistant to do this cleaning. So, the Posit Assistant did a fair amount of this.
But, yeah. So, whatever level you want to do, you know, there are certain places that I take things that this one, like, oh, in the original post, he followed up and mentioned that there was a typo in one of the IDs, which once I looked at it, it was really obvious. There was this ID 1184009 that didn't correspond to anything. And so, it was like I was able to go through and figure out that, okay, yeah, this is supposed to be 1104009 because the ship was missing the bird and the bird was missing the ship. And so, yes, I used the fancy, relatively new replace values in dplyr, which was very nice because it just I just want to replace that one and leave everything else as it is.
I did some, like, there was this no birds recorded signal in the data that basically means NA. And so, I cleaned that up. I'm really, like, particular about if something is a character or a double or something non-integer, I like to officially say, no, I know that this is an integer. You don't have to check through to see if there are a few of them that have a decimal point or something. And so, I did that, again, not absolutely necessary to do that level of cleaning.
And, yeah, some other little recodes because I guess to, again, back up a little bit, this original data is it's in XLS. But it is human it's, like, from logbooks. And then those logbooks were copied by people into a database. So, it has a couple of opportunities. Yeah. And so, that's why that one dataset ID doesn't match up is probably in one of the places the zero looked like an eight. And so, it didn't match up nicely.
See if I can get rid of these weird split. Yeah. So, the data is a bunch of ship data, like, information about what the ships are. And, like, sorry, when the recording took place, uh, who did the recording? There's this there is a set of observers who were on these ships in, uh, like, around New Zealand. They're all identified by, like, first name or actually I think that is an abbreviation of their last name. There's a another one that is another table that is the birds themselves. What birds did they see? So, we can see that in this one record or, you know, this one time of recording, they counted these different types of birds.
And, you know, again, partially it's that actually I think that what they did is this is the actual log entry. And then the people doing the coding broke that apart into things like species name, species abbreviation, age, the type of plumage for a certain type of bird versus the phase for other birds and the sex. I'm going to talk about that in a second when I write the introduction. As far as I can tell from the data dictionary, that's what happened here. But it would be interesting to do some exploration of do these fields correspond with one another in the way I think they should. I didn't do that yet because I wanted to kind of let people play.
And then they have a data dictionary of sorts. And so, that was helpful that I could go through and see, you know, what do they say. But often when you're working with pretty much, I don't know, lots of datasets, they'll give you an idea of what's here. Might not always be great. Hopefully if it is your own dataset, you know a little bit better about what it is. But if it's something that you just find online, again, do your best. Because part of the goal here is to let people work with real data. And often real data isn't perfect. And so, telling people this is what we know about the dataset is fine as long as it's not like a complete unusable mess.
Yeah. And I think that there's plenty of things that you could put in the data dictionary or the readme that will allow people to do the data cleaning themselves. And do some transformations, which are fun. So, I get I am torn on whether or not to fully clean it. But I'm going to do a time check because we are at 35 minutes in and we are only at the cleaning step. So, we've got to move on to the next step.
All right. So, yeah. That's the clean data. I have said, okay, this is what it is. I want to make sure I have these objects in my session. And that that is saved. All right. So, then the next thing you do is save your dataset. And so, I've got four of them. So, I'm going to copy this down four times or three times. And just put each of those into this function. Oops. That will save.
Now, again, because of I am who I am slash knew what we were doing today, I already have these data dictionaries and so perfect scale dictionary. I saved in that format within my setup. And so we can see that, you know, this is the column name is filled in automatically. The class was filled in automatically as character here, which would also be completely fine. But again, we can tell you a little bit more information that it's an ordered factor either way.
And then these are things mostly copied from the data dictionary here. And so we look at well, Beaufort scale actually is a dataset I created out of what they gave us because in the dictionary, they have a field that is wind speed, which it says is the ordinal Beaufort scale 0 to 12. And so I just split that off into a dataset that we could join to.
And so separately, we do one that's a little bit more normal. We have the birds and again, it's going to tell you just describe this field. And so I will grab the birds dictionary and describe those fields. And just realized that I used a different name for the top than what has happened or what was happening automatically. So I want to fix that.
I mean, all right, so the great thing about this is these, all of these Markdown files are created with this pre-filled thing that you don't have to worry about any of the formatting, right? So like all of those pipes that you see, it doesn't really matter where they are in space. You can leave your space as wacky. Just go through, make sure your classes are correct, and then go modify the description for each one. I myself just typed willy nilly into all of them for mine and got through it. That is absolutely fine.
So yeah, it does an automatic, like when it, when it loads, it'll be, you know, it'll look nice to begin with, but as you type your descriptions, you know, often this will go on and on and on. Something that's useful in here is a software app long lines in RStudio for something like this can be useful. So you can actually still see everything you're typing if it goes off the edge of the screen. But yeah, you just need to fill the info in.
And so this is the C states and I will keep that. And so we'll see, whoops, yeah, when I paste this in, yeah, I don't know if that's helpful necessarily to subtract long lines here and also just give it a little more space. And you can see that, you know, these are off. It doesn't matter. The spacing is just to make it a little easier to read as you're working on it. But this will actually get loaded and resaved by the scripts before it gets like printed into the, into the repo. So it doesn't matter. Also it gets rendered by GitHub to these visual dictionaries if I can, yeah, like this. So again, it doesn't matter what you do.
And then, so those are the somewhat like that step is the most formulaic. Finding these definitions can be a pain. That's what led me to split out some separate datasets because they were actually standardized scales and I was looking into more info, but, you know, the best you can do, again, of what is in each of these fields. And so that's what all this was. This one was relatively easy as dictionaries go because they gave all this information.
Writing the introduction and metadata
I did some, well, technically RStudio did some recoding of these into sentences, but I could mostly copy paste. All right. And so then we go on to the most, I don't know, possibly the most intimidating piece of describing what this is. And this is somewhat freeform. I give some example of what it is within this script. And I actually have kind of fallen into using this semi just standardized way of doing it because it makes it easier and faster and usually actually better.
And so it will be something like this week we're exploring bird sightings at sea. And so this was the actual link that I started with was at this New Zealand government just data site. But going through this, that led me to the source of the data. So when I went down and dug into the source, it's at this Museum of New Zealand, or they always refer to it as Te Papa, and I have no idea if I'm pronouncing that correctly, but Te Papa is the Museum of New Zealand.
And I might say something like, oops, comes from, I think I wrote it out here. I'm going to say, comes from Te Papa Tangarewa, the Museum of New Zealand. And again, I'm going to use that soft wrap along lines because that is easier to read. It consists of logbook entries of bird sightings at sea near New Zealand from, what was it, 1969 to 1990.
And then, so part of the formula, the next little piece is I'll put, I'll find a quote on the site. So this one actually, both of them have a nice little description of what it is. So I'll just grab the basics here. The data was recorded using guidelines, et cetera, et cetera, et cetera. Read through that, you know, make sure that the quote has something to do with the dataset. And then, try to do one or more questions about the data.
And so, the first thing I'll do here is that what I mentioned of data was recorded by hand and split into standardized columns. Do entries always match? Actually, I want to look at, which one is it, ships, I think. Or no, it's birds. To the, in species common name in the birds dataset, always match, match up with the split columns. Let's just do a, all names, birds. And what's really helpful is if you've already done your own analysis of the data. So while you're curating it and you're cleaning it, go ahead and analyze it so you know what questions can be asked of it. I think that's generally a good idea so that you're not suggesting questions that like literally can't be figured out from the data.
Also time check, we only have 14 minutes left. I'm almost done. It's okay. We'll get there. We are cooking. We're going. And for this particular one, at least for now, I'm just going to ask the one question. You can always, you know, whatever makes sense for what you're doing. But so, that's my intro.
I want to have a PNG image. I do like restrict it now to PNG. Makes my life, oops, makes my life easier. And so, we're going to also save an image, which we have in the submission, which is a screenshot of their Interactive Explorer. Let me do, this is a helper that I have to open the folder that I'm in in Windows, so oops. I want to do this and go to TT submission and say, you know, I'll just call it screenshot. And so, okay. I have that PNG image, and I also already did alt text for the image.
I want to talk about this a little of this is the thing that I probably have to edit the most often in submissions is I ask you to give me alt text for the image. The alt text, the tip I was given is that alt text should be able to replace the image, not just describe the image. So, a lot of times people will say an interactive plot or an interactive plot of seabird data. It's like, well, okay, but what does it show? And so, going to the details of what it shows, that it shows, you know, specific grid cells within the Southern Ocean that have data. This plot shows that there was no data from whatever years, 1970 to 1974, like, what do you actually get out of it with eyes when you look at the image so that, you know, it's actually replacing the image.
The alt text, the tip I was given is that alt text should be able to replace the image, not just describe the image.
So, I'm going to use that. Use for an LLM, pop it in if you happen to use something like that, and then read it and edit it, verify it, but it can be really helpful. Yes. Especially if you go back and forth a little bit, they can do a pretty good job. Again, this one did, I think, come about that way. I often do that as a step, but it's never perfect. Right. You always have to go back and look. But I've been surprised at how quickly it can get me to a better version than what I had written myself. So, that's the only time I like to use LLMs is when it can do it better than I can.
I think they are getting better at that, because I think there's been a fair amount of kind of writing on what they do wrong, and so those articles are getting incorporated into their training. All right. So, we are filling out metadata. Yes. And we didn't have to, like, create anything. It was already in that file. In this case, it's already in the submission. This might be something you have to go find. A lot of times for an article, I have to go find something, and the idea is just something that's related to it. Something that will give people some context of the data. Often that comes with the data, but sometimes I'll just find a Wikipedia article that's related or, you know, whatever it might be. So, just find something to use.
And then the image file name was screenshot.png. I already did the copy and paste of the alt. This is all in a function wrap is a good thing to realize. This is like we're wrapping this into a function. These are all options or arguments to a function. And technically, if you just call ttmeta, it will quiz you on these things, so you can do it that way, and it will walk through and have you enter the things one at a time, or you can use the template that comes up when you use the ttcurate dataset function. Fill it in that way. Either way you want to do it. I like to do it this way, because, like I said, once I have this, like, this is the same every time.
These last sections are how do you want me to mention you? Oh, and actually, that mention you made me think of that I also want to say, in the intro, yes, that David Hood, I want to thank David Hood. Oh, nice handle, David. ThoughtfulNZ. I'm a BlueSky. For the dataset, suggestion. And so, something like that.
Sometimes, I guess, in RStudio, it'll just turn that into a link if you paste over the text. Yeah, sometimes. I'm not sure what makes it. It does it in Slack nicely. Yeah. Anyway, so, it also will do it on GitHub. If you paste over text with a link, it'll do the markdown formatting for you. So, that's all set. This is all set. This has all the credit of who I am. Don't get caught up on, like, the formatting of your BlueSky, LinkedIn, Mastodon, GitHub links. If you include those, those are all optional. But if you do include them, I do some, like, figuring out of the various ways that you can enter those.
So, if I did, you know, at JohnTheGeek, at Fossadan.org, which is a valid way to enter your Mastodon handle, it should also, like, just deal with that. So, submit those. And then I will run that function, and it created this file. This meta.yaml. So, this is the thing that my script will use to post the dataset. So, it is helpful if this has everything. I'll also be checking that as I review any submissions.
Submitting the pull request
And then I make sure that I don't have any little asterisks up here. Everything looks good. And I can ttsubmit. And so, let me make sure, yeah, that has, that's going to open up a browser window a moment. This is going through and doing all the GitHub stuff for you. So, even if you have this project, you know, this is a package that I'm developing. It has its own GitHub repository. But for a moment, the script says, okay, this folder exists in a different GitHub repository. And it does all the check-ins, sorts that all out. It will create a fork of the TidyTuesday repository for you. So, you don't have to do any of the fancy GitHub stuff. And you can click create pull request. You can give it a title if you want. You can, like, change, you know, tell me if there's any details. But you don't have to. You can just hit create pull request.
We've got five minutes. So, those will be perfect. And I do ask that, you know, you go through here and go, okay, it's not already used. Not this one, but a recent one. I had to go through and fix this where all of the data, all of the files were big. Or actually, it was one big file. And so, I split it apart into separate files. Because GitHub will sometimes complain if it's more than 20 megabytes. People won't be able to download it sometimes if it's more than 20 megabytes. And so, I do this splitting.
And so, what's going to happen here? This is actually telling me that my personal fork is out of date. Doesn't really matter. But I'll, you know, you don't have to deal with that. But it's going to do a check. And I just wanted to see this happen. Because often this check will find something. And if it's, like, when you do this, after you have your GitHub set up, it will email you probably. If the check fails to say, hey, something's wrong. I try to be clear in there that some of the things that it says are wrong are things for me to deal with. And some are things for you to deal with. If it says there's no image, like, that's saying, hey, can you give me an image? You know, like, maybe I'll try to go find one. But it's a lot easier if you do it yourself. And then other times, you know, it will say that this link can't be reached. A human should check it. Because sometimes GitHub gets blocked from reaching the data site or data source. But you can reach it when you click on it. Things like that. So just kind of follow the instructions in the comment that will eventually show up here. And that's it.
And, you know, we're not going to have time for me to go through the review step. But then I would take this and review it. Make sure that it does all the things that it needs to do. And then this one actually won't be next week's dataset. Because someone has a timely one that I'm going to try to get in. But it will be the dataset in two or three weeks.
Awesome. Well, we have a question in the chat really quickly from Becca. And we only have four minutes left. So let's really quickly. Becca says, how does one confirm a dataset has not been used before? You can search in the repo at the top. There's a little type search. But you could also Google. And then there's also a ttmeta package that Jon made that will help you sift through. Here we go. So, yeah, there is this package that every week it auto-updates with all of the data. So you can install this ttmeta package to make it a little bit easier to search through. But also you can just search. All I really ask is that you've done a search. So you've searched for the name that, you know, C Word data. And you've searched for, like, the URL. And if neither of those come up, then we're good.
A lot of times what you search for, probably those words have been used in a previous dataset. But just do a little bit of diligence to make sure that it's not the same dataset. That said, we've run the same dataset twice with, like, re-cleaned. And it was a little bit different. Had newer data within it. Just make sure that you know that's happening and you explain it in the introduction. And that something about the data is different. Like, we don't ever want to run it exactly the same.
Reviewing a submitted PR
All right. Perfect. Becca, thank you for asking. So we have run through and we've submitted this. And with our two minutes left, Jon's going to show us what it looks like once he gets that submission as a PR. Yeah. So here's one. We'll see. We'll do this one, probably. This is one. Jen Richmond submitted it. She has done several datasets now. And so she's got it mostly down. And she actually had some of these things unchecked, actually. And I kind of dealt with them. This is an example of when she first submitted it, she didn't have an image. And so it ran through and said, hey, it's missing a file. And then she. Sorry. I can't remember if she did it or I did. Oh, yeah. So that's right. She did a JPEG. I just converted it to a PNG. Did a few other things. And it worked great. Except then I realized that the files, one of the files was huge. That was the repairs.csv. And I split off just some of the columns. So I'll do some work on it. If it's broken, the more you do, the better. But I can, you know, I'll work with you on that.
Another example of one is Novica submitted this earlier today, I think it was. Yeah. Four hours ago. And again, he went through a few steps of tweaking some things. I learned today that I need to update my checker script because this one that passed actually is a more recent update than this one that failed. It just took longer to run the one that failed. And so it made me think that it was broken, but it's fine.
But yeah. Then I will be, what I'll do is usually I start by just kind of looking at the cleaning script, see what I need to change. See what comments are kind of leftover that I want to delete. Want to delete things like that. Make sure everything's here. Make sure a common, a common thing is people will say, I say that your title should fit into the phrase. This week we're exploring blank. And a lot of times people will include the phrase this week we're exploring. Daniel Chen's slipped through with the Olympic data a couple of weeks. And so when it posted, the post said this week we're exploring this week we're exploring the Olympics. It just duffled up. So just, I just need the Olympics or the Olympic schedule or whatever it is. So things like that I'll watch out for. Usually I'll catch them, but not always.
And, you know, like if you don't include a credit for yourself, usually I'll hit you up and say, hey, how do you want me to mention you? Don't forget that. Let us know who you are, but I have to interrupt us, Jon, because we have reached the top of the hour. Okay. So I think that we got really, really far. Thank you, Jon, for walking us through this. We got all the way from opening up that curate article through submission. I hope that this helps you submit a TidyTuesday dataset. If you have used TidyTuesday benefited from it in any way from it, please, please, please submit. I hope to see your submissions come through.
Thank you so much, Jon, for joining us. I cannot wait to see all the things people submit. I know I have something in mind to submit as well of my own data. I will see everybody on Thursday. If you come to the data science hangout, or I will see you next week on the data science lab. Thank you, everybody. We'll see you next time. Hope you had a good time. See you on the Discord server.