Data dictionaries, parquet, & Claude | Hadley Wickham | Data Science Lab

Transcript#

This transcript was generated automatically and may contain errors.

I am joined today by our lab manager Hadley Wickham . Hadley, would you introduce yourself?

Hi everyone, I'm Hadley. I make R packages and do R stuff.

Hadley makes R packages and does R stuff, and today Hadley is going to be talking about the combination of data dictionaries, Claude, parquet files, in order to help us make our data sensible to not only human beings, but also to machines. Hadley, will you give us a little rundown and then feel free to share your screen and jump right in.

Okay, and I made a couple of very short notes before we get into it, just to remind me what we're talking about.

And basically, since the last data lab that I came on, I kind of realized if you're going to be using AI agents to help you do data analysis, you need to write down what you know about the data, so that AI agents can be as helpful as possible.

Of course, in some sense, like you should always have been doing this, because your human colleagues also benefit from this information. But for some reason, some unfortunate quirk of human psychology, it seems to be much more motivating to write this stuff down for non-humans than for humans.

it seems to be much more motivating to write this stuff down for non-humans than for humans.

So I've been thinking about this idea of data dictionaries. Certainly not a new idea, but I'm going to show you a little spec, pretty lightweight spec, for data dictionaries that I've been working on, and a workflow that I find super compelling, where you have three files, a data cleaning script, a data dictionary, and your final cleaned data. And you mostly edit them using Claude code. And I think this combination just feels really, really cool to me. And I should say, fast get, because that is super important.

And I'm going to be showing you these tools today in Positron , with Claude code, and with a Claude code tool called MCP REPL, which Isabella will hopefully drop a link in the chat. So MCP REPL is an open source tool developed by my colleague Tomasz. It basically allows any AI tool to run a persistent R or Python session, so it can run data analysis code for you. You can also do this using RStudio , and Posit AI, Posit Assistant, or Positron, and Posit AI, and Posit Assistant, and other tools, and get very similar results.

I am just playing around with this one at the moment, so do not read anything particular into my use of tools. This is just one of my experiments, trying different tools, and seeing how they feel, and how they should inform the design of my work. And how they should inform the design of future tools.

We've got at least three different, oh, OK, we have zero, 10,000, star, star, dash, dash, dash, dash, zero, all used to represent missing values in various places.

I feel like that's a really excellent use of Claude. Tell me what all these disparate missing values are in all these different variables. Yeah, and fix it for me.

Yeah, I remember that's one of the hardest things when I first started in data was I'd read my own data file in and I couldn't get anywhere because of all of the different and completely unique missing values that were used for all these different things. And this just makes it so fast. You don't need to have all of the code to do this at the top of your head. Claude can generate it, and then you can still super, super important that you're reading this code. you can, you know, still super, super important that you're reading this code because like you wanna check that it's doing the right thing but much, much faster because you have to type all of that out by hand.

Okay, we have time for one more question from Nathan and it says, so in this case, the parquet file is checked into Git. Is that more advisable when working with LLMs? Like normally I've heard to avoid checking data into Git.

It's really more about the size of the data, I think. Like if you can commit the data into Git, it just makes the rest of your life like so much easier because that data is now versioned along with all of your code. And this data happens to be public, non-private, non-sensitive too. So that's an aspect of this. Right, so I think like that, if you can't, I think my advice is if you can commit it to Git, you should commit it to Git because it's gonna make everything, it's gonna make it easier to see like how the data has changed over time. You can roll back to previous versions, all of those really nice features of using Git. Lots of cases where you can't use Git, you can't put the data into Git, it's too big, it's confidential, lots of reasons. This workflow still works there. You'll still have that clean.csv, you'll still have that data dictionary, just don't wind up the parquet file checked in.

Exploring geocoding and the Central Park mystery

Okay, so this is interesting. We've got a classic weird-ass scroll problem. I just asked it, what could I do? Generate like this, convert this to numeric. We could fix the obviously wrong geocodes with some of the Pennsylvania. This is kind of interesting. Like this is a thing where I'd be like, okay, yeah. Like, yes, we could use a latitude and longitude bounding box to zoom in on New York City. I would have zero clue what those are.

Actually, let's do that one. Like let's look at the non-NYC zip code start by writing the map dot R. And I'm just going to do that. What I'm hoping it will do is create a plot of all of the latitudes and longitudes, just because I've given it enough context to say that I want a map and the file name. And we'll see what that does.

Okay, it's doing some, oh, interesting. So it's decided to look at New York City zip code prefixes. That's interesting. And then it's going to, so it's actually sort of interesting. Like it has actually run that code and looked at the plot, but we can't see the plot. So I'm just going to source this in RStudio and then Positron, sorry, and look at that. And that is like pretty obvious that something is wrong with those, like there's some ones that are way outside.

So yeah, drop the points that are obviously out of range. Okay, and it's updated my map.r, which is fine. Let's just move around that. Okay, that looks more like New York City. I think, at least for my not terribly amazing knowledge of New York City.

Just so you know, if you're not looking in the chat Hadley, we currently have a vote going on on how much this vibe session is going to cost, but we have no idea if you're going to be able to tell us at the end what the token cost is.

And it looks like while we are seasoning on that slash cost, we do have a question that I think I missed. Jeff said, I'm curious to know how much this hour of using Cloud Code will cost. Can we get an update at the end?

Well, that's dumb. I cannot tell you that because I'm using my Cloud Code subscription. Aha, well, David had also asked an easier one, which model is Hadley using? Slash model. The recommended to Opus 4.7. With its 1 million contacts window.

And I will say, like, I, I'm trying to think how to put this, like, I guess I don't care about money. Like, I get, like, whatever. Like, if it saves me a minute or two and it costs me $5, like, I don't care. I am not a cost-conscious consumer of LLMs. And I 100% accept that it's coming from a place of privilege. But I do think for, like, what we're doing here, I suspect, like, that there are cheaper models that would do almost as well and bring, substantially bring the cost down. But that, I don't care about that because I have little money. I have little time and a lot of money, so I optimize for that.

So let's just, let's, oops, close that plot down to see what we've done. I think we all want Hadley to use whatever he has to use to develop all of the cool things that we get to use.

And then Connor had asked a question in the chat that I didn't understand. What CRS is it using to determine what coordinates are inside New York? Connor, what is, what is CRS? Coordinate reference system. Coordinate reference system. I don't know. I don't care, but I trust that plot. So again, if you're doing a real analysis, you would want to figure that out. But I'm guessing that Claude has enough knowledge to know the location of New York City and based on what I saw on that map, but that's pretty good.

I will show you one other thing I saw because it's kind of related. When you update map.r to add a map of elevators around Central Park. Because this was kind of interesting and I spent a little bit of time looking at this previously. And again, this is something that like, this would be so tedious to do by hand and I will see how well Claude actually does here, but like it probably knows the approximate latitude and longitude of Central Park and it can figure that out and update this code.

I'm definitely showing my lack of knowledge for spatial analysis here. So thank you for everybody who's adding context about CRS in the chat. I have no idea why I decided to put a dark green rectangle there. That seems pretty pointless, but this, yeah, I did this earlier and this is kind of interesting, right? Like, so what's this elevator like right in the middle of Central Park?

So let's ask. What's, when you pull out the rows for the elevators that are inside of Central Park? Let's see, you can park. Oh yeah. The bounding box is clearly totally wrong because it's square. So Claude just messed that up. Elevators inside the buildings, inside the park, the Met, the Delacorte, the Boathouse, the Tavern on the Green, like I'm skeptical about this. I'm pretty sure that's the Met there. That's the Metropolitan Museum of Art. I'm sure that has elevators, but this one like right in the middle.

Oh, and now we're in, now we've gone into cool. Uh-oh. What is that? Cool gibberish mode. We are summoning the ancients now.

Connor, thank you in the chat for giving us some more context. CRS is how we know how to plot coordinates on the earth, adjusting for the fact that the earth is not flat.

Okay, we had a question, which I will scroll up to from Nelly that says, is there a way to quantify environmental and electrical costs? Like not ones that affect one's personal finances, but still have an effect on the world?

That is a very hard question, I think. Yes, that is, yeah, I think it is important to be worried about the environmental aspects of LLMs. I, yeah, like for me, like for my usage of LLMs personally, I am reasonably certain that they are a pretty small proportion of my overall impact on the environment. And so what that means is like, if I want to do good for the environment, then the place to start is not to decrease my LLM usage, it's to like fly less, or, you know, we have a solar, we have electric cars and a solar powered roof. So those are the things like, you know, I really believe in climate change and we should all be doing things to try and make the environment better individually. And so if you want to do that, I think the way to do that is not to decrease your LLM usage, but to look for other things. But then at like obviously a societal level, like individually, it might not be a big use, but across millions of people, it adds up to be a big impact. And now I'm like out here, like writing tools to make LLMs easier to use, like how do I kind of square that with caring about the environment?

I think that Nellie has a good point, which I think about a lot. Nellie says, there's certainly a difference between people making AI slop videos on mass and people making code to do research.

And if it helps, I am offsetting all of our carbon footprints by never leaving my house. Okay, we have cooked. And I think that Claude has completed something, but I'm not sure. Yeah, I think, oh, it says no elevators are generally located inside Central Park. That's what Noor was asking. Is this supposed to be Central Park? Yes, I think so.

And so it's fair, I think, yeah, I mean, this is now, I'm like, how could I specify that that is that one point I really care about?

Because Central Park is definitely a big rectangle on the map. Yes. We know that.

Let's just try one more thing, let's see if I can get on a leaflet map. I'm basically worried about doing anything with the map. I'm basically worried about doing these interactive things because it's like maybe thousands of points there and maybe they'll.

OK, this is where it's geocoded it to and it is a freight elevator and the address is 1000 Fifth Avenue. So I think that is a geocoding mistake. I do not believe there is an elevator in the Great Lawn or somewhere near it.

I guess it could be the New York Central Park Precinct, but given that the address. Let's just pull this out and now we could now like now it's just too.

Yeah, Dan says that point is in the middle of the Great Lawn or a reservoir.

I'm also seeing how this is the downside of using an AI for everything. I don't even know what the name of that variable is. So let's go to my data dictionary.

Is this workflow actually faster?

This is where we have to jump in with our last question that is unanswered, which is Marco's asking, how do you know this is actually faster or more productive than just writing the code? Like the back and forth seems like it's a lot.

Yeah, yeah. I think that's a good question. And like, or like, is it faster or is it just does it take the same amount of time, but it's just different work? I don't know. I kind of feel like I could have probably written this code fairly quickly. Like this is legit. This is legitimate time saving. Like I would have to go and do some research to find that out.

But yeah, I don't know. Like I think if I was less familiar with R code, this would have taken me longer to write. But as long as you can kind of read R code and interpret it, and you can be fairly, you know, I think you don't need a particularly sophisticated understanding of R code to be able to read this and verify that it's doing what you wanted. So I think like on the whole, like this is much faster. It's also been much slower, because I'm explaining everything I'm doing to you all and answering questions. Like I do think it's fast. I think it's like 10 times faster. It's not faster than everything. But you also don't, like you don't have to use it when it's slower. Like you can still do all of this by hand. I think it's still a useful workflow.

But that whole idea of keeping these three files in sync, I think is really powerful and really useful. And I, you know, if you have more feedback about this, I'd like love to hear about it, because this is something that I am going to be working on more and over the coming weeks and months.

But that whole idea of keeping these three files in sync, I think is really powerful and really useful.

Yeah. Well, I will say Rachel and I have been talking about having a data science hangout all about these, like existential AI questions with some posit AI brains. And so look forward to that. I think in the future, be a great space to have these kind of conversations and talk about this stuff out loud.

And then also as an update, Dan Chen put a link in the chat to a Gizmodo article about an artificial cave beneath Central Park, and that maybe that is a maintenance elevator. The conspiracy continues.

Because during inside the article, elevator is mentioned three times. Like there is definitely somebody riding an elevator beneath Central Park.

I guess the other thing we could do actually is if we go back to that code. So you thought you were coming to learn about data dictionaries. Actually, we have one minute left till the top of the hour, and we are solving a conspiracy theory about the elevator in the middle of Central Park.

Yeah, this is the official elevator data where this came from. Yeah, like no real information. I'm assuming 1005th Avenue is, yeah. The Met.

All right. Well, I have to end us. I have to stop us from what we're doing, and we have to move on.

Transcript
What is MCP?
The elevators dataset
Building the data dictionary
Iterating on the data dictionary
Cleaning zip codes and dates
Data dictionaries and data contracts
Fixing placeholder missing values
Exploring geocoding and the Central Park mystery
Is this workflow actually faster?
Featured software

Data dictionaries, parquet, & Claude | Hadley Wickham | Data Science Lab

Transcript#

What is MCP?

The elevators dataset

Building the data dictionary

Iterating on the data dictionary

Cleaning zip codes and dates

Data dictionaries and data contracts

Fixing placeholder missing values

Exploring geocoding and the Central Park mystery

Is this workflow actually faster?

Featured software#

bookdown

bookdown.org

leaflet

pointblank