Transcript#

This transcript was generated automatically and may contain errors.

I am joined today by our lab manager Hadley Wickham. Hadley, would you introduce yourself?

Hi everyone, I'm Hadley. I make R packages and do R stuff.

Hadley makes R packages and does R stuff, and today Hadley is going to be talking about the combination of data dictionaries, Claude, parquet files, in order to help us make our data sensible to not only human beings, but also to machines. Hadley, will you give us a little rundown and then feel free to share your screen and jump right in.

Okay, and I made a couple of very short notes before we get into it, just to remind me what we're talking about.

And basically, since the last data lab that I came on, I kind of realized if you're going to be using AI agents to help you do data analysis, you need to write down what you know about the data, so that AI agents can be as helpful as possible.

Of course, in some sense, like you should always have been doing this, because your human colleagues also benefit from this information. But for some reason, some unfortunate quirk of human psychology, it seems to be much more motivating to write this stuff down for non-humans than for humans.

it seems to be much more motivating to write this stuff down for non-humans than for humans.

So I've been thinking about this idea of data dictionaries. Certainly not a new idea, but I'm going to show you a little spec, pretty lightweight spec, for data dictionaries that I've been working on, and a workflow that I find super compelling, where you have three files, a data cleaning script, a data dictionary, and your final cleaned data. And you mostly edit them using Claude code. And I think this combination just feels really, really cool to me. And I should say, fast get, because that is super important.

And I'm going to be showing you these tools today in Positron, with Claude code, and with a Claude code tool called MCP REPL, which Isabella will hopefully drop a link in the chat. So MCP REPL is an open source tool developed by my colleague Tomasz. It basically allows any AI tool to run a persistent R or Python session, so it can run data analysis code for you. You can also do this using RStudio, and Posit AI, Posit Assistant, or Positron, and Posit AI, and Posit Assistant, and other tools, and get very similar results.

I am just playing around with this one at the moment, so do not read anything particular into my use of tools. This is just one of my experiments, trying different tools, and seeing how they feel, and how they should inform the design of my work. And how they should inform the design of future tools.

What is MCP?

Okay, really quickly, since you mentioned MCP, can you talk a little bit about what that is? Noor had asked in the chat.

MCP, I can't remember what it stands for. Even if I did, it's a useless acronym, because it doesn't really explain what it does. It's Model Context Protocol. Yeah, it's basically a way of giving AI agents access to new tools to do things. I don't know, MCP servers can kind of do anything. So in this case, we're going to be using them to run R code. So this is just kind of a general extension or plugin system for AI coding tools. And we're using this MCP REPL, which works for R and Python.

The elevators dataset

And what I'm going to be doing is, I'm going to be working with this dataset called Elevators, that my colleague, Emil, gave to me, because he really likes it, because it's a fun dataset. We're going to start with a CSV file, with basically no knowledge about what's going on in this file. And kind of over the course of the next hour or so, sort of build up some knowledge and write all that down. And hopefully come up with a cleaner dataset at the end of it.

And so I'm going to kick this off by asking Claude, write me a plain R script to read Elevators and CSV, and save a parquet file.

So if you don't know what a parquet file is, you should learn about them, because they are basically like a CSV file, but basically better in every single possible way. They're smaller, they're faster. You don't have to worry about different types and different missing values. You can read them into every single programming language you can imagine.

So I've done that, and Claude is doing something. Okay, and it's come out with this clean.r script. Wow, okay, it's really gone all out there. It's done a bunch of stuff that I didn't want, and I didn't do 30 minutes ago when I tried this out. So I'm going to tell it, hey, ramp it back.

So I want us to start just as simply as possible. Yeah, okay, that is what I am looking for. Why has it chosen to skip that first line? I do not know, and I'm just going to delete that. Okay, so I can run that code, and now I get a parquet file. Positron has a built-in parquet file viewer, which makes it easier to explore.

Okay, so you probably could have written this code faster by yourself than with ThoughtCode, especially given that it gave us a bunch of crap we didn't want. But that is life working with AI.

Building the data dictionary

So what we're going to do next is we're going to describe this data set in a data dictionary. And I think we'll just kind of dive into that. I'm going to tell Claude to go read this.

And while I show you what I just told Claude to read, which is this spec. So data dictionary, or the data dictionary format that I've come up with, is just a YAML file. And we'll kind of see it very shortly. It's easier to understand with a concrete example. But it's basically going to describe all your tables, the fields inside those tables, relationships between those tables, and then any kind of domain-specific terminology that you might want.

And please make me a spec using what you know as our MOC elevators. And certainly, Claude knows a lot more about elevators than I do. So hopefully, it'll at least guess at what some of these possibly cryptic variable names actually do.

Okay, and it's come out with this data dictionary. Every elevator, escalator, dumbwaiter, and related conveyance device. So this data set has one table called elevators. That table has a bunch of columns. Each column has a name and a type. This is a string. We've got some information about constraints. Maybe it's a primary key, or it can never contain missing values. A description, and then some examples.

So the data dictionary includes examples, because that's super useful for humans. And it's super duper useful for LLMs as well.

Okay, and we've got device status. This is an enum, or like a factor. It has these five possible values, which mean these five possible things. So I'm not going to go through all of the variables in detail, but you can kind of see, like, this is a great place to record what you learn about the data set.

And what I'm going to do now, because I want to see how this changes over time, I'm going to just turn Git on, and I'm just going to say, first, select. Because I lied earlier, and we do actually have a little bit of context about this data set. And there's this readme file that comes with the original data set that points out a few things. So, like, zip code is frequently listed as a nine-digit zip plus four.

So what I want to do now is take that, tell Claude there's some extra context, and update.

And again, I'm going to try and use an LLM like I do normally, which is, I just kind of press my fingers on the keys in approximately the correct order, and hope that the LLM just figures it out. So I'm going to try not to worry about spelling mistakes, even though I'm presenting to you, because that's not what I do when I'm working by myself.

So again, we'll let that cook for a little bit. If you've got any more questions. We did have one question about Parquet files, and just so everybody knows, we are having a whole episode on Parquet files in June with Nick Crane, so we can deep dive then.

But the question was, are there any examples of applications that output Parquet files or scientific instrumentation that outputs to Parquet? Not that I know of. It's been a little, this Parquet files are mostly used by kind of big data engineering tools. They power like Snowflake and Databricks. You can use them with DuckDB. Less common in science, I think, and in open source data science, but these files are like really, really good. It's super duper fast. They compress really well. And so I strongly advise you learning that using them, particularly if you're doing any data cleaning. I think it's a really good idea to study if the result is Parquet.

And then Satyarishi had asked, what is a good way to store data dictionaries along with datasets? And I think that the answer is probably what we're doing right now. Exactly. So again, I think you should be putting these in a plain text format. You can use my data dictionary, not YAML format. I'm kind of, I'm reasonably confident this is a useful format and I'm going to keep working on it and make some more tools so you can easily turn this into a beautiful website, validate your data based on the spec, all that sort of stuff, but currently just kind of learning about it and trying it out and seeing what's happening.

And we have a question from Nathan also that says, is this particular data dict.yaml format specification useful for projects with only one table of data? Yes. And this indeed is one, this project only has one table of data and I think we'll see hopefully as we go along that it is useful.

Iterating on the data dictionary

So now we're just going to take a look at the diff of the data dict.yaml and see what information it has. Okay, we've got a point in time snapshot, not a live feed. Well, that's kind of obvious. So seven, so this is interesting. Seven test rows carry the literal street name dummy record and should be filtered out for analysis. So that's going to be something like, like, let's fix that shortly.

So now we've at least got all of this information from that readme and now we've encoded it in the same place. And so I'm not going to, again, I'm just going to skim this. We've got some information like, okay, this variable is really inconsistent. You know, be careful when you're using this. All of this is like super useful information if you're going to be analyzing this data set.

Okay, so I'm going to say update data dict. Okay, so now what I want to do next is start like diving into this data dictionary and fixing some of those issues. So I'm going to say, okay, remove the dummy records. Updating clean data dict.

So we're just going to now iterate, like, let's make this data cleaner. So I'm going to ask QWORD to update those three things. It's going to update my data cleaning script. It's going to put the results in the data dictionary. And then it's going to run my R code to create an updated parquet file. And then we'll look at that in the source code and see what it's done.

So it has added a line to strip out all of the street names. And then, okay, it's updated there. I'm just going to delete this. I don't think that's that important. It's removed the seven test rows. It's removed that later in the data dictionary as well.

Okay, so now I can check that and remove that many rows. And so I think, like, one of the things that's really nice about having these three files in concert is it's pretty easy for me to check that QWORD's done the right thing, right? Because I can just go in here, check that QWORD's done the right thing, right? Because I'm seeing it's updated my cleaning script and it's updated the data dictionary.

Becca had a question along that line, which was, is cloud also going to decide when it's a good time to make a commit? Uh, actually, well, let's try that out next time. Because we can, you can do that. I don't always do that. But yes, you can also tell it to commit.

So let's go back to that address and street name. Because I just noticed this white space is not normalized and may contain multiple internal spaces. So, okay, let's say normalize a white space and street.

I like the typo in normalize because it's very much what I would be doing. Cloud figures it out, man. The other thing that's nice is, like, I don't ever, like, I didn't hold down the shift key to type street underscore name.

So, okay. So again, what's it done? It's updated my cleaning script. It's using gsub to remove multiple sequences spaces with a single space. And it's trimming the white space on either end. It updates the dictionary. I'm just going to, I don't really care. I'm just going to delete that comment. Because I don't think you need to record that sort of information in the data dictionary when it's recorded in your clean.r.

We've done that. And now I can say commit it. And we'll see what core does. It's going to look at the current git status. It's going to look at the git log. I'm not sure why it's doing this. I'm just going to say, what did I go? What did it do? Add normalized street name white space. Okay. Sounds good.

So much trust.

We have a couple of questions. Do we want to fit some in right now while it's cooking? Yeah. Okay. So Brian had asked, if we see somebody scrolls way up, they'll see. Brian said, is there a way that the YAML format data dictionary could be used as an input? For example, recode underscore values, which is the new dplyr function.

Yes, probably not with recode values literally, but I think there's various things that you can't, that parquet files don't store very well. One of those things in particular is labels. Statisticians and data scientists love labeled values. Parquet files have no way to kind of add additional nice long labels to short information. So I'm imagining at some point, there will be an R package to accompany this file that will read that, read in your parquet file, read in the data dictionary and apply the labels.

Cleaning zip codes and dates

So now what I want to do next is start like diving into this data dictionary and fixing some of those issues. Looking at the zip code, like, frequently stored, I'm just going to say, like, let's just make this all five-digit zip codes. Well, no, actually, I don't like that. Like, let's get rid of, I mean, it says treat non-five-digit values as dirty. And it also says frequently stored as a nine-digit zip. OK, let's just tell me. Tell me about the sizes of zip code.

So there are most, so 82% of them are nine digits. So let's not, let's. Is it storing it as a number? Yeah, which is also. Jenna and I agree. They should be strings. You're not going to do any math on a zip code. Yes. OK, so let's do that. OK, so make it a string. And replace zero with and with proper missing values.

So again, you're going to imagine it's going to update that clean.R. It's going to update my data dictionary. And it's going to update the parquet file. And again, I just love this process because it makes it so clear, like, what's happening in every step. And then you can go back in your git history and see exactly what you've done.

Well, like last insp date is currently stored as a number. That's not good. db last per insp date is a string. That's also not good. And db approval date is also a string. OK, so let's fix that. Make sure you parse all the date columns as dates.

So this is a good example where the data dictionary is actually wrong. It says it's a date, but it's not actually recorded as a date in the file. This is typical, right? An LLM is not perfect. It's going to make mistakes. The goal is to create a process by which, over time, you discover those mistakes and fix them. And the whole thing gets better. And you can see exactly what's going on.

So now, again, we look at our git diff. And git is such an important piece of this because I can see exactly what the LLM has done at each point. It's used a for loop. Certainly, we can, I don't know if this is the way I would have written this code. But it's taking a very low-level base R approach here. But I kind of don't really care because I didn't have to write that code.

Data dictionaries and data contracts

All right, well, while we're watching McCulloch really quickly, Jenna Jordan had asked, do you see a meaningful distinction between data dicts and data contracts?

That is a very good question. I had initially thought of data dictionaries and data validation or data contracts as being very distinct. But as I have worked on this, I think they are actually very closely related. And one of my next steps is to start building out a tool. So you can take a data dictionary and compare it to a data file and discover if there are any inconsistencies. So I think that the data dictionary does actually end up specifying a lot of information that you want in a contract.

And so while here I'm focusing on the cleaning, I think you could also have written down a day or come up with a data dictionary some other way. And now you're using it to check that the data that you get is good.

So let's just quickly look at this and we'll come back to that idea. Because one thing I think we should record, one thing we want to record in this data is for dates and numbers. We should have the range here so we can at a glance see what are the valid values.

Fixing placeholder missing values

Okay, we've got a bunch of different placeholders in different columns. So let's eliminate all the placeholder values and use proper missing values instead. It's a very, very common exercise when you are converting data collected by non-data scientists who don't really have this idea of a missing value. They often use placeholder or sentinel values. Like, let's fix that and use proper missing values. So if we do counts or means, we get the correct thing without having to constantly remember, oh, in column seven, a missing value is actually recorded as a zero. So make sure you remember all the zeros before you compute a mean.

So again, that data, you see that data dictionary kind of updating. Let's see what it's done. Look, so listen to this. This is beautiful. We've got at least three different, oh, OK, we have zero, 10,000, star, star, dash, dash, dash, dash, zero, all used to represent missing values in various places.

We've got at least three different, oh, OK, we have zero, 10,000, star, star, dash, dash, dash, dash, zero, all used to represent missing values in various places.

I feel like that's a really excellent use of Claude. Tell me what all these disparate missing values are in all these different variables. Yeah, and fix it for me.

Yeah, I remember that's one of the hardest things when I first started in data was I'd read my own data file in and I couldn't get anywhere because of all of the different and completely unique missing values that were used for all these different things. And this just makes it so fast. You don't need to have all of the code to do this at the top of your head. Claude can generate it, and then you can still super, super important that you're reading this code. you can, you know, still super, super important that you're reading this code because like you wanna check that it's doing the right thing but much, much faster because you have to type all of that out by hand.

Okay, we have time for one more question from Nathan and it says, so in this case, the parquet file is checked into Git. Is that more advisable when working with LLMs? Like normally I've heard to avoid checking data into Git.

It's really more about the size of the data, I think. Like if you can commit the data into Git, it just makes the rest of your life like so much easier because that data is now versioned along with all of your code. And this data happens to be public, non-private, non-sensitive too. So that's an aspect of this. Right, so I think like that, if you can't, I think my advice is if you can commit it to Git, you should commit it to Git because it's gonna make everything, it's gonna make it easier to see like how the data has changed over time. You can roll back to previous versions, all of those really nice features of using Git. Lots of cases where you can't use Git, you can't put the data into Git, it's too big, it's confidential, lots of reasons. This workflow still works there. You'll still have that clean.csv, you'll still have that data dictionary, just don't wind up the parquet file checked in.

Exploring geocoding and the Central Park mystery

Okay, so this is interesting. We've got a classic weird-ass scroll problem. I just asked it, what could I do? Generate like this, convert this to numeric. We could fix the obviously wrong geocodes with some of the Pennsylvania. This is kind of interesting. Like this is a thing where I'd be like, okay, yeah. Like, yes, we could use a latitude and longitude bounding box to zoom in on New York City. I would have zero clue what those are.

Actually, let's do that one. Like let's look at the non-NYC zip code start by writing the map dot R. And I'm just going to do that. What I'm hoping it will do is create a plot of all of the latitudes and longitudes, just because I've given it enough context to say that I want a map and the file name. And we'll see what that does.

Okay, it's doing some, oh, interesting. So it's decided to look at New York City zip code prefixes. That's interesting. And then it's going to, so it's actually sort of interesting. Like it has actually run that code and looked at the plot, but we can't see the plot. So I'm just going to source this in RStudio and then Positron, sorry, and look at that. And that is like pretty obvious that something is wrong with those, like there's some ones that are way outside.

So yeah, drop the points that are obviously out of range. Okay, and it's updated my map.r, which is fine. Let's just move around that. Okay, that looks more like New York City. I think, at least for my not terribly amazing knowledge of New York City.

Just so you know, if you're not looking in the chat Hadley, we currently have a vote going on on how much this vibe session is going to cost, but we have no idea if you're going to be able to tell us at the end what the token cost is.

And it looks like while we are seasoning on that slash cost, we do have a question that I think I missed. Jeff said, I'm curious to know how much this hour of using Cloud Code will cost. Can we get an update at the end?

Well, that's dumb. I cannot tell you that because I'm using my Cloud Code subscription. Aha, well, David had also asked an easier one, which model is Hadley using? Slash model. The recommended to Opus 4.7. With its 1 million contacts window.

And I will say, like, I, I'm trying to think how to put this, like, I guess I don't care about money. Like, I get, like, whatever. Like, if it saves me a minute or two and it costs me $5, like, I don't care. I am not a cost-conscious consumer of LLMs. And I 100% accept that it's coming from a place of privilege. But I do think for, like, what we're doing here, I suspect, like, that there are cheaper models that would do almost as well and bring, substantially bring the cost down. But that, I don't care about that because I have little money. I have little time and a lot of money, so I optimize for that.

So let's just, let's, oops, close that plot down to see what we've done. I think we all want Hadley to use whatever he has to use to develop all of the cool things that we get to use.

And then Connor had asked a question in the chat that I didn't understand. What CRS is it using to determine what coordinates are inside New York? Connor, what is, what is CRS? Coordinate reference system. Coordinate reference system. I don't know. I don't care, but I trust that plot. So again, if you're doing a real analysis, you would want to figure that out. But I'm guessing that Claude has enough knowledge to know the location of New York City and based on what I saw on that map, but that's pretty good.

I will show you one other thing I saw because it's kind of related. When you update map.r to add a map of elevators around Central Park. Because this was kind of interesting and I spent a little bit of time looking at this previously. And again, this is something that like, this would be so tedious to do by hand and I will see how well Claude actually does here, but like it probably knows the approximate latitude and longitude of Central Park and it can figure that out and update this code.

I'm definitely showing my lack of knowledge for spatial analysis here. So thank you for everybody who's adding context about CRS in the chat. I have no idea why I decided to put a dark green rectangle there. That seems pretty pointless, but this, yeah, I did this earlier and this is kind of interesting, right? Like, so what's this elevator like right in the middle of Central Park?

So let's ask. What's, when you pull out the rows for the elevators that are inside of Central Park? Let's see, you can park. Oh yeah. The bounding box is clearly totally wrong because it's square. So Claude just messed that up. Elevators inside the buildings, inside the park, the Met, the Delacorte, the Boathouse, the Tavern on the Green, like I'm skeptical about this. I'm pretty sure that's the Met there. That's the Metropolitan Museum of Art. I'm sure that has elevators, but this one like right in the middle.

Oh, and now we're in, now we've gone into cool. Uh-oh. What is that? Cool gibberish mode. We are summoning the ancients now.

Connor, thank you in the chat for giving us some more context. CRS is how we know how to plot coordinates on the earth, adjusting for the fact that the earth is not flat.

Okay, we had a question, which I will scroll up to from Nelly that says, is there a way to quantify environmental and electrical costs? Like not ones that affect one's personal finances, but still have an effect on the world?

That is a very hard question, I think. Yes, that is, yeah, I think it is important to be worried about the environmental aspects of LLMs. I, yeah, like for me, like for my usage of LLMs personally, I am reasonably certain that they are a pretty small proportion of my overall impact on the environment. And so what that means is like, if I want to do good for the environment, then the place to start is not to decrease my LLM usage, it's to like fly less, or, you know, we have a solar, we have electric cars and a solar powered roof. So those are the things like, you know, I really believe in climate change and we should all be doing things to try and make the environment better individually. And so if you want to do that, I think the way to do that is not to decrease your LLM usage, but to look for other things. But then at like obviously a societal level, like individually, it might not be a big use, but across millions of people, it adds up to be a big impact. And now I'm like out here, like writing tools to make LLMs easier to use, like how do I kind of square that with caring about the environment?

I think that Nellie has a good point, which I think about a lot. Nellie says, there's certainly a difference between people making AI slop videos on mass and people making code to do research.

And if it helps, I am offsetting all of our carbon footprints by never leaving my house. Okay, we have cooked. And I think that Claude has completed something, but I'm not sure. Yeah, I think, oh, it says no elevators are generally located inside Central Park. That's what Noor was asking. Is this supposed to be Central Park? Yes, I think so.

And so it's fair, I think, yeah, I mean, this is now, I'm like, how could I specify that that is that one point I really care about?

Because Central Park is definitely a big rectangle on the map. Yes. We know that.

Let's just try one more thing, let's see if I can get on a leaflet map. I'm basically worried about doing anything with the map. I'm basically worried about doing these interactive things because it's like maybe thousands of points there and maybe they'll.

OK, this is where it's geocoded it to and it is a freight elevator and the address is 1000 Fifth Avenue. So I think that is a geocoding mistake. I do not believe there is an elevator in the Great Lawn or somewhere near it.

I guess it could be the New York Central Park Precinct, but given that the address. Let's just pull this out and now we could now like now it's just too.

Yeah, Dan says that point is in the middle of the Great Lawn or a reservoir.

I'm also seeing how this is the downside of using an AI for everything. I don't even know what the name of that variable is. So let's go to my data dictionary.

Is this workflow actually faster?

This is where we have to jump in with our last question that is unanswered, which is Marco's asking, how do you know this is actually faster or more productive than just writing the code? Like the back and forth seems like it's a lot.

Yeah, yeah. I think that's a good question. And like, or like, is it faster or is it just does it take the same amount of time, but it's just different work? I don't know. I kind of feel like I could have probably written this code fairly quickly. Like this is legit. This is legitimate time saving. Like I would have to go and do some research to find that out.

But yeah, I don't know. Like I think if I was less familiar with R code, this would have taken me longer to write. But as long as you can kind of read R code and interpret it, and you can be fairly, you know, I think you don't need a particularly sophisticated understanding of R code to be able to read this and verify that it's doing what you wanted. So I think like on the whole, like this is much faster. It's also been much slower, because I'm explaining everything I'm doing to you all and answering questions. Like I do think it's fast. I think it's like 10 times faster. It's not faster than everything. But you also don't, like you don't have to use it when it's slower. Like you can still do all of this by hand. I think it's still a useful workflow.

But that whole idea of keeping these three files in sync, I think is really powerful and really useful. And I, you know, if you have more feedback about this, I'd like love to hear about it, because this is something that I am going to be working on more and over the coming weeks and months.

But that whole idea of keeping these three files in sync, I think is really powerful and really useful.

Yeah. Well, I will say Rachel and I have been talking about having a data science hangout all about these, like existential AI questions with some posit AI brains. And so look forward to that. I think in the future, be a great space to have these kind of conversations and talk about this stuff out loud.

And then also as an update, Dan Chen put a link in the chat to a Gizmodo article about an artificial cave beneath Central Park, and that maybe that is a maintenance elevator. The conspiracy continues.

Because during inside the article, elevator is mentioned three times. Like there is definitely somebody riding an elevator beneath Central Park.

I guess the other thing we could do actually is if we go back to that code. So you thought you were coming to learn about data dictionaries. Actually, we have one minute left till the top of the hour, and we are solving a conspiracy theory about the elevator in the middle of Central Park.

Yeah, this is the official elevator data where this came from. Yeah, like no real information. I'm assuming 1005th Avenue is, yeah. The Met.

All right. Well, I have to end us. I have to stop us from what we're doing, and we have to move on.