Transcript#
This transcript was generated automatically and may contain errors.
Alright, let's go ahead and get started. I am so excited to announce our lab leader today is Hadley Wickham. Hadley, would you like to introduce yourself really quick?
I am Hadley, Chief Stylist at Posit and I make art packages.
So I am going to do a live analysis of this week's TidyTuesday data, which I suspect was influenced by this event. Slight suspicion.
Setting up the project
So I'm going to be using, unusually for me, RStudio to do this because I'm going to be sort of demoing slash trying out Posit Assistant because this is something that a bunch of folks at Posit have been working on recently and I want to kick the tires and see how it goes.
But I think what I'll show you, many of these ideas are going to apply like regardless of what AI assistant you use, but we're going to give it a go and see how it goes.
And so you'll notice, the way I'm going to start is I have this agents.md file. So this is a file that's going to be used by most coding agents. It's going to be loaded automatically. So this is a good place to put things that you want to apply in your analysis.
And so I'm going to put one thing that I know it's going to come up, which is use the nanoparquet package for reading parquet files.
I'll talk about what a parquet file is in a little bit, if you haven't heard of them. And then I also promised Gen Z, so please use a Gen Z and emoji. So we'll see how that goes.
Okay, now I'm just going to quickly jump over to the tidy Tuesday. And to get started, I mean, there's lots of automated ways to get this data, but I'm just going to say, hey, POPAI, can you please download the data from here and save it as parquet and create a bit of a dictionary and save it in agents.md.
Let's see how it does. Okay, so I'm going to give it permission. So Posit.ai has a pretty conservative permission model. It's not going to read anything from the web without your explicit permission.
But it tries something that didn't work. It's going to try something else. Got it. And again, it's going to ask me to run R code. If it's okay to run R code, that's kind of why we're here today. So I'm just going to say, go for it.
So you can see it's running a bunch of code. What's it doing? The tidyverse, it read that CSV file. It's just going to take a quick look at it and make sure. And then it's going to save it as a parquet file. So you can now see you've got NZ agriculture parquet. And then it's going to create a data dictionary.
Why parquet files?
So you might wonder why I used a parquet file.
Yeah, let's do a quick thumbs up or raise your hand if you've heard of parquet files before. Or used them before. I'm seeing lots of raised hands. Okay, quite a few.
So I'd say, like, I think parquet files are superior to CSV files in just about every possible way. Except for one, and that's you can't open it in a text editor and look. And that's because it's a binary file format, which makes it much smaller and generally much faster. And the other big advantage to CSV files is that CSV files don't really have types of columns. Like, you can kind of figure out if it's a number or a string, but there's no real way to record, like, if it's a date or a date time or a factor. And parquet files have all of that.
So what I do now, whenever I'm doing a data analysis, the first thing I do is create a parquet file. Because I know in every subsequent step of the analysis, I'm going to read that in. And I know as a parquet file, it's going to read in really quickly. And it's going to have all of the correct column types. So my factors are going to stay as factors. My dates are going to stay as dates. My date times are going to stay as date times. So really recommend doing this.
So I'd say, like, I think parquet files are superior to CSV files in just about every possible way. Except for one, and that's you can't open it in a text editor and look.
Also great if you're collaborating with anyone using any other language. Every language has tools for reading parquet files. Python, you can easily stick them into a dark DB database.
The agents.md file and data dictionary
So I've got my parquet file. I have got a little data dictionary. And again, this agents.md file. This is going to be given as additional context every single time I use the Assistant.
So if I start a new session, I can say, what variables are in the query call? Data set. And it shouldn't have to do anything to answer this. Because it's just going to look at that agents file, that memory file. And it can kind of tell me what's going on.
So where were we? So now we've got the data loaded into R. We've got it as a parquet file. It's 32 kilobytes. So not dealing with a big data set.
So it's got five variables. Loki, no cap. What have we got? We've got the year, end of June. Oh, this is the real. This is one of the most irritating things about New Zealand. The fiscal year ends in June, which is fiendishly confusing.
We've got some agricultural categories. We've got the value. We've got the unit and we've got the label.
Exploring sheep population data
So one of the motivating kind of things about this was like. Used to be when I was growing up for every human in New Zealand there are approximately 22 sheep. And now that is like radically dropped to about 4.5 sheep. So let's see if we can do a time series of that.
And one of the suggestions from Posit AI is create a visualization of sheep populations. So let's click on that. And see what happens.
So it's going to run some code to look at various different types of sheep. We've got a lot of very sheep specific terminology, which now I'm like. Do people know what a hogger is? That's a very sheep specific. Anyway, I think a hogger is a sheep that is two years or older. A lamb is less than one. A weather is a male sheep that's been neutered. So you're getting a lot of sheep terminology today.
From Derek in the chat real quick. Derek says, I see you're using Cloud Sonnet 4.5. When would you switch to Opus?
That's a good question. Probably for stuff like this, even Haiku is fine. Like I know. For me, when I'm like programming, I'm just like Opus 4.5 all the time. It's not worth. Like for me, I don't really care about the cost. So I'd rather just spend whatever it takes.
But maybe if we get if we see anything squirrelly or looking about weird. We will try switching to Opus and see if that does any better.
Okay, we've got this plot now. Kind of interestingly, I can say, well.
So this is one of these places where, like, yes, I do kind of know how to do this in ggplot2. Well, at least I did at some point.
We don't have a yearly data here. So I'm just asking is doing a bunch of calculations.
Okay. So here's the tea. Oh, my goodness. Here's the Gen Z slang that that was promised.
So basically my entire life, the number of sheep in the sheep population in New Zealand has been dropping. But what I kind of what I really want to do is also like it's really it's not just the absolute number of sheep that I think is interesting. It's the ratio of sheep to people. So let's see what we can do there.
And I'm going to try and do what I would do when I'm not doing this live, which is I'm just going to type. And I'm not going to fix any of my typos because I kind of assume that the LLM will like figure it out. So it might be confusing for you as a human, but hopefully the LLM will figure it out.
Getting real New Zealand population data
OK, so Jared asks, is that just correlation or is Hadley responsible for depleting the New Zealand sheep population? Is this a causal relationship here? Probably not. OK, well, legitimate question.
So the code says get New Zealand population data from a reliable source using World Bank UN data or similar. But for now, just approximated it. OK, so just made up the New Zealand population numbers and oh, and it just it just made up numbers for like some randomly picked sequence of years. And then linearly interpolated them.
So this is like one of the reasons, you know, like like AI is super useful for helping you with your data analyses. But it can make some questionable decisions. So let's say let's let's get some real New Zealand population data.
No, you see, you can use something like build into your computer or like some other add on package. This is something we are, I think, pretty interested in. Zhou Qing, in particular, has done some experiments, and it is interestingly, it is interesting how using voice changes the way you interact with with data. You might have seen my ggbot2 demo that I did at PositConf last year. Definitely interested in doing more of that.
So it seems so it's doing a bunch of work. OK, look to see if we could use the WB stats package. OK, then it just called the World Bank API just directly. Which is pretty cool. Whoa, whoa. It's just like, wow. So it clearly knew to do that and just. That's which is interesting. And then I got you a little JSON data and parsed it.
So I'm going to, you know, I think one of the most of what I've been doing here and this is just kind of ephemeral in this chat, but of course, you know, as a data scientist, you want to end up with some reproducible artifacts. I'm telling it to create this New Zealand population file.
The agents.md file explained
What's the purpose of the agents.md file and what goes in there?
So this is kind of the memory for the agent. So this is where you put anything that you want the agent to always know every time you start with a clean chat. So I think there's kind of two kind of big things. Yeah, it's like code style. And I would say data dictionary.
So it's like a really involved system prompt. Yes, it's exactly a system prompt. Let's think this is exactly a system prompt added automatically. It's combined like that, obviously. Posit AI has a system prompt of its own. This gets combined automatically. So this is just something that you should be building up as you work with a data set. I think this is a good place to kind of put your basic data dictionary. You might actually want to put that in the readme.
There's a sort of balance you want to strike between like what information are you giving specifically to the LLM and what information do you want to share with your colleagues as well? And anything that you want to share with another human, you probably want to put that in the readme.
So I'm just going to keep this stuff more for agents. So when I notice a behavior that annoys me, I'm going to add something to this agents to try and stop it from doing that thing.
Posit Assistant vs Positron assistant
Daniel asked, how does Posit AI in RStudio compare with the current Positron AI assistance? And I think that the answer is that it's like an amalgamation of Databot and Positron. It's complicated. Like they are going to align. They're going to become the same thing. Eventually.
The big challenge is like Posit AI is basically like a consumer product. And then what that means is like, we're going to pick the LLM for you. And we believe that like Claude is the best right now, the best LLM for doing writing R code and doing data science. And we're going to pick that for you. And we're going to write all of our tooling to work specifically with Claude.
Now, Positron assistant works inside Positron, which has to get installed inside lots of enterprise customers and enterprises. Big companies, for obvious reasons, don't want to be sending their data out to some like random LLM on the internet. They want to be using their own internal LLM and companies have done standardize on all sorts of different LLM providers. Sometimes the big ones, like Claude and Anthropic, or they're going through Azure or Databricks or Snowflake or something. Sometimes they'll pick some like total wackadoo thing and they've chosen to inflict that on their employees and Positron assistant has to work with all of those. So in the long run, they're going to come together and just be one tool, but it's taking a little while to figure out how to do all of that.
Reproducibility and ephemeral code
This is a great question that I get every time we talk about this is like, where is all of this code that it's creating in this side panel going? How do you keep track of it? Yeah, it does not. All of the code you should, this is purely ephemeral code. You should treat this as the same way as the code that you type in.
What I've done here is explicitly asked it to set up in our file. This is the file that I would then check in to get. This is the thing that I can start with a clean session, run this. And now I've recreated that parquet file again. So nothing has changed in the sense of like, how are you going to record what you're discovering? Your analysis, you should still be creating our files and QMDs or Markdown files. That's still your responsibility.
And I think it still makes sense to be like starting like clean sessions again, to ensure that you do have all that information.
Creating a sheep-to-people ratio plot
And now I can say like, create a plot that shows the ratio of sheep to people on the screen. So you can see, because I said in the agent's MD, like there's a data dictionary in the readme.md. Like it, it knows to do that. It's going to, okay, I'm going to start by reading the readme. Like always a good sign.
Then it's going to look for the sheep data again. It's picked total sheep. It's done. It's written a bunch of code and it's created this plot. Now again, like this plot is ephemeral. Like you should think of this chat pane, like you're typing stuff in the console. If I want to keep this, I should say like, okay, make me a Quarto report about this.
Okay. So it's going to give me a report. It kind of offers the sections and I'm just going to say, go for it.
It's going to slap. That's good.
Questions and answers
Does Posit Assistant also support skills files?
I think either it does or it will do in the near future because it is the skills file. So skills files basically, the idea is instead of like dumping absolutely everything into one giant agents file, you can kind of say, well, you could basically say the skills work, something like that. Like if you need to know how to do a linear regression, read linear regression by David. And then Posit AI will only read that if you need to do a linear regression. That's basically, in a nutshell, that's all skills is. It's kind of like a reference to a markdown file that will get rid of on demand.
So the Agents.MD, this is kind of a kind of blanking on the convention. This is a convention that lots of different AI agents will look in this. So Posit Assistant definitely will. I think if you use like Cloud Code in the terminal, it will also look there. If you use like Codex or Gemini, a lot of these other ones are standardizing on this Agents.MD file because for a while, like, yeah, for a long, long time ago in AI times, we were ending up, it looked like you'd have a Cloud.MD, a Gemini.MD, a Codex.MD, like, and then people decided that that was ridiculous and we should just have one kind of MD.
But I do think this idea of like telling it to read your readme is a good principle because the readme is aimed for humans. Like most of that context, you want the agent to have as well. And I think one of the things that like LLMs like make us do more of that is good for us. And we always should have been doing more. It's like writing down more of these assumptions, being very clear. So when we're both working with our AI collaborators and our human collaborators, like everyone knows what we're talking about. We're all on the same page.
Fixing AI-generated code quirks
So let's fix one thing that drives me bonkers about AI generated code.
So this is the type of thing you can put in your agents file, like your pet peeves. And now I can say, fix the thing.
So it's going to propose a change and it's going to delete that.
It's also nice. Like, it's always going to apologize for screwing up, even if it didn't do anything like that.
Okay. So I've got this QMD. Now, again, like this is the thing that I, this is the thing that's going to persist over time.
I don't love the fact that it's like added all of this text, but I do like the fact that it's talked about the data sources. The data prep was pretty simple. It extracted the sheep data, aligned it with population data.
Investigating why sheep numbers fell
So it's sticking to the data, we think at this point. So it's, yeah, it's sticking to the data here.
And then it says, look at key years. And it just picked 1982, 1985, 2000 and 2025. I've got no idea why it picked those years. Not evenly spaced. No, not at all. Very not evenly spaced.
And then it tells me that 19 years were the turning point. Okay, let's do some peek. Okay, now it's giving me some speculation. Economic reforms.
Okay, now this is like, does my New Zealand historical knowledge of sheep agriculture New Zealand up to this? I think that's correct. Like I do, like I know that, yeah, like New Zealand agriculture used to be heavily subsidized by the government and then they dropped those subsidies in favor of like more free market economics. I have no information about the global wool market.
Okay, well, this is a question. Dairy confusion and land use changes. This is, this seems a little implausible to me because sheep, basically sheep are fine on steep hills. So most of the sheep country in New Zealand are like pretty steep hills and you can't put cows on steep hills. So I'm like skeptical about this.
So let's see if we've got some data here. Look at the dairy cattle numbers to see if they rose as sheep fell.
I think it's important to point out as we're going through this, that like if you did not know how to code in R, you would not be able to filter through and quickly look back at the code that Posit Assistant is writing and tell whether or not it's BS. And so whenever somebody asks me like, do I still need to know how to code? I always say yes, because even if you are using an LLM, you have to be, like you're training an intern right now. You're training like a wildly talented intern who is just like super unhinged and chaotic and like you still have to look at everything that they do.
And so whenever somebody asks me like, do I still need to know how to code? I always say yes, because even if you are using an LLM, you have to be, like you're training an intern right now. You're training like a wildly talented intern who is just like super unhinged and chaotic and like you still have to look at everything that they do.
Do you still need to know how to code in 2026? Oh, yeah.
Just on the unhinged intern, like yes, this is an unhinged intern type thing. This is not literally the smoking gun. This is showing that there are lots more.
So it's showing that there are lots more dairy cattle, but nowhere near enough for that to be explanatory. In the context of millions of animals.
Shiny apps for exploratory analysis
Like I looked at this, the previous TidyTuesday, which is the astronomy picture of the day. And one of the things I did with it is like a clustering, a text clustering on the image descriptions. And like, I could have figured this out eventually, but just getting something that really, really quickly giving me the results without me having to remember all the details of like how do I use TidyText to do text clustering. That was like super, super appealing.
Like it chooses the number of clusters. And as a reminder, this is a couple of weeks ago now. But like, so when you do cluster analysis, right? You have to choose the number of clusters and there's a bunch of heuristics. But I think whenever you do cluster analysis, you want to do a little bit of experimenting. Like is my, are my results like robust to different numbers of clusters? And so what I did, I was like, okay, well, well, like make me a Shiny app that lets me explore that.
This is one of my favorite uses for Shiny apps, y'all. I make apps all the time that let me step through something and look at it. Like one time I was doing a little project where I was classifying blue sky posts. And so I was like, I'm going to pre-classify these with an LLM and then I want to build a Shiny app and have the Shiny app feed me each one and tell me like what it had assigned it. And then I can decide whether or not it was something. I was trying to classify replies versus posts, basically just really basic stuff, just to see if I could do it. It was amazing.
And like the ability to write like a Shiny app like this in like, you know, two or three minutes, this is really cool. Like, yes, I could have, I've, you know, I could have written this Shiny app. I probably never would have before because I would have felt like the cost being a trade-off like wasn't there. But now like, just give it a go. And with an LLM, like it's not a super complicated Shiny app and that's fine. And it's a throwaway, so it doesn't have to be pretty.
Do you still need to learn to code?
And that reminds me of a question we got to earlier. Like, does it still make sense to learn how to code in 2026?
Like, yes, I think understanding code is still really important. I do think the trade-off between the ability to read code and ability to write code has changed because now we can like generate so much code quickly with an LLM. Like your ability to read that code quickly, figure out, you know, understand what it's doing and if it's the right thing and then either like make small tweaks by hand or tell the LLM to fix it. But if you don't know what it's doing, that just seems like such a dangerous place to be in.
Like if you don't have some sense of like, you know, you still, to do a good cluster analysis, you still need to understand like, oh, actually the number of clusters is really important. And, you know, we, there's no magic way to figure out the correct number of clusters. So you have to know, like the LLM will just give you some decent answer. You need to go to and like interrogate that further. So still that like subject matter knowledge, that ability to read code, that ability to like go in.
You know, like you still want to create these data sets and the ability to like go in there and actually modify them directly is super important because, you know, being able to type a one instead of a two there, like that's much lower effort than saying to the LLM, hey, can you please go to this specific plot and do make this change for me? Like if you know exactly, like one of the advantages of code is that it's like, you know, it's this precise language and LLMs are great at going from your vague human language to this precise code language. But if you already know, if you already have that precision in mind, trying to tell an LLM what to do in human language to make a precise change, that's just like, that's inefficient.
So I think like code is still important, but your ability to write code, I think is less important than it used to be. Your ability to read code is still more important and your ability to like ask good questions, like even, even more important.
I will say we have two minutes left and I would love to end with just like a little, one more philosophical question from Ben, cause I've been thinking about this a lot, which is if one of the best ways to learn how to read code is to write code, what's gonna happen if we're writing it less?
Yeah, I don't know. Like one of the things, like we're gonna have to learn new ways, to make new tools to, you know, new ways of doing this. I think one of the things that's always sort of intrigued me, I read this article, I don't know, like maybe 10 years ago about masters of fine arts in programming. And like, I don't know, I can't remember if it was like a real program or just kind of a speculation, but this idea like, you know, when you go and become an artist, you spend a bunch of time, obviously you do spend a bunch of time like making art, but also like looking at the work of like old masters and attempting to copy that by hand. And of course you could take a photograph of a famous painting that's gonna give you like an exact reproduction, but that recreating it by hand, you know, the things you do in like art and design programs where you like collectively talk through things, I think all of these skills, like we're gonna have to figure out how to apply them to the data science and programming today, like super duper important.
So I think like code is still important, but your ability to write code, I think is less important than it used to be. Your ability to read code is still more important and your ability to like ask good questions, like even, even more important.
All right, wonderful. Well, we have literally one minute left, so I will say let's wrap up and everybody say thank you to Hadley for joining us. This was so much fun.
If you wanna sign up for the waitlist for Posit AI stuff, Nick has put that in the chat actually, posit.co slash products slash AI. You can sign up for the waitlist for this private beta for the tool that Hadley is using.
That is, it's currently a free beta because we're trying to understand what's helpful for people. And so it is free. So if you do wanna learn a little bit of like AI-powered data analysis with like no money down, at least for a little while, like we can't do this forever because we are a business, but at least if you wanna get your toes wet in AI-supported, empowered data science, like Posit AI is a great way to do that right now.
All right, fantastic. And I will see everybody on Thursday at the Data Science Hangout. I will also see you next week at RainbowRConf where I am going to be with Domi Pak doing a trivia thing. If you are attending RainbowRConf, I'll see you there. Bye everybody. See you on Thursday and see you next week. Bye.


