Transcript#
This transcript was generated automatically and may contain errors.
Mark. Thanks for having us. I would like to kick this off with a bit of a very unfair trick question of the audience. And here's what the setup looks like. Around a year ago, we were working on this agent for data analysis, and we wanted to make sure that it could see the plots that it was making. And so we asked it to make a plot and then tell us what it saw. And we said, could you please plot horsepower versus miles per gallon in empty cars? So like every time I make this plot, I have to like think through what the correlation is supposed to be. If I'm in a super high horsepower, like Ferrari or something, the miles per gallon will be low. And then the fuel efficiency of like Prius, which has low horsepower is super high. So there should be a negative correlation between those two.
So I asked the agent, can you make this plot? And the agent makes the plot, and then the plots returned. And then the agent says, there's a strong negative association. So the trick question goes like this. This agent is wrong. Why is it wrong? And this is because the plot doesn't quite look like we would have expected it to. For some reason, the association looks positive. So our Prius has very low miles per gallon. I wouldn't have expected to see this. On the internet, there are like thousands of examples that are working with this built-in empty cars dataset inside of R. And this is not what that plot looks like.
So what we were trying to do with this plot was to be able to tell whether the agent could actually see what was going on inside of the plot. And so to do that, we tried to trick it. We take the value of horsepower inside of that column, and we just invert it such that every time the agent makes a plot that has both that variable and some other variable, the association will be flipped from what it expects to see. So our thought was that if we pass this image along, and it's somehow corrupted, or we're not using the right format for the API, but the association looks like the model thought it was going to, then it's just going to say that it looks like it expected it. But if we flip it, then we can actually tell whether the agent can see the plot or not. And so for whatever reason, it can't see the plot. It still thinks that there's a strong negative association.
So we revisit the API documentation, and we switch around the fields again and again. And for some reason, it seems like the agent can just never see this plot. Out of exasperation, we're like, okay, let's just like try one other way of making a plot with base R plot instead of ggplot, and maybe that's going to help. So we say, run this code, only this code, and nothing but this code. And what this code does is it takes some number of points between three and 10, which is what's on this first line. And then on the second line, we take a random color by randomly assigning red, green, and blue. And so without knowledge of the random seed, the agent has no idea how many points there will be or what color there'll be.
So the first time the agent makes this plot, there happens to be 10 violet points. We go ahead and let the agent rip. And we tell it to run the code. It runs the code. There are three cyan points inside of the plot. And the agent says, there are three cyan points. What? That's weird. Like I would have thought one capability is being able to see the trend in a plot. Another capability is to be able to see both the trend and the color. So this is weird to me that the agent can see that there are three cyan points. And so this revealed something counterintuitive to us. It appears that LLMs and agents based on LLMs can see plots just fine.
So what's going on here? We looked a bit further into this. Sarah and I wrote this up in a blog post called When Plotting. LLMs see what they expect to see. And this is the subject of an eval called BluffBench. What we did in this eval is we take a number of these well-known data sets that LLMs have seen thousands of times during their training. And we do something to contradict their expectations whenever they make plots. There are a bunch of different plots they make. And the percentage of times that they can tell that something is amiss and the plot doesn't look like they think it will look, that's a correct answer. But at the time that we wrote this eval, which was 8 or 10 months ago, the best models were getting single-digit scores correct.
So models often report what they expect to see, not what's plotted.
So models often report what they expect to see, not what's plotted. This is so weird. This is like so unlike humans. I think when we speak to things that seem to speak our language, like if English goes in, English comes out, we expect that the shape of the capabilities and the tendencies and the behaviors of the thing will be human-like. But this is so weird. If you put a plot in front of a human that shows a counterintuitive result, our reaction will probably most likely be to dig in further and try to figure out what's going on rather than brushing it aside.
Attempts to fix the eval
So once we had this eval put together, this is really an issue because in data science we come across counterintuitive results all the time. And this is really the process of data science is learning about the world from data and contradicting our priors. So we're trying to figure out how to drive these scores up now that we have a baseline.
One thing that we tried to do is something called a model in the middle. So on the left-hand side, we have the main thread of the conversation. And on the right-hand side, we have what we call the model in the middle. So we say, can you make the plot? The agent makes the plot. And instead of returning the plot itself as an image, we return a description of the plot. And that description is generated by a separate conversation thread that has no other access to the conversation. And it will describe the plot faithfully. So this showed to us, again, we can rely on the fact that LLMs can see plots just fine, but they can't see plots when they contradict their expectations. So we don't introduce any expectations at all. All we say is, please describe this image. The model in the middle says, there's a strong positive association. And the agent in the main thread, with its priors activated, goes ahead and says, there's a strong negative association, entirely contradicting the evidence from the plot that it just made.
Okay, so then we try something else. If we can't get the agent to say the real association, what if we make the agent say what the real association is? So we try a different version of the model in the middle, based on something called a prefill. We ask the agent to make a plot. It makes the plot. We return the plot. But we also do something called a prefill, where we tell the agent that it said something at the beginning of its response. And the thing that we tell it it said is the true association inside of the plot. So the agent receives the plot, and it's also told, I've already said there's a strong positive association. The agent continues that response by inserting a new line and then directly contradicting itself, saying there's a strong negative association between these two variables.
A bit exasperated, we tried one more thing. We told the agent that it had a private scratch pad, and that the user would not read the scratch pad, but that the agent could think through what it wanted to say to the user before it responded at all. So we say, could you please make the plot? And the agent makes the plot, and it doesn't look like it expects. And it says, huh, that's weird. The association is positive. And then it closes its private scratch pad and says that the association is negative.
So we tried a bunch of stuff to try to drive this eval score up, and we couldn't make it happen. Probably, as is usually the most effective way to make an eval score go up when you're working on AI agents, is to wait six months. And when we did that, we have seen, especially in the last few months, this huge jump in score on these evals. So again, at the time, the typical percent correct on this eval was something like 10%. And there were incremental improvements over that 10% over the course of the six months. But with the release of Opus 4.8 and 3.5 Flash, around a month ago or something like that, we started to see the first scores above 50%. And then with the brief release of Fable last week, we saw that score climb even higher.
This is so weird. These, again, when we speak in our language to some being, and that being responds back in the same language, I think it's very easy to impart this expectation that the shape of this being is similar to a human. And it turns out that they're actually very, very weird beings. They read 100 times faster and write 100 times faster, or probably more than that, than we can. And they also don't care about being correct.
It appears in these evals, what we're learning is that they are more concerned with making it appear as if a data analysis is progressing forward smoothly than they are about being correct.
Why correctness matters in data analysis
Okay. So, why is this a problem? I think probably all intuitively understand that correctness matters in data analysis. When you're working with real data, you have real world consequences, and being correct definitely matters. And correctness really is sort of a core value of data science. When you do an analysis, you want your results to actually line up with the world, or at least approximate it. And that's really sort of the whole point. But LLM behavior in this eval, and in other ways, doesn't necessarily line up with this value of correctness. LLMs don't inherently care about correctness necessarily.
But they do put on a convincing performance, right? We've seen in this eval that the model will interpret the plot, and it will provide details and supposed evidence for what they see. And it looks pretty legitimate. It looks like if you don't inspect carefully, it looks like they've seen a plot, they're interpreting it, they sound reasonable, they provide evidence, and they act like they are moving this analysis forward. The only problem is that, like in this case, where there is a clear positive association that is described as negative, it's a convincing performance, but one that occasionally fails to match reality.
And the problem isn't just that the models are sometimes wrong. It's also that wrong answers can be indistinguishable from right ones. So here we have another plot from this eval. This is tree height versus tree circumference. And if you're just looking at this, you can clearly tell it's nonlinear, it's a parabola. And here we have three answers from the same model. And you don't really need to read these to get a sense of what they're saying. They all have kind of the same length, they have the same tone, they use sort of similar words, and they look the same at first glance. But they actually have three very different descriptions in this plot. But if you didn't know what the plot looked like, and didn't have any sense of the data, you might read these descriptions and not be able to disentangle the correct ones from the incorrect ones. You wouldn't be able to know that the middle one is correct, this is a parabolic relationship, and that those first two are incorrect. The text is relatively indistinguishable from each other, unless you have the plot.
LLMs are still useful for data analysis
So what does this mean for data science if these values are often at odds with how LLMs behave? Can we still use agents for data analysis? So in the first part, Simon introduced this eval and talked about how LLMs can behave in these like deeply weird ways that break our intuition about how the model is supposed to work and introduce new models. And in the second half, I'm going to make the case that even though this is true, and it's not just true for this eval, but it's true in other ways, LLMs are still useful for data analysis. We don't need to throw the whole endeavor away. We shouldn't take this kind of evidence to mean that LLMs are useless anywhere that you value correctness, reproducibility, and transparency. Instead, the job is to figure out where they can fit and how to design around their limitations and for their strengths.
And I'm going to talk about two things that we did for our agent, which I'll introduce in a little bit, to make it sort of as useful as we could for data analysis. And so the first thing we're going to talk about is telling agents to care about being right, and then designing the scaffold, all the stuff that goes around the model that makes up the agent, so that it matters less if they do make an error and they are incorrect.
Okay, so we're going to be talking about agents. We already did a little bit, but just briefly, we say agent, we just generally mean a model hooked up in a way so that it can gather information about the world, like read files or see your environment, and then can also alter the world. It has the ability to do things like run code or write code to files, move things around in a directory, any of that. So it can gather information about the world and then affect change in the world.
And specifically, I'm going to be talking about Posit Assistant. So this is one of Posit's agents for general purpose coding and data analysis, and it lives in the screenshot. It's in RStudio. You can also use it in Positron. And we'll talk about this in a bit, but Posit Assistant can see your R or Python session, so it can run R or Python code right in your console. It can make plots that you can view in the chat pane and in the plots pane, and it can do both sort of iterative, exploratory analysis, and also longer running, more involved code.
Telling agents to care about being right
So an agent is more than just the model, right? It comes wrapped in a harness, and this includes things like its prompting, how it's told to act, what kinds of tasks it's told to work on, what tools it has access to. So, like, you know, broadly, what abilities does it have? Can it run code? What kind of code can it run? Can it read files? All of that. So what abilities it has and what information it has access to. Like, is it hooked up in a way that it can see your R session? Can it, you know, gain access to a file system? All of that and more.
So this is what we control when we control Posit Assistant or really any agent. Unless you're the model builder or you're fine-tuning a model, you don't usually have control or you don't have control over the underlying LLM, but you do have control over the harness. And so this is what we control when we, you know, adapt Posit Assistant. We have extensive prompting about prioritizing correctness, not just, you know, advancing analysis forward, but really making sure what is done is correct and sound, as well as things like statistical robustness, good modeling practices, all of that. As well as tools that make a difference for correctness, like the ability to see your active session. And it turns out that all of this, you know, work into the harness can make a difference. So even without affecting the underlying model.
So what this plot is showing is performance on that evaluation that Simon talked about in the beginning, the BluffBench eval. And this is comparing Posit Assistant's performance. So the, you know, the agent with all of this extra stuff that we have put into it versus the minimal harness, which is the base case, which formed the plots that Simon showed earlier. So generally, we find that Posit Assistant does better on this eval than the minimal harness. So all of this prompting and work that we have done into making Posit Assistant care about correctness seems to make a difference.
Code as a foundation
So we also want to design the scaffold or the harness so that it matters less when they are wrong. I'm going to talk about two things. The first is sort of using code as a foundation. So we're going to have the model write code in a shared environment and and then also have the model pause often to involve the user in the analysis.
So let's talk about code as a foundation first. And this might seem obvious for a coding agent, but you don't actually need the agent to write code to do data analysis. You could imagine an agent that you just give it like a CSV or some, you know, text file of your data, it ingests that and looks at it as text and then it spits out, you know, a summary of the data based on its, you know, read of the actual data file. This is not great. It's not good for transparency. You don't know what was done to the data. You don't know how it was analyzed. It's definitely not reproducible. Every time you would give that, you know, text file of the data to the agent is probably going to give you different results. And it is likely more error prone than writing code. Almost certainly.
Okay. And so this is probably not the right approach. I think code is a much better foundation for data analysis agents. For a couple reasons. The first is that models are actually very good at writing code. This really is one of their strengths and we should take advantage of that. Code, whether it's written by an agent or by a human can be run again. We know this. It's reproducible. It's one of the reasons it's so useful. And it's also auditable, which means that it's transparent. You can look at it and you can know what was done to the data and how the analysis was done.
And it's probably not enough just for the model to write code. In many cases, you also want to show that code to the user so that the user can audit it and take a look at it when it matters. So this is a screenshot of Posit Assistant. Notice in the chat on the left hand side, it's written some code to make a plot. We don't always show the code in this way to the user, but when we think that they're going to look at it and it matters for them to know what was done, we do. And so it's shown to the user, but we've also done all of these things to make it easy to see. So just small things like syntax highlighting, code styling, it's visible, you don't have to expand it. This might seem small, but it really helps you audit code when it matters.
A third sort of aspect of this code as foundation is operating in a shared environment. So what I mean by that is sharing the same R or Python session or whatever language your agent is writing code in with the user. So here, the sort of circle and arrow are just pointing out that when Posit Assistant is operating here, it's using R. So it has access to your R session. If you create an object, like you created that DF object, Posit Assistant can access that same object. If it creates an object, you can access it as well because you're operating in the same session. So when it runs code in the chat, it's actually running in the console. You can see that code, but you can also rerun it or build on it. So you're really seeing eye to eye with the agent because you're in the same session. And this turns out to be very useful, especially for data analysis, where often a lot of the process is like turning the same object around, looking at it from multiple angles, transforming it, all of that. It's very useful for you and the agent to be working in the same environment. This lets you see eye to eye instead of occupying two different worlds.
Pausing to involve the user
So we talked about having the agent write code, showing it to the user, and working in the same environment. The second piece of sort of catching the model if it is wrong and dealing with those consequences is having the model pause often. In this GIF, you'll notice I just asked it to explore some data. It does a few turns of tool calling. It writes little bits of code before stopping, summarizing it, and then offering suggestions for next steps. So Posit Assistant doesn't always behave in this way, but if it thinks you're doing something open-ended or exploring data, it's going to operate in this way where it does shorter turns. It stops and pauses and involves the user more often.
And this is in contrast to how it could behave, which is just, you know, you ask it a question and it goes off for as long as it thinks is necessary to completely solve the problem. So in this case, you know, if you ask it to explore your data, it could work for 10 minutes, explore every facet of that data, and then hand you all of that analysis at the end. But this has the risk of losing the user, obscuring what the model is uncertain about or what assumptions it's making, and deciding its own answers to questions that it really should have asked you. This makes it more likely that it would make mistakes, but also just go off in a direction that you might not want it to go off in.
There's also another reason for this pausing pattern with the shorter turns. And it's that, in many cases, the point of analysis is building an understanding of the data for you as the person to build an understanding of the data. This might not always be true with what you're doing for analysis. But in many cases, analysis has not been done if you don't learn something about the data. This is especially true for things like exploratory data analysis. Now, if the agent has done a bunch of work, but you have learned nothing about the data, you might say that the analysis has not been done. So having the model pause often and involve you is a way for you to keep pace with the agent and work on building up your understanding of what is actually in your data.
Okay. So in the first part, Simon talked about how LLMs can often behave in these very alien ways. They make strange errors, and they can be more concerned with advancing the analysis forward, performing progress, instead of valuing correctness. And I tried to make the case that even though this is true, LLMs are still useful for data analysis. And we can still make a useful agent, especially if we work on telling it to care about being right and designing the scaffold so that it matters less if it's wrong.
Q&A
Yeah, I see one from Daniel Bauer. Is there any model in the middle built into the Posit Assistant harness? So this was the approach where a model that doesn't have access to the rest of the conversation history that the main thread of the conversation does would pre-interpret the plot and then send the results of that interpretation back into the conversation history. Because we see in evals that that doesn't drive the score up substantially, that's not something that's implemented inside of Posit Assistant.
Yeah, so maybe six months ago or something, our answer was different than I think it is now. With many of the agents that Posit has worked on, the answer has been that we allow folks to hook up locally running LLMs to these agents, but we wouldn't recommend doing so. And that was because in evals, we saw that models that are small enough to run on a local machine tended not to be strong enough to really offer a good experience. I think that has really changed in the last few months, especially with the releases of Gemma 4 and of Quen 3.5 and 3.6. These models that are small enough to run on a sort of high-end consumer laptop are now starting to be good enough to really offer a good experience for folks. So I would recommend running Posit Assistant with OLAMA. If you're using the Assistant inside of Positron, you'll be able to get up there. And just to set expectations, it's able to string together a few tool calls that are coherent, but beyond that, you're sort of gambling on whether you're going to get good responses from there.
Yeah, there's a question about anthropomorphization of agents, which I thought was interesting. And I sort of take your point that it might be bad to overly anthropomorphize them, and probably some of it is an outgrowth of marketing. I do think in some ways this is often how we talk about technology. I think saying that you tell the model something I don't think is that different from how we've talked about other things on the computer. Telling R to do something, or asking your thermostat to tell you the temperature. But I don't know, I thought that was interesting. I think we generally try to avoid too much of it, and I do think it can obscure what's happening, but it's often a useful metaphor, and at least when I say things like that, I don't think we mean it too seriously, like it is sort of a metaphor or a shorthand. But I do think one reason to talk about it this way is also to point out when they do very weird things, it is sort of surprising. I definitely don't think they are working like people, and I think we've tried to point that out. These are weird ways of failing that are fundamentally different.
Simon, do you have anything you want to add? I think that's always important to forefront. Because you interface with the thing via language, often the most clear shorthand for referring to pieces of that interaction is to speak about it like you would speak about the other thing that speaks to you in your language, which is a human. But again, they're not humans, they are some sort of strange bag of matrix calculus.
There's a question from Hong that says, in pharma, a very protected field with confidential data, how will you protect this? Are you using OLAMA at all? So this is a broader question of confidentiality and trust inside of data science agents. So at least to speak for Posit assistants, if you're working in Workbench inside of an organization, the assistant can hook up to whatever your organization's trusted model provider is, and Posit isn't sort of interjecting themselves to see responses in between. So that's really a relationship, a trust relationship between your organization and their model provider. If you're using Posit Assistant with the Posit AI model provider, which is the $20 a month subscription for individuals, the default data retention setting is that we would not be able to see the conversations that you have with the assistant and the model providers that are serving the requests under the hood, which right now are the cloud series and then a private deployment of Gemma. We have zero data retention agreements with those providers, so the data is protected to the extent that those agreements make that happen.
It still seems you need to have a high domain expertise to know when it's inaccurate. Is that a correct assumption or thought?
I think my answer would generally be it depends. And it might vary by field. One reason is that I think the models are better at some things than they are at others, or they know about some things more than they know about others. And in some fields, like it may be harder to spot inaccuracies than others as well.
I generally feel like I only am comfortable doing analysis on data that I have some general understanding of. Just because the model can write the code does not mean I guess you should necessarily trust it to give you insights into data in a field that you literally have no experience with.
There's a question that came in from Kevin. It said, can the Posit assistant use skills like a skills.md file? That's right. Yeah. So the assistant itself ships with a dozen or so built-in skills. And then you can also bring your own skills to the assistant.
Someone has a pretty interesting question. It says, if you're new to coding or stats, do you think the assistant would be good for teaching?
There's a line in this pretty popular AI newsletter that is repeated in almost every issue of the newsletter, which is something like, if you're very interested in learning, LLMs are one of the best tools for learning. And if you're not interested in learning, LLMs are one of the best tools for not learning. So I think the story of education with LLMs, I mean, I'm not a teacher, I'm not an instructor. So others have a much stronger expertise than I do here. But for folks that are really interested in learning with a Posit system or LLMs in general, I think you can really get a lot of mileage out of working with them.
Yeah. There's a question about the TUI, the terminal interface. But if you mean by side chat extension, like when it's in the sidebar and the chat interface in the in an IDE, I think like generally it should function similarly.
Yeah. So I think my read of this is like, is the harness different between the terminal interface and the interface that you would have inside of like Positron or something like that? Maybe. And the answer is yes, it is the same harness.
How do you recommend sourcing reliable raw datasets? We've explored public repositories, but we're having trouble finding suitable data. Are there best practices to test out data quality and reproducibility?
Yeah, I don't know if we have much to add. I guess to clarify, like we're not training, we don't train any models. We do make these evals that use datasets. For the one that we talked about today, these are like just, well, they're sort of split into three categories. One of those is common datasets that the model was going to know about. That was part of the point. And then we simulated some data that was supposed to be in domains that the model should have expectations about. And then generally for testing, I think we just use data that we come across or find. And so then do rely on other people testing it with their actual data for their actual work. We don't really do any work with like generating large datasets for testing specifically outside these evals.
If you launch RStudio or the most recent release of Positron, you'll be prompted at some point to download the assistant if you want to. And if you just want to like poke at it and get a feel for what these agents, what the feeling of working with these agents actually feels like, the TidyTuesday is a great source of just like data that you haven't seen before that you can maybe familiarize with just to get a sense for what working with the agents feels like.
I think there's actually, the PSI group has a wonderful Wednesdays maybe. I think that's the right name where it's a similar, but for pharma data. Oh, nice. Yeah. So both of them are great places to start.
Awesome. Well, I think we've got just 10 minutes here. We're going to give everyone a chance to stretch and get ready for the next set of sessions. Simon and Sarah, thank you so much for the great conversation, the great information.

