Building Reproducible LLM Workflows | Leslie Emery | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the Posit Data Science Hangout. I'm Libby Heron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I think it's time to introduce our featured leader today who also is joined by a pet. Leslie, would you like to introduce yourself? We are joined by Leslie Emery, who is Senior Principal Data Scientist at Bristol Myers Squibb, which we will be calling BMS, I think.

Yeah, thanks Libby. Definitely BMS all the time. So I have been working at BMS for about five years. I'm based in the Seattle office, and that's where I did grad school and worked as a research scientist afterwards as well. And at BMS, I work in the research organization, and specifically in the Informatics and Predictive Sciences team, which is, you can basically think of that as like computational biology or computational science. And then the smaller team that I'm in, Knowledge Science Research, we are focused on helping the other computational scientists get their work done in a more efficient, more reproducible way, and helping to bring together a bunch of different data sources from across the company to make it easier for computational scientists to work with our internal data.

What's something you like to do for fun? Yeah, I have a lot of hobbies. The biggest two are probably knitting and reading. I'm in two different book clubs and spend a lot of time reading science fiction. I just voted to nominate for the Hugo Awards this year, so I'm excited to see how that goes. And I've always got probably at least three knitting projects going at a time.

Data harmonization in pharma

So at BMS, we have a lot of internal data from clinical trials that are going on, and then also more experimental research that's being done, either basic mechanism of action kinds of things or translational research on the trials that we've already done. And that's what I often help with, is people doing additional research questions on data from our clinical trials. And the data that we have from clinical trials, it's designed for the clinical trial first and foremost. So the data that is going to be the primary outcome in a trial is really well formatted, well curated, easy to work with. But experimental biomarkers, for instance, new assays that are just being developed, things like that, it's not as uniform in the format.

And we often want to do analyses across multiple studies. And each of these clinical trials is kind of its own thing with its own team, and so there can be significant differences between them. So I help to do a lot of data harmonization, combining multiple clinical trials together, solving differences in format, data cleaning issues, and then also trying to make decisions, like if there's an actual difference between the way an assay was done in study one versus study two, how do we combine that in a way that's still valid to analyze them together?

It's a lot of the data cleaning, data wrangling work that I think there's a bad reputation for it, but it's so, so important. And I get really excited about it and trying to make the data that we already have higher quality, easier to use. And so I've done a lot of these data harmonization projects, and it's kind of the same things over and over again. Every project you have to identify which columns are compatible between studies or data sources, figure out does the content match, do the units of measurement match, that kind of thing. And it's, every project is its own unique thing, but also there's so much repetition in the kinds of things that you're doing.

Using LLMs for reproducible workflows

And so I'm really excited about ways to automate that. And so one thing that I've been working on recently is a way to use LLM assistants for these data harmonization projects in a reproducible way. So a lot of the work that people are doing with data in LLMs, it might be through something like DataBot and Positron , and you're having this interactive conversation with your data. But then when you're done with that, it only works for that one data set you were looking at, and it only captures exactly what you did in that session.

So what I'm working on and what I'll talk about in my PositConf talk is a way to capture the output of an LLM into config files, specifying what kinds of data transformations do you need to do, what parts of your data sets that you're starting with are compatible with one another to be put together into the same outputs. So capturing all of these steps into config files on a standard data model so that you can have the LLM work on the parts that it's good at, which is making these config files, and then put it into a reproducible, deterministic pipeline, and have that part do what it is good at, which is reproducibility.

So capturing all of these steps into config files on a standard data model so that you can have the LLM work on the parts that it's good at, which is making these config files, and then put it into a reproducible, deterministic pipeline, and have that part do what it is good at, which is reproducibility.

This is one of the things that has made me most scared about LLMs in the past, is just like we can't put them into a reproducible pipeline and expect them to do the same thing. So you are focusing on solving that problem and making it so that the LLM is creating the instructions for the work that you are doing, and those instructions are reproducible, and you don't have to depend on the LLM to be a deterministic thing when it cannot be.

I think that's amazing and interesting, and I can't wait to hear more from your PositConf talk. I will not pump you for information on your PositConf talk. I will let you work on that.

That you still need to put the time in to understand the data that you're working with. And that is something with like a hard limit. There's, you know, there's often no way to speed that up other than spending the time digging in and looking at the data.