Transcript#
This transcript was generated automatically and may contain errors.
Hey there, welcome to the Posit Data Science Hangout. I'm Libby Heron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.
I think it's time to introduce our featured leader today who also is joined by a pet. Leslie, would you like to introduce yourself? We are joined by Leslie Emery, who is Senior Principal Data Scientist at Bristol Myers Squibb, which we will be calling BMS, I think.
Yeah, thanks Libby. Definitely BMS all the time. So I have been working at BMS for about five years. I'm based in the Seattle office, and that's where I did grad school and worked as a research scientist afterwards as well. And at BMS, I work in the research organization, and specifically in the Informatics and Predictive Sciences team, which is, you can basically think of that as like computational biology or computational science. And then the smaller team that I'm in, Knowledge Science Research, we are focused on helping the other computational scientists get their work done in a more efficient, more reproducible way, and helping to bring together a bunch of different data sources from across the company to make it easier for computational scientists to work with our internal data.
What's something you like to do for fun? Yeah, I have a lot of hobbies. The biggest two are probably knitting and reading. I'm in two different book clubs and spend a lot of time reading science fiction. I just voted to nominate for the Hugo Awards this year, so I'm excited to see how that goes. And I've always got probably at least three knitting projects going at a time.
Data harmonization in pharma
So at BMS, we have a lot of internal data from clinical trials that are going on, and then also more experimental research that's being done, either basic mechanism of action kinds of things or translational research on the trials that we've already done. And that's what I often help with, is people doing additional research questions on data from our clinical trials. And the data that we have from clinical trials, it's designed for the clinical trial first and foremost. So the data that is going to be the primary outcome in a trial is really well formatted, well curated, easy to work with. But experimental biomarkers, for instance, new assays that are just being developed, things like that, it's not as uniform in the format.
And we often want to do analyses across multiple studies. And each of these clinical trials is kind of its own thing with its own team, and so there can be significant differences between them. So I help to do a lot of data harmonization, combining multiple clinical trials together, solving differences in format, data cleaning issues, and then also trying to make decisions, like if there's an actual difference between the way an assay was done in study one versus study two, how do we combine that in a way that's still valid to analyze them together?
It's a lot of the data cleaning, data wrangling work that I think there's a bad reputation for it, but it's so, so important. And I get really excited about it and trying to make the data that we already have higher quality, easier to use. And so I've done a lot of these data harmonization projects, and it's kind of the same things over and over again. Every project you have to identify which columns are compatible between studies or data sources, figure out does the content match, do the units of measurement match, that kind of thing. And it's, every project is its own unique thing, but also there's so much repetition in the kinds of things that you're doing.
Using LLMs for reproducible workflows
And so I'm really excited about ways to automate that. And so one thing that I've been working on recently is a way to use LLM assistants for these data harmonization projects in a reproducible way. So a lot of the work that people are doing with data in LLMs, it might be through something like DataBot and Positron, and you're having this interactive conversation with your data. But then when you're done with that, it only works for that one data set you were looking at, and it only captures exactly what you did in that session.
So what I'm working on and what I'll talk about in my PositConf talk is a way to capture the output of an LLM into config files, specifying what kinds of data transformations do you need to do, what parts of your data sets that you're starting with are compatible with one another to be put together into the same outputs. So capturing all of these steps into config files on a standard data model so that you can have the LLM work on the parts that it's good at, which is making these config files, and then put it into a reproducible, deterministic pipeline, and have that part do what it is good at, which is reproducibility.
So capturing all of these steps into config files on a standard data model so that you can have the LLM work on the parts that it's good at, which is making these config files, and then put it into a reproducible, deterministic pipeline, and have that part do what it is good at, which is reproducibility.
This is one of the things that has made me most scared about LLMs in the past, is just like we can't put them into a reproducible pipeline and expect them to do the same thing. So you are focusing on solving that problem and making it so that the LLM is creating the instructions for the work that you are doing, and those instructions are reproducible, and you don't have to depend on the LLM to be a deterministic thing when it cannot be.
I think that's amazing and interesting, and I can't wait to hear more from your PositConf talk. I will not pump you for information on your PositConf talk. I will let you work on that.
Challenges and rewards of the role
Given your role at BMS, you also seem to be existing at the intersection of multiple departments and so forth and multiple needs and challenges. What are some of the challenging aspects and rewards of your role?
Yeah, that's a great question. So the most rewarding thing for me is to be working on something that someone else is specifically asking for and waiting for and going to be excited to use like the moment I have it ready. So I came out of academia, I did a PhD, and then I worked as a research scientist for five years, where I would be working on something for months at a time, and then or sometimes years at a time, and then publish it in a paper and put it out into the world and never hear anything back about it. So I love the immediate feedback of like somebody else at the company needs something done. And I am the person who can help them.
You know, I see a lot of people who they're trying to do these kinds of things themselves in Excel spreadsheets, and they don't want to learn coding, they just want the data. And so I love to be able to find somebody like that and help get them an analysis ready data set. So they're not doing like merges in Excel.
BMS is an enormous organization. So that was a huge adjustment. When I moved to BMS, there's so many people. So everything I do has to be a collaboration. And there have often been times when, you know, I'm in a meeting, and we're talking about some new type of thing that needs to be done a new analysis, a new type of tool that needs to be set up. And my like academia mindset was like, oh, well, I can learn to do that. You know, I could, I could spend a month or two learning to do that and do that for us. But in an enormous company, there's often somebody else who already has expertise. So I needed to learn like how, how to instead think like, is there somebody else who already has this expertise, and maybe I can help consult with them and learn some of those skills from them. But I don't have to learn every single thing myself at BMS.
LLMs for data harmonization
What LLMs have you looked into for data harmonization? Have you found different LLMs work better for what you do?
Yeah, that's a great question. So I have not yet gotten to a point where I am like, fine tuning things like does, you know, model version one or model version to do better does model a or model to be do better. I'm still sort of working out the basic workflow and all the different moving parts of this automated data harmonization workflow. And so what I've kind of just picked a model that others at BMS have used and liked and recommended and stuck with that. And what I've been using so far is I think Cloud Sonnet still. And there's also, you know, there's at an internal company, there are some constraints around what's available. You know, we're using internal LLM endpoints. And so, you know, sometimes we might need to wait a little bit to get a new version of some models.
I think that's a really common experience, though. There was a question in there that was, what LLM do you use? What's your LLM of choice? And like, what do you use inside of your IDE? And there were a lot of responses that were like, I can't use one. I'm really, really limited by what my employer lets me do. So I think that's pretty common.
Thank you. And I think I actually we have actually been pretty lucky in that we have a lot of options and they are adding models pretty quickly. And we have a couple different ways of interacting, like the approach at BMS has kind of been like, let's try all the things and keep everybody, you know, on the forefront and then wait until we see how things shake out before we kind of converge on any single one. Yeah. And then there's always trying to keep everybody safe, everybody's information and data safe as well.
AI assistance and evaluation
Yes, absolutely. And I would say I'm still cautious. I feel really lucky in that I have already had a chance to develop an expertise in the areas where I'm trying out LLMs. And so I have the background knowledge to be able to evaluate the output that I'm getting. So I'm thinking in the context of that answer, I'm thinking more about like coding assistance as opposed to this specific LLM or data harmonization project. But there have been areas where like I used GitHub Copilot to learn to use the play rate framework to write automated tests for a Shiny app. And that's an area where I didn't have expertise. But I knew kind of the design that I wanted for some tests. And I was able to tell like all the suggestions it's giving me here for writing these tests like I don't like the way these are architected. And so having that background knowledge first, I think is really, really important.
So going back to using LLMs for data harmonization, one thing that is really key is having like an evaluation framework. So I have a colleague at BMS who is building like a custom like LLM output evaluation portal in a Shiny app. So we're really hoping to have something like that to help with the harmonization project.
And you always you are always going to need someone with the expertise in the data. So you can't just, you know, drop a couple of data sets into an LLM and say like harmonize these and then evaluate the output without having understood the data that you're starting with. And often the I think this can get overlooked when people are talking about how much time LLMs are going to save us. That you still need to put the time in to understand the data that you're working with. And that is something with like a hard limit. There's, you know, there's often no way to speed that up other than spending the time digging in and looking at the data.
That you still need to put the time in to understand the data that you're working with. And that is something with like a hard limit. There's, you know, there's often no way to speed that up other than spending the time digging in and looking at the data.
CDISC and data standards in pharma
Yeah, so that's a great question. I'm not familiar with Odyssey. All of the clinical, well, most of the data from clinical studies that I work with is coming from the CDISC formats, which there's SDTM, which is sort of like the initial data format, and then ADaM. I think it stands for analysis data model. And the idea of that is it's supposed to be like the analysis ready data set and it's prepared by the statisticians.
So we do use those. And in my team and the computational biologists doing exploratory research that we're working on, we're often starting with the CDISC data formats. And they are standardized to a degree. It is such a flexible standard that when you try to do any specific analysis, you often need to do additional reformatting. And then there are also differences in the CDISC version from study to study. CDISC is so flexible that I imagine there are different company versions of using CDISC because they're all... When I started working at BMS, the office I started at had started as a company that was acquired by Celgene and then Celgene was acquired by BMS. So there's a Celgene formatted CDISC, a BMS formatted CDISC, and then any other smaller companies that BMS acquires, they often have their own CDISC format as well. So it's nice in that there's some standardization to start with, but to truly get to an analysis ready data set, you often have to do a lot more work.
Getting resources for infrastructure changes
Yeah. If anyone has suggestions for this, I would love to hear them. You know, especially at a really big company, it's really hard to make changes like that. One approach that my team has taken, I think, is to just focus on the parts that we can actually make change in. So one of my closest coworkers, Clara, she gave a talk at Posit Conf last year about an internal data standardization framework that we've developed called data as a product in R or DAPR. And so that has been something that we have had time invested in. And we have done that by getting our closest stakeholders and collaborators on board with us spending time on that as part of meeting our shared goals.
Tech stack and tools
Yeah. So I and my immediate team, we work mostly in R, a little bit of Python here and there, and there is support for Python in our larger team, the Informatics and Predictive Sciences team. For myself, personally, I used to work a lot more in Python, but there's kind of a lot of momentum behind R in my team right now. So since that's what most people are using, it's easiest to work there. I am probably working like one-third of the time in Positron now. I'm trying to use it for new projects and for older projects that I already have set up, sticking with RStudio for now. We do have the like Posit commercial suite that we use, so Posit Workbench. I use occasionally and Package Manager for publishing our internal packages and making them easy to install. And then Posit Connect, we use all the time. We've got so much stuff deployed there, reports just to make it easier to share and lots of Shiny apps there.
Harmonizing incompatible data
Yeah, that's such an interesting question, because the situation that I often end up in is the reverse of that. So I'm often having to say, no, we can harmonize these. You'll have to consider that there are differences going in when you interpret it, but it is possible to harmonize these and analyze them side by side. Because often the stakeholders that I'm working with, they're so close to the data. They're so close to one specific study or one specific disease that they are thinking about all the time. It's hard for them to think that you could compare multiple myeloma and lymphoma, because they are really different diseases. I'm sure they are, but also a lot of the data that you put together for these two different diseases has valid comparisons that you can make.
So I'm often the one arguing that we can harmonize it. And I think the situations that I have run into where you just can't harmonize it at all, it's often more an issue of we didn't capture the right data to begin with. So you want to analyze some particular biomarker across two dozen studies that you know have been done and that you know it would have been valid. But in half of the studies, that biomarker that is so interesting was only captured at one time point or was only captured in a dozen patients, because it was something they were just trying out.
Ontologies and data dictionaries
Yeah, such a great question. So about data dictionaries, a lot of the data that we start with from the clinical trials comes in SAS format, and the SAS format files have embedded data labels. So in our harmonized data sets, we'll pull those labels directly out of our starting data sets and maybe make some changes to make them compatible across studies. What I would like to do is to have an LLM help with combining the definitions that might be different across studies. I haven't gotten there yet.
And then as far as ontologies, so this goes back to the question that somebody was asking about OMOP. You could drive yourself crazy trying to pick like the one perfect ontology or controlled vocabulary to use for any particular project. I've been stuck in that kind of question with multiple projects at this point in multiple different jobs and no one perfect ontology exists. So you should also try to avoid the classic xkcd problem of coming up with another ontology to solve this. I think the solution that I'm happiest with at this point is trying to combine multiple ontologies so pick the first ontology that covers the broadest the largest proportion of what you need to include and then if there's anything that's not covered by that pick selectively from other ontologies that you have available.
That is so hard and especially at an enormous organization like BMS. The people putting the clinical data sets together to begin with are so far upstream of the analyses that my stakeholders are doing and the data sets, the specifications for these data sets are put together often years before the analyses are done. I think it is easier to get buy-in from trying to collect data that's already there as far as trying to get it included from the beginning. I think the only thing that can really help is just building strong connections with those different parts of the organization. So I have tried to build connections to the biostatistics organization at BMS which is in charge of building these initial clinical trial data sets so that maybe when they're in a meeting next year for planning a new study they'll think, oh you know I had this conversation with Leslie last year about how they don't know what is in this data because we didn't capture all of the documentation on it. So maybe we need to include that from now on.
Career path and advice
Yeah absolutely. It's interesting because I had no intention to work at a pharma company. How did this happen? So you know I did my PhD in genomics. I wanted to be a professor I thought but then it turned out I didn't want to leave Seattle. I did not want to move. I didn't want to have to give up everything to move wherever academic jobs were. So right out of grad school I took a job as a research scientist in the biostatistics department at the same university I'd gone to and worked there for five years and it was great. I was in like sort of an analysis core team which is like you know our team did statistical analyses on genomics data and related clinical traits for the people that had genomic data.
And so I found this position that was like technical expertise but still working in science and that gave me some time to learn exactly like the kinds of tasks I wanted to work on. I was able to clarify that I really love being able to do this sort of thing where I'm getting a task done for a person in the scale of weeks or months and not years or a lifetime. So taking that time to figure out what things were important to me in a job and in a career and what I wanted to spend my time doing.
And then when I made my next career jump I thought I wanted to work for a tech company. I was like I really want to spend more time on building tools and working on data like that is what I find most gratifying and so I thought I wanted to work as a data scientist at a tech company. And I kept looking at job listings and thinking this like I don't want to look at ad data, I don't want to work at look at sales data, I don't want to look at website traffic data. And it turned out I still wanted to do something that was I felt having like an impact on people and for their health. And so I found this job listing that even though it was at a pharma company that you know I wasn't particularly thinking I wanted to work at a pharma company the job listing itself was like oh that's exactly what I want to be spending my time on and it was about building Shiny apps, it was about making analysis ready data so I had a lot of those skills that I developed in academia already.
And in terms of like positioning yourself to be in these roles that are at the intersection of things I think the most useful skill to have is being able to talk to people in different specializations and different fields. So a lot of what I do in my job is talking to somebody who is not a coder who doesn't know anything about R and trying to figure out like what is their goal and they often can't articulate it I need to be able to ask the right questions to figure out exactly what they need. And then it's often like troubleshooting skills so being able to like you know jump on a screen share with somebody because they're having trouble with some Shiny app I built and they can't type to me accurately exactly what's going on so I just jump on a screen share with them and I can walk them through oh I see what's going on here's you know here's what you need to do or here's what I need to change about the app to accomplish what you're asking for.
Long-term projects and staying motivated
Yeah that's a great question so I do have some really big long-term projects still one of them is this automating data harmonization you know it started out as just like the pipeline just the config files before LLMs were even in the picture I was putting my pipelines into config files so that I wasn't having to write like the same dplyr filter statements over and over again I could just say like what parameters I wanted to use for that. And so that is I think that's motivating in and of itself because it's continuing to help me to get my other projects done so it is like a big long-term project where putting effort in is also helping along all these smaller projects that do have timelines and deliverables that I'm handing off to other people.
I also have an example of a big long-term harmonization project at BMS that has not had the amount of use to justify the amount of effort and time I put into it and I think that academia mindset kept me from noticing that and realizing it as early as I should have. So now I'm working on like looking at metrics for example for the Shiny apps that I've put together so that was a big harmonized data set I put together and into this really highly customized Shiny app with all these detailed visualizations that I spent tons of time on and everybody said they wanted and then once it was built they weren't there they weren't looking at it.
Knitting data and wrapping up
Yeah yeah I absolutely do have a Ravelry I'm 50,000 stitches on Ravelry and one of my favorite parts about knitting is keeping track of my knitting data. So I have photos there I have project details and then every year I do an export of all of my project data and all of my yarn stash data is tracked there as well so I do an export at the end of every year and then I have a bunch of visualizations that I put together to track like how long did I spend on each of these projects how long did I wait between buying some yarn for a project and actually knitting it up into that project. I don't want to know that it can sometimes be embarrassing. So I do this annual extract of that data and then I've got this quarter report that I put together with a bunch of visualizations and I do it kind of in like January of every year it's super fun and then every year I add like a little bit more to it.
Leslie thank you so much for joining us I had so much fun I hope that you did too. I absolutely did thank you so much. Amazing. Thank you thank you for all your questions yes everybody had amazing questions thank you so much for asking them. And I hope that you will hang out with us on Tuesday at the data science lab with Hadley Wickham where we're talking about data dictionaries and parquet files and using cloud code and then also next Thursday we're going to be joined by Hansel Palencia manager of data science at DaVita. I cannot wait for that conversation it's going to be fantastic you are all going to love Hansel. I hope you have a wonderful weekend and enjoy your Friday goodbye everybody.