Posit Meetup | Afshin Mashadi-Hossein, Bristol Myers Squibb | Framework for Data Collaboration

Transcript#

This transcript was generated automatically and may contain errors.

Thank you so much for joining us today. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel Dempsey. I'm calling in from the RStudio office in Boston today.

Today we are streaming out to LinkedIn and YouTube Live. For anybody who's joining this group for the first time, a special welcome to you. If you are joining for the first time, this is a friendly and open meetup environment for teams to share the work they're doing within their organizations, teach lessons learned, network with each other, and really just allow us all to learn from each other.

Thank you all so much for making this a welcoming community. We really want to create a space where everybody can participate and we can hear from everyone. We love to hear from everyone, no matter your level of experience or the industry that you work in as well.

While you can ask questions on whichever platform you're watching from, you can also ask questions anonymously through the Slido link, which I'll pull up on the screen here in just a second.

But we are so excited to have so many members of the pharma community here to join us today as well. So I'd love to use this opportunity to share some information about a few other pharma communities and resources that exist today. And I have my colleague here, Phil Bauscher, joining us.

Hey, Phil. So a lot of exciting things happening in the community. We have the work through R&Pharma. We have things through your champion page, which is fantastic for people that are setting up environments. We also have a lot of exciting packages that are coming out of the pharmaverse right now in the pharma ecosystem. So yeah, lots of new exciting things in the ecosystem.

So for today's meetup, we are really excited to be joined by Afshin Mashadi-Hossain, Senior Principal Scientist at Bristol-Myers Squibb, who will be sharing their data science framework for data collaboration. And as mentioned earlier, we will have time for Q&A as well. So you can put those questions into wherever you're watching from or use the Slido link as well for anonymous questions.

But with that, I'd love to turn it over to our speaker for today's meetup, Afshin.

Sounds great. Yeah, thanks a lot, Rachel. Thanks, Phil. Hi, everyone. My name is Afshin Mashadi-Hossain. And as Rachel mentioned, I'm from Bristol-Myers Squibb.

Before going further, I wanted to, again, thank the RStudio Pharma Meetup community and organizers, particularly Rachel and Phil here for giving me the opportunity to share with you all data as a product, the data science framework for data collaborations. This is a work that my co-authors and I, Julie Rutlowski, Garth McGrath, and I presented to R in Pharma back in late 2021.

So the hope in presenting it again and sharing it here again is that we have a little bit more time for more of a discussion and question and answer. And hopefully, some of those conversations will benefit from the experiences that we have accumulated since that presentation last year.

Defining data product and data as a product

So to start things off, I think it's quite critical to provide a definition of data product and data as a product. If you have heard of this term, you might have noticed that in different circles, in different groups, it could mean quite different concepts.

How we are defining data product is simply a data object that, much like any other products, come with a set of characteristics or attributes. So specifically, we call data product, this notion of data product, to be a data object that is use case driven. This is in contrast with, quote unquote, wrangled data that can serve any purpose. It's clean data, and you can use it for whatever purpose. So this has to be, the data product has to have a specific user story or a set of coherent objectives that it achieves.

Now, if you have that level of specificity, you can imagine that you may have slightly different use cases leading to a variety of different data objects. So it becomes even more critical to be able to version control it and track it. So making it use case specific and being able, then, to branch it and version track it becomes quite critical.

It needs to be tested. It needs to be reproducible. And I would say not only reproducible, but easily reproducible, meaning that with a handful of commands, you can just reproduce the data. It should be easily maintainable, extensible. Think about this as your typical R package in this context.

So in this context of R users, you can use an R package for whatever end that you have in your analysis or otherwise. You can also use that to integrate it into a new R package and build upon it and extend it. Same should apply to data products. You should be able to integrate it into a new data product, bring a few of them together, build upon them, extend them. So it should be extensible.

It should be findable, accessible, and portable. And notes on that last one is that as data scientists, if I may speak for a lot of us, that we may not necessarily be super conscious or have concerns about whether my data is coming from Azure or AWS S3 or whatever platform, whatever infrastructure I have. But our colleagues in IT or other places, they may. So as my data, as the requirements of my organization changes, I need to take my data from, let's say, AWS S3 to Azure or vice versa or to some network drive.

All of these characteristics that we talked about should go along. So it shouldn't be something that is dependent on the platform or technology outside of the data product. So these core set of characteristics and attributes define what we call a data product.

Now, data as a product is a framework to make it practical for data science teams to pull together their data wrangling efforts towards collaboratively building data products. So on one hand, we have this product. On the other hand, we have a set of processes that allows us to easily, or at least methodically, build these data products.

So the processes could be agreements between the teams, like how we want to use the conventions of coding, as well as some automations and tools that allow us to implement those agreements. So data as a product, you can think about as a framework, way of thinking, as well as the tool sets that allow us to implement that way of thinking. Data product is the output.

Motivation: the 80-20 rule of data science

So to focus on one, perhaps you can think about this 80-20 rule of data science, which is basically this observation made by a lot of the data scientists that a typical week could involve four days out of that five working days that you are engaged in data processing or data wrangling, whatever you call it. And then on that last day, you may get a chance to actually do your data analysis.

So to focus on one, perhaps you can think about this 80-20 rule of data science, which is basically this observation made by a lot of the data scientists that a typical week could involve four days out of that five working days that you are engaged in data processing or data wrangling, whatever you call it.

So as a data scientist or leader of a data science team, if you are interested in adding more agility to your overall team efforts, it is fair to look at this data processing and try to bring in additional agility to it and efficiency to it.

And there are a variety of approaches that teams have implemented. Hopefully, we all work in teams that heroic data wrangling is not one of those approaches. No one is advocating for shrinking that 80% to 30% by just willpower and working extra hours. But then there are things like new data agreements that you may put in place with data vendors. You may ask your data platform to do more in the way of data validation. Even pushing that validation of data further upstream into your pipelines, that could help.

Perhaps a slightly different view would be rather than seeking relief, we can acknowledge that here's my team spending their precious time and resources and expertise. I should emphasize that expertise part. Building something, so there's a lot of value in that. If 80% of my team is building this artifact or output, that has a lot of value in it.

So instead of thinking about or along the way of thinking about reducing that time to getting that, can I also think about how can I drive more value from this precious resource? And if I'm thinking in that sense, I'm not trying to remove the burden of this effort. I'm just trying to drive more value from it. But to alleviate that burden, help with that, I can also make collaboration easy. So that big lift is shared across the team.

So you make collaboration easy. When permissible, you can add a dash of automated standards and make, put in place processes, process frameworks for collaboration. So not surprisingly, perhaps at this point, you can imagine that this latter approach of driving more value from data wrangling is what the framework of data as a product, as we are defining it, advocates.

So the idea is that you're bringing in your usual input data, and you're adding your logic of data wrangling. And then you feed it into this process that out would come a data product, which then can be fed into a number of utilities, analysis, apps, automated reports. And this data product now is reusable and extensible. So that work is essentially something that lives on.

That in contrast with a typical workflow of a data scientist, which may include data wrangling, be so inseparably intertangled with your utility, be it analysis or otherwise, that once that analysis is done or once that utility has run its course, then that piece of data work is essentially disposable. So you have an expensive, very, very expensive effort that now has no more use.

So in summary, I'm hoping at this point, some of you see some value and agree that with a minimal amount of work, which we showed in those videos, it doesn't take a whole lot of time, you can take your current data wrangling and trade it up to get a data product that comes with all of those attributes and guarantees that hopefully make the experience of the user much more pleasant and build efficiency as well across the team.

This is very much a project that we are both heavily using as a consumer of this framework, but also we are developing. And a lot of the development falls along the cycle of DevOps, but here we are having it for data.

To just mention some of these examples of projects that we are working on. So we are working on an information-based diffing of data. This is a work that we had a great colleague joining us for the summer and contributed to building this DDiff package. And we also are building vignettes around how to automate. So CICD, you know, set up data testing is another element that we want to kind of build vignettes around so how folks can integrate data testing into the workflow and some work around adding additional metadata to enable observability.

And lastly, something that is interesting is that since data is recipe, you can spin it up and spin it down. So demand-based deployment could be something that could help save storage if that is ever a concern.

So with that, I just wanted to take a moment again to thank our CIFARMA meetup organizers and community. And I need to acknowledge the efforts of a lot of colleagues who have contributed, thoughts, ideas, support one way or another to this project. And last but not least, all of you for tuning in today. Yeah, I'd love to kind of talk a little bit more and address some questions.

Q&A

Awesome, thank you so much, Afshin. There are a lot of questions starting to come in on Slido. So I just wanna remind everybody that you can ask questions wherever you are.

But one question that I see over on Slido was what's your best advice for companies running into issues getting approval to use open source?

Typically, I mean, I would say for that, it usually starts with a champion that helps to drive some successes. Shiny is great for that because it brings in some capabilities into the organization. But I mean, most of the big pharma's, small and even medium sized ones, there's usually a spark that starts with someone. And that helps create competency centers or data science meetups or hangouts. They usually have a way for people to get help. And that just starts to snowball into that. But R is a great language for starting off because you can have a lot of small little successes that can really grow. And that leads into Python and other languages, and JavaScript and Julia and so on. So usually it starts with the right person at the company driving for it.

I see somebody had asked a question over on LinkedIn, and I think this was towards the very beginning of the presentation that said, when you mentioned the vanilla code, is it possible to discuss about it here?

So we do, so the code itself, the package itself generates code. So it does provide some template code when I talked about Blueprint. So when you're starting your project, it populates your folder with some base wiring that then you can add to. I don't know if that's what you refer to as the vanilla code or not.

I see there was another question on Slido, which is, how has it been implementing the data as a product framework for a multilingual team? Does your environment support Python and R, and do you use Connect for this?

Yeah, I really love this question, and because it is one of the things that definitely on my mind, at this point, it is primarily R, right? Pins, which is an awesome package, so shout out to the RStudio developers for Pins, is at the core of this dapper. Pins is a package that supports both Python and R, and so the data products being Pins can be pulled in theory by Python as well as R.

Now, when it comes to the development, like we are talking about an end-to-end, so we can get a recipe and can build it up. Of course, that code is written in R. Of course, the packages are written in R. So in theory, it is totally feasible to write R code in a way that interfaces with Python and or vice versa, and have a sister package that essentially manages that where the code itself, if you wanted to reproduce it, not if you wanted to use it, as I said, if you want to just use it, you can absolutely use the Pins, which is at the core of this by Python or R, but if you're developing it, there is some dependency on the environments that you're developing on. So it can be done in theory, but perhaps it's better to kind of choose where you want to be.

What's it like with new users coming into this? Is there a certain type of training? Is data as a product and data wrangling and CICD? I would say are newer methodologies coming in. What's it like for people coming to this new or used to running these things more manually or in like a batch mode?

Absolutely. So again, I want to kind of distinguish between folks coming to this as a user or folks coming to this as a developer. So as a user, I think it's pretty straightforward. So the kind of ramp up to like, okay, this is the data library and I can pull in. So that part is much easier. And then as a developer, of course, there's more involved here, right?

And I want to kind of say that how critical this is, and this is one of the things that we have learned since the first time we presented on this. Technology goes so far. It is really about these, when it comes to these DevOps practices, it's really about culture. It's really about the collaboration. So it's been absolutely critical for us to be able to forge those relations and commitments and change in the mindset of how we are doing data. So it is a commitment.

Technology goes so far. It is really about these, when it comes to these DevOps practices, it's really about culture. It's really about the collaboration.

It is something that folks need to say, okay, I'm going to commit to being able to, or kind of like incurring this cost, establishing my data infrastructure in this way. And that commitment leads to some changes in how, like the question of when the data is done. It is a parallel to question to when we are done and analyzing TCGA, for example. No one asked that question because there's so many questions you can ask from that data set.

So building that mental bridge and cultural shift, I think is something that takes time. From the technology perspective, learning the code and all of that, I feel someone going through it a couple of times, they will be able to pick it up. And one learning again that we have had is that it is really important for those in the team who are a little bit more experienced to lower the kind of the entry barriers a little bit. So you bring in your wrangling code, you can work with your colleagues to integrate that into Dapr. And then from that point on, they can maintain it. So it's a little bit of a commitment, but it's something that once folks make that commitment, I think within a course of a month or so, they can fully be in this cycle of developing the data across the team.

And for surfacing out the data and for people looking for and finding, is there like a data dictionary or a system or do they go into an environment? Because someone says, I want this assay data or this research data. How do they go about locating that?

Yeah, absolutely. So building additional metadata that makes the dataset searchable is certainly something that there's room for to build additional metadata. At this point, the owner of the data, folks who are subject matter experts are perhaps the best people to connect with. So if you, I'm saying it is absolutely possible to put metadata out there, but the preferred route is a lot of times folks want to just kind of go fishing for data, right? And a lot of the context may be missed. So if they have specific questions that you know you work within a team and you know like this team actually owns a data product. So it's just a matter of having that conversation up front and then that kind of eithers it in.

And do traditional data environments like databases and maybe even larger distributed data sources, does that also play into here or is this more for flat files or?

No, absolutely. So we, a data product in examples that we have worked on brings in data from a diversity of sources. So you may have a traditional relational database that there's a table you want. So you bring that in and then you augment it with a table that you have curated yourself of medications and the classifications, for example. The key element is that unlike some of these data repositories or data platforms that are live data that could change anytime, here we are to snapshotting. So every time you're bringing it, you're taking a snapshot of that state and then you onboard it into your data product.

And do you have scheduling components? Like if they say, hey, pull this every day at 9 a.m., run this and then when the scientists wanna access the data it's routinely updated with the new lab data, for example.

Yeah, so that can absolutely be done. We haven't had the kind of automation of that as a use case. One of the things that we talk about with our colleagues is that the cadence of data products is at a slower pace. So if you have a data set that you need to look at every day perhaps you don't spend a huge amount of effort feature engineering and all of those elements. We are kind of going for like a quarterly update to this data product. We are going deep, but the data evolution is kind of, it's not that rapid. In theory, you can absolutely do it, right? You know, you can just, that could be part of the code that a snapshot set and just kind of these data products.

I think it's something we see pretty often when especially our developers, the idea of creating an ETL job, scheduling that, and the output of that being a data product or a data set that then is fed downstream into a machine learning model or a Shiny app, for example.

I see there's a few other questions that are coming through on Slido. But one was, Afshin, could you explain more about using promises to manage credentials that will be needed in the user's environment?

Yeah, absolutely. So everything that is in a data product becomes code, right? So if you need to access something that is behind some authentication layer, right? One way is to kind of provide that, you know, password, put it in the text, and then goes in your repo. Of course, that is a big no-no, right? So what we can do is we can have a function call as a text. So imagine, like, you know, if it is, you know, an environment variable in R with the sys.getEnv, and there it is, the environment variable. So these then become promises. So you can imagine you have a configuration file that tells where the data is and what password you use to get to the data. It doesn't tell you the exact password. It says, run this command. And that is the promise that says, when you run this command, this environment variable or this key is going to be present in the environment. So that's what I mean by a promise.

There's another question that I might reframe just a little bit to expand on it, but it was anonymous. But it's, are there certain tools that you've tried that haven't worked well for your requirements and how did you move towards getting the right tools that you needed?

So some, is it in the context of building these data products? So if, apologies if I'm not addressing this, but the tools that we have tried that haven't quite done what we wanted, like one of them is the DDIF, right? So the idea of the DDIF. So right now, so the idea is, if you're thinking about your data and codes to be kind of interchangeable, when you're writing a code and you committed and then someone says, okay, what's the difference between this version and the previous version? You do a DDIF and get this, and it will tell you like the lines of code that has changed.

Now, imagine you have two columns of data. Everything is multiplied by two. Now, every single element in the million rows is going to pop up as text different, right? You know, you started with one, one becomes two. But from an information perspective, nothing much has changed. You know, you have just multiplied everything by a factor. So when we would like to be able to do the DDIF on the data side, that is as actionable as when you do a DDIF on the text side.

And if you want to use the text-based DDIF, which are a lot of good packages out there that does that in R4U, you don't get that information base. You don't see how the likelihood of the data now has shifted. So if you were to say, if you come up with a score that says, oh, your data shifted by, you know, 0.02, you know, I know that's kind of perhaps a high bar, but that is an actionable thing. It's like, I don't need to run my ML model because my data barely changed. Versus, you know, your data has changed, you know, by, you know, 95%, right? There's some likelihood or something that is, you know, tells you that your data really shifted. So that is something that we are developing now.

I see somebody else asked on Slido. Someone said, there's a supportive community for open source and pharma, even though it's a competitive industry, other industries seem less open to sharing. Why do you think that is?

Well, I think, Rachel, probably you have as much and more insight than I do because you go across industry. But one thing I want to say, I grew up in biotech and more on the technology side of the biomedical research. So I watched for a while pharma and then I joined. And I was, even back then, I was blown away by the generosity of pharma for putting code out there, putting algorithms even out there.

I don't know exactly, I can't speak for universal pharma, but I feel like everyone who is in pharma is really, really focused on patients' lives. And when you're thinking about those medicines, when you're thinking about those chemistries and how we are innovating on therapeutics, that is where a lot of sensitivity and desire is to innovate, right? And there's a huge amount of generosity when it comes to algorithms that are not the core offering of this pharma. So if this helps the industry, they put it out there.

I think you hit the nail on the head. It's inspiring work. I mean, we're creating medicine and helping people's lives. And I mean, I think everybody personally or through family members have been impacted. I know I have been and I love the industry and I love supporting it. I've felt this way since day one being involved in it. And I think too, there's a strong background to the academia space and there's a lot of research that comes out of that and it carries over into the ecosystem, which is also helps create some of the

Posit Meetup | Afshin Mashadi-Hossein, Bristol Myers Squibb | Framework for Data Collaboration

Transcript#

Defining data product and data as a product

Motivation: the 80-20 rule of data science

Requirements for a data product

Data as code: the conceptual approach

The daapr package family

End-to-end workflow walkthrough

Data product structure and limitations

Summary and future work

Q&A

Featured software#

rstudio