Resources

Posit Meetup | Afshin Mashadi-Hossein, Bristol Myers Squibb | Framework for Data Collaboration

Led by Afshin Mashadi-Hossein, Sr Principal Scientist at Bristol Myers Squibb Github link: https://github.com/amashadihossein/daapr Other pharma use cases: rstudio.com/champion/life-science RStudio for Clinical Reporting: rstudio.com/solutions/pharma Chat with RStudio: rstd.io/chat-with-rstudio Abstract: For data science teams, data preparation takes substantial investment of time, data science expertise and subject matter proficiency. However, as the name implies, data preparation is typically viewed merely as a means to an end, encouraging creation of expensive but often single-use and fragile elements in data analysis workflows. Rather than seeing data preparation as an obstacle to be removed, we propose a framework that recognizes the time and expertise invested in data preparation and seeks to maximize the value that can be derived from it. Viewing analysis-ready data as a multi-purpose, modularly built product that should lend itself to collaborative development and maintenance, the framework of Data-as-a-Product (DaaP) aims to remove barriers to version tracking and collaborative data development and maintenance. Specifically, the framework, which is entirely implemented in R, enables joint code and data versioning based on git, standardizes metadata capture, tracks R packages used, and encourages best practices such as adherence to functional programming and use of data testing. Collectively, the patterns established by the DaaP framework can help data science teams transition from developing expensive, single-use “wrangled” datasets to building maintainable, version-controlled, and extendable data products that could serve as reliable components of their data analyses workflows. Bio: Afshin is a data scientist who is passionate about putting engineering and computational tools to work to realize the potential of biomedical data in service to human health

Aug 23, 2022
1h 19min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you so much for joining us today. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel Dempsey. I'm calling in from the RStudio office in Boston today.

Today we are streaming out to LinkedIn and YouTube Live. For anybody who's joining this group for the first time, a special welcome to you. If you are joining for the first time, this is a friendly and open meetup environment for teams to share the work they're doing within their organizations, teach lessons learned, network with each other, and really just allow us all to learn from each other.

Thank you all so much for making this a welcoming community. We really want to create a space where everybody can participate and we can hear from everyone. We love to hear from everyone, no matter your level of experience or the industry that you work in as well.

While you can ask questions on whichever platform you're watching from, you can also ask questions anonymously through the Slido link, which I'll pull up on the screen here in just a second.

But we are so excited to have so many members of the pharma community here to join us today as well. So I'd love to use this opportunity to share some information about a few other pharma communities and resources that exist today. And I have my colleague here, Phil Bauscher, joining us.

Hey, Phil. So a lot of exciting things happening in the community. We have the work through R&Pharma. We have things through your champion page, which is fantastic for people that are setting up environments. We also have a lot of exciting packages that are coming out of the pharmaverse right now in the pharma ecosystem. So yeah, lots of new exciting things in the ecosystem.

So for today's meetup, we are really excited to be joined by Afshin Mashadi-Hossain, Senior Principal Scientist at Bristol-Myers Squibb, who will be sharing their data science framework for data collaboration. And as mentioned earlier, we will have time for Q&A as well. So you can put those questions into wherever you're watching from or use the Slido link as well for anonymous questions.

But with that, I'd love to turn it over to our speaker for today's meetup, Afshin.

Sounds great. Yeah, thanks a lot, Rachel. Thanks, Phil. Hi, everyone. My name is Afshin Mashadi-Hossain. And as Rachel mentioned, I'm from Bristol-Myers Squibb.

Before going further, I wanted to, again, thank the RStudio Pharma Meetup community and organizers, particularly Rachel and Phil here for giving me the opportunity to share with you all data as a product, the data science framework for data collaborations. This is a work that my co-authors and I, Julie Rutlowski, Garth McGrath, and I presented to R in Pharma back in late 2021.

So the hope in presenting it again and sharing it here again is that we have a little bit more time for more of a discussion and question and answer. And hopefully, some of those conversations will benefit from the experiences that we have accumulated since that presentation last year.

Defining data product and data as a product

So to start things off, I think it's quite critical to provide a definition of data product and data as a product. If you have heard of this term, you might have noticed that in different circles, in different groups, it could mean quite different concepts.

How we are defining data product is simply a data object that, much like any other products, come with a set of characteristics or attributes. So specifically, we call data product, this notion of data product, to be a data object that is use case driven. This is in contrast with, quote unquote, wrangled data that can serve any purpose. It's clean data, and you can use it for whatever purpose. So this has to be, the data product has to have a specific user story or a set of coherent objectives that it achieves.

Now, if you have that level of specificity, you can imagine that you may have slightly different use cases leading to a variety of different data objects. So it becomes even more critical to be able to version control it and track it. So making it use case specific and being able, then, to branch it and version track it becomes quite critical.

It needs to be tested. It needs to be reproducible. And I would say not only reproducible, but easily reproducible, meaning that with a handful of commands, you can just reproduce the data. It should be easily maintainable, extensible. Think about this as your typical R package in this context.

So in this context of R users, you can use an R package for whatever end that you have in your analysis or otherwise. You can also use that to integrate it into a new R package and build upon it and extend it. Same should apply to data products. You should be able to integrate it into a new data product, bring a few of them together, build upon them, extend them. So it should be extensible.

It should be findable, accessible, and portable. And notes on that last one is that as data scientists, if I may speak for a lot of us, that we may not necessarily be super conscious or have concerns about whether my data is coming from Azure or AWS S3 or whatever platform, whatever infrastructure I have. But our colleagues in IT or other places, they may. So as my data, as the requirements of my organization changes, I need to take my data from, let's say, AWS S3 to Azure or vice versa or to some network drive.

All of these characteristics that we talked about should go along. So it shouldn't be something that is dependent on the platform or technology outside of the data product. So these core set of characteristics and attributes define what we call a data product.

Now, data as a product is a framework to make it practical for data science teams to pull together their data wrangling efforts towards collaboratively building data products. So on one hand, we have this product. On the other hand, we have a set of processes that allows us to easily, or at least methodically, build these data products.

So the processes could be agreements between the teams, like how we want to use the conventions of coding, as well as some automations and tools that allow us to implement those agreements. So data as a product, you can think about as a framework, way of thinking, as well as the tool sets that allow us to implement that way of thinking. Data product is the output.

Motivation: the 80-20 rule of data science

So to focus on one, perhaps you can think about this 80-20 rule of data science, which is basically this observation made by a lot of the data scientists that a typical week could involve four days out of that five working days that you are engaged in data processing or data wrangling, whatever you call it. And then on that last day, you may get a chance to actually do your data analysis.

So to focus on one, perhaps you can think about this 80-20 rule of data science, which is basically this observation made by a lot of the data scientists that a typical week could involve four days out of that five working days that you are engaged in data processing or data wrangling, whatever you call it.

So as a data scientist or leader of a data science team, if you are interested in adding more agility to your overall team efforts, it is fair to look at this data processing and try to bring in additional agility to it and efficiency to it.

And there are a variety of approaches that teams have implemented. Hopefully, we all work in teams that heroic data wrangling is not one of those approaches. No one is advocating for shrinking that 80% to 30% by just willpower and working extra hours. But then there are things like new data agreements that you may put in place with data vendors. You may ask your data platform to do more in the way of data validation. Even pushing that validation of data further upstream into your pipelines, that could help.

Perhaps a slightly different view would be rather than seeking relief, we can acknowledge that here's my team spending their precious time and resources and expertise. I should emphasize that expertise part. Building something, so there's a lot of value in that. If 80% of my team is building this artifact or output, that has a lot of value in it.

So instead of thinking about or along the way of thinking about reducing that time to getting that, can I also think about how can I drive more value from this precious resource? And if I'm thinking in that sense, I'm not trying to remove the burden of this effort. I'm just trying to drive more value from it. But to alleviate that burden, help with that, I can also make collaboration easy. So that big lift is shared across the team.

So you make collaboration easy. When permissible, you can add a dash of automated standards and make, put in place processes, process frameworks for collaboration. So not surprisingly, perhaps at this point, you can imagine that this latter approach of driving more value from data wrangling is what the framework of data as a product, as we are defining it, advocates.

So the idea is that you're bringing in your usual input data, and you're adding your logic of data wrangling. And then you feed it into this process that out would come a data product, which then can be fed into a number of utilities, analysis, apps, automated reports. And this data product now is reusable and extensible. So that work is essentially something that lives on.

That in contrast with a typical workflow of a data scientist, which may include data wrangling, be so inseparably intertangled with your utility, be it analysis or otherwise, that once that analysis is done or once that utility has run its course, then that piece of data work is essentially disposable. So you have an expensive, very, very expensive effort that now has no more use.

Requirements for a data product

So up to this point, we have talked about this notion of data product in abstract. So let's think about what's our north star. What are we aiming for here? So as I alluded to earlier, it has to serve a coherent set of objectives. So this is one of the core requirements.

So that limits the scope of the effort. Even with a limited scope, you don't want to be building it from A to Z all in one shot. So think about, again, a product. Like you may have different components that you build and you assemble. It has to be built modularly.

And if you are building it modularly and your buddy here is building some parts and you are building some parts of it and you're bringing it all together, it has to have, it shouldn't be something that is an art and everybody does it their own way. So collaboration requires some blueprint of how we are building these elements.

Now, again, collaboration requires everything to be versioned so we don't get tangled up in what version of what elements we are talking about across the team. So by everything here, I mean code, environment, and data needs to be versioned.

So all of the things that I've talked about so far, as far as the requirements of the product, has to do with the build process. But from a user perspective, it also has to be easy to use. So connect, list, get is all you need to be able to pull the data down and use it. So as a user, you should have an easy interface. And lastly, as I mentioned, we want to have it be platform agnostic.

So if you're going from S3, later on you need to go to file systems and RStudio Connect is your next destination. The data science workflow should dictate where your data is or if there are other requirements that can be accommodated and you don't have to worry about the functionality of your data being compromised.

Now, this is the product's requirement. And there should be a build tool that can enable building that product according to those attributes. So you can see the linkage here that a build tool, given all those attributes that I mentioned are requirements, should naturally support modular development, should minimize boilerplate, should enable reproducibility by capturing everything in the state of the code, data, and environment, should encourage best code style practices, should enable full traceability of everything, code, data, environment. And as a builder, you should not be learning a new language. If you know R, if you know Git, you should be good to go.

Data as code: the conceptual approach

Now, if this was instead of a data as a product, this was software as a product, you would rightly expect that the tool is already around. And in the rich set of tools and experiences and processes that have been developed and implemented by DevOps community, when we are talking about the software, we can accommodate all of those features of a build tool. The difference here is that not only we have code, but we have data. So this gives us an inspiration that if we were able to capture data as code, we would then build a bridge to this rich set of tools and experiences that are available in the DevOps community and put those to use for our end.

So building with data as a code, what's the approach here? Let me walk through an example of how this could work conceptually.

So let's say you are here in this point one. You have some data sets. Some of you in pharma are familiar with this nomenclature. So let's say you have some LV data, which is this lab data collected on a cohort of patients. You have some medication data and a bunch more data sets that you have at your disposal.

Now, you start your project. You set the initialization in motion. It starts capturing the environments that you're developing in. It adds the remote locations for your code, for your data. So that initialization is done.

Now, you go ahead and onboard your data into your data product. So let's start with this LV, which is this collection of lab measurements on a cohort of patients you have. And so you bring that in. And let's say you can now sync that to your remote location, which you have already defined at the initialization.

So at this stage, this is the part that perhaps worth attention is that you can capture this data that you're pushing to a remote location with a set of metadata in a way that uniquely identifies that. So you're capturing some metadata, like the label for that data, like some description. And this version, which could be essentially a SHA hash of that data, or any other flavor of getting some sort of a hash for that data, uniquely identifies that data set.

And some additional set of metadata that then will help you retrieve this data when you have the need for it. So now you have converted your data to a set of metadata that uniquely identifies and gives some description about that data. And if you capture that, as you repeat adding additional data sets, you build a manifest that has all the information you need for your input data. And that is the approach of converting data to code.

Then as you add additional layers of code for logic and you're capturing your environment, all of this is code. You can commit it. You can push it. And when you have done that pushing, in your remote GitHub repository, now you have the build logic, you have the package dependency, and you have all the information you need to identify the input data.

Now, this recipe is complete in a way that a colleague sitting at this point four or maybe a CI CD machine that gets triggered by the push sitting downstream can build this from scratch and push it again to the same remote data repository in the same fashion, capturing the same set of metadata. Now, everything here has been converted into code. So your data lives in its data location. Your code lives in the code location. And as long as you have access to both, you can build it from scratch.

The daapr package family

We have implemented this framework in R. So it's a family of R packages we call gapper. And it is available for use to build, essentially, implement the framework that I just shared with you all.

So this family of package consists of three packages, dpi, dpdeploy, dpbuild. Why three packages, you may ask. I can come up with a couple of different answers, historically, in terms of the bulk and code management. But perhaps the most important one is that you can think about three different persona that would be using these packages.

So starting with dpi, it means it's meant to simplify access to data products. So if you are just a user, whether it's an application or a report, you only need a handful of functions to be able to access the data products. So you don't need the bulk of the rest of the packages.

dpbuild is meant to simplify building of data products. This is the bulkiest of the packages. This is used by the data developer. So you can think about the persona here as the data developer. dpdeploy can be a data developer. Or it can be a workflow automation machine. So you have the task of dpdeploys to simplify sync and deployment of data products. So you can have a CI-CD process set up where, as your data comes in, it gets pushed.

End-to-end workflow walkthrough

So there are a variety of workflows that a dapper needs to accommodate. Like when you write code, you don't always start from setting up your project and going all the way to deployment. So you may set up your project, develop some, go do something else, come back, have some pull requests coming your way, change that, deploy.

But perhaps to just kind of provide an end-to-end example, if you were to start from the beginning and go to the end, it would look something like this. So you start with that DP in it, which sets up your project. It's kind of like when you do Git in it. So it sets up your project.

You add your data, so you onboard your data. We are, again, encouraging best practices. We are encouraging folks to spend a little bit of time here making sure the data they're onboarding is actually what they want to onboard. Then you add your logic of the code, and then you deploy the code and data.

Now, what I want to emphasize here is that this step is adding your code is the step that should take the bulk of your time. Whether or not you're using Dapr, you need to spend some time developing that logic of data processing. So the idea here is that if you, let's say, today spend an hour on a particular project or two hours on a particular project just doing the data processing part, now bringing that into the Dapr framework, you're adding maybe 5, 10 more minutes on top of that. But still, the bulk of your effort remains the same. So the boilerplate should be minimized.

Now, this is the claim. And we're going to go through an example, a toy example, that hopefully shows this. The toy example is overly simplistic, admittedly. We are starting from the cars data set. And I believe there is a column that is about the distance from the point that a car or driver applies the brake and the car starts.

So that column is in British units. And let's say the user story that you have asks you to convert that to a metric unit. Overly simplistic vignette. And actually, this vignette is also in the vignette for the package, so if you wanted to check it out more closely.

But being overly simplistic, still, the wiring and setup of the project is identical to a bigger project that may be more realistic with tons of data and tons of logic. So going through that, hopefully, give us a sense of what's the degree of boilerplate and the time that we need to allow for that.

So we start with this setup step. Again, thinking about all the steps that I mentioned in the previous setup data code deploy. So we're going to go through that. So setup here, simply, there's this function, vpinit, that sets up your project folder, runs git init, captures your environment thanks to RN. So it does the RN init, adds remote for code and data, and records project metadata.

Let's take a look at how this looks in practice. So this video, we are essentially going through that initialization step. So we are bringing in the library. We are setting up some promises for credential that we need to access. We are not providing a credential to the framework. We're just saying, these are the environment variables you will need when you are ready to build the data product. And we are putting that into this vpinit.

We are also providing the location of where the data needs to go. And then we execute this vpinit, so which takes in as parameters some metadata. What's the branch name? What's the project name? What's the user's story? What's the remote location? And those promises around environment variables that will be needed at the time. So again, here we have a single command, really, at line 10, that sets up your project.

Then we onboard the data. And in this case, we have a very simple dataset that we are onboarding. But in general, you can onboard any datasets bundled together as zip or otherwise into your data product. One thing I want to mention here is that the notion of data product being versions is that it is a snapshot. So you may have started live data, but you are snapshotting it at that point.

So there are some datasets that you need to have at hand as a snapshot, and then you build your data product. So you onboard that data product, I'm sorry, your input data into your input folder. As I said, there are some environment variables that you promised that you will be providing at the time of the build. So you set those promised environment variables. Then there's this step of dpinputmap. So you run this command dpinputmap.

This is a step that we added with the mindset of encouraging best practices, which in our mind is that if you have some bulk of data, it is beneficial to take a look before sending everything over. In some cases, you know that you want to send all of the data that you have put in the input folder over, and that's easy. In this case of two examples, that's going to be the case. But in some cases, you want to look at some metadata. Are there duplicate data? Are there names that could clash?

So there's this step that you do dpinputmap, and you get all of that metadata. You take a look at it, and you pick the elements that you need to sync, and you call dpinputsync by setting those flag of what data sets you want to sync. And this syncing step could take a little while, right? Because at this point, literally you're pushing the data sets one at a time to your remote location. So if you have hundreds of those data sets, depending on the size, it could take a little bit.

And this last step, you're doing dpinputwrite, which is actually pretty fast because at this point, all of those input data sets that you have synced are all metadata code. So you're just writing a manifest. So it's pretty quick.

So at this point, let's take a look and see how this looks. So we are just writing this cars data set into the input folder. And so that is the only input data that we want to onboard into our data product.

Next, we are gonna activate the project. This assumes that there was a gap between this initialization and then this second step. If you have already, if you want to go through all the way from initialization, you don't need to activate again. Then you set up those environment variable that we promised. So we set up those environment variables, and then we capture everything into this config object that has metadata, all the elements that we need to have at hand for the project to proceed smoothly.

So now at this point, we only have one data set. So we do the input map, and then we essentially know that we want to sync this data. There's nothing that we want to exclude or change. So we just sync by calling dpinputsync. Now your input data has been synced. This was pretty quick because it was just a simple data, a small data set. So you get a version for that, and now you're doing the right, and you're done at this point.

So now you have converted, you've set up your project, you've converted your input into code, onboarded that data, and now comes the part that you're gonna be building the data product itself, going from input to the output. So this is the part that you're gonna be spending your time, bringing your subject matter expertise to codify all the requirements and all of the user stories that you heard about into code that does the logic of data wrangling.

In our case, it's very simple, right? You know, we have only changing one column, but in essence, it's gonna be exactly the same every time. So you have dpinputsread, which reads in all of the input data that you have synced and onboarded into your data product. You add the wrangling logic. Then you structure your data object. There's this function called dpstructure that structures the input, the output, metadata all into one object.

Then similar to dpinputswrite, you are registering that data as a metadata now. So again, now your input was converted to code. Now your output is also converted to code. And since everything is code, you do dpcommit and dppush. And at this point, you have a repository with README and all of the documentation that you need to build this data product from scratch.

Your data product hasn't left your workstation, but anybody who has access to the repo can build it in their site and push it and include it in the CI-CD machine. So let's take a look to see how this goes. This is gonna be super fast. So you read all of the data sets, you added your logic, made the output, structured it in a data object, and then you write it as a YAML file, and then you do the dpcommit and dppush, and that will generate your repo.

So at this point, the work is done as far as the data developer is concerned, and the data can be built. But if you wanted to deploy at this point, and you don't have to, for example, you may be able to commit push a bunch of times until you get to a state that your data product is at a state that you want to push. But let's say you are at that stage and you're ready to push, so then you call the cpdeploy function, it's a single function, and it will deploy the data product for you.

This may take, again, a little bit, because at this point, you're also pushing your data product to your remote location, depending on the size. But that essentially concludes the A to Z of building the data product. Now, if you wanted to access it from the persona of a user, you only need these three commands, dpconnect, which logs into the library that has all the data product, dplist, it lists all the available data and metadata, and dpget, then you can pick from the list of data products that you have available.

So let's take a look at how this works. So dpdeploy deploys the data product. It doesn't even need any parameters at this point. If you have followed the steps as I described, this was pretty quick, because it was a small data product.

Then, as a user, now you can access this library doing dpconnect. And so now you're connected to that library, and then you can list the content. Now, as the content's being listed, perhaps it's worth taking a quick look at it. It has a lot of useful metadata, like who committed the data, what was the commit note, what is the git shawl, that if I needed to go back to the code, how can I pull this specific code, and a number of other metadatas that are of use.

And lastly, if you found your data product that you want to pull, you can just pull that by version. Now, by convention, what we have done is that the data product's input contents, we keep it as what we call pinlink. Essentially, it is a function called a specific version of a data that was integrated. So the data itself is not integrated into the list that is the data object. The output, since there are fewer of them, we always keep it as a data frame within the data.

Data product structure and limitations

You can bring in your own structure as you wish, but by convention, this is kind of a structure that we typically follow. So by convention, we always start with data product as DT and underscore, perhaps the project name, and then some particular topic that deals with perhaps a branch of that data product.

And we have input, output, metadata, README, you can have other options. So it's comprised of data frames and functions that do various things. A family of functions that worth to mention is this pinlink that I mentioned earlier.

So let's say you have this data product, it's had some input, you look at the clinical component that you're interested in the demographics. At this point, at this leaf, you have in the list, you really have a function called to a specific version of this demographic data. So it's not part of your data, but you can pull it in as needed.

So what happens is typically we have hundreds of these input data sets, and either all of them are fairly decent size, but the data product itself, the size of it is pretty small because all of these are actual function calls, they are not data. But convention, since we have fewer of these outputs, we keep them as data frames. So you have data for analysis A, data for analysis B, you have some metadata and README and such.

The output can absolutely also be pinlinks if needed, but the need hasn't been there so far.

So as you're thinking about this, some limitations and constraints, Dapr deals with tabular data, but does not natively impose or accommodate relational data models. So that is consideration. Dapr is not intended for large data tables. So if you have a bunch of smallish to medium sized tables, hundreds of megabytes, you have hundreds of them, totally fine. If you have single data tables that are gigantic, at this point, we have some ideas how to deal with it, but it hasn't been part of the use cases that we have worked on and it's not supported.

So Dapr does not address data governance. I would be glad to talk a little bit more about that. But instead of actually looking at this as a shortcoming, I look at this as an opportunity for collaboration across the team. So you can work with the data platforms that have a lot of knobs and levers that allow granular access being managed. But that is not part of the elements that is managed by Dapr.

And most importantly, Dapr and packages that it is built upon are very young. So Dapr itself is not that old, and pins and RN, which are integral to its functionality, are also quite young.

Summary and future work

So in summary, I'm hoping at this point, some of you see some value and agree that with a minimal amount of work, which we showed in those videos, it doesn't take a whole lot of time, you can take your current data wrangling and trade it up to get a data product that comes with all of those attributes and guarantees that hopefully make the experience of the user much more pleasant and build efficiency as well across the team.

So in summary, I'm hoping at this point, some of you see some value and agree that with a minimal amount of work, which we showed in those videos, it doesn't take a whole lot of time, you can take your current data wrangling and trade it up to get a data product that comes with all of those attributes and guarantees that hopefully make the experience of the user much more pleasant and build efficiency as well across the team.

This is very much a project that we are both heavily using as a consumer of this framework, but also we are developing. And a lot of the development falls along the cycle of DevOps, but here we are having it for data.

To just mention some of these examples of projects that we are working on. So we are working on an information-based diffing of data. This is a work that we had a great colleague joining us for the summer and contributed to building this DDiff package. And we also are building vignettes around how to automate. So CICD, you know, set up data testing is another element that we want to kind of build vignettes around so how folks can integrate data testing into the workflow and some work around adding additional metadata to enable observability.

And lastly, something that is interesting is that since data is recipe, you can spin it up and spin it down. So demand-based deployment could be something that could help save storage if that is ever a concern.

So with that, I just wanted to take a moment again to thank our CIFARMA meetup organizers and community. And I need to acknowledge the efforts of a lot of colleagues who have contributed, thoughts, ideas, support one way or another to this project. And last but not least, all of you for tuning in today. Yeah, I'd love to kind of talk a little bit more and address some questions.

Q&A

Awesome, thank you so much, Afshin. There are a lot of questions starting to come in on Slido. So I just wanna remind everybody that you can ask questions wherever you are.

But one question that I see over on Slido was what's your best advice for companies running into issues getting approval to use open source?

Typically, I mean, I would say for that, it usually starts with a champion that helps to drive some successes. Shiny is great for that because it brings in some capabilities into the organization. But I mean, most of the big pharma's, small and even medium sized ones, there's usually a spark that starts with someone. And that helps create competency centers or data science meetups or hangouts. They usually have a way for people to get help. And that just starts to snowball into that. But R is a great language for starting off because you can have a lot of small little successes that can really grow. And that leads into Python and other languages, and JavaScript and Julia and so on. So usually it starts with the right person at the company driving for it.

I see somebody had asked a question over on LinkedIn, and I think this was towards the very beginning of the presentation that said, when you mentioned the vanilla code, is it possible to discuss about it here?

So we do, so the code itself, the package itself generates code. So it does provide some template code when I talked about Blueprint. So when you're starting your project, it populates your folder with some base wiring that then you can add to. I don't know if that's what you refer to as the vanilla code or not.

I see there was another question on Slido, which is, how has it been implementing the data as a product framework for a multilingual team? Does your environment support Python and R, and do you use Connect for this?

Yeah, I really love this question, and because it is one of the things that definitely on my mind, at this point, it is primarily R, right? Pins, which is an awesome package, so shout out to the RStudio developers for Pins, is at the core of this dapper. Pins is a package that supports both Python and R, and so the data products being Pins can be pulled in theory by Python as well as R.

Now, when it comes to the development, like we are talking about an end-to-end, so we can get a recipe and can build it up. Of course, that code is written in R. Of course, the packages are written in R. So in theory, it is totally feasible to write R code in a way that interfaces with Python and or vice versa, and have a sister package that essentially manages that where the code itself, if you wanted to reproduce it, not if you wanted to use it, as I said, if you want to just use it, you can absolutely use the Pins, which is at the core of this by Python or R, but if you're developing it, there is some dependency on the environments that you're developing on. So it can be done in theory, but perhaps it's better to kind of choose where you want to be.

What's it like with new users coming into this? Is there a certain type of training? Is data as a product and data wrangling and CICD? I would say are newer methodologies coming in. What's it like for people coming to this new or used to running these things more manually or in like a batch mode?

Absolutely. So again, I want to kind of distinguish between folks coming to this as a user or folks coming to this as a developer. So as a user, I think it's pretty straightforward. So the kind of ramp up to like, okay, this is the data library and I can pull in. So that part is much easier. And then as a developer, of course, there's more involved here, right?

And I want to kind of say that how critical this is, and this is one of the things that we have learned since the first time we presented on this. Technology goes so far. It is really about these, when it comes to these DevOps practices, it's really about culture. It's really about the collaboration. So it's been absolutely critical for us to be able to forge those relations and commitments and change in the mindset of how we are doing data. So it is a commitment.

Technology goes so far. It is really about these, when it comes to these DevOps practices, it's really about culture. It's really about the collaboration.

It is something that folks need to say, okay, I'm going to commit to being able to, or kind of like incurring this cost, establishing my data infrastructure in this way. And that commitment leads to some changes in how, like the question of when the data is done. It is a parallel to question to when we are done and analyzing TCGA, for example. No one asked that question because there's so many questions you can ask from that data set.

So building that mental bridge and cultural shift, I think is something that takes time. From the technology perspective, learning the code and all of that, I feel someone going through it a couple of times, they will be able to pick it up. And one learning again that we have had is that it is really important for those in the team who are a little bit more experienced to lower the kind of the entry barriers a little bit. So you bring in your wrangling code, you can work with your colleagues to integrate that into Dapr. And then from that point on, they can maintain it. So it's a little bit of a commitment, but it's something that once folks make that commitment, I think within a course of a month or so, they can fully be in this cycle of developing the data across the team.

And for surfacing out the data and for people looking for and finding, is there like a data dictionary or a system or do they go into an environment? Because someone says, I want this assay data or this research data. How do they go about locating that?

Yeah, absolutely. So building additional metadata that makes the dataset searchable is certainly something that there's room for to build additional metadata. At this point, the owner of the data, folks who are subject matter experts are perhaps the best people to connect with. So if you, I'm saying it is absolutely possible to put metadata out there, but the preferred route is a lot of times folks want to just kind of go fishing for data, right? And a lot of the context may be missed. So if they have specific questions that you know you work within a team and you know like this team actually owns a data product. So it's just a matter of having that conversation up front and then that kind of eithers it in.

And do traditional data environments like databases and maybe even larger distributed data sources, does that also play into here or is this more for flat files or?

No, absolutely. So we, a data product in examples that we have worked on brings in data from a diversity of sources. So you may have a traditional relational database that there's a table you want. So you bring that in and then you augment it with a table that you have curated yourself of medications and the classifications, for example. The key element is that unlike some of these data repositories or data platforms that are live data that could change anytime, here we are to snapshotting. So every time you're bringing it, you're taking a snapshot of that state and then you onboard it into your data product.

And do you have scheduling components? Like if they say, hey, pull this every day at 9 a.m., run this and then when the scientists wanna access the data it's routinely updated with the new lab data, for example.

Yeah, so that can absolutely be done. We haven't had the kind of automation of that as a use case. One of the things that we talk about with our colleagues is that the cadence of data products is at a slower pace. So if you have a data set that you need to look at every day perhaps you don't spend a huge amount of effort feature engineering and all of those elements. We are kind of going for like a quarterly update to this data product. We are going deep, but the data evolution is kind of, it's not that rapid. In theory, you can absolutely do it, right? You know, you can just, that could be part of the code that a snapshot set and just kind of these data products.

I think it's something we see pretty often when especially our developers, the idea of creating an ETL job, scheduling that, and the output of that being a data product or a data set that then is fed downstream into a machine learning model or a Shiny app, for example.

I see there's a few other questions that are coming through on Slido. But one was, Afshin, could you explain more about using promises to manage credentials that will be needed in the user's environment?

Yeah, absolutely. So everything that is in a data product becomes code, right? So if you need to access something that is behind some authentication layer, right? One way is to kind of provide that, you know, password, put it in the text, and then goes in your repo. Of course, that is a big no-no, right? So what we can do is we can have a function call as a text. So imagine, like, you know, if it is, you know, an environment variable in R with the sys.getEnv, and there it is, the environment variable. So these then become promises. So you can imagine you have a configuration file that tells where the data is and what password you use to get to the data. It doesn't tell you the exact password. It says, run this command. And that is the promise that says, when you run this command, this environment variable or this key is going to be present in the environment. So that's what I mean by a promise.

There's another question that I might reframe just a little bit to expand on it, but it was anonymous. But it's, are there certain tools that you've tried that haven't worked well for your requirements and how did you move towards getting the right tools that you needed?

So some, is it in the context of building these data products? So if, apologies if I'm not addressing this, but the tools that we have tried that haven't quite done what we wanted, like one of them is the DDIF, right? So the idea of the DDIF. So right now, so the idea is, if you're thinking about your data and codes to be kind of interchangeable, when you're writing a code and you committed and then someone says, okay, what's the difference between this version and the previous version? You do a DDIF and get this, and it will tell you like the lines of code that has changed.

Now, imagine you have two columns of data. Everything is multiplied by two. Now, every single element in the million rows is going to pop up as text different, right? You know, you started with one, one becomes two. But from an information perspective, nothing much has changed. You know, you have just multiplied everything by a factor. So when we would like to be able to do the DDIF on the data side, that is as actionable as when you do a DDIF on the text side.

And if you want to use the text-based DDIF, which are a lot of good packages out there that does that in R4U, you don't get that information base. You don't see how the likelihood of the data now has shifted. So if you were to say, if you come up with a score that says, oh, your data shifted by, you know, 0.02, you know, I know that's kind of perhaps a high bar, but that is an actionable thing. It's like, I don't need to run my ML model because my data barely changed. Versus, you know, your data has changed, you know, by, you know, 95%, right? There's some likelihood or something that is, you know, tells you that your data really shifted. So that is something that we are developing now.

I see somebody else asked on Slido. Someone said, there's a supportive community for open source and pharma, even though it's a competitive industry, other industries seem less open to sharing. Why do you think that is?

Well, I think, Rachel, probably you have as much and more insight than I do because you go across industry. But one thing I want to say, I grew up in biotech and more on the technology side of the biomedical research. So I watched for a while pharma and then I joined. And I was, even back then, I was blown away by the generosity of pharma for putting code out there, putting algorithms even out there.

I don't know exactly, I can't speak for universal pharma, but I feel like everyone who is in pharma is really, really focused on patients' lives. And when you're thinking about those medicines, when you're thinking about those chemistries and how we are innovating on therapeutics, that is where a lot of sensitivity and desire is to innovate, right? And there's a huge amount of generosity when it comes to algorithms that are not the core offering of this pharma. So if this helps the industry, they put it out there.

I think you hit the nail on the head. It's inspiring work. I mean, we're creating medicine and helping people's lives. And I mean, I think everybody personally or through family members have been impacted. I know I have been and I love the industry and I love supporting it. I've felt this way since day one being involved in it. And I think too, there's a strong background to the academia space and there's a lot of research that comes out of that and it carries over into the ecosystem, which is also helps create some of the