Alok Pattani - Google | Sports Analytics Meetup | RStudio

Abstract: The increasing volume, variety, and velocity of sports data provides both great opportunities and challenges for data scientists working in sports. Using R with Google Cloud data science tools like BigQuery can help practitioners scale their analysis and impact in this "new era" of sports analytics. This presentation will include a demonstration of using R and Google Cloud together with an NCAA basketball data example, as well as a discussion of the application of such metrics and tools in the sports media and technology industries. Speaker Bio: Alok is a Data Science Developer Advocate at Google, where he shows how to use Google Cloud tools for data science, in sports and otherwise. He is a sports analytics expert and a long-time user of R and RStudio. Before joining Google, Alok spent 8 years at ESPN, where he was a founding member of their Sports Analytics team and contributed significantly to the use of analytical content across all media platforms. Alok is originally from Cheshire, CT and earned a BA/MA in statistics from Boston University. Alok's Slides: https://lnkd.in/gxQUf8nV Packages shared: sportsdataverse: https://lnkd.in/g8UKAJgc wehoop: https://lnkd.in/grBgmc33 hoopr: https://lnkd.in/gRHNWV4j glmnet: https://lnkd.in/g4cuxYzs googleAuthRverse: https://lnkd.in/gf5fRgcC All of Mark Edmondson's Google packages: https://lnkd.in/gmTpvMsY Other resources shared: Sports data: https://www.spotrac.com/ Analyzing NCAA Basketball with GCP: https://lnkd.in/gT-4vWwa 2020 Google Cloud March Madness Insights: https://lnkd.in/gEWd9xtu Alok's presentation on Innovating the MLB Fan Experience through Data: https://lnkd.in/g9XmFKsy NFL Player Tracking Data Meetup recording: Analyzing Soccer Data with BigQuery: https://lnkd.in/gbbCGKaK Sports channel on the R for Data Science Online Learning Community Slack: r4ds.io/join # chat-sports_analytics

Feb 17, 2022

1h 14min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, thanks, Rachel and Mitch. Really appreciate you having me here to speak today. Been a big fan of R and RStudio for probably a decade now and been involved in sports analytics for longer than that. So excited to come talk about a couple of things related to both sports analytics, R and Google Cloud, where I currently work.

Like I said, I'm Alok Pattani. I work as a data science developer advocate at Google Cloud, which means that I talk about how to external people outside of the company, how to use Google Cloud for data science. And often I like to do that using sports as an example, because I come from a sports background and sports is fun and gives us a way to talk about cloud technology and data science in unique ways that appeals to a large section of sports fans.

So a little bit of background on myself. I went to Boston University, did major in statistics there. That's probably where I first got started with R. Very, very different than now, given that this was probably more than 15 years ago now. I was fortunate to work in ESPN's stats information group and was a founding member of the sports analytics team, where we created metrics like the total quarterback rating for the NFL and college football, as well as power indices for various sports. So I'll talk a little bit about that today, depending on where we go.

Moved on to Google in 2016. And since about three years ago, I've been in the current position I'm at at Google Cloud, where, as I said, I talk about how to use Google Cloud tools for data science and have worked on our sports partnerships initially with the NCAA, as well recently with Major League Baseball, Golden State Warriors, and a few others, a few other sort of sports related engagements. I also do other, outside of Google, I do some sports and data science consulting, including for an NFL team, but not the one that won on Sunday. So I still have a lot of work to do to catch up to Mitch.

R and BigQuery for data science

So R and BigQuery for data science. So here is a chart that my teammate Priyanka put out recently about all the different tools on Google Cloud for data science. And you can see that there's a lot, right, quite frankly. So depending on how wide your definition of data science is, and here we picked a pretty wide one, there are lots of different categories of activities that fall into data science.

So you can see here we have things like data engineering, right, so all the aspects of getting data in, ingestion, pre-processing storage, et cetera. Data analysis, which is fairly common, something that a lot of people using R will do too, is exploratory data, whether you're getting stuff, maybe you go right from there into an activation in a dashboard tool, or you want to go build a model. And depending on how you want to build a model and how large and sophisticated and sort of what the goal of that model is, there's a ton of tools here you may want to use. And then once you sort of have your model and it's up and running, now you're sort of in the software engineering space for you want to manage your model and all the end points.

The one product I do want to talk about is BigQuery. So Google Cloud BigQuery is a serverless, highly scalable and cost-effective data warehouse. We have many customers using this that you store up to 100 plus petabytes of data, right? So that's well larger than any of the data we're going to talk about today, but is starting to become pretty common in the enterprise space, right? Where companies have more and more data. So it's SQL-based, highly available data warehouse.

Some of the unique things it has is the ability to go multi-cloud. So you can query stuff that's on Google Cloud within BigQuery itself, but also in other clouds with a product called BigQuery Omni. There's also real-time insights. So based on streaming data into Google Cloud with other tools, you can set up dashboards or other sort of interfaces where the data coming into your warehouse is available to you very, very quickly. There's also built-in machine learning. So through a product called BigQuery ML, BigQuery Machine Learning, you can fit a growing number of models, including linear regression, logistic regression, onto neural networks.

Using R and BigQuery together

So I like R and I like BigQuery. So this isn't, which one is better for what? You can do a lot of things in both of them. And a lot of that will be based on sort of the use case. When I go through the example, you'll see how I'm kind of weaving between the two. But in general, things I like to do in R include getting data from lots of different packages. We'll see on here, websites, APIs. I think the ecosystem around how to get data into R has been really awesome and getting better over the last several years.

General data manipulation. Obviously, the tidyverse is quite amazing for this. I use it all the time extensively. Exploratory data analysis. You know, whether that's just kind of looking at your data or plotting it with ggplot2 , something like that. Statistical analysis and modeling. This is where I think R has a large depth of packages, right, depending on where your background is, what type of specific model you want. You can almost always find an R package or multiple that are doing, that are available to solve your problem.

And then what I call ad hoc data visualization. As I'm going, I want to look at something, quickly build a plot, quickly, maybe facet the plot by a couple of different dimensions or look at a few things as I'm doing analysis. R, I think, really stands out for that. BigQuery, like we said, enterprise data warehouse. So if you want a bunch of data that is now available for R, but maybe lots of other things, that's like the place where we want that data to be stored.

The big R query package is really a strong way to integrate the two. So you want to get data from BigQuery into R, you want to take data from R into BigQuery. I use this package all the time to do this sort of stuff. Thanks to Jenny Bryan and Hadley at the RStudio team. I believe they're the developers and maintainers of that package.

NCAA basketball player rating system

Okay, so let's dive into an example analysis here where we're going to use NCAA basketball data, March Madness, not too far around the corner. And I have an interest in college basketball data going back to, I don't know, probably when I was a kid, but also from ESPN where I worked on the original version of our basketball power index, which launched about almost 10 years ago to the day. So I thought it would be a cool way to get into the idea of using R and BigQuery together and see what we can learn about NCAA basketball.

So here's what we want to do today, which is to create a rating system for NCAA basketball players. So this is not novel. Some people have done this type of thing, creating rating systems, a player rating system is among one of the top sort of topics in sports analytics. But the difference here is that we want a couple of different things to be true about this system.

So we want to use multiple player box score statistics. So for example, we don't want to just take the player who scored the most points and say that's the best and something like that, right? There's many ways that people can contribute to the game. In the world of NBA, there's actually way more interesting statistics now that take into account sort of play-by-play level, who's on the court, all the way to the newer tracking data to do this. But we're going to assume we have more limited data, but still very good on the player box score level.

We want to take all of that into account or as much as we can. And then the second point is important as well. We want to represent it as a player's contribution to winning. So let's say we had really detailed data, we could say who jumps the highest. That would be a way to rate players, but it doesn't necessarily in and of itself contribute to winning. So we want to take into account statistics, but also have some relationship with winning.

The third point is important as well. So often I've seen a lot of rating system for men's college basketball, not quite as many for women's college basketball, but I think it's growing over time. We want a rating system that can be equally applicable to both. Many statistics are collected on sort of both levels. Let's see, again, we want to talk about scaling. Can we take the same system and create it for men's and women's basketball? And then we want to apply it to the current system, but again, this is about scale. We want to compare to players in past years as well.

And then this last point, which is really important, we want to adjust for schedule. We don't want to just take into account stats as they are. We know that in Division I college basketball, there are over 350 teams each, men's and women's, and teams play varying schedules, all different sorts of quality. We want to adjust for who, not just, you know, we want to account for not just what you did, but who you did it against. And that's really important, so.

So why is this, why would this be important or useful? So there's like some basis, like people say that all-in-one ratings are dangerous and shouldn't be. Sort of consumed. And there definitely are reasons why you don't want them, particularly in certain instances where you want to learn more about a player, especially if you're coaching the players on your team. You want to talk more about what that player's strengths are, how they fit in with the rest of the team. It's not just an overall value thing. But let me take the other side and talk about why they really are useful sometimes.

So for media and fans, right? A lot of people in college basketball otherwise always talking about who's the best player, who should make a certain team, who should be an All-American, who should win the awards. And these all-in-one ratings are really useful for that, right? Because one player may be really good on defense, one's good on offense, one scores more, one rebounds more. How do we know, right? And these metrics give us a really good baseline for some of those discussions, as well as research, right?

And these metrics give us a really good baseline for some of those discussions, as well as research, right?

So you may be watching a game or in your media, you need to call a game. College basketball, again, 350 plus teams on either side. So that's thousands of players each year. The players change out often, right? Because they're moving through, especially some of the best players go right on to the pros. So just having an idea of who are the best players on certain teams can be really useful.

And again, in my time at ESPN, this is a lot of reason why we would develop these player ratings. College teams, right, would find this useful for managing their own roster, as well as looking at their opponents, who are the best players on those teams. Content companies. So I'm thinking more here, like even sort of the companies getting into daily fantasy or more coverage of gambling and sports. To something, even a company like Google, like if you have an all-in-one rating, what could you do with that? You might use it to figure out signals for which players might be interesting, what players are people looking for content for, above and beyond how good they kind of are at winning. It's just a measure of how interesting they are. It could be a signal in a machine learning model, right?

And then finally, we're talking about college ratings. Pro teams may find these ratings interesting because they want to evaluate, again, it could be NBA, it could be WNBA, evaluating the college players as potential draft prospects. And along with many other things, right, measurables and sort of how they do in interviews and things, you might want to use these all-in-one sort of ratings.

Analysis overview and data setup

So yeah, we're gonna talk about scaling sports analytics with R and Google Cloud. High-level, we're going to use open-source R packages, which exist, thankfully, to pull in NCAA basketball data. And then we're gonna upload it to BigQuery, kind of create our mini data warehouse. And then instead of sort of reinventing the wheel, we're gonna use existing basketball analytics theory for how to kind of take player stats and create a measure of how much they impact their team's winning. And we're gonna apply that in BigQuery.

We're gonna read that data back into R for our schedule adjustment. And then since it's there, we're gonna calculate some stuff there and get some final player ratings. And it'll push it back into BigQuery. And then once we start looking at them there, we'll do the whole purpose of player ratings, which is for argument, is for debate. So we'll say, well, this player's ranked too high, this player's ranked too low. And to some degree, that at least, once I got comfortable with that, I took things, I found it more interesting. No one's gonna take your player rating system and say, oh, this is perfect. And they shouldn't, and you shouldn't either.

So normally I don't include code directly on the slide, at least like this, but I really wanna point out how awesome sort of the ecosystem is here. So major thanks is at the bottom to Saim Gilani. So he sort of is the leader of this initiative around the sports dataverse, and has been a developer on many packages to kind of make sports data accessible in R, as well as Python and others, but we're focusing on R today. So the two you see here, hoopr and wehoop, one of them allows you access to men's basketball data and the other, as you might imagine, women's basketball data.

And this is the full code to get a bunch of data for the history of what's available in this package for men's and women's college basketball. So it's about almost 20 years, but not as complete here and 15 for women. And you can get teams, schedule, team box, player box, just by running these very well-named functions, load the schedule, load the box. And the other good thing is that the men's and women's data comes in with the same schema. So now when we talk about scaling, one thing we can do is actually store them together, add an extra field that says this is men's basketball, this is women's. And then every calculation we do, we can group by that sport category and everything else kind of can flow along. And we don't have to create two separate code bases or whatever for rating these players.

Win shares framework

So the framework I use here is actually fairly old, at least old as much relative to sports analytics goes. So this is based on using box score statistics. Again, if you have play-by-play, you have more detailed tracking data, you can do plenty of other things. But this, we have box score data going back, you know, about a decade or so for both men's and women's, thanks to the Hoobar and WeHoop packages.

So you have player box score stats, you have team box score stats. There's a whole lot of math developed by Dean Oliver, who I was fortunate enough to work with at ESPN and is a pioneer in the basketball analytics space, currently works as a coach for the Wizards. He wrote this book, Basketball on Paper, and did a bunch of stuff to show how you can get the most usage out of that to create these individual player offensive and defensive ratings, or rating, debrating. If you're familiar with basketball analytics, there's a concept called offensive and defensive efficiency, points for 100 possessions. These are the individual analogs to those.

It's not the same as what the team does when the player's on the court, it's a much more detailed calculation, certainly on the offensive side, using points assists relative to the team and how well the team did overall, but taking into account the player's sort of credit for each of those things. The offensive one is much more detailed and useful than the defensive one, because box score is traditionally very offensive based. But you can calculate those things, you can calculate how many possessions the offensive player used, as well as how many defensive possessions they were involved in.

And then with that and some other stats on league, again, I'm hand-waving a bunch on these big green arrows, but Sports Reference, among others, has a win-shares calculation. So they've detailed it here. I hope to share these slides afterwards so y'all can click the links and everything. But you can take those metrics and then do a bunch more math and develop these calculations for offensive and defensive win-shares, which then you can do another calculation, just add addition, to get total win-shares. And then I have this other adjustment here that I have seen and like, which is for college basketball in particular, win-shares is a low baseline. Everyone gets kind of wins relative to the team. But what I'm more interested in, how many wins does the player add above sort of the average Division I player, whether that be men's or women's.

Live demo: data gathering and BigQuery

All right, so hopefully you can see my RStudio. Yep, looks good. Great, so I have this first script, which is the data gathering script. Again, just look at the packages. We got HoopR, we got WeHoop. We have BigRQuery , which I talked about, that versatile package interface with R&B query. I'm not gonna run this right now. I'm just gonna show you. It takes not that long to run, maybe a few minutes. Again, it gathers a large history of men's and women's college basketball data.

Authenticating BigQuery, you can use email authentication, which I'm doing here because I'm on the sort of like my own laptop. But if you're doing this on a virtual machine or somewhere, you wanna use a service account. We can go into that another time. BigQuery has kinds of datasets I'll get into in a second. And I wrote this little helper function to upload the table into BigQuery using some of the functionality in the BigRQuery package.

This little section here is the one I had copied. This is how you read in all the data. Like I said, relatively easy. Thankful to the community for developing that. The rest of this is a bunch of just data munging, cleaning stuff up, separating out field goals, these sorts of things. As anyone who's worked with data knows, you do have to still do some work to get it in the form you want to sort in the database, especially. So this is kind of a bunch of that. And then every set of data at the end, we have this little upload and replace table in BigQuery that little helper function. And that will take, for example, this team's data and push it to a table in BigQuery called, in this case, teams.

So now we're here in Google Cloud. So I think RStudio should be familiar to a lot of people here, probably why you're in the meetup. Google Cloud may not be. So I'll go over a little bit of stuff here. There's way more than I can cover in a small session. But Google Cloud Platform, this little logo is BigQuery. So you can see we're in a BigQuery SQL workspace.

And here you can see what's called a project, right? So I'm in this data science demo project and then a data set. In this case, I created one called NCAA basketball. And within my data set can be many different things. So most prominently we have tables, which are these ones with the sort of dark calculator-like thing. And then these are views. We can also have models here, functions, which you'll see routines, all sorts of different things, materialized views, stroke procedures.

But, yeah, so this is kind of a way to encapsulate like a group of objects that will relate to the same thing. So this is called basketball data. And then what we'll look at is the tables we created from that R script into BigQuery. So we have a table for the teams. We have a table, and we can look here. So it's about 700 plus teams. That's men's and women's. We have a players table. So we pop that over. We look at the details. You can see there's quite a number of players in here, almost 90,000, again, men's and women's, going back several years.

And then we created tables called team game box. So this, again, we want those box for statistics for teams. You can see this has a schema down here. It's fairly wide, lots of different team statistics, field goal, rebounds, all the way to points, and some information about the team and the opponent in the game. This one has like 250,000 rows, 73 megabytes. So, again, BigQuery is built for data way, way bigger than this. But just to give you an example of what we have in this case, and then player game box, which is our biggest table, you can see here, maybe half a gigabyte, and about 2.5 million rows. So, yes, you could process all this in R. In fact, we did to get it here. But if you want to use this for multiple things, you probably want to store it in some sort of database. I think BigQuery is one option that's pretty solid for that.

And then you can imagine running a script, the same script we have, or some other job daily to update your data warehouse with the freshest data. As you go about there. And then I want to go on to sort of views and just how much you can actually use SQL to do these calculations. Like I said, it's a bunch of math, but it's a lot of joining different data amongst the teams and the players and things like that.

So, you can take your team box for data and kind of augment it in this type of way. I have this view. So, this is a view, which means sort of you write a SQL query and then it stores that sort of intermediate output. So, then you can call the view as if it's its own table. So, again, if you have something where you want to, there's a way you sort of pre-process data all the time. Maybe you want to use it and your teammates want to use it. You may use it for one thing. Then you can kind of abstract that in a view. This is something coming in a lot of databases and I think is a really useful concept.

So, this view, if we click over to the details, you'll see is really long. Again, I won't go into code, but it takes some team and game statistics and calculates advanced stats, sort of basketball analytics, one-on-one stats, offensive efficiency, defensive efficiency, some of the four factors. So, this is a useful thing that I can see using for multiple things. In this case, we'll use it as a stepping stone to the player ratings.

And then we have another view called player team game stats. This one, again, if you go over to this details tab, you can see the code behind the view. Like I said, it's heavily based here on basketball and paper. This one is really, really, really, really, really long. So, again, not going to go totally in depth, but the last steps are we calculate those individual offensive ratings and defensive ratings.

The one thing I should mention actually is this width statement for what's called common table expressions. This I almost use like R where you're, you know, the pipe or using different data frames and then merging them together. With the width statement, you can essentially chain logic together. So, I have this team game stat calcs and I have player game stat calcs, right? So, I do some calculations on the team level. I do some on the player level. Go down here and finally we merge them together, team and player. And then you can just keep building on your outputs. So, one large script is a little more readable with these common table expressions and allows you to reuse pieces over and over.

And then we can build a view on a view on a view. So, then the last view here is one called player season advanced stats. And I won't even show you the details, but the point is at the end here, we have this schema where we get these calculations for win shares. And like I said, this wins above average type of metric. So, to some degree, we might've answered the question right here all the way in BigQuery.

So, we got our answer here. Again, this is not necessarily the size of the data, but the logic is fairly complex amongst those views. Again, for women's basketball this year, these are the top 10 players and the ones above average. So, Alia Boston is a very well-known player in South Carolina who's a top team. Kaitlyn Clark, really excellent player for Iowa, can shoot from Steph Curry range, amazing. So, this looks pretty good, honestly, right? So, and Ioko Lee, I think, set the Division I scoring record earlier this year. This looks really good, at least in the top of the list. I'm not a huge expert on women's basketball, but some of the names I recognize. And you can see they contribute about five to six wins above average at the top.

Schedule adjustment

So, we moved to part two, right? This is where we get really in-depth about how to take a thing, which, again, was a bunch of math to get here, and now we take it to the next level with sort of statistical analysis. And again, we'll see how R and BigQuery can help with that. So, we talked about this. Teams face varying levels of competition. So, there's 350 plus Division I teams that are organized into 32 conferences. They play largely different opponents. Some conferences are traditionally better than others. Also, if you think about what affects performance, home court advantage matters, right? So, if you're playing at home a lot, you may likely do better than if you play on the road more.

And this middle section here, we have a model representation kind of of what happens. So, a player or a team stat is maybe loosely related to the intercept, sort of like a baseline, plus maybe the team effect. Again, plus could be plus or minus, maybe the team's above average or below average. The effect of the opponent, right, who they're playing, as well as maybe some impact from home and being at home or on the road or a neutral site. And then, again, error, right? So, if we think of fitting a model like this, again, very high-level loose representation, we can kind of directly measure the effects of team versus opponent and, again, home court on every stat.

So, we use FlimNet, which is a common package in sort of the regularized regression space to fit a bridge regression. Bridge regression helps with sort of, like I said, regularization, not like accounting for sort of outliers, small samples, as well as still giving us good estimates of these effects. And the interesting part here is we're not fitting a model for predictions. We're fitting a model to get coefficients that we can use to get estimates for offensive and defensive efficiency, effective offensive and defensive efficiency on teams and home court advantage, as well. So, we're more interested in the coefficients of this model than any prediction, at least initially.

So, that we use to adjust offensive and defensive efficiency for teams. But the last thing, we are interested in players. So, once we have an adjusted team rating, what do we do for the player side? Well, you can see, we wanna adjust at the rating level. So, we can adjust player's offensive and defensive ratings by the opponent's ratings on the other side of the ball. So, if I'm on offense, my rating should be adjusted for home court, as well as the opponent's defense. And then my defense should be adjusted for their offense. So, those are the two sort of formulas you see here. They're pretty actually straightforward. Once you have these ratings, the adjustments are essentially linear combinations of this on the game level.

So, a player's adjusted offensive rating in a game might go up or down. If he's at home, it'll go down, usually about two points or so, points for 100 possessions. And then opponent adjustment can vary a lot. You may play a really hard opponent and your offense rating will jump about 20 points or so. Again, the level, it's on a points for 100 possession level. So, the average is about 100. But it also could go down 20 points or so. And defense rating, you see a little bit like that as well.

Rating calculation in RStudio

So, this is a different script now that just calculates the team and player ratings. So, we use a couple of the regular sort of packages I used to do in Inflation, Tidyverse, Blue Janitor. We're doing new payload processing. Big R query, which we talked about. Glimnet, which is gonna fit this rich regression. And then, Broom, which is a nice package that allows you to kind of clean up regression outputs.

This sets up for BigQuery. Similar thing, we need to authenticate. We'll write some helper functions, this time to download data from BigQuery. We'll specify the seasons we're gonna look at. So, in this case, the completeness of the data is pretty solid, starting about 2013, 14. So, we'll start there. Gives us about nine years of data, which is what we were looking at before. Here's another thing that I do commonly. Sometimes I'll write my SQL with parameters in R, or Python in some cases. And then, using something like Glue or Format, you can then put in the parameters you want. So, if you change the seasons up top, you'll get different data. And these will sort of write the SQL.

One thing I wanted to show. So, we eventually get to this sort of table here, where we're gonna have a table, and I'll actually look at this in the console. This is the table of team game stats. So, we get to this point where this is one row per team, per game, per stat. So, you can see that this has a game ID. This is Jackson State. This is when they played Bethune-Cookman, and we're looking at their offensive efficiency, which was 118 over 60 possessions.

And then, what we're going to do here is, this is a fairly large data set on team game level of about 300,000 rows, but we wanna do this adjustment by season. We're not adjusting this year's Jackson State team versus Bethune-Cookman over the last eight years. We're gonna adjust it versus this year. So, one of the tricks here we'll go into is sort of this nesting concept. And what I mean by nesting is, if you look here, we then nest this team game data into different levels by sport, by season, and by stat. So, there are 36 rows here, 18 for men, 18 for women, and nine each within those for offensive efficiency and defense efficiency, and one per season. And then, we nest these sort of game level stats in here. And what this kind of does is help organize our data. This is using tidyverse concepts per as well, so that we can fit one row, one regression per row here, which you'll see is in the next step.

So, this function, again, I'm hoping to share this code. There's a lot of details here I'm skipping over, but what it will do is fit that reg regression we talked about using this GlymNet library, and then we'll use, we'll take the model coefficients, kind of pull them out, and do a bunch of manipulations to get them back into the right form, and get team adjusted stats off of this. What you see here is we run this using multiprocessing with the per package. This future pmap step will basically take each row of this, say, take these games and data and give me back adjusted, you take this data of games for this season, for this stat, and give me back adjusted ratings.

And when you see here, what you'll see is on our next thing, after we fit, again, I ran this before, we get adjusted regression results. So, for each one of these rows, we now have regression results, and if you want to look at one of them, for example, the first one, you can see it gives you two different things. We have home court advantage, with, like I said, an estimate about two points, and then for every team, we get an adjusted rating, a couple different versions of it, and this is the team IDs.

We move on from there. We do a bunch of stuff to then take that out, extract it, put it in, and get the team ratings straight. Then we want to adjust for the ratings. I'm going to, again, gloss over some of this. All those calculations we did initially in SQL, I can also do here in R. I love chaining together sort of the pipe with tidyverse functionality. You can see a bunch of joining, more calculations. Eventually, we get down to player ratings, and I will show you this. So, we have this player season summary thing, and it's pretty long, but you can see it gets season, player, and the last column is that adjusted wins above average. So, now we've calculated the thing we want, and then finally, we will write it back to BigQuery.

Results and discussion

Okay. So, does schedule adjusting matter? This is a question. So, you can see this is actually in that thing, in that last script. I create this plot using ggplot2. So, we take for each dot here represents the players, one player, their raw wins above average versus the adjusted wins above average. So, there's a very tight correlation, something like 0.9 in a lot of cases here. So, you may think, like, okay, this isn't really doing much, but what you will notice is that for each sort of, like, vertical line, which represents a win above average, there's actually a good amount of spread amongst the adjusted wins above average. If you take a two-win player in women's basketball, the adjusted rating might have them close to an average player, or they could be four-ish wins, which is a really big difference. So, we can see that while there is a strong correlation, this does make a big difference in certain cases, and this is probably something we do want to account for going forward.

So, we can see that while there is a strong correlation, this does make a big difference in certain cases, and this is probably something we do want to account for going forward.

Okay. So, now our more final results, like I said, the top players. So, a couple things to notice. The top three over here for this season's women are exactly the same, but the numbers were higher. Before, they were closer to five wins. Now, they're closer to six or seven. So, all these teams are in hard conferences. So, actually, their schedule was above average, and therefore, you get even more credit, and you get greater differentiation. Some of these top players up here from the last nine years are pushing nine or ten wins on the women's side.

Men's ratings for this year look a little, like, again, Sheboy from Kentucky's there, though. You see Malchi Smith. So, it isn't impossible to be here if you're in a smaller school. You just have your performance get adjusted relative to the competition you face. You see some other bigger school players here. And then, the historic list, as I'm a fan of college basketball, sort of going back, I think it's really interesting. You see Zion, who's a sort of marquee player, was an amazing player for his one year at Duke. He won player of the year. Franklin is the one player of the year. So, you see some more common names here. So, we like this list. It's pretty good.

Surfacing outputs and running R on Google Cloud

Loose ends. So, we went through analysis. We showed R. We showed BigQuery. But we didn't talk about how to sort of surface the outputs. This would be sort of the one endgame of this process. If we had a player rating system we really liked, how would we share it? We could do it in Google Sheets. We could do it in a tool like Data Studio, which is Google's sort of free interactive dashboarding tool. It's one of my favorites. So, I think, yeah, it's certainly a good way to build some stuff. It's more point and click based. Looker, sort of our enterprise business intelligence platforms. You can build your API over the data. That's much more on the business side, but it's a really amazing tool as well. And then, of course, you know, we're here, RStudio Shiny . So, Shiny allows you to do a ton, right? You can take the functionality of R and build dashboards that allow you to go deep into that data. Let's say you wanted to even run some of that adjustment code on the fly. You can do that with something like Shiny. And if you build your interactive web application there, you can publish it to something like RStudio Connect to share within your organization or Shinyapps .io to share publicly.

And then, this is for sort of the people coming for R on Google Cloud. So, I showed an example of how to use R sort of with Google Cloud, because we had BigQuery, we had R, we were moving data maybe between the two. That's not the same as running R or RStudio on Google Cloud. So, here's a few options for how to do that. So, RStudio Workbench is sort of the professional product, the most high-end version of this, where you can deploy RStudio directly on Google Cloud Platform. But you can also do, you know, if you're smaller scale, you have access to Google Cloud and you have access to an RStudio server, open source version, you can then install it on Compute Engine.

If you're scheduling R scripts, there's ways to do that with Docker on Google Cloud with a bunch of tools like Container Registry. And another one, this last one, is one that I actually like a lot, because I'm a Jupyter Notebook fan. You can use our sort of Google Cloud product called Vertex AI Workbench to run Jupyter Notebooks back into my R. You can configure up your memory up and down. So, again, you might have some calculations where you run out of memory to do that.

One other plug here, Mark Edmondson, he's a really tremendous Google Developer expert in Denmark, focused on R and Google Cloud. He wrote some of the packages that allow you to interface with Google Cloud Storage with R and a bunch of tutorials, really, really helpful stuff. He is an expert in this space. So, I would recommend following him and looking up his website information if you're interested in particular about different ways to use R on Google Cloud.

Takeaways

So, let's find the tools and methods first. R and RStudio work well with BigQuery. BigQuery is made for very large storage and analytics. So, we looked at regression. In this case, we did reg regression. So, what's interesting here is when you look at regression, all the hype these days around modeling prediction. Regression is really good as like a summarization tool. This is kind of how we use it here. It's not explicitly building prediction model. And I found this very useful in lots of sports analytics tasks.

And yeah, this is pretty hard. I think there's a lot here that

On this page

Transcript
R and BigQuery for data science
Using R and BigQuery together
NCAA basketball player rating system
Analysis overview and data setup
Win shares framework
Live demo: data gathering and BigQuery
Schedule adjustment
Rating calculation in RStudio
Results and discussion
Surfacing outputs and running R on Google Cloud
Takeaways
Featured software

Featured software#