Brian Danielak | Designing a Socially-Critical Data Science Course

Transcript#

This transcript was generated automatically and may contain errors.

Let's talk about designing a socially-critical data science course. I have for you a play in three acts. Act 1, teaching about wealth inequality. Act 2, data analysis for social action. And Act 3, specific resources to help you teach.

Act 1: Teaching about wealth inequality

Welcome to your first day of my class. The very first thing we do is that everyone gets one penny, an actual physical penny. The very next thing that we do is offer a prediction about the following. Everyone's going to do one trade with a classmate with your penny. How do you trade? You find another student. You bet against them on a random outcome event, like a coin flip or rock-paper-scissors. And the loser pays the winner one penny.

Here's an abstract view of what my classroom would look like. Two students would find each other and pair up. There's one penny at stake. They're going to compete on a random outcome event and bet on the odds. And the loser is going to pay the winner one penny. So this happens across the classroom simultaneously with different students in different pairs.

Before trading, we're all equal. If you were to look at a bar graph of everybody's wealth, it looks really uninteresting. Everybody's got one unit of money. It's a penny. And we've ordered these completely randomly, but each agent, as it were, has a bar.

So the first question I ask before we even do the trade is, what's going to happen after everyone does one trade? I'd like you to imagine that. You each have a penny. You're going to match up with somebody else. You're going to bet on something. A coin flip is pretty handy because you both have pennies. And the loser is going to pay the winner one penny.

So what happens after that happens is that half the people in the room go broke. And half the people in the room double their money instantly with one trade. Then you ask, what happens after lots of trades? That starts to look like this where you can see that a few people have most of the wealth. And there's even one person who's got five pennies, which means they quintupled their initial investment. But 71% of the people in this simulation are broke. Not poor, not impoverished, broke.

Here's another case of us running it with slightly different rules and a greater money supply. I like this one a lot. It shows the same features. We have a few people with most of the wealth. But then you have people who go below the zero line with negative 20 wealth that they finished with. And you might be saying to yourself, is that an error in the graph? Or what happened?

Loans happened. We allowed students to loan money at interest so they could loan four pennies with the expectation of getting five back, I think was the rule for that simulation. And the very first outcome was that the students constructed their own subprime loan crisis. Nobody was checking credit. So I watched one student hit the same person up for like five loans. And when she denied him a sixth loan, he walked across the classroom, found somebody else who didn't know him, and got more loans.

The second outcome is that the students discovered the value of a distribution graph. So these bar graphs have the advantage of simplicity. But every time we run the simulation, we can't predict who's going to get most of the money. We just know that it's going to concentrate. And I do this on purpose because I want them to see that it's very difficult to predict the outcome of a bar graph. But if there's some underlying property that says, oh, most of the money is going to go to the few, that's the value of a distribution graph, that one distribution graph can represent so many different situations.

So ultimately, we can't predict which agents will go broke, but we can predict that most will. Finally, I think it's important that a seemingly fair system can still produce unfair results. It seems fair because the coin toss is fair and everybody has an equal shot, but it's not fair because broke people routinely get denied the opportunity to trade again. If you had money, why would you trade with a broke person? You have a 50% chance of losing some money and a 50% chance of getting no money. That's it. So broke people stay broke, and they stay out of the simulation ultimately.

Finally, I think it's important that a seemingly fair system can still produce unfair results. It seems fair because the coin toss is fair and everybody has an equal shot, but it's not fair because broke people routinely get denied the opportunity to trade again.

So when I was teaching at Michigan State, a very local crisis was getting international attention, and it had to do with the poisoning of the water supply in Flint, Michigan. Flint's about an hour away from East Lansing, which is where I was teaching. The super short version is that to save money, the city of Flint decided to stop importing water from the Great Lakes and divert the Flint River instead, and the Flint River's toxic corrosiveness leached lead out of the pipes.

This was happening while we were teaching the course, and a data set had just come out from folks at Virginia Tech who had sent at-home testing kits to 270 families in Flint to get readings on their water levels. And we had students analyze this and write about it.

Here's one example of a student analysis where what they've done is try to compute the number of samples that exceed the EPA's action limit. So the EPA says if more than 10% of your samples are above this threshold, which happens to be .015 milligrams per liter, I think, you've got a problem. And a student wrote some code to count up how many of the raw samples there are that exceed that amount, as well as the percentage that exceed that amount. The other thing I want to point out to you is it's difficult to see, perhaps, but that distribution has a real long tail. And there are some folks out there at .2 or, yeah, sorry, .12, .14, six, eight times the legal limit for lead. That's a huge problem.

So we had our students write to the governor of Michigan with their analysis results. And here's an example student one. I constructed a histogram. I also wrote a simple program to compute the number of samples and the percentage above the limit. Forty-five samples exceeded the EPA limit, making this 16.6% of the samples above the action limit. This indicates the seriousness of this problem and exactly why we need to urgently take action to help Flint.

Act 3: Resources to help you teach

You might be asking, what can I do? So I have three pieces of advice that I'm going to go into detail on. The first is try to choose data and activities that question systems of power. The second is do what you can to make bias issues in your data known and check for them. And the third, which I think is arguably the most important, is to adopt a critical framework to analyze and expose oppression.

So first up, you should use the opportunity of being able to choose data sets for your class to analyze to ask what issues demand our attention socially. Now, I don't want to stand here and say that every activity that you do must always be focused on improving the social good. But I think even if you started with one activity, you'd find it has a ripple impact, and it gets you thinking about more data that can do this, and it gets your students thinking about more ways they can try and help society through the newfound powers they're learning in your courses.

Second, you can check data for biases. You can ask who is being disadvantaged in this data. And so a tool for that is IBM's AI360. I've got the link down there, but also I'll have a QR code you can scan, which has all the links that are related to this presentation, so don't panic. The AI360 has some demo data that you can take a look at, and this is an example of what it can produce for you. This data was analyzed and has a privileged group, which happen to be females, and an unprivileged group, which happen to be males, because we have representational issues of unequal weighting.

So the tool goes on to determine things like what would the statistical parity difference be, what's the equal opportunity difference, what's the average odds difference. And then it goes so far as to offer strategies for addressing that bias. These are two of the automated strategies that it proposes, reweighting and optimized preprocessing. I think a third is adversarial learning, but I don't know that much about adversarial learning, so don't quote me on that. But the idea is that the tool lets you see where your data might be leading you to conclusions that are steered by biases that, in most cases, we want to correct for, and it offers automated solutions for doing so.

My third point is about adopting a critical framework. I think it's so important to recognize that data analyses are not value neutral. So this is a book called Data Feminism that came out in 2020 from MIT Press. You can read the entire thing for free at their website, which, again, will be in my list of links in my QR code thing. Here's a quote from a description of the book. Illustrating data feminism in action, D'Ignazio and Klein show how challenges to the male-female binary can help challenge other hierarchical and empirically wrong classification systems.

It's a little jargony, so let's do this. Raise your hand if you've either seen or yourself done gender as a dichotomous variable, zero or one. So what happens when your gender is not captured by zero or one, or male or female? Well, you could add another option, like non-gender conforming, but what happens if your gender is just non-binary or your gender is gender fluid? You could add options until the cows come home. Basically what's being revealed to you is an oppressive system that is forcing people into categories that they don't necessarily identify with, or is forcing them to exclude themselves from the data analysis entirely just by not answering the question. And D'Ignazio and Klein are trying to point out it's stuff like that that we might not even be consciously aware of if we're not thinking for it that can have huge impacts on the people we're asking questions of.

My final example comes from mapping Detroit. I call this redlining in action. This was a map that was produced in 1929 by an all-white, all-male chamber of commerce and a majority white male federal home loan board. The purpose of the map was to figure out neighborhoods where it was risky to offer people loans and neighborhoods where it was safe to offer people loans. The effect of the map is that an already racially segregated demographic is now disproportionately affected by people's decisions not to lend to, for example, black people because those were deemed neighborhoods where it was risky to lend.

So can you imagine if your credit worthiness depended not on yourself but on who's next to you? Imagine how that crushes social mobility. Imagine how that reifies social order.

D'Ignazio and Klein say early 20th century redlining maps had an aura very similar to the big data approaches of today. These high-tech scalable solutions were deployed across the nation, and there were one method among many that worked to ensure that wealth remained attached to the racial category of whiteness, that wealth remained attached to the racial category of whiteness.

These high-tech scalable solutions were deployed across the nation, and there were one method among many that worked to ensure that wealth remained attached to the racial category of whiteness, that wealth remained attached to the racial category of whiteness.

I think it's worth asking, if that's what data science was doing then, what's it doing now? Thanks, folks.

Brian Danielak | Designing a Socially-Critical Data Science Course | RStudio (2022)

Transcript#

Act 1: Teaching about wealth inequality

Act 3: Resources to help you teach

Featured software#

rstudio

Brian Danielak | Designing a Socially-Critical Data Science Course | RStudio (2022)

Transcript#

Act 1: Teaching about wealth inequality

Act 2: Data analysis for social action

Act 3: Resources to help you teach

Featured software#

rstudio