A few months ago, the New York Times featured a “juicy” study concerning the racial biases of NBA referees. Naturally, everyone went apeshit over the idea. I mean, how could they not? After all, the halo of authenticity provided by academic study gives you a free pass to run with the sexiest of all headlines:
NBA REFS: RACIST?
Obviously, I exaggerate. Still, it’s hard to deny that talking heads love the idea of research backing up stories on topics as volatile as race. University! Academia! STATISTICS! It’s all just so sexy… a columnist’s wet dream. Problem is, people rush to conclusions without really getting to the bottom of what the research says.
Still, you might be wondering why I’m talking about the NBA Ref study in August. Well, history tends to repeat itself: It’s baseball’s turn to share the headlines.
And here we are again. Another study, another claim concerning inter-racial call biases. Unfortunately, it’s probably going to prove to be another example of exaggerated columnist fodder.
So, like I did last time, I will throw out the following disclaimer: I am not an expert. I am not a statistician. I’m a former research assistant in psychology who decided to jump ship to the much more exciting world of trading. This qualifies me for…well, dick. And man did that sentence sound better in my head. Point is, I’m just a guy with some basic stats background. Do with this information what you will.
Anyway, unlike the NYT article (which provided a link to the actual NBA ref study), I had to do a little digging to find Hamermesh’s study/results. The study’s design is actually pretty interesting, and I found their comparison of QuesTec to non-QuesTec umpire calls fascinating – I like the idea of comparing ump decision making with and without scrutiny. Regardless, I spent most of my time going over their data and conclusions. Here’s a quick recap:
“The analysis of individual pitches and game outcomes suggests that baseball umpires express racial/ethnic preferences in their decisions about players’ performances. Pitches are more likely to be called strikes when the umpire shares the race/ethnicity of the starting pitcher, an effect that becomes significantly stronger when umpire behavior is less well monitored. The evidence also suggests that this bias is strong enough to affect measured performance and games’ outcomes. As in many other fields, racial/ethnic preferences work in all directions—most people give preference to members of their own group. The difference in MLB, as in so many other fields of endeavor, is that power belongs disproportionately to members of the majority—White—group (Hamermesh, 19).”
In studies with claims like these, I try to immediately look for the common test – the comparison across comparable conditions that truly shows me a difference BETWEEN groups. This is an important point if you’re going to claim bias; showing me that white umps call more strikes for white pitchers is only bias if, say, black umps AREN’T doing the same. If both seem to call more strikes for white players, there are questions to be asked, but they aren’t questions of racial bias. Table 2 (Page 26 of the PDF) has the basic percentages of pitches called for strikes by ump/pitcher ethnicity.
Just looking at the percentages, there’s some variability between umpires by race… but then you notice the number of total pitches analyzed by each umpire ethnicity. Holy mother of discrepancy! Here’s a simplified table from page 26, including only called pitch data and some simple sums:Um. This looks like a data problem. A quick review of the study’s first table (which I should have spent some more time on initially) reveals that our umpire sample size is 93, and a whopping 85 of the 93 umps tracked are white. 5 are black, and 3 are Hispanic. And right here, at least in my mind, you can throw out the study’s results, regardless of how elegant the rest of the paper may seem.
You cannot make sweeping statements of race and racial bias with subject groups this small. There is nothing to infer. Sure, you might have enough total pitches viewed by umps of all ethnicities to generate a statistical comparison that looks legitimate… but, at the end of the day, you’re making key assumptions about racial attitudes based on the work of 3-5 people. An experiment/study’s conditions need to be comparable in number. You wouldn’t compare the averaged IQ tests of 85 students to the average of 5 other students and expect to gain any kind of brilliant insight. It’s the same deal here.
Here’s a nice FAQ about the study as composed by the authors. They address a lot of issues, including the point that their study does NOT make any assumptions concerning the consciousness of the ump’s bias. It’s worth a read, but it doesn’t address my point concerning their umpire sample sizes. In fact, they assert the opposite:
The basic idea behind these tests is that, because randomness is completely unpredictable, its average effect will diminish as the size of the sample increases. An effect that persists despite a very large sample, as ours is, almost certainly is not random. All our results used standard methodology to account explicitly for the possible role of randomness (Hamermesh FAQ, 1).
The problem is that, while the sample of pitches is quite large, the umpire sample is not. 3 umpires can’t possibly make for a full comparison group. It’s that simple.
Look, I’m not here to try and deny that racism (or racial disparity) exists in the world of sports. If anything, the study’s presented ethnic breakdown of umpires and players emphasizes the discrepancy in baseball’s racial composition. That, however, does not mean we need to jump to conclusions about racism and its effect on the game – a study does not a reality make.
But hey, if Sportscenter needs a headline, I’ll just generate a few to make for good controversial conversation and general screaming:
Tour de France Riders: Terrorists?
NFL Refs: Baby-rapists?
NASCAR Drivers: Necrophiliac-Cult Leaders?
Talk amongst yourselves.