April 22, 2013

Clustering Pitchers by Similarity: Part 1

About six weeks ago I presented some of my latest research at the SABR Analytics Conference in Phoenix. The analysis focused on identifying pitchers who are similar to one another, grouping them into clusters, and determining how hitters have performed against various clusters. I worked closely with George Ng a data scientist at YarcData and made use of their sophisticated Urika hardware appliance, which specializes in graph analytics. The intent of the project is to develop an alternative to the relatively uninformative one-on-one batter-pitcher match up data that teams tend to use to inform their lineup, pinch-hitting and bullpen match up decisions. There are numerous problems with relying on the one-on-one batter-pitcher history, including small sample sizes and data that is old and stale. Is it relevant that Derek Jeter’s career stats vs. Roy Halladay includes a 4 for 10 in 1999?

The process to create pitcher clusters begins with determining the attributes that will define “similarity” between pitchers. I chose to tackle this issue from the batter’s perspective. In other words, what criteria would hitters use to “type” a pitcher? I matched the criteria–in the form of questions, with Pitch f/x attributes. The framework, which includes about 12 different attributes, is detailed in the chart below. Keeping with the approach of judging similarity from the perspective of the hitter, I segmented the data for each pitcher, based on left-handed vs. right-handed hitters. In other words, Jered Weaver wasn’t profiled once on these attributes. Instead, he was profiled twice–vs. LHB and vs. RHB, separately. Some pitchers–Jered Weaver, Hiroki Kuroda and Lance Lynn are particularly good examples–approach lefty and righty hitters completely differently. For example, at a very basic level, Weaver’s top 2 pitches against RHB are a 4-seam fastball and slider, while his top two pitches against lefties are a sinker and change-up. Some pitchers not only alter their pitch selection, but also change their release point (alter their starting point on the pitching rubber), or their movement (add a little more cut to their fastball or tilt to their slider), as well as many of the other attributes I include in the analysis. These nuances make it important to differentiate pitchers by their lefty-righty batter splits. Furthermore, I cluster a pitcher by his handedness, which leads to four separate categories of pitcher clusters–RHP vs. RHB, RHP vs. LHB, LHP vs. RHB, and LHP vs. LHB.

The results of the similarity analysis show that some pitcher pairs are similar against right-handed batters, but very different when judged against left-handed batters. The Red Sox Felix Dubront and the Rangers Matt Harrison are similar when facing LHB, but less so when facing RHB. Other highly similar pairs of pitchers include Bruce Chen and Randy Wolf (vs. LHB), Jonathan Niese and Wandy Rodriguez (vs. RHB) and David Price and Felix Dubront (vs. RHB). Pitchers who are least similar, or most opposite to one another include Brandon Morrow and Kyle Lohse (vs. LHB) and Nathan Eovaldi and Shaun Marcum (vs. RHB).

We can also see which pitchers are most similar to themselves, when facing righty and lefty hitters. It’s not surprising to see RA Dickey as the pitcher who differentiates the least, between RHB and LHB. Many closers dominate this list, as they tend to have a limited pitch repertoire and use it in the same fashion regardless of who they face. But other starters who rank high are AJ Burnett, Wade Miley and Manny Parra. Those who are most opposite to themselves when pitching to LHB and RHB include Lance Lynn, Matt Cain and Wade Davis.

In future posts I’ll describe the process and share the results of pitcher clusters, as well as patterns of hitter performance against clusters.

3 comments

April 22, 2013 - 2:58 pm professorjack3

I can see practical value in this approach. Pence beat up a pitcher Friday night that he’d been 0-8 with. That lifetime history factoid broadcasters use is of little interest, but this metric might be much more predictive.

May 10, 2013 - 2:00 am Jon Roegele

I like this approach. Looking forward to the next parts of the series.

June 3, 2013 - 9:12 am Pingback: Clustering Pitchers By Similarity: Part 2 « Diamond Dollar$