April 2013

Clustering Pitchers by Similarity: Part 1

About six weeks ago I presented some of my latest research at the SABR Analytics Conference in Phoenix. The analysis focused on identifying pitchers who are similar to one another, grouping them into clusters, and determining how hitters have performed against various clusters. I worked closely with George Ng a data scientist at YarcData and made use of their sophisticated Urika hardware appliance, which specializes in graph analytics. The intent of the project is to develop an alternative to the relatively uninformative one-on-one batter-pitcher match up data that teams tend to use to inform their lineup, pinch-hitting and bullpen match up decisions. There are numerous problems with relying on the one-on-one batter-pitcher history, including small sample sizes and data that is old and stale. Is it relevant that Derek Jeter’s career stats vs. Roy Halladay includes a 4 for 10 in 1999?

The process to create pitcher clusters begins with determining the attributes that will define “similarity” between pitchers. I chose to tackle this issue from the batter’s perspective. In other words, what criteria would hitters use to “type” a pitcher? I matched the criteria–in the form of questions, with Pitch f/x attributes. The framework, which includes about 12 different attributes, is detailed in the chart below. Keeping with the approach of judging similarity from the perspective of the hitter, I segmented the data for each pitcher, based on left-handed vs. right-handed hitters. In other words, Jered Weaver wasn’t profiled once on these attributes. Instead, he was profiled twice–vs. LHB and vs. RHB, separately. Some pitchers–Jered Weaver, Hiroki Kuroda and Lance Lynn are particularly good examples–approach lefty and righty hitters completely differently. For example, at a very basic level, Weaver’s top 2 pitches against RHB are a 4-seam fastball and slider, while his top two pitches against lefties are a sinker and change-up. Some pitchers not only alter their pitch selection, but also change their release point (alter their starting point on the pitching rubber), or their movement (add a little more cut to their fastball or tilt to their slider), as well as many of the other attributes I include in the analysis. These nuances make it important to differentiate pitchers by their lefty-righty batter splits. Furthermore, I cluster a pitcher by his handedness, which leads to four separate categories of pitcher clusters–RHP vs. RHB, RHP vs. LHB, LHP vs. RHB, and LHP vs. LHB.

Clustering Pitchers

The results of the similarity analysis show that some pitcher pairs are similar against right-handed batters, but very different when judged against left-handed batters. The Red Sox Felix Dubront and the Rangers Matt Harrison are similar when facing LHB, but less so when facing RHB. Other highly similar pairs of pitchers include Bruce Chen and Randy Wolf (vs. LHB), Jonathan Niese and Wandy Rodriguez (vs. RHB) and David Price and Felix Dubront (vs. RHB). Pitchers who are least similar, or most opposite to one another include Brandon Morrow and Kyle Lohse (vs. LHB) and Nathan Eovaldi and Shaun Marcum (vs. RHB).

We can also see which pitchers are most similar to themselves, when facing righty and lefty hitters. It’s not surprising to see RA Dickey as the pitcher who differentiates the least, between RHB and LHB. Many closers dominate this list, as they tend to have a limited pitch repertoire and use it in the same fashion regardless of who they face. But other starters who rank high are AJ Burnett, Wade Miley and Manny Parra. Those who are most opposite to themselves when pitching to LHB and RHB include Lance Lynn, Matt Cain and Wade Davis.

In future posts I’ll describe the process and share the results of pitcher clusters, as well as patterns of hitter performance against clusters.

It’s Time for the Yankees to Make the Big Move

With the news of Derek Jeter’s return delayed until at least late July, guaranteeing he’ll miss 100 or more games this year, it may be time to go to Plan B. The perfect move for the Yankees may be to trade for Texas Ranger’s, Jurickson Profar, a shortstop and the top rated prospect in all of baseball. When Jeter plays his next game as a Yankee, he will be 39 years old. Considering many have questioned his ability to play a credible shortstop for several years, a 39 year old version, coming off of serious ankle surgery, does not seem to be a great fit with a championship caliber team. On the other side of this potential trade we have a team that has two outstanding shortstops. Elvis Andrus, the incumbent Ranger shortstop is a 24 year old who has already made two All Star teams and played in two World Series. Profar made his major league debut last September, as a 19 year old, and promptly homered in his first MLB plate appearance. He is Baseball America’s #1 ranked prospect in all of baseball. He projects to be a legitimate major league shortstop, with above average power and a significantly above average hitter–a rare trifecta of skills.

I can’t think of a better time to gracefully slide Jeter to another role in the Yankee lineup. With his extended absence, uncertain return and even more uncertain physical capacity once he does return, it’s hard to argue with a move to acquire the top shortstop prospect since Troy Tulowitzki. At age 20, Profar would be under Yankee control at least through his age 26 season. His quick bat will likely amplify his left-handed power at Yankee Stadium, making him an even greater than expected run producer. The hope is that within a year or two–by age 22–Profar is a .280 hitter with 15 home runs, plus an above average major league shortstop. His ultimate upside could be the second coming of Robinson Cano.

One question is what can the Yankees give up to induce the Rangers to trade baseball’s top prospect. The Yankees would need to assemble an impressive package of players to acquire Profar. The Yankees farm system is not depleted, but many of it’s top prospects are at lower levels. A package that includes 21 year old outfielder Mason Williams and another highly rated prospect, like Tyler Austin, along with Brett Gardner, may at least get the Rangers attention. If you need to add Joba Chamberlain to the package, it’s worth considering. I realize that Brett Gardner is an integral part of the Yankee offense today, but with Granderson coming back soon, it might make sense to deal from a position of relative strength, in order to solve the long term problem of Jeter’s successor. I just don’t believe Edwardo Nunez has the defensive chops to be an everyday big league shortstop on a contending team. There may not be a cheaper option anytime soon, or one that has the chance to be an enduring, long term solution like Profar.

The toughest question may be where Jeter will play when he returns. Making him the primary DH may be the best option, while easing him into 3B, a position that requires much less lateral range. When the Yankees acknowledge that Jeter cannot play shortstop at a high level, a logjam is inevitable at either DH or the position Jeter moves to. When (if?) A-Rod comes back, it gets even more complicated. A-Rod may be best suited for DH. Hafner can only be a DH. Youkilis is limited to 1B, 3B or DH. However, these problems are only marginally more complicated with Profar replacing Jeter at shortstop. The issue of how to allocate playing time among players who have evolved into immobile, primarily offensive contributors is an issue that is not going away for the Yankees of the next several years. Now may be the time to confront the issue head on.

Stats vs. No Stats—a Controlled Experiment?

Over the last week, two articles appeared discussing two teams’ contrasting approaches to making baseball decisions. The Washington Nationals were called a “scouting first” organization that integrates statistical analyses into team decisions. By contrast, the Philadelphia Phillies seem proudly defiant of the trend to incorporate advanced metrics into their decision criteria. While there are a large number of MLB teams that put significant energy and dollars into objective analysis of data, the other end of the spectrum is often a mystery. Who are the clubs and how do they process information. In recent years teams like the Orioles, Dodgers and Giants have been accused of shunning stats in favor of intuition or the perspective and wisdom of career baseball people. However, when pressed these teams typically deny an aversion to the numbers side of the game and in fact tout their otherwise low-profile prowess in this area. It now seems that the Phillies are willing to be the proud flag-bearers for a shrinking group of ballclubs who believe that “new stats” fail to add value to decisions. We may finally have a controlled experiment of the stats team vs. the no-stats team. If two clubs, who fit those descriptions were to maintain their loyalty to their respective internal decision processes, it would be interesting to see how they perform over the next 4 or 5 years.

So who is our poster-child for the stats gurus? In the opposite corner, representing the stat heads, we have the Houston Astros. Truth be known, the opposite corner is actually quite crowded with teams that strive to make stat analysis a potential competitive advantage, with the Tampa Bay Rays at the top of the list, but we’ll choose the Astros as our subject for our controlled experiment. Under the leadership of former Cardinal executive Jeff Luhnow, Astros have assembled a team that more closely resembles a NASA lab crew than a baseball front office. From former NASA engineer Sig Mejdal, the team’s Director of Decision Sciences, to Assistant GM David Stearns and Pitch f/x guru Mike Fast, Luhnow has attracted a top-notch staff. Team CEO George Postolos seems fully bought-in to Luhnow’s approach and the baseball world is watching to see how the Astros fare over the next five years.

I like matching the Astros against the Phillies , because this match up also has a bit of handicapping embedded in it. The Phillies have been a competitive club, who some believe can still contend for the NL East, while the Astros are thought to be the worst team in baseball—by a lot. Given the predictions of how each team is expected to perform in 2013, we’re probably giving the Phillies a 20-win per season head start for the coming season. We can see how long the Astros take to close the gap and try to assess if the two teams approach to decisions was responsible for the outcome.

My view is that well thought out problem solving—quantitative and qualitative—can add enormous value to decision processes. Over my career, I’ve seen analytics supplement intuitive judgment, experience and observation on hundreds of occasions, almost always leading to higher quality decisions. I’ve seen baseball teams integrate analytics with scouting information and the wisdom of veteran baseball people to improve the confidence in their decisions.

The baseball data world is changing rapidly. Just six years ago baseball was producing about 900,000 data points to capture the outcomes of each pitch thrown and ultimately of each plate appearance in a major league season. With the introduction of Pitch f/x and related datasets, beginning on a full scale basis in 2008, we now have over 15 million annual data points that chronicle the baseball season, ranging from the angle of break on Derek Holland’s slider, to the most popular two-pitch sequence by Jered Weaver. There are literally thousands of questions that we could only speculate on six years ago, that we can answer objectively today. Even if you believe that statistical analysis may not have been a difference maker in 2006, the 15x increase in data we have today changes the game. It can help reduce the risk on $100 million contract decisions to a manageable level. I’m not arguing against the scouting perspective. The scouting perspective is critical and often the lead horse in a decision process. But that’s different than excluding statistical analysis from the ultimate decision.

My bet on how the controlled experiment turns out: I would expect the experiment will be aborted before we reach our five-year timeframe, as the Phillies will eventually modify their decision processes to integrate more quantitative information. If that change occurs, it may be interpreted as an answer to the controlled experiment.

%d bloggers like this: