Monday, October 26, 2015

State of the Art of the Stats

At the end of August, in a post discussing the exploration and filtering of shots from tennis matches, I referenced the fact that the majority of data generated by tennis matches hasn't been captured; venues where Hawk Eye is available are the exception, and even Hawk Eye doesn't capture every detail that could be useful in downstream analysis of matches.

Certainly today more and more tennis data *is* being captured, and the alternatives for capturing data are expanding steadily - when there is video available it is even possible to capture additional data from historical matches (indeed this is being done to some extent by volunteers for the Match Charting Project, and automated video processing tools are in development which could facilitate this process) - but if "progress" is to be made in understanding how the analysis of tennis data can contribute to the development of the game (which primarily means player development), at least two things need to happen:
  1. tennis data has to be made accessible, and 
  2. tennis data has to be transformed into a standard format
Obvious, right? 

As I've pursued my vision of a Tennis Analytics Integration Platform I've struggled with both of these issues at almost every turn.  My professional career was entirely focused on inter-application, cross-enterprise and intra-enterprise data and process integration, so it is very familiar territory, and unsurprising.  But I *was* surprised when I initially began to track my boys' tennis matches and found that the elegant iPhone/iPad application ProTracker Tennis allows for export of match data in an easily parseable format.  And I was inspired, as many others have been, by the entirely open approach to tennis data and statistics taken by Jeff Sackmann, who makes his entire dataset available on GitHub.  (Here is a great Guardian article about Sackmann's effort).

Still, the integration of data from only two sources requires careful consideration.  As an example, ProTracker Tennis captures a good amount of information, but only a handful of shots out of every point.  The Match Charting Project makes it possible to capture *many* attributes for *every* shot of every point, but doesn't attempt capture of shot coordinates.  The former tracks "Forcing Errors" (but not the error that was forced) while the latter charts "Forced Errors". The "upshot" of such differences is that data must not only be coerced into a "standard" view, but also that it is often not possible to generate the same set of statistics (or visualizations) for matches from each source.  

Admittedly, ProTracker Tennis and the Match Charting Project are, in general, targeting different audiences: ProTracker Tennis is mostly used to track amateur and junior matches while the Match Charting Project is focused on ATP and WTA matches.  Nevertheless, it is possible to use either tool to track any match; and it is further possible that comparing match play of amateurs, juniors and professionals in a common framework will produce useful insights for player development.

Disagreeable Data

This week I (finally) began adding additional data sources for ATP / WTA matches to the Tennis Analytics Integration Platform.  There are now more than 2,500 matches which can be visualized in TAVA, and that number is poised to grow rapidly.  Not only will the number of matches grow; the number of matches for which there are multiple "views" will also grow.  

In an ideal situation, two views of a single match would be entirely complementary, overlapping only for player names, tournament and venue names, and dates.  This does occur when point-progression data is pulled from betting sites (which tend to have score results and match odds but not traditional stats) and pre-calculated statistics are pulled from a tournament site (which don't ever seem to have point-progression data).  

But not all situations are ideal.

Merging data sources is always tricky, even when the data originates from the same domain and ostensibly covers the same conceptual territory. In addition to the challenges presented by the fact that various data sources have modified and extended their own data formats over the years, there is the "not insignificant" issue that a number of tennis statistics are generated from data that is very much subject to the interpretation of the individual doing the gathering.  (see articles by Carl Bialik and Jeff Sackmann on the topic of forced vs. unforced errors).
“I think if you have two or three different people recording unforced errors, you’re going to get two or three different figures,” said Kevin Fischer, senior communications manager for the Women’s Tennis Association.”  - NYT 
The matches from data sources I've added this week overlap significantly with the Match Charting Project.  This is both a headache (from a design and programming point-of-view) and an opportunity. One of the potential weaknesses of the Match Charting Project (eloquently articulated by Stephanie Kovalchik in her blog On-the-T.com) is that match data is gathered by a single person working, often in real time, with a spreadsheet.  Errors can be introduced into the data that are hard to weed out. With a second and even third "view" of point-progression and final statistics, discrepancies can be automatically identified and flagged.  [An obvious addition to Tennis AiP is a Match Editor in which such discrepancies can be resolved - and indeed this is in the roadmap.]

It's important to remember that even the "official" data for professional tennis matches is still being generated by teams of humans working "behind the scenes".  A 2014 article which appeared in The Guardian gives some perspective on the "State of the Art" of how tennis statistics are gathered still today...  A team of 48 "data entry people" are scattered about Wimbledon; most sit court side.  They have technology, yes, but, apart from serve speed, there is a human generating every bit of data. And not every court can be covered to the extent that full stats can be generated. This is apparent even from a review of stats available on IBM's Slamtracker - matches between lower ranked players which take place on peripheral courts often have little to no data available.

All of this leads to the thought that there are further opportunities for crowdsourcing tennis data and generating statistics and match views that could push beyond what is currently being done for the professional tours, even by top-flight corporations.

Machine Dreams and Silo Silliness

But what about Hawk Eye, you may ask?  Yes, impressive technology, generating huge amounts of data, all of it inaccessible to statisticians and aficionados.  It's not even clear that the coaches of top tennis professionals actually know what to do with it all, yet.  Damien Saunder at GameSetMap.com has a number of articles looking at what kind of analysis Hawk Eye data makes possible.  At present, however, none of the official statistics available online appear to have been actually produced by systems such as Hawk Eye; it is used for adjudication and for the generation of graphics during televised performances, to "enhance viewer experience".  Certainly Hawk Eye can be viewed as "State of the Art" technology for raw data capture, but it is not clear that it represents the "State of the Art" for charting and statistical analysis.

It is also not clear to what extent a tennis match *can* be automatically charted, though companies such as Mojjo and PlaySight are now gathering impressive amounts of data and making it possible to deliver useful and usable systems which can be operated by club players.  Both companies are reducing the cost to end-users and, while not yet entirely portable, have made it possible to at least increase court coverage - the number of courts from which some portion of match data could potentially be accessed.  There are also companies such as Tennis Analytics (high end) and Tennis-Stat.com that offer services for post-match video processing from end-user video, but none appear to be as advanced as what Damien Saunder achieved using ArcGIS in this 2012/2013 effort (though Tennis Analytics is enabling broad and powerful exploration and filtering of match data - see the end of this article).

The big question, from my perspective, is whether these relatively new entries into Match Tracking (or Charting) will remain Data/Information Siloes and be as inaccessible to third parties as Hawk Eye. If it is not possible to integrate data from such sources then we are left in a situation where the "State of the Art" could only be advanced within the context of a single application environment (a single corporation), and, barring a scenario where one technical "solution" becomes ubiquitous across all venues where one might potentially play tennis, we could never achieve a Big Data view, even for a single player (though a single player could potentially spend hundreds of thousands of dollars to pay a single service to generate their own data set).

I also see the "Silo Mentality" in practically every tennis application that is available for Tablets and Smart Phones (apart from the sad fact that the vast majority of them are inscrutable and/or useless).   The primary focus of every new tennis tracking application or product offering (including racquet sensors, the new Babolat POP and Pulse Play) is to "build a community" and to cash in on "social marketing"; only a very few such products that I've reviewed (out of several dozen) have any sort of intentional data export capability, and fewer still export any data that is useful for pushing the "State of the Art" in tennis analysis forward or in fact adding meaningfully to what little conversation there is about how any of the data or analysis will contribute to better tennis (to player development).

The most potent and relevant counter example to the situation I've described above is probably the Developer API offered by FitBit, a leader in the "Activity Tracker" market.  There is an increasing number of such gadgets that are allowing third parties, usually for a monthly subscription fee, to access and integrate data in real time.  Similar APIs are catalogued by ProgrammableWeb.  There is no reason why vendors of products and services related to tennis match data could not create similar offerings.  In the future, the Tennis Analytics Integration Platform will expose just such an API, and I have created a number of examples in the hope of inspiring collaborators with more data visualization experience than myself to join the effort.

Conclusion

There seems to be a general agreement that presenting more data is a good thing, and that surely the presentation and even visualization of data will be helpful, but I fear there is a real danger that it is all just a distraction and that the only ones who *may* benefit, for some period of time, are those selling the technology.  It certainly seems that IBM's and SAP's pursuit of Analytics for the ATP and WTA, respectively, is far more about marketing their brands and keeping the audience entertained than it is about actually providing meaningful insights. (Not that entertainment is a bad thing!) This is a critique that appears often, most recently in Nikita Taparia's very entertaining Tennis Note #24.

Ultimately the benefits to be found from analyzing data across a large number of tennis matches will be limited by the subset of common statistics and "views" that can be derived from whatever number of data sources are amenable to integration.  The goal of integration is to maximize the amount of data which can be usefully processed while minimizing the degree to which differences in the structure and "views" offered by each data source impact the scope of viable analysis.

I believe it would ultimately be to the benefit of all who are producing tennis-related applications as well as those who are working in the field of Sports Analytics if there were an Open Data approach to data generated from tennis matches. There have been, without a doubt, many passionate pleas for such an approach that proceeded this screed. If you've made it this far, I'm surprised again!

Sunday, October 4, 2015

Game Tree: Point Progression


"Game Tree" is a depiction of Point Progression for a selection of games within a tennis match or across a series of tennis matches; it is a Sankey Diagram and possesses the "Markov property", meaning that the set of future "states" that are possible are constrained by the current "state", the point score at any moment in a game. Here is a nice interactive explanation.
"Markov Chains" have often been applied to tennis games.  Google "Markov Tennis" and you'll find a large number of articles on statistics which use Tennis to explore probability.  A few of the results use Data Flow Diagrams to depict Point Progression: Wolfram Alpha has an attractive visualization (see above) which was reproduced in an article on Predictive Modeling; and NC State University produced a YouTube video, as part of an online course titled "Introduction to Finite Math", with a whiteboard explanation of state transitions in tennis games.  When visualizations are provided they are usually arranged like the GameFish in TennisVisuals.com, with either horizontal or vertical orientation:

As far as I can tell, the Game Tree design created by Damien Saunder and David Webb at GameSetMap.com is the first time a Sankey Diagram (or Harness Flow Map) was applied to Point Progression in Tennis. The primary innovation was to apply the idea of "quantitive flow lines" to the possible point paths through the tree such that the width of each line represents the frequency with which games passed through each possible "state" for the score, but the real power of the design comes from its interactive nature.  SVG (Scalable Vector Graphics) are used to:
  1. animate exploration of the data when it is filtered by selecting individual games, groups of games, or constraining the games to only service games for a chosen player
  2. provide contextual information when "hovering" over specific elements
The original implementation of Game Tree, presented as a celebration of Nadal's 2013 comeback, used match data downloaded in XML format from the William Hill Sports betting website.


In the TennisVisuals version of Game Tree, data is retrieved in JSON format from the Mongo database which underpins TennisVisuals.com.  That data, in turn, is presently sourced from Jeff Sackmann's Match Charting Project (many other data sources will come online soon).

The inspiration for the Game Tree design seems to have been the same frustration that drove the development of the Points-to-Set chart: the final score of a tennis match reveals very little about how close a match actually may have been. Even a match with a 6-0, 6-0 score may have been "hotly contested".  Traditional stats miss the story every time. Percentage of Points Won for a 6-0, 6-0 match, for instance, provides only a very crude view of match intensity - ranging from 100% for complete dominance by one player to 62.5% for a match in which every game reached Deuce once and only once - it relates nothing of the drama and is of very little use for constructive analysis.

In the following Game Tree visualization of Nadal's service games in a match against Wawrinka at the 2013 Madrid Masters, it is easy to see that Nadal won the first point of his service games 77.8% of the time.  When he did lose the first point in a match, 100% of the time he won the second point.


With Game Tree it is possible to see how often Deuce was reached during a match; the thickness of flow lines even indicates how often game scores ricocheted between Deuce and Advantage.  In the match with Wawrinka, for games both served and received, Nadal lost only one game that reached Deuce:


In the Saunder/Webb implementation of Game Tree, the "Nodes" of the tree are color-coded to indicate momentum.  Dark nodes represent positive momentum while Red nodes represent negative momentum.  In the TennisVisuals version of Game Tree these representations still hold true, but momentum is always viewed from the perspective of the primary player; when filtering for the opponent's service game, the tree is not "flipped", as occurs in the Saunder/Webb version of the Nadal-Djokovic Roland Garros 2014 final.

The relative importance of each point in tennis games, sets, and matches has been analyzed extensively, most famously by Carl Morris in his article "The most important points in tennis", which was published in Optimal Strategies in Sports in 1977.   It is probably impossible to publish analysis of points in tennis without referencing Morris...  here is one of many studies, notable for its visualization of relative point importance within the context of a set:
In 2014, Professors Franc Klassen and Jan Magnus provided ample coverage of the topic in their book "Analyzing Wimbledon".  Most recently Jeff Sackmann wrote a series of blog posts ("How Important is the First Point of Each Game?""The Pivotal Point of 15-30") drawing on a theoretical model which he has published and utilizing his extensive match database.

My plan is to integrate the insights garnered from such analyses into the TennisVisuals version of Game Tree so that results for each match can be viewed in the context of benchmark figures. I'd like to auto-generate summary reports to go along with Game Tree visualizations, similar to what Saunder has done for Nadal-Djokovic 2014 Roland Garros Final.  I also plan to divide each point "flow line" into errors and winners and highlight "clutch" performance.

Shortly after releasing the initial version of Game Tree, Saunder published a follow-up entitled "Where are you most likely to win a point on Nadal's serve?" In this article he introduced a "Proportional Symbol Game Tree" which shows the percent chance at every possible "state" of the score that an opponent had of winning the point.  It's enticing to think about using a similar Game Tree to visualize a player's service game performance in one match relative to their average over the course of the past year...  perhaps overlapped Proportional Symbols of reduced opacity...