Monday, October 26, 2015

State of the Art of the Stats

At the end of August, in a post discussing the exploration and filtering of shots from tennis matches, I referenced the fact that the majority of data generated by tennis matches hasn't been captured; venues where Hawk Eye is available are the exception, and even Hawk Eye doesn't capture every detail that could be useful in downstream analysis of matches.

Certainly today more and more tennis data *is* being captured, and the alternatives for capturing data are expanding steadily - when there is video available it is even possible to capture additional data from historical matches (indeed this is being done to some extent by volunteers for the Match Charting Project, and automated video processing tools are in development which could facilitate this process) - but if "progress" is to be made in understanding how the analysis of tennis data can contribute to the development of the game (which primarily means player development), at least two things need to happen:
  1. tennis data has to be made accessible, and 
  2. tennis data has to be transformed into a standard format
Obvious, right? 

As I've pursued my vision of a Tennis Analytics Integration Platform I've struggled with both of these issues at almost every turn.  My professional career was entirely focused on inter-application, cross-enterprise and intra-enterprise data and process integration, so it is very familiar territory, and unsurprising.  But I *was* surprised when I initially began to track my boys' tennis matches and found that the elegant iPhone/iPad application ProTracker Tennis allows for export of match data in an easily parseable format.  And I was inspired, as many others have been, by the entirely open approach to tennis data and statistics taken by Jeff Sackmann, who makes his entire dataset available on GitHub.  (Here is a great Guardian article about Sackmann's effort).

Still, the integration of data from only two sources requires careful consideration.  As an example, ProTracker Tennis captures a good amount of information, but only a handful of shots out of every point.  The Match Charting Project makes it possible to capture *many* attributes for *every* shot of every point, but doesn't attempt capture of shot coordinates.  The former tracks "Forcing Errors" (but not the error that was forced) while the latter charts "Forced Errors". The "upshot" of such differences is that data must not only be coerced into a "standard" view, but also that it is often not possible to generate the same set of statistics (or visualizations) for matches from each source.  

Admittedly, ProTracker Tennis and the Match Charting Project are, in general, targeting different audiences: ProTracker Tennis is mostly used to track amateur and junior matches while the Match Charting Project is focused on ATP and WTA matches.  Nevertheless, it is possible to use either tool to track any match; and it is further possible that comparing match play of amateurs, juniors and professionals in a common framework will produce useful insights for player development.

Disagreeable Data

This week I (finally) began adding additional data sources for ATP / WTA matches to the Tennis Analytics Integration Platform.  There are now more than 2,500 matches which can be visualized in TAVA, and that number is poised to grow rapidly.  Not only will the number of matches grow; the number of matches for which there are multiple "views" will also grow.  

In an ideal situation, two views of a single match would be entirely complementary, overlapping only for player names, tournament and venue names, and dates.  This does occur when point-progression data is pulled from betting sites (which tend to have score results and match odds but not traditional stats) and pre-calculated statistics are pulled from a tournament site (which don't ever seem to have point-progression data).  

But not all situations are ideal.

Merging data sources is always tricky, even when the data originates from the same domain and ostensibly covers the same conceptual territory. In addition to the challenges presented by the fact that various data sources have modified and extended their own data formats over the years, there is the "not insignificant" issue that a number of tennis statistics are generated from data that is very much subject to the interpretation of the individual doing the gathering.  (see articles by Carl Bialik and Jeff Sackmann on the topic of forced vs. unforced errors).
“I think if you have two or three different people recording unforced errors, you’re going to get two or three different figures,” said Kevin Fischer, senior communications manager for the Women’s Tennis Association.”  - NYT 
The matches from data sources I've added this week overlap significantly with the Match Charting Project.  This is both a headache (from a design and programming point-of-view) and an opportunity. One of the potential weaknesses of the Match Charting Project (eloquently articulated by Stephanie Kovalchik in her blog is that match data is gathered by a single person working, often in real time, with a spreadsheet.  Errors can be introduced into the data that are hard to weed out. With a second and even third "view" of point-progression and final statistics, discrepancies can be automatically identified and flagged.  [An obvious addition to Tennis AiP is a Match Editor in which such discrepancies can be resolved - and indeed this is in the roadmap.]

It's important to remember that even the "official" data for professional tennis matches is still being generated by teams of humans working "behind the scenes".  A 2014 article which appeared in The Guardian gives some perspective on the "State of the Art" of how tennis statistics are gathered still today...  A team of 48 "data entry people" are scattered about Wimbledon; most sit court side.  They have technology, yes, but, apart from serve speed, there is a human generating every bit of data. And not every court can be covered to the extent that full stats can be generated. This is apparent even from a review of stats available on IBM's Slamtracker - matches between lower ranked players which take place on peripheral courts often have little to no data available.

All of this leads to the thought that there are further opportunities for crowdsourcing tennis data and generating statistics and match views that could push beyond what is currently being done for the professional tours, even by top-flight corporations.

Machine Dreams and Silo Silliness

But what about Hawk Eye, you may ask?  Yes, impressive technology, generating huge amounts of data, all of it inaccessible to statisticians and aficionados.  It's not even clear that the coaches of top tennis professionals actually know what to do with it all, yet.  Damien Saunder at has a number of articles looking at what kind of analysis Hawk Eye data makes possible.  At present, however, none of the official statistics available online appear to have been actually produced by systems such as Hawk Eye; it is used for adjudication and for the generation of graphics during televised performances, to "enhance viewer experience".  Certainly Hawk Eye can be viewed as "State of the Art" technology for raw data capture, but it is not clear that it represents the "State of the Art" for charting and statistical analysis.

It is also not clear to what extent a tennis match *can* be automatically charted, though companies such as Mojjo and PlaySight are now gathering impressive amounts of data and making it possible to deliver useful and usable systems which can be operated by club players.  Both companies are reducing the cost to end-users and, while not yet entirely portable, have made it possible to at least increase court coverage - the number of courts from which some portion of match data could potentially be accessed.  There are also companies such as Tennis Analytics (high end) and that offer services for post-match video processing from end-user video, but none appear to be as advanced as what Damien Saunder achieved using ArcGIS in this 2012/2013 effort (though Tennis Analytics is enabling broad and powerful exploration and filtering of match data - see the end of this article).

The big question, from my perspective, is whether these relatively new entries into Match Tracking (or Charting) will remain Data/Information Siloes and be as inaccessible to third parties as Hawk Eye. If it is not possible to integrate data from such sources then we are left in a situation where the "State of the Art" could only be advanced within the context of a single application environment (a single corporation), and, barring a scenario where one technical "solution" becomes ubiquitous across all venues where one might potentially play tennis, we could never achieve a Big Data view, even for a single player (though a single player could potentially spend hundreds of thousands of dollars to pay a single service to generate their own data set).

I also see the "Silo Mentality" in practically every tennis application that is available for Tablets and Smart Phones (apart from the sad fact that the vast majority of them are inscrutable and/or useless).   The primary focus of every new tennis tracking application or product offering (including racquet sensors, the new Babolat POP and Pulse Play) is to "build a community" and to cash in on "social marketing"; only a very few such products that I've reviewed (out of several dozen) have any sort of intentional data export capability, and fewer still export any data that is useful for pushing the "State of the Art" in tennis analysis forward or in fact adding meaningfully to what little conversation there is about how any of the data or analysis will contribute to better tennis (to player development).

The most potent and relevant counter example to the situation I've described above is probably the Developer API offered by FitBit, a leader in the "Activity Tracker" market.  There is an increasing number of such gadgets that are allowing third parties, usually for a monthly subscription fee, to access and integrate data in real time.  Similar APIs are catalogued by ProgrammableWeb.  There is no reason why vendors of products and services related to tennis match data could not create similar offerings.  In the future, the Tennis Analytics Integration Platform will expose just such an API, and I have created a number of examples in the hope of inspiring collaborators with more data visualization experience than myself to join the effort.


There seems to be a general agreement that presenting more data is a good thing, and that surely the presentation and even visualization of data will be helpful, but I fear there is a real danger that it is all just a distraction and that the only ones who *may* benefit, for some period of time, are those selling the technology.  It certainly seems that IBM's and SAP's pursuit of Analytics for the ATP and WTA, respectively, is far more about marketing their brands and keeping the audience entertained than it is about actually providing meaningful insights. (Not that entertainment is a bad thing!) This is a critique that appears often, most recently in Nikita Taparia's very entertaining Tennis Note #24.

Ultimately the benefits to be found from analyzing data across a large number of tennis matches will be limited by the subset of common statistics and "views" that can be derived from whatever number of data sources are amenable to integration.  The goal of integration is to maximize the amount of data which can be usefully processed while minimizing the degree to which differences in the structure and "views" offered by each data source impact the scope of viable analysis.

I believe it would ultimately be to the benefit of all who are producing tennis-related applications as well as those who are working in the field of Sports Analytics if there were an Open Data approach to data generated from tennis matches. There have been, without a doubt, many passionate pleas for such an approach that proceeded this screed. If you've made it this far, I'm surprised again!

No comments:

Post a Comment