Tuesday, November 17, 2015

Rally Tree: Point Distribution and Win Percentage

Tennis is an "intermittent" sport. The level of intensity can vary greatly with the rally length of points and the time taken between points (among other factors including surface, ball type, sex and level of play). When rallies are visualized they are typically depicted temporally from the first point to the last, which gives a jagged chart where it is difficult to discern any pattern at all. "Rally Tree" is an attempt to bring a different perspective to the analysis of rallies.

"Rally Tree" depicts the distribution of points across various rally lengths, beginning at the top with rally lengths of Zero, which indicate either Aces, Serve Winners, or Double Faults. Color coding differentiates errors where balls were "netted" vs. hit long.

There are several available views. The default view displays all points for a single match or selection of matches. You can filter by player to display only points served by either player (or composite of opponents). Notice the number of winners among the points won for servers vs. those receiving.

Additionally there is an overlay depicting the percentage chance that a point was won for any given rally length. The offset vertical lines represent 50% either side of center (0%).

For the "served points" views, this gives a graphic representation of the Persistence of Server Advantage, which varies greatly among players. Please note that this is not the same as percentage of points won for a given rally length.

The "Rally Tree" graphics in this post are of Novak Djokovic's matches at Wimbledon in 2014 and 2015. The last two images depict the persistence of Djokovic's server advantage on the left and a composite of his opponents' server advantage on the left. Djokovic's dominance is obvious. Apart from rallies of seven, he had a greater than 50% chance of winning all points with rallies up to sixteen. His opponents' composite server advantage only extended to rallies of five.

You can play around with a live version of Rally Tree and explore your favorite players at TennisVisuals.com

In the near future "Rally Tree" will be integrated with "Game Tree" and other TAVA components so that selections in one component can drive views in another. For instance, a "Point Progression" from 0-0 to 0-15 can be selected in "Game Tree" to view the distribution of points in the "Rally Tree" or "Points-to-Set". From this point it will be possible to explore whether there are certain points in a match when rally lengths increase...

To read more about "Persistence of Server Advantage" please follow the link to Jeff Sackmann's blog post on the topic.

Monday, October 26, 2015

State of the Art of the Stats

At the end of August, in a post discussing the exploration and filtering of shots from tennis matches, I referenced the fact that the majority of data generated by tennis matches hasn't been captured; venues where Hawk Eye is available are the exception, and even Hawk Eye doesn't capture every detail that could be useful in downstream analysis of matches.

Certainly today more and more tennis data *is* being captured, and the alternatives for capturing data are expanding steadily - when there is video available it is even possible to capture additional data from historical matches (indeed this is being done to some extent by volunteers for the Match Charting Project, and automated video processing tools are in development which could facilitate this process) - but if "progress" is to be made in understanding how the analysis of tennis data can contribute to the development of the game (which primarily means player development), at least two things need to happen:

tennis data has to be made accessible, and
tennis data has to be transformed into a standard format

Obvious, right?

As I've pursued my vision of a Tennis Analytics Integration Platform I've struggled with both of these issues at almost every turn. My professional career was entirely focused on inter-application, cross-enterprise and intra-enterprise data and process integration, so it is very familiar territory, and unsurprising. But I *was* surprised when I initially began to track my boys' tennis matches and found that the elegant iPhone/iPad application ProTracker Tennis allows for export of match data in an easily parseable format. And I was inspired, as many others have been, by the entirely open approach to tennis data and statistics taken by Jeff Sackmann, who makes his entire dataset available on GitHub. (Here is a great Guardian article about Sackmann's effort).

Still, the integration of data from only two sources requires careful consideration. As an example, ProTracker Tennis captures a good amount of information, but only a handful of shots out of every point. The Match Charting Project makes it possible to capture *many* attributes for *every* shot of every point, but doesn't attempt capture of shot coordinates. The former tracks "Forcing Errors" (but not the error that was forced) while the latter charts "Forced Errors". The "upshot" of such differences is that data must not only be coerced into a "standard" view, but also that it is often not possible to generate the same set of statistics (or visualizations) for matches from each source.

Admittedly, ProTracker Tennis and the Match Charting Project are, in general, targeting different audiences: ProTracker Tennis is mostly used to track amateur and junior matches while the Match Charting Project is focused on ATP and WTA matches. Nevertheless, it is possible to use either tool to track any match; and it is further possible that comparing match play of amateurs, juniors and professionals in a common framework will produce useful insights for player development.

Disagreeable Data

This week I (finally) began adding additional data sources for ATP / WTA matches to the Tennis Analytics Integration Platform. There are now more than 2,500 matches which can be visualized in TAVA, and that number is poised to grow rapidly. Not only will the number of matches grow; the number of matches for which there are multiple "views" will also grow.

In an ideal situation, two views of a single match would be entirely complementary, overlapping only for player names, tournament and venue names, and dates. This does occur when point-progression data is pulled from betting sites (which tend to have score results and match odds but not traditional stats) and pre-calculated statistics are pulled from a tournament site (which don't ever seem to have point-progression data).

But not all situations are ideal.

Merging data sources is always tricky, even when the data originates from the same domain and ostensibly covers the same conceptual territory. In addition to the challenges presented by the fact that various data sources have modified and extended their own data formats over the years, there is the "not insignificant" issue that a number of tennis statistics are generated from data that is very much subject to the interpretation of the individual doing the gathering. (see articles by Carl Bialik and Jeff Sackmann on the topic of forced vs. unforced errors).

“I think if you have two or three different people recording unforced errors, you’re going to get two or three different figures,” said Kevin Fischer, senior communications manager for the Women’s Tennis Association.” - NYT

The matches from data sources I've added this week overlap significantly with the Match Charting Project. This is both a headache (from a design and programming point-of-view) and an opportunity. One of the potential weaknesses of the Match Charting Project (eloquently articulated by Stephanie Kovalchik in her blog On-the-T.com) is that match data is gathered by a single person working, often in real time, with a spreadsheet. Errors can be introduced into the data that are hard to weed out. With a second and even third "view" of point-progression and final statistics, discrepancies can be automatically identified and flagged. [An obvious addition to Tennis AiP is a Match Editor in which such discrepancies can be resolved - and indeed this is in the roadmap.]

It's important to remember that even the "official" data for professional tennis matches is still being generated by teams of humans working "behind the scenes". A 2014 article which appeared in The Guardian gives some perspective on the "State of the Art" of how tennis statistics are gathered still today... A team of 48 "data entry people" are scattered about Wimbledon; most sit court side. They have technology, yes, but, apart from serve speed, there is a human generating every bit of data. And not every court can be covered to the extent that full stats can be generated. This is apparent even from a review of stats available on IBM's Slamtracker - matches between lower ranked players which take place on peripheral courts often have little to no data available.

All of this leads to the thought that there are further opportunities for crowdsourcing tennis data and generating statistics and match views that could push beyond what is currently being done for the professional tours, even by top-flight corporations.

Machine Dreams and Silo Silliness

But what about Hawk Eye, you may ask? Yes, impressive technology, generating huge amounts of data, all of it inaccessible to statisticians and aficionados. It's not even clear that the coaches of top tennis professionals actually know what to do with it all, yet. Damien Saunder at GameSetMap.com has a number of articles looking at what kind of analysis Hawk Eye data makes possible. At present, however, none of the official statistics available online appear to have been actually produced by systems such as Hawk Eye; it is used for adjudication and for the generation of graphics during televised performances, to "enhance viewer experience". Certainly Hawk Eye can be viewed as "State of the Art" technology for raw data capture, but it is not clear that it represents the "State of the Art" for charting and statistical analysis.

It is also not clear to what extent a tennis match *can* be automatically charted, though companies such as Mojjo and PlaySight are now gathering impressive amounts of data and making it possible to deliver useful and usable systems which can be operated by club players. Both companies are reducing the cost to end-users and, while not yet entirely portable, have made it possible to at least increase court coverage - the number of courts from which some portion of match data could potentially be accessed. There are also companies such as Tennis Analytics (high end) and Tennis-Stat.com that offer services for post-match video processing from end-user video, but none appear to be as advanced as what Damien Saunder achieved using ArcGIS in this 2012/2013 effort (though Tennis Analytics is enabling broad and powerful exploration and filtering of match data - see the end of this article).

The big question, from my perspective, is whether these relatively new entries into Match Tracking (or Charting) will remain Data/Information Siloes and be as inaccessible to third parties as Hawk Eye. If it is not possible to integrate data from such sources then we are left in a situation where the "State of the Art" could only be advanced within the context of a single application environment (a single corporation), and, barring a scenario where one technical "solution" becomes ubiquitous across all venues where one might potentially play tennis, we could never achieve a Big Data view, even for a single player (though a single player could potentially spend hundreds of thousands of dollars to pay a single service to generate their own data set).

I also see the "Silo Mentality" in practically every tennis application that is available for Tablets and Smart Phones (apart from the sad fact that the vast majority of them are inscrutable and/or useless). The primary focus of every new tennis tracking application or product offering (including racquet sensors, the new Babolat POP and Pulse Play) is to "build a community" and to cash in on "social marketing"; only a very few such products that I've reviewed (out of several dozen) have any sort of intentional data export capability, and fewer still export any data that is useful for pushing the "State of the Art" in tennis analysis forward or in fact adding meaningfully to what little conversation there is about how any of the data or analysis will contribute to better tennis (to player development).

The most potent and relevant counter example to the situation I've described above is probably the Developer API offered by FitBit, a leader in the "Activity Tracker" market. There is an increasing number of such gadgets that are allowing third parties, usually for a monthly subscription fee, to access and integrate data in real time. Similar APIs are catalogued by ProgrammableWeb. There is no reason why vendors of products and services related to tennis match data could not create similar offerings. In the future, the Tennis Analytics Integration Platform will expose just such an API, and I have created a number of examples in the hope of inspiring collaborators with more data visualization experience than myself to join the effort.

Conclusion

There seems to be a general agreement that presenting more data is a good thing, and that surely the presentation and even visualization of data will be helpful, but I fear there is a real danger that it is all just a distraction and that the only ones who *may* benefit, for some period of time, are those selling the technology. It certainly seems that IBM's and SAP's pursuit of Analytics for the ATP and WTA, respectively, is far more about marketing their brands and keeping the audience entertained than it is about actually providing meaningful insights. (Not that entertainment is a bad thing!) This is a critique that appears often, most recently in Nikita Taparia's very entertaining Tennis Note #24.

Ultimately the benefits to be found from analyzing data across a large number of tennis matches will be limited by the subset of common statistics and "views" that can be derived from whatever number of data sources are amenable to integration. The goal of integration is to maximize the amount of data which can be usefully processed while minimizing the degree to which differences in the structure and "views" offered by each data source impact the scope of viable analysis.

I believe it would ultimately be to the benefit of all who are producing tennis-related applications as well as those who are working in the field of Sports Analytics if there were an Open Data approach to data generated from tennis matches. There have been, without a doubt, many passionate pleas for such an approach that proceeded this screed. If you've made it this far, I'm surprised again!

Sunday, October 4, 2015

Game Tree: Point Progression

"Game Tree" is a depiction of Point Progression for a selection of games within a tennis match or across a series of tennis matches; it is a Sankey Diagram and possesses the "Markov property", meaning that the set of future "states" that are possible are constrained by the current "state", the point score at any moment in a game. Here is a nice interactive explanation.

"Markov Chains" have often been applied to tennis games. Google "Markov Tennis" and you'll find a large number of articles on statistics which use Tennis to explore probability. A few of the results use Data Flow Diagrams to depict Point Progression: Wolfram Alpha has an attractive visualization (see above) which was reproduced in an article on Predictive Modeling; and NC State University produced a YouTube video, as part of an online course titled "Introduction to Finite Math", with a whiteboard explanation of state transitions in tennis games. When visualizations are provided they are usually arranged like the GameFish in TennisVisuals.com, with either horizontal or vertical orientation:

As far as I can tell, the Game Tree design created by Damien Saunder and David Webb at GameSetMap.com is the first time a Sankey Diagram (or Harness Flow Map) was applied to Point Progression in Tennis. The primary innovation was to apply the idea of "quantitive flow lines" to the possible point paths through the tree such that the width of each line represents the frequency with which games passed through each possible "state" for the score, but the real power of the design comes from its interactive nature. SVG (Scalable Vector Graphics) are used to:

animate exploration of the data when it is filtered by selecting individual games, groups of games, or constraining the games to only service games for a chosen player
provide contextual information when "hovering" over specific elements

The original implementation of Game Tree, presented as a celebration of Nadal's 2013 comeback, used match data downloaded in XML format from the William Hill Sports betting website.

In the TennisVisuals version of Game Tree, data is retrieved in JSON format from the Mongo database which underpins TennisVisuals.com. That data, in turn, is presently sourced from Jeff Sackmann's Match Charting Project (many other data sources will come online soon).

The inspiration for the Game Tree design seems to have been the same frustration that drove the development of the Points-to-Set chart: the final score of a tennis match reveals very little about how close a match actually may have been. Even a match with a 6-0, 6-0 score may have been "hotly contested". Traditional stats miss the story every time. Percentage of Points Won for a 6-0, 6-0 match, for instance, provides only a very crude view of match intensity - ranging from 100% for complete dominance by one player to 62.5% for a match in which every game reached Deuce once and only once - it relates nothing of the drama and is of very little use for constructive analysis.

In the following Game Tree visualization of Nadal's service games in a match against Wawrinka at the 2013 Madrid Masters, it is easy to see that Nadal won the first point of his service games 77.8% of the time. When he did lose the first point in a match, 100% of the time he won the second point.

With Game Tree it is possible to see how often Deuce was reached during a match; the thickness of flow lines even indicates how often game scores ricocheted between Deuce and Advantage. In the match with Wawrinka, for games both served and received, Nadal lost only one game that reached Deuce:

In the Saunder/Webb implementation of Game Tree, the "Nodes" of the tree are color-coded to indicate momentum. Dark nodes represent positive momentum while Red nodes represent negative momentum. In the TennisVisuals version of Game Tree these representations still hold true, but momentum is always viewed from the perspective of the primary player; when filtering for the opponent's service game, the tree is not "flipped", as occurs in the Saunder/Webb version of the Nadal-Djokovic Roland Garros 2014 final.

The relative importance of each point in tennis games, sets, and matches has been analyzed extensively, most famously by Carl Morris in his article "The most important points in tennis", which was published in Optimal Strategies in Sports in 1977. It is probably impossible to publish analysis of points in tennis without referencing Morris... here is one of many studies, notable for its visualization of relative point importance within the context of a set:

In 2014, Professors Franc Klassen and Jan Magnus provided ample coverage of the topic in their book "Analyzing Wimbledon". Most recently Jeff Sackmann wrote a series of blog posts ("How Important is the First Point of Each Game?", "The Pivotal Point of 15-30") drawing on a theoretical model which he has published and utilizing his extensive match database.

My plan is to integrate the insights garnered from such analyses into the TennisVisuals version of Game Tree so that results for each match can be viewed in the context of benchmark figures. I'd like to auto-generate summary reports to go along with Game Tree visualizations, similar to what Saunder has done for Nadal-Djokovic 2014 Roland Garros Final. I also plan to divide each point "flow line" into errors and winners and highlight "clutch" performance.

Shortly after releasing the initial version of Game Tree, Saunder published a follow-up entitled "Where are you most likely to win a point on Nadal's serve?" In this article he introduced a "Proportional Symbol Game Tree" which shows the percent chance at every possible "state" of the score that an opponent had of winning the point. It's enticing to think about using a similar Game Tree to visualize a player's service game performance in one match relative to their average over the course of the past year... perhaps overlapped Proportional Symbols of reduced opacity...

Friday, September 18, 2015

Makeover

Today TAVA moved to a new domain: TennisVisuals.com

There are now proper Instructions for using the latest version of the interface; Examples, which had previously only been noted on Twitter, now have their own page.

Most of the changes I've been working on haven't yet surfaced, but the project has graduated from a hobby hosted on my brother's minimally configured server, which I was capable of crashing, to a true application having a real, and expandable, home in the cloud.

I will be posting new visual elements to the Examples page before they are integrated into the application. The Examples page will also be a place to gather unique visualizations made by using the yet-to-be-published API, which will make a growing database of ATP and WTA matches generally available.

Personal matches charted using ProTracker Tennis will not be made public. If you use ProTracker Tennis, you can mail your matches to tennis.aip 'at' gmail.com. You will receive a link by email to the TAVA visualization of your match.

Monday, August 31, 2015

Shot Explorer: Parallel Sets

Tennis Matches generate a great deal of data, but the majority of data from Tennis Matches hasn't generally been captured. Even with Hawkeye, at the elite level, what can be done with tennis data is still relatively uncharted territory. Witness the work being done by Damian Saunder at GameSetMap.com; cutting edge. In a very real sense we are still at the beginning of the era of the application of "knowledge discovery" tools to tennis data. As Jeff Sackmann says in his 2015 presentation at the MIT Sloan Sports Analytics Conference, "Tennis lags behind pretty much every other sport..." when it comes to what he calls "Actionable Analytics".

At the base of every tennis statistic there are Shots; many Shots. And every shot has a multitude of attributes that can be captured.

When I started my Tennis Analytics Integration Platform (AiP) project I was eager to try as many D3 visualizations as possible. Immediately after figuring out how to use the Sunburst chart to create a compact visualization of the Sets, Games, Points and Shots of a Tennis Match, I turned to the Sankey Diagram to try to build a visual filter for selecting Shots to display on a graphic representation of a Tennis Court. I was inspired by the Shot depiction capabilities of ProTracker Tennis. My goal was to build a tool that didn't require checkboxes and which enabled every shot to be seen at once.

Here is the result of my first attempt, using the D3 Sankey plugin and Sankey example created by Mike Bostock.

Sankey diagrams are typically used to represent "flows" within a system. A quantity of something is depicted as flowing or passing through a series of stages; at each stage there is a transformation and new categories emerge; as new categories emerge, quantities are divided between them; it is also possible for categories to emerge that recombine quantities.

When applied to the attributes of Shots within a Tennis Match, each attribute becomes a stage where a quantity of Shots is divided or combined; each attribute value becomes a category. In the Sankey diagram above, the attributes from left to right are "Stroke", "Stroke Type", "Trajectory", "Result" and "Endpoint". Hovering over a flow between any two attribute categories reveals the number of Shots within that flow; in other words, which Shots share those two attribute values.

I was quite excited by this early visualization, but it turned out that the re-combining of quantities made it impossible to follow a single Shot as it passed through each stage. In other words, at any one time I was only able to generate a collection of shots that shared the values of two attributes. What was really needed was a way to generate a collection of shots that had the same value for an arbitrary number of attributes. For instance, I wanted to be able to see all Second Serves that were "down the line" or "to the T" Service Winners, or all Backhand CrossCourt Drives that ended in the Net.

Parallel Sets ended up providing me with one possible answer. Parallel Sets were developed circa 2005 by Robert Kosara, Fabian Bendix and Helwig Hauser (see here and here) as a method of visualizing Categorical data. Parallel Sets "divide the flow path" at each stage/attribute; while flows do pass through subsequent stages together, they do not re-combine.

In the parlance of Parallel Sets, each Tennis Shot attribute becomes a "dimension" and each possible attribute value becomes a "category". The dimension "Stroke Type", for example, has the categories "Drive", "Slice", "Lob", "Drop Shot", "Smash", "First Serve" and "Second Serve". Now, admittedly, this looks like a mess of multi-colored spaghetti. Some of the categories have so few Shots in them that they are too narrow to read. Thankfully, each flow or "ribbon" is highlighted as the mouse hovers over it, and a helpful tooltip appears to list each attribute value which applies to the region of the ribbon, between two dimensions, where the mouse is hovering. In the screenshot above the mouse is hovering over a region between Stroke and Stroke Type where the categories are "Serve" and "First Serve". Seventy-one shots match this criteria, which is 79% of all Serves by one player during the match.

The Parallel Sets image above is from the 2015 Western & Southern Open Final between Serena Williams and Simona Halep. You can explore this match yourself here. (Click on either player's name to reveal the Parallel Sets diagram).

The Seventy-one "First Serves" mentioned above are 79% of all Serves by Simona Halep, but this isn't a useful statistic. The real power of the Parallel Sets diagram can be seen by dragging dimensions vertically and categories horizontally to interactively explore the data. By reorganizing the dimensions, in this instance dragging "Stroke Type" to the top of the diagram, it is possible to see that of all "First Serves", 20% were "Serve Winners", 3% were "Aces" and 51% were "In", which totals to 74% (due to rounding the First Serve Percentage given in the Statistics is 73%).

Any number of other statistics can be derived using the above method, but this wan't the original inspiration for making a Parallel Sets diagram part of TAVA. TAVA began with data from ProTracker Tennis and the Parallel Sets "Shot Explorer" initially enabled the visualization of shot placement on a graphic representation of a tennis court:

In the first court the selection is: "Serve" > "First Serve"; In the middle court: "Serve" > "First Serve" > "In" > "Cross Court" > "Ad Service Box"; In the last court: "Forehand" > "Drive".

ProTracker Tennis captures coordinate data for first and second serves, the return of serve, and the final "Key Shot" which ends a point, making court visualization possible. When I added support for matches captured by the Match Charting Project (MCP) I initially questioned the value of the Parallel Sets visualization; MCP data includes a great deal more shot detail, but it doesn't include shot coordinates, and only a rough estimation of shot placement can be derived (more about this in a future post). But recently I was inspired to connect the Parallel Sets "Shot Explorer" visualization with the Points-to-Set graphic via "Point Highlighting". The result is that when a collection of shots is selected in the "Shot Explorer" it is possible to view when during a match those shots occurred.

In this instance I selected the only two double faults made by Serena during the 2015 Western & Southern Open Final. I discovered that they were made during the same Game during the Second Set, and that Serena won the game anyway.

I have a few more ideas about how the value of the "Shot Explorer" can be enhanced in the future, particularly once a few more visualizations and control structures are added to TAVA.

In Summary, Parallel Sets are a useful frequency-based representation of data. Tennis Matches generate a large, complex data set, and most of it is not amenable to time-series analysis. Using a frequency-based visualization in combination with time-series views makes it possible to create collections of data elements (Shots) and visualize their distribution throughout a Tennis Match.

Wednesday, August 19, 2015

Match Radar

The graphic above is a “Match Radar” chart of the 2007 Wimbledon Final between Rodger Federer and RafaelNadal [TAVA link].

The Match Radar is intended to provide a compact visual comparison of the key statistics for players of a tennis match. This is in contrast to the Points-to-Set, Horizon and Radial Horizon charts which aim to depict the dynamics of a match with respect to the scoring of each set.

The Match Radar enables a quick assessment of whether and how one player dominated another, whether a match was lob-sided, and where players differed on key statistics. It is not intended as a tool for in-depth analysis.

In the match shown above you can see that Federer (blue) and Nadal (purple) were very close in terms of 1^st and 2^nd serve statistics, with Federer having only a very slightly higher 1^st serve percentage and percentage of 2^nd serve points won. Similarly, both players were very close on percentage of return points won for both 1^st and 2^nd serves, with Federer again having only slightly better numbers. Where Federer really stood out was in Aces, Serve Winners, Percentage of Returns-in-play and Forcing Errors. Nadal had more outright winners and more breakpoints, but he failed to convert on enough of the breakpoints to win the match; there was only a difference of seven points at the end of the match.

In the current version of TAVA, the Match Radar appears as both a dashboard icon, with no legend, and a full-size chart; in both cases the graphic is interactive. Values appear in a “tooltip” when the mouse hovers over any point on the chart.

Here is a Match Radar for the 2013 US Open Final between Victoria Azarenka (blue) and Serena Williams (purple); Williams won 5-7, 7-6, 6-1. [TAVA link].

I made a number of changes to the Radar Chart examples found in the various D3Galleries (here and here). You will find a recent D3 Example here. The most notable addition I made to the Radar Chart is the adoption of support for diverse types of axes. This addition was inspired by “Parallel Coordinates” charts, which you can read about here and here. You can find examples of Parallel Coordinates charts in the current version of TAVA. I haven't spent a great deal of time trying to optimize their use, but they do seem to be unwieldy; at a size where the labels could be read and the various matches being charted could be discerned I found it necessary to make the graphic horizontally scrollable.

In the Match Radar chart, the majority of the statistics are given as percentages, but there are some statistics (aces, serve winners, winners, forcing errors, and breakpoints) which are given as “extents” where the axes ranges from zero to the maximum value achieved by either player.

I have also modified the standard Radar Chart to support inverted axes, where the high value appears at the center of the Radar with the low value on the outer edge. This can be used to depict Unforced Errors or Double Faults, where the low value is deemed “better” and should enlarge the player's color area of the radar, rather than pull it toward the center. Additionally, the Match Radar supports “bounded extents” where the extent values can be set arbitrarily. This is appropriate when displaying Aggressive Margins, for instance, when values can range either side of zero.

In a future version of TAVA I plan to make the Match Radar “dynamic” such that it can support the real-time display of a selection of points (“brushing” a range of points on the Horizon Chart, for instance); this capability would also make it possible to “Play” the match from the beginning and watch the changes in the shapes of each player's radar as the match progresses.

I also plan to enable users to configure their own views, selecting which statistics are most relevant for their purposes, and in which order they should appear. I haven't yet decided which statistics should appear as the default, and which order makes the most sense.

This is a selection of matches played by Novak Djokovic (blue). Once you are familiar with the layout of the axes on the Match Radar, you can begin to compare matches and to look for patterns. You might want to look for matches that appear very unbalanced, or very close, for instance.

The match below is the 2013 US Open Semifinal between Novak Djokovic and Stan Wawrinka. [TAVA link]. You can see iconic representation of this match in the bottom row above (2^nd from right).

The right side of the radar is dedicated to service statistics, while the left ranges from Returns-in-Play and Return Points Won (at the bottom) to Winners, Forcing Errors and Breakpoints (at the top). Djokovic won this match 2-6, 7-6, 3-6, 6-3, 6-4, so it was indeed close.

The Match Radar can also be used to quickly look for changes in key statistics across sets:

This may give an idea of how the “brushing” will work: dragging across a range of points in the horizon chart (below the Points-to-Set chart), would dynamically update the Match Radar to reflect the statistics for the selected range of points. A number of coaches have asked to be able to identify those moments in a match when a specific statistic changed dramatically... I'm not sure how that will surface, as yet, but these “discovery” tools may aid in the development of new ideas.

Going forward I also want to use the selection of statistics chosen for the Match Radar to drive views of a player's statistics across a series of matches, within a tournament, or within an arbitrary date range. These views would not employ the Match Radar.

Radar Charts have often been critiqued, along with other polar coordinate charts (see here and here). I understand these critiques and, for the most part, agree with them; nevertheless, I find the iconic version of the Radar Chart gives a good "at-a-glance" gut feel for how players differed on key statistics, particularly after spending time looking at a large number of matches and becoming familiar with how to interpret the layout. The Match Radar is far more compact than the alternatives when quickly comparing matches or sets and can serve as a control structure to drive other, more in-depth, analysis tools and visualizations.

I think a significant difference in the application of the Match Radar Chart in the TAVA application is that it is a standard visualization that can be used across potentially thousands of matches (I now have approximately 3000 matches in the Tennis AiP database), as opposed to a one-off article by a journalist or researcher where a novel analysis is being presented.

I'm on the lookout for compact alternatives to the Match Radar Chart, but I believe it will persist in future versions of TAVA. It is likely the Match Radar will become more of a control structure that drives a variety of complementary "drill-down" visualizations.

Thursday, August 13, 2015

Radial Horizon Graphs: An Original?!?

The logical conclusion to my exploration of Horizon and Corona Graphs. I've searched and can find no other examples on the "internets" of a radial Horizon Graph, so perhaps this is a first!

For your viewing pleasure, here are ten matches from Grand Slam Finals. I hope the Radial Horizon Graphs have captured something of the dynamics of the matches that you wouldn't otherwise see when glancing at the score. I've provided a brief (unprofessional) "reading" of each graph, making the (perhaps unwarranted) assumption that there are more winners and forced errors being made than unforced errors. In the next version of this graphic, or in the interactive version for TAVA, changes in momentum due to winners and losers should be easily discerned.

LEFT: Apart from the first few games of the first set, Kvitova dominated Bouchard.
RIGHT: Hingis was dominating until near the end of the 2nd set; Capriati had a strong finish.

LEFT: Navratilova and Evert were neck-and-neck in the first two sets; Martina dominated in the beginning of the 3rd, but Evert clawed her way back and took the lead at the very end, then lost.
RIGHT: Muguruza took an early lead but lost momentum; she began to recover near the end of the 2nd set, but it was too late.

LEFT: Safina was stronger at the beginning of both sets.
RIGHT: Sharapova was strong at the beginning of the 1st set; she lost ground steadily in the 2nd.

LEFT: Na Li almost gave up her early lead in the 1st set; she never looked back in the 2nd.
RIGHT: Notice that all three sets ended at 6-3; there are differences, but Cilic steadily advanced.

LEFT: Federer began and ended in control; Roddick lead throughout the 2nd set; the 3rd set was a bit of a seesaw, but Federer held; the 4th was definitive.
RIGHT: Wozniacki had a chance to break in the 1st game of the 1st set; after that point she couldn't win points fast enough to keep up with Serena.

Wednesday, August 12, 2015

Points-to-Set: Horizon Corona

I've been searching for a representation of a tennis match that captures the dynamics of play yet remains simple enough and compact enough to use as either an icon or a control structure suitable for selecting a range of points within a match. I also wanted a graphic that could be used to quickly compare a series of matches, with enough detail to easily differentiate a 6-0, 6-0 win that was a "cakewalk" from a 6-0, 6-0 win where every game went to deuce and beyond.

The Corona/Horizon Graphs above are the result of my early attempts to use Points-to-Set data in a new way, charting the difference between the two players' Points-to-Set numbers rather than the absolute values.

Corona Graphs

Corona graphs are actually formally known as radial area graphs; there are also examples of radial histograms which I would describe as "Corona Graphs". These graphs share a lot in common with Polar Coordinate Graphs (such as TAVA's Radar Chart), but they look like the Corona that surrounds our Sun. I haven't seen the name used in the Visualization community as yet, but it is fitting, especially considering the formal definition of a Coronagraph: "A coronagraph is a telescope that can see things very close to the Sun. It uses a disk to block the Sun's bright surface, revealing the faint solar corona, stars, planets and sungrazing comets. In other words, a coronagraph produces an artificial solar eclipse". So, with Corona Graphs I hope to highlight important aspects of a match which normally are obscured by the quantity of data available within the match.

[Update: The term "Corona Charts" (here and here) is used in the financial community. But it is not a radial structure and doesn't resemble the graphs above.]

Horizon Graphs / Charts

Horizon graphs are a type of Time-Series graph which were developed relatively recently by Panopticon Software (now known as DataWatch). Here is a paper describing the development of the graph, and here is an in-depth analysis of the Horizon Graph by Stephen Few of Perceptual Edge, a "Visual Business Intelligence" company.

Horizon Graphs excel at displaying a large number of time series at one time. They are described as a tool for rapidly scanning huge amounts of data to quickly identify "points of concern"; they "preserve data density while preserving resolution." A Tennis Match can certainly be thought of as a time series, a progression of points through time. Horizon Graphs seem ideally suited for comparing matches, but it turns out they are also useful for comparing Sets within matches, and for identifying critical moments during play.

When I began this project I was overwhelmed by the variety of chart examples available. I wanted to try them all, but it wasn't immediately obvious how each type of chart could be meaningfully applied. It wasn't until I generated my first Corona graphs with Point-to-Set data that I realized how I could use Horizon Graphs, and how useful they could be.

Here is the progression from my first Match Corona visualization to my first Match Horizon:

In the first Corona graph, on the left, the difference in Points-to-Set values varies from positive to negative. For the second Corona graph I simply flipped the negative values and changed the color to represent the second player. Below you can see the same data values in a standard horizon graph.

The horizon graph is then cut into bands and layered. The peaks are still visible and no space "under the curves" is wasted. Color gradations indicate distance from the baseline so that the greater values become darker.

With this realization it became possible to compare sets and matches with a very compact visual.

When you see a Horizon Graph for the first time you might find it to be somewhat confusing. But with a bit of study and experience I think you'll find them very valuable. Read the links above or this in-depth overview by a team at Berkeley: "Sizing the Horizon: The Effects of Chart Size and Layering on the Graphical Perception of Time Series Visualizations".

Set Comparison

Here are the sets from the 2001 R16 match at Wimbledon between Pete Sampras and Roger Federer. Federer won the match 7-6, 5-7, 6-4, 6-7, 7-5. Federer is in blue; Sampras is in Green.

You can see the winner of each set by the final color of each graph. The depth of color at any given moment indicates the distance between the two Points-to-Set numbers: darker colors indicate a greater point difference. Turning the graphic into a control structure will enable point and game selection as well as "brushing" to select a range of points in a game. For the next version of TAVA I will add ticks and marks to optionally indicate breakpoints, aces, winners, errors & etc. I'll save the use of Horizon and Corona graphs as control structures for a future post.

To illustrate the ability of the Horizon Graph to enable rapid differentiation of sets which have the same score in games but which vary widely in the intensity of play and the distribution of points, here are Horizon Graph for three sets which each finished at 6-0:

In the first example one player dominated completely, winning all points. In the second example, which is taken from the 2012 Olympics final between Serena Williams and Maria Sharapova, Serena gave up 12 points to Sharapova and needed 28 points to close out the set. In the third example every game of the set went to deuce and most games were at deuce more than once. Seventy-one points were played in the final example versus only twenty-four in the first example and forty in the second.

Match Comparisons

The screen real-estate provided by Blogger makes these a bit too compact, but I hope this gives some idea of the expressiveness of Horizon Graphs. You can click on each graph to see the full size image:

And finally, here is a link to a video about Interactive Horizon Graphs. This is a bit orthogonal to my intent to use Horizon Graphs as control structures, but it is interesting nevertheless and may provide some inspiration for a way to compare very large numbers of matches in the future. I'm discovering that there are many attributes of matches other than Points-to-Set which may be usefully visualized with Horizon Graphs.

Acknowledgements

I want to recognize again the work of Francis X. Diebold, Glenn Rudebusch and Professor Diebold's students at the University of Pennsylvania. As far as I and they can tell, their work on the concept of Points-To-Set is completely original.