Raffay Hamid



CVPR 2010 Paper: Player Localization Using Multiple Static Cameras for Sports Visualization

Authors: Raffay Hamid, Ram Krishan Kumar, Matthias Grundmann, Kihwan Kim, Irfan Essa, Jessica Hodgins



NEWS!: We recently (Aug. 2010) collected a new soccer data set at ESPN Wide World of Sports. We used scissor lifts with adjustable heights to mount 3 synchronized 720P-HD cameras for covering one half of the field. We collected two games with camera heights of 60 feet. One of these games was recorded at night under flood lights. We captured the third game with cameras' height of about 25 feet. We tested our framework on 22,000 frames of this data (this is besides the results on 60,000 frames of PIXAR soccer data given in our CVPR paper). Results on the new data can be found in our journal paper (in preparation).


1. Introduction:

Visualizing multi-player sports has grown into a multimillion dollar industry. However, inferring state of
a multi-player game is still an open challenge. This is specially true when the context of the game changes in a dynamic and continuous manner. Examples of such sports include soccer, field hockey, and basketball. Our work is geared towards automatic visualization of this particular subset of sports.

One of the main technical challenge for sports visualization systems is to infer accurate player positions in the face of occlusion and visual clutter. One solution to this end is to use multiple overlapping cameras, provided the observations from these cameras could be fused reliably. Our work explores this question of efficient and robust fusion of visual data observed from multiple synchronized cameras, and apply this information for generating sports visualizations. These include displaying a virtual offside line in soccer games, highlighting players in passive offside positions, and showing players' motion patterns.

Our key contribution is the modeling and analysis for the problem of fusing corresponding players' positional information as finding minimum weight K-length cycles in complete K- partite graphs. The algorithm-class we propose to this end uses a dynamic programming based approach, that varies
over a continuum of being maximally to minimally greedy in terms of the number of paths explored at each iteration. We use our proposed algorithm-class for an end-to-end sports visualization framework, and demonstrate its robustness by presenting results over 60,000 frames of real soccer footage captured over five different illumination conditions, play types, and team attire.


2. Framework Overview:

Following are the main steps we have in our sports visualization framework:


2a. Background Subtraction

We begin by adaptively learning per-pixel Gaussian mixture models for scene background. These models are used for foreground extraction by theresholding appearance likelihoods of scene pixels. The input and output to this step are given in the following figure. Note that the while this step allows player pixels quite successfully, it also extracts the shadow pixels as a part of the foreground. Such shadow pixels can be problematic for player tracking, and therefore need to be removed.





2b. Shadow Removal

While there are numerous appearance based methods for shadow removal [21], they mostly work best for relatively soft shadows. In soccer games however, shadows can be quite strong. We therefore rely on geometric constraints of our multi-camera setup for robust shadow removal.

Consider the following figure, where only shadow pixels of the player are view independent. This enables us to remove shadows by warping extracted foreground in one view onto another, and filtering out the overlapping pixels. We begin by finding 3X3 planner homographies between each pair of views, such that for any point in one view, we know a distinct mapping for it in the second view.

In cases where a player is partially occluded by a shadow, simply relying on these geometric constraints might result in losing image regions belonging to occluded parts of players. To avoid this, we apply chromatic similarity constraints of original and projected pixels before classifying them as shadow versus non-shadow. The intuition here is that the appearance similarity of shadow pixels across multiple
views would be more than that for non-shadow pixels.





The input and output of the shadow removal step are shown in the following figure. Notice that some parts of the player are also removed while removing the shadows, however by and large this method of shadow removal performs quite well.




2c. Player Tracking

We track the player blobs using a particle filter based framework. We represent the state of each player using a multi-modal distribution, which is sampled by a set of particles. To propagate the previous particle set to the next, we perform the three-step procedure of Selection, Prediction and Measurement. Here Selection implies the step of importance sampling of a set of particles from the previous step based on how well they fit the measurement for the last frame. Prediction implies the application of a dynamic model on the selected particles. Finally, measurement relates to ranking the particles in terms of how well they match the measurement from the current frame. These three steps are repeated for each of the frame in the video. This entire process is illustrated in the following figure.



2d. View Dependent Blob Classification

We classify the tracked blobs on a per-frame and per-view basis. We pre-compute the hue and saturation histograms of a few (~5) player-templates of both teams as observed from each view. During testing, we compute this hue and saturation histograms for the detected blobs, and find their Bhattacharyya distances from the player-templates of the corresponding view. We classify each blob into offense or defense teams based on the label of their nearest neighbor templates. The pipeline of blob-classification for one
particular view is shown in the following figure.



The output of the tracking and player classification on an example frame is shown in the following figure.




2e. Data Fusion for Player Classification

To transform players’ location observed from multiple cameras into a shared space, we project the base-point of all blobs observed from each camera into real-world coordinates of the field. We pose fusing location evidence of players observed from multiple cameras as iteratively finding minimum weight K-length cycles in a complete K-partite graph. Nodes in each partite of this graph represent blobs of detected players in different cameras. The edge-weights in this graph are a function of pair-wise similarity
between blobs observed in camera-pairs, and their corresponding ground plane distances. Correspondence
between a player’s blobs observed in different cameras is equivalent to a K-length cycle in this graph. This problem setup is illustrated in the following figure.




Specifically, we can state our problem as given a complete K-partite graph G with K tiers, we want to find the minimum weight cycle c in G, such that c passes through each tier in K once and only once. A complete K-partite graph and a node-cycle are shown in the following left more and right most figures respectively. We iteratively find and remove K-length minimum weight cycles from G until there remain no more cycles in in the graph.



Note that as our problem is cyclic in nature, the edges we find must start and end at the same node. Note that while using traditional dynamic programming, there is no guarantee that the shortest path returned by the algorithm would necessarily end at the same node as the source node. We therefore need to modify our graph representation such that we could satisfy the cyclic constraint of our problem, while still using a
dynamic programming based scheme.

Assume the size of all nodes V in G is n. For each node v in V , we can construct a sub-graph Gv with K + 1 tiers, such that the only node in the 1st and the (K + 1)st tier of Gv is v. Besides the 1st and the (K + 1)st tiers of Gv, its topology is the same as that of G. This is illustrated in the 2nd figure above.

Note that the shortest cycle in G involving node v is equivalent to the shortest path in Gv that has v as its source and destination. Our problem can now be re-stated as given G, construct Gv for all v in V . Find shortest K length paths P = {pv in Gv for all v in V} that span each tier in Gv once and only once. Find shortest cycle in G by searching for shortest path in P. There is an inherent tradeoff between efficiency and optimality of this search problem, which is analyzed in detail in the paper.


3. Multi-Player Sports Visualization

We use our framework to generate various automatic sports visualization, three of which are enlisted below.


3a. Offside Line Visualization

An important foul in soccer is the offside call, where an offense player receives the ball while being behind the second last defense player (SLD). We want to detect the SLD player, and to draw an offside line underneath him/her. To test the robustness of our proposed system, we ran it on approximately 60,000 frames of soccer footage captured over 5 different illumination conditions, play types, and teams’ attire.



We compared the performance of our proposed system with that of finding the SLD player in each camera individually, and with naively fusing this information by taking their average. Our proposed fusion mechanism out performs the rest with an average accuracy of 92.6%. The naive fusion produces an average accuracy of 75.7%. The average accuracy across all 3 individual cameras over all 5 sets is 82.7%. To the best of our knowledge, this is the most thorough test of automatic offside-line visualization for soccer games available.


3b. Passive Offside Visualization

Offence players can be in an offside state either actively (get directly involved in the play while being behind the SLD), or passively (be present behind the SLD and not get directly involved in the play). Fig. 10 shows an example illustrating the offense player in passive offside state automatically highlighted using our proposed framework. Visualizations such as these can be used in assisting viewers
predict whether or not an offside foul is likely to take place.




3c. Passive Offside Visualization

Visual broadcast of soccer games only shows an instantaneous representation of the sport, where no visual record of what happened over some preceding time is usually maintained. There are two important challenges in having a lapsed representation of a game. Firstly, automatic detection of players’ actions is hard. And secondly, summarizing these actions in an informative manner is non-obvious. To this end, we consider players’ movement as a basic representation of the state of a game, and use our framework to
visualize development of a game over a window of time (see Figure below). Visualizing such holistic movements of players accumulated over time can potentially help viewers’ understanding of how a game is progressing, identifying the various defense and offence strategies being used, and predicting the subsequent game-plan for each of the teams.




4. Conclusions and Future Work

We have presented a novel modeling and search method for fusing evidence from multiple information sources as iteratively finding minimum weight K-length cycles in complete K-partite graphs. As an application of the proposed algorithm-class, we have presented a framework for soccer player localization using multiple synchronized static cameras. We have used this fused information to generate various
sports visualizations, including the virtual offside line, highlighting players in passive offside state, and showing players’ accumulated motion patterns. We have presented a thorough analysis of the robustness of our framework by testing it over a large and diverse set of soccer footage.

In the future we want to apply our algorithm-class for a wider set of correspondence finding problems, including matching for depth estimation, trajectory matching using multiple cameras, and motion capture reconstruction. Furthermore, we want to use our visualization framework for a variety of sports, including rugby, hockey, and baseball.




Copyright © 2010 Raffay Hamid. All rights reserved.