Sports Data Mining - Related Work - Introduction I n today’s world, as sports are getting more

Introduction I n today’s world, as sports are getting more competitive than ever, players and teams are looking

2. BACKGROUND

2.3 Related Work

2.3.1 Sports Data Mining

Data mining is the process of uncovering hidden trends and patterns from data sources. The data sources can be structured such as databases or unstructured such as videos. Data mining is applied to these structured and unstructured data to learn hidden patterns in different applications domains such as business, medicine, engineering, etc.

A massive amount of data related to player performances, team performances, etc., are col- lected by organizations and sport-related associations. Various data mining techniques have been applied successfully on the generated sports data in the past few years. Sports data mining refers to the application of data mining techniques on sports data [32]. It is used for player performance analysis, team performance analysis, building game strategies, etc., by teams to have advantage over their opponents. Sports data mining tasks in various sports are presented in Table 2.2. Box-score data, video data, and tracking data are the focus of analysis for sports researchers and analysts [61].

Sports Data Mining Tasks

American Football Estimating team strength [33], Evaluating quarterbacks [34], Evaluating place-kickers [35], Forecasting the success [36]

Baseball Batter’s performance [37], Base runner’s performance [38], Pitcher’s performance [39], Fielder’s performance [40]

Basketball Evaluating player contributions [41,42], Optimal strategy planning [43], Measuring offensive/defensive player abilities [44,45]

Cricket Target resetting [7,8], Match simulation [9,10,46,47], Evaluating player performance [11,13,48,49], Evaluating team strength [15,50], Optimal lineups [16,17,51], Tactics [18,19]

Hockey Assessing player performance [52], Optimal strategy planning [53], Player contribution [54], NHL drafting [55]

Soccer Model outcomes [56, 57], Team quality [58], Individual player rat- ings [59], Referee bias [60]

Table 2.2: Data Mining Tasks in Various Sports.

We present below work carried out using each of these data in various sporting domains, including cricket.

Box-score Data Analysis

Box-score data are the discrete data referencing the in-game events. Humans generate most of these events and statistics (e.g., scorecards, points table). Statistics is primarily applied to box-score data to measure player and team performances. In baseball, a massive amount of data and statistics are generated in a traditional tabular presentation of a baseball team’s schedule. To make sense of this data requires significant cognitive effort. SportsVis [62] uses the baseline bar display and player map to explore a team’s and a player’s performance throughout a season, respectively. However, it considers only the aggregate information for any particular game. In basketball, treemap [63] is used to visualize NBA basketball player statistics. In soccer, a gap chart [64] is used to visualize the temporal evolution of ranks and scores of soccer teams participating in a competition. A gap chart is a class of line charts where gaps between teams show the magnitude of their score difference, hence ensures no overlap of tied entries.

In cricket, box-score data is used for target resetting, match simulation, player performance analysis, team strength analysis, optimal lineups prediction, and devising match tactics. We present the literature related to limited over cricket (T20I and ODI) and Test cricket separately.

Limited Over Cricket: When a match is interrupted due to bad weather, target resetting plays a major role. Duckworth and Lewis [7] proposed a method (DL method) for target resetting, which is adopted by the International Cricket Council (ICC). To account for the recent changes in scoring, the DL-method is updated by Stern [8]. Bailey and Clarke [9] developed a predictive model

of the game’s outcome by employing the DL-method, and a linear model was used for fitting the resulting target scores. Correlation of winning a game to different batting combinations and run rate was identified by Allsopp and Clarke [10]. Traditional statistics such as batting average fails to consider the number of balls faced, and strike rate fails to consider the number of dismissals.

Croucher [48] proposed batting index (batting average×strike rate) which takes both into account.

Saikia et al. [49] proposed the first quantitative investigation of fielding. To provide a weighted measure of fielding proficiency, they performed subjective assessments of every fielding play, such as catching, ground fielding, and run-outs. Theodoro et al. [13] proposed a Bayesian hidden Markov model for assessing batting in one-day cricket. Iyer and Sharda [11] employed neural networks to predict cricket players’ performance based on their past performances. Assessing team strength in cricket is very crucial. Davis et al. [15] proposed a match simulator that assesses team strength in T20I cricket. Jhanwar and Pudi [50] model relative team strength using player’s career statistics and recent performances. Using the relative team strength, toss decision, and match venue, they predict the winner of an ODI cricket match. Optimal team selection for a game has a major effect on the game’s outcome. Lemmer [16] and Ahmed et al. [17] proposed search algorithms to select teams. The approaches typically permit constraints on team selection, e.g., a fixed number of pure batsmen, all-rounders, and bowlers are imposed when forming a team. Chhabra et al. [51]

proposed a team recommendation system in cricket by modelling players into embeddings that represent players’ strengths and weaknesses. Their model is based on player’s past performances (quantitative factor) and opponent players’ strengths and weaknesses (qualitative factor).

Test Cricket: Brooks et al. [46] used an ordered probit model with batting and bowling strengths to predict match outcomes in Test cricket. Scarf and Shi [47] modeled the match outcome probabilities using logistic regression, given the position at the end of the third innings. Scarf and Akhtar [18] extended this to the positions at the end of the first and second innings. Their models have been used to consider the declaration strategy and the follow-on decision in Test cricket. Scarf et al. [19] used negative binomial distributions to model the runs scored in innings and partnerships during test matches.

Video Analysis

Gong et al. [65] proposed a system to parse soccer videos to various play categories. To achieve this, they employed four high-level detectors, namely line mark recognition, motion detection, ball detection, and player’s uniform color detection. The plays in the mid-field, penalty zones, and corner areas are identified with high accuracy, whereas for shot-at-goals and corner-kicks, the accuracy is low. Assfalg et al. [66] employed a Hidden Markov Model (HMM) to detect and recognize soccer matches’ highlights. Specifically, the authors investigated penalty kicks, free kicks next to the goal post, and corner kicks. These three actions are typical highlights often shown in a soccer game. For the classification task, qualitative features are extracted from the video. The free kicks, penalty kicks, and corner kicks are identified with 80%, 90%, and 100% accuracy, respectively. Highlight recognition is also successfully examined in other sports domains such as tennis [67], basketball [68],

and baseball [69]. In cricket, camera motion estimation was carried out by Lazarescu et al. [14] to index cricket videos and to classify shots offered by batsmen based on the estimates. Low-scoring shots are classified accurately compared to high-scoring shots.

Baillie and Jose [71] proposed an audio-based event detection on soccer broadcasts. Mel Frequency Cepstral Coefficients (MFCC) are extracted from the soundtrack of soccer commentary.

A high correlation between crowd response to key events was taken as a cue to detect events effectively.

The text data present in the video is utilized in indexing, retrieval, efficient learning, and effective inference. The superimposed text provides vital information about the game’s proceedings.

Zhang and Chang [72] employed caption-text detection and recognition to identify events in baseball videos. Score, out, and ball-counts related text are detected with high accuracy compared to inning- number.

Xu et al. [73] proposed a combination of video and audio features to detect tennis events.

In particular, low-level features in the video, namely motion vector field, texture, and color are extracted. MFCC features and zero-crossing rate are employed to differentiate applause from commentary speech. Audio keywords, which are argued to be mid-level features, are extracted for audio commentary data in various sports. These are, in turn, used to detect semantic events.

Nepal et al. [68] used crowd cheer (audio), scorecard (text), and motion detection (video) for event detection in basketball. Sankar et al. [70] used keyword analysis on synchronized text commentary and video to identify an act of the game that a video frame corresponds to (ball being bowled or advertisement). They also looked into interesting commentaries by maintaining a count of interesting words.

Tracking Data Analysis

Recent developments in tracking and sensing technologies make it possible to obtain spatio-temporal information (x and y coordinates at time t) about the players and equipment (e.g., ball, bat, etc.) in real-time during the play. Tracking data are the continuous spatio-temporal motion data generated by multi-camera tracking systems (e.g., Hawk-Eye, SportVU).

Many visualization tools are introduced for finding hidden patterns using spatio-temporal data. In soccer, Wu et al. [74] proposed ForVizor to visualize player formation changes over time and reveal the continuous spatial flows of formations (formation flows) for in-depth analysis. To explain the cause of formation analysis, in addition to formation flows, multiple coordinated com- ponents are also designed. ForVizor uses a combination of manual annotation and algorithmic representation. It involves the manual annotation of the entire video by experts, which requires significant effort and may not be scalable. In baseball, Dietrich et al. [75] proposed Baseball4D that uses raw baseball tracking data (player and ball) over time and plots them as events on a dot map to reconstruct the entire game and visually explore each play. It combines time-varying player tracking and ball tracking data streams to generate nontrivial statistics and visualizations.

In basketball, Beshai [76] developed Buckets that utilizes basketball shot data (spatial data) to

view details about a single-player, compare multiple players, and explore league trends. In cricket, Das et al. [12] proposed CricVis, a web-based visualization system that utilizes box-score data (scorecards) and tracking data (ball tracking) to construct visualizations such as pitch maps and stump maps to analyze the bowling overview and batting overview, respectively. Morgan et al. [20]

predicted where a specific batsman would hit a specific bowler and bowl type in a specific game scenario. Spatio-temporal data-based analysis focuses on visualizing low-level information (e.g., player actions). High-level tactical strategies, e.g., team tactics, are hard to infer from this low- level information.

Sports data mining methods mainly focus on box-score data, tracking data, and video data.

Box-score data are used for tasks such as target resetting, match simulation, player performance analysis, etc. While interesting, the focus of these methods has primarily been at an aggregate level, i.e., they do not attend to the minute details of the game. Tracking data and video data are successfully applied in many sports-related tasks. However, the main bottleneck with these data is that they are not available publicly as they are highly expensive to capture in every match. It motivates us to propose a model that uses publicly available data, considers the game’s minute details, and learns player-specific strategies (strengths and weaknesses). Unstructured data in the form of cricket text commentary is used for this task. In order to effectively use the text commentary data, we look at the literature related to text (and short text) representation and visualization in the following sections.

Dalam dokumen Learning Player-specific Strategies Using Cricket Text Commentary (Halaman 36-40)