Fitting Ml Model On Bigquery
tags: #bigquery #machine-learning #gcp
Baseline Model: Logistic Regression
- The SELECT statement generates the isGoal outcome variable and extracts relevant features from event data, such as shot distance and angle. Joins are utilized to identify the competition of each shot, allowing for the exclusion of World Cup data from the current modeling, which will be used later after the model has been fitted.
```
CREATE MODEL
soccer.xg_logistic_reg_modelOPTIONS( model_type = ‘LOGISTIC_REG’, input_label_cols = [‘isGoal’] ) AS
SELECT Events.subEventName AS shotType,
/* 101 is known Tag for ‘goals’ from goals table */ (101 IN UNNEST(Events.tags.id)) AS isGoal,
soccer.GetShotDistanceToGoal(Events.positions[ORDINAL(1)].x,
Events.positions[ORDINAL(1)].y) AS shotDistance,
soccer.GetShotAngleToGoal(Events.positions[ORDINAL(1)].x,
Events.positions[ORDINAL(1)].y) AS shotAngle
FROM
soccer.events Events
LEFT JOIN
soccer.matches Matches ON
Events.matchId = Matches.wyId
LEFT JOIN
soccer.competitions Competitions ON
Matches.competitionId = Competitions.wyId
WHERE /* Filter out World Cup matches for model fitting purposes / Competitions.name != ‘World Cup’ AND / Includes both “open play” & free kick shots (including penalties) */ ( eventName = ‘Shot’ OR (eventName = ‘Free Kick’ AND subEventName IN (‘Free kick shot’, ‘Penalty’)) ) ;
- There is also evaluation (so cool)
![[Pasted image 20250111230710.png]]
- Understanding the model fit (via bigquery [ML weight function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-weights))
![[Pasted image 20250111231141.png]]
- Each input in the logistic regression model has associated weights: a single numerical weight for continuous features and one for each category in categorical features. Interpreting these coefficients directly can be misleading due to potential correlations among predictors. However, based on general soccer knowledge, the following trends appear reasonable:
- penalty kicks have a higher success rate,
- shots taken from greater distances tend to have lower success,
- and shots taken at steeper angles to the goal are more likely to succeed.
-
#### XGBOOST
- using BigQuery ML's XGBoost to the data and compare its performance with the earlier fitted logistic regression model.
- Boosted trees theoretically offer enhanced accuracy due to their capability to capture nonlinear relationships and interactions among features, which logistic regression may not adequately address
CREATE MODEL soccer.xg_boosted_tree_model
OPTIONS(
model_type = ‘BOOSTED_TREE_CLASSIFIER’,
input_label_cols = [‘isGoal’]
) AS
SELECT Events.subEventName AS shotType,
/* 101 is known Tag for ‘goals’ from goals table */
(101 IN UNNEST(Events.tags.id)) AS isGoal,
soccer.GetShotDistanceToGoal(Events.positions[ORDINAL(1)].x,
Events.positions[ORDINAL(1)].y) AS shotDistance,
soccer.GetShotAngleToGoal(Events.positions[ORDINAL(1)].x,
Events.positions[ORDINAL(1)].y) AS shotAngle
FROM
soccer.events Events
LEFT JOIN
soccer.matches Matches ON
Events.matchId = Matches.wyId
LEFT JOIN
soccer.competitions Competitions ON
Matches.competitionId = Competitions.wyId
WHERE /* Filter out World Cup matches for model fitting purposes / Competitions.name != ‘World Cup’ AND / Includes both “open play” & free kick shots (including penalties) */ ( eventName = ‘Shot’ OR (eventName = ‘Free Kick’ AND subEventName IN (‘Free kick shot’, ‘Penalty’)) ) ;
- Evaluating results: The log loss and ROC AUC of the boosted tree model is pretty similar to that from the logistic regression model, but slightly worse (higher log loss and lower ROC AUC).
#### Prediction on new data
- Can use [BigQuery ML's prediction functionality](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-predict)
SELECT *
FROM
ML.PREDICT(
MODEL soccer.xg_logistic_reg_model,
(
SELECT
Events.subEventName AS shotType,
/* 101 is known Tag for 'goals' from goals table */
(101 IN UNNEST(Events.tags.id)) AS isGoal,
`soccer.GetShotDistanceToGoal`(Events.positions[ORDINAL(1)].x,
Events.positions[ORDINAL(1)].y) AS shotDistance,
`soccer.GetShotAngleToGoal`(Events.positions[ORDINAL(1)].x,
Events.positions[ORDINAL(1)].y) AS shotAngle
FROM
`soccer.events` Events
LEFT JOIN
`soccer.matches` Matches ON
Events.matchId = Matches.wyId
LEFT JOIN
`soccer.competitions` Competitions ON
Matches.competitionId = Competitions.wyId
WHERE
/* Look only at World Cup matches for model predictions */
Competitions.name = 'World Cup' AND
/* Includes both "open play" & free kick shots (including penalties) */
(
eventName = 'Shot' OR
(eventName = 'Free Kick' AND subEventName IN ('Free kick shot', 'Penalty'))
) ) ) ``` ![[Pasted image 20250111234801.png]]