Beating Bookmakers — Proof of Concept model that is good enough to start betting.

11 min readSep 15, 2022

Several experiences in my life inspired me to start writing a following series of articles. Making no introduction, reading some publication*s on the medium, was also crucial — along the couple of scientific publications on various approaches to the problem in question — I decided to try my hand at predicting the results of football matches, with a little support from machine learning.

In a series of upcoming publications, I would like to share my experiences from the last weeks — hope someone will find anything valuable from this work.

Executive summary. What we are dealing with.

The aim of the project was to create several baseline models, which will determine how far they are from the market standard, and what the real result would be if the model in question was betting on the results of matches based on average bookmaker odds.

What is the market standard?

This was the first key question that I had to answer while doing research. In order to be able to answer it, one has to talk about the same metric — in this case it is “Accuracy”, which determines how many times in n cases the model was right. At the same time, it is also worth noting that the effectiveness can vary over the course of different competitions, so for the sake of simplicity we will focus on the English Premier League first. And what are these market standards?

Below are some links to sources:

“Accuracy” of bookmaking algorithms oscillates around 54% — http://andrew.carterlunn.co.uk/programming/2018/02/20/beating-the-bookmakers-with-tensorflow.html
The “Accuracy” of Octosport’s Premier League model is 54% — https://www.octosport.io/model-performance ;
The “Accuracy” of the basic logistic regression model is below 50% https://medium.com/geekculture/building-a-simple-football-prediction-model-using-machine-learning-f061e607bec5 @ octosport.io

In the following article Nicholas Utikal quotes papers about top ML models achieving 60–65% accuracy. His own approach is close to this performance — https://medium.com/@nicholasutikal/predicting-football-results-using-archetype-analysis-and-xgboost-1344027eae28 . I also understand that there is a competition on Kaggle where they are achieving nearly 100% accuracy using some feature engineering and LSTM networks. But I’m not taking it into account, as every problem in the world has a 100% accurate solution on Kaggle :)

Weight of each percentage of success.

In terms of effectiveness, each percentage matters. It is enough for us to consider the table below for the English Premier League for the 21/22 season.

If we add up the product of the matches and the course assigned to them, and then divide by the number of games in the season (380), we will know what the return on investment would be, in the event that all the matches we are betting turned out to be successful. Here I am assuming a single bet on the result (home / away / draw). The sum of the products is 715.26, which is ~ 1.8822 percent return.

Of course, there is no need to delude yourself in predicting all matches, but it is worth considering how much 1% weighs in “Accuracy” in this case, in order to be able to realistically assess your expectations, but also possible achievements.

Quick look makes us understand that “industry standard” is barely profitable. And that only getting close to 60% or above makes the algorithm something that can be considered interesting from an investment point of view.

Objective

The above calculation is of course a simplification, which does not take into account the bias with which betting odds are intentionally created, but it gives a picture of what kind of potential we are talking about. In particular, that it is easy to increase the profitability of such a venture without freezing all cash on 380 individual bets, only apply a smooth approach with replenishment of your portfolio to cover missed bets. I’ll cover it in the following articles.

Therefore, it seems to me that a fairly reasonable-sounding target that I could have set for myself before starting the research and testing was an “Accuracy” level of 55% as a baseline for further consideration. Profitability as a base is not the worst starting point.

Collection and processing of data.

I am so lucky that I started the UEFA football coach course some time ago and this sport was close to me for half my life. Therefore, I decided that, using the knowledge I have, I will try to build a data set that has the potential to be analytically relevant.

The soccerdata library (https://soccerdata.readthedocs.io/) came to the rescue, which is a data scraper from several different sources. You can find a detailed description of what and how from this library was used in my notebook, the link to which is attached below.

https://github.com/SquareGraph/FootballPredictionsModel/blob/main/DataGathering_from_Soccerdata.ipynb

On the other hand, I wanted to tell you in a few points what is happening there and show what the data looks like.

Soccerdata Fbref api

It is my basic api to which I merge data from other sources and engineer new features that will be used for the algorithm:

Output of the read_schedule() based function.

fbref.read_schedule () — a league schedule. The method returns a DataFrame of all games in the given season. I had to clean it up a bit as it had lines of postponed games filled in with “NaN”. The code below is an example of what my function utilizing the soccerdata library returns. The important column is “game_id”, which will allow you to link different data together.
fbref.read_player_match_stats () — according to this method, we are able to obtain very accurate data on various types of statistics from each game, accumulated per player. There are many statistics available, but for this model we will only use passing (also because when I launched one function that was about to build a DataFrame from all statistics for all meetings, after 5 hours of notebook work, when the objects were ready, they turned out to be become so large that any operations on them, such as saving to csv, caused the notebook to malfunction in the Google Colab service).
fbref.read_shot_events () — this one groups all shot events per game. The information is both categorical and quantitative, therefore the categories have been translated into numerical values, and the information on the distance of shots has been clustered into 10-meter bins.

Two important annotations here, for people who will want to try to use this data themselves. The first of a programming nature. Well, team names differ between APIs. That is why you often have to work with map-like dictionaries, because one of the api wants “Manchester United” and the other one needs “Manchester Utd”.

The second, more important issue concerns the characteristics of these data. Well, we cannot take these data into account directly for the calculation, because we will not have them at the moment when the match comes to the prediction. The data on the number of shots or the number of passes is simply known only after the game has taken place. So, analytically valuable information for us is the moving average of the accumulated features.

The above code snippet shows a function that I created to build a DataFrame for each team with their home and away performances respectively. I assumed a window the size of the last five games. An important parameter here is closed = “left”, according to which we can be sure that the previous and not the current meeting are taken into account. At the same time, when calculating such an average, the first row will be NaN, so we fill it with scipy.stats.mode.

Soccerdata ELO API

Another element that I found interesting is the ELO ranking. This parameter has been added for each team participating in the match and is calculated on the match day. I created a function and later used it through the pd.DataFrame.apply () method, adding two columns to the DataFrame — “home_rank” and “away_rank”.

Soccerdata MatchHistory API

This API, appeared very important because it contained betting odds for each match, and some additional data, such as the FTR (Full Time Result) column, which will not only be a further label, but also allowed to calculate the form of the teams, in a similar way to counting passes or shots using a moving average. Here also, the team names were different than in the FBref API, so before I was able to combine these data together, I had to prepare another dictionary and then execute pd.DataFrame.merge (on = [“home_team”, “away_team”]. Annotation — MatchHistory API has broken match dates, so be aware.

Full season DataFrame MatchHistory API, but with broken dates.

Player parameters — Fifa game data

The entire market of modern football analytics tends to standardize player description with measurable parameters, so you can compare players using the same criteria. A simplified version of such measurements has already been made in FIFA and it is worth considering in our case. But as part of the base model I will keep it simple. I won’t check the exact lineups and compare how a given defender performs against the opponent’s offensive formation. Instead I averaged team parameters and subtracted home params from visiting one. This way I created a table where if any statistic is positive, this means the host team has an advantage. In the case of negative values - the guests are on top.

Where does this data come from? Soccerdata also provides access to the SoFIFA scrapper, however the data is incomplete for some reason. On the other hand, Kaggle comes with help and the great work that people have been doing for years (@StefanoLeone, thanks) working on the following dataset:

https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset?rvi= 1

Thanks to this, I avoided a lot of time devoted to writing a scrapper. The final table with FIFA data consists of 43 columns filled with floating point values.

Preparation of data for machine learning.

The data prepared in the previous steps looks as follows.

The shape parameter (380,141) tells us that each row in the DataFrame corresponds to one game in the English Premier League. And, almost all columns will become features for the proposed models. Almost because:

main.FTR will be y_labels. FTR stands for Full Time Result. By using pd.factorize () original categorical values (H — home win, A — away win, D — draw) were previously assigned the appropriate numerical values.
We have historically derived from two different DataFrame (MatchHistory and FBref) columns called “date_x” and “date_y” — for now we will not consider them in any way, so we will drop them.
The same goes for the game_id parameter we previously used to merge the two tables.
There are a few more features that appeared during linking (and which I need to fix in the next iteration of the collecting notebook) — these are parameters like “home_point”, “away_points”, “draw”, “D_HT” which were auxiliary parameters for counting team forms; or a copy of the “Unnamed: 0” index.

In addition, we have a few cases of columns almost completely filled with NaN values. In this case, I checked the API response with real matches, and decided to fill them in with 0. Some of these columns are a consequence of the lack of data on the API side, and some of the fact that when calculating the moving average from 0 value in the first rows (there was no data before game 1), the next were filled with NaN.

Since we want to define the baseline I am not messing around with normalizing or otherwise modifying the data.

Trained models and metrics.

In the subject of metrics, in addition to the aforementioned “Accuracy”, I will also use one of my own, specific to the problem and niche. Namely, it is about checking how much the types were really worth according to the algorithm, i.e. profitability. As the size of the validation sample was 76 matches, this number is also the minimum value that the model should achieve in order to consider it in terms of being a non-loss one (remember all the time about the simplification that assumes naive portfolio management, where all matches are constant money pot split equally on each bet).

I have named this custom metric “score” and it is calculated as follows:

Selected models

For all the models below, hyperparameters were selected initially(with some basic, model-wise understanding), but without fine tuning. Fine tuning is still ahead, in upcoming articles. At first, I decided to analyze four models. The well-known RandomForestClassifier from the Sickit Learn library and XGBoost; both require no introduction. In the case of tabular data, they are considered a market standard and one of the most effective solutions.

The other two are the corresponding TabNet (https://arxiv.org/abs/1908.07442) in the implementation of https://github.com/dreamquark-ai/tabnet, and a simplified version of DeepInsight (https://www.nature.com/articles/s41598-019-47765-6), an algorithm that translates linear tabular data into two-dimensional feature groupings for later use of the CNN network.

A link to the notebook, which describes the entire process step by step, is attached below.

https://github.com/SquareGraph/FootballPredictionsModel/blob/main/BaselineModels_Football_Predictions_55_60_version_to_publish.ipynb

Summary

The most important thing, however, is a table with a comparison of all models in relation to the target of 55% accuracy.

As for DeepInsight, I have run it multiple times and the results are very random. I am not abandoning this concept yet, but I will definitely check LSTM networks rather than try to refine this model or the data that is to be delivered to it.

RandomForest and XGB showed potential, but each of the assessment parameters puts them below the qualification threshold, but close enough that I will very strongly consider GridSearchCV or another method of optimizing hyperparameters for further research.

On the other hand, TabNet in the basic version turned out to be significantly higher than the assumptions, and what’s more, it came close to the top market solutions. Extrapolating over the entire season, with the naïve money management, the algorithm would potentially give a chance of a return of 11%.

In conclusion, if the test solution already looks efficient, it is worth delving into the problem further.

Next steps? Fine tuning the data and fine tuning the algorithms :) !

Some articles I mentioned*