I am trying to fit a random forest to predict the number of yellow cards given predictors such as referee, team, opponent (highly ordinal) , along with other variables such as competition, season (year), etc...
The dataset is quite large as I reformatted such that every game is repeated twice from the point of view of both teams. It is about 36k rows. There are 5 countries and the seasons span from 2015 to 2019.
It crashes no matter how I try to partition the data. I have tried to take a subset
My data looks like this:
team opponent home yellow_cards red_cards opposition_yc referee country season competition_level total_bookings match_date matchweek
<fct> <fct> <dbl> <int> <int> <int> <fct> <fct> <dbl> <fct> <int> <date> <int>
1 1860 Munch… Kaisers… 0 3 0 2 Bastia… Germany 2015 2 3 2014-08-04 1
2 1860 Munch… RB Leip… 1 3 0 1 Guido … Germany 2015 2 3 2014-08-10 2
3 1860 Munch… Heidenh… 0 1 0 2 Patric… Germany 2015 2 1 2014-08-22 3
4 1860 Munch… Darmsta… 1 2 0 2 Martin… Germany 2015 2 2 2014-08-31 4
5 1860 Munch… St Pauli 0 4 0 3 Robert… Germany 2015 2 4 2014-09-14 5
6 1860 Munch… Ingolst… 1 2 0 1 Sven J… Germany 2015 2 2 2014-09-20 6
and the structure is
tibble [36,018 × 14] (S3: tbl_df/tbl/data.frame)
$ team : Factor w/ 241 levels "1860 Munchen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ opponent : Factor w/ 241 levels "1860 Munchen",..: 114 181 103 70 210 110 197 99 3 81 ...
$ home : num [1:36018] 0 1 0 1 0 1 0 1 0 0 ...
$ yellow_cards : int [1:36018] 3 3 1 2 4 2 2 1 3 4 ...
$ red_cards : int [1:36018] 0 0 0 0 0 0 0 0 0 0 ...
$ opposition_yc : int [1:36018] 2 1 2 2 3 1 5 0 1 2 ...
$ referee : Factor w/ 217 levels "Alain Bieri",..: 23 87 159 134 177 201 21 187 135 208 ...
$ country : Factor w/ 5 levels "England","France",..: 3 3 3 3 3 3 3 3 3 3 ...
$ season : num [1:36018] 2015 2015 2015 2015 2015 ...
$ competition_level: Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ total_bookings : int [1:36018] 3 3 1 2 4 2 2 1 3 4 ...
$ match_date : Date[1:36018], format: "2014-08-04" "2014-08-10" "2014-08-22" "2014-08-31" ...
$ matchweek : int [1:36018] 1 2 3 4 5 6 7 8 9 10 ...
$ yellow_card_lag1 : int [1:36018] NA 3 3 1 2 4 2 2 1 3 ...
along with two other numeric columns that I had to remove from displaying here (with some NAs)
I have already fitted with GLMs and Linear Models, but wanted to use random forests for better predictability and to model more complex relationships between predictors, and also since I have read from Elements of Statistical Learning textbook that random forests are good for missing values, which I do have.
Since yellow cards is non negative and discrete count, I tried using distRforest to fit the data with Poisson method but only to have R crash each time, even when I partition the data into smaller subsets, such as only for one country, or removing NA, or even just taking the first 100 rows. What could be the reason?
distRforest::rforest(formula = yellow_cards ~
+ + yellow_card_lag1 + referee,
data = data_2_reshaped, ## Data doesnt have to be preprocessed, unlike for parametric regression
method = "poisson", ## For YC count data
ntrees = 50,
track_oob = T,
ncand = 5)