distRforest keeps crashing for me when building random forest

26 Views Asked by At

I am trying to fit a random forest to predict the number of yellow cards given predictors such as referee, team, opponent (highly ordinal) , along with other variables such as competition, season (year), etc...

The dataset is quite large as I reformatted such that every game is repeated twice from the point of view of both teams. It is about 36k rows. There are 5 countries and the seasons span from 2015 to 2019.

It crashes no matter how I try to partition the data. I have tried to take a subset

My data looks like this:

       team        opponent  home yellow_cards red_cards opposition_yc referee country season competition_level total_bookings match_date matchweek
   <fct>       <fct>    <dbl>        <int>     <int>         <int> <fct>   <fct>    <dbl> <fct>                      <int> <date>         <int>
 1 1860 Munch… Kaisers…     0            3         0             2 Bastia… Germany   2015 2                              3 2014-08-04         1
 2 1860 Munch… RB Leip…     1            3         0             1 Guido … Germany   2015 2                              3 2014-08-10         2
 3 1860 Munch… Heidenh…     0            1         0             2 Patric… Germany   2015 2                              1 2014-08-22         3
 4 1860 Munch… Darmsta…     1            2         0             2 Martin… Germany   2015 2                              2 2014-08-31         4
 5 1860 Munch… St Pauli     0            4         0             3 Robert… Germany   2015 2                              4 2014-09-14         5
 6 1860 Munch… Ingolst…     1            2         0             1 Sven J… Germany   2015 2                              2 2014-09-20         6

and the structure is

tibble [36,018 × 14] (S3: tbl_df/tbl/data.frame)
 $ team             : Factor w/ 241 levels "1860 Munchen",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ opponent         : Factor w/ 241 levels "1860 Munchen",..: 114 181 103 70 210 110 197 99 3 81 ...
 $ home             : num [1:36018] 0 1 0 1 0 1 0 1 0 0 ...
 $ yellow_cards     : int [1:36018] 3 3 1 2 4 2 2 1 3 4 ...
 $ red_cards        : int [1:36018] 0 0 0 0 0 0 0 0 0 0 ...
 $ opposition_yc    : int [1:36018] 2 1 2 2 3 1 5 0 1 2 ...
 $ referee          : Factor w/ 217 levels "Alain Bieri",..: 23 87 159 134 177 201 21 187 135 208 ...
 $ country          : Factor w/ 5 levels "England","France",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ season           : num [1:36018] 2015 2015 2015 2015 2015 ...
 $ competition_level: Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ total_bookings   : int [1:36018] 3 3 1 2 4 2 2 1 3 4 ...
 $ match_date       : Date[1:36018], format: "2014-08-04" "2014-08-10" "2014-08-22" "2014-08-31" ...
 $ matchweek        : int [1:36018] 1 2 3 4 5 6 7 8 9 10 ...
 $ yellow_card_lag1 : int [1:36018] NA 3 3 1 2 4 2 2 1 3 ...

along with two other numeric columns that I had to remove from displaying here (with some NAs)

I have already fitted with GLMs and Linear Models, but wanted to use random forests for better predictability and to model more complex relationships between predictors, and also since I have read from Elements of Statistical Learning textbook that random forests are good for missing values, which I do have.

Since yellow cards is non negative and discrete count, I tried using distRforest to fit the data with Poisson method but only to have R crash each time, even when I partition the data into smaller subsets, such as only for one country, or removing NA, or even just taking the first 100 rows. What could be the reason?

distRforest::rforest(formula = yellow_cards ~ 
                   + +  yellow_card_lag1 + referee,
                 data = data_2_reshaped, ## Data doesnt have to be preprocessed, unlike for parametric regression
                 method = "poisson", ## For YC count data
                 ntrees = 50,
                 track_oob = T, 
                 ncand = 5)
0

There are 0 best solutions below