Machine learning algorithms to predict NHL fantasy statistics

Method

Broadly, our method is as follows:

  1. Shape the data
  2. Factor analysis
  3. Preliminary models
  4. Model ensembling
  5. Model testing

Shape the data

Before analysis can begin, we need a clean, functional set of data. This entails converting many of the statistics to per-game numbers, stripping out the seasons that won't enter the model, and converting each remaining player to a single observation with several seasons' data.

Factor Analysis

We begin by building several random forest models on all potential predictors. We then look at various importance measures built into R to choose the candidate set of factors from the full pool. This is all automated; the suggested factors and their importance metrics are then analyzed to hand-select the final factor set.

Preliminary Models

Having chosen a set of factors, the data is again reshaped down to what is necessary, and models of several types are built. The algorithms used include random forests, boosting, k-nearest neighbors, or support vector machines.

Model ensembling

The preliminary models are combined in various ways, looking for a potential meta-model that might reliably outperform the individual basic models.

Model testing

Having built several models, we then begin to look at how the models perform outside the training set. We again reshape the data to predict the same outcome but in a different season. We also include a naive model that assumes each player will get precisely the same outcome that they got the previous season. If the models are not performing to standards, we tweak the parameters by which the ensembling took place, the parameters of the preliminary models, and try varying combinations of factors to hone in on the signal.