By Navarun Jain, Lux Actuaries
Say you’ve just received data for motor insurance pricing for a new company. You want to model claim counts, but you’re not quite sure what rating factors really affect counts just by looking at the data, because there’s 2 million records. You could certainly do a sensitivity test for each of the variables in the data to find out by how much claim counts change if you make a small change in each variable. You tried one variable at a time, but that’s not enough. You would need to look at combinations of rating factors. What if there are 8 possible on/off (or yes/no or 1/0) rating factors? That would mean up to 255 possible combinations of rating factors to test. More, if rating factors have greater depth, such as age. Sadly, our clients cannot pay actuarial rates for the 10 years of analysis required to consider every combination or rating factors in depth. But we still need to answer the question.
So we need to find a way to model claim counts using a combination of predictor variables. We need a model that allows us to see what the effect of a small change on each variable is on the target variable (in this case, claim counts). In other words, we need a regression model. Thanks to modern statistics, we’ve got many such regression models! The current practice used by most actuaries is GLM’s.
Section 1: GLM’s
As the name suggests, Generalised Linear Models or GLM’s are an extension of the classical linear model, which is represented by the following equation:
This model has 2 components – a random component, which refers to the probability distribution of the response variable, and a systematic component, which specifies the linear combination of the explanatory variables that is used in the predictor. In the case of a linear model, the systematic component is given by α0 + α 1x1 + α 2x2 + …, where the x’s represent the explanatory variables and the α’s represent their coefficients. The link between the two components is explained by the link function, which specifies how the expected value of the random component relates to the systematic component.
GLM’s are one of the simplest statistical models. The best thing about them is that they are transparent. They allow us to include many predictors in one model and consider all of their effects simultaneously. No more testing 255 different models!
However, a GLM is still basic. It relies on us specifying an underlying distribution. Now who’s to say that we’d always know whether a certain regression model should be Poisson or, say, binomial? What happens if we get the distribution wrong? The model would fail, and our clients can’t risk that. That’s not all though. How do we know what combination of variables best explains the target variable? GLM’s can give us an estimate of how well the model fits the data, but there’s nothing to compare it to. Ideally, we need to run multiple GLM’s and compare the outputs from those to decide which model works best.
So consider this – what if you could train a computer to do the following:
- Decide what variables best predict claim counts,
- construct the best model that explains claim counts based on those variables, and
- accurately predict claim counts for any combination of those variables that you specify?
And what if you could do this for any variable you want to model? All you would have to do is input the data, and a computer takes care of the rest. And what’s more, you don’t need to be a rocket scientist to build such systems!
How do we do this, you ask?
Section 2: Enter Machine Learning.
It’s exactly what it sounds like – machines learn from data we give it. There’s 2 parts, or phases, to this – training and testing. The training phase is when we give the system a dataset and tell it how to learn, by specifying different algorithms to follow. The testing phase is when the system gets a chance to try out what it has learnt on new data. Think of it like teaching a 5-year old how to add two numbers. First, we show them how it’s done. This is analogous to the training phase. Once they have learnt how to add, we give them different sets of numbers and then they add them on their own. This is like the testing phase.
However, there’s one difference – in the case of the child, we tell them, for example, that 1+2 = 3 and 3+2 = 5. In the case of machine learning, we’d just input the following data:
Now all we do is tell the system that the result is X3, and that it is predicted by some model based on X1 and X2. Our algorithm then enables the system to independently analyze the data and find out what the relationship between X1, X2 and X3 is. It independently deduces that X3 = X1 + X2. Now we can give them, say, the following data:
The system will now be able to “fill in the blanks”, i.e., accurately predict the value of X3.
The first dataset is known as the training dataset, because this is what the system uses to deduce the best model based on the data. The second dataset is known as the test dataset, because the system tests the model on this dataset.
At a high level, machine learning works like this – data goes into black box, where it gets transformed, and predictions are generated from it. The “black box” here is the algorithm. Each algorithm defines a specific set of processes and manipulations that can be made to the input data. These processes have a very wide complexity spectrum, ranging from simple stochastic methods such as linear regression models to Decision Trees to even artificial intelligence algorithms such as Artificial Neural Networks. In this article, I will explain the high-level theory behind two such processes – Random Forests and Artificial Neural Networks (ANN’s).
Section 3: Data Preparation
Before these models are built, we need to make sure that we understand the data. It is, therefore, highly recommended that the following checks be performed on the data:
- The dataset is fairly large (has a large number of cases)
- All fields, especially the ones corresponding to the potential predictors, are completely and accurately populated. Any ‘Not Applicables’ need to be omitted or replaced. Further, all anomalies need to be eliminated. For example, filter out the cases for which age and exposure are negative.
- We must make sure that we do not have too many possible values for any predictor. For example, age is a continuous variable that can take any numerical value. Passing this into a model will overcomplicate it, because it has too many unique values, thereby making the model too large to reconcile. Hence, it is better to use age bands instead of ages.
- The most important check is making sure any categorical variables (‘red’, ‘blue’) being passed into the model can be interpreted by the system. We must be careful in dealing with such variables, especially those with many levels. GLM’s can work with categorical variables, but Random Forests and ANN’s cannot always interpret them (how much is ‘red’?). Hence, it is always recommended to transform them to binary variables (numbers).
Section 4: Decision Trees and Random Forests
To understand how a decision tree works, consider this: you’re trying to figure out how to spend your weekend. You’d consider all your possible choices, and then look into the specifics of each one. For example, will you go the movies or go to a nice restaurant? If you choose movies, what kind of movie? You’ve got, say, action, comedy and romance to choose from. If you go to a restaurant, what kind of food do you want? East? West? If East then Indian? Chinese? Thai? If Indian then Veg or Meat? And so on.
The idea here is that, at every level, we make a choice, and every choice we make narrows our options down until we get the best one which fits all our choices. Decision trees work the exact same way. They involve stratifying and segmenting the predictor space into various unique subsets. This is done using a series of splitting rules. For example, consider a dataset that predicts salaries of baseball players based on a variety of factors, say number of year and number of hits. Passing this data into a decision tree model may yield the following:
Fig.1: Example of a Decision Tree Model
Here, the top split assigns `Years < 4.5` to the left branch. For `Years > 4.5`, there is a second split for `Hits`. The tree assigns `Hits < 117.5` to the left branch, and the rest to the right branch. For each split, the predicted salary is given by the mean response value at the end of each branch.
Random Forests extend this simple model in that it uses a collection of unique decorrelated decision trees. Further, each decision tree is built using unique and random subsets of the predictor space, such that no 2 trees are the same, hence the term Random Forest.
Random Forest models are useful in that they give us a visual idea of what variables are the strongest predictors in the data. Variable importance is measured using the Gini Index, which is an estimate of impurity. Thus, it measures how often a certain variable would be considered incorrectly in a split, given the distribution of the predictors. We tested this using a model where we tried to predict motor claim frequency using driver age, vehicle age, engine capacity, No Claims Discount (NCD), branch and an agency repair indicator. The variable importance plot for this is shown below:
Fig.2: Variable Importance Plot
Thus, driver age was the strongest predictor, while engine capacity was the weakest.
Random Forests are often hailed by computer science enthusiasts as being the “worry-free” approach to predictive modelling. This model trains fast and requires the least parameter tuning. This is however, both a strength and a weakness. It’s a strength because it makes the model easy to train. On the other hand, the theory behind it is very complex, and hence it is hard to make sense of Random Forest outputs if we don’t spend a lot of time interpreting outcomes.
So, we’ve seen 2 different approaches to predictive modelling, both of which are useful and work fairly well for reasonably-sized datasets. GLM’s give us information on the model specifications, but they are restricted to only certain distributions that we must specify. Random Forests are quite sophisticated models, but they are hard to explain due to the complexity of information about the model specs. And if your data contains as much as a few hundred thousand records, Random Forests become unsustainable, since they are too large for any statistical computing program to handle.
At this point the question becomes – is there a model that can beat all other models? Is there a model that contains more sophistication than Random Forests and gives us more information than GLM’s? Can this model be more accurate than both GLM’s and Forests, and could work even if I had 500,000 or 2,000,000 rows of data?
Section 5: Yes, there is! They’re called Neural Networks…
…or NN’s for short. Neural Networks are a complex computing system designed after the human brain. It was, in fact, the first algorithm that led to the development of Artificial Intelligence and modern robotics. The best thing about NN’s is that they can be visualized easily. The topology of a network gives us a very clear picture of what happens to data once it is passed through the model.
Consider the example we used for Random Forests. Training a neural network based on the same variables yields the following:
Fig. 3: An Example of a Neural Network
Each bubble you see there is called a neuron. There are 4 layers of neurons in the above model – the input layer (which is the leftmost layer), two hidden layers (each with 10 and 4 neurons respectively), and finally, the output layer (on the right). Each of the predictors is assigned to a neuron at the input layer, where the data for each variable is transformed using what is known as an activation function. This is what enables each neuron to generate an output, which is then passed on to the next layer, where it is transformed again till it reaches the output layer. At every layer, a bias term is added (shown in blue). These are determined by the model, based on the relationships between the predictors and the data. They can also be specified externally as constant weights. For example, suppose we’re trying to model claim counts. We know that exposure acts as an offset here. Hence, we would specify exposure as a constant weight, which gets added externally to the network and is added to the weights generated by the network. Thus, neural networks look at patterns within the data. This allows them to detect any kind of hidden relationships between the predictors and the data (and within the predictors themselves) – something GLM’s and Random Forests are not fully capable of doing.
Each neuron receives inputs from the left, and the inputs are multiplied by the weights of the connections as they travel to the right. At every subsequent layer, the weights of each unit are added. If this sum is greater than a certain threshold (which we can specify), the unit “fires”, and an output is generated which travels to the next layer. There is also an element of feedback involved. At every iteration, the network computes the difference between its predictions and the actual values. Based on how large these differences were, the model internally figures out what it did wrong, and tries to correct its mistakes in the next iteration. Think of it like bowling – in the first round, you see how many skittles you knocked down. Then you can decide how to bowl in the next round so that you can knock down the most skittles.
Neural Networks are, thus, the best we have in terms of completely unsupervised learning. All we do is tell it what the target and predictor variables are, and how many hidden neurons we want. The system can then automatically detect all the relationships and patterns within the data, and generate predictions based on the same. It considers the data holistically and not in chunks like Random Forests do.
Section 6: The million-dollar question – How do these models compare?
We modelled motor claim frequencies based on Driver Age, Vehicle Age, Engine Capacity, Branch, No Claim Discount (NCD) and an Agency/Non-Agency indicator. The data was reasonably large. We used the following models:
- A GLM (Poisson regression with log of exposure as offset)
- A Random Forest model with 500 Decision Trees
- A 6-10-6-1 Neural Network.
The table below shows how these models compared with each other on 2 accounts – running time and Cross Validation MSE. Cross Validation is a process by which we train the model multiple times (with a random sample of the data generated at every iteration), and aggregate the prediction errors over all the iterations. It is the most reliable process of determining how accurate a model is in predicting new data. In this case, we have run a 10-fold Cross Validation.
The model comparison summary is as follows:
From the above table, it is evident that the Neural Network is the best model, since it has the lowest prediction error. The next best model is the GLM. The Random Forest model does not seem to work best for this data, since it takes relatively longer to run, and it does not seem to generate reliable predictions.
Section 7: Is this for me?
If you’ve stuck with me this far, then you’re probably thinking, “When should I really use such sophisticated models? Should this be the direction I need to head towards right now?” The answer is – it depends. Remember – DO NOT RUN COMPLICATED MODELS JUST FOR THE SAKE OF IT. Try running these models only if your data can work with it. If you’ve only got a few thousand rows, maybe GLM’s will work just fine. Increase the complexity of your model, and you will need more data to back it up. Given the fact the market is developing and we now have increased volumes of premiums and claims data, it appears that we might have to start considering machine learning methods, as traditional stochastic processes are bound to fail beyond a certain limit. Moreover, the world is moving forward and adapting to the changing data landscape, and models such as Neural Networks and Deep Learning are being heavily researched and invested in, led by companies such as Google and Facebook.
To conclude, I could tell you that it’s not worth the time and effort required. But that’s not entirely true. The need for advanced modelling is increasing at a rapid pace, and we need to start now if we are to be at par with our data. Let’s explore the models we just saw in detail. We might run into some difficulties at first, but trust me when I say that we will get results that are a lot more accurate and comprehensive – and a lot more profitable for our clients and employers! So knock yourselves out, and have fun with the data!