Artificial intelligence is rapidly entering actuarial workflows. Tools like ChatGPT, Claude, Gemini, and Gemma can now assist with coding, model interpretation, technical writing, and governance documentation. That raises an important question for actuaries, pricing teams, and decision-makers:
Can AI improve actuarial modelling in a way that is technically correct, commercially useful, and defensible?
To test this properly, I ran a practical experiment using a public motor insurance dataset of roughly 678,000 policies. The exercise compared:
I then asked multiple AI systems to interpret the results using the same core prompt.
The outcome was more interesting than a simple "AI versus traditional modelling" story.
The task was to model claim frequency on public motor insurance data.
The workflow included:
Technical Note:
This detail matters. In actuarial frequency modelling, exposure is not optional. If exposure is mishandled, the comparison becomes unfair immediately. A GLM incorporates the offset naturally. A GBM does not do that by construction and must approximate the same logic through modelling choices. That distinction became one of the most important technical insights in the exercise.
| Model | Validation Deviance | Test Deviance |
|---|---|---|
| GLM | 0.3185 | 0.3219 |
| GBM | 0.3590 | 0.3624 |
Lower deviance indicates better fit.
The result was clear:
The GLM outperformed the GBM by approximately 12–13%.
That is not a rounding error. It is a material gap.
Just as importantly, both models showed relatively tight validation-to-test consistency. That suggests the result is not primarily a story of catastrophic overfitting. The more likely interpretation is that the GLM was well aligned to the structure of the problem, while the GBM did not extract enough additional signal to justify its extra flexibility.
Insurance frequency data is sparse. In this dataset, around 95% of policies had no claims.
That means:
In that setting, a simpler model often has an advantage. A GLM imposes structure and reduces variance. A GBM is more flexible, but flexibility only helps if there is meaningful additional structure to discover.
When signal is limited, extra complexity does not create insight. It often just creates variance.
Motor insurance pricing problems are often well described by:
That is exactly where GLMs are strongest.
If the data-generating process is approximately log-linear, then a GLM is not merely a legacy benchmark. It is a model class that is structurally aligned to the problem.
This is one of the biggest reasons "traditional" actuarial models continue to remain highly competitive in real pricing environments.
One of the clearest points raised by Gemma was the role of parsimony.
A GLM is constrained. That is often presented as a limitation. In actuarial work, it is frequently an advantage.
Parsimony gives you:
In other words, simplicity is not the opposite of sophistication. In actuarial modelling, simplicity is often a disciplined form of regularisation.
One of the most important observations from the AI comparison, especially from the corrected same-prompt Claude response, is that a GLM handles exposure correctly by design through the log offset, whereas a GBM must approximate that relationship through features or target construction.
That difference is not cosmetic. It goes to the heart of frequency modelling.
A well-specified GLM is estimating a rate in the way actuarial theory expects. A GBM can still be built to compete, but the modeller has to be careful:
This means some of the observed advantage may reflect both model suitability and the fact that actuarial structure is naturally embedded in the GLM framework.
A fair critique is that the GBM may have been under-tuned.
That is true, and it should be acknowledged.
Hyperparameters such as:
can materially affect boosting performance.
But that does not make the exercise meaningless. It actually strengthens the practical lesson.
In real actuarial and business settings, model complexity has a cost:
So the relevant question is not only whether a GBM can eventually be tuned to beat a GLM. It is whether the extra complexity is justified in the context of the problem.
In this exercise, the answer was clearly not yet.
Yes. More than many industry conversations suggest.
There is a strong narrative in the market that machine learning will automatically outperform GLMs in pricing. That is too simplistic.
A more accurate statement is:
Machine learning outperforms GLMs when the data is rich enough, large enough, and complex enough to reward the extra flexibility.
Where that is not true, a well-built GLM can remain the stronger model.
That is especially likely when:
So this result should not be read as "ML is worse than GLM." It should be read as:
In this setting, the GLM was better aligned to the structure and signal content of the problem.
Machine learning has a genuine edge when one or more of the following are true:
Very large datasets make it easier for boosting models and other ML methods to reliably detect patterns that would otherwise be unstable.
If risk behaves in a way that bends, thresholds, reverses, or interacts in ways that are hard to pre-specify, ML becomes much more attractive.
Telematics, geospatial variables, weather, behavioural data, and high-dimensional external data are natural environments for ML.
When there are many weak predictors that individually add little but collectively matter, GBMs can aggregate signal more effectively than a manually specified GLM.
In some environments, marginal lift is commercially decisive. In others, governance, tariff translation, and regulatory defensibility matter more.
That distinction should always shape model choice.
After building the models, I asked four AI systems to interpret the same core result.
All four converged on the same high-level conclusion:
The GLM outperformed the GBM, and that outcome is plausible in actuarial pricing.
What differed was not the answer. It was the emphasis.
ChatGPT was strongest on:
Its strongest contribution was in explaining why a GLM can be the right model class, not just a baseline.
Gemini leaned more heavily into:
Its value was in reminding us not to dismiss ML too quickly just because the first challenger lost.
Once Claude was re-run using the same prompt as the other models, its strengths became:
This version of Claude was less "executive summary" than the earlier draft and more technically comparable to the other models. The earlier output was much more polished and consulting-oriented, but it was also based on a different prompt, so it was not a like-for-like benchmark.
Gemma was strongest on:
It was less actuarially nuanced than ChatGPT or Claude, but it added a very useful layer of conceptual discipline.
The most important lesson is not that one AI model "won."
It is that AI systems can enhance actuarial workflows in different ways:
That leads to a broader conclusion:
AI does not replace actuarial judgement. It changes how actuarial judgement can be applied.
Used well, AI can improve:
But none of the models removed the need for:
That is exactly where the actuary remains central.
A rigorous actuarial review should acknowledge the following limitations.
The GBM was not exhaustively tuned — This means the exercise does not prove a tuned GBM could not close some of the gap.
The comparison used a specific train/validation/test workflow — A more extensive cross-validation framework could add robustness.
Deviance was the primary metric — That is appropriate, but not sufficient on its own. Calibration, portfolio adequacy, and ranking stability also matter.
The modelling exercise focused on frequency — A full pricing framework may also include severity and pure premium comparisons.
AI interpretation quality depends heavily on prompt design — This became especially clear with Claude. Different prompts produced meaningfully different styles of output.
The biggest result was not simply that the GLM beat the GBM.
It was this:
The value of AI in actuarial work is not in replacing models. It is in improving how models are built, tested, challenged, and explained.
And on the modelling side:
The best model is not the most fashionable one. It is the one that performs well, matches the structure of the problem, and can be defended.
Sometimes that will be a machine-learning model.
Often, it will still be a GLM.
That is not a concession. It is a result.
Wizard & Co. provides independent actuarial and data-driven advisory support across:
If you are reviewing your pricing methodology, considering machine learning, or trying to understand where AI genuinely adds value in actuarial work, get in touch.
Wizard & Co. helps organisations combine actuarial rigour with modern execution.
Get in TouchLet's discuss how Wizard & Co. can support your actuarial and pricing objectives.