When the Classic Beats the Machine | Wizard & Co. Actuarial Insights
AI powers big data analysis and automation workflows, showcasing neural networks and data streams for business. Artificial intelligence, machine learning, digital transformation and tech innovation.
Actuarial AI

When the Classic Beats the Machine: What AI Reveals About Actuarial Modelling

By Wizard & Co. April 2026 18 min read

Introduction

Artificial intelligence is rapidly entering actuarial workflows. Tools like ChatGPT, Claude, Gemini, and Gemma can now assist with coding, model interpretation, technical writing, and governance documentation. That raises an important question for actuaries, pricing teams, and decision-makers:

Can AI improve actuarial modelling in a way that is technically correct, commercially useful, and defensible?

To test this properly, I ran a practical experiment using a public motor insurance dataset of roughly 678,000 policies. The exercise compared:

  • A Poisson Generalised Linear Model (GLM) as the actuarial benchmark
  • A Gradient Boosting Machine (GBM) as the machine-learning challenger

I then asked multiple AI systems to interpret the results using the same core prompt.

The outcome was more interesting than a simple "AI versus traditional modelling" story.

The Modelling Exercise

The task was to model claim frequency on public motor insurance data.

The workflow included:

  • Identifying the claim count target
  • Using Exposure as time at risk
  • Fitting a Poisson GLM with a log exposure offset
  • Fitting a GBM challenger
  • Comparing performance using out-of-sample deviance

Technical Note:

This detail matters. In actuarial frequency modelling, exposure is not optional. If exposure is mishandled, the comparison becomes unfair immediately. A GLM incorporates the offset naturally. A GBM does not do that by construction and must approximate the same logic through modelling choices. That distinction became one of the most important technical insights in the exercise.

The Results

Model Validation Deviance Test Deviance
GLM 0.3185 0.3219
GBM 0.3590 0.3624

Lower deviance indicates better fit.

The result was clear:

The GLM outperformed the GBM by approximately 12–13%.

That is not a rounding error. It is a material gap.

Just as importantly, both models showed relatively tight validation-to-test consistency. That suggests the result is not primarily a story of catastrophic overfitting. The more likely interpretation is that the GLM was well aligned to the structure of the problem, while the GBM did not extract enough additional signal to justify its extra flexibility.

Why Did the GLM Outperform the GBM?

1 The problem appears to be signal-limited, not complexity-limited

Insurance frequency data is sparse. In this dataset, around 95% of policies had no claims.

That means:

  • The signal-to-noise ratio is low
  • True relationships are relatively weak
  • Randomness dominates a large part of the observed outcome

In that setting, a simpler model often has an advantage. A GLM imposes structure and reduces variance. A GBM is more flexible, but flexibility only helps if there is meaningful additional structure to discover.

When signal is limited, extra complexity does not create insight. It often just creates variance.

2 The structure of insurance pricing often suits GLMs extremely well

Motor insurance pricing problems are often well described by:

  • Multiplicative relationships
  • Additive effects on the log scale
  • Distributions aligned with Poisson, Gamma, or Tweedie-type frameworks

That is exactly where GLMs are strongest.

If the data-generating process is approximately log-linear, then a GLM is not merely a legacy benchmark. It is a model class that is structurally aligned to the problem.

This is one of the biggest reasons "traditional" actuarial models continue to remain highly competitive in real pricing environments.

3 Parsimony is a strength, not a weakness

One of the clearest points raised by Gemma was the role of parsimony.

A GLM is constrained. That is often presented as a limitation. In actuarial work, it is frequently an advantage.

Parsimony gives you:

  • A lower-variance estimator
  • Clearer interpretability
  • Better auditability
  • A model that resists chasing noise

In other words, simplicity is not the opposite of sophistication. In actuarial modelling, simplicity is often a disciplined form of regularisation.

4 Exposure handling matters more than many ML comparisons acknowledge

One of the most important observations from the AI comparison, especially from the corrected same-prompt Claude response, is that a GLM handles exposure correctly by design through the log offset, whereas a GBM must approximate that relationship through features or target construction.

That difference is not cosmetic. It goes to the heart of frequency modelling.

A well-specified GLM is estimating a rate in the way actuarial theory expects. A GBM can still be built to compete, but the modeller has to be careful:

  • What is being predicted: count or rate
  • How exposure is incorporated
  • Whether the objective function aligns with the evaluation metric
  • Whether the setup gives the GBM a fair chance

This means some of the observed advantage may reflect both model suitability and the fact that actuarial structure is naturally embedded in the GLM framework.

5 The GBM may not have been fully optimised — but that does not invalidate the result

A fair critique is that the GBM may have been under-tuned.

That is true, and it should be acknowledged.

Hyperparameters such as:

  • Learning rate
  • Tree depth
  • Number of estimators
  • Early stopping
  • Loss function alignment

can materially affect boosting performance.

But that does not make the exercise meaningless. It actually strengthens the practical lesson.

In real actuarial and business settings, model complexity has a cost:

  • More tuning effort
  • More validation burden
  • More governance overhead
  • More explainability friction

So the relevant question is not only whether a GBM can eventually be tuned to beat a GLM. It is whether the extra complexity is justified in the context of the problem.

In this exercise, the answer was clearly not yet.

Is This Result Expected?

Yes. More than many industry conversations suggest.

There is a strong narrative in the market that machine learning will automatically outperform GLMs in pricing. That is too simplistic.

A more accurate statement is:

Machine learning outperforms GLMs when the data is rich enough, large enough, and complex enough to reward the extra flexibility.

Where that is not true, a well-built GLM can remain the stronger model.

That is especially likely when:

  • The data is structured
  • The core rating variables are already curated
  • Relationships are close to additive or multiplicative
  • Domain expertise has already shaped the feature set
  • Explainability matters

So this result should not be read as "ML is worse than GLM." It should be read as:

In this setting, the GLM was better aligned to the structure and signal content of the problem.

When Would Machine Learning Likely Win?

Machine learning has a genuine edge when one or more of the following are true:

1 The data is very large

Very large datasets make it easier for boosting models and other ML methods to reliably detect patterns that would otherwise be unstable.

2 The relationships are strongly non-linear

If risk behaves in a way that bends, thresholds, reverses, or interacts in ways that are hard to pre-specify, ML becomes much more attractive.

3 There are rich interaction effects

Telematics, geospatial variables, weather, behavioural data, and high-dimensional external data are natural environments for ML.

4 The feature space is wide

When there are many weak predictors that individually add little but collectively matter, GBMs can aggregate signal more effectively than a manually specified GLM.

5 The use case prioritises pure predictive lift over interpretability

In some environments, marginal lift is commercially decisive. In others, governance, tariff translation, and regulatory defensibility matter more.

That distinction should always shape model choice.

The AI Comparison: ChatGPT vs Gemini vs Claude vs Gemma

After building the models, I asked four AI systems to interpret the same core result.

All four converged on the same high-level conclusion:

The GLM outperformed the GBM, and that outcome is plausible in actuarial pricing.

What differed was not the answer. It was the emphasis.

ChatGPT: Structure and Actuarial Reasoning

ChatGPT was strongest on:

  • Model structure
  • Data-generating process intuition
  • Log-linear reasoning
  • Actuarial framing
  • Exposure awareness

Its strongest contribution was in explaining why a GLM can be the right model class, not just a baseline.

Gemini: Optimisation and Machine-Learning Diagnostics

Gemini leaned more heavily into:

  • Signal-to-noise discussion
  • Underfitting or insufficient tuning
  • Hyperparameter sensitivity
  • ML performance diagnosis

Its value was in reminding us not to dismiss ML too quickly just because the first challenger lost.

Claude: Implementation Fairness and Production Realism

Once Claude was re-run using the same prompt as the other models, its strengths became:

  • Practical comparison fairness
  • Exposure handling differences between GLM and GBM
  • Tuning fairness
  • Calibration considerations
  • Production-level recommendations such as stacking and calibration review

This version of Claude was less "executive summary" than the earlier draft and more technically comparable to the other models. The earlier output was much more polished and consulting-oriented, but it was also based on a different prompt, so it was not a like-for-like benchmark.

Gemma: Parsimony and Model Philosophy

Gemma was strongest on:

  • Clarity
  • Simplicity
  • Bias-variance trade-off intuition
  • Explaining why simpler models sometimes win

It was less actuarially nuanced than ChatGPT or Claude, but it added a very useful layer of conceptual discipline.

What This Reveals About AI in Actuarial Work

The most important lesson is not that one AI model "won."

It is that AI systems can enhance actuarial workflows in different ways:

  • One helps frame the model properly
  • One helps diagnose the challenger
  • One helps think about implementation fairness
  • One helps explain the role of simplicity

That leads to a broader conclusion:

AI does not replace actuarial judgement. It changes how actuarial judgement can be applied.

Used well, AI can improve:

  • Coding speed
  • Review quality
  • Technical explanation
  • Governance drafting
  • Communication with non-technical stakeholders

But none of the models removed the need for:

  • Correct exposure treatment
  • Metric selection
  • Fairness of comparison
  • Domain judgement
  • Actuarial accountability

That is exactly where the actuary remains central.

Methodological Caveats

A rigorous actuarial review should acknowledge the following limitations.

1

The GBM was not exhaustively tuned — This means the exercise does not prove a tuned GBM could not close some of the gap.

2

The comparison used a specific train/validation/test workflow — A more extensive cross-validation framework could add robustness.

3

Deviance was the primary metric — That is appropriate, but not sufficient on its own. Calibration, portfolio adequacy, and ranking stability also matter.

4

The modelling exercise focused on frequency — A full pricing framework may also include severity and pure premium comparisons.

5

AI interpretation quality depends heavily on prompt design — This became especially clear with Claude. Different prompts produced meaningfully different styles of output.

The Real Takeaway

The biggest result was not simply that the GLM beat the GBM.

It was this:

The value of AI in actuarial work is not in replacing models. It is in improving how models are built, tested, challenged, and explained.

And on the modelling side:

The best model is not the most fashionable one. It is the one that performs well, matches the structure of the problem, and can be defended.

Sometimes that will be a machine-learning model.

Often, it will still be a GLM.

That is not a concession. It is a result.

Work With Us

Wizard & Co. provides independent actuarial and data-driven advisory support across:

Pricing
Risk
Model Review
Growth Strategy
AI-Enabled Decision Support

If you are reviewing your pricing methodology, considering machine learning, or trying to understand where AI genuinely adds value in actuarial work, get in touch.

Wizard & Co. helps organisations combine actuarial rigour with modern execution.

Get in Touch

Frequently Asked Questions

Ready to explore your pricing potential?

Let's discuss how Wizard & Co. can support your actuarial and pricing objectives.