What is a GLM in actuarial pricing?

A Generalised Linear Model is a statistical model commonly used in insurance pricing to estimate claim frequency, severity, or pure premium using structured relationships between rating factors and outcomes.

Why is exposure important in a frequency model?

Exposure represents time at risk. In a Poisson frequency model, exposure is typically included as a log offset so the model estimates claim rate rather than raw count.

Does machine learning always beat GLMs in insurance pricing?

No. Machine learning tends to outperform GLMs mainly when the dataset is large, rich, non-linear, and interaction-heavy. In structured pricing datasets, GLMs often remain highly competitive.

Can AI replace actuarial judgement?

No. AI can accelerate coding, analysis, and communication, but actuarial judgement is still required for model specification, validation, governance, and accountability.

What did this experiment show about AI tools?

It showed that different AI tools can reach similar conclusions while prioritising different aspects of the problem, such as structure, optimisation, fairness, or simplicity.

Why did the GLM outperform the GBM in this experiment?

The GLM outperformed the GBM by approximately 12-13% because: 1) The problem is signal-limited, not complexity-limited; 2) Insurance pricing structure suits GLMs well; 3) Parsimony reduces variance; 4) Exposure handling is embedded in GLM design; 5) The GBM may not have been fully optimised.

When the Classic Beats the Machine: What AI Reveals About Actuarial Modelling

Introduction

Artificial intelligence is rapidly entering actuarial workflows. Tools like ChatGPT, Claude, Gemini, and Gemma can now assist with coding, model interpretation, technical writing, and governance documentation. That raises an important question for actuaries, pricing teams, and decision-makers:

Can AI improve actuarial modelling in a way that is technically correct, commercially useful, and defensible?

To test this properly, I ran a practical experiment using a public motor insurance dataset of roughly 678,000 policies. The exercise compared:

A Poisson Generalised Linear Model (GLM) as the actuarial benchmark
A Gradient Boosting Machine (GBM) as the machine-learning challenger

I then asked multiple AI systems to interpret the results using the same core prompt.

The outcome was more interesting than a simple "AI versus traditional modelling" story.

The Modelling Exercise

The task was to model claim frequency on public motor insurance data.

The workflow included:

Identifying the claim count target
Using Exposure as time at risk
Fitting a Poisson GLM with a log exposure offset
Fitting a GBM challenger
Comparing performance using out-of-sample deviance

Technical Note:

This detail matters. In actuarial frequency modelling, exposure is not optional. If exposure is mishandled, the comparison becomes unfair immediately. A GLM incorporates the offset naturally. A GBM does not do that by construction and must approximate the same logic through modelling choices. That distinction became one of the most important technical insights in the exercise.

The Results

Model	Validation Deviance	Test Deviance
GLM	0.3185	0.3219
GBM	0.3590	0.3624

Lower deviance indicates better fit.

The result was clear:

The GLM outperformed the GBM by approximately 12–13%.

That is not a rounding error. It is a material gap.

Just as importantly, both models showed relatively tight validation-to-test consistency. That suggests the result is not primarily a story of catastrophic overfitting. The more likely interpretation is that the GLM was well aligned to the structure of the problem, while the GBM did not extract enough additional signal to justify its extra flexibility.

Why Did the GLM Outperform the GBM?

1 The problem appears to be signal-limited, not complexity-limited

Insurance frequency data is sparse. In this dataset, around 95% of policies had no claims.

That means:

The signal-to-noise ratio is low
True relationships are relatively weak
Randomness dominates a large part of the observed outcome

In that setting, a simpler model often has an advantage. A GLM imposes structure and reduces variance. A GBM is more flexible, but flexibility only helps if there is meaningful additional structure to discover.

When signal is limited, extra complexity does not create insight. It often just creates variance.

2 The structure of insurance pricing often suits GLMs extremely well

Motor insurance pricing problems are often well described by:

Multiplicative relationships
Additive effects on the log scale
Distributions aligned with Poisson, Gamma, or Tweedie-type frameworks

That is exactly where GLMs are strongest.

If the data-generating process is approximately log-linear, then a GLM is not merely a legacy benchmark. It is a model class that is structurally aligned to the problem.

This is one of the biggest reasons "traditional" actuarial models continue to remain highly competitive in real pricing environments.

3 Parsimony is a strength, not a weakness

One of the clearest points raised by Gemma was the role of parsimony.

A GLM is constrained. That is often presented as a limitation. In actuarial work, it is frequently an advantage.

Parsimony gives you:

A lower-variance estimator
Clearer interpretability
Better auditability
A model that resists chasing noise

In other words, simplicity is not the opposite of sophistication. In actuarial modelling, simplicity is often a disciplined form of regularisation.

4 Exposure handling matters more than many ML comparisons acknowledge

One of the most important observations from the AI comparison, especially from the corrected same-prompt Claude response, is that a GLM handles exposure correctly by design through the log offset, whereas a GBM must approximate that relationship through features or target construction.

That difference is not cosmetic. It goes to the heart of frequency modelling.

A well-specified GLM is estimating a rate in the way actuarial theory expects. A GBM can still be built to compete, but the modeller has to be careful:

What is being predicted: count or rate
How exposure is incorporated
Whether the objective function aligns with the evaluation metric
Whether the setup gives the GBM a fair chance

This means some of the observed advantage may reflect both model suitability and the fact that actuarial structure is naturally embedded in the GLM framework.

5 The GBM may not have been fully optimised — but that does not invalidate the result

A fair critique is that the GBM may have been under-tuned.

That is true, and it should be acknowledged.

Hyperparameters such as:

Learning rate
Tree depth
Number of estimators
Early stopping
Loss function alignment

can materially affect boosting performance.

But that does not make the exercise meaningless. It actually strengthens the practical lesson.

In real actuarial and business settings, model complexity has a cost:

More tuning effort
More validation burden
More governance overhead
More explainability friction

So the relevant question is not only whether a GBM can eventually be tuned to beat a GLM. It is whether the extra complexity is justified in the context of the problem.

In this exercise, the answer was clearly not yet.

Is This Result Expected?

Yes. More than many industry conversations suggest.

There is a strong narrative in the market that machine learning will automatically outperform GLMs in pricing. That is too simplistic.

A more accurate statement is:

Machine learning outperforms GLMs when the data is rich enough, large enough, and complex enough to reward the extra flexibility.

Where that is not true, a well-built GLM can remain the stronger model.

That is especially likely when:

The data is structured
The core rating variables are already curated
Relationships are close to additive or multiplicative
Domain expertise has already shaped the feature set
Explainability matters

So this result should not be read as "ML is worse than GLM." It should be read as:

In this setting, the GLM was better aligned to the structure and signal content of the problem.

When Would Machine Learning Likely Win?

Machine learning has a genuine edge when one or more of the following are true:

1 The data is very large

Very large datasets make it easier for boosting models and other ML methods to reliably detect patterns that would otherwise be unstable.

2 The relationships are strongly non-linear

If risk behaves in a way that bends, thresholds, reverses, or interacts in ways that are hard to pre-specify, ML becomes much more attractive.

3 There are rich interaction effects

Telematics, geospatial variables, weather, behavioural data, and high-dimensional external data are natural environments for ML.

4 The feature space is wide

When there are many weak predictors that individually add little but collectively matter, GBMs can aggregate signal more effectively than a manually specified GLM.

5 The use case prioritises pure predictive lift over interpretability

In some environments, marginal lift is commercially decisive. In others, governance, tariff translation, and regulatory defensibility matter more.

That distinction should always shape model choice.

The AI Comparison: ChatGPT vs Gemini vs Claude vs Gemma

After building the models, I asked four AI systems to interpret the same core result.

All four converged on the same high-level conclusion:

The GLM outperformed the GBM, and that outcome is plausible in actuarial pricing.

What differed was not the answer. It was the emphasis.

ChatGPT: Structure and Actuarial Reasoning

ChatGPT was strongest on:

Model structure
Data-generating process intuition
Log-linear reasoning
Actuarial framing
Exposure awareness

Its strongest contribution was in explaining why a GLM can be the right model class, not just a baseline.

Gemini: Optimisation and Machine-Learning Diagnostics

Gemini leaned more heavily into:

Signal-to-noise discussion
Underfitting or insufficient tuning
Hyperparameter sensitivity
ML performance diagnosis

Its value was in reminding us not to dismiss ML too quickly just because the first challenger lost.

Claude: Implementation Fairness and Production Realism

Once Claude was re-run using the same prompt as the other models, its strengths became:

Practical comparison fairness
Exposure handling differences between GLM and GBM
Tuning fairness
Calibration considerations
Production-level recommendations such as stacking and calibration review

This version of Claude was less "executive summary" than the earlier draft and more technically comparable to the other models. The earlier output was much more polished and consulting-oriented, but it was also based on a different prompt, so it was not a like-for-like benchmark.

Gemma: Parsimony and Model Philosophy

Gemma was strongest on:

Clarity
Simplicity
Bias-variance trade-off intuition
Explaining why simpler models sometimes win

It was less actuarially nuanced than ChatGPT or Claude, but it added a very useful layer of conceptual discipline.

What This Reveals About AI in Actuarial Work

The most important lesson is not that one AI model "won."

It is that AI systems can enhance actuarial workflows in different ways:

One helps frame the model properly
One helps diagnose the challenger
One helps think about implementation fairness
One helps explain the role of simplicity

That leads to a broader conclusion:

AI does not replace actuarial judgement. It changes how actuarial judgement can be applied.

Used well, AI can improve:

Coding speed
Review quality
Technical explanation
Governance drafting
Communication with non-technical stakeholders

But none of the models removed the need for:

Correct exposure treatment
Metric selection
Fairness of comparison
Domain judgement
Actuarial accountability

That is exactly where the actuary remains central.

Methodological Caveats

A rigorous actuarial review should acknowledge the following limitations.

The GBM was not exhaustively tuned — This means the exercise does not prove a tuned GBM could not close some of the gap.

The comparison used a specific train/validation/test workflow — A more extensive cross-validation framework could add robustness.

Deviance was the primary metric — That is appropriate, but not sufficient on its own. Calibration, portfolio adequacy, and ranking stability also matter.

The modelling exercise focused on frequency — A full pricing framework may also include severity and pure premium comparisons.

AI interpretation quality depends heavily on prompt design — This became especially clear with Claude. Different prompts produced meaningfully different styles of output.