AI and Data Science

Always Innovating | Experiment 1: Loss Function Analysis

Lorena Piedras
Lorena Piedras
Jun 27, 2024
9 minutes to read

Introduction

My name is Lorena, and I’m a data scientist at Syrup. I’m incredibly excited to welcome you to our first-ever AI experiment write-up, which kicks off a new series we’re calling Always Innovating.

Data science forms the foundation of our work here at Syrup. We are constantly testing the latest machine learning (ML) models and experimenting with new frameworks in our pursuit of the best AI forecasts and optimizations for fashion and apparel brands. Innovation is built into our daily rhythm.

Our first experiment write-up looks at an important question you might not have thought to ask: why does it matter how we train our models? Isn’t the final result more important? Keep reading to find out the answer.

Looking for a more technical write-up?

Download here

A Quick Primer on Model Testing, Plus Our Research Question

At Syrup, we use supervised machine learning methods to train our forecasting models. This means we train the model to predict a known quantity, which we call the target, and then provide feedback on its accuracy (hence: supervised).

The model gets better at predicting the target by learning from its errors. We measure the errors with something called a loss function.

There are many different types of loss functions, each designed to “punish” different types of errors. The most commonly used loss for predicting a numeric quantity is called mean squared error (MSE). It’s the default on most ML software and punishes the model more for larger errors than it does for smaller ones.

So, the research question for this first experiment is a simple one on paper: is mean squared error the right loss to use in forecasting at Syrup, given that we always want to optimize for the best customer outcomes possible?

Looking for a vocabulary refresher?

Check out our blog that covers the AI basics every merchandise planning team should know.

Read more

Why Does How We Test Matter?

One powerful feature of Syrup forecasts is our ability to accurately predict demand at granular levels. When forecasting at the SKU x store x week level, zeroes make up a large chunk of the observations in our training data.

Which makes sense! For most of the hundreds of thousands of combinations that emerge from this narrow focus, there are no sales during each period of analysis.

In data science terms, this is known as data sparsity. To put it in retail terms, this is another way of saying that there is intermittent demand at the most granular levels.

When we train a model using MSE on a dataset that has mostly zeros, the model learns a bad habit: given a choice between under- and over-prediction, it’s more likely to under-predict because doing so allows it to consistently achieve a low squared error.

For retailers, under-forecasting lower-selling items may not be a huge issue. But under-forecasting items that are the biggest contributors to revenue — and therefore stocking out on them — causes a real headache.

It’s imperative that we find a model training system that encourages precision where it matters most: our customers’ financial performance.

Introducing Tweedie Loss (And How We Tested It)

Of course, MSE is not the only loss function available. For this experiment, we were particularly interested in evaluating something called a Tweedie loss function, which is based on Tweedie distributions.

Tweedie loss is a compelling candidate for a couple of reasons. First, it is specifically tailored to handle datasets with many zeros alongside positive values. Second, Tweedie loss provides greater flexibility than MSE due to its adjustable parameters, which allows it to better adapt to the data. You can think of parameters as settings or adjustments specific to the Tweedie loss function that control its behavior.

We tested the Tweedie loss function with a few different parameters. Testing for us means training a model on historical data and evaluating results on periods where we know the targets or known quantities. This approach, known as "backtesting," ensures that the evaluation periods are not included in the training data. This is crucial as it allows us to gauge the model's ability to generalize to new, unseen data. We then calculate a suite of metrics that incorporate different aspects of our forecast.

To see if the model has improved its under-prediction habit on the higher-valued targets (i.e., nonzero), we looked at lift curves. The lift curve compares the actual average units with the model's predictions across different target bins. These target bins are created by dividing the target values into quantiles.

Here’s the lift curve for the default loss, MSE — actuals are in blue, the MSE model’s predictions are in green.

MSE Only

If the lines are close together, the model is performing well. In this case, the model’s predictions are consistently below the actual values.

Diving Into the Results

After some experimentation, we confirmed our suspicion that using Tweedie loss results in a better fit than squared error, through the narrow lens of reducing under-forecasting.

A good way to illustrate this is to compare Tweedie loss to standard squared error across a range of predicted values, using the same lift curve approach we covered above. Here’s a chart that looks at the Tweedie loss with the best-performing parameters — you can see how it compares to our baseline MSE loss.

MSE Tweedie

The improvement might seem modest in the graph, but this change allowed us to reduce under-forecasting from 24% to 19%, which is a 21% decrease! We still have an opportunity to get the forecasting line to match even more closely with the actuals, but using Tweedie loss is a step in the right direction.

You may note that the right side of the chart has a big gap. Why is that? In short, it comes back to data sparsity. In this last bucket, there are very few observations for the model to train on, which makes accurate predictions challenging. In addition, the few values the model can look at have a high degree of variance — individual values are farther away from the average than they are in other buckets.

As we covered earlier, for our customer’s data, it’s most important to have high-quality predictions when an observation is greater than 0. This approach optimizes model performance for articles most likely to drive financial performance.

And, for certain parameters, Tweedie loss punishes the model for predicting closer to zero when the real target value is 1.

This is a great reminder that just because something is the default, it doesn’t mean it’s always the best option. And if you’re running forecasts internally already, it might be worth doing a quick check to see if MSE is impacting your results!

Charting Our Next Steps

We are constantly running experiments just like this. When the impact is predictable and generalizable, our flexible platform allows us to deploy new approaches right away. For those experiments that are inconclusive or not quite production-ready, we continue building with additional experiments.

In the case of Tweedie, we can add our findings to the “interesting, but not ready for production” bucket. But that doesn’t mean this is a “failure!” On the contrary, each experiment we run provides additional information we can build on for future iterations of our models.

We’ve confirmed our hunch that the choice of loss functions, and even the parameters of those functions, can help correct for unwanted behavior, such as the model’s habit to underpredict. And future experiments can take this knowledge to continue pushing our model to be the best possible, given the challenging data sparsity reality.

I hope you learned something interesting today about model testing! We’ll be back with more experiments in the future, so make sure to check back.

In the meantime, here’s a piece written by my star boss Mike that walks through why data science is so important as we look to help brands perfect their inventory processes.

Get Started In Less Than 30 Days

Get your free analysis today to avoid stockouts and unnecessary markdowns.

Get in Touch

Recommended resources

See all resources
No items found.

Make Forecasting Your
Superpower

See how AI tailored to your unique business can deliver
insights that boost margin.

Book a demo
Syrup logomark