Machine Learning in Insurance | Clearcover Insurance

Written by Zac Ernst | May 12th, 20204 minute read — Written by Zac Ernst | May 12th, 2020
4 minute read

For businesses starting a new machine learning initiative, often their biggest challenge is not having too much data, but having too little. They don’t have massive quantities of data in a data lake, but only a few thousand data points in a data puddle.

Machine learning in the data puddle poses severe challenges. Obviously, you can forget about certain techniques, such as deep neural nets, which require millions of data points. Furthermore, the risk of overfitting is increased because the number of features in your model could be large relative to the small size of your training data. Spurious correlations abound, conflating variables are difficult to pry apart, and small errors in your data can have a dramatic effect.

Startups such as Clearcover are especially prone to face these challenges. Despite our fast growth, we haven’t been around long enough to amass the quantity of data we’d like to have in a machine learning initiative. So we’ve had to be very creative and disciplined in our approach to machine learning. In this post, I’ll describe overarching processes that have allowed us to chalk up some early successes in our machine learning initiatives. They come down to three elements: (1) achieving alignment among all the stakeholders; (2) utilizing all the knowledge the business possesses; and (3) augmenting data with external data sources.

Alignment

The most crucial step is to set goals and expectations properly among everyone who is impacted by machine learning. Alignment on the right machine learning goal requires a few elements:

Viability. The goal has to be possible to achieve, given the limitations of the business’s data and other resources.
Value. The project must be able to show significant measurable value on the business’s bottom line.
Vision. The machine learning project should help stakeholders see how it could be grown into more ambitious projects over time.

In our case, Clearcover had considered building an ambitious machine learning model to predict customer lifetime value. Unfortunately, our data puddle isn’t nearly big enough to support such a project -- for one thing, the company has only been in existence for a few years and doesn’t have sufficient data on lifetime value. This goal fails on viability.

We encouraged our stakeholders to think of a lifetime value model as a goal that we could achieve in a few years. But in the meantime, we challenged everyone involved to come up with ideas for other projects that could be framed as steps toward that goal. Customer churn, for example, is closely related to lifetime value because customers who churn quickly will obviously have very low lifetime value. It turned out that our business intelligence team had already determined some useful insights about the segment of customers churning within sixty days.

Predicting sixty-day defection turned out to be a project that we could all align on, which had the potential to make a big difference to the business, and which happened to be a step toward the creation of more ambitious models such as lifetime value. And because we only have to have a customer for sixty days in order to know whether they churn, the data puddle had at least a few thousand data points that could contribute to the model. So we quickly aligned on this goal.

Utilize all the knowledge around you

A risk of doing machine learning in the data puddle is that with so few data points, it’s especially easy to overfit your data or make other mistakes that might cause you to generate models that aren’t predictive. So you can’t take a “throw everything at the wall and see what sticks” approach. Instead, you have to be guided by specific hypotheses and all the domain knowledge possessed by anyone in the business. Our small machine learning team worked hand-in-hand with product managers, actuaries, engineers, and business intelligence through every step of the process.

My experience has been that if a problem has been identified as important to the business, there are always people who have spent time thinking about possible solutions. Although it’s always a good idea to solicit input from domain experts, this step is absolutely necessary in the data puddle.

In previous jobs, I’ve seen data science teams select the data sources, profile the data, engineer all the features, document sources, train a model, and show it to the business’s domain experts only at the end. But that is a sure-fire way to fail.

Domain experts should be consulted in every single stage of the process. They can help in many ways, including:

Data profiling. There is no mathematical test for whether the data contains error. A spike in sales around Christmas could be normal for a retailer, but indicate an error for an insurance company.
Feature engineering. Experts always have ideas about which features will be useful for the model. They can point you in the right direction when you’re building features, saving a lot of time.
Modeling. Even if your domain experts aren’t data scientists or statisticians, they can help you select appropriate models. For example, they’ll know if certain features tend to interact linearly or not, which is very useful when choosing a modeling approach.
Validation. You should always communicate even your most preliminary results to your domain experts because they’ll be able to tell you if there are any suspicious results.
Communication. Your domain experts are probably closer to your stakeholders and can provide guidance about how to present your findings.

It should go without saying that this is an iterative process, not a linear one. I’ve seen very sophisticated teams approach these steps in a waterfall fashion by doing all the data profiling and all the feature engineering first, before beginning the modeling phase. But this is a blunder. You’ll get much more value from your business’s knowledge by iterating on these steps as many times as possible. Practically, this means that you should profile only a small part of your data, then engineer only the features that are most likely to yield results, and get into the “build, test, learn” loop right away. Even if your first models perform poorly, you’ll get better results sooner.

Augment your data

Another way to get value out of your data puddle is to grow the puddle by adding data from external sources. There are many high-quality external data sources that are freely available, and you should use them if possible.

In our project, United States Census data turned out to be helpful. The Census Bureau collects a huge amount of data and aggregates it at multiple levels (nation, state, county, zip code, etc.). If demographics is important to your business, you should definitely check this source.

Of course, there are other sources which can be valuable, depending on your specific problem. The Internal Revenue Service also provides data aggregated by zip code, for example; OpenStreetMap can provide valuable information about locations, distances, businesses, and other features; Wikidata contains millions of data points about many different types of entities including people, places, and businesses. And it’s becoming more common for major cities to provide open data portals that contain very fine-grained downloadable tables about every aspect of the city.

Combining these data sources with each other and with your own first-party data can be powerful. If you know the zip codes of your customers, for example, you can learn essential data points that inform the car insurance application. I’m a frequent consumer of open data sets for the projects I work on. It’s always a good idea to check to see if one of these data sets can be useful. But in the data puddle, external data can make the difference between success and failure.

Conclusion

All the hottest tools, blog posts, and conferences are heavily geared toward the challenge of working with “Big Data.” But in my opinion, machine learning in a data puddle is far more challenging. It forces you to deeply understand your business, customers, and stakeholders. But if you have a clearly articulated, comprehensive strategy that leverages every resource at your disposal, your team can overcome these challenges and have a major impact on the business.

Interesting in joining our team and working with ML technologies? Check out our open spots!

Author

Featured

Zac Ernst

Zac is a former philosophy professor who left academia for the tech world. His focus is on helping businesses succeed in their data science initiatives, with an emphasis on integrating the technical side of the organization with the business side. He is based in Florida, where he and his wife breed champion Great Danes who would like a belly rub, please.

Connect with Zac Ernst: LinkedIn

Machine Learning in the Insurance Industry

Alignment

Utilize all the knowledge around you

Augment your data

Conclusion

Author

Related articles