The Folly of Optimizing A, While Caring About B
R2 and MAE are not as helpful as you think at improving profitability
Nearly every organization that considers partnering with Verikai will go through a proof-of-concept exercise called our “validation process”. However, Verikai’s vast behavioral database enables new machine learning algorithms for which standard goodness-of-fit metrics are not ideal. For individuals accustomed to only regression-based modeling, understanding and interpreting the validation results can be challenging. R2 or MAE become the hammer, and every predictive model becomes the nail. This article addresses some common questions we encounter when going through validation and why goodness-of-fit metrics can be problematic.
BACKGROUND: Why We Validate
Verikai performs the validation process for two reasons:
- ROI ESTIMATION: For organizations evaluating Verikai, a proof-of-concept shows our ability to produce results on a new population. They can use the results to estimate profit margin & loss ratio impacts had they used Verikai on their historical business. Doing so allows prospective customers to estimate the future ROI of working with Verikai.
- PRODUCT CUSTOMIZATION: Verikai’s data science team will use the historical data collected during validation to calibrate our models for your business. Case management, network discounts, group-size discounts, etc., can all impact profitability. Verikai can use machine learning to quantify unique factors about your business with sufficient historical data as a training set. Our model outputs will then be adjusted (“calibrated”) to reflect what is unique to your business.
Common evaluation metrics
Most customers evaluate models using one of a few standard metrics:
Coefficient of Determination (R2)
Any meaningful real-world data will have outcome values (Y) that vary between observations (X). R2 is the percentage of the variance in Y that can be explained by X. For example, a person studying education policy may be interested in understanding the degree to which SAT scores (Y) are explainable by household income (X).
R2 is the most commonly used evaluation metric because it is often the only evaluation metric students are exposed to in an undergraduate statistics course, and it is easily calculated in Microsoft Excel.
Mean Absolute Error (MAE)
MAE is calculated by averaging the absolute value of each error prediction. For example, if a model forecasted that a company would sell 100 widgets for the year, if actual sales came out to be either 90 widgets or 110 widgets, the absolute error would be |90 – 100| = |110 – 100| = 10 widgets in both cases.
The advantage of looking at MAE over R2 is in interpretability. Since MAE is in the same units as the forecasted value, it is easier for non-technical decision-makers to evaluate models by comparing MAE values to the range of expected outcomes.
Mean Square Error (MSE) / Root Mean Square Error (RMSE)
MSE and RMSE are similar to MAE except that the error terms are squared before the average is taken. The difference between MSE and RMSE is that RMSE will then take the square root to convert the error metric to the same units as the outcome variable.
The most significant difference between MSE/RMSE and MAE is in their sensitivity to outliers. For example, if one observation is off by 5 units and another is off by 10 units, the latter will have twice the influence on the MAE calculation. However, the latter observation will carry four times the weight (not two) when looking at squared errors (52 = 25 and 102 = 100).
What is problematic about R2/MAE/RMSE/etc.?
Nothing is wrong per se using metrics like R2/MAE/RMSE, provided you only care about optimizing on that metric and suspect the underlying function is linear.
The problem arises when what people REALLY care about are other metrics, like underwriting profitability. In most cases, actuaries and underwriters will care about and be assessed against multiple metrics (e.g., improve loss ratio by X% while growing revenue by least Y%). Charged with figuring out how to grow profitably, risk-bearing entities often lean on new or better predictive models as a way to better identify and price risk.
Underwriters and actuaries commonly assess models using a regression-based approach on the assumption (implicit and often unknowingly) that a model scoring higher on a particular metric (e.g., R2) is a model that will produce better results for what really matters (e.g., profitability). Setting aside the caveat that academic research shows R2 to be meaningless for model selection when data is nonlinear (which may be the case for healthcare claims), is it necessarily true that a better R2 will produce better financial results?
An illustrative example
Imagine designing a prediction model for a particular medical condition that is both rare and extremely costly. Also, imagine that you have a population of 100 groups that you are considering insuring, one of whom has a member with this hypothetical condition that will cause the group to be highly unprofitable.
If the condition were rare enough, a predictive model that says nobody ever has this hypothetical condition would appear to be a “good” model if looking at only a single goodness-of-fit metric because the model is rarely ever wrong. However, this defeats the purpose of building a model in the first place. The risk-bearer will be no better off when writing the 100 groups because the model would not have flagged the highly unprofitable group containing this condition and thus will be stuck paying the costs associated with it.
Alternatively, the risk-bearer may build a more aggressive model for this hypothetical condition. Suppose the model identifies five groups likely to have our hypothetical condition: the one group with the condition and four groups without. The underwriter or carrier may use this information to decline to quote the five flagged groups and only write the other 95 groups.
In this case, the more aggressive model may look worse according to a goodness-of-fit metric because it incorrectly flagged four groups. It is also true that this more aggressive model will have resulted in the risk-bearer turning away four groups that would have been slightly profitable. However, the overall performance of the book will be markedly better on both an absolute dollar basis and in percentage terms when using the more aggressive model. The extreme cost associated with this hypothetical condition means the best predictive model for the business takes a “better safe than sorry” approach to flagging the condition.
In other words, the best model for the business may be suboptimal on common evaluation metrics. The risk-bearer may purposely deploy a more aggressive model to identify this condition (i.e., with higher True Positive Rate) with the understanding that it will incorrectly flag some individuals/groups as having the condition when they do not (i.e., higher False Positive Rate). Part of the “art” of data science is determining the degree to which precision should be sacrificed for sensitivity. Finding the right balance becomes more complex the more significant the discrepancy between the cost of a False Negative (e.g., writing a potentially massively unprofitable group) relative to a False Positive (e.g., not writing a slightly profitable group). Failing to account for the asymmetry in healthcare data creates a disconnect between goodness-of-fit metrics and the operational results that matter.
How does Verikai evaluate models?
First, it would be incorrect to say that Verikai does not use or recommend looking at traditional goodness-of-fit metrics. We use them extensively during internal model testing and feature engineering. However, it is crucial to understand their limitations and not be bound to them.
Second, all model development efforts at Verikai revolve around providing ROI through lower loss ratio and improved profitability. The final risk scores that customers receive are the result of hundreds of models identifying individual conditions and the associated cost. Because various models may be more or less aggressive depending on the conditions’ False Positive vs False Negative cost differentials, our analysis during the proof-of-concept/validation exercise largely ignores traditional goodness-of-fit metrics. Instead, it focuses on lift charts and quantile analyses, which makes scenario analyses easy.
Academics largely ignore model evaluation tools like lift and gain charts. Still, they are popular among practicing data scientists because it allows a direct comparison between predicted vs actual on the metrics that matter (e.g., profitability) irrespective of the underlying goodness-of-fit metrics. Because lift charts assess book-level performance metrics, they make it easy to perform ROI analyses when analyzing various decision options and business scenarios. For underwriters and actuaries, this makes it easy to estimate what their loss ratio will look like after deploying Verikai and by how much their intended use will improve their financial results for the book.
Insurers are not alone if they struggle to generate value from their analytics teams – companies across all industries struggle with turning predictive models into improved KPIs. MIT researcher Kalyan Veeramachaneni has cataloged several reasons data science projects fail to deliver business results (Why You’re Not Getting Value from Your Data Science). Chief among them is that the experts building or analyzing models are disconnected from the underlying business problem too often.
With insurance carriers and underwriters, this disconnect frequently becomes apparent during Verikai’s validation process. Actuaries, underwriters, and data scientists often choose some evaluation metric (which may or may not be appropriate for the data at hand) and choose among models using only those scores. Verikai believes this is flawed when considering skewed data like healthcare claims, where costs for some small number of conditions, drugs, and insureds are disproportionately large relative to their frequency. Optimizing for some arbitrary metric may limit the number of high-cost individuals flagged by the models and reduce the ultimate value provided to the business.
Verikai is a “book of business” tool comprised of hundreds of machine learning models. Each model and risk score is carefully evaluated with an eye on improving the metrics that matter for an entire book: loss ratio and profitability ($). Improving business KPIs may be achieved by sacrificing performance on common evaluations metrics.
Our forward-looking partners understand that the value in the Verikai platform – and the key to achieving profitable growth – is identifying the small number of highly unprofitable groups. Our partners have not only improved business results by dumping highly unprofitable groups onto other carriers, but they have also aggressively targeted and won new business. Verikai’s partners are winning groups by identifying those massively overpriced by traditional manuals, undercutting competitors, and still making a healthy profit. Finally, without the need to subsidize a small number of highly unprofitable groups, Verikai partners have room to be more competitive on price across the board. All of this in combination has helped our partners achieve the Holy Grail of insurance: lower loss ratio, higher revenue, and higher profit margin.
Interested in exploring ways to differentiate your underwriting capabilities and gain a competitive advantage? Email email@example.com to learn more and see our product in action.