Credit Risk Meets Machine Learning: 6 Cautionary Takeaways for Banking Executives

Hannah Glasrud
Jul 29
9 min read

Deposits 101: The Three Stages of Effective Deposit Management

Credit risk is rising – and with it, the importance of accuracy and precision in risk assessment. In banking’s early days, lenders made credit decisions based on their own judgment and intuition. Today’s environment is increasingly more complex, and assessing credit risk has become correspondingly more difficult.

While traditional statistical approaches (like logistic regression) have been the norm in data-driven solutions for measuring credit risk, they are limited in fully utilizing larger, complex datasets. With the advent of BIG DATA, institutions of all sizes want to include this wealth of knowledge in their decision-making. Developers are turning toward machine learning methods to unlock data’s insights and improve all parts of the credit lifecycle from application to payoff (or loss).

These new methods may be most impactful when used in the underwriting process. In the race to compete, institutions must make competitive offers to prospective borrowers quickly, or even instantly. Technology can help with this, but not all tools are equally effective at assessing credit risk.

When an institution embeds a machine learning model in its decisioning process, the through-the-door borrower population is entirely dependent on the model’s effectiveness. At best, the model accurately captures the risk profile of each new offer. At worst, the model contains hidden biases that systematically increase credit risk in ways that are not easily detected.

This is the context in which Extreme Gradient Boosting (commonly known as “XGBoost”) has emerged as a popular and powerful modeling tool.

While the mechanics of XGBoost may sound complex, the fundamentals are actually not hard to understand. For a deeper dive, click to read XGBoost Fundamentals for Bankers: The Mechanics Behind the Algorithm.

What is XGBoost?

XGBoost is a supervised machine learning algorithm known for scalability and flexibility with modeling complex datasets. At a high level, XGBoost creates an ensemble (large group) of decision trees through gradient boosting (a calculus-based optimization method) and then uses the ensemble to make predictions.

XGBoost has won numerous machine learning contests for its notable performance, flexibility, and efficiency. But behind all its shine and novelty, there are some hard truths. This new way of modeling requires more data, and machine learning algorithms like XGBoost are only as good as the data they are exposed to during the learning process. This brings us to the first of six key things that management needs to know.

Takeaway 1: XGBoost’s performance is entirely dependent on the data.

To use XGBoost to automate underwriting (or produce a score that is then used in the underwriting process), modelers must have significant data to make a sound and reliable model.

Many models can differentiate the extreme borrower, i.e., those who are very unlikely to default and those that almost always default. What differentiates great underwriting models from average is the ability to correctly identify the “on the line” applicants. This middle ground is difficult to model because so many variables are required to separate the herd.

XGBoost can theoretically do this very difficult task (separate out good and bad “on the line” applicants), but in reality it is often limited by the amount and quality of inputted data.

Thus, the need for a large dataset of applicants. Note that “large” is not measured simply by the number of records, but also the robustness of the data used to train the model.

Ideally, a data set will:

Cover all demographics that an institution lends to now and possibly in the future
Differentiate between loan offerings
Contain a significant number of records (think: millions, not thousands)
Cover performance of denied applications
Contain historical metrics of borrower performance
Span across as many economic conditions as possible (such as inflation) that may affect a borrower’s likelihood of repayment, regardless of whether or not economic variables are factors considered by the model. This is particularly important to produce a reliable estimate for applicants in a sub pool such as “on the line” applicants.

If an institution does not have this robust a dataset, it does not necessarily mean that its credit department is not using an XGBoost model! DCG has seen (and validated) XGBoost models built on only a few years of data. If an institution has built a model based on a less than ideal dataset, it is critical that its management team understand the limitations of its output.

Takeaway 2: An XGBoost model's performance is limited by its training.

Like any algorithm, XGBoost has its limits. It is important that management understand those limits and override an XGBoost model’s output accordingly when approaching them.

A key contributor to an XGBoost model’s limitations is its training process.

XGBoost can only base its estimates on what it has been exposed to. When operating in a new environment (e.g., extrapolating), an XGBoost model is likely to underperform. For example, DCG has seen XGBoost models built on only a few years of pre-COVID data, with no data from the Great Recession. Those models learned from data reflecting a relatively calm and reliable lending period; thus, their output may potentially be too rosy during a credit downturn.

XGBoost models may fail or underperform if there is significant drift (or shift) in the data from when it was trained to when the model is being applied. As the name hints, “drift” is a movement in the distribution of data. Model owners should continuously monitor data for drift, which can occur both in the data fed into the model (the factors) or in its output (the response, i.e., the probability of default).

Takeaway 3: An XGBoost model’s objective must dictate its data preparation.

We’ve seen how the broader the view XGBoost can learn from, the better a model will perform when facing new information. Accordingly, during development, management must first consider the model’s objective in order to effectively prepare the data to train the model.

For example, to model the likelihood of a car running properly today, one would consider how well the car has historically performed, its age, and other factors leading up to today. The car’s lifetime matters. In this instance, the data the model is learning from should be sampled sequentially because the order of the data matters in the problem the model is trying to solve.

In a through-the-door problem, such as underwriting, where each applicant is independent of the others, order does not matter. The method of data preparation should encourage the model to learn from the best subset with the largest amount of experience in it. This would indicate the use of a method like stratified or randomized data sampling to create the training data.

Takeaway 4: Tuning matters – greatly! Do not settle for default values whenever possible.

In machine learning models, including XGBoost, there is a process called tuning. This is where all the model’s parameters are adjusted to find the settings that produce the best predictions, like carefully turning the knob on an old radio to hear the best sound from an out-of-town station.

Modelers may tune many parameters in XGBoost, and the model's performance is very sensitive to the tuning process. Developers and management should review various performance metrics to assess the quality of the model as it is being tuned.

Here are the top 5 tuning parameters that affect model performance:

1) Objective. The model owner chooses this function depending on the problem they are trying to model. For example, if using XGBoost to classify a potential borrower as likely to default on a loan, then a model developer would set the "objective function" to "binary logistic." If the owner chooses the wrong loss function, they create the wrong model. This is essential.

2) Evaluation. This is the metric used to evaluate the model’s performance on the validation data. The default value depends on the selected objective function. Classification problems (as in the example above of a potential borrower defaulting or not defaulting on a loan) default to a value of ‘Log Loss,’ i.e., the price paid for the inaccuracy of predictions in classification problems, but developers may use multiple other evaluation metrics. The important thing is to choose the metric carefully to best measure the model’s objective.

3) Gamma, max depth, min child weight, lambda. Modelers use these four parameters to limit a model’s complexity in order to help prevent overfitting, which leads to deteriorating performance.

4) Learning rate. Learning rate controls how much the model adjusts (learns) after an iteration. Increasing the learning rate makes computation faster but the model may not learn optimally. Slower learning allows more exposure, but more time is then required to train the model.

5) Scale pos weight. This parameter controls the weights of positive and negative class labels. This is helpful to set when data is imbalanced (like most credit default-related data). The parameter’s default value gives equal weight to positive and negative classes. It is important that the modeler adjust this value for imbalanced data.

Takeaway 5: As of today, all evidence points to the fact that underwriting models should not be purely autonomous and hands-free.

Sometimes the best defense is the first and second line. Machines are no substitute for human experience, wisdom, or insight. Humans should always oversee a machine learning-built model through active monitoring of its performance and data.

In our work with clients, DCG has observed several common instances when intervention could and should happen:

Model Performance Deterioration. Model owners should actively monitor a model’s ability to predict the performance of its borrowers. When the model’s overall performance or performance for a given sub-group deteriorates, the model owner should perform manual reviews in conjunction with the model output and may also adjust the classification threshold to make more conservative predictions. This will be particularly relevant for on the line applicants.
New Products. Model owners should expect more volatile results for new products, as there is less robust historical information related to these products. Until the model can produce reliable results for new products, the model owner should institute manual reviews.
Noticeable Drift. If there is a noticeable drift in either the applicant pool or the data factors, an XGBoost model may no longer be the best predictor and may indicate a need for re-development (to include this relevant information shift) and more frequent human oversight and potential intervention.

Takeaway 6: Management should require regular XGBoost model validations.

After development, wise management teams require effective challenge of these complex models by validators who thoroughly understand banking and the mathematics involved. While many modelers can build an XGBoost model, not all understand each parameter's nuances for banking and can determine whether the model was built correctly and optimally.

A robust validation by an experienced bank validator should include:

Replication and Developmental Testing. This confirms that the model was built and performs as documented. Replication of models also allows for the validator to build potential challenger models, as well as obtain tests around your model that may or may not be provided by your model developers.
Implementation Review. A thorough validation should include review of the model’s implementation beyond strictly training the algorithm. This encompasses how the model fits within the intended process, such as underwriting. The validation should consider classification thresholds or cutoffs, decision matrices or scorecards, and control testing.
Ongoing Performance Monitoring. XGBoost models are more black box than other modeling types, such as traditional regression models. Ongoing performance monitoring is the essential mechanism for assessing the model’s reliability and revealing model risk. The validation must critically evaluate management’s efforts to build a strong monitoring framework, including how monitoring is implemented effectively for vendor-developed models.

Contact DCG if you have questions about credit modeling, XGBoost, or machine learning in banking.

ABOUT THE AUTHORS

Hannah Glasrud is a Quantitative Consultant at Darling Consulting Group. In her role, she performs model validations and runs DCG’s CECL model. Throughout her career she has worked with a variety of models including CECL, DFAST, capital aggregators, credit, and default rate models. Her work includes validating machine learning models.

Hannah began her career at DCG in the Data Analytics Group as a business systems analyst where she performed prepayment and deposit studies for community banks. From there, she was promoted to Quantitative Analyst where she assisted in model validations and model risk management engagements.

Using her technical knowledge and coding experience, Hannah produces automated reporting tools for various areas within the company and performs code reviews for models written in SAS, R, and Python. She also developed and maintains DCG’s CECL model.

Hannah earned a BA in Economics from Boston University and an MA in Analytics from Georgia Tech.

J. Chase Ogden is a Quantitative Consultant with Darling Consulting Group's Quantitative Risk Analysis and Strategy team. As a practitioner at large and mid-sized financial institutions, Chase has experience in a wide array of modeling approaches, applications, and techniques, including asset/liability models, pricing and profitability, capital models, credit risk and reserve models, operational risk models, deposit studies, prepayment models, branch site analytics, associate goals and incentives, customer attrition models, householding algorithms, and next-most-likely product association.

Chase is a graduate of the University of Mississippi and holds master’s degrees in international commerce policy and applied statistics from George Mason University and the University of Alabama, respectively. A teacher at heart, Chase frequents as an adjunct instructor of mathematics and statistics.

Talk With Us

DCG Insights

Stay up to date on the latest from DCG