With Credit Card Fraud Example
Machine Learning has become the buzzword for the past decade and its popularity is not winding down. With more professionals from a different field or new graduates stepping into this area, I feel having a thorough guide not only from ML perspective but also a business perspective would be very helpful. I learned a lot in the past few years in developing machine learning models, and even more through business understanding. It was a struggle for me to go through all of the challenges and I hope anyone out there doesn’t have to go through them. In this article, I would like to try my best to talk about ML by combining the both in an overview and highlighting some of the parts that are often overlooked. I will be as comprehensive as possible with following outlines:
- Problem Statement
- Y clearly defined
- Data understanding
- Data collection
- Data cleaning
- Data Readiness
- Feature selection
- Model training
- Model validation
Documentation & Performance Tracking
- performance tracking
If you are tasked with building a machine learning model, there is a good chance that a problem is awaiting to be solved. And understanding the problem is half the solution. Don’t wait the questions to seek you, stay proactive and ask questions, the answers will follow. If you don’t know what to ask, you can always start with the business. What is your department, what goal is it to be achieved, who are your partners or what are your upstream and downstream businesses. These will help you to learn the big picture. Once you learned the big picture, then ask about the project. It will make your life much simpler in understanding the task.
Your stakeholders are the requesters, the users and the Engs, and basically everyone who will be impacted by the results of the model. It will enable you to have a clear and comprehensive communication with each team, managing their expectations and get most accurate and genuine feedbacks from your partners.
Once you understand the task and identified right stakeholders, assess how much time you will need to develop the model. Update your stakeholder to align the expectation. I think the biggest reason to have a timeline is not to deliver project on time (if you can great), it is to adjust & realign with your stakeholder when an unexpected event happens. Over and over again in actual project, these unexpected will happen more often than anyone can think.
Y — Clearly defined
While this is not much of a problem when you are in a traditional business like banking, insurance or retail, it can be huge in startups. Taking my current job in a startup as an example, credit loss can be mixed with first party fraud which can be mingled with third party fraud. Yes you can still model on bad actors all mingled together, it will just impair your model accuracies because of the under representation of the user behaviors in each fraud type. You can refer to my guide on how to build a fraud team before building models
Once you have your fraud type clearly defined and properly recorded in a database, you will have a much easier life to deal with modeling. However there are still some overlooked constraints you will face when defining the the Y, they are time-bound, data bound, and business bound constraints. We will talk about them in details.
When I was working with new application identity fraud victims came in to report ID theft everyday, some of them were reported within 5 days of account open, some could be up to 30 days and some were extreme cases that victims reported ID theft after 10 years. Yes, that was more than 3650 days past before the victims realized they were defrauded. You get the point, the problem here is how do we make a reasonable time cut off to count which one is fraud and which one is not. One of the solutions is that you can use acceleration graph to see where fraud rate is tapering off to a point that is barely increasing.
This one can be coming from a new vendor your company just signed with or a new feature just developed and implemented, in either cases the data was not available within the defined time of your dataset. In some weird cases, your business just sent few selected cases to reduce the costs resulted in incompletion of dataset. One of the solutions you can take is that to leave the vendor data to next iteration or you can choose a period that would minimally impact your model.
A businsess-bound constraint focuses more on the processes of business which translates into a data-bound constraint. Let’s say use customer acquisition as an example that customer eligibility usually comes before fraud checks, so if these applications are denied for acquisition reasons, you don’t need to consider them to be fraud in the first place.
All in all having a proper definition of your target variable along with all the constraints being considered is extremely important and possibly influences your model performance a lot in the end.
Much like the Y, it is also important that you understand the processes of your business for independent variables. Why do you need to understand your data? Don’t you just feed as much data into the model as possible and let the model do the work for you? Unfortunately, however intelligent it may sound, machine learning is not able to learn what process take precedence first. So in order to let model help you, you need to help model first in determining processes.
For example, transactional fraud usually comes after policy checks, like CVV checks, expiration date checks. If these are not passed, you don’t even need to consider if a transaction is fraud.
Data is the new oil and whoever is in possession of largest quantity of it will be the champion in this era. So try to collect as much data as possible before building the next iteration of your model. Flip through your company’s documentation and talk to your partners about what data & vendors they are using, whatever way can get you more data do it.
The idea of having more data is better sit on the premise that it will bring you new information. Because you don’t know which feature will bring you that additional information, you tend to collect as much as possible and dump most of them through feature selection. So if you can collect quality data in the first place, you probably don’t need big data and that’s the new trend “small data”.
Unfortunately, data does not come as clean as you may think specially in startups where MVP/agile is the go to way to build the next unicorn. After your data collection, you need to ensure the integrity of your data. Making sure the completeness, consistency as well as its accuracy, these are crucial to a successful model. Major things to check:
- Missing values (depends on the model)
- Data conformity (Capitalization, Data types, Length etc.)
The difference to me between data cleaning and data readiness is that data cleaning is to fix the corruption of the data, whereas data readiness is to improve the performance of model. Traditionally, data cleaning and data readiness are done together, however with the emerge of big data and data engineer this two processes are better to be separated. Let data engineer do the integration and cleaning, and modeler do the readiness. The reason is that the data cleaned by data engineer faces all teams, and modeler uses data for specific business task. We can separate data readiness into two parts: Numeric and Categorical
- Transformation on the distribution (check with box-cox)
- Level reduction
- Dummy variable creation
- Converting discrete numeric
EDA is a process to understand the characteristics of the data, you can check out my article on EDA for details.
Feature selection is a result of too much data being collected during data collection and overwhelmingly amount of data do not actually contribute to the performance of the model. Mathematically speaking, the more features being collected, the higher likelihood of the new feature forming a linear combination with existing features (Multicollinearity). A second reason is model overfitting and the third reason is that because of the technological advancement being slow to handle large amount of features, hence the new machine learning chips are being developed to process even more features and data.
First and foremost, don’t dive right into machine learning feature selection check on meaningless variables like IDs, keys, unary features, and reason codes features etc. These are low hanging fruit and you will see a lot of them. By removing these, your next round of feature selection will be much more accurate and meaningful.
Next you will be confused with all the online articles talking about feature selection, forward, backward, stepwise, decision tree, RFE, L1 regularization etc. Forget all these theoretical selections and stick with GBM’s variable importance and retain empirically 95% cumulatively, remove the bottom 5%.
Lastly, run variable importance again and correlation test through multiple iterations, if correlation coefficient is greater than a threshold (empirically >0.6) and corresponding counterpart variable is less important, then drop it.
Altogether through the listed steps of feature selection, they will reduce about 70% – 80% of the features. Next you will can feed into your model to be train when you have the final list of features.
Model assumptions can often be overlooked and ignored, but knowing this will help to know what kind of problems will face and hence how to prevent in the future. There are two types of assumption, model’s assumption and data assumption
Model’s assumptions are coming from the model itself, like linger regression need to have 5 assumptions to be held true to validate the result.
Sometimes I call it external assumptions, because many of them are impacted by external factors. For example, macro economics are stable, model predicability on the historical data is reflective of future data and attributes remain stable.
Now that you have the final list of features, it is time to fit models. Academically, you can use data to fit multiple models and select best the one based on the model performance. In reality you still can, but most of the time we already know that Neural Nets or GBM are better than logistic regression or decision tree. So we can cut the chase and go straight to the one you think will be the best. But sometimes you can use logistic regression to form the baseline and see how much lift GBM can bring it to you.
Because of the complexity (data + features), when we are fitting hyper-parameter space we can segment the space into sub-spaces and fit one sub-space at a time while others are fixed. When one of the sub-space is optimized, we can iterate next sub-space until all of the sub-spaces are fitted.
After model is fitted, we will need to test the goodness of model on different datasets. One on validation, two on out of time datasets if available. These are to ensure your model is stable overtime and will not deteriorate.
Earlier in the data assumption, we talked about economics, data and pattern being stable, but how can we track these? we can use population stability index to track it. We can also track individual features and how the importance is impacting on the model performance. These are very important because very likely all of your business’s application decision, transaction decision, acquisition decision are made off of these model results. One thing down, everything will go wrong. So you already spend so much time and energy developing the model, don’t let external factors ruin it.
Finally having a good documentation enables you to communicate with your stakeholders effectively and efficiently, and being the source of all the work that you have done. Having all details presented to the rights party, it also allows you to be correctly audited (if you can in a financial sector) for quality assurance as well as sensitive information. Let documentation be your friends and you will benefit from it in the long run.
This is what I have learned from my experience in the past few years. There are other guides that also great but not as detailed and business practical, that’s why I wrote this. Hope this can be helpful to someone who read this. Lastly, machine learning is always changing and advancing as it becomes a go to solution for most of the companies. MLOps is developed for that purpose to streamline the entire workflow, but the core steps will not be changing much. Hope this helps!