Supplier invoice fraud is a significant issue in business. Every year millions are lost to incorrect or invalid invoices being submitted by suppliers. The machine learning based approach outlined here is able to identify up to 97% of these invoices before they become due for payment.
The example used here is based on a database of ~90k invoices over a five year period from a services company that, in-turn, purchased other services and goods for resale. During this period, the company had a persistent issue with supplier invoices being submitted for inflated amounts, mistaken amounts and occasionally for goods/services that were never provided.
After careful manual analysis, many invoices were tagged as fraudulent and these are used for training the model here. It’s worth noting that the term ‘fraudulent’ here does not necessarily mean intentionally fraudulent but also includes innocent data entry mistakes.
To begin we will load the data into a Pandas dataframe and tidy up some missing values:
Before an AI model can be generated, a significant amount of feature engineering is needed. The first item explored is an application of Benford’s Law. This states that, in a naturally occuring dataset, the leading digit is more likely to be a small number than a high one.
The calculation is based on the fact that the logs of the numbers are normally distributed, not the numbers themselves.
Based on this, the distribution of the first digit of each invoice amount should look like this:
With there being a roughly 30% chance of the first digit being a 1 and a roughly 4% chance of it being a 9. Analysis of the invoice dataset reveals the real-world distribution to be:
We can see immediately, that the occurance of the number 9 is higher than we would expect it to be. Also, the number 0 occurs for invoices with decimal values less than 1.
To make use of Benford’s Law in the model, we will seperate the first digit of the invoice amount into a new column called ‘first_digit’.
Next we will generate a z-score for each invoice. This is a measure of how far a value is away from the standard deviation of its group. In this case we will group the invoices by supplier and get a standard deviation for each.
This is generated using the code below:
Anecdotally, it was reported that the incidence of fraud was higher in the summer months. While there was no obvious explanation of this, it was decided to investigate further.
To do this, the invoice dates are converted into a new feature based on the number of days since 1st of January. This allows all years to be processed equally.
Because the 31st of December is next to the 1st of January, the days value needs to be converted into a circular value. This is achieved by generating the cosine of the days value divided by 365 days of the year. Plotting this value gives the below chart where each value is between -1 and +1:
Finally we will create two more features. Again, based on anecdotal information, it was reported that fraudulent invoices were often for whole number amounts, without decimals and it was more common for them to also have shorter descriptions.
To address this possibility, another new feature was added to flag if the amount contained a decimal or not and further one for the length of the invoice description.
The dataset, including the new features was then analysed to look for correlations:
From the above correlation matrix, if we look at the row relating to the ‘fraudulent’ field we can observe a number of interesting things.
Firstly, there is a strong correlation with the invoice amount. This is somewhat to be expected as a fraudster is unlikely to generate an invoice for a low value. We know already that the dataset contains many invoices with values < 1, so large value invoices may more frequently be fraudulent.
There also seems to be a correlation with the length of the invoice description. This indicates that fraudulent invoices are more likely to have shorter descriptions.
Supporting Benford’s Law, we can see a correlation with the first digit of the invoice amount. Finally we can observe that the zscore is also linked with an invoice being fraudulent or not.
The next step here is to choose the best type of classifier to use. While we have a number of features that correlate well with being fraudulent or not, these features do not correlate strongly between each other. For example the length of the invoice description has a correlation of 0.024 with the first digit of the invoice amount.
Because of this weak correlation across features, a Random Forest classifier was chosen. This type of classifier works by creating several seperate decision trees each one outputting it’s own prediction. These are then grouped together to make a final prediction.
We will now split the data into test and training datasets and train a Random Forest classifier.
After training, the model is able to make predictions against the test dataset with an impressive 97% accuracy (precision 92% and recall 78%).
To assess the overally quality of the model we will now generate a ROC curve.
This will generate the below chart:
The model itself is able to achieve an accuracy over 80% without generating significant numbers of false positives. An accuracy of slightly over 90% represents a compromise with a modest number of false positives being generated.
Overall, this approach is able to achieve very high levels of accuracy depending on the number of false positives you are willing to accept. It is all however dependent to the nature of the source data, all companies will not be the same.