Why my first job in data science was not what I expected
New shirt, new shoes. I was ready for my first job in one of Ireland’s biggest banks. I was excited. Looking back, I had good reason to be. I was able to work on impactful projects and I learned an immense amount. In fact, the biggest lesson was:
Data science was not what I expected.
I expected to work on the forefront of computer science, statistics and machine learning. Applying new methods to drive unique insights. Automating everything. In short, I fell victim to the hype around the profession.
So, I want to share my lessons with you. I hope that we can get around the hype and improve your understanding of what a data scientist does. Let’s dive into the first lesson.
My job involved building credit risk and fraud models. These were impactful models. They were used to automate lending on a large scale. I’m talking applications worth billions of euros a year. You may think that, with such high stakes, I would be doing advanced machine learning. You’d be wrong.
I exclusively build models using logistic regression. I am not alone. From banking to insurance, much of the financial world runs on regression. Why?
Because these models work.
The performance of regression models was good enough. They are also widely understood and accepted at the bank. To adopt a new algorithm, it not only had to outperform regression. The improvement also had to justify the effort of explaining the algorithm.
With regression, I ended up with models that had 8 to 10 features. Each of these features had to be thoroughly explained. A non-technical colleague had to agree they captured a relationship that existed in reality.
With regression this was simple. Black box models would have been more difficult to explain. Sure, I could have used methods like SHAP or PDPs and ICE Plots. The problem is they wouldn’t give me the same level of certainty. I would have also needed to explain the method I used to explain my model.
This was a source of disappointment. Leaving uni, I had learned so much about random forests, XGBoost and neural networks. I was excited to apply these techniques. In the first week, I remember one of my senior colleagues saying:
“Forget about all those fancy models”
She was right. Many data scientists will never need them.
Less disappointing was the realisation of how useful machine learning is. It sank in when I saw all the applications in the banking industry alone. To name a few:
- Credit risk — predict default due to financial distress
- Fraud — predict if customers do not intend to repay a loan
- Pre-areas — identify customers in financial distress
- Churn—identify customers who intend to leave the bank
- Marketing — identify the best customers to promote a product to
These models were used to automate processes across the bank. Working on them exicited me. It gave me the opportunity to create something that could impact the world more than I could have ever done alone. This gave me a lot of motivation. Much needed motivation.
Building models at university was a breeze—clean datasets, pre-engineered features and automated hyper-parameter tuning. It took me a couple of hours to get 99.9% accuracy. Imagine my surprise when it took a team of 3 of us 8 months to build a credit risk model. 8 months!
Most of this time went into building our dataset. This does not only include model features. I had to justify all of my modelling decisions. To do so, I included any variables needed for sampling and representation analysis, segmentation analysis, fairness analysis and model evaluation.
I had to build many of these variables from scratch. The underlying data fields were spread across multiple tables with inconsistent documentation (if there was any). Once built came the debugging. Oh, the debugging. I still get chills thinking about it.
If mistakes are made (they were) they would cause a lot of pain down the line (they did). To minimise this, a lot of testing was done. The issue was that there was nothing to compare my model features to. The best I could do was:
- Sense check. This involves visualising feature trends and validating them with domain knowledge. Does a sudden drop in income make sense? Yes, Covid.
- Unit tests. That means calculating the feature values for a few customers manually.
I didn’t know about this side of data science. It was not the “sexiest job of 2019” I was told about. It was boring. Yet, it was worth it. Seeing the final model filled me with pride. It was my child. My child that I immediately sent off to sanction thousands of loans.
I quickly realised how critical non-technical skills would be. Communication is key. There were no assignment briefs or clearly worded exam questions. At times, tasks were described in a haphazard way. I didn’t expect that part of my job would be to understand what I was asked to do.
I needed to improve both my communication skills and domain knowledge to effectively apply my technical skills.
This became easier as I gained more experience. More specifically, as I gained knowledge of the banking industry. In the beginning, I didn’t even know what clarifying questions to ask. There was a lot of jargon and TLAs (three-letter acronyms). Once I grasped this language my life became easier.
Data science is a hot job. It is also just a job title. You could be expected to do a variety of tasks. Companies know that people want to be data scientists and they will market their positions appropriately.
I started my job with a bunch of fresh graduates. I was lucky. I ended up doing work that I would classify as data science. Some of my fellow graduates were not so lucky. Just SQL and excel. Really, they should have been called data analysts.
Looking back, a warning sign was that all the seniors in the department had the title of “quantitative analysis”. The new juniors were all called “data scientists”. Had the work suddenly changed? No.
Going into my next job I would focus less on the job title. I would ask more questions about what work I would do on a day-to-day basis. The next lesson taught me to also ask about the tools used to do this work.
A common sentiment is that you should focus on process over tools. I think this comes from data scientists who have never had to work with outdated technology. I agree that process is important. It is equally as important to have access to the best tools to implement those processes.
Old tools are draining. They are also abundant in the banking industry.
Coming from university, I had experience with Python. You can build complex models and interactive visualisations with a few lines of code. In banking we have SAS. SAS can do a fraction of what Python can do with a multiple of the effort. I found it a bit demoralizing. I knew I could do a better job with open source tools but had no way of accessing them.
Working with old tools also made my skills less marketable. The industry moves quickly. I realised this when I started applying for new jobs. 95% of data science job applications mention tools like Python, Pytorch, TensorFlow etc… Companies want people who have experience with the latest technology.
In the end, all jobs have their downside. I am happy with my first experience. I completed interesting projects. I did work that had a material impact on the Irish economy. If only I had access to better tools to do that work.