Facts are unpleasant: 87% of data science projects never make it to production [1]. Your project can fail due to many reasons. How to act and (possibly) prevent such situations? How to deal with negative emotions? And, at the end of the day, why did this happen?
First of all, let’s talk about what a failure looks like. A project is failed when it doesn’t meet its goals. For example, it could be an analysis that doesn’t answer a business question or doesn’t help to make a decision. In the case of machine learning, it could be a model which doesn’t work in production or isn’t even deployed to production.
So failure can be different, and its main feature is that your work became useless in terms of its objective. Although there are not many people sharing their failures, this topic is very important in data. A good reflection on mistakes can help to avoid failures in the future or determine what will succeed next time.
The failure can become a point of your growth or remain a simple failure depending on how you will go through it.
Data Science is like a treasure-hunting activity. Because it’s research and development, there is no safe or proven path. You need to go where no one went before, work on data no one touched before, or even solve a problem no one resolved before. A novice treasure hunter may look for standard goods, but an experienced hunter finds the most legendary of treasure. Data scientists seek out successful models, and once in a while, their models and analyses work! Although a senior data scientist may work on more complicated or tricky projects, everyone is continuously failing, and that’s just part of the job [2].
The systems of software developers and those of data scientists can be compared with the mathematical concepts of logic and probability, respectively. The logical statement “if A, then B” can be coded easily in any programming language, and in some sense every computer program consists of a very large number of such statements within various contexts. The probabilistic statement “if A, then probably B” isn’t nearly as straightforward. Any good data-centric application contains many such statements — consider the Google search engine (“These are probably the most relevant pages”), product recommendations on Amazon.com (“We think you’ll probably like these things”), website analytics (“Your site visitors are probably from North America and each view about three pages”)[3].
Having this understanding, let’s look closer at the key reasons why failures happen and are different ways to deal with these failures.
The data needed for your project could be corrupted, unusable, or not exist. Imagine that you want to analyze whether retail clients tend to spend more when their social status changes from single to married and from childless to young parents. Then it turns out that the company’s data contains only the latest person status without historical changes or doesn’t have any status column at all. Such issues can close many projects at an early start.
What to do: When you are going to start a project you want to prevent the most common data issues by asking the right prevention questions. The worst is that you can not imagine all possible scenarios with broken data. The best solution is to get a data sample before the project starts. If it’s not possible, try to study the data as soon as possible and design an early “go/no go” step in the project timeline.
If the project has been already started and the data issue reveals, you need to notify stakeholders as soon as possible. They will be upset if it shows up after one or two months of active work. You need to figure out how you can act in such kind of situation. Is it possible to gather more data? Maybe you can use additional sources? Can you start collecting the necessary data so the project becomes possible in the future?
Tip: check the data first and communicate to stakeholders immediately.
Your data could be totally fine and even not very dirty. But it still can be useless. There could be no signal, in other words, given information could not be used for the prediction. It happens when features don’t have any influence on the target. For example, a company wants to predict a customer’s satisfaction based on their demography data such as gender, age, etc. We can make some assumptions about the data’s signal but we will not be able to know this for sure before starting the project. Given data and target can be uncorrelated, so the resulting model will work at the random chance level.
What to do: This situation can be tricky. Is it possible to gather more data or to change the data source? In our example above, is it possible to find/collect customers’ reviews left for ordered items? If the data you have is all that can you have, try to reframe the problem. This dataset is not good for satisfaction prediction, but if you use it for customers clustering? Or maybe you can find the most popular items based on age-gender characteristics and build a simple recommendation system? Reshaping could save the project though by modifying it.
One possible mistake that you can make is to try using more complex models in an attempt to find a signal. Unfortunately, you can spend a lot of time on training neural networks and still have nothing. The most complex model will help to increase accuracy, but it can’t make something out of anything.
Tip: start with the simplest method, check the project’s feasibility and if it works, move to more complex solutions.
You can do your best and finish your project brilliantly, and get fairly high model metrics, but the customer can end up not using your work results. It happens all the time, and such models could not be even deployed in production. Why? Because the project is not providing value to the customer. For example, you have built a perfect demand prediction model but the sales department doesn’t want to use it because they use spreadsheets for calculations and refuse to trust the model. Due to such circumstances, all work can be in vain.
What to do: start communication, it’s never too late to talk to your customers. What can you change so your work becomes valuable? For example, will the sales department use your model if you add prediction clarification so they can see from what numbers the prediction was derived?
Try to get feedback as fast as you can, by doing the POC model and showing it at an early stage. Does the customer want it? Does it have any suggestions or doubts? Collecting useful information allows you to be flexible and productive.
Tip: instead of immediately diving into data, exploring it, and building state-of-the-art models, make sure that you understand customers’ needs. What are their main concerns, and what effect do they expect? Don’t try to solve the problem until you know what the problem is.
When your project fails, you probably think that you are the worst Data Scientist in the world. Or, were you a better professional, the project would not crush. (In data science, everyone makes mistakes, even senior specialists, don’t blame yourselves).
You can have a short vacation or take several day-offs. Try something new, visit a new place, or do physical activities — these things will give positive impressions, lower the stress level, and make you fresher.
Talk to someone who will understand and support you. Ideally is to find a more senior person and listen to their failure stories. Being sad and unhappy in that kind of situation is normal because the project was important to you. But don’t let negative emotions own you. You gained extremely useful experience, now you can make conclusions and go further. After all, there are still plenty of interesting tasks in the data field!
References:
[1] Why do 87% of data science projects never make it into production?, by VentureBeat, 2019
[2] Build a career in Data Science, by Emily Robinson and Jacqueline Nolis (Manning)
[3] Think Like a Data Scientist, by Brian Godsey (Manning)