My take on the key transferrable skills for those coming from academia to commercial data science
I’ve spent 5 years working as a researcher grinding laser physics, nonlinear optics, and solid-state laser engineering. While being fully submerged in the field, and excited about what I’ve been doing, at some point I made a transition into the commercial data science industry.
After working in data science for additional 6 years I have an impression that the skill set that I developed in the applied physics field has a perfect use in working on commercial projects that are not at all related to laser physics.
Plenty has been written about how useful academic experience might be, but I decided to express my personal opinion on the subject.
To make my point I’ve decided to rate each skillset group based on how useful it is and why.
Who is this article for?
I think I wrote it mostly for the people thinking about the transition from the academic environment into the commercial field, but also for myself, to reflect on the intersection of tools, skills, and mindsets between the two fields.
Experience with literature review → 7/10
Why is literature review such a great and transferrable skill (habit) for commercial data science?
In my opinion, a literature review is a bit overlooked and misunderstood in commercial data science. And I’m not saying that we don’t read enough about brand-new model architectures and framework designs (this part is executed exceptionally well).
But when it comes to getting more structured and valuable information on the subject of the project quickly and effectively — that is where the biggest gap in the data science world exists in my opinion.
A literature review might not even be the best term here. I could also call it background research, or state-of-the-art analysis.
When dealing with a business problem, in my opinion, it is essential to get at least some theoretical base on the subject of your problem. What literature review does:
- Forms a foundation for solid decisions on data strategy. Acquaint yourself with existing techniques and approaches in the domain field.
- Speeds up the onboarding process. If you are new to the domain you are working on, getting knowledge on the subject as quickly as possible is the first step for getting to value generation.
- Improves communication quality with experts in the field. Domain experts, also called subject matter experts are invaluable for solving data problems. But they typically don’t program, and they are pretty busy. Thus data scientists must acquire some understanding of the domain-specific terminology and concepts to communicate effectively and collaborate seamlessly with these experts.
- Drastically improves the quality of your insights. In my experience, a literature review adds to a foundation for decision-making about data collection, preprocessing, modeling, and evaluation, ultimately improving the quality of the insights you deliver. In my experience, it works, but not always.
Paying attention to a literature review, and investing time and effort into it, embodies a particular mindset — open-minded, humble, and inquisitive. A literature review helps with keeping you away from reinventing the wheel or the trap of confirmation bias.
Transferring journaling practices from academia to commercial data science has been very rewarding for me. Behind multiple practical benefits, it gives you a priceless sense of continuity when going through ups and downs in the work life of a researcher. In my opinion, by adopting the keystone habit of maintaining a lab notebook, data scientists can easily track their experiments, jot down ideas and observations, and monitor their personal and professional growth. I wrote a whole separate piece on why it is such a great idea to do so, feel free to check it out!
Knowledge of programming → 6/10
In my scientific journey, I’ve been working on experimental data processing, numerical simulations, and statistical learning on an everyday basis. Programming was also essential for developing and testing new laser designs before testing physical prototypes (numerical simulations).
I’ve used it constantly for typical data science stuff:
- experimental data processing (Python, Wolfram)
- numerical simulations (Wolfram, Matlab, Python)
- statistical learning (Wolfram, Matlab, Python)
- data visualization (Origin Pro, Python, R)
Wolfram (Wolfram Mathematica more specifically) was the most heavily-used tool because we had a license for it in the lab. It had a great toolset for solving non-linear differential equations, and we were widely using it for numerical simulations.
Python was a tool of choice for me to wrangle data generated during experiments (beam shapes, oscillograms).
When it comes to data visualization, Origin was the primary tool because it allowed embedding of visuals into text documents while keeping them editable. Line charts, histograms (including kernel density estimators), regression analysis — Origin was a great tool. Origin has a GUI, so it is not even about coding, I just have to mention it to make sure Python and R don’t get all data viz. credit.
In general, I had a solid experience with each of the tools mentioned above: I know the syntax and I can solve problems with decent efficiency. So why just 6/10? Why are programming skills gained in academia relatively low-transferrable into commercial data science? That is a pretty strong statement, but I think the downsides of academic experience may outweigh the upsides. Mainly because good software practices are completely neglected in many scientific environments.
Caveat: this statement is based on my personal experience of working in applied physics field, and definitely do not apply to everyone working in academia. Take everything from this section with a grain of salt!
On one hand, neglecting good software principles is a natural consequence of researchers optimizing for speed of research and number of publications, not for code quality and maintainability. On the other hand, there are almost no people coming from proper software development to academia (for financial reasons), thus there is no real production expertise in the first place. I should also mention that working on designing experiments, doing a literature review, collecting measurements, writing code to process them, and getting valuable insights — all at the same time is exhaustive. As a consequence, you just don’t have enough resources to study software development.
Proficiency in conducting measurements→ 9/10
This one is difficult to explain, so bear with me. Measuring stuff in applied laser physics is a discipline of its own. Delivering valuable measurements is a skill that takes years to train! There are many reasons for that: you have to understand the physics of the process, follow measurement protocol and have specialized knowledge and training to operate complex and expensive instrumentation.
For example, I’ve been working with diode-pumped pulsed solid-state lasers, measuring multiple parameters of the laser beam: pulse duration, pulse energy, repetition rate, beam profile, divergence, polarization, spectral content, temporal profile, and beam waist. Doing any of these measurements is so damn difficult. Let’s say, you want to measure the beam profile (see the image below).
Beam profile refers to the spatial distribution of the laser beam’s intensity across its cross-section or transverse plane.
In theory, you just direct a laser beam to a CCD camera and get your beam shape in seconds. But doing it on the ground is a whole different story. If you are working with a pulsed solid-state laser with a decent pulse energy, and you know what you are doing, you will direct a laser beam to the high-quality optical wedge to get most of the pulse energy into a trap and work with a reflection of a beam that has only a fraction of the energy of the original beam. You will do so to protect the CCD camera from a disaster. But using a wedge will be not enough. You will install an adjustable beam attenuator, lock it into the darkest mode and then gradually lower the absorption rate until you get the correct exposure on your CCD camera.
If you are working with an infrared laser that is invisible to the human eye, you are faced with a problem: you have to steer the beam through small apertures without seeing the actual beam. This skill alone can only be acquired through training and practice. By the way, each step of beam manipulation has to be done with extreme care due to the safety regulations: you have to wear appropriate protective goggles, use protective screens, etc.
Okay, moving on, now your beam is attenuated and sits nicely at the CCD camera. But you still have plenty to do: wire the CCD camera to the laser power unit to achieve synchronization and produce a stable image. If you’ve done everything correctly — you get your images. Wait, images?
Then you realize that if your laser operates at a pulse repetition rate of 50 Hz, that means that it produces 50 pulses a second. Each produced pulse might have a slightly different beam profile. How do you produce the result? Should you just pick a random shot and capture the image? Or should you produce the average image using a certain number of pulses? Oh, the averaging was enabled by default by the software managing the CCD camera?
Let’s wrap this “measuring beam shape” nonsense up. From all the measurements I did in my life, I have 2 key transferrable qualities: it is vigilance (NEVER take anything at face value) and meticulous attention to metadata (how exactly data was measured or recorded, which tools were used, and even why it happened in the first place). Both are golden when it comes to working with real-life data. Because it allows you to be way more efficient in producing the actual impact without getting into the rabbit holes. And that is something that is valued both in academia and in commercial data science.
Data Communication Proficiency → 10/10
While I was in academia, I didn’t consider data communication to be a particularly noteworthy or valuable topic to write about. Working on data visualizations, chatting about data and theories, and writing scientific papers were just part of the job. But after years of doing research, you gain a solid skill set in data communication on different levels (both formal and informal).
Writing a scientific paper is one of the more challenging skills to obtain among formal data communication types. It takes a lot of practice to be able to compose a compelling piece that has a proper structure (abstract → intro → literature review → methodology → results → discussion → conclusion → acknowledgments). The structure of the article itself presumes that you have a story to write about. And it is not just about writing: you have to know your way around producing compelling and purposeful visual representations of data. All to get your message to the audience.
I rate this skill as a 10 out of 10 transferability because commercial data science unsurprisingly depends on interactions between humans, communicating your thoughts and results.
Overall, I believe that those with a scientific background can bring unique perspectives and valuable skills to the field of data science. To those in academia who believe that transitioning to a career in commercial data science means abandoning all their hard work and expertise, I offer a different perspective: you have a wealth of value to bring to the table. In my opinion, the best course of action is to leverage your existing skills while picking up new techniques and best practices of the field you transition into (we all know it is a lifelong journey).