It can save your life: believe me!
Coding can lead you to real messes, believe me. If you are a beginner, right now you are writing your first lines of code and maybe you are dealing with warnings and code errors; did I guess?
As you will learn and practice more and more, you will deal with projects and, unfortunately, with bigger problems. Especially if you are working with Jupyter Notebooks — because you want to be a Data Scientist, right? So, start with Notebooks! — the fact that cells can run independently from one another will lead you to troubles and various problems.
Here’s a fact: suppose you have worked for hours — maybe days! — and something goes so badly that you have to throw your project in the bin and make it again from the beginning. It would be a real pity, wouldn’t it?
Well, this is where GIT comes in help; but before talking about it, I want to tell you how I understood the importance of GIT at the beginning of my learning path in Data Science. Then, I’ll tell you why GitHub is very important for you, and why you should learn it right now if you are a beginner (if you are not a beginner, I suppose you know how to use GitHub, don’t you? 🙂 ).
I’ve worked as a Mechanical Designer for two years, designing complex mechanical parts. To give you an idea, I designed assembly parts that can have tens or even hundreds of sub-parts.
As the market often moves very fast, we were asked to design the parts as quickly as possible; but, you know clients: they often ask for changes while you work (or after the work is done!); can you imagine what would mean to change even 4 parts in a tent to hundreds-parts assembly?
Well, yes: the CAD software may explode (and, in fact, sometimes it happened!). And yes: if the CAD explodes after hours of work, you may lose hours of work; unless you have a life buoy.
Luckily, I’ve worked for a firm that did have a life buoy integrated into its CAD software. In simple words, this is what the life buoy could do:
- it saved the designed parts in a local server
- it permitted autosave locally, so that if the CAD crashed you could retrieve the work you did (hooray!)
- it gave you the possibility to “freeze” a version of the project (aka, the assembly and all the sub-parts) and you could return to a previous version if needed (and that’s the magic!!)
All this stuff really saved my life and hours and hours of work. But, you know what? This is actually what a GIT version control system allows you to do!
As I experienced the feeling of losing hours and hours of work, when I started learning Data Science and got familiar with the concept of GIT I immediately decided to pause my Data Science learning to learn and practice GIT, because I didn’t want to lose hours of coding in the future.
So, what basically GIT does is very simple: it saves your code (aka, your Data Science project with your Jupyter Notebook, your CSV with the data, the folders, and anything you’ll need) in a “particular folder” — called a “repository”-. This particular folder is “git-initiated” (when you type the command $ git init
) and this allows you to control the version of your software.
Also, beyond versioning your software, giving you the possibility to return to a previous version if your code becomes a real mess, you have the possibility to distribute your code. This means that you can clone (with $ git clone
) an existing repository in any machine, for example on the computer of one of your colleagues but you are the only one who can modify the project (unless you give the permissions to some of your colleagues).
I believe there are two main reasons why you should start learning how GitHub works today:
- The first is version control. I hope I have convinced you: versioning your software will save you from starting from 0 (or, almost from 0) if your code becomes a real mess, giving you the possibility to restart from a previous version.
- The second is that on GitHub you can create as many repositories as you want, and this gives you the possibility to create your own portfolio. It means that while you create your Data Science projects, you can store them in local (on your PC) repositories and then load them on GitHub, so that you can show them to the world. Also, GitHub gives you the possibility to host one website; commonly, we host our portfolio. To give an example, here’s mine. As you click on the various projects I made, you can see that are all stored on my GitHub (also, I’ve created an explanatory PDF for each project, so that you can understand that even if now you are not a pro in the Data Science field).
How difficult is it to learn GitHub? Well, the functioning and the main commands are very easy to learn: I think I learned it in 3–4 hours (and if I made it, you can make it too!).
You even find some online courses that teach you how to start with GitHub: pick one, it will help you for sure.
Also, I’d like to leave you a couple of more recommendations, based on my experience:
- Start with GitHub Desktop. I’ve started with GitHub using the command line; there is nothing bad with it, but I advise you to use GitHub Desktop (you can download it from here) because it is more user-friendly; and, also, it avoids you to make some mistakes.
- Since you work locally (you develop your code on your PC), you’ll have to connect your local machine to your GitHub account to load the repositories on GitHub. GitHub gives you the possibility to connect the two in a secure way, implementing SSH keys. In this article, I’ve created a guide to help you in these steps (please: consider that the final part of this guide is intended to use the command line to use GitHub: you can skip it, as I advised you to use GitHub Desktop).
In this article, we’ve seen the importance of GitHub; in my opinion, you should learn it right now if you are a beginner for two main reasons:
- it gives you the possibility to version your code and your projects, saving you hours of work in case of any kind of problems
- it gives you the possibility to create a public portfolio to show to HR and Managers, if you are seeking a job