Git with it: Git, GitHub, and more version control
An overview on Git and how developers use the system to manage their code.
Welcome to Day to Data!
Did a friend send you this newsletter? Be sure to subscribe to get a weekly dose of musings by a data scientist turned venture investor, breaking down technical topics and themes in data science, artificial intelligence, and more. New post every Sunday.
When I first started coding, one of the scariest sights to me was a computer terminal. I’d watch as my computer science professor gracefully navigated through the operating system of a Mac computer. He knew every command, navigated with ease, and never touched his mouse. It was a fascinating sight! It felt like the keys to the computer’s kingdom. But I was terrified of it. I didn’t know how to use it, and felt like if I typed something wrong, I’d just about destroy my computer. I safely could use “cd” and “ls” — computer talk for “change directory aka folder” and “list files” respectively.
It wasn’t until I started learning about a new program, git, that I felt like I could really move quickly and build some fun stuff, while using the terminal with ease. I could make changes without being worried about losing tested, working code. I could contribute incrementally to larger projects. And I could even contribute to projects that other people were working on. I’ve spoken about how the Jupyter notebook changed the course of software development, but many could say Git was the real catalyst. It’s one of the simplest and most amazing pieces of technology that I’ve interacted with. Let’s talk about it.
Before diving in, I’d take a quick read of my past post on open source software. It’ll set the stage for the energy of the dev community, excited about the idea of moving quickly by building in public.
Let’s git it started!
Some quick definitions — when we talk about version control, think about when you’re editing a Google doc, and there’s options to revert the document to older versions. You can also see who’s made what changes to the document. Version control is broad theme of understanding what changes are being made by who to a body of work - in our case, a code base — and a running store of the changes being made in the project.
In 2005, Linus Torvalds, creator of popular computer program Linux, wrote the earliest version of Git. Before writing Git, Linus and the Linux team were maintaining the Linux kernel using a proprietary version control software called Bitkeeper, who later revoked the team’s free access in 2005. After he “disappeared over the weekend”, Linus “emerged the following week”with Git, a version control management system for the Linux team to continue operating. This software, now maintained by Junio Hamano (who has the iconic /gitster username on LinkedIn), was the start of something really. freaking. huge.
Why was Git such a big deal?
Let me take a step back. Software development was becoming a really big industry. Tools like Bitkeeper were booming as software developers needed ways to maintain versions of their code and also share versions with their colleagues. As the process of software development was becoming more mainstream at the enterprise, teams were collaborating more readily, building massive software programs, and needed the tools to support this. Bitkeeper enabled programmers to have a version of their code on their local computer and provided the basics of version control. However, it wasn’t open source and struggled through some basics (like renaming a project). There was so much code being written (cool visual here), and all that code was being collaborated on by lots of devs, iterated upon quickly, and constantly needing repair. Imagine life before shared Google docs and without an undo button — tough. Tools did exist, but they weren’t doing everything that was needed for the rapid pace of innovation.
For a quick laugh — in one of Torvald’s early commits, there was an epic README committed to the git repository. It read as follows:
"git" can mean anything, depending on your mood.
- random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronounciation of "get" may or may not be relevant. stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
- "global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
- "goddamn idiotic truckload of sh*t": when it breaks
What is Git?
Git is a version control management system.
Git stores code in repositories which act like folders for files of code, text, and other information needed for a computer program. Most folders on your computer could be stored in a git repository. You can have a local repository, which is the files on your personal machine, and also a remote repository, which is the shared files stored in a remote location, like GitHub. These remote and local repositories can communicate with each other, which allow files to be updated and collaborated on by many developers.
In git, a branch is a separate version of the master repository. Branches enable individuals to update code without updating the master branch. Branches can be merged back into the master once approved by others through a merge request.
In the simple visual above, the green line is called the Master branch. This is the “source of truth” for the given code base. The master branch is stored in a remote repository, like on a site like GitHub, that all collaborators have access to. Everyone can see what is in the master branch at any given time. In order to add new code to the master branch, a merge request is required (not always, but it is a best practice!). A merge request is the process of collaborators reviewing new code to ensure that it’s quality code fit for the master branch. This process enables developers to work on separate parts of a larger project, and publish their work after team approval to the shared master branch.
If developer1, shown in Blue, wants to make an update, developer1 will pull the most recent version of the master branch to their local machine, where they will make changes to the code. Then, they will push their updated code from their local to the remote environment, and create a merge request. The other developers will look at the merge request and say, ok, this looks good and we can now make this part of our master branch!
As you can see, developer2 is also making changes at the same time. Since the remote version of the code base is shared, that’s totally ok. developer2 can pull the latest version of master, make some edits on their local computer, then create another merge request to get approval from their collaborators on their new edits.
And the coolest part - the details of all these changes are stored by git. The writers of the code, the exact characters changed each time, and lots of metadata are stored in the repository. This enables version control, backtracking, and easy comparisons as developers iterate. There’s a lot of caveats to this process, but this is a simplified walkthrough of the process of updating a master branch of code.
In an interview with Linus, he was asked:
“What is the most interesting use you’ve seen for Git and/or GitHub?”
“I’m just happy that it made it so easy to start a new project. Project hosting used to be painful, and with git and GitHub it’s just so trivial to do a random small project. It doesn’t matter what the project is; what matters is that you can do it.” — Linus Torvalds
GitHub was born
In 2007, GitHub was launched. GitHub is software that provides a user experience around git, storing repositories and enabling version control so that developers could truly use git with ease. It’s enabled some incredible innovation and been a platform that enabled the sharing of some of the most exciting products of our era - React, Llama, TensorFlow, Streamlit, MLFlow, and so many more!
As of January 2023, GitHub has over 370M repositories, 100M+ developers, and estimated to be making over $1B in revenue. The company, bought for $7B+ in 2018 by Microsoft, is a behemoth, that keeps on growing, especially with their code generation tool, GitHub CoPilot. I’m trying to keep the newsletters a bit shorter these days, so I’ll stop there, but that’s a story for another newsletter!
That’s a wrap for the week!
Thanks for tuning in. It’s been quite the busy few weeks. I’ve been thinking a lot about what I hope for the future of Day to Data, and I’ve got some ideas for some fun stuff in the future - collaborations with other friends, talking to engineers and builders, and continuing to make tech easy to understand. If you’ve read this far, share Day to Data with a friend! Or let me know what you’d like to read next!
This is hard on my aged brain and I have to look back and think that this all started with a Towson computer science course in your senior year in high school.