Continuous Integration with Git

Lessons Learned: Continuous Integration with Git

Hi, everyone. This is James Shore with my lessons learned about continuous integration and Git. I'm recording this on August 3rd, 2012.

We're going to start out by reviewing version control fundamentals and how they apply to Git. Then we'll take a look at what continuous integration is and how it's often misunderstood. Next, we'll look at how to make continuous integration work with Git and a team of multiple developers. We'll close with a handy utility to make it easier for you to use continuous integration on your team.

By the way, although I'll be describing how to use parts of Git in this episode, this isn't meant to be a Git tutorial. I'm going to gloss over a lot of details. If you're new to Git, you should also read one of the many introductions to Git on the web, such as the Github tutorial or the Pro Git book. I also recommend Think Like (a) Git and A Visual Git Reference once you're familiar with the basics.

Version Control Fundamentals

First, some version control fundamentals. In this section, we're going to look at the motivation for version control, time travel, branches, merges, and teamwork.

Motivation

When you first started programming, you probably noticed something painful: when a program gets large enough, you can break your code so badly that you're unable to fix it again. One way to combat this is to backup your files every time you finish a significant change. This way, if you break something, you can go back in time to a previous backup and get things working again.

This is the fundamental idea behind version control. At its heart, version control systems like Git are just fancy backup systems for your source code. Every time you finish a significant change, you "commit" those changes. Each commit is a backup of your files at that point in time. In Git, you commit in two steps: first, git add . to tell Git about any files you've added, then git commit -a to actually make the commit.

Each commit points back to the one that came before. All your commits, reaching back into history, form your "source code repository." It's also called your "version history."

Traveling Back in Time

One of the neat things about version control systems is that you can travel back in time to see what your source code looked like in the past. You can go back to any commit in your version history.

Git calls your current position in the version history the "HEAD." As you make commits, Git updates the HEAD to point to your most recent changes.

To go back in time, you use the git checkout command. When you do, git moves the HEAD to the commit you asked for and updates your files to match.

Branches

Because you have this ability to travel back in time, you also have the ability to create alternate histories--a sort of parallel dimension in your code. If you make a new commit when you've traveled back in time, your commit is added in parallel to your existing history. This is called a "branch." Each branch has its own independent version of your files. This is useful to keep multiple streams of work from interfering from each other.

In Git, every branch needs a name. By default, your first branch is called "master." You can name your current branch by using the git branch command.

The branch name is actually a pointer to a specific commit. When you commit, your current branch is automatically updated to point to that commit. This makes it easy to travel between branches. Rather than keeping track of long commit identifiers, you can just use branch names to get to the latest code in any particular branch.

Merges

Branches are useful for keeping work isolated, but eventually you'll want to combine everything back into a single program. This is called a "merge." To merge in Git, you first switch to the branch you want to merge into, then you tell Git to merge the other branch. Git applies some fancy logic under the covers to combine the alternate histories and then creates a new commit with the result.

Sometimes Git isn't able to do that merge automatically. This is called a merge conflict, and it usually happens when the two branches have both changed the same lines of code. When that happens, Git will mark the lines in question with the two different possibilities. You'll need to merge the lines manually and then commit to finish your merge.

Teamwork

When you use Git, your repository is stored on your local computer in your project's .git directory. This has a lot of advantages, but it doesn't help your team coordinate. When you commit your changes to Git, that change lives on your computer alone. No one else can see it. So Git has the ability to "clone" a repository. When you clone a repository, you copy the entire repository to a new machine. The clone keeps track of where it came from, which allows you to easily send changes back and forth between multiple machines.

For example, if the origin repository makes changes and you want to get them, you can use git pull to merge those changes into your repository. Similarly, if you've made changes and you want to send them back to the origin repository, you can use git push to merge those changes back.

Copying and coordinating repositories between multiple machines is called "distributed version control" and it's a major innovation in version control systems. There's a lot more to it than I'm describing here, so check out your Git reference for more information.

Version Control Summary

That's a brief introduction to version control with Git. Fundamentally, version control is a way of keeping backups of all your significant changes. These backups, or "commits," form a version history that allow you to go back in time, create branching histories, and merge those histories back together.

Continuous Integration

When you have a bunch of people working on the same code, coordination can be a problem. Each person (or pair) works independently, but eventually, everyone's code has to be combined together. This is called "integration," and it's often difficult. Any miscommunication means the code won't work when it's integrated. For example, if Sarah writes a function named "runJob()" and Pradeep's code is written to call "startJob()" instead, the code will fail when it's integrated. Problems like this have caused big projects to spend weeks or even months in integration.

The core Agile practice for preventing integration problems is "continuous integration." Continuous integration is a simple idea: rather than waiting until the end of a project to integrate, let's integrate every few hours! That way, miscommunications will be flushed out and resolved early, before they've had a chance to do much damage.

So that's the first principle of continuous integration: integrate frequently. To put it another way, no branch should live for more than a few hours before being integrated.

But it's not enough to just merge your branches every few hours. Your code also needs to work--it has to be ready to ship, from a technical perspective, if not a market perspective. The idea is to flush out all the little issues that build up and prevent you from shipping when a project is supposed to be done.

So that's the second principle of continuous integration: integrated code must be known-good. Achieving this ideal makes the whole team more productive. You can ship any time you want, and you never have to wait for someone else to fix the build. When you pull integrated code, it's guaranteed to work, which means that any problems you encounter are your fault--which means that you can fix them.

This is what people so often get wrong about continuous integration. You should be able to integrate at any time, and integrated code must be known-good. There's a huge number of so-called "continuous integration tools," but most of these tools don't actually ensure known-good integrations. They're actually just build runners. Sure, they'll tell you when your integration failed, but that's not good enough. When you're doing continuous integration well, it should be literally impossible to integrate bad code.

It's actually a lot easier to guarantee known-good integrations than you might think. All you have to do is test your integration before you make it available to the rest of the team. I'll start with an overview of how this works, and then we'll get into the details.

Init: Set up Git for continuous integration.

First, you'll need to set up Git with a master integration machine and multiple development workstations.

1. Integrate locally.

When you're ready to integrate, you'll start by testing and committing everything you've been working on. You want to make sure your code is good before you integrate. That way you know any future problems are due to integration conflicts, not problems in your code.

Once everything builds clean, merge the known-good integration code into your local code. Test it to make sure everything merged properly. If you run into any problems, fix them and commit.

2. Push to private branch on integration machine.

Once you've integrated, you need to make sure it will work on other developers' machines as well. Push the code to the integration machine, but don't make it available to the other developers yet--it's still not known good.

3. Promote known-good code to public integration branch.

When the code's on the integration machine, test it one last time. If the code doesn't pass, you know there's some sort of machine-specific issue. Maybe you forgot to check in a file, or there's a global variable that needs to be configured automatically, or something else. Whatever the problem is, don't just manually fix it on the integration machine, because then the other developers' builds will break when they run your code on their machines. Instead, go back to your local machine and fix the problem there.

Once everything's working on the integration machine, your code is known-good. Now you can promote it to the public integration branch and let everyone else know you've integrated. You're done!

Start over if anyone else integrates before you're done.

Now, there is one caveat to this. If anyone else integrates while you're doing this, you'll have to start over with step 1 and integrate their code before you continue. This won't happen too often in practice as long as your build is reasonably fast.

In fact, good builds are the real challenge with continuous integration. Because you're testing the code so often, you need a fast, automated build. A fast build is the hardest part of doing continuous integration well, but it is so worth it. A good, fast build will multiply your productivity as a developer, even if you aren't using continuous integration.

Setting Up Git for Continuous Integration

Now let's talk about the details. To use Git for continuous integration, we're going to set up a master integration machine and multiple development machines. The integration machine will perform double-duty: in addition to containing the canonical, known-good repository, it will also be used to test if code is known-good. The development machines will be used for, well, development.

The integration repository will have a master integration branch for known-good code and a separate test branch for each development workstation. The development workstations will each have one development branch, corresponding to the test branch on the integration server.

The development workstations will pull from the integration branch, do work, then push to their workstation's branch. Once the code in a workstation branch is proven to be known-good, it will be promoted to the master integration branch.

To set it up, create (or copy) a repository on the integration machine. Then create an integration branch and one branch for each development workstation. Finally, switch to the integration branch. That's everything you need to do on the integration machine.

On the development workstations, clone the integration machine repository. (For simplicity, I'm showing a local clone in this video. You'll need to clone over your local network. See your git reference for a guide.) Once the repository is cloned, switch to the development workstation's named branch.

Repeat for each additional development workstation.

1. Integrate Locally

When you're ready to integrate, make sure your code builds clean and check in your changes. Then pull the latest known-good code from the integration machine. Make sure it builds clean. If it doesn't, fix the problems and commit before continuing.

2. Push to private branch

Once your code is integrated, push it to your private branch on the integration machine.

3. Promote to public integration branch

Now you need to make sure your code works on the integration machine. First, checkout your workstation branch. As a precaution, merge the integration branch again, just in case you missed someone else's integration. This is just a precaution--you're really supposed to start over if someone else integrates. Next, run the build to make sure the code is known-good. If the merge or build fails, you'll need to switch back to the integration branch and go back to your workstation to fix the problem.

Assuming everything worked, you can promote your code to the integration branch. Switch back to the integration branch and merge your code. The options I'm using here ensure that your integration shows up in the commit log.

That's it! Again, if someone else integrates while you're in the middle of this process, you'll have to start over with step 1.

A Convenient Integration Script

As a convenience, I've created a script to automate each of these three steps. You can find it at the URL on your screen. Let's see how it works when we have two developers working on the same code.

First, each workstation makes a change to the same file without realizing it.

The person at the dev1 workstation happens to integrate first. First, he looks at the quick reference as a reminder. Then he builds, integrates locally, confirms that it worked, and pushes to the integration machine. Next he walks over the integration machine and promotes the build. The script automatically confirms that the build works before promoting. That's it--he's done.

Meanwhile, the person at the dev2 workstation is ready to integrate as well. He too looks at the quick reference. Then he builds and integrates locally. Unfortunately, his changes conflict with dev1's changes. He fixes them, confirms that the change worked, and commits.

Now he's ready to push to the integration machine. He walks over to the integration machine, promotes his code and he's done as well.

Conclusion

So that's what I've learned so far about continuous integration with Git. To summarize, continuous integration is based on two principles: integrate frequently, and ensure your integrated code is known good. You can use Git to guarantee a known-good build by integrating locally, testing your code in a private branch on an integration machine, then promoting the code to a master integration branch once it's been proven to be known-good. I've provided a script to make this process more convenient and you can find it at the link on your screen.

Thanks for watching, and I'll see you next time!