Writing a CDN Health Monitor in Node.js

The JavaScript Screencast for ProfessionalsAutomation, test-driven development, refactoring, and much more!

Writing a CDN Health Monitor in Node.js

October 30^th, 2014|Comments|Permalink

One of the scariest things about running my online business is the reality that things can break, and break badly, without me ever knowing. I’m lucky to have enough customers that someone typically tells me when something breaks, but even so, most people just shrug their shoulders and move on. Net result: fewer subscribers, worries about money, and still no idea that something’s wrong.

Nowhere is this more true than in my content delivery network (CDN).

A CDN is critical for a content-heavy website like mine. In order for my videos to play smoothly, you need a solid 2500kbps connection. (Actually, it’s usually less than that, but I use 2500kbps as a hard minimum.) If my content server was here in Portland, I would see great performance, but people in Australia wouldn’t. And they would shrug their shoulders and move on.

A CDN resolves this problem by storing content on servers that are physically distributed around the world. Some sort of deep magic causes requests to be directed to the server that’s physically closest, which means better performance. For example, my CDN provider has servers in Amsterdam, Seattle, and Singapore, among others. That means the bandwidth-starved denizens of Australia actually have a shot at enjoying my videos.

Deep Magic, Invisible Problems

The deep magic is a problem. If I access a video, I’m directed to the server in Seattle, where everything’s hunky-dorey and whole videos are served in ten seconds. But my customers might be accessing the server in Los Angeles, which has decided to take a long lunch and then declare a mental health holiday.

So not only can things go wrong without me knowing it, but things can go wrong for everyone else while continuing to work just fine for me. Lovely.

Worse, when things go wrong, they go wrong in obscure ways. Just recently, my origin server suffered some sort of inode exhaustion problem that evaded my provider’s monitoring. This resulted in a few of my less-frequently-used video files failing to mirror to a subset of edge servers. Luckily, a subscriber told me about it (thanks again, Alan!) and my CDN provider fixed the problem.

I’ve also seen things like connections that succeed but return empty files, connections that serve data only slightly faster than stamping cuneiform into clay tablets, and servers that just give in to the futility of life and refuse all connections.

(As an aside, SendFaster has been wonderfully responsive when things go wrong. And of course they have their own monitoring in place.)

I have extensive monitoring in place on my app server, and I use Pingdom for uptime monitoring, but Pingdom completely fails to detect CDN problems. When a connection fails, Pingdom confirms the failure by connecting from a different server. And of course that request ends up at a different edge server, and everything seems fine.

These are the kinds of things that keep me up at night.

A Health Monitor in Node.js

I didn’t want to write my own network monitor. But as I researched the problem, I had a hard time finding something that did what I needed. I needed something that would:

Check all CDN nodes, not just the closest edge server
Monitor all my files (remember, some can fail while others continue working)
Check a 100MB+ video file without downloading the whole damn thing
Tell me when bandwidth approaches or drops below the crucial 2500kbps mark
Inform me of “it technically works” problems like returning a nearly-empty file

So I wrote my own. Node made it pretty easy, actually.

Here’s the code. I’ll explain how it works in a moment.

"use strict";

var http = require("http");

exports.check = function check(server, bytesToRetrieve, timeoutInMillis, callback) {
  var start = Date.now();
  var errors = [];
  var bytesReceived = 0;

  var request = http.get({
    host: server.host,
    port: server.port,
    path: server.path,
    headers: {
      Host: this.hostname,
      "Cache-Control": "no-cache, no-store, no-transform"
    },
    agent: false   // Don't pool the connection (this line is not tested)
  });

  request.on("response", function(response) {
    if (response.statusCode !== 200) errors.push("Wrong status code: " + response.statusCode);

    response.on("data", function(chunk) {
      bytesReceived += chunk.length;
      if (bytesReceived > bytesToRetrieve) request.abort();
    });
    response.on("error", fail);
    response.on("end", done);
  });
  request.on("error", fail);
  setTimeout(request.abort.bind(request), timeoutInMillis);

  function fail(error) {
    errors.push(error);
    done();
  }

  function done() {
    callback(errors, bytesReceived, Date.now() - start);
  }
};

This code has a few unique points. First, it only downloads part of the file. I don’t want to download 100MB+ on each check. Second, it aborts the request after a timeout that’s carefully calculated to be equivalent to 2500kbps.

When the check is done, it tells the caller about any errors, how much data was received, and how long it took. That information gets passed into my monitoring subsystem, which crunches the data and alerts me if the node is down (errors occurred), too slow (size / time < 2500kbps), or just completely borked (fewer bytes received than expected). The whole thing runs on a timer that checks a random host each time (and eventually, a random video URL as well).

The code works well, but it’s not perfect: I’ve disabled checks of the Singapore server, for example, because the monitor runs out of my normal app server on the US east coast. I was constantly getting speed warnings about the Singapore server—presumably, because it was too far away, not because of actual problems. A better monitor would be geo-distributed, but that’s beyond my capability.

How It Works

It took some digging through the docs, and careful review of how Node handles HTTP errors, but in the end I was pleasantly surprised at how compact and simple the code was. Here’s how it works:

"use strict";

var http = require("http");

exports.check = function check(server, bytesToRetrieve, timeoutInMillis, callback) {
  var start = Date.now();
  var errors = [];
  var bytesReceived = 0;
  ⋮
};

This is just your normal Node boilerplate. We "use strict"; to catch errors. The require call loads Node’s http module. The exports.check line exports our check function for other modules to use. And finally, we set up some variables. Next!

exports.check = function check(server, bytesToRetrieve, timeoutInMillis, callback) {
  ⋮
  var request = http.get({
    host: server.host,
    port: server.port,
    path: server.path,
    headers: {
      Host: server.hostname,
      "Cache-Control": "no-cache, no-store, no-transform"
    },
    agent: false        // Don't pool the connection (this line is not tested)
  });
  ⋮

The http.get call performs the request. The host is actually the server’s IP address, because we want to avoid using DNS and its CDN magic. Port and path are self-explanatory. The Host: header is our normal hostname—this tells the server who we are. The Cache-Control line is my attempt to make sure we’re actually talking to the server and not some intermediate cache. Similarly, the agent option attempts to ensure that we’re connecting fresh every time, just in case the server starts rejecting new connections.

  ⋮
  request.on("response", function(response) {
    ⋮
  });
  request.on("error", fail);
  setTimeout(request.abort.bind(request), timeoutInMillis);
  ⋮

Now we’re getting to the meat of it. We do three things with our request:

Wait for a response. I’ll describe that in a moment.
Listen for any errors, such as the server refusing the connection, and run our fail function if they occur.
Automatically abort the request after our time is up.

The setTimeout call is a bit weird if you’re not familiar with the idiom. The first parameter to setTimeout is the function we want to call when the timeout occurs. It’s basically a more compact version of this:

  setTimeout(function() {
    request.abort();
  }, timeoutInMillis);

Next, let’s look at what happens when we receive a response.

  ⋮
  request.on("response", function(response) {
    if (response.statusCode !== 200) errors.push("Wrong status code: " + response.statusCode);

    response.on("data", function(chunk) {
      bytesReceived += chunk.length;
      if (bytesReceived > bytesToRetrieve) request.abort();
    });
    response.on("error", fail);
    response.on("end", done);
  });
  ⋮

First, we check the status code of the response. If it isn’t 200 (“OK“), we add an error to our errors array.

Next, we set up an event handler that fires every time we receive any data from the server. We don’t care about the data—we just throw it away—but we do keep track of how much we receive. Once we’ve received as much as we were looking for, we abort the request.

Finally, we listen for the response to end or error out. We’ll call fail() or done() when it does.

Now for our last bit of code:

  ⋮
  function fail(error) {
    errors.push(error);
    done();
  }

  function done() {
    callback(errors, bytesReceived, Date.now() - start);
  }
});

These functions are called when the request ends. There’s several ways that could occur:

The request or response could error out, which calls fail() with an error. The fail function simply logs the error and calls done().
Our timeout could trigger, which calls request.abort(). That gracefully shuts down the request, which results in the response’s end event firing, which calls done(). After that, my monitoring subsystem detects that bytes/time is low and sends me an alert.
We could get all bytes we expected, which calls request.abort(). That gracefully shuts down the request, as with the timeout.
The response could end normally, which results in the “end” event firing, which calls done(). As with the timeout, my monitoring subsystem detects alerts me that we didn’t receive enough data.

A Simple Monitor

This is a pretty simple monitor, but it’s already alerted me to several problems. In an ideal world, I’d rather use a service like Pingdom to do this sort of monitoring, but I wasn’t able to find anything that handled everything I needed. In the absence of something better, this does the trick. And one nice thing about this tool compared to something like Pingdom is that my monitoring configuration is versioned along with all my other code.

I still worry about what I don’t know, but at least I’ve removed a good chunk of CDN issues from the list.

I’m James Shore, host of Let’s Code: Test-Driven JavaScript, a screencast series focused on rigorous & professional web development. If you liked this essay, share the URL! You can find more of my writing at jamesshore.com.