# thraxil.org:

## anders pearson

I'm a programmer, artist, writer, and more. I wrote the code that powers this site. I work for CCNMTL as a programmer. I was a WaSP member (emeritus now). I paint and draw quite a lot. I code quite a bit.

## Using Make with Django

I would like to share how I use a venerable old technology, GNU Make, to manage the common tasks associated with a modern Django project.

As you'll see, I actually use Make for much more than just Django and that's one of the big advantages that keeps pulling me back to it. Having the same tools available on Go, node.js, Erlang, or C projects is handy for someone like me who frequently switches between them. My examples here will be Django related, but shouldn't be hard to adapt to other kinds of codebases. I'm going to assume basic unix commandline familiarity, but nothing too fancy.

Make has kind of a reputation for being complex and obscure. In fairness, it certainly can be. If you start down the rabbit hole of writing complex Makefiles, there really doesn't seem to be a bottom. Its reputation isn't helped by its association with autoconf, which does deserve its reputation for complexity, but we'll just avoid going there.

But Make can be simple and incredibly useful. There are plenty of tutorials online and I don't want to repeat them here. To understand nearly everything I do with Make, this sample stanza is pretty much all you need:

ruleA: ruleB ruleC
command1
command2

A Makefile mainly consists of a set of rules, each of which has a list of zero or more prerequisites, and then zero or more commands. At a very high level, when a rule is invoked, Make first invokes any prerequisite rules (if needed), which may have their own stanzas defined elsewhere) and then runs the designated commands. There's some nonsense having to do with requiring actual TAB characters in there instead of spaces for indentation and more syntax for comments (lines starting with #) and defining and using variables, but that's the gist of it right there.

So you can put something extremely simple (and pointless) like

say_hello:
echo make says hello

in a file called Makefile, then go into the same directory and do:

$make say_hello echo make says hello make says hello Not exciting, but if you have some complex shell commands that you have to run on a regular basis, it can be convenient to just package them up in a Makefile so you don't have to remember them or type them out every time. Things get interesting when you add in the fact that rules are also interpreted as filenames and Make is smart about looking at timestamps and keeping track of dependencies to avoid doing more work than necessary. I won't give trivial examples to explain that (again, there are other tutorials out there), but the general interpretation to keep in mind is that the first stanza above specifies how to generate or update a file called ruleA. Specifically, other files, ruleB and ruleC need to have first been generated (or are already up to date), then command1 and command2 are run. If ruleA has been updated more recently than ruleB and ruleC, we know it's up to date and nothing needs to be done. On the other hand, if the ruleB or ruleC files have newer timestamps than ruleA, ruleA needs to be regenerated. There will be meatier examples later, which I think will clarify this. The original use-case for Make was handling complex dependencies in languages with compilers and linkers. You wanted to avoid re-compiling the entire project when only one source file changed. (If you've worked with a large C++ project, you probably understand why that would suck.) Make (with a well-written Makefile) is very good at avoiding unnecessary recompilation while you're developing. So back to Django, which is written in Python, which is not a compiled language and shouldn't suffer from problems related to unnecessary and slow recompilation. Why would Make be helpful here? While Python is not compiled, a real world Django project has enough complexity involved in day to day tasks that Make turns out to be incredibly useful. A Django project requires that you've installed at least the Django python library. If you're actually going to access a database, you'll also need a database driver library like psycopg2. Then Django has a whole ecosystem of pluggable applications that you can use for common functionality, each of which often require a few other miscellaneous Python libraries. On many projects that I work on, the list of python libraries that get pulled in quickly runs up into the 50 to 100 range and I don't believe that's uncommon. I'm a believer in repeatable builds to keep deployment sane and reduce the "but it works on my machine" factor. So each Django project has a requirements.txt that specifies exact version numbers for every python library that the project uses. Running pip install -r requirements.txt should produce a very predictable environment (modulo entirely different OSes, etc.). Working on multiple projects (between work and personal projects, I help maintain at least 50 different Django projects), it's not wise to have those libraries all installed globally. The Python community's standard solution is virtualenv, which does a nice job of letting you keep everything separate. But now you've got a bunch of virtualenvs that you need to manage. I have a seperate little rant here about how the approach of "activating" virtualenvs in a particular shell environment (even via virtualenvwrapper or similar tools) is an antipattern and should be avoided. I'll spare you most of that except to say that what I recommend instead is sticking with a standard location for your project's virtualenv and then just calling the python, pip, etc commands in the virtualenv via their path. Eg, on my projects, I always do $ virtualenv ve

at the top level of the project. Then I can just do:

$./ve/bin/python ... and know that I'm using the correct virtualenv for the project without thinking about whether I've activated it yet in this terminal (I'm a person who tends to have lots and lots of terminals open at a given time and often run commands from an ephemeral shell within emacs). For Django, where most of what you do is done via manage.py commands, I actually just change the shebang line in manage.py to #!ve/bin/python so I can always just run ./manage.py do_whatever. Let's get back to Make though. If my project has a virtualenv that was created by running something along the lines of: $ virtualenv ve
$./ve/bin/pip install -r requirements.txt And requirements.txt gets updated, the virtualenv needs to be updated as well. This is starting to get into Make's territory. Personally, I've encountered enough issues with pip messing up in-place updates on libraries that I prefer to just nuke the whole virtualenv from orbit and do a clean install anytime something changes. Let's add a rule to a Makefile that looks like this: ve/bin/python: requirements.txt rm -rf ve virtualenv ve ./ve/bin/pip install -r requirements.txt If that's the only rule in the Makefile (or just the first), typing $ make

At the terminal will ensure that the virtualenv is all set up and ready to go. On a fresh checkout, ve/bin/python won't exist, so it will run the three commands, setting everything up. If it's run at any point after that, it will see that ve/bin/python is more recently updated than requirements.txt and nothing needs to be done. If requirements.txt changes at some point, running make will trigger a wipe and reinstall of the virtualenv.

Already, that's actually getting useful. It's better when you consider that in a real project, the commands involved quickly get more complicated with specifying a custom --index-url, setting up some things so pip installs from wheels, and I even like to specify exact versions of virtualenv and setuptools in projects I don't have to think about what might happen on systems with different versions of those installed. The actual commands are complicated enough that I'm quite happy to have them written down in the Makefile so I only need to remember how to type make.

It all gets even better again when you realize that you can use ve/bin/python as a prerequisite for other rules.

Remember that if a target rule doesn't match a filename, Make will just always run the commands associated with it. Eg, on a Django project, to run a development server, I might run:

$./manage.py runserver 0.0.0.0:8001 Instead, I can add a stanza like this to my Makefile: runserver: ve/bin/python ./manage.py runserver 0.0.0.0:8001 Then I can just type make runserver and it will run that command for me. Even better, since ve/bin/python is a prerequisite for runserver, if the virtualenv for the project hasn't been created yet (eg, if I just cloned the repo and forgot that I need to install libraries and stuff), it just does that automatically. And if I've done a git pull that updated my requirements.txt without noticing, it will automatically update my virtualenv for me. This sort of thing has been incredibly useful when working with designers who don't necessarily know the ins and outs of pip and virtualenv or want to pay close attention to the requirements.txt file. They just know they can run make runserver and it works (though sometimes it spends a few minutes downloading and installing stuff first). I typically have a bunch more rules for common tasks set up in a similar fashion: check: ve/bin/python ./manage.py check migrate: check ./manage.py migrate flake8: ve/bin/python ./ve/bin/flake8 project test: check flake8 ./manage.py test That demonstrates a bit of how rules chain together. If I run make test, check and flake8 are both prerequisites, so they each get run first. They, in turn, both depend on the virtualenv being created so that will happen before anything. Perhaps you've noticed that there's also a little bug in the ve/bin/python stanza up above. ve/bin/python is created by the virtualenv ve step, but it's used as the target for the stanza. If the pip install step fails though (because of a temporary issue with PyPI or just a typo in requirements.txt or something), it will still have "succeeded" in that ve/bin/python has a fresher timestamp than requirements.txt. So the virtualenv won't really have the complete set of libraries installed there but subsequent runs of Make will consider everything fine (based on timestamp comparisons) and not do anything. Other rules that depend on the virtualenv being set up are going to have problems when they run. I get around that by introducing the concept of a sentinal file. So my stanza actually becomes something like: ve/sentinal: requirements.txt rm -rf ve virtualenv ve ./ve/bin/pip install -r requirements.txt touch ve/sentinal Ie, now there's a zero byte file named ve/sentinal that exists just to signal to Make that the rule was completed successfully. If the pip install step fails for some reason, it never gets created and Make won't try to keep going until that gets fixed. My actual Makefile setup on real projects has grown more flexible and more complex, but if you've followed most of what's here, it should be relatively straightforward. In particular, I've taken to splitting functionality out into individual, reusable .mk files that are heavily parameterized with variables, which then just get includeed into the main Makefile where the project specific variables are set. Eg, here is a typical one. It sets a few variables specific to the project, then just does include *.mk. The actual Django related rules are in django.mk, which is a file that I use across all my django projects and should look similar to what's been covered here (just with a lot more rules and variables). Other .mk files in the project handle tasks for things like javascript (jshint and jscs checks, plus npm and webpack operations) or docker. They all take default values from config.mk and are set up so those can be overridden in the main Makefile or from the commandline. The rules in these files are a bit more sophisticated than the examples here, but not by much. [I should also point out here that I am by no means a Make expert, so you may not want to view my stuff as best practices; merely a glimpse of what's possible.] I typically arrange things so the default rule in the Makefile runs the full suite of tests and linters. Running the tests after every change is the most common task for me, so having that available with a plain make command is nice. As an emacs user, it's even more convenient since emacs' default setting for the compile command is to just run make. So it's always a quick keyboard shortcut away. (I use projectile, which is smart enough to run the compile command in the project root). Make was originally created in 1976, but I hope you can see that it remains relevant and useful forty years later. ## 2015 Music It's the time of the year when everyone puts out their top albums of 2015 lists. 2015 has been a very good year for the kind of music that I like, so I thought about writing my own up, but decided that I am too lazy to pare it down to just 10 albums or so. Instead, here's a massive list of 60 albums (and 1 single) that came out in 2015 that I enjoyed. If you like dark, weird music (atmospheric black metal, doom, sludge, noise, and similar), perhaps you might too. I've only included albums that are up on Bandcamp (they're generous with previews and you can buy the album if you like it). There are way too many for me to write about each (and really, what's the point when you can listen to each of them for yourselves?). I also make no attempt at ranking them, thus they are presented in alphabetical order by artist name. TAGS: music ## A Curated List of London Tube Station Names That Amuse Me • Barking • Canada Water • Chigwell • Chorleywood • Cockfosters • East Ham • Elephant & Castle • Goodge Street • Ickenham • Upminster • Uxbridge • West Ham Plus Honorable Mention to the Overground station "Wapping". ## Continuously Deploying Django with Docker I run about a dozen personal Django applications (including this site) on some small servers that I admin. I also run a half dozen or so applications written in other languages and other frameworks. Since it's a heterogeneous setup and I have a limited amount of free time for side projects, container technology like Docker that lets me standardize my production deployment is quite appealing. I run a continuous deployment pipeline for all of these applications so every git commit I make to master goes through a test pipeline and ends up deployed to the production servers (assuming all the tests pass). Getting Django to work smoothly in a setup like this is non-trivial. This post attempts to explain how I have it all working. ### Background First, some background on my setup. I run Ubuntu 14.04 servers on Digital Ocean. Ubuntu 14.04 still uses upstart as the default init, so that's what I use to manage my application processes. I back the apps with Postgres and I run an Nginx proxy in front of them. I serve static assets via S3 and Cloudfront. I also use Salt for config management and provisioning so if some of the config files here look a bit tedious or tricky to maintain and keep in sync, keep in mind that I'm probably actually using Salt to template and automate them. I also have a fairly extensive monitoring setup that I won't go into here, but will generally let me know as soon as anything goes wrong. I currently have three "application" servers where the django applications themselves run. Typically I run each application on two servers which Nginx load balances between. A few of the applications also use Celery for background jobs and Celery Beat for periodic tasks. For those, the celery and celery beat processes run on the third application server. My goal for my setup was to be able to deploy new versions of my Django apps automatically and safely just by doing git push origin master (which typically pushes to a github repo). That means that the code needs to be tested, a new Docker image needs to be built, distributed to the application servers, database migrations run, static assets compiled and pushed to S3, and the new version of the application started in place of the old. Preferably without any downtime for the users. I'll walk through the setup for my web-based feedreader app, antisocial, since it is one of the ones with Celery processes. Other apps are all basically the same except they might not need the Celery parts. I should also point out that I am perpetually tweaking stuff. This is what I'm doing at the moment, but it will probably outdated soon after I publish this as I find other things to improve. ### Dockerfile Let's start with the Dockerfile: Dockerfile: FROM ccnmtl/django.base ADD wheelhouse /wheelhouse RUN apt-get update && apt-get install -y libxml2-dev libxslt-dev RUN /ve/bin/pip install --no-index -f /wheelhouse -r /wheelhouse/requirements.txt WORKDIR /app COPY . /app/ RUN /ve/bin/python manage.py test EXPOSE 8000 ADD docker-run.sh /run.sh ENV APP antisocial ENTRYPOINT ["/run.sh"] CMD ["run"] Like most, I started using Docker by doing FROM ubuntu:trusty or something similar at the beginning of all my Dockerfiles. That's not really ideal though and results in large docker images that are slow to work with so I've been trying to get my docker images as slim and minimal as possible lately. Roughly following Glyph's approach, I split the docker image build process into a base image and a "builder" image so the final image can be constructed without the whole compiler toolchain included. The base and builder images I have published as ccnmtl/django.base and ccnmtl/django.build respectively and you can see exactly how they are made here. Essentially, they both are built on top of Debian Jessie (quite a bit smaller than Ubuntu images and similar enough). The base image contains the bare minimum while the build image contains a whole toolchain for building wheels out of python libraries. I have a Makefile with some bits like this: ROOT_DIR:=$(shell dirname $(realpath$(lastword $(MAKEFILE_LIST)))) APP=antisocial REPO=thraxil WHEELHOUSE=wheelhouse$(WHEELHOUSE)/requirements.txt: $(REQUIREMENTS) mkdir -p$(WHEELHOUSE)
docker run --rm \
-v $(ROOT_DIR):/app \ -v$(ROOT_DIR)/$(WHEELHOUSE):/wheelhouse \ ccnmtl/django.build cp$(REQUIREMENTS) $(WHEELHOUSE)/requirements.txt touch$(WHEELHOUSE)/requirements.txt

build: $(WHEELHOUSE)/requirements.txt docker build -t$(IMAGE) .

So when I do make build, if the requirements.txt has changed since the last time, it uses the build image to generate a directory with wheels for every library specified in requirements.txt, then runs docker build, which can do a very simple (and fast) pip install of those wheels.

Once the requirements are installed, it runs the application's unit tests. I expose port 8000 and copy in a custom script to use as an entry point.

### docker-run.sh

That script makes the container a bit easier to work with. It looks like this:

#!/bin/bash

cd /app/

if [[ "$SETTINGS" ]]; then export DJANGO_SETTINGS_MODULE="$APP.$SETTINGS" else export DJANGO_SETTINGS_MODULE="$APP.settings_docker"
fi

if [ "$1" == "migrate" ]; then exec /ve/bin/python manage.py migrate --noinput fi if [ "$1" == "collectstatic" ]; then
exec /ve/bin/python manage.py collectstatic --noinput
fi

if [ "$1" == "compress" ]; then exec /ve/bin/python manage.py compress fi if [ "$1" == "shell" ]; then
exec /ve/bin/python manage.py shell
fi

if [ "$1" == "worker" ]; then exec /ve/bin/python manage.py celery worker fi if [ "$1" == "beat" ]; then
exec /ve/bin/python manage.py celery beat
fi

# run arbitrary commands
if [ "$1" == "manage" ]; then shift exec /ve/bin/python manage.py "$@"
fi

if [ "$1" == "run" ]; then exec /ve/bin/gunicorn --env \ DJANGO_SETTINGS_MODULE=$DJANGO_SETTINGS_MODULE \
$APP.wsgi:application -b 0.0.0.0:8000 -w 3 \ --access-logfile=- --error-logfile=- fi With the ENTRYPOINT and CMD set up that way in the Dockerfile, I can just run $ docker run thraxil/antisocial

and it will run the gunicorn process, serving the app on port 8000. Or, I can do:

$docker run thraxil/antisocial migrate and it will run the database migration task. Similar for collectstatic, compress, celery, etc. Or, I can do: $ docker run thraxil/antisocial manage some_other_command --with-flags

to run any other Django manage.py command (this is really handy for dealing with migrations that need to be faked out, etc.)

### docker-runner

Of course, all of those exact commands would run into problems with needing various environment variables passed in, etc.

The settings_docker settings module that the script defaults to for the container is a fairly standard Django settings file, except that it pulls all the required settings out of environment variables. The bulk of it comes from a common library that you can see here.

This gives us a nice twelve-factor style setup and keeps the docker containers very generic and reusable. If someone else wants to run one of these applications, they can pretty easily just run the same container and just give it their own environment variables.

The downside though is that it gets a bit painful to actually run things from the commandline, particularly for one-off tasks like database migrations because you actually need to specify a dozen or so -e flags on every command.

I cooked up a little bit of shell script with a dash of convention over configuration to ease that pain.

All the servers get a simple docker-runner script that looks like:

#!/bin/bash

APP=$1 shift IMAGE= OPTS= if [ -f /etc/default/$APP ]; then
. /etc/default/$APP fi TAG="latest" if [ -f /var/www/$APP/TAG ]; then
. /var/www/$APP/TAG fi exec /usr/bin/docker run$OPTS $EXTRA$IMAGE:$TAG$@

That expects that every app has a file in /etc/default that defines an $IMAGE and $OPTS variable. Eg, antisocial's looks something like:

/etc/default/antisocial:

export IMAGE="thraxil/antisocial"
--rm \
-e SECRET_KEY=some_secret_key \
-e AWS_S3_CUSTOM_DOMAIN=d115djs1mf98us.cloudfront.net \
-e AWS_STORAGE_BUCKET_NAME=s3-bucket-name \
-e AWS_ACCESS_KEY=... \
-e AWS_SECRET_KEY=... \
... more settings ... \
-e ALLOWED_HOSTS=.thraxil.org \
-e BROKER_URL=amqp://user:pass@host:5672//antisocial"

With that in place, I can just do:

 $docker-runner antisocial migrate And it fills everything in. So I can keep the common options in one place and not have to type them in every time. (I'll get to the TAG file that it mentions in a bit) ### upstart With those in place, the upstart config for the application can be fairly simple: /etc/init/antisocial.conf: description "start/stop antisocial docker" start on filesystem and started docker-postfix and started registrator stop on runlevel [!2345] respawn script export EXTRA="-e SERVICE_NAME=antisocial -p 192.81.1.1::8000" exec /usr/local/bin/docker-runner antisocial end script The Celery and Celery Beat services have very similar ones except they run celery and beat tasks instead and they don't need to have a SERVICE_NAME set or ports configured. ### Consul Next, I use consul, consul-template, and registrator to rig everything up so Nginx automatically proxies to the appropriate ports on the appropriate application servers. Each app is registered as a service (hence the SERVICE_NAME parameter in the upstart config). Registrator sees the containers starting and stopping and registers and deregisters them with consul as appropriate, inspecting them to get the IP and port info. consul-template runs on the Nginx server and has a template defined for each app that looks something like: {{if service "antisocial"}} upstream antisocial { {{range service "antisocial"}} server {{.Address}}:{{.Port}}; {{end}} } {{end}} server { listen 80; server_name feeds.thraxil.org; client_max_body_size 40M; {{if service "antisocial"}} location / { proxy_pass http://antisocial; proxy_next_upstream error timeout invalid_header http_500; proxy_connect_timeout 2; proxy_set_header Host$host;
proxy_set_header        X-Real-IP       $remote_addr; } error_page 502 /502.html; location = /502.html { root /var/www/down/; } {{else}} root /var/www/down/; try_files$uri /502.html;
{{end}}
}

That just dynamically creates an endpoint for each running app instance pointing to the right IP and port. Nginx then round-robins between them. If none are running, it changes it out to serve a "sorry, the site is down" kind of page instead. Consul-template updates the nginx config and reloads nginx as soon as any changes are seen to the service. It's really nice. If I need more instances of a particular app running, I can just spin one up on another server and it instantly gets added to the pool. If one crashes or is shut down, it's removed just as quickly. As long as one there's at least one instance running at any given time, visitors to the site should never be any the wiser (as long as it can handle the current traffic).

That really covers the server and application setup.

What's left is the deployment part. Ie, how it gets from a new commit on master to running on the application servers.

### Jenkins

Jenkins is kind of a no-brainer for CI/CD stuff. I could probably rig something similar up with TravisCI or Wercker or another hosted CI, but I'm more comfortable keeping my credentials on my own servers for now.

So I have a Jenkins server running and I have a job set up there for each application. It gets triggered by a webhook from github whenever there's a commit to master.

Jenkins checks out the code and runs:

export TAG=build-$BUILD_NUMBER make build docker push thraxil/antisocial:$TAG

$BUILD_NUMBER is a built-in environment variable that Jenkins sets on each build. So it's just building a new docker image (which runs the test suite as part of the build process) and pushes it to the Docker Hub with a unique tag corresponding to this build. When that completes successfully, it triggers a downstream Jenkins job called django-deploy which is a parameterized build. It passes it the following parameters: APP=antisocial TAG=build-$BUILD_NUMBER
HOSTS=appserver1 appserver2
CELERY_HOSTS=appserver3
BEAT_HOSTS=appserver3

These are fairly simple apps that I run mostly for my own amusement so I don't have extensive integration tests. If I did, instead of triggering django-deploy directly here, I would trigger other jobs to run those tests against the newly created and tagged image first.

The django-deploy job runs the following script:

#!/bin/bash
hosts=(${HOSTS}) chosts=(${CELERY_HOSTS})
bhosts=(${BEAT_HOSTS}) for h in "${hosts[@]}"
do
ssh $h docker pull${REPOSITORY}thraxil/$APP:$TAG
ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true ssh$h "echo export TAG=$TAG > /var/www/$APP/TAG"
done

for h in "${chosts[@]}" do ssh$h docker pull ${REPOSITORY}thraxil/$APP:$TAG ssh$h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG" done for h in "${bhosts[@]}"
do
ssh $h docker pull${REPOSITORY}thraxil/$APP:$TAG
ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true ssh$h "echo export TAG=$TAG > /var/www/$APP/TAG"
done

h=${hosts[0]} ssh$h /usr/local/bin/docker-runner $APP migrate ssh$h /usr/local/bin/docker-runner $APP collectstatic ssh$h /usr/local/bin/docker-runner $APP compress for h in "${hosts[@]}"
do
ssh $h sudo stop$APP || true
ssh $h sudo start$APP
done

for h in "${chosts[@]}" do ssh$h sudo stop $APP-worker || true ssh$h sudo start $APP-worker done for h in "${bhosts[@]}"
do
ssh $h sudo stop$APP-beat || true
ssh $h sudo start$APP-beat
done

It's a bit long, but straightforward.

First, it just pulls the new docker image down onto each server. This is done first because the docker pull is usually the slowest part of the process. Might as well get it out of the way first. On each host, it also writes to the /var/www/$APP/TAG file that we saw mentioned back in docker-runner. The contents are just a variable assignment specifying the tag that we just built and pulled and are about to cut over to. The docker-runner script knows to use the specific tag that's set in that file. Of course, it first backs up the old one to a REVERT file that can then be used to easily roll-back the whole deploy if something goes wrong. Next, the database migrations and static asset tasks have to run. They only need to run on a single host though, so it just pulls the first one off the list and runs the migrate, collectstatic, and compress on that one. Finally, it goes host by host and stops and starts the service on each in turn. Remember that with the whole consul setup, as long as they aren't all shut off at the same time, overall availability should be preserved. Then, of course, it does the same thing for the celery and celery beat services. If that all completes successfully, it's done. If it fails somewhere along the way, another Jenkins job is triggered that basically restores the TAG file from REVERT and restarts the services, putting everything back to the previous version. ### Conclusion and Future Directions That's a lot to digest. Sorry. In practice, it really doesn't feel that complicated. Mostly stuff just works and I don't have to think about it. I write code, commit, and git push. A few minutes later I get an email from Jenkins telling me that it's been deployed. Occasionally, Jenkins tells me that I broke something and I go investigate and fix it (while the site stays up). If I need more capacity, I provision a new server and it joins the consul cluster. Then I can add it to the list to deploy to, kick off a Jenkins job and it's running. I've spent almost as much time writing this blog post explaining everything in detail as it took to actually build the system. Provisioning servers is fast and easy because they barely need anything installed on them besides docker and a couple config files and scripts. If a machine crashes, the rest are unaffected and service is uninterrupted. Overall, I'm pretty happy with this setup. It's better than the statically managed approach I had before (no more manually editing nginx configs and hoping I don't introduce a syntax error that takes all the sites down until I fix it). Nevertheless, what I've put together is basically a low rent, probably buggy version of a PaaS. I knew this going in. I did it anyway because I wanted to get a handle on all of this myself. (I'm also weird and enjoy this kind of thing). Now that I feel like I really understand the challenges here, when I get time, I'll probably convert it all to run on Kubernetes or Mesos or something similar. ## Docker and Upstart Docker has some basic process management functionality built in. You can set restart policies and the Docker daemon will do its best to keep containers running and restart them if the host is rebooted. This is handy and can work well if you live in an all Docker world. Many of us need to work with Docker based services alongside regular non-Docker services on the same host, at least for the near future. Our non-Docker services are probably managed with Systemd, Upstart, or something similar and we'd like to be able to use those process managers with our Docker services so dependencies can be properly resolved, etc. I haven't used Systemd enough to have an opinion on it (according to the internet, it's either the greatest thing since sliced bread or the arrival of the antichrist, depending on who you ask). Most of the machines I work on are still running Ubuntu 14.04 and Upstart is the path of least resistence there and the tool that I know the best. Getting Docker and Upstart to play nicely together is not quite as simple as it appears at first. Docker's documentation contains a sample upstart config: description "Redis container" author "Me" start on filesystem and started docker stop on runlevel [!2345] respawn script /usr/bin/docker start -a redis_server end script That works, but it assumes that the container named redis_server already exists. Ie, that someone has manually, or via some mechanism outside upstart run the docker run --name=redis_server ... command (or a docker create), specifying all the parameters. If you need to change one of those parameters, you would need to stop the upstart job, do a docker stop redis_server, delete the container with docker rm redis_server, run a docker create --name=redis_server ... to make the new container with the updated parameters, then start the upstart job. That's a lot of steps and would be no fun to automate via your configuration management or as part of a deployment. What I expect to be able to do with upstart is deploy the necessary dependencies and configuration to the host, drop an upstart config file in /etc/init/myservice.conf and do start myservice, stop myservice, etc. I expect to be able to drop in a new config file and just restart the service to have it go into effect. Letting Docker get involved seems to introduce a bunch of additional steps to that process that just get in the way. Really, to get Docker and Upstart to work together properly, it's easier to just let upstart do everything and configure Docker to not try to do any process management. First, make sure you aren't starting the Docker daemon with --restart=always. The default is --restart=no and is what we want. Next, instead of building the container and then using docker start and docker stop even via upstart, we instead want to just use docker run so parameters can be specified right there in the upstart config (I'm going to leave out the description/author stuff from here on out): start on filesystem and started docker stop on runlevel [!2345] respawn script /usr/bin/docker run \ -v /docker/host/dir:/data \ redis end script This will work nicely. You can stop and start via upstart like a regular system service. Of course, we would probably like other services to be able to link to it and for that it will need to be named: start on filesystem and started docker stop on runlevel [!2345] respawn script /usr/bin/docker run \ --name=redis_server \ -v /docker/host/dir:/data \ redis end script That will work. Once. Then we run into the issue that anyone who's used Docker and tried to run named containers has undoubtedly come across. If we stop that and try to start it again, it will fail and the logs will be full of complaints about: Error response from daemon: Conflict. The name "redis_server" is already in use by container 9ccc57bfbc3c. You have to delete (or rename) that container to be able to reuse that name. Then you have to do the whole dance of removing the container and restarting stuff. So we put a --rm in there: start on filesystem and started docker stop on runlevel [!2345] respawn script /usr/bin/docker run \ --rm \ --name=redis_server \ -v /docker/host/dir:/data \ redis end script This is much better and will mostly work. Sometimes, though, the container will get killed without a proper SIGTERM signal getting through and Docker won't clean up the container. Eg, it gets OOM-killed, or the server is abruptly power-cycled (it seems like sometimes even a normal stop just doesn't work right). The old container is left there and the next time it tries to start, you run into the same old conflict and have to manually clean it up. There are numerous Stack Overflow questions and similar out there with suggestions for pre-stop stanzas, etc. to deal with this problem. However, in my experimentation, they all failed to reliably work across some of those trickier situations like OOM-killer rampages and power cycling. My current solution is simple and has worked well for me. I just add couple more lines added to the script section like so: script /usr/bin/docker stop redis_server || true /usr/bin/docker rm redis_server || true /usr/bin/docker run \ --rm \ --name=redis_server \ -v /docker/host/dir:/data \ redis end script In the spirit of the World's Funniest Joke, before trying to revive the container, we first make sure it's dead. The || true on each of those lines just ensures that it will keep going even if it didn't actually have to stop or remove anything. (I suppose that the --rm is now superfluous, but it doesn't hurt). So this is how I run things now. I tend to have two levels of Docker containers: these lower level named services that get linked to other containers, and higher level "application" containers (eg, Django apps), that don't need to be named, but probably link in one of these named containers. This setup is easy to manage (my CM setup can push out new configs and restart services with wild abandon) and robust. ## Circuit Breaker TCP Proxy When you're building systems that combine other systems, one of the big lessons you learn is that failures can cause chain reactions. If you're not careful, one small piece of the overall system failing can cause catastrophic global failure. Even worse, one of the main mechanisms for these chain reactions is a well-intentioned attempt by one system to cope with the failure of another. Imagine Systems A and B, where A relies on B. System A needs to handle 100 requests per second, which come in at random, but normal intervals. Each of those implies a request to System B. So an average of one request every 10ms for each of them. System A may decide that if a request to System B fails, it will just repeatedly retry the request until it succeeds. This might work fine if System B has a very brief outage. But then it goes down for longer and needs to be completely restarted, or is just becomes completely unresponsive until someone steps in and manually fixes it. Let's say that it's down for a full second. When it finally comes back up, it now has to deal with 100 requests all coming in at the same time, since System A has been piling them up. That usually doesn't end well, making System B run extremely slowly or even crash again, starting the whole cycle over and over. Meanwhile, since System A is having to wait unusually long for a successful response from B, it's chewing up resources as well and is more likely to crash or start swapping. And of course, anything that relies on A is then affected and the failure propagates. Going all the way the other way, with A never retrying and instead immediately noting the failure and passing it back is a little better, but still not ideal. It's still hitting B every time while B is having trouble, which probably isn't helping B's situation. Somewhere back up the chain, a component that calls system A might be retrying, and effectively hammers B the same as in the previous scenario, just using A as the proxy. A common pattern for dealing effectively with this kind of problem is the "circuit breaker". I first read about it in Michael Nygaard's book Release It! and Martin Fowler has a nice in-depth description of it. As the name of the pattern should make clear, a circuit breaker is designed to detect a fault and immediately halt all traffic to the faulty component for a period of time, to give it a little breathing room to recover and prevent cascading failures. Typically, you set a threshold of errors and, when that threshold is crossed, the circuit breaker "trips" and no more requests are sent for a short period of time. Any clients get an immediate error response from the circuit breaker, but the ailing component is left alone. After that short period of time, it tentatively allows requests again, but will trip again if it sees any failures, this time blocking requests for a longer period. Ideally, these time periods of blocked requests will follow an exponential backoff, to minimize downtime while still easing the load as much as possible on the failed component. Implementing a circuit breaker around whatever external service you're using usually isn't terribly hard and the programming language you are using probably already has a library or two available to help. Eg, in Go, there's this nice one from Scott Barron at Github and this one which is inspired by Hystrix from Netflix, which includes circuit breaker functionality. Recently, Heroku released a nice Ruby circuit breaker implementation. That's great if you're writing everything yourself, but sometimes you have to deal with components that you either don't have the source code to or just can't really modify easily. To help with those situations, I wrote cbp, which is a basic TCP proxy with circuit breaker functionality built in. You tell it what port to listen on, where to proxy to, and what threshold to trip at. Eg: $ cbp -l localhost:8000 -r api.example.com:80 -t .05

sets up a proxy to port 80 of api.example.com on the local port 8000. You can then configure a component to make API requests to http://localhost:8000/. If more than 5% of those requests fail, the circuit breaker trips and it stops proxying for 1 second, 2 seconds, 4 seconds, etc. until the remote service recovers.

This is a raw TCP proxy, so even HTTPS requests or any other kind of (TCP) network traffic can be sent through it. The downside is that that means it can only detect TCP level errors and remains ignorant of whatever protocol is on top of it. That means that eg, if you proxy to an HTTP/HTTPS service and that service responds with a 500 or 502 error response, cbp doesn't know that that is considered an error, sees a perfectly valid TCP request go by, and assumes everything is fine; it would only trip if the remote service failed to establish the connection at all. (I may make an HTTP specific version later or add optional support for HTTP and/or other common protocols, but for now it's plain TCP).

## Dispatch van Utrecht 2: The Office

2014's almost over, I've been living in Utrecht for about five months now (minus my travelling all over the place), time for a small update.

It took until well into September for our shipping container of belongings to make it across the Atlantic and get delivered. Up to that point, my office setup looked like this:

A photo posted by anders (@thraxil) on

Just my laptop on a basic IKEA desk. The rest of the room empty and barren.

Now it looks like this:

Same desk, but I've got my nice Humanscale Freedom chair, my monitor mounts, and a couple rugs.

The other half of the room is now my art and music studio:

Phoenix got me an easel for Christmas (as a not so subtle hint that I should do more larger pieces), and I set up another basic IKEA table for my guitar and recording gear. Clearly I still need to work out some cable management issues, but otherwise, it's quite functional.

Finally, behind my left shoulder when I'm at my desk is a little fireplace type thing that is currently serving as the closest thing I have to shelves in the office, so it's got some trinkets and my collection of essential programming books:

TAGS: utrecht

## Dispatch van Utrecht 1: Settling In

Earlier this year, my girlfriend, Phoenix was offered a tenure track teaching position at HKU, a respected university in the Netherlands. It was too good of an offer for her to pass up, despite having a pretty well established life in NYC.

My job at CCNMTL involves many aspects, but the bulk of it is programming and sysadmin work that I can ultimately do anywhere I have access to a computer and a good internet connection. I've been living in NYC since 1999 and as much as I love it, I was ready for a bit of a change. So I worked out a telecommuting arrangement, packed up my life, and headed across the Atlantic with her.

The last few months have been a whirlwind of paperwork, packing, and travelling. The process of moving internationally, I wouldn't wish on my worst enemy. It's been expensive, stressful, and exhausting. Frankly, we're both impressed that our relationship has remained intact through it (just don't put us in a rental car with me driving and Phoenix navigating, and everything will be fine). I may write more later about the trials and tribulations of international Apostille, international health insurance plans, bank transfers, Dutch immigration bureaucracy, and moving a cat to Europe, but take my word for it that all of those things should be avoided if you can manage it.

At the beginning of August, we finished clearing out of New York and landed in Schiphol airport with a couple suitcases and a cat. All of our belongings had been either given to friends, sent to family for storage, or packed into a shipping container that was going to be taking the slow boat across the Atlantic (it's still not here yet; hopefully early September). We had an air mattress and an apartment in Utrecht that we'd signed a lease on and transferred a significant amount of money over for sight unseen.

That moment pretty much compressed the stress and uncertainty of the whole process into one laser-focused point. We'd spent months dealing with setback after setback on every front of the move. We'd both lived in New York long enough to see just about every kind of scam and shadiness from landlords and rental agencies. Yet here we were, in a new country with barely more than the clothes on our backs, our finances nearly depleted, and difficult to access internationally, trusting that we'd have somewhere to live, based on some email conversations and an online listing with a couple small pictures.

Setting foot in our new apartment for the first time, I think we each would've cried if we weren't so exhausted and shocked.

Coming from cramped, dingy, loud NYC apartments, we felt like we'd just won the lottery. The pictures had not even done the place justice. Everything was perfect. It was huge. Two floors, so we could live in the bottom floor and keep offices upstairs. High ceilings, big windows, stained glass. Everything newly renovated, clean, and high quality. A small, closed in back yard with a shed (containing an electric lawn mower that looks like it's made by Fisher-Price, but gets the job done), a balcony. All located in a little neighborhood that was both quiet (I think entire days go by without a car driving down our street) but only a couple minutes bike ride from the city center and train station (which is then a 30 minute ride to Amsterdam Centraal). The landlord had even set up internet for us before we got there and left us fresh flowers on the counter, and wine, beer, and coffee in the fridge.

We spent the first few days taking care of basic necessities. We bought bicycles (cycling is the primary means of transportation here), explored our neighborhood and the downtown, located grocery stores and cafes, and made the pilgrimage to Ikea. Largely, for the first week before the new furniture we bought was delivered, we'd sort of wander around the empty apartment in a daze, not quite believing we had this much space.

I'll close this entry with a shot of the cat and I sitting in our new living room (the smallest room in the apartment) watching a dove that stopped by our back yard for a visit:

TAGS: utrecht

## Apple's SSL/TLS bug and programming language design

A big story in the tech world recently has been Apple's bug deep in the SSL/TLS stack. There's a good writeup of the code on ImperialViolet that tracks down the exact point in the code. It boils down to these couple lines of C code:

The crux of the problem is that the indentation of the second goto fail; makes it appear to be part of the same clause as the first one. The double gotos look weird but harmless at a glance. But a closer look reveals that there are no braces after the if statement. The indentation of the second goto is a lie, so it actually parses like so:

The second goto fail; isn't in a conditional at all and so a goto will always be executed. Clearly not what the original programmer(s) intended.

Apple has the resources to hire top-notch programmers and focus intently on quality in their work. This being security related code, was probably subject to more scrutiny and code review than most other code. Still, this simple but brutal mistake slipped through and a lot of systems are vulnerable as a result. The point here isn't that the programmers responsible for this code were bad; it's that even the very best of us are human and will still miss things.

These kinds of mistakes fascinate me though and I like to think about how they can be prevented.

Designers of some more "modern" languages have solved this particular problem. Python, which catches its share of flak for having significant whitespace eliminates this type of problem. In fact, preventing this kind of disconnect between the intent of the code and its appearance on screen is a large part of why Python has significant whitespace. Imagine if that block of code had been written in Python (of course Python doesn't have gotos, etc, etc, but you catch my drift):

"A" and "B" there are the two possible ways it could be written. Both are unambiguous; they each work exactly like they look like they should work. It's much harder for Python code to develop those subtle traps. This isn't to say Python doesn't have its own share of blind spots. Arguably, Python's dynamic typing opens up a whole world of potential problems, but at the very least, it avoids this one.

Go, Another language I spend a lot of time with, which is heavily C-influenced, also manages to sidestep this particular trap. In Go, it's simply required to use braces after an if statement (or for the body of a for loop). So a Go translation of the code would have to look like one of the following:

Again, there's really no way to write the code in a deceptive manner. You could indent it however you want, but the required closing brace would stick out and make a problem obvious long before it went out to half the computers and mobile devices on the planet.

Go even goes a step further by including go fmt, a code formatting tool that normalizes indentation (and does many other things) and the Go community has embraced it whole-heartedly.

A code formatting tool run over the original C code would've made the problem immediately obvious, but there doesn't seem to be strong push in that community for promoting that kind of tool.

I try to always remember that programming complex systems and maintaining them across different environments and through changing requirements is one of the most difficult intellectual tasks that we can take on. Doing it without making mistakes is nearly impossible for mere humans. We should lean on the tools available to us for avoiding traps and automating quality wherever and whenever we can. Language features, compiler warnings, code formatters, linters, static analysis, refactoring tools, unit tests, and continuous integration are all there to help us. There may be no silver bullet, but we've got entire armories full of good old fashioned lead bullets that we might as well put to use.

## Erlang in Erlang

The FAQ for the Erlang Programming language says:

Erlang is named after the mathematician Agner Erlang. Among other things, he worked on queueing theory. The original implementors also tend to avoid denying that Erlang sounds a little like "ERicsson LANGuage".

My guess though is that most programmers writing Erlang code don't really know much about who Agner Erlang was or why his work inspired programmers at Ericsson nearly a century later to name a programming language after him. I wasn't there, so obviously I can't speak to exactly what they were thinking when naming the programming language, but I think I can give a bit of background on Agner Erlang's contributions and the rest should be fairly obvious. In the process of explaining a bit about his work, I simply can't resist implementing some of his equations in Erlang.

Agner Erlang was a mathematician who worked for a telephone company. At the time, a very practical problem telephone companies faced was deciding how many circuits had to be deployed to support a given number of customers. Don't build enough and you get busy signals and angry customers. Build too many and you've wasted money on hardware that sits idle.

This is a pretty fundamental problem of capacity planning and has clear analogies in nearly every industry. Agner Erlang laid many of the fundamentals for modeling demand in terms of probabilities and developed basic equations for calculating quality of service given a known demand and capacity.

Erlang's first equation simply defines a unit of "offered traffic" in terms of an arrival rate and average holding time. This unit was named after him.

The equation's pretty rough. Brace yourself:

$$E = \lambda h$$

λ is the call arrival rate and h is the call holding time. They need to be expressed in compatible units. So if you have ten calls per minute coming in (λ) and each call lasts an average of 1 minute (h), you have 10 Erlangs of traffic that your system will need to handle. This is pretty fundamental stuff that would later develop into Little's Law, which I think of as the $$F = ma$$ of Queuing Theory.

Implementing this equation in Erlang is trivial:

I call it "Erlang-A" even though I've never seen anyone else call it that. His two more well-known, substantial equations are called "Erlang-B" and "Erlang-C" respectively, so you see where I'm coming from.

Let's look at those two now, starting with Erlang-B.

Erlang-B tells you, given a traffic level (expressed in Erlangs) and a given number of circuits, what the probability of any incoming call has of getting a busy signal because all of the circuits are in use.

This one turns out to be non-trivial. $$E$$ is the traffic, $$m$$ is the number of circuits:

$$P_b = B(E,m) = \frac{\frac{E^m}{m!}} { \sum_{i=0}^m \frac{E^i}{i!}}$$

I won't go into any more detail on that equation or how it was derived. If you've taken a course or two on Probabilities and Statistics or Stochastic Processes, it probably doesn't look too strange. Just be thankful that Agner Erlang derived it for us so we don't have to.

It's perhaps instructive to plug in a few numbers and get an intuitive sense of it. That will be easier if we code it up.

That equation looks like a beast to program though as it's written. Luckily, it can be expressed a little differently:

$$B(E,0) = 1$$ $$B(E,m) = \frac{E B(E,m - 1)}{E B(E,m - 1) + m}$$

That actually looks really straightforward to implement in a programming language with recursion and pattern matching. Like Erlang...

I just introduced an intermediate 'N' to avoid the double recursion.

Now we can run a few numbers through it and see if it makes sense.

The base case expresses an idea that should be intuitively obvious. If you don't have any circuits at all, it doesn't matter how much or how little traffic you are expecting, you're going to drop any and all calls coming in. $$P_b = 1$$.

Similarly, though not quite as obvious from looking at the formula is that if you have no expected traffic, any capacity at all will be able to handle that load. $$B(0, m) = 0$$ no matter what your $$m$$ is.

The next more complicated case is if you have 1 circuit available. A load of 1 Erlang makes for a 50% chance of dropping any given call. If your load rises to 10 Erlangs, that probability goes up to about 91%.

More circuits means more capacity and smaller chances of dropping calls. 1E of traffic with 10 circuits, $$B(1,10) = .00000001$$. Ie, a very, very low probability of dropping calls.

I encourage you to spend some time trying different values and getting a sense of what that landscape looks like. Also think about how that equation applies to systems other than telephone circuits like inventory management, air traffic, or staffing. Clearly this is wide-ranging, important stuff.

Erlang-C builds on this work and extends it with the idea of a wait queue. With Erlang-B, the idea was that if all circuits were in use, an incoming call "failed" in some way and was dropped. Erlang-C instead allows incoming calls to be put on hold until a circuit becomes available and tells you the probability of that happening, again given the expected traffic and number of circuits available:

$$P_W = {{\frac{E^m}{m!} \frac{m}{m - E}} \over \sum_{i=0}^{m-1} \frac{E^i}{i!} + \frac{E^m}{m!} \frac{m}{m - E}}$$

Another rough looking one to implement. Once again though, there's an alternate expression, this time in terms of Erlang-B:

$$C(E,0) = 1$$ $$C(E,m) = \frac{m B(E,m)}{m - E(1 - B(E,m))}$$

Which again, is straightforward to implement in Erlang:

Again, you should try out some values and get a feel for it. An important anomaly that you'll discover right away is that you can get results greater than 1 when the expected traffic is higher than the number of circuits. This is a little strange and of course in the real world you can never have a probability higher than 1. In this case, it simply represents the fact that essentially, as $${t \to +\infty}$$, the wait queue will be growing infinitely long.

I'll stop here, but it's important to note that these equations are just the tip of the iceberg. From them you can predict average queue wait times, throughput levels, etc. Given how many different types of systems can be represented in these ways and how applicable they are to real-world problems of capacity planning and optimization, I think it's now clear why Agner Erlang deserves (and receives) respect in the field.

I guess I should also point out the obvious, that you should do your own research and not trust any of my math or code here. I only dabble in Queuing Theory and Erlang programming so I've probably gotten some details wrong. The Erlang code here isn't meant for production use. In particular, I'm suspicious that it has some issues with Numerical Stability, but a proper implementation would be harder to understand. Ultimately, I just wanted an excuse to write about a topic that I'm fascinated by and make a very long, drawn out "Yo Dawg..." joke.