Late to the Party

A little behind the curve on this one (hey, I've been busy), but as of today, every Python app that I run is now on Python3. I'm sure there's still Python2 code running here and there which I will replace when I notice it, but basically everything is upgraded now.

TAGS: python

Samlare: expvar to Graphite

One of my favorite bits of the Go standard library is expvar. If you're writing services in Go you are probably already familiar with it. If not, you should fix that.

Expvar just makes it dead simple to expose variables in your program via an easily scraped HTTP and JSON endpoint. By default, it exposes some basic info about the runtime memory usage (allocations, frees, GC stats, etc) but also allows you to easily expose anything else you like in a standardized way.

This makes it easy to pull that data into a system like Prometheus (via the expvarCollector) or watch it in realtime with expvarmon.

I still use Graphite for a lot of my metrics collection and monitoring though and noticed a lack of simple tools for geting expvar metrics into Graphite. Peter Bourgon has a Get to Graphite library which is pretty nice but it requires that you add the code to your applications, which isn't always ideal.

I just wanted a simple service that would poll a number of expvar endpoints and submit the results to Graphite.

So I made samlare to do just that.

You just run:

$ samlare -config=/path/to/config.toml

With a straightforward TOML config file that would look something like:

CarbonHost = ""
CarbonPort = 2003
CheckInterval = 60000
Timeout = 3000

URL = "http://localhost:14001/debug/vars"
Prefix = "apps.app1"

URL = "http://localhost:14002/debug/vars"
Prefix = "apps.app2"
FailureMetric = "apps.app2.failure"

That just tells samlare where your carbon server (the part of Graphite that accepts metrics) lives, how often to poll your endpoints (in ms), and how long to wait on them before timing out (in ms, again). Then you specify as many endpoints as you want. Each is just the URL to hit and what prefix to give the scraped metrics in Graphite. If you specify a FailureMetric, samlare will submit a 1 for that metric if polling that endpoint fails or times out.

There are more options as well for renaming metrics, ignoring metrics, etc, that are described in the README, but that's the gist of it.

Anyone else who is using Graphite and expvar has probably cobbled something similar together for their purposes, but this has been working quite well for me, so I thought I'd share.

As a bonus, I also have a package, django-expvar that makes it easy to expose an expvar compatible endpoint in a Django app (which samlare will happily poll).

2016 Music

Once again, I can't be bothered to list my top albums of the year, but here's a massive list of all the music that I liked this year (that I could find on Bandcamp). No attempt at ranking, no commentary, just a firehose of weird, dark music that I think is worth checking out.

TAGS: music

Using Make with Django

I would like to share how I use a venerable old technology, GNU Make, to manage the common tasks associated with a modern Django project.

As you'll see, I actually use Make for much more than just Django and that's one of the big advantages that keeps pulling me back to it. Having the same tools available on Go, node.js, Erlang, or C projects is handy for someone like me who frequently switches between them. My examples here will be Django related, but shouldn't be hard to adapt to other kinds of codebases. I'm going to assume basic unix commandline familiarity, but nothing too fancy.

Make has kind of a reputation for being complex and obscure. In fairness, it certainly can be. If you start down the rabbit hole of writing complex Makefiles, there really doesn't seem to be a bottom. Its reputation isn't helped by its association with autoconf, which does deserve its reputation for complexity, but we'll just avoid going there.

But Make can be simple and incredibly useful. There are plenty of tutorials online and I don't want to repeat them here. To understand nearly everything I do with Make, this sample stanza is pretty much all you need:

ruleA: ruleB ruleC

A Makefile mainly consists of a set of rules, each of which has a list of zero or more prerequisites, and then zero or more commands. At a very high level, when a rule is invoked, Make first invokes any prerequisite rules (if needed), which may have their own stanzas defined elsewhere) and then runs the designated commands. There's some nonsense having to do with requiring actual TAB characters in there instead of spaces for indentation and more syntax for comments (lines starting with #) and defining and using variables, but that's the gist of it right there.

So you can put something extremely simple (and pointless) like

    echo make says hello

in a file called Makefile, then go into the same directory and do:

$ make say_hello
echo make says hello
make says hello

Not exciting, but if you have some complex shell commands that you have to run on a regular basis, it can be convenient to just package them up in a Makefile so you don't have to remember them or type them out every time.

Things get interesting when you add in the fact that rules are also interpreted as filenames and Make is smart about looking at timestamps and keeping track of dependencies to avoid doing more work than necessary. I won't give trivial examples to explain that (again, there are other tutorials out there), but the general interpretation to keep in mind is that the first stanza above specifies how to generate or update a file called ruleA. Specifically, other files, ruleB and ruleC need to have first been generated (or are already up to date), then command1 and command2 are run. If ruleA has been updated more recently than ruleB and ruleC, we know it's up to date and nothing needs to be done. On the other hand, if the ruleB or ruleC files have newer timestamps than ruleA, ruleA needs to be regenerated. There will be meatier examples later, which I think will clarify this.

The original use-case for Make was handling complex dependencies in languages with compilers and linkers. You wanted to avoid re-compiling the entire project when only one source file changed. (If you've worked with a large C++ project, you probably understand why that would suck.) Make (with a well-written Makefile) is very good at avoiding unnecessary recompilation while you're developing.

So back to Django, which is written in Python, which is not a compiled language and shouldn't suffer from problems related to unnecessary and slow recompilation. Why would Make be helpful here?

While Python is not compiled, a real world Django project has enough complexity involved in day to day tasks that Make turns out to be incredibly useful.

A Django project requires that you've installed at least the Django python library. If you're actually going to access a database, you'll also need a database driver library like psycopg2. Then Django has a whole ecosystem of pluggable applications that you can use for common functionality, each of which often require a few other miscellaneous Python libraries. On many projects that I work on, the list of python libraries that get pulled in quickly runs up into the 50 to 100 range and I don't believe that's uncommon.

I'm a believer in repeatable builds to keep deployment sane and reduce the "but it works on my machine" factor. So each Django project has a requirements.txt that specifies exact version numbers for every python library that the project uses. Running pip install -r requirements.txt should produce a very predictable environment (modulo entirely different OSes, etc.).

Working on multiple projects (between work and personal projects, I help maintain at least 50 different Django projects), it's not wise to have those libraries all installed globally. The Python community's standard solution is virtualenv, which does a nice job of letting you keep everything separate. But now you've got a bunch of virtualenvs that you need to manage.

I have a seperate little rant here about how the approach of "activating" virtualenvs in a particular shell environment (even via virtualenvwrapper or similar tools) is an antipattern and should be avoided. I'll spare you most of that except to say that what I recommend instead is sticking with a standard location for your project's virtualenv and then just calling the python, pip, etc commands in the virtualenv via their path. Eg, on my projects, I always do

$ virtualenv ve

at the top level of the project. Then I can just do:

$ ./ve/bin/python ...

and know that I'm using the correct virtualenv for the project without thinking about whether I've activated it yet in this terminal (I'm a person who tends to have lots and lots of terminals open at a given time and often run commands from an ephemeral shell within emacs). For Django, where most of what you do is done via commands, I actually just change the shebang line in to #!ve/bin/python so I can always just run ./ do_whatever.

Let's get back to Make though. If my project has a virtualenv that was created by running something along the lines of:

$ virtualenv ve
$ ./ve/bin/pip install -r requirements.txt

And requirements.txt gets updated, the virtualenv needs to be updated as well. This is starting to get into Make's territory. Personally, I've encountered enough issues with pip messing up in-place updates on libraries that I prefer to just nuke the whole virtualenv from orbit and do a clean install anytime something changes.

Let's add a rule to a Makefile that looks like this:

ve/bin/python: requirements.txt
    rm -rf ve
    virtualenv ve
    ./ve/bin/pip install -r requirements.txt

If that's the only rule in the Makefile (or just the first), typing

$ make

At the terminal will ensure that the virtualenv is all set up and ready to go. On a fresh checkout, ve/bin/python won't exist, so it will run the three commands, setting everything up. If it's run at any point after that, it will see that ve/bin/python is more recently updated than requirements.txt and nothing needs to be done. If requirements.txt changes at some point, running make will trigger a wipe and reinstall of the virtualenv.

Already, that's actually getting useful. It's better when you consider that in a real project, the commands involved quickly get more complicated with specifying a custom --index-url, setting up some things so pip installs from wheels, and I even like to specify exact versions of virtualenv and setuptools in projects I don't have to think about what might happen on systems with different versions of those installed. The actual commands are complicated enough that I'm quite happy to have them written down in the Makefile so I only need to remember how to type make.

It all gets even better again when you realize that you can use ve/bin/python as a prerequisite for other rules.

Remember that if a target rule doesn't match a filename, Make will just always run the commands associated with it. Eg, on a Django project, to run a development server, I might run:

$ ./ runserver

Instead, I can add a stanza like this to my Makefile:

runserver: ve/bin/python
    ./ runserver

Then I can just type make runserver and it will run that command for me. Even better, since ve/bin/python is a prerequisite for runserver, if the virtualenv for the project hasn't been created yet (eg, if I just cloned the repo and forgot that I need to install libraries and stuff), it just does that automatically. And if I've done a git pull that updated my requirements.txt without noticing, it will automatically update my virtualenv for me. This sort of thing has been incredibly useful when working with designers who don't necessarily know the ins and outs of pip and virtualenv or want to pay close attention to the requirements.txt file. They just know they can run make runserver and it works (though sometimes it spends a few minutes downloading and installing stuff first).

I typically have a bunch more rules for common tasks set up in a similar fashion:

check: ve/bin/python
    ./ check

migrate: check
    ./ migrate

flake8: ve/bin/python
    ./ve/bin/flake8 project

test: check flake8
    ./ test

That demonstrates a bit of how rules chain together. If I run make test, check and flake8 are both prerequisites, so they each get run first. They, in turn, both depend on the virtualenv being created so that will happen before anything.

Perhaps you've noticed that there's also a little bug in the ve/bin/python stanza up above. ve/bin/python is created by the virtualenv ve step, but it's used as the target for the stanza. If the pip install step fails though (because of a temporary issue with PyPI or just a typo in requirements.txt or something), it will still have "succeeded" in that ve/bin/python has a fresher timestamp than requirements.txt. So the virtualenv won't really have the complete set of libraries installed there but subsequent runs of Make will consider everything fine (based on timestamp comparisons) and not do anything. Other rules that depend on the virtualenv being set up are going to have problems when they run.

I get around that by introducing the concept of a sentinal file. So my stanza actually becomes something like:

ve/sentinal: requirements.txt
    rm -rf ve
    virtualenv ve
    ./ve/bin/pip install -r requirements.txt
    touch ve/sentinal

Ie, now there's a zero byte file named ve/sentinal that exists just to signal to Make that the rule was completed successfully. If the pip install step fails for some reason, it never gets created and Make won't try to keep going until that gets fixed.

My actual Makefile setup on real projects has grown more flexible and more complex, but if you've followed most of what's here, it should be relatively straightforward. In particular, I've taken to splitting functionality out into individual, reusable .mk files that are heavily parameterized with variables, which then just get includeed into the main Makefile where the project specific variables are set.

Eg, here is a typical one. It sets a few variables specific to the project, then just does include *.mk.

The actual Django related rules are in, which is a file that I use across all my django projects and should look similar to what's been covered here (just with a lot more rules and variables). Other .mk files in the project handle tasks for things like javascript (jshint and jscs checks, plus npm and webpack operations) or docker. They all take default values from and are set up so those can be overridden in the main Makefile or from the commandline. The rules in these files are a bit more sophisticated than the examples here, but not by much. [I should also point out here that I am by no means a Make expert, so you may not want to view my stuff as best practices; merely a glimpse of what's possible.]

I typically arrange things so the default rule in the Makefile runs the full suite of tests and linters. Running the tests after every change is the most common task for me, so having that available with a plain make command is nice. As an emacs user, it's even more convenient since emacs' default setting for the compile command is to just run make. So it's always a quick keyboard shortcut away. (I use projectile, which is smart enough to run the compile command in the project root).

Make was originally created in 1976, but I hope you can see that it remains relevant and useful forty years later.

2015 Music

It's the time of the year when everyone puts out their top albums of 2015 lists. 2015 has been a very good year for the kind of music that I like, so I thought about writing my own up, but decided that I am too lazy to pare it down to just 10 albums or so. Instead, here's a massive list of 60 albums (and 1 single) that came out in 2015 that I enjoyed. If you like dark, weird music (atmospheric black metal, doom, sludge, noise, and similar), perhaps you might too. I've only included albums that are up on Bandcamp (they're generous with previews and you can buy the album if you like it). There are way too many for me to write about each (and really, what's the point when you can listen to each of them for yourselves?). I also make no attempt at ranking them, thus they are presented in alphabetical order by artist name.

TAGS: music

A Curated List of London Tube Station Names That Amuse Me

Plus Honorable Mention to the Overground station "Wapping".

Continuously Deploying Django with Docker

I run about a dozen personal Django applications (including this site) on some small servers that I admin. I also run a half dozen or so applications written in other languages and other frameworks.

Since it's a heterogeneous setup and I have a limited amount of free time for side projects, container technology like Docker that lets me standardize my production deployment is quite appealing.

I run a continuous deployment pipeline for all of these applications so every git commit I make to master goes through a test pipeline and ends up deployed to the production servers (assuming all the tests pass).

Getting Django to work smoothly in a setup like this is non-trivial. This post attempts to explain how I have it all working.


First, some background on my setup. I run Ubuntu 14.04 servers on Digital Ocean. Ubuntu 14.04 still uses upstart as the default init, so that's what I use to manage my application processes. I back the apps with Postgres and I run an Nginx proxy in front of them. I serve static assets via S3 and Cloudfront. I also use Salt for config management and provisioning so if some of the config files here look a bit tedious or tricky to maintain and keep in sync, keep in mind that I'm probably actually using Salt to template and automate them. I also have a fairly extensive monitoring setup that I won't go into here, but will generally let me know as soon as anything goes wrong.

I currently have three "application" servers where the django applications themselves run. Typically I run each application on two servers which Nginx load balances between. A few of the applications also use Celery for background jobs and Celery Beat for periodic tasks. For those, the celery and celery beat processes run on the third application server.

My goal for my setup was to be able to deploy new versions of my Django apps automatically and safely just by doing git push origin master (which typically pushes to a github repo). That means that the code needs to be tested, a new Docker image needs to be built, distributed to the application servers, database migrations run, static assets compiled and pushed to S3, and the new version of the application started in place of the old. Preferably without any downtime for the users.

I'll walk through the setup for my web-based feedreader app, antisocial, since it is one of the ones with Celery processes. Other apps are all basically the same except they might not need the Celery parts.

I should also point out that I am perpetually tweaking stuff. This is what I'm doing at the moment, but it will probably outdated soon after I publish this as I find other things to improve.


Let's start with the Dockerfile:


FROM ccnmtl/django.base
ADD wheelhouse /wheelhouse
RUN apt-get update && apt-get install -y libxml2-dev libxslt-dev
RUN /ve/bin/pip install --no-index -f /wheelhouse -r /wheelhouse/requirements.txt
COPY . /app/
RUN /ve/bin/python test
ENV APP antisocial
CMD ["run"]

Like most, I started using Docker by doing FROM ubuntu:trusty or something similar at the beginning of all my Dockerfiles. That's not really ideal though and results in large docker images that are slow to work with so I've been trying to get my docker images as slim and minimal as possible lately.

Roughly following Glyph's approach, I split the docker image build process into a base image and a "builder" image so the final image can be constructed without the whole compiler toolchain included. The base and builder images I have published as ccnmtl/django.base and ccnmtl/ respectively and you can see exactly how they are made here.

Essentially, they both are built on top of Debian Jessie (quite a bit smaller than Ubuntu images and similar enough). The base image contains the bare minimum while the build image contains a whole toolchain for building wheels out of python libraries. I have a Makefile with some bits like this:

ROOT_DIR:=$(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))

$(WHEELHOUSE)/requirements.txt: $(REQUIREMENTS)
        mkdir -p $(WHEELHOUSE)
        docker run --rm \
        -v $(ROOT_DIR):/app \
        -v $(ROOT_DIR)/$(WHEELHOUSE):/wheelhouse \
        cp $(REQUIREMENTS) $(WHEELHOUSE)/requirements.txt
        touch $(WHEELHOUSE)/requirements.txt

build: $(WHEELHOUSE)/requirements.txt
        docker build -t $(IMAGE) .

So when I do make build, if the requirements.txt has changed since the last time, it uses the build image to generate a directory with wheels for every library specified in requirements.txt, then runs docker build, which can do a very simple (and fast) pip install of those wheels.

Once the requirements are installed, it runs the application's unit tests. I expose port 8000 and copy in a custom script to use as an entry point.

That script makes the container a bit easier to work with. It looks like this:


cd /app/

if [[ "$SETTINGS" ]]; then
    export DJANGO_SETTINGS_MODULE="$APP.settings_docker"

if [ "$1" == "migrate" ]; then
    exec /ve/bin/python migrate --noinput

if [ "$1" == "collectstatic" ]; then
    exec /ve/bin/python collectstatic --noinput

if [ "$1" == "compress" ]; then
    exec /ve/bin/python compress

if [ "$1" == "shell" ]; then
    exec /ve/bin/python shell

if [ "$1" == "worker" ]; then
    exec /ve/bin/python celery worker

if [ "$1" == "beat" ]; then
    exec /ve/bin/python celery beat

# run arbitrary commands
if [ "$1" == "manage" ]; then
    exec /ve/bin/python "$@"

if [ "$1" == "run" ]; then
    exec /ve/bin/gunicorn --env \
                 $APP.wsgi:application -b -w 3 \
                 --access-logfile=- --error-logfile=-

With the ENTRYPOINT and CMD set up that way in the Dockerfile, I can just run

$ docker run thraxil/antisocial

and it will run the gunicorn process, serving the app on port 8000. Or, I can do:

$ docker run thraxil/antisocial migrate

and it will run the database migration task. Similar for collectstatic, compress, celery, etc. Or, I can do:

$ docker run thraxil/antisocial manage some_other_command --with-flags

to run any other Django command (this is really handy for dealing with migrations that need to be faked out, etc.)


Of course, all of those exact commands would run into problems with needing various environment variables passed in, etc.

The settings_docker settings module that the script defaults to for the container is a fairly standard Django settings file, except that it pulls all the required settings out of environment variables. The bulk of it comes from a common library that you can see here.

This gives us a nice twelve-factor style setup and keeps the docker containers very generic and reusable. If someone else wants to run one of these applications, they can pretty easily just run the same container and just give it their own environment variables.

The downside though is that it gets a bit painful to actually run things from the commandline, particularly for one-off tasks like database migrations because you actually need to specify a dozen or so -e flags on every command.

I cooked up a little bit of shell script with a dash of convention over configuration to ease that pain.

All the servers get a simple docker-runner script that looks like:



if [ -f /etc/default/$APP ]; then
  . /etc/default/$APP
if [ -f /var/www/$APP/TAG ]; then
  . /var/www/$APP/TAG

exec /usr/bin/docker run $OPTS $EXTRA $IMAGE:$TAG $@

That expects that every app has a file in /etc/default that defines an $IMAGE and $OPTS variable. Eg, antisocial's looks something like:


export IMAGE="thraxil/antisocial"
export OPTS="--link postfix:postfix \
     --rm \
     -e SECRET_KEY=some_secret_key \
     -e \
     -e AWS_STORAGE_BUCKET_NAME=s3-bucket-name \
     -e AWS_ACCESS_KEY=... \
     -e AWS_SECRET_KEY=... \
     ... more settings ... \
     -e \
     -e BROKER_URL=amqp://user:pass@host:5672//antisocial"

With that in place, I can just do:

 $ docker-runner antisocial migrate

And it fills everything in. So I can keep the common options in one place and not have to type them in every time.

(I'll get to the TAG file that it mentions in a bit)


With those in place, the upstart config for the application can be fairly simple:


description "start/stop antisocial docker"

start on filesystem and started docker-postfix and started registrator
stop on runlevel [!2345]


  export EXTRA="-e SERVICE_NAME=antisocial -p"
  exec /usr/local/bin/docker-runner antisocial
end script

The Celery and Celery Beat services have very similar ones except they run celery and beat tasks instead and they don't need to have a SERVICE_NAME set or ports configured.


Next, I use consul, consul-template, and registrator to rig everything up so Nginx automatically proxies to the appropriate ports on the appropriate application servers.

Each app is registered as a service (hence the SERVICE_NAME parameter in the upstart config). Registrator sees the containers starting and stopping and registers and deregisters them with consul as appropriate, inspecting them to get the IP and port info.

consul-template runs on the Nginx server and has a template defined for each app that looks something like:

{{if service "antisocial"}}
upstream antisocial {
{{range service "antisocial"}}    server {{.Address}}:{{.Port}};

server {
        listen   80;

        client_max_body_size 40M;
{{if service "antisocial"}}
        location / {
          proxy_pass http://antisocial;
          proxy_next_upstream     error timeout invalid_header http_500;
          proxy_connect_timeout   2;
          proxy_set_header        Host            $host;
          proxy_set_header        X-Real-IP       $remote_addr;
        error_page 502 /502.html;
        location = /502.html {
          root   /var/www/down/;
        root /var/www/down/;
        try_files $uri /502.html;

That just dynamically creates an endpoint for each running app instance pointing to the right IP and port. Nginx then round-robins between them. If none are running, it changes it out to serve a "sorry, the site is down" kind of page instead. Consul-template updates the nginx config and reloads nginx as soon as any changes are seen to the service. It's really nice. If I need more instances of a particular app running, I can just spin one up on another server and it instantly gets added to the pool. If one crashes or is shut down, it's removed just as quickly. As long as one there's at least one instance running at any given time, visitors to the site should never be any the wiser (as long as it can handle the current traffic).

That really covers the server and application setup.

What's left is the deployment part. Ie, how it gets from a new commit on master to running on the application servers.


Jenkins is kind of a no-brainer for CI/CD stuff. I could probably rig something similar up with TravisCI or Wercker or another hosted CI, but I'm more comfortable keeping my credentials on my own servers for now.

So I have a Jenkins server running and I have a job set up there for each application. It gets triggered by a webhook from github whenever there's a commit to master.

Jenkins checks out the code and runs:

export TAG=build-$BUILD_NUMBER
make build
docker push thraxil/antisocial:$TAG

$BUILD_NUMBER is a built-in environment variable that Jenkins sets on each build. So it's just building a new docker image (which runs the test suite as part of the build process) and pushes it to the Docker Hub with a unique tag corresponding to this build.

When that completes successfully, it triggers a downstream Jenkins job called django-deploy which is a parameterized build. It passes it the following parameters:

HOSTS=appserver1 appserver2

These are fairly simple apps that I run mostly for my own amusement so I don't have extensive integration tests. If I did, instead of triggering django-deploy directly here, I would trigger other jobs to run those tests against the newly created and tagged image first.

The django-deploy job runs the following script:


for h in "${hosts[@]}"
    ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
    ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
    ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"

for h in "${chosts[@]}"
    ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
    ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
    ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"

for h in "${bhosts[@]}"
    ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
    ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
    ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"


ssh $h /usr/local/bin/docker-runner $APP migrate
ssh $h /usr/local/bin/docker-runner $APP collectstatic
ssh $h /usr/local/bin/docker-runner $APP compress

for h in "${hosts[@]}"
    ssh $h sudo stop $APP || true
    ssh $h sudo start $APP

for h in "${chosts[@]}"
    ssh $h sudo stop $APP-worker || true
    ssh $h sudo start $APP-worker

for h in "${bhosts[@]}"
    ssh $h sudo stop $APP-beat || true
    ssh $h sudo start $APP-beat

It's a bit long, but straightforward.

First, it just pulls the new docker image down onto each server. This is done first because the docker pull is usually the slowest part of the process. Might as well get it out of the way first. On each host, it also writes to the /var/www/$APP/TAG file that we saw mentioned back in docker-runner. The contents are just a variable assignment specifying the tag that we just built and pulled and are about to cut over to. The docker-runner script knows to use the specific tag that's set in that file. Of course, it first backs up the old one to a REVERT file that can then be used to easily roll-back the whole deploy if something goes wrong.

Next, the database migrations and static asset tasks have to run. They only need to run on a single host though, so it just pulls the first one off the list and runs the migrate, collectstatic, and compress on that one.

Finally, it goes host by host and stops and starts the service on each in turn. Remember that with the whole consul setup, as long as they aren't all shut off at the same time, overall availability should be preserved.

Then, of course, it does the same thing for the celery and celery beat services.

If that all completes successfully, it's done. If it fails somewhere along the way, another Jenkins job is triggered that basically restores the TAG file from REVERT and restarts the services, putting everything back to the previous version.

Conclusion and Future Directions

That's a lot to digest. Sorry. In practice, it really doesn't feel that complicated. Mostly stuff just works and I don't have to think about it. I write code, commit, and git push. A few minutes later I get an email from Jenkins telling me that it's been deployed. Occasionally, Jenkins tells me that I broke something and I go investigate and fix it (while the site stays up). If I need more capacity, I provision a new server and it joins the consul cluster. Then I can add it to the list to deploy to, kick off a Jenkins job and it's running. I've spent almost as much time writing this blog post explaining everything in detail as it took to actually build the system.

Provisioning servers is fast and easy because they barely need anything installed on them besides docker and a couple config files and scripts. If a machine crashes, the rest are unaffected and service is uninterrupted. Overall, I'm pretty happy with this setup. It's better than the statically managed approach I had before (no more manually editing nginx configs and hoping I don't introduce a syntax error that takes all the sites down until I fix it).

Nevertheless, what I've put together is basically a low rent, probably buggy version of a PaaS. I knew this going in. I did it anyway because I wanted to get a handle on all of this myself. (I'm also weird and enjoy this kind of thing). Now that I feel like I really understand the challenges here, when I get time, I'll probably convert it all to run on Kubernetes or Mesos or something similar.

Docker and Upstart

Docker has some basic process management functionality built in. You can set restart policies and the Docker daemon will do its best to keep containers running and restart them if the host is rebooted.

This is handy and can work well if you live in an all Docker world. Many of us need to work with Docker based services alongside regular non-Docker services on the same host, at least for the near future. Our non-Docker services are probably managed with Systemd, Upstart, or something similar and we'd like to be able to use those process managers with our Docker services so dependencies can be properly resolved, etc.

I haven't used Systemd enough to have an opinion on it (according to the internet, it's either the greatest thing since sliced bread or the arrival of the antichrist, depending on who you ask). Most of the machines I work on are still running Ubuntu 14.04 and Upstart is the path of least resistence there and the tool that I know the best.

Getting Docker and Upstart to play nicely together is not quite as simple as it appears at first.

Docker's documentation contains a sample upstart config:

description "Redis container"
author "Me"
start on filesystem and started docker
stop on runlevel [!2345]
    /usr/bin/docker start -a redis_server
end script

That works, but it assumes that the container named redis_server already exists. Ie, that someone has manually, or via some mechanism outside upstart run the docker run --name=redis_server ... command (or a docker create), specifying all the parameters. If you need to change one of those parameters, you would need to stop the upstart job, do a docker stop redis_server, delete the container with docker rm redis_server, run a docker create --name=redis_server ... to make the new container with the updated parameters, then start the upstart job.

That's a lot of steps and would be no fun to automate via your configuration management or as part of a deployment. What I expect to be able to do with upstart is deploy the necessary dependencies and configuration to the host, drop an upstart config file in /etc/init/myservice.conf and do start myservice, stop myservice, etc. I expect to be able to drop in a new config file and just restart the service to have it go into effect. Letting Docker get involved seems to introduce a bunch of additional steps to that process that just get in the way.

Really, to get Docker and Upstart to work together properly, it's easier to just let upstart do everything and configure Docker to not try to do any process management.

First, make sure you aren't starting the Docker daemon with --restart=always. The default is --restart=no and is what we want.

Next, instead of building the container and then using docker start and docker stop even via upstart, we instead want to just use docker run so parameters can be specified right there in the upstart config (I'm going to leave out the description/author stuff from here on out):

start on filesystem and started docker
stop on runlevel [!2345]
    /usr/bin/docker run \
    -v /docker/host/dir:/data \
end script

This will work nicely. You can stop and start via upstart like a regular system service.

Of course, we would probably like other services to be able to link to it and for that it will need to be named:

start on filesystem and started docker
stop on runlevel [!2345]
    /usr/bin/docker run \
    --name=redis_server \
    -v /docker/host/dir:/data \
end script

That will work.


Then we run into the issue that anyone who's used Docker and tried to run named containers has undoubtedly come across. If we stop that and try to start it again, it will fail and the logs will be full of complaints about:

Error response from daemon: Conflict. The name "redis_server" is
already in use by container 9ccc57bfbc3c. You have to delete (or
rename) that container to be able to reuse that name.

Then you have to do the whole dance of removing the container and restarting stuff. So we put a --rm in there:

start on filesystem and started docker
stop on runlevel [!2345]
    /usr/bin/docker run \
    --rm \
    --name=redis_server \
    -v /docker/host/dir:/data \
end script

This is much better and will mostly work.

Sometimes, though, the container will get killed without a proper SIGTERM signal getting through and Docker won't clean up the container. Eg, it gets OOM-killed, or the server is abruptly power-cycled (it seems like sometimes even a normal stop just doesn't work right). The old container is left there and the next time it tries to start, you run into the same old conflict and have to manually clean it up.

There are numerous Stack Overflow questions and similar out there with suggestions for pre-stop stanzas, etc. to deal with this problem. However, in my experimentation, they all failed to reliably work across some of those trickier situations like OOM-killer rampages and power cycling.

My current solution is simple and has worked well for me. I just add couple more lines added to the script section like so:

    /usr/bin/docker stop redis_server || true
    /usr/bin/docker rm redis_server || true
    /usr/bin/docker run \
    --rm \
    --name=redis_server \
    -v /docker/host/dir:/data \
end script

In the spirit of the World's Funniest Joke, before trying to revive the container, we first make sure it's dead. The || true on each of those lines just ensures that it will keep going even if it didn't actually have to stop or remove anything. (I suppose that the --rm is now superfluous, but it doesn't hurt).

So this is how I run things now. I tend to have two levels of Docker containers: these lower level named services that get linked to other containers, and higher level "application" containers (eg, Django apps), that don't need to be named, but probably link in one of these named containers. This setup is easy to manage (my CM setup can push out new configs and restart services with wild abandon) and robust.

Circuit Breaker TCP Proxy

When you're building systems that combine other systems, one of the big lessons you learn is that failures can cause chain reactions. If you're not careful, one small piece of the overall system failing can cause catastrophic global failure.

Even worse, one of the main mechanisms for these chain reactions is a well-intentioned attempt by one system to cope with the failure of another.

Imagine Systems A and B, where A relies on B. System A needs to handle 100 requests per second, which come in at random, but normal intervals. Each of those implies a request to System B. So an average of one request every 10ms for each of them.

System A may decide that if a request to System B fails, it will just repeatedly retry the request until it succeeds. This might work fine if System B has a very brief outage. But then it goes down for longer and needs to be completely restarted, or is just becomes completely unresponsive until someone steps in and manually fixes it. Let's say that it's down for a full second. When it finally comes back up, it now has to deal with 100 requests all coming in at the same time, since System A has been piling them up. That usually doesn't end well, making System B run extremely slowly or even crash again, starting the whole cycle over and over.

Meanwhile, since System A is having to wait unusually long for a successful response from B, it's chewing up resources as well and is more likely to crash or start swapping. And of course, anything that relies on A is then affected and the failure propagates.

Going all the way the other way, with A never retrying and instead immediately noting the failure and passing it back is a little better, but still not ideal. It's still hitting B every time while B is having trouble, which probably isn't helping B's situation. Somewhere back up the chain, a component that calls system A might be retrying, and effectively hammers B the same as in the previous scenario, just using A as the proxy.

A common pattern for dealing effectively with this kind of problem is the "circuit breaker". I first read about it in Michael Nygaard's book Release It! and Martin Fowler has a nice in-depth description of it.

As the name of the pattern should make clear, a circuit breaker is designed to detect a fault and immediately halt all traffic to the faulty component for a period of time, to give it a little breathing room to recover and prevent cascading failures. Typically, you set a threshold of errors and, when that threshold is crossed, the circuit breaker "trips" and no more requests are sent for a short period of time. Any clients get an immediate error response from the circuit breaker, but the ailing component is left alone. After that short period of time, it tentatively allows requests again, but will trip again if it sees any failures, this time blocking requests for a longer period. Ideally, these time periods of blocked requests will follow an exponential backoff, to minimize downtime while still easing the load as much as possible on the failed component.

Implementing a circuit breaker around whatever external service you're using usually isn't terribly hard and the programming language you are using probably already has a library or two available to help. Eg, in Go, there's this nice one from Scott Barron at Github and this one which is inspired by Hystrix from Netflix, which includes circuit breaker functionality. Recently, Heroku released a nice Ruby circuit breaker implementation.

That's great if you're writing everything yourself, but sometimes you have to deal with components that you either don't have the source code to or just can't really modify easily. To help with those situations, I wrote cbp, which is a basic TCP proxy with circuit breaker functionality built in.

You tell it what port to listen on, where to proxy to, and what threshold to trip at. Eg:

$ cbp -l localhost:8000 -r -t .05

sets up a proxy to port 80 of on the local port 8000. You can then configure a component to make API requests to http://localhost:8000/. If more than 5% of those requests fail, the circuit breaker trips and it stops proxying for 1 second, 2 seconds, 4 seconds, etc. until the remote service recovers.

This is a raw TCP proxy, so even HTTPS requests or any other kind of (TCP) network traffic can be sent through it. The downside is that that means it can only detect TCP level errors and remains ignorant of whatever protocol is on top of it. That means that eg, if you proxy to an HTTP/HTTPS service and that service responds with a 500 or 502 error response, cbp doesn't know that that is considered an error, sees a perfectly valid TCP request go by, and assumes everything is fine; it would only trip if the remote service failed to establish the connection at all. (I may make an HTTP specific version later or add optional support for HTTP and/or other common protocols, but for now it's plain TCP).

Dispatch van Utrecht 2: The Office

2014's almost over, I've been living in Utrecht for about five months now (minus my travelling all over the place), time for a small update.

It took until well into September for our shipping container of belongings to make it across the Atlantic and get delivered. Up to that point, my office setup looked like this:

My temporary office setup, waiting for my stuff.

A photo posted by anders (@thraxil) on

Just my laptop on a basic IKEA desk. The rest of the room empty and barren.

Now it looks like this:

desk setup

Same desk, but I've got my nice Humanscale Freedom chair, my monitor mounts, and a couple rugs.

The other half of the room is now my art and music studio:

studio setup

Phoenix got me an easel for Christmas (as a not so subtle hint that I should do more larger pieces), and I set up another basic IKEA table for my guitar and recording gear. Clearly I still need to work out some cable management issues, but otherwise, it's quite functional.

Finally, behind my left shoulder when I'm at my desk is a little fireplace type thing that is currently serving as the closest thing I have to shelves in the office, so it's got some trinkets and my collection of essential programming books:


TAGS: utrecht