Thraxil.org

// thraxil.org

My Own Javascript Y Combinator

By anders pearson 15 Sep 2008

I came across a link to the slides for Xavier Leroy’s course on Functional programming languages this weekend and have been slowly making my way through them.

It’s just slides from the lectures so it can be a bit of a struggle to follow them without, you know, the actual lectures. On the other hand, having to struggle a bit and figure out what’s going on from just the slides has been kind of good for forcing me to really pay attention and not gloss over the parts that aren’t immediately obvious.

Judging by the slides, the lectures are fantastic. They’re incredibly ambitious in scope. In the first lecture, it starts by introducing the Lambda Calculus (well, that part is fast enough that I think it’s assumed that the reader is already basically familiar), shows reduction strategies, implements a lambda calculus interpreter in a couple lines of Caml, makes it efficient, adds lexical scoping and closures, and keeps it all tied back to the Lambda Calculus. Further lectures go on to describe abstract machines, compile the enriched Lambda Calculus for the abstract machine, do tail call elimination and lazy evaluation, proves the correctness of the abstract machine and compiler, gets into exception returning vs state passing vs continuation passing implementations, monads and monad transformations (including the concurrency monad transformer), and then gets into some even deeper optimizations. It basically connects the theoretical, mathematical underpinnings of programming all the way to the tricks that are used in compiler optimization. Highly recommended if that sounds interesting to you.

Anyway, in the course of making my way through that material, I decided that I needed to refresh my understanding of the Lambda Calculus and combinators since I haven’t really done much of that since undergrad. I’ve been writing a lot of JavaScript lately for work and, while I still don’t really like the syntax that much, JavaScript does have proper closures so it’s suitable for lambda calculus experimentation.

So I decided to implement the Y Combinator in JavaScript as an exercise. I remember it being covered in class long ago, and, while I could see why it was interesting, I don’t think it ever quite clicked for me exactly how it worked. There’s a good page explaining the how and why of the Y Combinator here. The example code is all in Scheme, so I just went through the page and translated each example to JavaScript, making sure I understood what was going on at each step. I kind of like this approach to learning CS. Just reading a text with code samples, it’s all too easy to accept that the samples do what it says they do and not really spend the extra effort to understand exactly what’s going on. Translating it into a different programming languages forces you to really grasp every detail.

So, this is my JavaScript Y Combinator. There are many like it, but this one is mine:

:::javascript
function Y (X) {
  return (function(procedure) {
    return X(function(arg) {
      return procedure(procedure)(arg);
    });
  })(function(procedure) {
    return X(function(arg) {
      return procedure(procedure)(arg);
    });
  });
}

It’s not as pretty as the scheme version:

:::scheme
(define Y
  (lambda (X)
    ((lambda (procedure)
       (X (lambda (arg) ((procedure procedure) arg))))
     (lambda (procedure)
       (X (lambda (arg) ((procedure procedure) arg)))))))

And neither is as pretty as the original untyped Lambda Calculus:

Y = λf·(λx·f (x x)) (λx·f (x x))

But it works.

My Django Wishlist

By anders pearson 13 Sep 2008

Having built a few Django sites now, I’m ninety-nine percent happy with it as a framework. The remaining one percent is a couple issues I have with how it does things, but they’re relatively minor. Overall, I’m impressed at how well Django gets all the details right.

These are the things that are still on my wishlist for Django. No doubt some of them are things that Django already does and I just haven’t come across them in the documentation yet (and hopefully someone will add a comment to point me in the right direction). Other things are probably impossible given other design constraints. The rest are probably things that just bother me and no one else cares.

In no particular order:

Transactions per HTTP Request by default

Django’s transaction’s work on a an autocommit model by default so each time a model object is saved, it happens in a seperate transaction. For web applications, I’ve always found that it makes far more sense (and eliminates some painfully sneaky bugs) if there’s a single database transaction per HTTP request. It’s fairly trivial to enable that model in django by including the transaction middleware. I just don’t see why it isn’t included by default.

Better Support for ON DELETE CASCADE with PostgreSQL

Creating a foreign key relationship with Django’s ORM effectively does an ON DELETE CASCADE, but it doesn’t actually set up the database that way; the deletion is handled in the Python layer instead. If you only ever use the Django ORM to access the database, everything works smoothly and you’d never even notice. However, by not doing the cascade in the database, it makes it more difficult to maintain data integrity if you ever want to modify the database directly with some other tool or language. My understanding is that the reason for this has to do with Django’s supporting certain RDBMSs that don’t implement cascades. But, dammit, I use PostgreSQL for all my sites and it does support it just fine. SQLObject works with the same broken RDBMSs yet it manages to set up a proper ON DELETE CASCADE when you use it with PostgreSQL.

Not Swallowing Template Errors

Django’s templates are a thing of beauty. They strike a really nice balance of simplicity and power. But if your template calls a method that raises an exception, it swallows the exception and just returns an empty string. You don’t see anything in the browser and there are no tracebacks in the console. On a production site, this is absolutely what you want to happen. But when I’m developing, this behavior masks errors that I would like to see, so I can fix them. From what I’ve read, there are backwards compatibility issues with having errors make it up to the browser, even in development mode. But it seems to me that it should be possible to get a traceback in the console when running ‘manage.py runserver’.

Easier HTTP Method Dispatch

I’m a huge fan of the REST architectural style, so it bugs me a bit to have code like this in all my views:

:::python
if request.method == "POST":
    # do some POST stuff
if request.method == "GET":
    # do some GET stuff
if request.method == "DELETE":
    # do some DELETE stuff
# ...

I’ve seen a few different projects out there to make REST method dispatch smoother, but I’d like to see it in the core. More hooks for supporting E-Tags and conditional requests would also make me happy.

Also see my note on Cherrypy style dispatching below.

Flexible Directory Templating

Django’s ‘startproject’ and ‘startapp’ commands are basically doing the same thing as Paste’s ‘create’. Paste has this wonderful ability to use templates though. TurboGears and Pylons both just use Paste to do their equivalent code generation tasks. So, when I was doing TurboGears at work, I was able to create, by subclassing the TG template, one master application template that included the settings that we pretty much always used, the scripts and configs that we use for our automated deployment system, etc. Then, to start a new TG app, I could just do:

$ tg-admin quickstart --template=ourcustomtemplate

Super convenient.

I’ve actually gone as far as creating my own paste template for starting django projects. So instead of doing ‘django-admin.py startproject’, I do ‘paster create –template=mycustomdjangotemplate’ and I get a django project with my custom setup and config. The problem though is that if Django changes the code that startproject generates, I’ll have to manually update my Paste template to match it. If Django just used Paste for that functionality, I could subclass the default Django template and pick up those changes for free.

Eggs / setuptools support

The Django community for the most part doesn’t seem too interested in supporting setuptools distribution or distributing eggs for their apps. Setuptools is notoriously difficult to get your head around, but once you do, you can use it to rig up some really slick distribution and deployment mechanisms. I have a script that will, in one step, set up a virtualenv, and install into it all the eggs that I’ve placed in a directory. Our automated deployment uses that to reliably and repeatably install an application and all its dependencies (the precise versions that I’ve selected and dropped in the eggs directory) onto our production or staging servers. I never have have the problem of something that works in development breaking in production because prod had a slightly different (and incompatible) version of some library installed. You can rig this sort of thing up with distutils packages too, but eggs make it much easier. Like with Paste templates, I already use this approach with Django, but it involves the extra step for me of generating eggs for Django and the third-party Django apps I use. The plugin mechanism that setuptools provides is also a thing of beauty and I can imagine quite a few ways that it could simplify the building and deploying of reusable django apps.

Image Thumbnailing Support in Core

sorl.thumbnail rocks. It’s always the first Django app I add to a project (every project I work on seems to involve image uploading at some point). It is my opinion that sorl.thumbnail, or something that provides equivalent functionality should make it into the core since it’s such a common need.

Cherrypy Style URL Dispatch

Django’s regex dispatch in urls.py is very powerful and flexible. It’s basically the same as Routes, but the plain regex syntax seems much more sensible to someone like me (with years of Perl experience) than Routes’ custom rules syntax. Still, it’s overkill for a lot of applications. I’m still a fan of Cherrypy’s approach of mapping the URLs to a tree of controller objects. So ‘/foo/bar/baz/?p1=v1’ turns into a call of the ‘SomeRootObject.foo.bar.baz()’ method with p1=v1 arguments. It may look really restrictive, but in my experience with TurboGears, you almost never need to do anything more complicated than that (and Cherrypy does have mechanisms to let you do more complicated things for those few times when you need to). On my todo list is to write a cherrypy style dispatcher for Django (not to replace the default urls.py approach, but to augment it) or just figure out how to shoehorn cherrypy’s dispatcher directly into a django project. I know that probably no one else will care, but it will make me happy.

A Pony

Oh, wait, Django already has one. Nevermind.

myopica.org

By anders pearson 05 Sep 2008

One of my other little Django side-projects is ready to show now too: myopica.org

I’ve been posting my artwork on flickr but I also use that for all my other random photography and whatnot. All the sharing functionality and the community of flickr are great, but it’s not really a very good portfolio and you have to kind of hunt around to find my paintings and drawings.

So I set up myopica.org as a proper portfolio site for my artwork. It’s extremely basic and stripped down so nothing really distracts from the artwork.

On To Django

By anders pearson 05 Sep 2008

Django 1.0 being released reminded me that perhaps I should mention here that thraxil.org is now running on Django.

Whenever I move most of my development to a new technology, rewriting the backend for this site in it is usually the final vetting stage for me. Thraxil.org isn’t the biggest, most complicated piece of software on the net, but it’s different enough from most blogs and I’m particular enough about certain things that I feel like if I can build this site with a language or framework and not run into any serious issues, it’s good enough for me to do my day to day development with it. So over the years it’s gone through a few different Perl incarnations with different databases behind it, switched to Python CGIs that published static pages, then CherryPy + SQLObject + SimpleTAL, and finally TurboGears. I started on a Plone version at one point when I was into that, but never quite got the momentum to push that one far enough to actually use in production.

I’ve been keeping a close eye on Django since it came out a couple years back. At that time, Django wasn’t very polished, did a lot of things I didn’t like (many of those things were fixed with the “magic removal” branch), and I was already using Cherrypy and SQLObject pretty comfortably so it didn’t appeal to me very much. TurboGears was announced soon after and, being based on CherryPy and SQLObject, it was a no-brainer for me to move to it. But Django’s kept chugging along the whole time, adding features, refactoring code, polishing things up, picking up a larger and larger community, and generally eliminating my objections one by one. Earlier this year, I took some time to play with it again and I was impressed at how far along it had come and started building small test apps that got larger and more non-trivial over time. I built a few Django sites for myself, built and launched The FU for a friend and basically decided it was ready to become my preferred framework (well, I’ll probably still use Bourbon for writing microapps). So it was time to re-write thraxil.org.

I’d also been growing increasingly frustrated with the unreliable OpenVZ based virtual server that the site has been hosted on for the last couple years so the rewrite also included a move to a nice SliceHost Xen server.

Of course, no rewrite is ever just a rewrite. There’s always design changes to make and new features that manage to creep their way into the scope of the rewrite. In this case, it was mostly things I wanted to simplify. I’ve mostly been using delicious and flickr for getting my bookmarks and images up on the web, so those sections of the site have been superfluous and needed to go (technically they’re still around, just made much less prominent in the display and site architecture). I’ve also become more aggressive in my hatred of comment spam (and the stupid, random, drive-by comments from strangers) so comments here are now moderated. Ie, when a comment is posted, I get an email and I have to either approve the comment, or it doesn’t show up on the site.

The switch to Django actually went even smoother than I was expecting, even after building a few Django sites beforehand. It felt a little like cheating. Since Django’s ORM and SQLObject have such similar requirements on the database schema (integer primary keys on every table, etc.) I could actually just keep the database as-is and not have to worry about redesigning the schema or migrating data. Django’s “inspectdb“ functionality gave me a basic models.py in seconds. It took five minutes more to tweak models.py a bit and enable the Django admin application. So in literally minutes, I had a Django site running with all my data and a completely usable admin interface. From there, it was mostly just going through the list of different page types on the site, creating a view for each, mapping it to the right url, and making a template for it. Altogether, after two evenings staying a bit late in my office, I had pretty much the whole public side of the site ported over and ready for production. That includes all my experimentation with the visual design of the site (probably more of my time was spent fighting with CSS than with any of the backend code). There are still a few odds and ends here and there that I still need to put back in (search, per-tag atom feeds, etc) but it’s mostly done.

I also have to admit that I’m impressed with the performance. The old site was TurboGears + SimpleTAL + PostgreSQL running proxied behind lighttpd on an OpenVZ server. I made heavy use of memcached so the vast majority of requests were just getting served out of cache and TG barely ever had to hit the database. That was good enough for it to survive a post hitting #1 on reddit without the server really breaking a sweat. The new setup is Django + PostgreSQL being served by Apache with mod_wsgi (in daemon mode) on a Xen server. I decided to benchmark before I bothered setting up memcached and found that the new setup, without caching, is already faster than the old one was with caching enabled. It’s not a totally fair benchmark since I don’t really know anything about the underlying hardware that each virtual server was running on and the new, simpler design is probably doing a little less work than the old one was. Still, unless I start to see performance issues, I’m just not even going to bother setting up caching.

I haven’t even begun to make use of the myriad of pluggable Django applications that are out there ready to be integrated into the site. I have a few ideas, but for now I’m digging the new, stripped down version of things.

Overall, though, I’m quite pleased with Django so far. I imagine I’ll have more to say about it in the near future.

Subprocess Hanging: PIPE is your enemy

By anders pearson 13 Mar 2008

Every once in a while, I run across a bug or a tricky problem where googling for a solution doesn’t turn up much. When I come up with a solution, I like to write it up and put it online so the next person to come across it hopefully will have an easier time figuring it out. This is one of those posts.

One of the internal applications I wrote at work does a lot of work via external programs. It’s basically glueing together a whole bunch of shell scripts and putting a nice interface on them.

Running an external program from Python isn’t very hard in the simple case. There’s actually a wealth of options available. The entry level is to use os.system() and give it a list of arguments. That gives you the return code but doesn’t give you the output of the command.

For what I’m doing, I need to have access to the return code, STDOUT, and STDERR. Requirements like that lead to the [os.popen*][os-popen] functions. Basically, something like:

:::python
import os
(c_stdin,c_stdout,c_stderr) = os.popen3(cmd,'r')
out = c_stdout.read()
err = c_stderr.read()
c_stdin.close()
c_stdout.close()
c_stderr.close()

There are still problems with that. The environment that the child command runs in (variables, cwd, tty, etc) is the same environment that the parent is running in. So to set, eg, to set environment variables for the child, you have to put them into os.environ in the parent, or to set the cwd for the child command, you have to have the parent do an os.chdir(). That can be troublesome in some situations. Eg, if the parent is a CherryPy process, doing an os.chdir() makes it hopelessly lost and it will crash. So you have to fork() a child process, set up the environment there, do the above code, and then pass the results back to the parent process.

This has been enough of a pain point for Python programmers that Python 2.4 added the subprocess module. The code above can be replaced with:

:::python
from subprocess import Popen, PIPE
p = Popen(cmd,stdout=PIPE,stderr=PIPE)
(out,err) = p.communicate()

Aside from being a little shorter, subprocess.Popen() also takes additional arguments like cwd and env that let you manipulate the environment of the child process (it does the fork() for you). It basically gives you one very nice interface for doing anything and everything related to spawning external commands. Life is generally better with subprocess around.

Unfortunately, there is a subtle, devious bug in that code. I know this after encountering it and spending many hours trying to figure out what was going on.

Where I encountered it was when the command being run was doing an svn checkout. The checkout would run for a while and then the svn command would hang at some point. It wouldn’t use CPU, there would be no error messages. The process would still show up in ps or top. It would just stop and the parent process would sit and wait for it to finish. Complete deadlock. Running the exact svn command on the commandline, it would run with no problems. Doing an svn checkout of a different repository would work fine. Kill the hung svn process and the parent would complete and STDOUT would show most of the expected output from the svn checkout. With the particular repository, it would always hang at exactly the same spot; completely repeatable.

How could an svn checkout of a particular repository hang, but only when run via subprocess?

After much frustrating debugging, searching, and experimentation, I narrowed it down to the output of the svn command on STDOUT. If I added a -q (quiet) flag, it would complete without hanging. I eventually noticed that the output that had been collected in STDOUT after killing the svn process was right around 65,000 characters. Since 216 is 65536, that seemed like a coincidence worth investigating. I wrote a test script that just wrote 216 characters to STDOUT and ran it via subprocess. It hung. I modified it to print 216 - 1 characters to STDOUT. No hanging. The troublesome svn repository happened to have a lot of files in it, so a lot of verbose output on the checkout.

A closer inspection of the subprocess.Popen docs revealed a warning “Note: The data read is buffered in memory, so do not use this method if the data size is large or unlimited.” I’d probably read that before and assumed that it was a warning about possibly consuming a lot of memory and being inefficient if you try to pass a lot of data around. So I ignored it. The STDOUT chatter of shell scripts that I was collecting for logging purposes did not strike me as “large” (65K is positively tiny these days) and it isn’t an application where I’m particularly concerned about memory consumption or efficiency.

Apparently, that warning actually means “Note: if there’s any chance that the data read will be more than a couple pages, this will deadlock your code.” What was happening was that the memory buffer was a fixed size. When it filled up, it simply stopped letting the child process write to it. The child would then sit and patiently wait to be able to write the rest of its output.

Luckily the solution is fairly simple. Instead of setting stdout and stderr to PIPE, they need to be given proper file (or unix pipe) objects that will accept a reasonable amount of data. (A further hint for anyone who found this page because they encountered the same problem and are looking for a fix: Popen() needs real file objects with fileno() support so StringIO-type fake file objects won’t work; [tempfile.TemporaryFile] is your friend).

This strikes me as kind of a poor design. Subprocess is wonderful in most ways and a real improvement over the old alternatives. But with this kind of issue, the programmer using it will probably not encounter any problems in development and think everything is fine but some day will find their production app mysteriously deadlocked and have almost no clues as to what’s causing it. That seems like something that deserves a big flashing red warning in the docs every time PIPE is mentioned.

[os-popen]:http://docs.python.org/lib/os-newstreams.html#os-newstreams

gracias muchas a todos!

By jere 04 Jan 2008

It was fantastic, as always! Waes hail!

books

By anders pearson 03 Dec 2007

It’s been a busy month. The NaGraNoWriMo was a success. I ruined my health, alienated my family, and none of my friends remember what I look like, but I made it all the way to 50 pages after all.

The result, “Error And Annihilation”, is up on Flickr here. It’s also available to order in dead tree format.

One book just wouldn’t be enough for me this year though. So I also now have available Myopica, which is the sequel to last year’s Nearsighted and Obsessive-Compulsive and contains pretty much everything I’ve done in 2007 (except what’s in Error And Annihilation). There’s also a corresponding (though not sorted) flickr set for convenient browsing.

As usual, those are both available as free PDF downloads and the books are for sale at cost (ie, I make no money on any of this). Happily, Lulu.com also now has an option for Public Domain licensing so that’s what they are.

Again, thanks to Marc Raymond for laying them out and making everything look more professional than it is.

Error And Annihilation

By anders pearson 01 Nov 2007

My NaGraNoWriMo shall now commence.

I’ve been convinced to scale back my ambitions to just 30 pages, or about three to four hours per day instead of my original goal of 50 pages.

I’ve got a fresh large Moleskine sketchbook with 30 pages pre-numbered, a pack of nonrepro blue leads for my mechanical pencil, a small collection of Sakura Micron pens, and I’m ready to go.

My working title for the book is “Error And Annihilation”. I’ll try to post an update now and then, but no guarantees.

A Simple Programming Puzzle Seen Through Three Different Lenses

By anders pearson 30 Oct 2007

The other day, I stumbled across Mark Nelson’s blog post describing a fairly simple NPR word puzzle: “Take the names of two U.S. States, mix them all together, then rearrange the letters to form the names of two other U.S. States. What states are these?”

Mark Nelson is a programmer, so his first instinct of course was to write a small program to solve it. Mine too. I immediately stopped reading his post, opened a new emacs buffer and spent five minutes coming up with this:

states = [
    "alabama", "alaska", "arizona", "arkansas", "california", "colorado",
    "connecticut", "delaware", "florida", "georgia", "hawaii", "idaho",
    "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana",
    "maine", "maryland", "massachusetts", "michigan", "minnesota",
    "mississippi", "missouri", "montana", "nebraska", "nevada",
    "newhampshire", "newjersey", "newmexico", "newyork", "northcarolina",
    "northdakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhodeisland",
    "southcarolina", "southdakota", "tennessee", "texas", "utah", "vermont",
    "virginia", "washington", "westvirginia", "wisconsin", "wyoming"]
seen = dict()
for state1 in states:
    for state2 in states:
        if state1 == state2:
            continue
        letters = list(state1 + state2)
        letters.sort()
        key = "".join(letters)
        if seen.has_key(key):
            (old1, old2) = seen[key]
            if old1 == state2 and old2 == state1:
                continue
            else:
                print "found it: %s + %s, %s + %s" % (
                    state1, state2, old1, old2)
                exit(0)
        else:
            seen[key] = (state1, state2)

Sure enough, it gave me the answer (and like Mark Nelson, seeing it made me realize that it probably would have been even more productive to just think about it for a minute instead of writing the code).

To me, that Python code is about as simple and direct as it gets. There’s nothing clever going on there. Just loop over the cartesian product of states x states, skip a couple obvious cases and then normalize the names of the states by sorting the letters and put them into a dictionary. As soon as you hit a combination that’s already in the dictionary, you have your answer.

That approach to me would qualify as a “naive” or “brute force” approach. I figured there might be a more optimized approach, but with only 50 states it wasn’t really worth optimizing any more. Running it took all of 18 milliseconds.

When I went back and read the rest of the blog post, my jaw dropped. His first pass, in C++ involved a quadruply nested loop, STL libraries, and was taking upwards of an hour to run. Looking at his code, he really took the notion of a “brute force” approach to a level I hadn’t even considered. His second and third passes improved the runtime and even simplified the code, but they still run on the order of a few seconds. That’s for a compiled, low level language. My Python code written without any particular thoughts toward efficiency running in an interpreter (so the time of starting up the python process, parsing the code, and interpreting it directly, line by line without compiler optimizations are all included in the running time) was beating it by a couple orders of magnitude.

The key, of course, is the algorithm. Python, like Ruby, Perl, and other relatively high level languages, has supports a dictionary (or hashtable if you prefer) data type at the syntax level. As a result, anyone programming in one of those languages quickly learns how to make the most of them and becomes familiar with a number of idioms, including the one I used of testing for uniqueness by keeping a ‘seen’ dictionary, inserting keys one by one and looking for a collision. It’s dead simple, commonly used in scripting languages, and extremely efficient since inserting into a hashtable is O(1) and tends to be one of the most finely tuned parts of a scripting language’s interpreter/VM.

There’s nothing Python specific about the algorithm. In fact, in the comments on the post, Vince Huston submits a C++ solution that’s basically identical to my approach (and probably outperforms everyone else handily). If I were forced to solve the same problem in C++, I would probably have come up with something similar. I would not be at all surprised if Vince Huston has had some experience coding in scripting languages and Mark Nelson hasn’t.

The point isn’t that Mark Nelson is a bad programmer. Far from it. Looking around at the rest of his site, at articles like his explanation of the The Byzantine Generals Problem (which was how I ended up on his site in the first place), and at the books he’s written, I’d guess that overall, he has more breadth and depth to his understanding of algorithms and programming than I do.

My point is really just to repeat the tired, broken record cry of advocates of higher level languages that 1) using a faster, more appropriate algorithm will usually go much further in optimization than low level optimizations (using a compiled language, static types, fretting over clock cycles in an inner loop, etc) and 2) learning different programming paradigms, languages, and idioms will improve your programming even if you end up going back to a lower level language. In this case, some familiarity with dictionary idioms common in scripting languages helps immensely in producing a fast, simple solution.

Another comment on his post goes even further. Tom Moertel submits a solution implemented in Haskell. From a performance standpoint, it’s pretty much the same approach as mine, using a dictionary (in this case, Haskell’s Data.Map library) to do the comparisons on normalized strings. What makes it a really nice solution though, is that he approaches it by starting with a “clusterBy” higher order function that takes a signature function and a list and returns a list of groups clustered by the signature (my explanation is bad; his examples make it much more clear). Essentially, instead of directly solving the problem, he creates a general purpose tool which can then trivially be applied to the specific problem at hand. clusterBy is the kind of function that I could see being useful to other problems. So not only does he end up with the solution to this problem, he also ends up with a useful tool to make future programming tasks simpler. Haskell’s smooth support of higher order functions makes this a natural approach and it seems to be the way that proficient Haskell programmers end up taking on most problems.

Python had the good taste to steal a few items from Haskell’s bag of tricks though, so I was inspired to try a Python version. Here’s a rough equivalent of clusterBy:

:::python
def cluster_by(f,lst):
    transformed = [f(x) for x in lst]
    d = dict()
    for t,i in zip(transformed,lst):
        d.setdefault(t,[]).append(i)
    return d.values()

Not as pretty as the Haskell, but works essentially the same way.

Then, to solve the original problem, we need a normalize function:

:::python
def normalize(t):
    letters = list(t[0] + t[1])
    letters.sort()
    return "".join(letters)

It takes the tuple of two state names, sorts all the letters in them and returns that as a string. The final piece is just to apply cluster_by with the normalize function to the cartesian product of states x states and find the resulting entries with multiple entries (and since I was going Haskell-style, I decided to use list comprehensions instead of for loops as well):

:::python
clustered = cluster_by(normalize,[(x,y) for x in states for y in states])
print [x for x in clustered if len(x) > 2]

It runs in 25 milliseconds on my machine, so slightly more overhead than the procedural approach, but close enough to not matter and, IMO, it’s cleaner and would lend itself more to further reuse of the cluster_by or normalize functions.

So first knowing an idiom from a scripting language can produce better C++ code (see Vince Huston’s solution), then taking a lesson from Haskell I’m able to come up with better Python code.

I’m waiting for a Prolog programmer to come along and smoke everyone next.

NaGraNoWriMo

By anders pearson 25 Oct 2007

At some point a while back, my little meditational abstract doodles started occasionally getting boxes around them. When there were several of those on a page, I noticed that they sort of resembled the panel layout of a comic. I kind of liked that so I experimented more in that direction, playing with deliberate panel layouts and borrowing from the vocabulary of comics and sequential art while keeping my subject matter abstract. Now I’ve got a fairly large collection of these abstract comics and it’s a style that I think is still interesting and rich for experimentation.

Lately I’ve found myself in several conversations with friends about National Novel Writing Month (NaNoWriMo). The executive summary if you’re too lazy to click the link and aren’t familiar with it: you give yourself the month of November (30 days) to write a 50,000 word novel. It doesn’t have to be good or make much sense or anything, it just has to be started and finished within the month. I’ve always thought it was a cool idea. A bunch of my friends have attempted and at least one has even successfully completed a NaNoWriMo.

As a writer, I’m not really interested in writing fiction so it isn’t something I’d be up for myself. The idea continues to intrigue me though so I’ve decided that this year I want to try to do an abstract graphic novel NaNoWriMo style. A “NaGraNoWriMo” I guess (calling it a “NaAbsGraNoWriMo” would just be taking it too far.)

Obviously, I can’t use word count as a metric for my graphic novel. Laying out, drawing and inking a full page of panels I’ve found takes me about three to four hours with the level of detail that my style involves. The notion of “a picture is worth a thousand words” I think actually works out pretty well for my purposes. 50,000 words = 50 pages = 1.67 pages per day = about 6 hours per day of me drawing for 30 days. I think that’s a similar magnitude of challenge to what other NaNoWriMo writers are attempting.

That’s the goal, anyway. I’ve got a lot of other things going on in November so it’s not going to be easy to find all that time. I’ll try to post the pages to flickr as I go, but scanning and posting takes time that I could be drawing so that will be a lower priority than finishing.

If you don’t see me outside my apartment or hear from me next month, now you’ll know why.

[Yes, I know that someone else has used the term “NaGraNoWriMo” for something similar. Their approach was to write a “script” for a 175 page graphic novel. I honestly don’t know quite what a script for a graphic novel is but I’m pretty sure that it wouldn’t work for an abstract one. I like my idea better.]