Thraxil.org

// thraxil.org

verizon pain

By anders pearson 17 Aug 2006

As the programmer in the family, I’m the defacto tech support guy. No way around it. I don’t mind too much except that I really only know Linux stuff well. I haven’t had Windows installed on any computer of mine since about 1999. If something there doesn’t work immediately, I’m about as helpless as the average Windows user would be on Linux.

So when my sister up in Maine bought her first computer a few weeks ago, I got in there early and installed Ubuntu on it while I was visiting her. All she really wants to do is basic web surfing and email and maybe upload some pictures from a digital camera. Ubuntu is way more than enough for this plus I know it well, it’s easy for me to set it up to remotely administer and troubleshoot it for her, and it’s pretty secure by default. If I’d left Windows on there it would have turned into a virus infested mess in no time and I’d probably be stuck walking her through a reinstall every few months. Honestly, if I’d been able to talk to her before she bought her computer, I’d have steered her towards a Mac of some kind, but I couldn’t so just got a cheap eMachines box and Ubuntu will have to do.

Installation was pretty straightforward. I ran into some problems just because the machine only came with 256MB of memory (which I think is criminally low). The current version of Ubuntu (Dapper Drake) installs from a livecd and livecds like to have a bit more memory to work with. So the installer was really dragging and I had to get a bit creative with going in and killing off every process that wasn’t completely necessary. If I’d known that she had that little memory, I would have either bought more for her before installing the OS, or I would have just used a slightly older version of Ubuntu that still had the seperate install disk that does everything from the command line and doesn’t need lots of memory to run. Anyway, other than that, it was completely painless. Ubuntu detected all the hardware automatically and ran smoothly once it was installed on the drive. I plugged in my digital camera and iPod and it detected them both and started up the appropriate programs without any trouble. From what I’ve seen, tables have turned and Ubuntu currently has better hardware detection and support than Windows. The camera that worked trivially on her computer, I can’t get a windows machine to recognize without manually installing drivers from a CD that I’ve long since lost.

Unfortunately, the one thing I couldn’t set up for her while I was visiting was the network connection. She’d called Verizon and ordered a DSL account from them. While I was there, she still hadn’t gotten the DSL modem/router yet in the mail. I tried to talk her into going with a cable modem instead of DSL because I know that it’s usually a much simpler setup, but the only option where she lived would have cost twice as much as the DSL and she would have had to have some things rewired to get cable to the room with her computer (or a really long ethernet cable I guess). Plus the DSL came with a phone line that she was planning on getting anyway (she’s been on just a cellphone for the last few years).

This left me in the position of having to walk her through the whole network setup thing over the phone this morning. Verizon, of course doesn’t officially support anything other than Windows XP and Mac OS X, but I have yet to meet an ISP that officially supports linux and yet I’ve always been able to get my stuff connected so I figured it couldn’t be too bad.

Oh god, was it bad. Ultimately we were able to get her online, but it was not a simple process and involved making use of a coworker’s Mac to complete one of the steps. I’m writing up the steps that we had to go through here in case anyone else out there is trying to get a similar setup working. Perhaps some of the information here will be useful.

The box she got in the mail contained a dsl modem/wireless router and a Windows/Mac CD-ROM installation disk. No instructions, no account info.

Got her to plug in the router and put the right cables into the right jacks and turn everything on. So far, so good. Ubuntu saw that it had an ethernet connection and got itself an IP address from the router via DHCP. That involved nothing beyond plugging things in and turning them on.

Then she starts up a browser and tries to go to google.com. What she gets is a website telling her that her operating system is unsupported. In other words, they set a portal system like you sometimes see with for-pay wireless networks in airports, etc. where you get an IP and everything via DHCP, but the router always sends you to a login page when you first get on so you can enter your account info and it will authenticate you. Except with the additional twist that the login webpage was doing a browser detect and not letter her even access that.

The easiest thing at this point would have been for her to borrow a windows or mac laptop from someone and go through the activation process on that. My sister lives in rural Maine though and basically doesn’t know anyone within a few hundred miles with a laptop. So our options were to either figure out some way to do it from Linux or give up and reinstall Windows on her computer for the DSL activation and then reinstall Linux after that. I don’t give that easily though.

I found some phone numbers that she could call to set up her account username and password over the phone and sent her off to do that while i did a bit more research on the problem.

My assumption at this point was that it would be like a typical portal setup and it was just a web page that she needed to get access to and there was an overly restrictive browser detect script keeping her from getting to something that would probably work just fine if she could get to it.

So next I took her through hacking her Firefox config to spoof the User-Agent string to make it report itself as Internet Explorer so she could get to the activation web page. The steps for this are:

enter ‘about:config’ into the browser url bar and hit enter
right click anywhere on the preferences page that comes up and select ‘New -> String’
put in ‘general.useragent.override’ and hit enter
enter ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)’ into that box and hit enter

That worked nicely and she now had a browser that would look exactly like IE6 to Verizon’s servers. When she went back to the login page, she was able to see it this time, although then we learned that they weren’t kidding about only supporting IE and the page was a big garbled mess to Firefox (web standards? what are those?). Just to try to see what she was encountering, I got the url from her, which was for http://activatemydsl.verizon.net/ and loaded it on a Windows machine in our office. There’s a form there for entering a phone number and zip code to set up a new account, or entering a username/password to activate an existing account. So she probably hadn’t needed to do that stuff over the phone; we could have done it through the web here. But we already had the account info at this point, so I put that in and… got a page with an Active-X control. Ouch. There was no way we were going to get that working on her linux box. Actually, I couldn’t even get it to run on the Windows machine that I was on.

I got one of my coworkers with a Mac to go to the site and it gave them a .dmg file which we unpacked and ran the Mac installer program that it included. This program had us put in her account info and click through a Terms of Service agreement, etc. It ended by installing a bunch of crap on his computer and informing us that the account was now activated.

Back on her computer, she was still getting the same garbled login screen instead of websites so I figured that one of the things the installer program probably took care of was putting in the PPPoE username and password in the router’s settings.

This is usually easy enough to do manually though. I got her local IP address (192.168.1.65) and from that could figure out the router’s IP (192.168.1.1) which I had her connect to in Firefox. We hit a bit of a stumbling block here because the router didn’t come with a manual so she didn’t know the router’s admin username/password. They generally all come with the same defaults set for a given model though. I googled for the default for her router. At first I only searched for the manufacturer and not the exact model number and the default passwords listed for them didn’t work. I was starting to get worried that Verizon had installed custom firmware on the routers with a different admin password so users couldn’t mess with it or something. But when I got her to read off the exact model number to me I was able to find a different password that actually worked. From there, we just went through the router’s setup and entered her Verizon account info and everything worked after that.

At this point, I’m not sure if the step of running the installer program on my coworker’s computer was strictly necessary or if we’d gone straight into the router and put in her username/password there if it would have worked. My guess is that it was necessary though for the installer to run so the TOS could be agreed to and a flag set in Verizon’s central database saying that the account was valid. If we’d skipped that step, I think we probably wouldn’t have been able to connect even with the PPPoE username/password set. If that’s indeed the case, then it really is impossible to get a Verizon DSL connection set up without at least knowing someone with a Windows or Mac box that can do the activation for you. If it isn’t necessary, then we could have done it all from Linux (though without the router manual, we still would’ve needed to already have an internet connection somewhere so we could google the default router password).

Now, if Verizon had just made their activation process be a normal website that worked in any browser instead of a full desktop application that required a particular OS to run and had included the manual for the router they sent her, this whole process would have been almost completely painless.

My obligatory rant is how this sort of thing makes linux adoption harder than it should be. Ubuntu did absolutely everything it could to make it all easy and painless for the user. There’s nothing that any linux programmer could have done to make it easier for us. The problem was entirely that Verizon was doing a number of things that made it very difficult for a Linux box to connect. But for someone a little less persistent than me and a without the same knowledge and resources, if they’d installed Ubuntu on their computer and then tried to connect to Verizon with it, they’d come out of the experience thinking “Ubuntu must be broken. It can’t even connect to DSL.” Linux adoption is a chicken and egg problem. Until lots of people are running Linux, it doesn’t make sense for someone like Verizon to officially support it, but if it’s this hard to connect to their service, not very many of their customers are going to run Linux. As an open source developer myself, I feel some real empathic frustration for the Ubuntu developers. They’ve done an amazing job making an OS that just works out of the box for almost everything. If adoption was just a matter of writing code, they would have solved it by now. But instead, there are these external forces in play that make it an uphill battle for them no matter how great a product they create.

generic functions coming to python?

By anders pearson 06 Apr 2006

About once a year or so, I find myself spending some time immersing myself in the world of Lisp (and Scheme). Lisp is sort of a proto programming language. Lisp fans make a strong argument that it’s on a different level entirely than all other languages. I pretty much buy into that argument, but I’ve never quite been able to get it to click for me to the point where I would feel comfortable writing a real, production application in Lisp. I always start out with the intention of writing something useful and large enough to get myself over that hump but I always either get distracted with other stuff or hit some wall (usually involving just not being able to get some library I want to use installed on whatever Lisp interpreter I’ve chosen).

Nevertheless, my little excursions into the land of parenthesis have always been worthwhile. It always seems that I pick up some new or deeper understanding of some aspect of programming. When I first went through the exercises in SICP it was a revelation for me. Suddenly all kinds of things like recursion, first-class functions, data driven programming, and constraint systems made much more sense. Later attempts taught me the power of code as data, having full, unrestrained access to the abstract syntax tree of your running program and the power of a true macro system (I still don’t feel comfortable with Lisp’s macro syntax but I’ve at least decided that it’s worth learning at some point). Scheme was also my first real exposure to a proper, interactive REPL environment and that style of development impressed me enough that when python came along, I was probably more willing to give it a look because it had a nice interactive interpreter. Common Lisp and SLIME are still the gold standard in that department though.

Lisp certainly isn’t the only language that I’ve learned valuable lessons from though.

Haskell gave me a lot of respect for pattern matching, list comprehensions, lazy evaluation and (along with SML) powerful type inference. It’s another language that I intend to keep coming back to because I’m sure it has more to offer me. Haskell also changed my mind from “significant whitespace is evil!” to “significant whitespace is a bloody fantastic idea!” and again, probably helped butter me up for python.
Since C was my first real programming language, I won’t even attempt to list all that it taught me.
Assembly is invaluable for peeling back the curtain and knowing what really goes on in the black box. Plus, spending time down there will free you of all fear of pointers or hex math.
I can’t think of anything too specific that I learned from FORTH, but it definitely warped my mind and showed me just how much you can take out of a language and still have something powerful.
Perl taught me that if you have lists, hashes, and references, you can pretty much get away without any other data structures for 99% of real world applications. It was also my first real exposure to the power of closures.
Java taught me how easy and dangerous it is to over-engineer.
Smalltalk taught me about code blocks and how to think about OOP as messages instead of function invocations.
What little Ruby i’ve played with has taught me that I’m not really sure why anyone would bother writing in Ruby when Smalltalk exists, but I still plan on spending more time exploring there before I discount it entirely.
Erlang has taught me how to think about distributed architectures (forming much of the basis for my REST obsession and the microapp stuff like Tasty and Pita that I’ve been pushing lately). It also taught me that most “enterprise” approaches to building high-reliability, scalable systems are misguided.
Python is a constant reminder of the power of simplicity and elegance and, for me, has come to best embody the “make the easy problems easy to solve and the hard problems possible” principle.

Still, Lisp remains the wellspring of mindbending “AHA!” moments for me. My most recent delving into Lisp was with Peter Siebel’s book “Practical Common Lisp“. On that last pass through, generic functions finally clicked for me and I was able to seem them as the conceptual underpinnings of object oriented programming. Suddenly, things like aspect oriented programming and python’s decorators were just trivial applications of this one root idea.

Now, I see that Guido’s thinking about adding generic functions to python 3000. Clearly, I think this is a good idea and I’m really curious to see how the syntax comes out. PJE’s implementation in PEAK is interesting, but having it in the syntax of the language should allow it to be even smoother.

migrating applications between servers with (virtually) zero downtime

By anders pearson 27 Mar 2006

The problem: You need to move a web application to a different server. Eg, an application accessible as app.example.com is currently running on a machine with the IP address 1.2.3.4 and it needs to be moved to 1.2.3.5. This will involve a DNS change. DNS updates, unfortunately aren’t very predictable. Unless you administrate your own DNS servers, you probably don’t have much control over exactly when a change goes out. Even if you can control DNS, it can take up to a day for the new DNS entry to propagate out to users (because it may be cached at several points in between).

If you admin your own DNS, you can lower the TTL (time-to-live) and do the update exactly when you want and that will probably be good enough. Most of us don’t have that level of control over our DNS though. At work, we don’t have direct control over DNS; the network admins take our requests and push everything out in a nightly batch, with a relatively high TTL. For thraxil.org and other sites that I run on my own, I rely on free DNS services which I also have no real control over. Pretty much anyone who isn’t an ISP is probably in a similar situation.

At work we’ve come up with a way to do the move with basically zero downtime and complete control over the exact time that the transition happens. It doesn’t require any admin access above and beyond control over the two web servers involved. IMO, it’s just about as simple as messing with DNS TTL stuff.

We’re probably not the first to have come up with this technique. It’s fairly simple and obvious if you’re familiar with apache configuration. I haven’t seen it mentioned anywhere before though and we had to figure it out ourselves, so clearly it’s not as well-known as it ought to be.

The technique basically revolves around apache’s mod_rewrite module and its ability to proxy requests. Make sure it’s installed and enabled on your servers. With some setups, you’ll also need to ensure that mod_proxy is installed for mod_rewrite to be able to do proxying. Make sure you understand how your apache server has been compiled and configured. Technically, you only need it set up on one of them, but it’s useful in general so you’ll probably want it on both.

The procedure starts with getting the application running on both servers. On the new one, it will have to be running on some other hostname temporarily so you can test and make sure it’s all working. Then, at a point in time that’s convenient for you (late night, early morning, etc.) you replace the configuration on the old server with a proxying directive that proxies all requests coming in to that server to the new one. Now, you can do the DNS change (or request the change from the DNS admins). Once the proxy is running, any request for to either the old IP address or the new one will both end up going to the new server. Once the DNS change has gone through and has had time to propagate out to everyone (checking the logs on the old server to make sure there are no more requests to it is a good way to be fairly sure), you can safely turn off the proxy and decommision the old server.

Here’s a more detailed example. Say we have an application running with a hostname of app.example.com, which is an alias for the IP address 1.2.3.4. We went to move it onto 1.2.3.5 with as little downtime as possible.

First, we register app-new.example.com to point to 1.2.3.5 (or just set it up in /etc/hosts since it doesn’t really have to be public). We get the application installed and running, probably against a test database, on 1.2.3.5 and make sure that everything works like it should.

On 1.2.3.4, apache’s configuration probably has a virtual host section like:

:::apacheconf
<VirtualHost *>
   ServerName app.example.com
   # ... etc. 
</VirtualHost>

We duplicate the same virtual host section to the apache config on the new server, 1.2.3.5. It’s a good idea to then test it by overriding app.example.com in your /etc/hosts (or your OS’s equivalent) to point to the new server.

Then, on 1.2.3.4 we change it to:

:::apacheconf
<VirtualHost *>
  RewriteEngine On
  RewriteRule (.*) http://app-new.example.com/$1 [L,P]
</VirtualHost>

If there’s data to migrate, we then take apache down on 1.2.3.4, migrate the data to the new server, and bring apache back up. If data needs to be migrated, there isn’t really a way to avoid some downtime or you’ll run the risk of losing some during the move. Having a script or two pre-written to handle the data migration is usually a good idea to ensure that it will go quickly and smoothly. There are other tricks to minimize this downtime, but for most of us a few seconds or a few minutes of downtime isn’t the end of the world (and is highly preferable to lost or inconsistent data).

At this point, requests to app.example.com will be proxied over to 1.2.3.5 without users noticing anything except perhaps a little extra latency.

Then the request is put in to the DNS admins to change the app.example.com alias to resolve to 1.2.3.5 instead of 1.2.3.4.

Eventually, that will go through and users will start hitting 1.2.3.5 directly when they go to app.example.com. For a while, there will still be a trickle of hits to 1.2.3.4, but those should fade away once the DNS change propagates.

That’s pretty much what we do. It’s worked well for us for quite a few server moves. Obviously, since this involves messing with the configuration on production servers, you shouldn’t attempt it without fully understanding what’s involved. You should also test it out on a test setup before trying it with real applications. Plus, you should always have a fallback plan in case any step of it doesn’t work like you expect it to for some reason.

tasty lightning

By anders pearson 08 Mar 2006

at pycon the other week, Jonah and i gave a lightning talk on tasty and some of the ideas we’ve been thinking about in the area of “mini apps”. a lightning talk is a super fast 5 minute presentation. the idea is that a lot of people can get an idea or two each out in front of a whole conference without much stress or prep work. in particular, people tend to give lightning talks to show off a cool little trick or to just put out an idea that might not be fully formed yet and get some feedback early on. that was pretty much why we gave one.

it occurred to me though that i should probably write up the basics of the talk and post it here partly to disseminate the ideas a little more and partly because writing it forces me to clarify some of my own thinking on the subject.

so this post will basically cover the material that Jonah and i raced through in our five minutes at pycon with a bit more elaboration and refinement.

our talk was titled “Using the Web as Your Application Platform” but could also really be thought of as “Taking REST to Extremes”. none of the ideas we’ve been exploring are particularly novel or cutting edge on their own, but i think we’ve been pushing things a bit further than most.

everything i’ve been building lately has been very strongly guided by the REST architectural principles. i drank as much of that particular kool-aid as i could find. the main concept behind REST is basically designing your application to take full advantage of HTTP and the inherent architecture of the web, which is much more thoughtfully designed than many people realize.

like most developers, i’ve been chasing the pipe dream of “reusable components” for a long time. every programming shop out there at some point dreams of creating a library of reusable components specific to their problem domain that they can just put together in different ways to quickly tackle any new projects that come along and never have to write the same thing more than once. there are a million different ways that we attempt to build our little programming utopia: libraries, frameworks, component architectures, fancy content management systems, etc. every hot new language promises to make this easier and every vendor bases their sales pitch around it. most of the time though, every solution comes up a little short in one way or another. with deadlines and annoying real world requirements and restraints, the utopia never quite seems to materialize.

eg, where i work, we develop in python, perl, java, and php and we use a number of frameworks. some of those because they’re productive and nice to work with, some because of legacy reasons, some because of the particular environment that an application has to run in, and some because they are local customizations of open source applications and we can’t exactly change the language they were implemented in to suit our whims. i don’t think this situation is all that uncommon for web shops. right away, this is clearly a huge limitation on how much reusability we can get out of a library or framework. if i write something in python it isn’t going to be much use in java or php. so for a common task, like authenticating against the university’s central system, we’ve got to implement it at least four times, once for each language.

REST offers a solution for this particular predicament though. every language out there and every framework that we might reasonably use comes with built in support for HTTP. if you build a small self-contained application where you might have written a library or a component, it’s immediately usable from all the other languages and frameworks.

this was the reasoning behind tasty’s implementation as an application with a REST API instead of just as a library. tasty was written in python with turbogears which is my current high productivity environment but it was written as part of a project which we have to eventually deploy in a java course management system and which we’re also doing UI prototyping for in Plone. to be blunt, i just can’t stand java. implementing a full featured tagging library in java on a j2ee stack would have taken me months and probably driven me nuts. i could have built it on Plone; that would have been reasonable, but it still would have taken me a bit longer and i would have had to wrestle with converting the relational model that tasty uses to an the object database that Plone uses and it probably would have ended up buggy or slow and my head might have exploded.

so instead i built it with the REST API and now it can be used by any application written in any language with just a little bit of glue. we already use it with python and perl “client” applications, Sky has written a javascript client, and i’m sure the java and php clients aren’t far off.

tasty isn’t particularly unique either. i’ve been writing applications that are designed to be tied together for a while now and i’ve got a small stable of them that i can and do mash together to build more complex applications.

this weblog is a good example. the main application is python and turbogears but it ties together at least four mini applications behind the scenes. full text search is handled by a seperate application. when you type something into the searchbox here it makes an HTTP request to the search application behind the scenes. when you view any post, it makes an HTTP request to a tasty server to get the tags for that post. if you log in to the admin interface, it uses a seperate application that does nothing but manage sessions. if you upload an image, it passes the image off via an HTTP POST request to a seperate application that replies with a thumbnailed version of the image. and i’ve only started converting to this strategy. i plan on extracting a lot more of the functionality out into seperate applications before i’m through. eg, i think that threaded comments can be their own application, i have an application that does just markdown/textile/re-structured text conversion that i plan on plugging in, templating and skinning can be a seperate application, and i’ve got a nifty event dispatching application that opens up a number of potential doors.

one good example i can bring out to demonstrate just how far we’re taking this concept is a tiny little internal application that we call “pita”. pita implements “data pockets”. essentially, it’s an HTTP accessible hashtable. you can stick some data in at a particular ‘key’ using an HTTP POST request and then you can pull that data out with a ‘GET’ request to the same url.

using curl on the commandline, a sample session would look like this (assuming that pita.example.com was running a pita server):

% curl -X POST -F "value=this is some data to store" http://pita.example.com/service/example/item/foo
% curl -X POST -F "value=this is some more data" http://pita.example.com/service/example/item/bar
% curl http://pita.example.com/service/example/item/foo
this is some data to store
% curl http://pita.example.com/service/example/item/bar
this is some more data
% curl -X DELETE http://pita.example.com/service/example/item/foo
% curl -X DELETE http://pita.example.com/service/example/item/bar

it’s a fairly trivial application, but it comes in quite useful, particularly if you want to share a little data between multiple applications. the applications sharing data don’t have to all have access to the same database, they just have to all point at the same pita server and agree on a convention for naming keys.

doing the above sequence of pita operations from within a python session is fairly trivial too using my little restclient library:

:::pycon
>>> from restclient import POST, DELETE, GET
>>> base = "http://pita.example.com/service/example"
>>> POST(base + "/item/foo", params=dict(value="this is some data to store"))
'ok'
>>> POST(base + "/item/bar", params=dict(value="this is some more data")
>>> print GET(base + "/item/foo")
this is some data to store
>>> print GET(base + "/item/bar")
this is some more data
>>> DELETE(base + "/item/foo")
>>> DELETE(base + "/item/bar")

using JSON as the format for the payload (or occasionally XML for some of the really tough stuff) makes exchanging more complex structured data very easy.

hopefully by now it’s clear that we’re really following through on the promise of the title of the talk. the web itself is our application platform, framework, or component framework (depending on how you look at it). it doesn’t matter much what each of the components are built out of, as long as they speak the same language to each other (which in our case is the language of the web: HTTP).

aside from the language independence win (which is certainly not insignificant for us, although it might not impress developers who stick to a single platform as much), this approach has a number of other nice qualities.

first, it positively forces you to design components that are loosely coupled and functionally cohesive; two primary goals of software design. it’s one thing to hide an implementation behind a library API, it’s quite another when the abstraction is so tight that a component could be swapped out with one written in a completely different language and running on a different physical computer and the other components would never be any the wiser.

second, it actually scales out really well. one of the first responses i usually get when i start explaining what we’re doing to other programmers is “but isn’t doing a bunch of HTTP requests all over the place really slow?”. that question sort of annoys me on the grounds that they’re implicitly optimizing prematurely (and many of you know by now how strongly i’m against premature optimization). yes, making an HTTP request is slower than a plain function invocation. however, in practice it isn’t so much slower that it actually makes much of any difference that approaches being a bottleneck. there are some good reasons for this. first, the nature of how these things tie together usually means only a couple of backchannel HTTP requests to each mini app involved on a page view; not hundreds or thousands. an HTTP request might be slower than calling a function from a library, but only by a few milliseconds. so using this approach will add a few milliseconds to the time on each page view, but that’s pretty negligable. remember, most significant differences in performance are due to big O algorithm stuff, not differences of milliseconds on operations that aren’t performed very often. a second argument is that these sorts of components are almost always wrapping a database call or two. while an HTTP request might be much slower than a function call, it’s not much more overhead at all compared to a database hit. i’ve done some informal benchmarks against mini apps running locally (i’m certainly not advocating that you spread your components out all over the web and introduce huge latencies between them; there should be a very fast connection between all your components) and the overhead of an HTTP request to local cherrypy or mod_python servers is pretty comparable to the overhead of a database hit. so if you’re comfortable making an extra database request or two on a page, hitting an external app shouldn’t be much different.

one of the most powerful arguments against the performance naysayers though is the scalability argument. yes, there’s a small performance hit doing this, but the tradeoff is that you can scale it out ridiculously easily. eg, if thraxil.org all of a sudden started getting massive amounts of traffic and couldn’t handle the load, i could easily setup four more boxes and move the tasty server, the full-text search server, and image resizer, and the session manager out to their own machines. the only thing that would change in the main application would be a couple urls in the config file (or, with the right level of DNS control, that wouldn’t even be necessary). the hardware for each application could be optimized for each (eg, some applications would get more out of lots of memory while others are CPU or disk bound). plus, there exist a myriad of solutions and techniques for load balancing, caching, and fail-over for HTTP servers so you pretty much get all that for free. instead of implementing caching in an application, you can just stick a caching proxy in front of it.

scaling out to lots of small, cheap boxes like that is a proven technique. it’s how google builds their stuff (they use proprietary RPC protocols instead of HTTP, but the idea is the same). large parts of the J2EE stack exist to allow J2EE applications to scale out transparently (IMO, it takes an overly complex approach to scaling, but again, the core idea is basically the same). you can think of the web itself as a really large, distributed application. REST is its architecture and it’s how it scales. so, arguably, this approach is the most proven, solid, tested, and well understood approach to scaling that’s ever been invented.

i really only cover the scalability thing though as a counter to the knee-jerk premature optimization reflex. i mostly view it as a pleasant side effect of an architecture that i’m mostly interested in because of its other nice attributes. even if it didn’t scale nearly as well, for me and probably the lower 80% of web developers who don’t need to handle massive amounts of traffic and for whom maintainability, loose coupling, and reusability trump performance it would still be an approach worth trying. luckily though, it appears to be a win-win situation and no such tradeoff needs to be made.

i have a ton of other things that i’d like to say about this stuff. thoughts on different approaches to designing these mini applications, how the erlang OTP was a large inspiration to me, tricks and techniques for building, deploying and using them, recommendations for squeezing even more performance out of them and avoiding a couple common pitfalls, ideas for frameworks and libraries that could make this stuff even easier, and even one idea about using WSGI to get the best of both worlds. but i’ve already covered a lot more here than was in the lightning talk so i’ll save those ideas for future posts.

the only other thing which i do want to note and that i didn’t get a chance to mention at pycon was Ian Bicking’s posts on the subject of small apps instead of frameworks. Ian, i think, has been thinking about some of the same stuff we have. i was building these small apps (of a sort) before i read his post, but i hadn’t really thought much of it. coming across Ian’s posts made me a lot more conscious of what i was doing and helped me crystalize some of my thoughts and really focus my development and experimentation.

anyway, expect more from me on this subject. and hopefully, by next year’s pycon, we’ll have enough nice demos, benchmarks, and experience building these things that we can give a proper full talk then (or actually at just about any other web/programming conference, since these ideas aren’t really python specific at all).

turbo thraxil

By anders pearson 04 Mar 2006

i finally got around to porting this site to the TurboGears framework, which i’ve been using a lot lately.

it was already CherryPy + SQLObject, so it wasn’t that big a deal.

i also used Routes to map the urls to controller methods. i’m generally ok with the default cherrypy approach, but for thraxil.org, the url structure is fairly complicated and Routes certainly made that a lot cleaner.

next, i’m working on refactoring as much of the code out into seperate mini REST applications as i can (like Tasty) to make things even more manageable. of course, i need a few more unit tests before i can really do that safely…

wrong again

By anders pearson 03 Feb 2006

i’m losing my knack for predicting what sort of horrors the universe has in store for us.

first, there were cockroaches controlling robots and robots controlling mice and i was worried about cockroach controlled rodents. then there were zombie dogs and i was worried about cockroach controlled zombie animals.

now, i learn that the cockroaches are merely pawns of the wasp, hairworms are controlling grasshoppers, and the Toxoplasma Gondii protozoa are up to some kind of evil scheme involving controlling rats in order to subtly alter human personalities (and possibly drive us insane) via kitty litter.

there’s layer upon layer of conspiracy and intrigue in the animal world and humans are caught in the middle of it. i’m not sure we’ll ever figure out who’s really pulling the strings or why, but i for one would like to get the hell off this planet now. it’s clearly not safe here.

meanwhile, our fearless leader is worried about human-animal hybrids…

Scanner for Anders...

By Thanh Christopher Nguyen 01 Feb 2006

Hey anders. Check out the HP scanjet 4670. I got one for xmas. It sits on an easel, and hinges forward. But, it’s not actually attatched to the easel. So, you can simply pick it up and lay it over a larger image (like all your paintings), scanning pieces of it at a time. The top is made of glass, so you can see the alignment. It comes with software to realign larger images once it has all been scanned in. And there’s even a small device that connects to it to scan photo negatives and slides. Which is cool, if you like to work with film ever now and then. Anyway… I keep meaning to mention it to you, but you’re never up as late as I am, so here is this crappy post on your cool site.

scaling tag clouds

By anders pearson 13 Dec 2005

While we’re on the subject of tagging, let’s talk a little bit about tag clouds and their display.

Tag clouds are nice visual representations of a user’s tags with the more common clouds displayed in a larger font, perhaps in a different color. Canonical examples are delicious’ tag cloud and flickr’s cloud.

It’s a clever and simple way to display the information.

Doing it right can be a little tricky too. A tagging backend (like Tasty) will generally give you a list of tags along with a count for each for how many times it appears.

The naive approach is to divide up the range (eg, if the least common tag has a count of 1 and the most common 100) into a couple discrete levels and assign each to a particular font-size (eg, between 12px and 30px).

The problem with the naive approach (which I’m now noticing all over the place after spending so much time lately thinking about tag clouds) is that tags, like many real-world phenomena typically follow a power law. The vast majority of tags in a cloud will appear very infrequently. A small number will appear very frequently.

Here‘s an example of a cloud made with this sort of linear scaling.

It’s better than nothing, but as you can see, most tags are at the lowest level and there are just a couple really big ones. Since it’s dividing up an exponential curve into equal chunks, much of the middle ground is wasted.

To make a cloud more useful, it needs to be divided up into levels with a more equal distribution. The approach i found easiest to implement was to change the cutoff points for the levels. Conceptually, it’s sort of like graphing the distribution curve logarithmically. so instead of dividing 1-100 up as (1-20, 21-40, 41-60, 61-80, 81-100), it becomes something like (0-1, 2-6, 7-15, 16-39, 40-100).

That turns that same cloud into this, which I think makes better use of the size spectrum.

The actual algorithm for doing the scaling requires a bit of tuning but this is the prototype code I wrote for testing that produced that nicer scaled cloud:

tasty

By anders pearson 12 Dec 2005

at work, we build lots of custom applications for specific classes or professors. this means that most of our stuff isn’t very generalizable and gets locked up in the ivory tower, which is a shame since we do (IMHO) good work. every once in a while though, something comes along that is a little more general purpose that other people might find useful.

lately, i’ve been on a kick of building small, self-contained web applications that are designed to be mixed and matched and used to build larger applications. exactly the kind of thing that Ian Bicking wrote about.

my most recent mini-application, and one that’s made it out of the ivory tower, is called Tasty and is a tagging engine. probably the easiest way to think of it is as del.icio.us but designed to be embedded inside an application. it supports a very rich tagging model (basically what’s explained on tagschema.com) and is very efficient.

Tasty was written in python using the excellent TurboGears framework. but Tasty’s interface is just HTTP and JSON, so it can be integrated with pretty much any application written in any language on any platform. there’s even a pure javascript Tasty client in the works.

also, fwiw, Tasty has been powering the tagging on thraxil.org for a few weeks now (it was sort of my sanity check to make sure that the API wasn’t too obnoxious to integrate into an existing architecture and to make sure it performed ok on a reasonably large set of tags).

(update: Ian Bicking has an interesting followup that gets more into the small applications vs frameworks/libraries discussion and even posits an approach to making integration between python applications even easier)

(replace-string "keyword" "tag")

By anders pearson 11 Dec 2005

i finally gave in and changed the nomenclature on the site from ‘keywords’ to ‘tags’.

this website had ‘keywords’ years before del.icio.us came along and everyone started calling them ‘tags’.

i’m stubborn so i stuck with “keywords” for quite a while, but i guess they finally wore me down.