By
anders pearson
10 Sep 2004
at work the other day, i gave a little mini presentation on my experience using python’s profiler. afterwards i wrote it up and posted it to our internal weblog. i thought it might be worth preserving here as well with a little editing.
one of the applications i work on is the “Image Annotation Tool” (or IAT), which allows users to manage collections of images and annotate them. there’s a lot more to it than that, but for our purposes here, that’s enough to understand.
the IAT had been functioning fine in the sandbox where i’ve been doing development. last week i migrated the data from last semester to the format that the new version needed and updated the IAT code on production. when i tried loading the page that shows every single image in my library (607 of them) on the production server, it was working but taking several minutes to load. my sandbox library with 60 images loaded in about 2 seconds. clearly there were some scalability issues that needed to be looked into. i decided it was time to figure out python’s profiler.
a profiler is a program (or library in python’s case) that will examine a running program and collect statistics on how many times functions are called and how long they took. there are other profilers that will track things like memory usage, files opened, and other resources, but for now, i’m just interested in CPU time.
the conventional wisdom is that premature optimization is the root of all evil and that code should be written in the most straightforward, maintainable fashion without much regard paid towards performance until it’s demonstrable that the code doesn’t run fast enough. at that point, a profiler should be used to find the bottlenecks in the code and the bare minimum of optimizations should be done to get the code running within acceptable limits. the rationale is that most optimizations make code harder to understand, maintain, and modify; since programmer time is far more expensive and limited than CPU time, they should only be a last resort. the other part of the reasoning is that programmers are notoriously bad at guessing where the bottlenecks are in a piece of code (and notoriously confident that they are good at guessing). it’s far too common to see programmers spend time complicating their code adding optimizations that speed up a part of the code that’s only responsible for half a percent of the overall execution time anyway. you don’t have to be a software engineer to understand this logic; it’s Cost Benefit Analysis 101.
this is how i’ve tried to write code for years now. my experience so far with web application and database development has been that optimizations are almost never necessary at all. i just write things in the most straightforward way i can think of, using convenient abstractions and high level interpreted languages planning on optimizing another day. the other day rarely comes. at work we’re usually not working with large enough amounts of data, thousands of simultaneous users or doing complicated enough things that the backend code ever approaches being the bottleneck (which is usually just network latency and browser rendering time). on the rare occasions that speed became a factor, i didn’t really bother using a profiler; i would just sprinkle a bunch of print statements in the code that would print labels and timestamps to the logs. then i could dig through the logs and get a rough idea of where things were bogging down. that strategy works fairly well if the code is simple enough that you have a good idea where the bottleneck might be. i’ve been pretty lucky with my guesses in the past and things have generally worked out, but i’ve chased a bottleneck up the wrong tree enough times that i’m now willing to accept that profilers could be useful.
for the IAT bottleneck, i of course had my guesses as to what was slowing things down. i’m using SQLObject as an ORM(Object Relational Mapper) and my experience with ORMs is that when you fetch a large collection of objects from the database, they sometimes decide to fetch the data for each object in seperate SQL SELECTs. so for 607 images, i wasn’t going to be surprised if i saw 607 seperate database calls being made instead of one SELECT returning the data for all 607 in one pass.
python’s profiler runs in two stages. the first stage collects the data from running the app and stores it in a log. the second stage parses that data and lets you generate reports from it. the two reports i chose to generate were a list of the function calls sorted by the time spent in each function (“stats.txt
“) and a list of which functions called which other functions (“callees.txt
“) which is useful for tracing through and making sense of the results of the first report.
the first run gave me this for stats.txt
(i’m just going to show the top few lines for the sake of brevity):
632870 function calls (514488 primitive calls) in 220.097 CPU seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
6454 157.236 0.024 157.236 0.024 DBConnection.py:128(_queryAll)
1 33.282 33.282 35.143 35.143 htmltmpl.py:535(process)
4226 8.249 0.002 8.249 0.002 DBConnection.py:141(_queryOne)
42681 2.573 0.000 18.401 0.000 SQLObject.py:385(__new__)
42681 2.528 0.000 2.528 0.000 Cache.py:51(get)
4619 2.192 0.000 12.847 0.003 SQLObject.py:663(_init)
118029/1 1.544 0.000 220.097 220.097 <string>:1(?)
4619 1.475 0.000 1.684 0.000 SQLObject.py:810(_SO_selectInit)
85202 1.348 0.000 1.348 0.000 SQLObject.py:687(_SO_loadValue)
1001 1.259 0.001 179.976 0.180 iat.py:632(data)
3035 1.237 0.000 115.038 0.038 iat.py:845(data)
10749 1.175 0.000 1.175 0.000 DBConnection.py:79(getConnection)
7460 1.121 0.000 1.312 0.000 htmltmpl.py:911(escape)
68 0.430 0.006 0.448 0.007 DBConnection.py:352(__init__)
9372 0.414 0.000 0.414 0.000 htmltmpl.py:789(find_value)
each line has the number of times that a particular function was called, the total time spent executing that function (not including time spent in functions that it calls), the average time for each call of the function, the total cumulative time spent in that function (including time spent in functions that it calls) and the average cumulative time per call.
to make sense of a profiler report, you have to be a little familiar with the code that you’re profiling and the libraries that it calls. eg, it’s important to know that DBConnection.py
, Cache.py
and SQLObject.py
are part of the SQLObject library and htmltmpl.py
is the templating library that i use to generate the output. iat.py
is the main file with IAT specific code.
so the report tells us that it took 220 seconds to run the code. we can also see that a lot of time (157 seconds) was spent in the _queryAll()
function in DBConnection.py
which was called 6454 times. that’s the function that does a select from the database. so the code is basically doing over 6000 database calls. no wonder it’s running slow! the next step is to figure out why. so far, it seems to support my hypothesis that the ORM isn’t doing things in the most efficient manner. looking at the cumulative time column, we can see that there are a couple big ones being called from in iat.py (the mysterious 220 second <string>:1(?)
line can be ignored; it’s just how the profiler reports string conversions. since the outermost layer of the code returns a string, it picks up all others in the cumulative time, but you can see that it only uses a second and a half total time itself). data()
on lines 632 and 845 of iat.py
seem to account for a lot of the cumulative time.
line 632 is the data()
method of the Image class. since there are 607 images that we need to get the info on, it’s not too surprising that that method is called a lot. the reason that it’s called 1001 times instead of 607 times is probably that the page has a sidebar with a list of libraries, collections and image counts, so that probably accounts for some extra calls beyond the 607.
but line 845 is a little more mysterious. it’s the data()
method of the Annotation class. (the basic hierarchy of the IAT is Library -> Image -> Group -> Annotation -> Shape). so Annotations are subcomponents of Images. the question is why are we fetching Annotations’ data when we’re just making a page with an index of thumbnails. digging into the callees.txt report, we see:
...
%iat.py:632(data) <string>:1(?)(29029) 220.097
iat.py:22(visible_to_user)(1001) 1.038
iat.py:51(editable_by_user)(1001) 0.039
iat.py:734(num_annotations)(1001) 21.472
iat.py:774(data)(469) 149.159
iat.py:926(merge_dicts)(1001) 0.075
iat.py:1095(calculate_permissions)(1001) 1.454
%iat.py:845(data) <string>:1(?)(42490) 220.097
iat.py:907(data)(3193) 1.343
iat.py:926(merge_dicts)(3035) 0.075
iat.py:1095(calculate_permissions)(3035) 1.454
...
%iat.py:774(data) <string>:1(?)(6566) 220.097
iat.py:845(data)(3035) 115.038
iat.py:926(merge_dicts)(469) 0.075
iat.py:1095(calculate_permissions)(469) 1.454
...
Image.data()
is calling Group.data()
(line 774) 469 times and Group.data()
is calling <code>Annotation.data()</code> 3035 times. Group.data()
calling Annotation.data()
makes sense; pretty much any time you want the data for a group of annotations, you’re going to want the data for the annotations within that group as well. but why would Image.data()
be calling Group.data()
at least in this case? we have no use for the groups and annotations data here.
so i take a closer look at Image.data()
and see that sure enough there’s a line:
groups = [g.data(user=user) for g in self.groups]
in it. so Image.data()
is fetching the data for each group attached to that image. the interesting thing is that ‘groups’ is never used after that point. so the line is totally superfluous! it serves no purpose but seems to be responsible for a lot of database calls. i don’t know why i wrote it; i probably had an idea for something, got part of it in there, then got distracted and forgot about it. it happens. on the sandbox library, it wasn’t slowing things down enough to be be noticable so it never got removed.
removed the offending line and ran the profiler again.
295080 function calls (256736 primitive calls) in 64.093 CPU seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 32.306 32.306 33.384 33.384 htmltmpl.py:535(process)
1480 22.289 0.015 22.289 0.015 DBConnection.py:128(_queryAll)
2518 2.985 0.001 2.985 0.001 DBConnection.py:141(_queryOne)
36918 1.146 0.000 1.146 0.000 SQLObject.py:1103(instanceName)
4067 1.061 0.000 1.061 0.000 DBConnection.py:79(getConnection)
9372 0.630 0.000 0.630 0.000 htmltmpl.py:789(find_value)
68 0.408 0.006 0.426 0.006 DBConnection.py:352(__init__)
2911 0.380 0.000 1.552 0.001 SQLObject.py:810(_SO_selectInit)
18775 0.259 0.000 0.259 0.000 Cache.py:51(get)
18775 0.252 0.000 5.816 0.000 SQLObject.py:385(__new__)
1001 0.208 0.000 27.661 0.028 iat.py:632(data)
...
down from 220 seconds to 64 seconds (about a 70% improvement) with a one-line change. not too shabby. while my guess about the ORM slowing things down with a lot of DB calls was partially right, we’ve already seen that it wasn’t the primary bottleneck at all. without the profiler, i probably would have caught the problem eventually, but i probably would have rewritten a lot of code first only to see it not speed things up very much. profiler: 1, anders: 0.
ok. a 70% improvement is pretty sweet, but 64 seconds to load a single page is still not really acceptable. the top line now seems to be a single call to htmltmpl.py
‘s process()
function which is taking about 32 seconds. htmltmpl
is the templating library that i use to seperate the business logic from the display logic. it’s generating a pretty big page from a lot of data, but 32 seconds still sounds like a long time. a coworker and i have argued about the usefulness of templating libraries versus their overhead many times before with my claim being that they would probably never be the bottleneck in one of our applications. was i going to have to eat some serious crow here?
at this point, i actually started re-reading the profiler documentation, sure that i was misinterpreting the numbers somehow. i couldn’t imagine how it would be taking up so much time. i was especially surprised, since, as one of the developers for htmltmpl.py
, i could remember merging a patch sent in that had some performance enhancements and the informal benchmarks i’d done at the time using much larger datasets didn’t take anywhere near that long to run. so i started digging into the copy of htmltmpl.py
installed on our production server. the problem was immediately clear. i had never upgraded htmltmpl on that machine to the version with that patch; it was a much older version installed. so i did an upgrade and re-ran the profiler.
295080 function calls (256736 primitive calls) in 34.888 CPU seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1480 20.828 0.014 20.828 0.014 DBConnection.py:128(_queryAll)
2518 2.942 0.001 2.942 0.001 DBConnection.py:141(_queryOne)
2911 2.357 0.001 2.523 0.001 SQLObject.py:810(_SO_selectInit)
1 1.873 1.873 3.745 3.745 htmltmpl.py:535(process)
2458 1.345 0.001 1.345 0.001 urllib.py:1130(quote)
37991/1 1.170 0.000 34.888 34.888 <string>:1(?)
2911 1.123 0.000 6.999 0.002 SQLObject.py:663(_init)
68 0.406 0.006 0.423 0.006 DBConnection.py:352(__init__)
9372 0.364 0.000 0.364 0.000 htmltmpl.py:789(find_value)
18775 0.258 0.000 0.258 0.000 Cache.py:51(get)
18775 0.252 0.000 7.722 0.000 SQLObject.py:385(__new__)
1001 0.209 0.000 27.199 0.027 iat.py:632(data)
...
that cut it in half again. down to under 2 seconds to process the templates. for a monster page of 600 thumbnails, that’s good enough performance from the templating engine. without the profiler, i may never have noticed that htmltmpl.py
was a version behind. i never would have suspected it of being a bottleneck. so we’ve sped things up by almost an order of magnitude now by removing a single line of code and upgrading a library. both optimizations that i wouldn’t have come up with on my own in the same amount of time. profiler: 2, anders: 0.
actually the score should be even higher for the profiler. during this process, there were actually a number of times where, despite the profiler data sitting right in front of me, i was tempted by my intuition and made changes here and there that i thought would at least speed things up a few seconds. in every case, using the profiler confirmed that my intuitions were wrong and resulted in negligable improvements at best. all of those changes were backed out and took a little bit of my ego with them.
of course 34 seconds still wasn’t quite fast enough for me (i’d be happy with something under 10 seconds. loading a page of 607 thumbnails will bring many browsers to their knees, so it’s not something that we expect users to be doing that frequently and if the backend takes less than 10 seconds, the bottleneck will then clearly be back to network latency and page rendering time, which we can’t do anything about). since Image.data()
is taking 27 seconds of the time, it looks like we’re now at the point where convincing the ORM to do things more efficiently will make for a significant improvement.
i’ll spare you the profiler output, but with a few more passes, i’d gotten it down to between 8 and 12 seconds (depending on the server load). once again, my intuitions weren’t quite right.
we highlight thumbnails for images that have annotations attached to them and it turned out to be the code that counted how many annotations each image has that was running slow. the Image class has a method called num_annotations()
which would return that count. num_annotations
looked like:
def num_annotations(self):
cnt = 0
for g in self.groups:
cnt += g.num_annotations()
return cnt
so again, we were looping over all the groups for every image in the library. my first pass at fixing it was to do:
def num_annotations(self):
return Annotation.select(AND(Annotation.q.groupID == Group.q.id,
Group.q.imageID == self.id)).count()
ie, to let SQLObject do one query (with a relatively complex join) for each image instead of looping. well, imagine my surprise when i ran the profiler again and found that that actually worsened performance by a few seconds. since most images only had 1 or 2 groups at the most, the old way was really only doing 2 or 3 very simple queries for each image while the newer one was doing 1 complex query. apparently the joins were bad enough that it was slower than a couple simple queries.
my next approach actually worked. i got rid of num_annotations
altogether and at a higher level, where we were getting the list of images in the library, did a single complex query to get a list of every annotation in the library and then looped over the list of images and marked them as annotated if they appeared in that list. in that case, the expense of one really complicated query was much lower than 600 - 1200 or so simple queries.
score yet another point for the profiler.
this is probably good enough for me. the backend takes around 10 seconds to load a page with 607 thumbnails but the browser takes at least that long just to render the page (more the first time since it has to fetch all the thumbnails). other pages take a second or two (so the gathering of sidebar data isn’t taking too long on large libraries). i could probably get it a bit faster, but the gains would be smaller and smaller and the costs (in terms of complexity and time invested) would keep increasing.
so the bottom line is that if you are optimizing code and not using a profiler, you’re probably wasting a lot of time. if you are optimizing code before verifying that there is even a problem, you’re definitely wasting time.
By
anders pearson
03 Sep 2004
Philips has a pretty neat new camera which they are advertising as “wearable”.
i’ve had a strong interest in the field of wearable computing for a long time now so i was naturally interested. unfortunately, their idea of “wearable” is that it goes on your keychain. one of the attributes in wearable computing is that it should be unmonopolizing, ie. it shouldn’t require the user to actively do anything to use it. with most cameras, including this one, you have to pull it out, point it at your subject and click a button to take the picture. all that this camera really has as an advantage at that point is that it’s small enough to always carry with you.
what i’d like to see is a really wearable camera in the wearable computing sense. with a bit of hacking on the hardware, this camera might be a good candidate. if you could modify it so it clips to your shirt pocket and just snaps a photo every 10 seconds or so, that would be pretty useful. you would have an automatic record of your day. you would get 99% boring, random, dark, and out of focus pictures, but every once in a while it might capture something interesting that you never would have shot if you’d had to pull it out of your pocket, aim, and shoot.
the reassuring thing about this camera is that in terms of size and power, we’re really almost there. the camera can be set to take low-res 640x480 photos and has 128MB of memory. that’s about 1000 pictures, which is about 1 picture per minute for 16 hours. so if you synched it up every night, you’d be fine. i’m sure that within a year or two, you could buy a version that had 1GB of space which pretty much gets you to the 10 seconds between shots point. with a bit more intelligence built in, you could probably also have the camera detect that light levels are too low or that nothing is in focus and not bother taking a bunch of the pictures that you would end up throwing away anyway. within 5 years, a device this size for a reasonable price could take decent quality video of your entire day.
probably the limiting factor on the camera is the battery life. it might have space for 1000 pictures, but i doubt the battery could handle staying on for 16 hours. haven’t been able to find any info on its battery life though.
i’m really tempted to buy one of these, glue a clip on the back of it, and rip it open and see if i could add some circuitry to automatically snap pictures once a minute. could be a fun project. need to find out more about the battery life though to know if it would be worth trying.
By
anders pearson
05 Jul 2004
since i digitally sign all of my outgoing email, i’m often put in the position of having to explain what digital signatures are and why one might want to use them. i’ve tried to find a good explanation on the web aimed at the layperson, but i haven’t found anything decent yet, so here’s my best shot at it.
summary: digital signatures are a way of ensuring that a document (such as an email) is from who it says it is and hasn’t been tampered with along the way.
the first thing that you should understand is that email is not at all secure. email protocols were designed back when the internet was used almost exclusively by the military and academic communities. it was a trusting environment and there wasn’t much thought given to malicious hackers. the result is that it is trivially easy to forge email. any clever 12 year old could probably show you how. ie, just because the ‘From:’ line of an email says the message is from grandma, doesn’t mean that it is. many of the Outlook worms that are floating around trick people into opening them by forging the ‘From:’ field so it looks like a message from a friend.
it’s also not very difficult for someone in the right position to change the contents of an email. as an email travels from the sender to the recipient, it usually passes through several different mail servers. anyone with administrative access (legitimate or not) to any of the servers on the path could substitute their own message and no one would be the wiser. another consequence of the insecure design of email systems is that there is no privacy. along the way there are many chances for nefarious people to read the email that you send. you should never write anything in an email that you wouldn’t be comfortable writing on the back of a postcard.
i shouldn’t really have to explain why this can be dangerous. there are an endless number of bad things that a criminal can do when they can impersonate someone. use your imagination or just google for “identity theft”.
now, perhaps you can see the need for some way of being sure that an email really is from who it says it is. this is what digital signatures do. i won’t try to bore you with too much explanation of how it works, but at a surface level, it’s fairly straightforward.
when you sign a message, it makes use of a ‘key’ that is unique to you. actually the key has two parts, both just extremely large numbers; one public, which anyone in the world can look at, and one private which only you can see and should be kept as a carefully guarded secret. the two parts are related to each other with a special mathematical relationship, but knowing the public key won’t give away any clues about the private key. indeed, to be useful, you need to make your public key widely available. mine is online here and there are servers on the web that just host public keys. my public key could also be downloaded from one of them here.
when a message is signed, it’s converted to a number (on a computer, everything is stored as a number at some level anyway, so this is a no-brainer) and a mathematical operation is performed using the message’s number and the sender’s private key. the result is the ‘signature’ of the message and it gets sent along with the message. if even a single byte of the message changes, the signature would be entirely different. since the private key was involved in the operation, no one else could generate the same signature for the message without stealing your private key. the actual math involved is fairly advanced and i won’t try to explain it here. the important part is that there is a similar operation that anyone else can perform using the message, the signature, and the public part of your key that will tell them whether the signature was really generated using the public key’s private counterpart (remember that there is a special mathematical relationship between the public and private keys). ie, they can use the signature and your public key to verify that it was signed with a particular key and hasn’t been modified since it was signed.
that’s a very rough sketch of the theory behind it. if you really want to understand more, read this introduction to cryptography and then just start googling. now for some more practical information. there are two main approaches to digitally signing email: PGP and S/MIME. both have their strengths and weaknesses. PGP is more ubiquitous and it’s what i use, so that’s what i’ll explain and recommend. both also let you encrypt email (solving the privacy problem), but for that, both the sender and recipient need to have keys. actually, the primary purpose of both is encryption, but i think that for most people, handling signatures is a more pressing need.
PGP stands for ‘Pretty Good Privacy’ and is pretty well known. it can be downloaded from pgpi.org for free. there are also commercial versions available and the open source GnuPG, which is what i use. if you want to sign your emails or verify the signatures of others’, download one of those and read its manual to see how you go about setting it up. if you plan on using it very frequently (which you should), you’ll want to figure out how to integrate it with your email client. i can’t possibly cover every possible combination of mail client + PGP implementation. usually doing a google search for “your mail client + PGP” or something will lead you to a plugin and tutorial. eg, this page covers PGP + MS Outlook and enigmail is a plugin that will let you use PGP with mozilla thunderbird. if you use web-based mail like hotmail or yahoo, you’re pretty much out of luck. sorry.
one important point that i should also make is that if you start signing your email, you should get in the habit of signing all of it. if your correspondents get used to receiving unsigned email from you, they’ll have no reason to be suspicious if they get something forged. if you sign everything and they receive an unsigned email claiming to be from you, they might think twice before trusting its validity. i sign every email i send (with the exception of posts to a particular mailing list that blocks all attachments). if you get an email from me that isn’t signed, assume that it’s not really from me and verify things through some other channel.
i should also mention that some email clients (particularly Outlook) don’t always handle signed messages very gracefully. the signature shows up as a small plaintext attachment with a ‘application/pgp’ content-type. this can appear quite mysterious. Outlook even warns the user that since it doesn’t know what it is, it might be a virus (of course, had it actually been a virus, Outlook would have just infected itself and not said anything, but that’s a different rant). this is why i find myself having to explain digital signatures on a pretty regular basis.
as an added incentive to start signing your mail, keep in mind that if everyone always signed their mail and it was expected, spam could be pretty much totally eliminated.
By
anders pearson
01 Jul 2004
it’s gone on long enough. looking through the logs for this site, i can see that a lot of you are still using Microsoft Internet Explorer as your web browser.
stop it.
(if you’re not, pat yourself on the back and feel free to skip the rest of this)
i could talk about its poor support of web standards and how that makes life hard on web designers and programmers and threatens the wonderful openness of the web and probably none of you care. i don’t blame you. i could talk about how microsoft has stopped IE development (aside from the constant stream of security patches) and there hasn’t been a new release since IE 6 (which was only a minor improvement over 5.5) came out around 3 years ago (a lifetime in software).
anyone who’s been paying attention for the last few years should know that IE is horribly insecure. new major vulnerabilities seem to come out every couple months that expose IE users to worms, viruses, and spyware that damage and slows down their computers.
finally, there at least seem to be some journalists who are putting two and two together. maybe a good scare about having your bank account broken into will get everyone’s attention. or how about going to jail?
now that i’ve got your attention, perhaps i may be able to persuade you to try out an alternative.
firefox is my strongest recommendation. it’s small, fast, FREE, blocks popup windows by default, has a nice google search box right on the taskbar, and has a bunch of other nifty features like tabbed browsing and browser extensions. firefox is a better browser than IE for pretty much every value of “better”. it also has the best standards support of any browser out there and is open source, but you probably don’t care about that kind of thing (though you should).
if firefox is too “open-source commie” for you, maybe you’d prefer opera, which is $39 or free with ads. it’s got more features than any one person could ever use and is ridiculously fast.
if you’re on a Mac and you aren’t using Safari or firefox yet, you get no sympathy from me.
please. it’s time to switch to a better browser. if you’re using IE, go download firefox or opera now and at least try them out. if you’ve already switched go find any of your friends who haven’t switched yet and help them out. if you care about them, you owe it to them.