anders vs. the profiler

By anders pearson 10 Sep 2004

at work the other day, i gave a little mini presentation on my experience using python’s profiler. afterwards i wrote it up and posted it to our internal weblog. i thought it might be worth preserving here as well with a little editing.

one of the applications i work on is the “Image Annotation Tool” (or IAT), which allows users to manage collections of images and annotate them. there’s a lot more to it than that, but for our purposes here, that’s enough to understand.

the IAT had been functioning fine in the sandbox where i’ve been doing development. last week i migrated the data from last semester to the format that the new version needed and updated the IAT code on production. when i tried loading the page that shows every single image in my library (607 of them) on the production server, it was working but taking several minutes to load. my sandbox library with 60 images loaded in about 2 seconds. clearly there were some scalability issues that needed to be looked into. i decided it was time to figure out python’s profiler.

a profiler is a program (or library in python’s case) that will examine a running program and collect statistics on how many times functions are called and how long they took. there are other profilers that will track things like memory usage, files opened, and other resources, but for now, i’m just interested in CPU time.

the conventional wisdom is that premature optimization is the root of all evil and that code should be written in the most straightforward, maintainable fashion without much regard paid towards performance until it’s demonstrable that the code doesn’t run fast enough. at that point, a profiler should be used to find the bottlenecks in the code and the bare minimum of optimizations should be done to get the code running within acceptable limits. the rationale is that most optimizations make code harder to understand, maintain, and modify; since programmer time is far more expensive and limited than CPU time, they should only be a last resort. the other part of the reasoning is that programmers are notoriously bad at guessing where the bottlenecks are in a piece of code (and notoriously confident that they are good at guessing). it’s far too common to see programmers spend time complicating their code adding optimizations that speed up a part of the code that’s only responsible for half a percent of the overall execution time anyway. you don’t have to be a software engineer to understand this logic; it’s Cost Benefit Analysis 101.

this is how i’ve tried to write code for years now. my experience so far with web application and database development has been that optimizations are almost never necessary at all. i just write things in the most straightforward way i can think of, using convenient abstractions and high level interpreted languages planning on optimizing another day. the other day rarely comes. at work we’re usually not working with large enough amounts of data, thousands of simultaneous users or doing complicated enough things that the backend code ever approaches being the bottleneck (which is usually just network latency and browser rendering time). on the rare occasions that speed became a factor, i didn’t really bother using a profiler; i would just sprinkle a bunch of print statements in the code that would print labels and timestamps to the logs. then i could dig through the logs and get a rough idea of where things were bogging down. that strategy works fairly well if the code is simple enough that you have a good idea where the bottleneck might be. i’ve been pretty lucky with my guesses in the past and things have generally worked out, but i’ve chased a bottleneck up the wrong tree enough times that i’m now willing to accept that profilers could be useful.

for the IAT bottleneck, i of course had my guesses as to what was slowing things down. i’m using SQLObject as an ORM(Object Relational Mapper) and my experience with ORMs is that when you fetch a large collection of objects from the database, they sometimes decide to fetch the data for each object in seperate SQL SELECTs. so for 607 images, i wasn’t going to be surprised if i saw 607 seperate database calls being made instead of one SELECT returning the data for all 607 in one pass.

python’s profiler runs in two stages. the first stage collects the data from running the app and stores it in a log. the second stage parses that data and lets you generate reports from it. the two reports i chose to generate were a list of the function calls sorted by the time spent in each function (“stats.txt“) and a list of which functions called which other functions (“callees.txt“) which is useful for tracing through and making sense of the results of the first report.

the first run gave me this for stats.txt (i’m just going to show the top few lines for the sake of brevity):

632870 function calls (514488 primitive calls) in 220.097 CPU seconds

  Ordered by: internal time, cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    6454  157.236    0.024  157.236    0.024 DBConnection.py:128(_queryAll)
       1   33.282   33.282   35.143   35.143 htmltmpl.py:535(process)
    4226    8.249    0.002    8.249    0.002 DBConnection.py:141(_queryOne)
   42681    2.573    0.000   18.401    0.000 SQLObject.py:385(__new__)
   42681    2.528    0.000    2.528    0.000 Cache.py:51(get)
    4619    2.192    0.000   12.847    0.003 SQLObject.py:663(_init)
118029/1    1.544    0.000  220.097  220.097 <string>:1(?)
    4619    1.475    0.000    1.684    0.000 SQLObject.py:810(_SO_selectInit)
   85202    1.348    0.000    1.348    0.000 SQLObject.py:687(_SO_loadValue)
    1001    1.259    0.001  179.976    0.180 iat.py:632(data)
    3035    1.237    0.000  115.038    0.038 iat.py:845(data)
   10749    1.175    0.000    1.175    0.000 DBConnection.py:79(getConnection)
    7460    1.121    0.000    1.312    0.000 htmltmpl.py:911(escape)
      68    0.430    0.006    0.448    0.007 DBConnection.py:352(__init__)
    9372    0.414    0.000    0.414    0.000 htmltmpl.py:789(find_value)

each line has the number of times that a particular function was called, the total time spent executing that function (not including time spent in functions that it calls), the average time for each call of the function, the total cumulative time spent in that function (including time spent in functions that it calls) and the average cumulative time per call.

to make sense of a profiler report, you have to be a little familiar with the code that you’re profiling and the libraries that it calls. eg, it’s important to know that DBConnection.py, Cache.py and SQLObject.py are part of the SQLObject library and htmltmpl.py is the templating library that i use to generate the output. iat.py is the main file with IAT specific code.

so the report tells us that it took 220 seconds to run the code. we can also see that a lot of time (157 seconds) was spent in the _queryAll() function in DBConnection.py which was called 6454 times. that’s the function that does a select from the database. so the code is basically doing over 6000 database calls. no wonder it’s running slow! the next step is to figure out why. so far, it seems to support my hypothesis that the ORM isn’t doing things in the most efficient manner. looking at the cumulative time column, we can see that there are a couple big ones being called from in iat.py (the mysterious 220 second <string>:1(?) line can be ignored; it’s just how the profiler reports string conversions. since the outermost layer of the code returns a string, it picks up all others in the cumulative time, but you can see that it only uses a second and a half total time itself). data() on lines 632 and 845 of iat.py seem to account for a lot of the cumulative time.

line 632 is the data() method of the Image class. since there are 607 images that we need to get the info on, it’s not too surprising that that method is called a lot. the reason that it’s called 1001 times instead of 607 times is probably that the page has a sidebar with a list of libraries, collections and image counts, so that probably accounts for some extra calls beyond the 607.

but line 845 is a little more mysterious. it’s the data() method of the Annotation class. (the basic hierarchy of the IAT is Library -> Image -> Group -> Annotation -> Shape). so Annotations are subcomponents of Images. the question is why are we fetching Annotations’ data when we’re just making a page with an index of thumbnails. digging into the callees.txt report, we see:

...
%iat.py:632(data)                         <string>:1(?)(29029)  220.097
                                        iat.py:22(visible_to_user)(1001)    1.038
                                        iat.py:51(editable_by_user)(1001)    0.039
                                        iat.py:734(num_annotations)(1001)   21.472
                                        iat.py:774(data)(469)  149.159
                                        iat.py:926(merge_dicts)(1001)    0.075
                                        iat.py:1095(calculate_permissions)(1001)    1.454
%iat.py:845(data)                         <string>:1(?)(42490)  220.097
                                        iat.py:907(data)(3193)    1.343
                                        iat.py:926(merge_dicts)(3035)    0.075
                                        iat.py:1095(calculate_permissions)(3035)    1.454
...
%iat.py:774(data)                         <string>:1(?)(6566)  220.097
                                        iat.py:845(data)(3035)  115.038
                                        iat.py:926(merge_dicts)(469)    0.075
                                        iat.py:1095(calculate_permissions)(469)    1.454
...

Image.data() is calling Group.data() (line 774) 469 times and Group.data() is calling <code>Annotation.data()</code> 3035 times. Group.data() calling Annotation.data() makes sense; pretty much any time you want the data for a group of annotations, you’re going to want the data for the annotations within that group as well. but why would Image.data() be calling Group.data() at least in this case? we have no use for the groups and annotations data here.

so i take a closer look at Image.data() and see that sure enough there’s a line:

groups = [g.data(user=user) for g in self.groups]

in it. so Image.data() is fetching the data for each group attached to that image. the interesting thing is that ‘groups’ is never used after that point. so the line is totally superfluous! it serves no purpose but seems to be responsible for a lot of database calls. i don’t know why i wrote it; i probably had an idea for something, got part of it in there, then got distracted and forgot about it. it happens. on the sandbox library, it wasn’t slowing things down enough to be be noticable so it never got removed.

removed the offending line and ran the profiler again.

295080 function calls (256736 primitive calls) in 64.093 CPU seconds

  Ordered by: internal time, cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1   32.306   32.306   33.384   33.384 htmltmpl.py:535(process)
    1480   22.289    0.015   22.289    0.015 DBConnection.py:128(_queryAll)
    2518    2.985    0.001    2.985    0.001 DBConnection.py:141(_queryOne)
   36918    1.146    0.000    1.146    0.000 SQLObject.py:1103(instanceName)
    4067    1.061    0.000    1.061    0.000 DBConnection.py:79(getConnection)
    9372    0.630    0.000    0.630    0.000 htmltmpl.py:789(find_value)
      68    0.408    0.006    0.426    0.006 DBConnection.py:352(__init__)
    2911    0.380    0.000    1.552    0.001 SQLObject.py:810(_SO_selectInit)
   18775    0.259    0.000    0.259    0.000 Cache.py:51(get)
   18775    0.252    0.000    5.816    0.000 SQLObject.py:385(__new__)
    1001    0.208    0.000   27.661    0.028 iat.py:632(data)
...

down from 220 seconds to 64 seconds (about a 70% improvement) with a one-line change. not too shabby. while my guess about the ORM slowing things down with a lot of DB calls was partially right, we’ve already seen that it wasn’t the primary bottleneck at all. without the profiler, i probably would have caught the problem eventually, but i probably would have rewritten a lot of code first only to see it not speed things up very much. profiler: 1, anders: 0.

ok. a 70% improvement is pretty sweet, but 64 seconds to load a single page is still not really acceptable. the top line now seems to be a single call to htmltmpl.py‘s process() function which is taking about 32 seconds. htmltmpl is the templating library that i use to seperate the business logic from the display logic. it’s generating a pretty big page from a lot of data, but 32 seconds still sounds like a long time. a coworker and i have argued about the usefulness of templating libraries versus their overhead many times before with my claim being that they would probably never be the bottleneck in one of our applications. was i going to have to eat some serious crow here?

at this point, i actually started re-reading the profiler documentation, sure that i was misinterpreting the numbers somehow. i couldn’t imagine how it would be taking up so much time. i was especially surprised, since, as one of the developers for htmltmpl.py, i could remember merging a patch sent in that had some performance enhancements and the informal benchmarks i’d done at the time using much larger datasets didn’t take anywhere near that long to run. so i started digging into the copy of htmltmpl.py installed on our production server. the problem was immediately clear. i had never upgraded htmltmpl on that machine to the version with that patch; it was a much older version installed. so i did an upgrade and re-ran the profiler.

295080 function calls (256736 primitive calls) in 34.888 CPU seconds

  Ordered by: internal time, cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1480   20.828    0.014   20.828    0.014 DBConnection.py:128(_queryAll)
    2518    2.942    0.001    2.942    0.001 DBConnection.py:141(_queryOne)
    2911    2.357    0.001    2.523    0.001 SQLObject.py:810(_SO_selectInit)
       1    1.873    1.873    3.745    3.745 htmltmpl.py:535(process)
    2458    1.345    0.001    1.345    0.001 urllib.py:1130(quote)
 37991/1    1.170    0.000   34.888   34.888 <string>:1(?)
    2911    1.123    0.000    6.999    0.002 SQLObject.py:663(_init)
      68    0.406    0.006    0.423    0.006 DBConnection.py:352(__init__)
    9372    0.364    0.000    0.364    0.000 htmltmpl.py:789(find_value)
   18775    0.258    0.000    0.258    0.000 Cache.py:51(get)
   18775    0.252    0.000    7.722    0.000 SQLObject.py:385(__new__)
    1001    0.209    0.000   27.199    0.027 iat.py:632(data)
...

that cut it in half again. down to under 2 seconds to process the templates. for a monster page of 600 thumbnails, that’s good enough performance from the templating engine. without the profiler, i may never have noticed that htmltmpl.py was a version behind. i never would have suspected it of being a bottleneck. so we’ve sped things up by almost an order of magnitude now by removing a single line of code and upgrading a library. both optimizations that i wouldn’t have come up with on my own in the same amount of time. profiler: 2, anders: 0.

actually the score should be even higher for the profiler. during this process, there were actually a number of times where, despite the profiler data sitting right in front of me, i was tempted by my intuition and made changes here and there that i thought would at least speed things up a few seconds. in every case, using the profiler confirmed that my intuitions were wrong and resulted in negligable improvements at best. all of those changes were backed out and took a little bit of my ego with them.

of course 34 seconds still wasn’t quite fast enough for me (i’d be happy with something under 10 seconds. loading a page of 607 thumbnails will bring many browsers to their knees, so it’s not something that we expect users to be doing that frequently and if the backend takes less than 10 seconds, the bottleneck will then clearly be back to network latency and page rendering time, which we can’t do anything about). since Image.data() is taking 27 seconds of the time, it looks like we’re now at the point where convincing the ORM to do things more efficiently will make for a significant improvement.

i’ll spare you the profiler output, but with a few more passes, i’d gotten it down to between 8 and 12 seconds (depending on the server load). once again, my intuitions weren’t quite right.

we highlight thumbnails for images that have annotations attached to them and it turned out to be the code that counted how many annotations each image has that was running slow. the Image class has a method called num_annotations() which would return that count. num_annotations looked like:

def num_annotations(self):
  cnt = 0
  for g in self.groups:
     cnt += g.num_annotations()
  return cnt

so again, we were looping over all the groups for every image in the library. my first pass at fixing it was to do:

def num_annotations(self):
   return Annotation.select(AND(Annotation.q.groupID == Group.q.id,
                                Group.q.imageID == self.id)).count()

ie, to let SQLObject do one query (with a relatively complex join) for each image instead of looping. well, imagine my surprise when i ran the profiler again and found that that actually worsened performance by a few seconds. since most images only had 1 or 2 groups at the most, the old way was really only doing 2 or 3 very simple queries for each image while the newer one was doing 1 complex query. apparently the joins were bad enough that it was slower than a couple simple queries.

my next approach actually worked. i got rid of num_annotations altogether and at a higher level, where we were getting the list of images in the library, did a single complex query to get a list of every annotation in the library and then looped over the list of images and marked them as annotated if they appeared in that list. in that case, the expense of one really complicated query was much lower than 600 - 1200 or so simple queries.

score yet another point for the profiler.

this is probably good enough for me. the backend takes around 10 seconds to load a page with 607 thumbnails but the browser takes at least that long just to render the page (more the first time since it has to fetch all the thumbnails). other pages take a second or two (so the gathering of sidebar data isn’t taking too long on large libraries). i could probably get it a bit faster, but the gains would be smaller and smaller and the costs (in terms of complexity and time invested) would keep increasing.

so the bottom line is that if you are optimizing code and not using a profiler, you’re probably wasting a lot of time. if you are optimizing code before verifying that there is even a problem, you’re definitely wasting time.

Tags: programming python optimization profilers software engineering