spent a little time this evening speeding up Corral. specifically, the part of it that downloads feeds, parses them and puts new items in the database. this is pretty light work, CPU-wise, but it was taking a good 15 or 20 minutes to handle the couple hundred feeds in the database.
the problem was that it did everything sequentially, and some of the sites that it's downloading are slow. so if feed A takes 58 seconds to download, feed B 1 minute, and feed C 2 seconds, the total would be 2 minutes. all wasted just waiting for slow webservers to respond. clearly, this was a job for multithreading.
what i wanted was to have it downloading feeds 5 or 10 at a time. that way the total time would be roughly the same as the time for the slowest feed or two, and not the cumulative total for all the feeds. in the above example, A,B,C would all be downloaded within about 1 minute (B's time). the more concurrent downloads, the greater the savings.
so the trick is to start up a bunch of threads (but not too many, i don't want to overwhelm the system with 100 threads all trying to download feeds at once) and let them download the feeds concurrently, collecting the results in a single list to be processed after all results are in.
collecting the results of multiple threads into a single data structure can be tricky if you haven't done it before. if two threads try to modify a list at the same time, you're in for all kinds of pain. one way to avoid it is to synchronize variables, or use mutex locks. in this case though, there was an easier way since i was willing to wait for all the threads to finish before doing any processing of the results. in general, if you can find away to not have to deal with synchronized variables, or mutex locks, you'll save yourself some headaches.
python's threading module let's you create a new thread as an object instance. the interesting part is that the object is still around after the thread terminates, just as a regular object in the main thread. so you can store the results as an object attribute. then, when all the download threads are finished, you can loop over them and 'harvest' the results.
there doesn't seem to be much in the way of beginner oriented documentation, tutorials or example code on threads in python, so here's a little toy script that i wrote to test out my ideas and make sure i had things figured out. when that script finishes, the results array should have all the contents of the pages. it should be pretty easy to follow (especially if you make reference to the threading.Thread documentation).
from threading import Thread import time import urllib MAX_THREADS = 2 # some urls to fetch urls = ["http://www.google.com/", "http://thraxil.org/", "http://www.columbia.edu/~anders/", "http://www.slashdot.org/"] getters =  class Getter(Thread): """ just downloads a url and stores the results in self.contents """ def __init__(self,url): Thread.__init__(self) self.url = url def run(self): # this is the part that can take a while if the site is slow self.contents = urllib.urlopen(self.url).read() # when this method returns, the thread exits def count_active(): """ returns the number of Getter threads that are alive """ num_active = 0 for g in getters: if g.isAlive(): num_active += 1 print "%d alive" % num_active return num_active # spawn threads for url in urls: print "fetching %s" % url while count_active() >= MAX_THREADS: print " too many active" time.sleep(1) g = Getter(url) getters.append(g) g.start() # harvest the results results =  for g in getters: print "waiting..." # join() causes the main thread to wait until g is finished g.join() results.append(g.contents) print "done"
after writing that toy script, i was able to modify Corral's feeder script to use the same tricks in just a few minutes and it ran correctly the first time. now, with MAX_THREADS set to 10, it fetches all the feeds in under 2 minutes. not a bad improvement over the 18 to 20 minutes it was taking before.