citation, original sources, and the sad state of mainstream media

By anders pearson 18 Oct 2003

a few days ago, the prime minister of malaysia gave a speech to an islamic summit conference. some of his remarks were reported in the media and <a href=”http://releases.usnewswire.com/GetRelease.asp?id=139-10162003”>condemned as anti-semetic</a>. Mahathir and others <a href=”http://www.bernama.com/bernama/v3/news.php?id=24433”>countered</a> that the quotes being thrown around were taken wildly out of context. passionate debate ensued about how to properly interpret his remarks. what has surprised me is how no news agency has done the obvious and linked to a complete transcript of the speech. it would seem to me that if one party is complaining that there words were taken out of context, that the obvious way to resolve the dispute is to let people judge for themselves. but go ahead and look. i spent a lot of time digging and haven’t been able to find any western mainstream media coverage that makes any attempt to link to the original transcript.

zarina

at this point, i could care less whether or not the speech was anti-semitic or not. go read it and decide for yourself. most people are smart enough to figure it out without biased journalists telling them how to think.

i was surprised that such an obvious thing wasn’t done, but i shouldn’t have been. i’m optimistic against my better judgement. for some reason, i keep expecting mainstream journalists to actually demonstrate some integrity every once in a while but i know it will never happen.

my current beef with mainstream journalism is the complete lack of anything resembling an academic citation standard.

most of us have had it drilled into our heads since we first started writing research papers in high school that we have to cite our sources and provide a bibliography. having spent the last few years working in academia, i’ve developed a deep appreciation for the mindset behind this. the main impetus behind academic citation is to build a useful body of knowledge. if someone is researching subject X and comes across your paper on the topic, they can go further and read the same sources that you read if they need a deeper understanding. it also grounds research and keeps everything honest. you can’t just assert something to be a fact without being able to back it up and expect to be taken seriously (in theory at least). science takes this idea one step further; not only do you need to cite everything, but your experiments need to be repeatable or your results will be ignored. mathematics takes it to the logical extreme and requires that everything be <em>provable</em>. essentially, the burden of proof in academia is on the author. they must go to pains to show that their facts are solid.

journalism seems to have nothing like this. if a reporter is making something up, the burden of proof is on the public to show that they’re full of shit. when a journalist quotes the president’s speech, you never see a bibliography at the end of the article pointing to where you can find a transcript of the whole speech.

i don’t really expect this out of print or television news. but news websites have no excuse. it essentially costs nothing to post complete transcripts online, or at least link to them. (all congressional hearings have online transcripts thanks to the <a href=”http://www.fnsg.com/“>FNS</a>).

the web makes citation simple. since so many people get their news online now, imagine how great it would be if everyone had access to the same original source material that the journalists access. people could actually start forming their own opinions without being spoonfed by the media. it just pisses me off that so much of the potential of online journalism is completely wasted.

this and better

By anders pearson 15 Oct 2003

my Grandmother passed away this afternoon.

her favorite phrase, by far, would be delivered any time we were just sitting around, not actually doing anything. she’d pound a fist on the table, inhale, take on a firm expression and say: “well… this and better <em>might</em> do… but this and worse will <em>never</em> do.” and we would all be expected to get up and actually do something productive rather than idle away.

Dorothy Pearson, 1919-2003, RIP.

python threading

By anders pearson 15 Oct 2003

spent a little time this evening speeding up Corral. specifically, the part of it that downloads feeds, parses them and puts new items in the database. this is pretty light work, CPU-wise, but it was taking a good 15 or 20 minutes to handle the couple hundred feeds in the database.

the problem was that it did everything sequentially, and some of the sites that it’s downloading are slow. so if feed A takes 58 seconds to download, feed B 1 minute, and feed C 2 seconds, the total would be 2 minutes. all wasted just waiting for slow webservers to respond. clearly, this was a job for multithreading.

what i wanted was to have it downloading feeds 5 or 10 at a time. that way the total time would be roughly the same as the time for the slowest feed or two, and not the cumulative total for all the feeds. in the above example, A,B,C would all be downloaded within about 1 minute (B’s time). the more concurrent downloads, the greater the savings.

so the trick is to start up a bunch of threads (but not too many, i don’t want to overwhelm the system with 100 threads all trying to download feeds at once) and let them download the feeds concurrently, collecting the results in a single list to be processed after all results are in.

collecting the results of multiple threads into a single data structure can be tricky if you haven’t done it before. if two threads try to modify a list at the same time, you’re in for all kinds of pain. one way to avoid it is to synchronize variables, or use mutex locks. in this case though, there was an easier way since i was willing to wait for all the threads to finish before doing any processing of the results. in general, if you can find away to not have to deal with synchronized variables, or mutex locks, you’ll save yourself some headaches.

python's threading module let's you create a new thread as an object instance. the interesting part is that the object is still around after the thread terminates, just as a regular object in the main thread. so you can store the results as an object attribute. then, when all the download threads are finished, you can loop over them and 'harvest' the results.

there doesn't seem to be much in the way of beginner oriented documentation, tutorials or example code on threads in python, so here's a little toy script that i wrote to test out my ideas and make sure i had things figured out. when that script finishes, the results array should have all the contents of the pages. it should be pretty easy to follow (especially if you make reference to the threading.Thread documentation).

from threading import Thread
import time
import urllib

MAX_THREADS = 2
# some urls to fetch
urls = ["http://www.google.com/",
        "http://thraxil.org/",
        "http://www.columbia.edu/~anders/",
        "http://www.slashdot.org/"]
getters = []

class Getter(Thread):
    """ just downloads a url and stores the results in self.contents """
    def __init__(self,url):
        Thread.__init__(self)
        self.url = url
    def run(self):
        # this is the part that can take a while if the site is slow
        self.contents = urllib.urlopen(self.url).read()
        # when this method returns, the thread exits

def count_active():
    """ returns the number of Getter threads that are alive """
    num_active = 0
    for g in getters:
        if g.isAlive():
            num_active += 1
    print "%d alive" % num_active
    return num_active

# spawn threads
for url in urls:
    print "fetching %s" % url
    while count_active() >= MAX_THREADS:
        print " too many active"
        time.sleep(1)
    g = Getter(url)
    getters.append(g)
    g.start()

# harvest the results
results = []
for g in getters:
    print "waiting..."
    # join() causes the main thread to wait until g is finished
    g.join()
    results.append(g.contents)
    print "done"

after writing that toy script, i was able to modify Corral’s feeder script to use the same tricks in just a few minutes and it ran correctly the first time. now, with MAX_THREADS set to 10, it fetches all the feeds in under 2 minutes. not a bad improvement over the 18 to 20 minutes it was taking before.

lost smell

By kamden 10 Oct 2003

Well, typically I have found myself in an unlikely situation. After a mishap in orgo. lab yesterday, involving heavy doses of smoking napthalene, I have now lost my sense of smell. Totally. Pizza tastes like wet cardboard with a hint of mothball. I think that crap recrystallized deep in my nasal cavity. Losing sense bites. I really hope it comes back. Tell Harvard to get better hoods.

eolas upside?

By anders pearson 07 Oct 2003

Eolas’ patent lawsuit against microsoft has not made them popular with web developers lately (or those of us who think that software patents are even more evil than microsoft). microsoft recently <a href=”http://www.microsoft.com/presspass/press/2003/oct03/10-06EOLASPR.asp”>announced</a> that they would be modifying IE’s handling of plugins (such as ActiveX, Flash, Java, Quicktime, etc) to eliminate any infringement.

their update page shows that the change will involve <a href=”http://msdn.microsoft.com/ieupdate/activexchanges.asp#userexperience”>adding a dialog</a> that pops up anytime the user loads a page with a plugin. after the user hits ‘Ok’, it will load as normal.

i read that and was immediately reminded of the <a href=”http://www.texturizer.net/firebird/extensions/#Flash%20Click%20To%20View”>Flash Click to View</a> extension for mozilla, which is incredibly useful for eliminating annoying animated banner adds. the mozilla extension is a little smoother than IE’s dialog boxes because it puts the button right on the page where the plugin would have displayed rather than popping up and getting in the way. but microsoft’s page also says that users will be able to set a preference to have all plugins blocked by default; then, pages with plugins will have a little icon at the bottom that the user can click on to get them to load.

to me it looks like microsoft’s solution to the lawsuit will make IE a better browser. furthermore, not loading ActiveX content by default will be a nice security enhancement (this has been a source of many a hole in the past).

so all it took was a lawsuit to get microsoft to make a usability enhancement to their browser. now, if we can just get a few more companies to sue them, maybe they’ll also add tabbed browsing, popup blocking and better cookie controls.

The Legendary Couch-Moving Epic

By Eric Mattes 03 Oct 2003

I would like to share with the readers of thraxil my amazing adventure from last week:

<p><a href="http://www.ericmattes.com/couch/">http://www.ericmattes.com/couch/</a></p> 

fsck

By anders pearson 03 Oct 2003

in the last couple years, ‘fsck’ and variations like ‘fscking’ have entered the common vocabulary, particularly in an online context, as a way of swearing without really swearing. you see it all over blogs and in email. occasionally, i’ve heard people trying to actually pronounce it in conversations.

what interests me is that there is an increasing number of people who use the slang but who aren’t hardcore unix geeks, and i wonder if any of them really have any idea where the term came from.

for those people, here’s a little education: <tt>fsck</tt> is a unix tool for doing a FileSystem CHeck. in the old days, linux filesystems like ext2 were almost as poorly designed as FAT, NTFS, or HFS. if you were unlucky enough to have the power go out on you while the computer was writing data to disk, you could end up with a corrupted disk, and possibly not be able to boot off it (if the corrupted part was something necessary to the OS). if that happened, the fix was to boot off a rescue disk with a minimal kernel and run fsck to repair the disk. the process was slow and tedious and somehow always had to be done at the most inopportune times. we now have reiserfs, ext3, XFS, and other journaling filesystems that prevent this kind of tragedy from occurring.

the superficial resemblence of ‘fsck’ to a certain other four letter word combined with the fact that fsck was mostly used in situations of frustration where swearing was already common originated the use of ‘fsck’ as slang among the unix crowd.

from there i’d guess that it spread out to the non geek masses via slashdot, where it became common in the comments and sigs.

strangest. coincidence. ever.

By anders pearson 01 Oct 2003

a couple weeks ago, we hired a few CS students part-time to help out at <a href=”http://www.ccnmtl.columbia.edu/“>the center</a>.

this morning, one of them, Davang, a first year grad student stops me as i walk into the office. “is your full name ‘anders pearson’?” “yes, it is.” “is this yours?” and hands me a Time Warner Cable bill with my name and (new) address on it. “where did you get this?” “it came in my mail.”

my first thought was that maybe he’d moved into my old apartment by some coincidence and they had somehow gotten things confused with my account. but he lives somewhere entirely different. apparently, my statement came in the same envelope as his, which was properly addressed to him.

so it looks like somehow, the people or machines stuffing envelopes got the two statements stuck together and put into the same envelope. the bizarre coincidence is that of all the 11 million people in new york that it could have gone to, it went to someone in my office.

well, they probably divide things up into neighborhoods so the fact that he lives relatively close to me made it more likely. also, i started the account in the last month, and since he moved into the city at around the same time, his account was probably started at nearly the same time, which would make the account numbers close. so it’s not quite the 1 in 11 million chance that it looks like (or 30 in 11 million if you consider that it could have been anyone in the office), but i think it was still pretty unlikely. i’d say probably worse than 1 in 100,000. still not odds that i’d have bet on.

dev

By anders pearson 30 Sep 2003

remember, drachen, the mp3 database frontend that i wrote a while back? in case you don't, its motivation was that i have a huge collection of mp3s and ogg vorbis files and all other mp3 players i've tried just aren't designed to handle a collection that large and their track selection interfaces are virtually unusable. so drachen is a python frontend to a postgresql database of all my mp3s.

i've been working on porting it to PyGTK and Glade (see this article). the old Tk interface that you can see in the old screenshots was getting old and from using it for a while, i had some ideas on ways to improve it. GTK looks much nicer and has a nice multi-column list widget that i wanted to use. not to mention that i wanted a good excuse to do something with Glade and figure out if it is really as slick as it looks.

i basically spent two nights getting up to speed on glade and (mostly) GTK. glade is a cinch and once i figured out that a lot of the python + glade tutorials on the web are out of date and refer to a libglade python library that has been replaced with gtk.glade, it took me no time to get the shell of the interface built and turned into a glade file that python could read. GTK is a lot more complicated and, unfortunately, not quite as well documented as i’d like. in particular, it took me quite a while to figure out how to do some basic things with the ListStore and TreeView. the individual APIs are documented pretty well, but i had to combine information and example code from a bunch of different places to figure out the general usage patterns.

then i spent this evening tying the new frontend to the old backend. luckily i designed it to be as MVC like as is possible with Tkinter before, so the database code was pretty cleanly split from the Tk code.

while playing with it though, i started to get frustrated with how slow it was at searching the database. when you enter a query into the search box, it had to do a massive join (bringing together the tracks, artists, and albums tables) and do three LIKE comparisons on the fields. even with adding every index i could think of, postgres just can’t make that kind of query fast over 20,000 records.

so i decided to go all the way and build a reverse index like a real search engine would have. i won't bother explaining the concept of a reverse index since this perl.com article does a better job than i probably could.

it turned out to be easier to implement than i thought it would. just added a reverse_index table to the database and wrote a 10 line perl script to go through all the database records and index them (it took 10 minutes to run – longer than it took to write). now, searches are nice and fast.

here's a quick screenshot of the new GTK interface. much more modern looking than the old stuff.