thraxil.org:

A Boggling Return to C

by anders pearson Sun 17 Apr 2011 16:21:05

My first real programming class, back in high school, was taught in C. I found it difficult and I remember spending the whole year thinking that I really had no aptitude for programming. I think I got an OK grade in the end, but I felt like I was really struggling compared to the other students in the class. It wasn't until a year or two later, taking CS classes in college and finding them easy, that I figured a few things out. First, that C programming class was incredibly intensive. By the end of the school year, I'd implemented, from scratch, a 3D wireframe graphics engine. Second, I felt like I was struggling compared to my friends in the class, because my friends were complete prodigies who had already been coding in other languages for years.

I haven't really coded in C for the last decade though. Aside from a couple classes that happened to involve a bit of C as an aside, college CS classes were getting on the Java bandwagon and by the time I graduated, I'd discovered Perl as a great language to just get stuff done. Since then it's just been Perl, Python, and JavaScript with excursions into other high-level functional languages for fun (ML, Haskell, Lisp, Erlang, etc).

Recently, I had the urge to get back to my roots and relearn C. I'm not sure exactly why. I think I've just had a few too many moments lately where I've felt like software I was using or working on was slower and more bloated feeling than it ought to be. I know writing in C won't automatically fix that, but I've had the urge to write small, efficient programs. To make my own data structures and libraries where every clock cycle and byte of memory used are essential to the problem being solved.

Any time I try to learn a new language, I like to start with a toy program or two. Something simple enough to get my head around, but meaty enough to not be completely trivial.

For my first little toy project this time, I've dusted of my copy of K&R and written a program that, given a Boggle board, will solve it, finding all the english words that appear.

I implemented this before in Python about a year ago for fun. You can see that program here.

The C version is functional but will probably make more experienced C programmers cringe. C syntax is pretty straightforward and not hard to pick back up, but I've got very little experience with the standard libraries available and I'm still re-familiarizing myself with the idioms of developing in that language.

I also took a fundamentally different approach this time, since I've had a year of background thinking on the most efficient way to do this.

Both versions rely on the standard unix 'words' file (/usr/share/dict/words on most Linux boxen) as a master dictionary. They also both expect a board to be defined in a textfile in a very basic manner, eg:

tntb
gdar
elte
sish

I'm not going to go over the rules of Boggle, since you can google it.

The approach I used in Python was to first come up with a reduced alphabet consisting only of the letters that actually show up on the board. In the example case above: abdegilnrst. Then I go through each line of the words file, skipping it if there's an apostrophe, a capital letter (no proper nouns in Boggle), or any letters in the word do not appear in the restricted alphabet. The ones that are left are potentially findable words. Then I do a fairly straightforward search at each position of the board, going neighbor to neighbor to see if that word appears. Then at the end, I take all the words that were found and sort them by length.

That's a really rough, fairly brute-force approach. I won't apologize. That's kind of how I like to do things in Python. It probably took me an hour to write that code and it runs fast enough that I could use it to cheat and get really high scores on online Boggle games (start a new game, quickly type the board into a text file, run the Python script, which took a couple seconds, then spend the next minute or two typing in the words it found, starting with the longest, highest scoring ones first). Basically a total success as far as I'm concerned. Python absolute excels for this approach to problem solving where programmer-time matters a lot and there's a certain performance threshold of "good enough" that, once you reach, you don't gain by exceeding. With Python and a couple basic built-in data structures like lists and hash tables, there are few problems you can't tackle by just taking a very straightforward approach and adding a few quick filters to reduce the search-space drastically. Sophistication is possible, but rarely necessary.

But C is different to me. I feel like since C usually requires more lines of code and more care to get a program working but offers far more control and efficiency, it requires a different approach. It's worth taking the time to get it right to begin with and avoid wasting cycles and memory.

I've had a few ideas in my head lately for things that I want to build that I think will actually form a nice foundation for other work. Things that would really benefit from being small, fast, efficient and "right" from the ground up. This is what's got me interested in getting back into C.

Back to Boggle.

The C version takes a completely different approach, that I thought up recently and seems elegant. It's based on a Prefix Tree, or "Trie" data structure.

What I do is first go through the words file (again throwing away words that have apostrophes or capital letters) and building a Trie out of it. The result is a Trie structure representing every word in the english language. Checking to see if some word exists is O(n) where n is the number of letters in the word, and further will always be very small in this application. Next, we go through each position on the board as a starting point and then go through all the possible word paths from that point, checking against the master word Trie.

Now, the nice thing about using the Trie instead of just a plain hash table lookup or binary search or something, is that in addition to being able to tell if a given word exists, we can also find out whether there are any words that start with a given prefix just as quickly. That means that as we check every possible path through the board, we can eliminate subpaths very quickly when the Trie tells us that there can't possibly be any valid words that start with that prefix. Eg, starting at the 't' in the upper left corner of the example board and looking at all the paths that can originate from there, we can pretty much instantly eliminate them all since there are no words in the english language that start with 'tn', 'tg', or 'td'. This is the sort of reasoning that a human mind can do quickly, and a Trie is how we can get a computer to be about as clever in this case. This rapid pruning of the search space pays dividends quickly.

The code obviously isn't perfect. I've actually completely ignored the case of 'Qu', which Boggle puts as a single tile. It wouldn't be hard to make my code handle that one, but I left it out as it wouldn't fundamentally change the approach; it just adds a bit of extra complexity that would make it less clear.

One enhancement I could possibly make would be to do the same filtering step I did in the Python version, coming up with the restricted alphabet of only letters that appear on the board first and skipping any words in the english dictionary that have any letters besides those. That would make the Trie construction step faster and the resulting, smaller Trie would use less memory and be faster to search. But building the Trie of the entire dictionary already only takes 82 milliseconds on my laptop (the second part of the program, walking all the subpaths through the board and checking for matches takes about 20 milliseconds for a 5x5 board. By way of comparison, the Python program takes 1.5 seconds to run, so that's about a 10X speedup).

The other enhancement which I might do once I've thought about it a bit more is to serialize a built Trie to a file on disk that could be read in quickly and completely skip the step of building the same Trie every time the program is run.

Writing the C version took me a few evenings, with a lot of stupid mistakes and sidetracks, but it was fun. I like looking at the code and knowing that any inefficiency in it is because I didn't design it well enough, not because I'm using a massive library or framework that has a thousand features that I don't actually need for this particular problem.

TAGS: programming c python

comments

abdegilnrst: you forgot the h ;)

Nice. You might be able to build the Trie more rapidly if you know the input word list is already sorted.

I made a boggle solver a few years ago. If you want to take a look: http://www.deanandara.com/bogl.py

I had this as an interview question once, they were looking for a recursive solution that used tries. Fun interview.

I wrote a web application for a boggle/scrabble solver using python.

The whole project took me about 3 hours and coding the solver was the most fun.

I also open-sourced the code:

http://words.gumyum.com

What of you used a hash like a trie data structure? I wonder what kind of speed boost you might get. Then compare that speed to your C program.

It's amazing how similar your story is to mine. (And thanks for writing it. It's encouraging.) When I was in college, they had really just finished moving away from C to Java. I had one or two classes where we used C, but it was nearly always apologized for or spoken of as "the old, dirty way". Not knowing any better, I believed this.

I recently came to the same conclusion you did, that I wanted to get "closer to the metal". I had gone so far as to dust off my copy of K&R when I discovered Go. It gave me everything I was hoping to get out of returning to C.

I even implemented boggle, but um...a little differently. http://i.imgur.com/enqM9.jpg

Nice work.

Yeah, I've been playing with Go for the last few months as well and I've been very impressed with it. It's the first language I've seen in a while that I feel like I can see myself actually being as productive with as I am with Python.

libtpl is a great serialization library. It can serialize a struct, an array and even a linked list. I use it in conjuction with memcache (TPL_MEM storage) or a file (TPL_FILE)

If your purpose is speed, don't use strlen inside the for loop; set a variable to the length beforehand, otherwise strlen will be called before every iteration of the for loop.

Thanks. That's exactly the kind of low level optimization stuff I was looking to get back into.

It does strike me though as something that the compiler ought to be able to optimize out.

Most compilers don't optimize across function boundaries, so it doesn't know that strlen does not have side effects. If it had side effects, the compiler has to assume that you wanted the function invoked for every iteration.

A hashtable for strings is always slower than a trie. In a hashtable you have to process the string with the hashfunction, and then acess the table. With a trie one can retrieve information just by processing the string.

This is the approach I used when I wrote anagramarama, it's such a fun little problem to solve and you feel like you've properly earned your wings after you've done it

If you are having to write things in C, I recommend you check out the D language, which has all the cool, modern features, yet is link-compatible with C. It's the perfect replacement. http://dlang.org/overview.html


formatting is with Markdown syntax. Comments are not displayed until they are approved by a moderator. Moderators will not approve unless the comment contributes value to the discussion.

namerequired
emailrequired
url
remember info?