Tuesday, December 1, 2009

The joy of collections

I recently discovered the collections library in python (http://docs.python.org/library/collections.html), which is a treasure trove of various handy data structures. As a fan of math, I had already found myself creating many of these classes and functionalities, and it is good to know others think the same way. These are implemented in C, which obviates any python scratchings I had been using.

In particular, here are a few cool tricks using the defaultdict type, which functions as a dictionary, but with a default value (saving the use of many d[k] = d.get(k, ) ... lines and helper functions).

defaultdict takes a function to call as the initalizer for the default value of a new key, since int() evaluates to 0, and list() evaluates to [], we save ourselves some lambda or helper functions here and use those.

import collections

d_counts = collections.defaultdict(int)
d_lists = collections.defaultdict(list)

sentence = "the quick brown fox jumps over the lazy dog"
for i, letter in enumerate(sentence):
d_counts[letter] += 1
d_lists[letter].append(i)

for letter, count in sorted(d_counts.iteritems()):
print "%s - %s" % (letter, count)
for letter, indicies in sorted(d_lists.iteritems()):
print "%s - %s" % (letter, indicies)


You can also nest the initializers, this code might be appropriate for use in a word generating program.

d_dict_int = collections.defaultdict(lambda: collections.defaultdict(int))

text = open("sample_text")
previous = ""
for line in text:
for letter in line:
d_dict_int[previous][letter] += 1
previous = (previous + letter)[-3:]
text.close()

# print the number of times any letter appeared after its (up to) 3 previous letters
for antecedents, letters in sorted(d_dict_int.iteritems()):
print antecedents
for letter, count in sorted(letters.iteritems()):
print " %s - %s" % (letter, count)


You can even be clever and allow for arbitrarily deep dictionaries, for those who don't like to initialize anything.

def inf_dict():
return collections.defaultdict(inf_dict)
d_infdict = inf_dict()
d_infdict[4][5]['foo'][6]['bar'] = 100


One caveat to note. While checking for the existence of an entry does not create it, referencing it will fill it up (you do not need to assign to it), making it unsuitable for try/excepts. It should be easy enough to catch though, as an IndexError clearly makes no sense.

d = collections.defaultdict(int)
print 4 in d # False
print d[4] # 0
print 10 in d # False
print 4 in d # True

No comments: