11.3 The Berkeley DB Module

Python comes with the bsddb module, which wraps the Berkeley Database library (also known as BSD DB) if that library is installed on your system and your Python installation is built to support it. With the BSD DB library, you can create hash, binary tree, or record-based files that generally behave like dictionaries. On Windows, Python includes a port of the BSD DB library, thus ensuring that module bsddb is always usable. To download BSD DB sources, binaries for other platforms, and detailed documentation on BSD DB, see http://www.sleepycat.com. Module bsddb supplies three factory functions, btopen, hashopen, and rnopen.

btopen, hashopen, rnopen

btopen(filename,flag='r',*many_other_optional_arguments)
hashopen(filename,flag='r',*many_other_optional_arguments)
rnopen(filename,flag='r',*many_other_optional_arguments)

btopen opens or creates the binary tree format file named by filename (a string that denotes any path to a file, not just a name), and returns a suitable BTree object to access and manipulate the file. Argument flag has exactly the same values and meaning as for anydbm.open. Other arguments indicate low-level options that allow fine-grained control, but are rarely used.

hashopen and rnopen work the same way, but open or create hash format and record format files, returning objects of type Hash and Record. hashopen is generally the fastest format and makes sense when you are using keys to look up records. However, if you also need to access records in sorted order, use btopen, or if you need to access records in the same order in which you originally wrote them, use rnopen. Using hashopen does not keep records in order in the file.

An object b of any of the types BTree, Hash, and Record can be indexed as a mapping, with both keys and values constrained to being strings. Further, b also supports sequential access through the concept of a current record. b supplies the following methods.

close

b.close(  )

Closes b. Call no other method on b after b.close( ).

first

b.first(  )

Sets b's current record to the first record, and returns a pair (key,value) for the first record. The order of records is arbitrary, except for BTree objects, which ensure records are sorted in alphabetical order of their keys. b.first( ) raises KeyError if b is empty.

has_key

b.has_key(key)

Returns True if string key is a key in b, otherwise returns False.

keys

b.keys(  )

Returns the list of b's key strings. The order is arbitrary, except for BTree objects, which return keys in alphabetical order.

last

b.last(  )

Sets b's current record to the last record and returns a pair (key,value) for the last record. Type Hash does not supply method last.

next

b.next(  )

Sets b's current record to the next record and returns a pair (key,value) for the next record. b.next( ) raises KeyError if b has no next record.

previous

b.previous(  )

Sets b's current record to the previous record and returns a pair (key,value) for the previous record. Type Hash does not supply method previous.

set_location

b.set_location(key)

Sets b's current record to the item with string key key, and returns a pair (key,value). If key is not a key in b, and b is of type BTree, b.set_location(key) sets b's current record to the item whose key is the smallest key larger than key and returns that key/value pair. For other object types, set_location raises KeyError if key is not a key in b.

11.3.1 Examples of Berkeley DB Use

The Berkeley DB is suited to tasks similar to those for which DBM-like files are appropriate. Indeed, anydbm uses dbhash, the DBM-like interface to the Berkeley DB, to create new DBM-like files. In addition, the Berkeley DB can also use other file formats when you use module bsddb explicitly. The binary tree format, while not quite as fast as the hashed format when all you need is keyed access, is excellent when you also need to access keys in alphabetical order.

The following example handles the same task as the DBM example shown earlier, but uses bsddb rather than anydbm:

import fileinput, os, bsddb
wordPos = {  }
sep = os.pathsep
for line in fileinput.input(  ):
    pos = '%s%s%s'%(fileinput.filename(  ), sep, fileinput.filelineno(  ))
    for word in line.split(  ):
        wordPos.setdefault(word,[  ]).append(pos)
btOut = bsddb.btopen('btindex','n')
sep2 = sep * 2
for word in wordPos:
    btOut[word] = sep2.join(wordPos[word])
btOut.close(  )

The differences between this example and the DBM one are minimal: writing a new binary tree format file with bsddb is basically the same task as writing a new DBM-like file with anydbm. Reading back the data using bsddb.btopen('btindex') rather than anydbm.open('indexfile') is similarly trivial. To illustrate the extra features of binary trees regarding access to keys in alphabetical order, we'll perform a slightly more general task. The following example treats its command-line arguments as specifying the beginning of words, and prints the lines in which any word with such a beginning appears:

import sys, os, bsddb, linecache
btIn = bsddb.btopen('btindex')
sep = os.pathsep
sep2 = sep * 2

for word in sys.argv[1:]:
    key, pos = btIn.set_location(word)
    if not key.startswith(word):
         sys.stderr.write('Word-start %r not found in index file\n' % word)
    while key.startswith(word):
        places = pos.split(sep2)
        for place in places:
            fname, lineno = place.split(sep)
            print "%r occurs in line %s of file %s:" % (word,lineno,fname)
            print linecache.getline(fname, int(lineno)),
        try: key, pos = btIn.next(  )
        except IndexError: break

This example exploits the fact that btIn.set_location sets btIn's current position to the smallest key larger than word, when word itself is not a key in btIn. When word is a word-beginning, and keys are words, this means that set_location sets the current position to the first word, in alphabetical order, that starts with word. The tests with key.startswith(word) let us check that we're still scanning words with that beginning, and terminate the while loop when that is no longer the case. We perform the first such test in an if statement, right before the while, because we want to single out the case where no word at all starts with the desired beginning, and output an error message in that specific case.