11.3 The Berkeley DB Module
Python comes with the
bsddb module, which wraps the Berkeley Database
library (also known as BSD DB) if that library is installed on your
system and your Python installation is built to support it. With the
BSD DB library, you can create hash, binary tree, or record-based
files that generally behave like dictionaries. On Windows, Python
includes a port of the BSD DB library, thus ensuring that module
bsddb is always usable. To download BSD DB
sources, binaries for other platforms, and detailed documentation on
BSD DB, see http://www.sleepycat.com. Module
bsddb supplies three factory functions,
btopen, hashopen, and
rnopen.
btopen(filename,flag='r',*many_other_optional_arguments)
hashopen(filename,flag='r',*many_other_optional_arguments)
rnopen(filename,flag='r',*many_other_optional_arguments)
|
|
btopen opens or creates the binary tree format
file named by filename (a string that
denotes any path to a file, not just a name), and returns a suitable
BTree object to access and manipulate the file.
Argument flag has exactly the same values
and meaning as for anydbm.open. Other arguments
indicate low-level options that allow fine-grained control, but are
rarely used.
hashopen and rnopen work the
same way, but open or create hash format and record format files,
returning objects of type Hash and
Record. hashopen is generally
the fastest format and makes sense when you are using keys to look up
records. However, if you also need to access records in sorted order,
use btopen, or if you need to access records in
the same order in which you originally wrote them, use
rnopen. Using hashopen does not
keep records in order in the file.
An object b of any of the types
BTree, Hash, and
Record can be indexed as a mapping, with both keys
and values constrained to being strings. Further,
b also supports sequential access through
the concept of a current record.
b supplies the following
methods.
Closes b. Call no other method on
b after
b.close( ).
Sets b's current record
to the first record, and returns a pair
(key,value)
for the first record. The order of records is arbitrary, except for
BTree objects, which ensure records are sorted in
alphabetical order of their keys.
b.first( ) raises
KeyError if b is empty.
Returns True if string
key is a key in
b, otherwise returns
False.
Returns the list of
b's key strings. The
order is arbitrary, except for BTree objects,
which return keys in alphabetical order.
Sets
b's current record to the
last record and returns a pair
(key,value)
for the last record. Type Hash does not supply
method last.
Sets b's current record
to the next record and returns a pair
(key,value)
for the next record. b.next(
) raises KeyError if
b has no next record.
Sets b's current record
to the previous record and returns a pair
(key,value)
for the previous record. Type Hash does not supply
method previous.
Sets b's current record
to the item with string key key, and
returns a pair
(key,value).
If key is not a key in
b, and b is of
type BTree,
b.set_location(key)
sets b's current record
to the item whose key is the smallest key larger than
key and returns that key/value pair. For
other object types, set_location raises
KeyError if key is not
a key in b.
11.3.1 Examples of Berkeley DB Use
The Berkeley DB is suited to tasks similar to those for which
DBM-like files are appropriate. Indeed, anydbm
uses dbhash, the DBM-like interface to the
Berkeley DB, to create new DBM-like files. In addition, the Berkeley
DB can also use other file formats when you use module
bsddb explicitly. The binary tree format, while
not quite as fast as the hashed format when all you need is keyed
access, is excellent when you also need to access keys in
alphabetical order.
The following example handles the same task as the DBM example shown
earlier, but uses bsddb rather than
anydbm:
import fileinput, os, bsddb
wordPos = { }
sep = os.pathsep
for line in fileinput.input( ):
pos = '%s%s%s'%(fileinput.filename( ), sep, fileinput.filelineno( ))
for word in line.split( ):
wordPos.setdefault(word,[ ]).append(pos)
btOut = bsddb.btopen('btindex','n')
sep2 = sep * 2
for word in wordPos:
btOut[word] = sep2.join(wordPos[word])
btOut.close( )
The differences between this example and the DBM one are minimal:
writing a new binary tree format file with bsddb
is basically the same task as writing a new DBM-like file with
anydbm. Reading back the data using
bsddb.btopen('btindex') rather than
anydbm.open('indexfile') is similarly trivial. To
illustrate the extra features of binary trees regarding access to
keys in alphabetical order, we'll perform a slightly
more general task. The following example treats its command-line
arguments as specifying the beginning of words, and prints the lines
in which any word with such a beginning appears:
import sys, os, bsddb, linecache
btIn = bsddb.btopen('btindex')
sep = os.pathsep
sep2 = sep * 2
for word in sys.argv[1:]:
key, pos = btIn.set_location(word)
if not key.startswith(word):
sys.stderr.write('Word-start %r not found in index file\n' % word)
while key.startswith(word):
places = pos.split(sep2)
for place in places:
fname, lineno = place.split(sep)
print "%r occurs in line %s of file %s:" % (word,lineno,fname)
print linecache.getline(fname, int(lineno)),
try: key, pos = btIn.next( )
except IndexError: break
This example exploits the fact that
btIn.set_location sets
btIn's current position to the
smallest key larger than word, when
word itself is not a key in
btIn. When word is a
word-beginning, and keys are words, this means that
set_location sets the current position to the
first word, in alphabetical order, that starts with
word. The tests with
key.startswith(word)
let us check that we're still scanning words with
that beginning, and terminate the while loop when
that is no longer the case. We perform the first such test in an
if statement, right before the
while, because we want to single out the case
where no word at all starts with the desired beginning, and output an
error message in that specific case.
|