11.1 SerializationPython supplies a number of modules that deal with I/O operations that serialize (save) entire Python objects to various kinds of byte streams, and deserialize (load and recreate) Python objects back from such streams. Serialization is also called marshaling. 11.1.1 The marshal ModuleThe marshal module supports the specific serialization tasks needed to save and reload compiled Python files (.pyc and .pyo). marshal only handles instances of fundamental built-in data types: None, numbers (plain and long integers, float, complex), strings (plain and Unicode), code objects, and built-in containers (tuples, lists, dictionaries) whose items are instances of elementary types. marshal does not handle instances of user-defined types, nor classes and instances of classes. marshal is faster than other serialization modules. Code objects are supported only by marshal, not by other serialization modules. Module marshal supplies the following functions.
dumps returns a string representing object value. dump writes the same string to file object fileobj, which must be opened for writing in binary mode. dump(v,f) is just like f.write(dumps(v)). fileobj cannot be a file-like object: it must be an instance of type file.
loads creates and returns the object v previously dumped to string str, so that, for any object v of a supported type, v equals loads(dumps(v)). If str is longer than dumps(v), loads ignores the extra bytes. load reads the right number of bytes from file object fileobj, which must be opened for reading in binary mode, and creates and returns the object v represented by those bytes. fileobj cannot be a file-like object: it must be an instance of type file. Functions load and dump are complementary. In other words, a sequence of calls to load(f) deserializes the same values previously serialized when f's contents were created by a sequence of calls to dump(v,f). Objects that are dumped and loaded in this way can be instances of any mix of supported types. Suppose you need to analyze several text files, whose names are given as your program's arguments, and record where each word appears in those files. The data you need to record for each word is a list of (filename, line-number) pairs. The following example uses marshal to encode lists of (filename, line-number) pairs as strings and store them in a DBM-like file (as covered later in this chapter). Since those lists contain tuples, each made up of a string and a number, they are within marshal's abilities to serialize. import fileinput, marshal, anydbm wordPos = { } for line in fileinput.input( ): pos = fileinput.filename( ), fileinput.filelineno( ) for word in line.split( ): wordPos.setdefault(word,[ ]).append(pos) dbmOut = anydbm.open('indexfilem','n') for word in wordPos: dbmOut[word] = marshal.dumps(wordPos[word]) dbmOut.close( ) We also need marshal to read back the data stored to the DBM-like file indexfilem, as shown in the following example: import sys, marshal, anydbm, linecache dbmIn = anydbm.open('indexfilem') for word in sys.argv[1:]: if not dbmIn.has_key(word): sys.stderr.write('Word %r not found in index file\n' % word) continue places = marshal.loads(dbmIn[word]) for fname, lineno in places: print "Word %r occurs in line %s of file %s:" % (word,lineno,fname) print linecache.getline(fname, lineno), 11.1.2 The pickle and cPickle ModulesThe pickle and cPickle modules supply factory functions, named Pickler and Unpickler, to generate objects that wrap file-like objects and supply serialization mechanisms. Serializing and deserializing via these modules is also known as pickling and unpickling. The difference between the modules is that in pickle, Pickler and Unpickler are classes, so you can inherit from these classes to create customized serializer objects, overriding methods as needed. In cPickle, Pickler and Unpickler are factory functions, generating instances of special-purpose types, not classes. Performance is therefore much better with cPickle, but inheritance is not feasible. In the rest of this section, I'll be talking about module pickle, but everything applies to cPickle too. Note that in releases of Python older than the ones covered in this book, unpickling from an untrusted data source was a security risk—an attacker could exploit this to execute arbitrary code. No such weaknesses are known in Python 2.1 and later. Serialization shares some of the issues of deep copying, covered in Section 8.5 in Chapter 8. Module pickle deals with these issues in much the same way as module copy does. Serialization, like deep copying, implies a recursive walk over a directed graph of references. pickle preserves the graph's shape when the same object is encountered more than once, meaning that the object is serialized only the first time, and other occurrences of the same object serialize references to a single copy. pickle also correctly serializes graphs with reference cycles. However, this implies that if a mutable object o is serialized more than once to the same Pickler instance p, any changes to o after the first serialization of o to p are not saved. For clarity and simplicity, I recommend you avoid altering objects that are being serialized while serialization to a single Pickler instance is in progress. pickle can serialize in either an ASCII format or a compact binary one. Although the ASCII format is the default for backward compatibility, you should normally request binary format, as it saves both time and storage space. When you reload objects, pickle transparently recognizes and uses either format. I recommend you always specify binary format: the size and speed savings can be substantial, and binary format has basically no downside except loss of compatibility with very old versions of Python. pickle serializes classes and functions by name, not by value. pickle can therefore deserialize a class or function only by importing it from the same module where the class or function was found when pickle serialized it. In particular, pickle can serialize and deserialize classes and functions only if they are top-level names for their module (i.e., attributes of their module). For example, consider the following: def adder(augend): def inner(addend, augend=augend): return addend+augend return inner plus5 = adder(5) This code binds a closure to name plus5 (as covered in Section 4.10.6.2 in Chapter 4), which is a nested function inner plus an appropriate nested scope. Therefore, trying to pickle plus5 raises a pickle.PicklingError exception: a function can be pickled only when it is top-level, and function inner, whose closure is bound to name plus5 in this code, is not top-level, but rather nested inside function adder. Similar issues apply to other uses of nested functions, and also to nested classes (i.e., classes that are not top-level). 11.1.2.1 Functions of pickle and cPickleModules pickle and cPickle expose the following functions.
dumps returns a string representing object value. dump writes the same string to file-like object fileobj, which must be opened for writing. dump(v,f,bin) is like f.write(dumps(v,bin)). If bin is true, dump uses binary format, so f must be open in binary mode. dump(v,f,bin) is also like Pickler(f,bin).dump(v).
loads creates and returns the object v represented by string str, so that for any object v of a supported type, v= =loads(dumps(v)). If str is longer than dumps(v), loads ignores the extra bytes. load reads the right number of bytes from file-like object fileobj and creates and returns the object v represented by those bytes. If two calls to dump are made in sequence on the same file, two later calls to load from that file deserialize the two objects that dump serialized. load and loads transparently support pickles performed in either binary or ASCII mode. If data is pickled in binary format, the file must be open in binary format for both dump and load. load(f) is like Unpickler(f).load( ).
Creates and returns an object p such that calling p.dump is equivalent to calling function dump with the fileobj and bin argument values passed to Pickler. To serialize many objects to a file, Pickler is more convenient and faster than repeated calls to dump. You can subclass pickle.Pickler to override Pickler methods (particularly method persistent_id) and create your own persistence framework. However, this is an advanced issue, and is not covered further in this book.
Creates and returns an object u such that calling u.load is equivalent to calling function load with the fileobj argument value passed to Unpickler. To deserialize many objects from a file, Unpickler is more convenient and faster than repeated calls to function load. You can subclass pickle.Unpickler to override Unpickler methods (particularly the method persistent_load) and create your own persistence framework. However, this is an advanced issue, and is not covered further in this book. 11.1.2.2 A pickling exampleThe following example handles the same task as the marshal example shown earlier, but uses cPickle instead of marshal to encode lists of (filename, line-number) pairs as strings: import fileinput, cPickle, anydbm wordPos = { } for line in fileinput.input( ): pos = fileinput.filename( ), fileinput.filelineno( ) for word in line.split( ): wordPos.setdefault(word,[ ]).append(pos) dbmOut = anydbm.open('indexfilep','n') for word in wordPos: dbmOut[word] = cPickle.dumps(wordPos[word], 1) dbmOut.close( ) We can use either cPickle or pickle to read back the data stored to the DBM-like file indexfilep, as shown in the following example: import sys, cPickle, anydbm, linecache dbmIn = anydbm.open('indexfilep') for word in sys.argv[1:]: if not dbmIn.has_key(word): sys.stderr.write('Word %r not found in index file\n' % word) continue places = cPickle.loads(dbmIn[word]) for fname, lineno in places: print "Word %r occurs in line %s of file %s:" % (word,lineno,fname) print linecache.getline(fname, lineno), 11.1.2.3 Pickling instance objectsIn order for pickle to reload an instance object x, pickle must be able to import x's class from the same module in which the class was defined when pickle saved the instance. By default, to save the instance-specific state of x, pickle saves x._ _dict_ _, and then, to restore state, reloads x._ _dict_ _. Therefore, all instance attributes (values in x._ _dict_ _) must be instances of types suitable for pickling and unpickling (i.e., a pickleable object). A class can supply special methods to control this process. By default, pickle does not call x._ _init_ _ to restore instance object x. If you do want pickle to call x._ _init_ _, x's class must supply the special method _ _getinitargs_ _. In this case, when pickle saves x, pickle then calls x._ _getinitargs_ _( ), which must return a tuple t. When pickle later reloads x, pickle calls x._ _init_ _(*t) (i.e., the items of tuple t are passed as positional arguments to x._ _init_ _). When x._ _init_ _ returns, pickle restores x._ _dict_ _, overriding attribute values bound by x._ _init_ _. Method _ _getinitargs_ _ is therefore useful only when x._ _init_ _ has other tasks to perform in addition to the task of giving initial values to x's attributes. When x's class has a special method _ _getstate_ _, pickle calls x._ _getstate_ _( ), which normally returns a dictionary d. pickle saves d instead of x._ _dict_ _. When pickle later reloads x, it sets x._ _dict_ _ from d. When x's class supplies special method _ _setstate_ _, pickle calls x._ _setstate_ _(d) for whatever d was saved, rather than x._ _dict_ _.update(d). When x's class supplies both methods _ _getstate_ _ and _ _setstate_ _, _ _getstate_ _ may return any pickleable object y, not just a dictionary, since pickle reloads x by calling x._ _setstate_ _(y). A dictionary is often the handiest type of object for this purpose. As mentioned in "The copy Module" in Chapter 8, special methods _ _getinitargs_ _, _ _getstate_ _, and _ _setstate_ _ are also used to control the way instance objects are copied and deep-copied. If a new-style class defines _ _slots_ _, the class should also define _ _getstate_ _ and _ _setstate_ _, otherwise the class's instances are not pickleable. 11.1.2.4 Pickling customization with the copy_reg moduleYou can control how pickle serializes and deserializes objects of an arbitrary type (not class) by registering factory and reduction functions with module copy_reg. This is useful when you define a type in a C-coded Python extension. Module copy_reg supplies the following functions.
Adds fcon to the table of safe constructors, which lists all factory functions that pickle may call. fcon must be callable, and is normally a function.
Registers function fred as the reduction function for type type, where type must be a type object (not a class). To save any object o of type type, module pickle calls fred(o) and saves fred's result. fred(o) must return a pair (fcon,t) or a tuple (fcon,t,d), where fcon is a safe constructor and t is a tuple. To reload o, pickle calls o=fcon(*t). Then, if fred returned a d, pickle uses d to restore o's state, as in "Pickling of instance objects" (o._ _setstate_ _(d) if o supplies _ _setstate_ _, otherwise o._ _dict_ _.update(d)). If fcon is not None, pickle also calls constructor(fcon) to register fcon as a safe constructor. 11.1.3 The shelve ModuleThe shelve module orchestrates modules cPickle (or pickle, when cPickle is not available in the current Python installation), cStringIO (or StringIO, when cStringIO is not available in the current Python installation), and anydbm (and its underlying modules for access to DBM-like archive files, as discussed later in this chapter) in order to provide a lightweight persistence mechanism. shelve supplies a function open that is polymorphic to anydbm.open. The mapping object s returned by shelve.open is less limited than the mapping object a returned by anydbm.open. a's keys and values must be strings. s's keys must also be strings, but s's values may be of any type or class that pickle can save and restore. pickle customizations (e.g., copy_reg, _ _getinitargs_ _, _ _getstate_ _, and _ _setstate_ _) also apply to shelve, since shelve delegates serialization to pickle. Beware a subtle trap when you use shelve and mutable objects. When you operate on a mutable object held in a shelf, the changes don't take unless you assign the changed object back to the same index. For example: import shelve s = shelve.open('data') s['akey'] = range(4) print s['akey'] # prints: [0, 1, 2, 3] s['akey'].append('moreover') # trying direct mutation print s['akey'] # doesn't take; prints: [0, 1, 2, 3] x = s['akey'] # fetch the object x.append('moreover') # perform mutation s['akey'] = x # store the object back print s['akey'] # now it takes, prints: [0, 1, 2, 3, 'moreover'] The following example handles the same task as the pickling example earlier, but uses shelve to persist lists of (filename, line-number) pairs: import fileinput, shelve wordPos = { } for line in fileinput.input( ): pos = fileinput.filename( ), fileinput.filelineno( ) for word in line.split( ): wordPos.setdefault(word,[ ]).append(pos) shOut = shelve.open('indexfiles','n') for word in wordPos: shOut[word] = wordPos[word] shOut.close( ) We must use shelve to read back the data stored to the DBM-like file indexfiles, as shown in the following example: import sys, shelve, linecache shIn = shelve.open('indexfiles') for word in sys.argv[1:]: if not shIn.has_key(word): sys.stderr.write('Word %r not found in index file\n' % word) continue places = shIn[word] for fname, lineno in places: print "Word %r occurs in line %s of file %s:" % (word,lineno,fname) print linecache.getline(fname, lineno), These two examples are the simplest and most direct of the various equivalent pairs of examples shown throughout this section. This reflects the fact that module shelve is higher level than the modules used in previous examples. |