10.6 Compressed Files
Although
storage space and transmission bandwidth are increasingly cheap and
abundant, in many cases you can save such resources, at the expense
of some computational effort, by using compression. Since
computational power grows cheaper and more abundant even faster than
other resources, such as bandwidth, compression's
popularity keeps growing. Python makes it easy for your programs to
support compression by supplying dedicated modules for compression as
part of every Python distribution.
10.6.1 The gzip Module
The gzip module lets
you read and write files compatible with those handled by the
powerful GNU compression programs gzip and
gunzip. The GNU programs support several
compression formats, but module gzip supports only
the highly effective native gzip format,
normally denoted by appending the extension .gz
to a filename. Module gzip supplies the
GzipFile class and an open
factory function.
class GzipFile(filename=None,mode=None,compresslevel=9,
fileobj=None)
|
|
Creates and returns a file-like object f
that wraps the file or file-like object
fileobj. f
supplies all methods of built-in file objects except
seek and tell. Thus,
f is not seekable: you can only access
f sequentially, whether for reading or
writing. When fileobj is
None, filename must be
a string that names a file: GzipFile opens that
file with the given mode (by default,
'rb'), and f wraps the
resulting file object. mode should be one
of 'ab', 'rb',
'wb', or None. If
mode is None,
f uses the mode of
fileobj if it is able to find out the
mode; otherwise it uses 'rb'. If
filename is None,
f uses the filename of
fileobj if able to find out the name;
otherwise it uses ''.
compresslevel is an integer between
1 and 9: 1
requests modest compression but fast operation, and
9 requests the best compression feasible, even if
that requires more computation.
File-like object f generally delegates all
methods to the underlying file-like object
fileobj, transparently accounting for
compression as needed. However, f does not
allow non-sequential access, so f does not
supply methods seek and tell.
Moreover, calling
f.close does
not close fileobj
when f was created with an argument
fileobj that is not
None. This behavior of
f.close is very
important when fileobj is an instance of
StringIO.StringIO, since it means you can call
fileobj.getvalue after
f.close to get the
compressed data as a string. This behavior also means that you have
to call fileobj.close
explicitly after calling
f.close.
open(filename,mode='rb',compresslevel=9)
|
|
Like
GzipFile(filename,mode,compresslevel),
but filename is mandatory and there is no
provision for passing an already opened
fileobj.
Say that you have some function
f(x)
that writes data to a text file object x,
typically by calling
x.write and/or
x.writelines. Getting
f to write data to a
gzip-compressed text file instead is easy:
import gzip
underlying_file = open('x.txt.gz', 'wb')
compressing_wrapper = gzip.GzipFile(fileobj=underlying_file, mode='wt')
f(compressing_wrapper)
compressing_wrapper.close( )
underlying_file.close( )
This example opens the underlying binary file
x.txt.gz and explicitly wraps it with
gzip.GzipFile, and thus, at the end, we need to
close each object separately. This is necessary because we want to
use two different modes: the underlying file must be opened in binary
mode (any translation of line endings would produce an invalid
compressed file), but the compressing wrapper must be opened in text
mode because we want the implicit translation of
os.linesep to \n. Reading back
a compressed text file, for example to display it on standard output,
is similar:
import gzip, xreadlines
underlying_file = open('x.txt.gz', 'rb')
uncompressing_wrapper = gzip.GzipFile(fileobj= underlying_file, mode='rt')
for line in xreadlines.xreadlines(uncompressing_wrapper):
print line,
uncompressing_wrapper.close( )
underlying_file.close( )
This example uses module xreadlines, covered
earlier in this chapter, because GzipFile objects
(at least up to Python 2.2) are not iterable like true file objects,
nor do they supply an xreadlines method.
GzipFile objects do supply a
readlines method that closely emulates that of
true file objects, and therefore module xreadlines
is able to produce a lazy sequence that wraps a
GzipFile object and lets us iterate on the
GzipFile object's lines.
10.6.2 The zipfile Module
The zipfile module
lets you read and write ZIP files (i.e., archive files compatible
with those handled by popular compression programs
zip and unzip,
pkzip and pkunzip,
WinZip, and so on). Detailed information on the
formats and capabilities of ZIP files can be found at http://www.pkware.com/appnote.html and
http://www.info-zip.org/pub/infozip/. You
need to study this detailed information in order to perform advanced
ZIP file handing with module
zipfile.
Module zipfile can't handle ZIP
files with appended comments, multidisk ZIP files, or
.zip archive members using compression types
besides the usual ones, known as stored (when a file is copied to the
archive without compression) and deflated (when a file is compressed
using the ZIP format's default algorithm). For
invalid .zip file errors, functions of module
zipfile raise exceptions that are instances of
exception class zipefile.error. Module
zipfile supplies the following classes and
functions.
Returns True if the file named by string
filename appears to be a valid ZIP file,
judging by the first few bytes of the file; otherwise returns
False.
class ZipInfo(filename='NoName',date_time=(1980,1,1,0,0,0))
|
|
Methods getinfo and infolist of
ZipFile instances return instances of
ZipInfo to supply information about members of the
archive. The most useful attributes supplied by a
ZipInfo instance z are:
- comment
-
A string that is a comment on the archive member
- compress_size
-
Size in bytes of the compressed data for the archive member
- compress_type
-
An integer code recording the type of compression of the archive
member
- date_time
-
A tuple with 6 integers recording the time of last modification to
the file: the items are year, month, day (1 and
up), hour, minute, second (0 and up)
- file_size
-
Size in bytes of the uncompressed data for the archive member
- filename
-
Name of the file in the archive
class ZipFile(filename,mode='r',compression=zipfile.ZIP_STORED)
|
|
Opens a ZIP file named by string filename.
mode can be 'r', to
read an existing ZIP file; 'w', to write a new ZIP
file or truncate and rewrite an existing one; or
'a', to append to an existing file.
When mode is 'a',
filename can name either an existing ZIP
file (in which case new members are added to the existing archive) or
an existing non-ZIP file. In the latter case, a new ZIP file-like
archive is created and appended to the existing file. The main
purpose of this latter case is to let you build a self-unpacking
.exe file (i.e., a Windows executable file that
unpacks itself when run). The existing file must then be a fresh copy
of an unpacking .exe prefix, as supplied by
www.info-zip.org or by other
purveyors of ZIP file compression tools.
compression is an integer code that can be
either of two attributes of module zipfile.
zipfile.ZIP_STORED requests that the archive use
no compression, and zipfile.ZIP_DEFLATED requests
that the archive use the deflation mode of
compression (i.e., the most usual and effective compression approach
used in .zip files).
A ZipFile instance z
supplies the following methods.
Closes archive file
z. Make sure the close
method is called, or else an incomplete and unusable ZIP file might
be left on disk. Such mandatory finalization is generally best
performed with a try/finally
statement, as covered in Chapter 6.
Returns a ZipInfo instance that supplies
information about the archive member named by string
name.
Returns a list of ZipInfo instances, one for each
member in archive z, in the same order as
the entries in the archive itself.
Returns a list of strings, the names of each member in archive
z, in the same order as the entries in the
archive itself.
Outputs a textual directory of the archive
z to file sys.stdout.
Returns a string containing the uncompressed bytes of the file named
by string name in archive
z. z must be
opened for 'r' or 'a'. When the
archive does not contain a file named
name, read raises an
exception.
Reads and checks the files in archive z.
Returns a string with the name of the first archive member that is
damaged, or None when the archive is intact.
z.write(filename,arcname=None,compress_type=None)
|
|
Writes the file named by string filename
to archive z, with archive member name
arcname. When
arcname is None,
write uses filename as
the archive member name. When
compress_type is None,
write uses
z's compression type;
otherwise, compress_type is
zipfile.ZIP_STORED or
zipfile.ZIP_DEFLATED, and specifies how to
compress the file. z must be opened for
'w' or 'a'.
zinfo must be a ZipInfo
instance specifying at least filename and
date_time.
bytes is a string of bytes.
writestr adds a member to archive
z, using the metadata specified by
zinfo and the data in
bytes. z must
be opened for 'w' or 'a'. When
you have data in memory and need to write the data to the ZIP file
archive z, it's simpler
and faster to use
z.writestr rather than
z.write. The latter
approach would require you to write the data to disk first, and later
remove the useless disk file. The following example shows both
approaches, each encapsulated into a function, polymorphic to each
other:
import zipfile
def data_to_zip_direct(z, data, name):
import time
zinfo = zipfile.ZipInfo(name, time.localtime( )[:6])
z.writestr(zinfo, data)
def data_to_zip_indirect(z, data, name):
import os
flob = open(name, 'wb')
flob.write(data)
flob.close( )
z.write(name)
os.unlink(name)
zz = zipfile.ZipFile('z.zip', 'w', zipfile.ZIP_DEFLATED)
data = 'four score\nand seven\nyears ago\n'
data_to_zip_direct(zz, data, 'direct.txt')
data_to_zip_indirect(zz, data, 'indirect.txt')
zz.close( ) Besides being faster and more concise,
data_to_zip_direct is handier because, by working
in memory, it doesn't need to have the current
working directory be writable, as
data_to_zip_indirect does. Of course, method
write also has its uses, but
that's mostly when you already have the data in a
file on disk, and just want to add the file to the archive.
Here's how you can print a list of all files
contained in the ZIP file archive created by the previous example,
followed by each file's name and contents:
import zipfile
zz = zipfile.ZipFile('z.zip')
zz.printdir( )
for name in zz.namelist( ):
print '%s: %r' % (name, zz.read(name))
zz.close( )
10.6.3 The zlib Module
The zlib module
lets Python programs use the free InfoZip zlib
compression library (see http://www.info-zip.org/pub/infozip/zlib/),
Version 1.1.3 or later. Module zlib is used by
modules gzip and zipfile, but
the module is also available directly for any special compression
needs. This section documents the most commonly used functions
supplied by module zlib.
Module zlib also supplies functions to compute
Cyclic-Redundancy Check (CRC) checksums, in order to detect possible
damage in compressed data. It also provides objects that can compress
and decompress data incrementally, and thus enable you to work with
data streams that are too large to fit in memory at once. For such
advanced functionality, consult the Python library's
online reference.
Note that files containing data compressed with
zlib are not automatically interchangeable with
other programs, with the exception of files that use the
zipfile module and therefore respect the standard
format of ZIP file archives. You could write a custom program, with
any language able to use InfoZip's free
zlib compression library, in order to read files
produced by Python programs using the zlib module.
However, if you do need to interchange compressed data with programs
coded in other languages, I suggest you use modules
gzip or zipfile instead. Module
zlib may be useful when you want to compress some
parts of data files that are in some proprietary format of your own,
and need not be interchanged with any other program except those that
make up your own application.
Compresses string str and returns the
string of compressed data. level is an
integer between 1 and 9:
1 requests modest compression but fast operation,
and 9 requests compression as good as feasible,
thus requiring more computation.
Decompresses the compressed data string
str and returns the string of uncompressed
data.
|