The format known as MPEG-1, Layer III (or MP3 for short) was developed in the late 1980s and early 1990s and was finalized in November 1992 by the Motion Pictures Expert Group (MPEG) as part of the original MPEG-1 standard. The MPEG committee is a gathering of scientists and engineers who work under the auspices of the International Standards Organization (ISO) and the International Electro-Technical Commission (IEC). The members of the MPEG group are responsible for establishing standards for digital coding of moving pictures and audio. (See the sidebar "About the Motion Pictures Expert Group ".)
MP3 is more than a simple compression scheme. Most people are familiar with file compressors such as zip. But if you've ever tried to zip up a WAV file, you've probably found that raw audio doesn't compress well at all. Compression shaves only a tiny percentage from the original file size. Instead, MP3 gets most of its compression from the science of psychoacoustics -- the modeling of human auditory perception. The theory is that uncompressed audio streams carry a lot of data that isn't actually perceived by humans, for a variety of reasons. The logic follows: why store data that can't be perceived? MP3 encoders analyze audio streams and compare them to mathematical models of human psychoacoustics -- a far more complex and mathematically intensive process than simple zip compression. The process is time- and processor-intensive (compared to zip, anyway), but it has the benefit of achieving more effective compression.
Of course, the act of discarding data results in an imperfect audio stream by definition. No MP3 file contains all the data found in the original uncompressed source stream. But in practice, MP3s can be created with high enough quality to render them indistinguishable from the source to even the most discerning listener. At mid-level bitrates (quality levels), MP3 streams can be indistinguishable from the source to the ears of most people. The trick is in finding the best possible balance or compromise between file size and quality.
As stated earlier, MP3 is a perceptual audio coding scheme. MP3 encoders analyze an audio signal and compare it to psychoacoustic models representing limitations in human auditory perception. They then encode as much useful information as possible given the restrictions set by the bitrate and sampling frequency established in the encoder application. A number of distinct steps comprise the encoding process, including:
There are many aspects of audio to which the human ear is insensitive. For starters, most audio streams cover a frequency range much broader than that which humans can hear. The human hearing range generally falls between 20 Hz and 20 kHz and is most sensitive between 2 kHz to 4 kHz. As people grow older, their auditory acuity is diminished -- many people cannot hear tones higher than 16 kHz. MP3 encoders can immediately discard frequencies above or below this range. The minimal audition threshold represents the level at which the human ear will perceive sound. It is not necessary to code frequencies under or over this threshold, because they won't be perceived.
When two or more sounds are played simultaneously and one sound is louder than the other, the louder sound hides or "masks" the softer one (this concept is also discussed in Chapter 2, "The Science of Sound and Digital Audio"). If you record both sounds, the softer sound is still present in the recorded spectrum. However, since the softer sound is masked and therefore imperceptible, that sound can be safely removed from the recording. Similarly, if two tones are close together on the frequency spectrum, they may appear indistinct from one another. However, if the two tones are sufficiently distinct, they will be independently perceptible and must both be encoded. File size is cut dramatically when undetectable (or barely detectable) sounds are removed from the recording, preserving disk space.
These two effects are called "auditory masking" and "temporal masking" and may best be understood by analogy. If you watch a flying bird, its outline may be distinct against the sky. But if the bird passes in front of the sun, the sun's brightness completely overpowers the bird's outline. As the bird moves toward the other edge of the sun, it becomes visible again. The same principle applies with masking effects in MP3 encoding.
MP3 files are stored as a series of "frames," which can be thought of much like the frames that make up a movie. Each frame carries only a fraction of a second's worth of audio data and is preceded by a header section describing the bitrate, encoding method, and other metadata pertaining to the frame to come. In some cases, a portion of the audio stream may be adequately encoded with room leftover in its frame. The reservoir of bytes lets the MP3 encoder "borrow" space from unfilled frames to store data of adjacent frames that need additional space. The reservoir of bytes is a sort of space-lending concept that helps to ensure a consistent flow of data and quality rate.
While not an essential part of the MP3 encoding process, joint stereo is an option in most encoders and is typically enabled by default. When joint stereo is enabled, stereo sound is represented using a mixture of true stereo and monophonic sound, along with some spatializing information. Joint stereo is useful because very low and very high frequencies cannot be located in space by humans with the same precision as normal frequencies. The MP3 format exploits this fact by encoding very low and very high frequencies in mono, thereby saving storage space in the resultant file. To save additional storage space, try encoding in mono. To make sure you trap all possible spatial data, encode in stereo mode. Most users find that joint stereo is adequate for most purposes.
As mentioned earlier, compressing a WAV file with zip doesn't shave much off the file size, which is why psychoacoustics are employed. However, the MP3 encoding process actually does employ the classic Huffman encoding algorithm. After all psychoacoustic methods have been applied, the Huffman encoding pass seeks out and compresses any remaining redundancies in the bit pattern. It's as though zip-type encoding were being run internally on the psychoacoustically encoded data. While psychoacoustic coding is great at dealing with polyphonous sections, it's not as efficient when dealing with highly repetitive, or "pure" sections. The Huffman pass, on the other hand, is great at handling redundancies, for the same reason a text file filled with a million zeros will compress to almost nothing.
The Huffman encoding "pass" is very rapid and allows a savings of 20% in file size, on average. The Huffman pass therefore makes a perfect complement to perceptual coding techniques.
A plethora of players
While we cover only a few players in this chapter, there are literally hundreds of MP3 players on the market for virtually every operating system. Some are free; some require a small fee; some are basic and lightweight; others are full-featured and sometimes even bloated. Some work from the command line, while others in a normal window, and still others operate within funky, irregularly shaped interfaces. Check the software libraries for a list of MP3 players for your operating system.
Note that when AOL purchased Nullsoft in late 1999, they made Winamp freeware. Traditional audio players have acquired MP3 playback capabilities as well. MP3 playback capabilities have been added to Liquid Audio's LiquidPlayer (see the sidebar "Liquid Audio: building a viable e-music system" later in this chapter), RealAudio's RealPlayer (and their popular RealJukebox), Apple's QuickTime, and Microsoft Media players.
Copyright © 2002 O'Reilly & Associates. All rights reserved.