The number of bits used to represent the quantized DCT coefficients is reduced by ‘coding,’ which takes advantage of some of the statistical properties of the coefficients. After quantization, many of the DCT coefficients often – the vast majority of the high-frequency coefficients – are zero. A technique called ‘run-length coding’ takes advantage of this fact by grouping consecutive zero-valued coefficients (a ‘run’) and encoding the number of coefficients (the ‘length’) instead of encoding the individual zero-valued coefficients.
Run-length coding is typically followed by variable-length coding (VLC). In variable-length coding, commonly occurring symbols (representing quantized DCT coefficients or runs of zero-valued quantized coefficients) are represented using code words that contain only a few bits, while less common symbols are represented with longer code words. By using fewer bits for the most common symbols, VLC reduces the average number of bits required to encode a symbol thereby reducing the number of bits required to encode the entire image.
On the decompression side, variable-length decoding (VLD) reverses the steps performed by the VLC block in the compression algorithm. Variable-length decoding is much more computationally demanding than variable-length coding. VLC performs one table lookup per symbol (where a symbol is encoded using multiple bits); in contrast, the most straightforward implementation of VLD requires a table lookup and some simple decision making to be applied for each bit. VLD requires an average of about 11 operations per input bit. Thus, the processing requirements of VLD are proportional to the video codec’s selected bit rate. VLD can consume as much as 25% of the cycles spent in a video decoder implementation.
In a typical video decompression algorithm, the straightforward VLD implementation described above (which operates on one bit at a time) requires several kilobytes of lookup table memory. It is possible to improve the performance of the VLD by operating on multiple bits at a time, but this optimization requires the use of much larger lookup tables.
Some of the newer standards (such as H.264) replace or augment the run-length coding and VLC techniques described above to achieve greater compression. For example, H.264 supports both CAVLC (context-adaptive VLC) and CABAC (context-adaptive arithmetic coding). CAVLC augments VLC by adapting the coding scheme based on previously-coded coefficients. CABAC entirely replaces VLC by using instead a more efficient – but also more computationally demanding – scheme of arithmetic coding. CABAC can consume as many as 50% of the cycles in an H.264 decoder.
Looking at a bigger picture
All of the techniques described so far operate on each 8×8 block independently from any other block. Since images typically contain features that are much larger than an 8×8 block, more efficient compression can be achieved by taking into account the similarities between adjacent blocks in the image.
To take advantage of inter-block similarities, a prediction step is often added prior to quantization of the transform coefficients. In this step, codecs attempt to predict the image information within a block using the information from the surrounding blocks. Some codecs (such as MPEG-4) perform this step in the frequency domain, by predicting DCT coefficients. Other codecs (such as H.264) do this step in the spatial domain, and predict pixels directly. The latter approach is called ‘intra prediction.’
In this step, the encoder attempts to predict the values of some of the DCT coefficients (if done in the frequency domain) or pixel values (if done in the spatial domain) in each block based on the coefficients or pixels in the surrounding blocks. The encoder then computes the difference between the actual value and the predicted value and encodes the difference rather than the actual value. At the decoder, the coefficients are reconstructed by performing the same prediction and then adding the difference transmitted by the encoder. Because the difference tends to be small compared to the actual coefficient values, this technique reduces the number of bits required to represent the DCT coefficients.
In predicting the DCT coefficient or pixel values of a particular block, the decoder has access only to the values of surrounding blocks that have already been decoded. Therefore, the encoder must predict the DCT coefficients or pixel values of each block based only on the values from previously encoded surrounding blocks.
JPEG uses a very rudimentary DCT coefficient predic tion scheme, in which only the lowest-frequency coefficient (the ‘DC coefficient’) is predicted using simple differential coding. MPEG-4 video uses a more sophisticated scheme that attempts to predict the first DCT coefficient in each row and each column of the 8×8 block. This scheme is referred to as ‘AC-DC prediction.’
AC-DC prediction can require a substantial amount of processing power and some implementations require large data arrays. However, AC-DC prediction is typically used in a small fraction of time – when motion compensation (discussed later) is not used – and usually has a negligible impact on average processor load.
As mentioned earlier, in H.264 the prediction is done on pixels directly, and the DCT-like integer transform always processes a residual – either from motion estimation or from intra-prediction. In H.264, the pixel values are never transformed directly as they are in JPEG or MPEG-4 I-frames. As a result, the decoder has to decode the transform coefficients and perform the inverse transform in order to obtain the residual, which is added to the predicted pixels.