Recognize video compression technology

type

status

date

slug

summary

Images

An image can be considered as a two-dimensional matrix. If we take color into account, we can make a generalization: consider this image as a three-dimensional matrix - the extra dimension is used to store color information.

If we choose the three primary colors (red, green, and blue) to represent these colors, this defines three planes: the first is the red plane, the second is the green plane, and the last is the blue plane.

We call each point in this matrix a pixel (image element). The color of a pixel is represented by the intensity (usually expressed numerically) of the three primary colors. For example, a red pixel is green with intensity 0, blue with intensity 0 and red with maximum intensity. A pink pixel can be represented by a combination of all three colors. If you specify a range of intensity values from 0 to 255, red 255, green 192, and blue 203 would represent pink.

An example is the following pictures. The first one contains all color planes. The rest are the red, green, and blue planes (shown as gray tones) (translation: high color intensities are shown as bright colors, low intensities as dark colors).

We can see that for the final imaging, the red plane contributes more to the intensity (the brightest of the three planes is the red plane), and the blue plane (the last image) contributes mostly only to Mario's eyes and part of his clothes. All color planes contribute less to Mario's beard (the darkest part).

Storing the intensity of a color takes up a certain size of data space, this size is called the color depth. If the intensity of each color (plane) takes up 8 bits (ranging from 0 to 255), then the color depth is 24 (8*3) bits, and we can deduce that we can use 2 to the 24th power of different colors.

Another property of an image is resolution, which is the number of pixels in a plane. This is usually expressed as width * height.

Another property of an image or video is aspect ratio, which simply describes the proportional relationship between the width and height of an image or pixel.

When people say that this movie or photo is 16:9, they are usually referring to the Display Aspect Ratio (DAR), however we can also have different shapes of individual pixels, which we call Pixel Aspect Ratio (PAR).

Image compression strategies

To do this, we can take advantage of a visual property: we distinguish brightness more sharply than we do color.

Our eyes are more sensitive to brightness than to color. You can test this yourself by looking at the image below.

If you can't see that square A and square B on the left are the same color, well, it's our brain playing a trick, which makes us pay more attention to light and darkness than to color. Here on the right there is a connector that uses the same color, then we (the brain) can easily distinguish the fact that they are the same color.

Digital storage of color

The principles we first learned about color images use the RGB model, but there are other models as well. One model separates luminance (lightness) from chroma (color) and it is called YCbCr*.

* There are many models that do the same separation.

This color model uses Y for luminance and two color channels: Cb (blue chroma) and Cr (red chroma). yCbCr can be converted from RGB or back to RGB. using this model we can create images with full color, as shown in the following figure.

Conversion between YCbCr and RGB

One might ask how we can represent all the colors without using green (chroma).

To answer this question, we will introduce the conversion from RGB to YCbCr. We will use the coefficients from the standard BT.601 recommended by the ITU-R group*.

The first step is to calculate the luminance, we will use the constants recommended by the ITU and replace the RGB values.

Once we have the luminance, we can split the colors (blue chroma and red chroma):

And we can also use YCbCr to convert back and even get the green color.

*Organizations and standards are common in the digital video world, and they usually define what a standard is, e.g., what is 4K, what frame rate should we use? resolution? Color model?

Chroma subsampling

Once we can separate luminance and chroma from an image, we can take advantage of the fact that the human visual system is more sensitive to luminance than chroma to selectively cull information. Chroma subsampling is a technique for encoding images so that the chroma resolution is lower than the luminance.

How much should we reduce the chroma resolution? There are already some modes that define how to handle resolution and merging ( final color = Y + Cb + Cr ).

These patterns are called subsampling systems, and are represented as 3-part ratios - a:x:y, which define the resolution of the chrominance plane in relation to a small block in the luminance plane with resolution a x 2.

a is the horizontal sample reference (usually 4).

x is the number of chroma samples in the first row (relative to the horizontal resolution of a), and

y is the number of chroma samples in the second row.

The one exception that exists is 4:1:0, which provides one chroma sample within each block of 4 x 4 luminance plane resolution.

Common schemes used in modern codecs are: 4:4:4 (no subsampling), 4:2:2, 4:1:1, 4:2:0, 4:1:0 and 3:1:1.

The image below shows the same image encoded using several of the main chroma subsampling techniques, the first row of images is the final YCbCr, while the last row of images demonstrates the chroma resolution. Such a small loss is indeed a great victory.

If we had used YCbCr 4:2:0 we could have halved the size.

Video

Now we can define a video as n consecutive framesper unit of time, which can be considered as a new dimension. n is the frame rate, which is equivalent to FPS (Frames Per Second) if the unit of time is seconds.

The amount of data per second needed to play a video is its bitrate (often called the bitrate).

Bitrate = Width * Height * Color Depth * Frames Per Second

For example, a video with 30 frames per second, 24 bits per pixel, and a resolution of 480x240 would require 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we did not do any compression.

When the bit rate is almost constant it is called constant bit rate ( CBR ); but it can also vary and is called variable bit rate ( VBR ).

This graphic shows a constrained VBR that doesn't take much data when the frame is black.

Video Compression Strategies

We recognize that not compressing video is not an option; a single hour-long video with a resolution of 720p and 30fps would require 278GB*. Simply using lossless data compression algorithms-such as DEFLATE (used by PKZIP, Gzip, and PNG)-won't adequately reduce the bandwidth required for video either, and we need to find other ways to compress video.

*We use the product to arrive at this number 1280 x 720 x 24 x 30 x 3600 (width, height, bits per pixel, fps, and seconds)

Exploiting visual properties: we distinguish brightness more sharply than we do color.

Temporal repetition: a video contains many images that change only slightly.

Intra-image repetition: each frame also contains many areas of the same or similar color.

Among other things, the visual characteristics are the same as the image compression strategy mentioned above. If we use YCbCr 4:2:0 we can reduce the size by half (139GB), but it's still not ideal.

We arrive at this value by multiplying width, height, color depth and fps. Previously we needed 24 bit, now we only need 12 bit.

Video Terminology

Now we'll go further and eliminate temporal redundancy, but before we do that let's define some basic terminology. Let's say we have a 30fps movie, and these are the first 4 frames.

An I-frame (see, keyframe, intra-frame encoding), is a self-contained frame. It doesn't rely on anything to render, I-frames are similar to still images. The first frame is usually an I-frame, but we will see that I-frames are periodically inserted between other types of frames.

P-frames (predictive), take advantage of the fact that the current frame can almost always be rendered using the previous frame. For example, in the second frame, the only change is that the ball has moved forward. We can reconstruct the previous frame just by using the references and differences (from the second frame) to the previous frame.

B-frames (bi-directional prediction), how to reference the previous and next frames to do better compression? Simply put B-frames do just that.

These frame types are used to provide better compression rates, and we will see how this happens in the next chapter. For now, we can think of I-frames as expensive, P-frames as cheap, and B-frames as the cheapest.

Temporal Duplication

Let's explore removing temporal duplication; a technique for removing this type of redundancy is inter-frame prediction.

We will try to spend less amount of data to encode frames 0 and 1 which are consecutive in time.

We could do a subtraction where we simply subtract frame 0 from frame 1 to get the residuals, so that we only need to encode the residuals!

But there is a better way to save data. First, we'll think of frame 0 as a collection of chunks, and then we'll try to match the chunks on frame 1 to the chunks on frame 0. We can think of this as motion prediction.

We expect the ball to move from x=0, y=25 to x=6, y=26, and the values of x and y are the motion vectors. A further way to save on the amount of data is to encode only the difference between these two motion vectors. So, the final motion vector is x=6 (6-0), y=1 (26-25).

In practice, this ball will be sliced into n partitions, but the processing is the same.

We can see that when we use motion prediction, the amount of data encoded is less than when using the simple residual frame technique.

Intra-image repetition

Let's explore removing spatial duplicates within an image, a technique to remove this type of redundancy is intra-frame prediction.

If we analyze each frame in a video, we will see that there are many regions that are interconnected.

Here is an I-frame that we can't predict using the previous frame, but we can still compress it. We're going to encode that red region we chose. If we look around it, we can estimate the change in color around it.

We predict that the colors in the :frame will be consistent vertically, which means that the unknown pixel will have the same color as the neighboring pixels.

Our prediction can be wrong, so we need to utilize this technique first ( intra-frame prediction ) and then subtract the actual value to figure out the residual, resulting in a matrix that is easier to compress than the original data.

Video Container VS Video Encoder

A container is seen as a wrapper format containing metadata for video (and most likely audio as well), such as MP4.

The video compressed by the encoder can be seen as the content it carries, e.g. H.264 encoded compressed content.

Video Codec

What it is. It's the software or hardware used to compress or decompress digital video.

Why? People need to improve the quality of video with limited bandwidth or storage space.

How?

We're going to go over the main mechanisms behind general-purpose video codecs, most of the concepts are practical and are used by modern codecs such as VP9, AV1 and HEVC. Note: we will simplify a lot. Sometimes we will use real examples (mainly H.264) to demonstrate the techniques.

Step 1 - Partitioning the frames

The first step is to divide the frame into several partitions, sub-partitions or even more.

Usually the codec organizes these partitions into slices (or tiles), macros (or coding tree units) and many sub-partitions. The maximum size of these partitions varies, HEVC is set to 64x64 while AVC uses 16x16, but the sub-partitions can be up to 4x4 in size.

Step 2 - Prediction

Once we have the partitions, we can make predictions on top of them. For inter-frame prediction, we need to send the motion vector and the residuals, and for intra-frame prediction, we need to send the predicted direction and the residuals.

Step 3 - Transformation

After we get the residual block ( predicted partition - real partition ), we can transform it in a way so we know which pixels we should discard and still keep the overall quality. There are several transformations for this exact behavior.

Although there are other transformations, we focus on the Discrete Cosine Transform (DCT).The main functions of the DCT are:

Converts blocks of pixels into blocks of frequency coefficients of the same size.

Compresses energy, making it easier to eliminate spatial redundancy.

Reversible, which also means you can revert back to pixels.

Step 4 - Quantization

When we discard some of the coefficients, in the last step (transform), we do some form of quantization. In this step, we selectively remove information ( lossy parts ) or simply put, we quantize the coefficients to achieve compression.

How do we quantize a block of coefficients? A simple method is uniform quantization where we take a block and divide it by a single value (10) and round up the values.

Step 5 - Entropy Coding

After we have quantized the data (image blocks/slices/frames), we can still compress it in a lossless way. There are many methods (algorithms) available to compress data. We will briefly go through a few of them and you can read this great book to understand them better: Understanding Compression: Data Compression for Modern Developers.

Step 6 - Bitstream Format

After all these steps, we need to package the compressed frames and content. The decoder needs to be explicitly informed about the encoding definitions such as color depth, color space, resolution, prediction information (motion vectors, intra-frame prediction direction), grade*, level*, frame rate, frame type, frame number, and much more.

Video Compression Review

Previously we calculated that we would need 139GB to save a one-hour video file at 720p resolution and 30fps, and if we use the techniques we've learned here, such as inter- and intra-frame prediction, transformations, quantization, entropy coding, and others we can achieve - assuming that we spend 0.031 bit per pixel - the same viewing quality as a video file at 720p resolution. -- only 367.82MB for the same viewing quality of video, compared to 139GB of storage.

🌻

Core reference:

https://github.com/leandromoreira/digital_video_introduction/tree/master

https://en.wikipedia.org/wiki/Image_compression