Intro Video Compression | Liang’s Blog

type

status

date

slug

summary

Image

An image can be thought of as a 2D matrix. If we think about colors, we can extrapolate this idea seeing this image as a 3D matrix where the additional dimensions are used to provide color data.

If we chose to represent these colors using the primary colors (red, green and blue), we define three planes: the first one for red, the second for green, and the last one for the blue color.

We'll call each point in this matrix a pixel (picture element). One pixel represents the intensity (usually a numeric value) of a given color. For example, a red pixel means 0 of green, 0 of blue and maximum of red. The pink color pixel can be formed with a combination of the three colors. Using a representative numeric range from 0 to 255, the pink pixel is defined by Red=255, Green=192 and Blue=203.

For instance, look at the picture down below. The first face is fully colored. The others are the red, green, and blue planes (shown as gray tones).

We can see that the red color will be the one that contributes more (the brightest parts in the second face) to the final color while the blue color contribution can be mostly only seen in Mario's eyes (last face) and part of his clothes, see how all planes contribute less (darkest parts) to the Mario's mustache.

And each color intensity requires a certain amount of bits, this quantity is known as bit depth. Let's say we spend 8 bits (accepting values from 0 to 255) per color (plane), therefore we have a color depth of 24 bits (8 bits * 3 planes R/G/B), and we can also infer that we could use 2 to the power of 24 different colors.

Another property we can see while working with images or video is the aspect ratio which simply describes the proportional relationship between width and height of an image or pixel.

When people says this movie or picture is 16x9 they usually are referring to the Display Aspect Ratio (DAR), however we also can have different shapes of individual pixels, we call this Pixel Aspect Ratio (PAR).

Picture compression strategy

We're better at distinguishing brightness than colors, the repetitions in time.

Our eyes are more sensitive to brightness than colors, you can test it for yourself, look at this picture.

If you are unable to see that the colors of the squares A and B are identical on the left side, that's fine, it's our brain playing tricks on us to pay more attention to light and dark than color. There is a connector, with the same color, on the right side so we (our brain) can easily spot that in fact, they're the same color.

Color model

We first learned how to color images work using the RGB model, but there are other models too. In fact, there is a model that separates luma (brightness) from chrominance (colors) and it is known as YCbCr*.

* there are more models which do the same separation.

This color model uses Y to represent the brightness and two color channels Cb (chroma blue) and Cr (chroma red). The YCbCr can be derived from RGB and it also can be converted back to RGB. Using this model we can create full colored images as we can see down below.

Converting between YCbCr and RGB

Some may argue, how can we produce all the colors without using the green?

To answer this question, we'll walk through a conversion from RGB to YCbCr. We'll use the coefficients from the standard BT.601 that was recommended by the group ITU-R* . The first step is to calculate the luma, we'll use the constants suggested by ITU and replace the RGB values.

Once we had the luma, we can split the colors (chroma blue and red):

And we can also convert it back and even get the green by using YCbCr.

* groups and standards are common in digital video, they usually define what are the standards, for instance, what is 4K? what frame rate should we use? resolution? color model?

Chroma subsampling

With the image represented as luma and chroma components, we can take advantage of the human visual system's greater sensitivity for luma resolution rather than chroma to selectively remove information. Chroma subsampling is the technique of encoding images using less resolution for chroma than for luma.

How much should we reduce the chroma resolution?! It turns out that there are already some schemas that describe how to handle resolution and the merge (final color = Y + Cb + Cr).

These schemas are known as subsampling systems and are expressed as a 3 part ratio - a:x:y which defines the chroma resolution in relation to a a x 2 block of luma pixels.

a is the horizontal sampling reference (usually 4)

x is the number of chroma samples in the first row of a pixels (horizontal resolution in relation to a)

y is the number of changes of chroma samples between the first and seconds rows of a pixels.

An exception to this exists with 4:1:0, which provides a single chroma sample within each 4 x 4 block of luma resolution.

Common schemes used in modern codecs are: 4:4:4 (no subsampling), 4:2:2, 4:1:1, 4:2:0, 4:1:0 and 3:1:1.

You can see the same image encoded by the main chroma subsampling types, images in the first row are the final YCbCr while the last row of images shows the chroma resolution. It's indeed a great win for such small loss.

If we use YCbCr 4:2:0 we can cut this size in half.

Video

Finally, we can define a video as a succession of n frames in time which can be seen as another dimension, n is the frame rate or frames per second (FPS).

The number of bits per second needed to show a video is its bit rate.

bit rate = width * height * bit depth * frames per second

For example, a video with 30 frames per second, 24 bits per pixel, resolution of 480x240 will need 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we don't employ any kind of compression.

When the bit rate is nearly constant it's called constant bit rate (CBR) but it also can vary then called variable bit rate (VBR).

This graph shows a constrained VBR which doesn't spend too many bits while the frame is black.

Video compression strategy

We learned that it's not feasible to use video without any compression; a single one hour video at 720p resolution with 30fps would require 278GB*. Since using solely lossless data compression algorithms like DEFLATE (used in PKZIP, Gzip, and PNG), won't decrease the required bandwidth sufficiently we need to find other ways to compress the video.

* We found this number by multiplying 1280 x 720 x 24 x 30 x 3600 (width, height, bits per pixel, fps and time in seconds)

In order to do this, we can exploit how our vision works. We're better at distinguishing brightness than colors, the repetitions in time, a video contains a lot of images with few changes, and the repetitions within the image, each frame also contains many areas using the same or similar color.

The visual features are the same as the image compression strategy mentioned above.If we use YCbCr 4:2:0 we can cut this size in half (139 GB)* but it is still far from ideal.

* we found this value by multiplying width, height, bits per pixel and fps. Previously we needed 24 bits, now we only need 12.

Frame types terminology

Now we can move on and try to eliminate the redundancy in time but before that let's establish some basic terminology. Suppose we have a movie with 30fps, here are its first 4 frames.

We can see lots of repetitions within frames like the blue background, it doesn't change from frame 0 to frame 3. To tackle this problem, we can abstractly categorize them as three types of frames.

I Frame (intra, keyframe)

An I-frame (reference, keyframe, intra) is a self-contained frame. It doesn't rely on anything to be rendered, an I-frame looks similar to a static photo. The first frame is usually an I-frame but we'll see I-frames inserted regularly among other types of frames.

P Frame (predicted)

A P-frame takes advantage of the fact that almost always the current picture can be rendered using the previous frame. For instance, in the second frame, the only change was the ball that moved forward. We can rebuild frame 1, only using the difference and referencing to the previous frame.

B Frame (bi-predictive)

What about referencing the past and future frames to provide even a better compression?! That's basically what a B-frame is.

These frames types are used to provide better compression. We'll look how this happens in the next section, but for now we can think of I-frame as expensive while P-frame is cheaper but the cheapest is the B-frame.

Temporal redundancy (inter prediction)

Let's explore the options we have to reduce the repetitions in time, this type of redundancy can be solved with techniques of inter prediction.

We will try to spend fewer bits to encode the sequence of frames 0 and 1.

One thing we can do it's a subtraction, we simply subtract frame 1 from frame 0 and we get just what we need to encode the residual.

But what if I tell you that there is a better method which uses even fewer bits?! First, let's treat the frame 0 as a collection of well-defined partitions and then we'll try to match the blocks from frame 0 on frame 1. We can think of it as motion estimation.

We could estimate that the ball moved from x=0, y=25 to x=6, y=26, the x and y values are the motion vectors. One further step we can do to save bits is to encode only the motion vector difference between the last block position and the predicted, so the final motion vector would be x=6 (6-0), y=1 (26-25)

In a real-world situation, this ball would be sliced into n partitions but the process is the same.

The objects on the frame move in a 3D way, the ball can become smaller when it moves to the background. It's normal that we won't find the perfect match to the block we tried to find a match. Here's a superposed view of our estimation vs the real picture.

But we can see that when we apply motion estimation the data to encode is smaller than using simply delta frame techniques.

Spatial redundancy (intra prediction)

If we analyze each frame in a video we'll see that there are also many areas that are correlated.

Let's walk through an example. This scene is mostly composed of blue and white colors.

This is an I-frame and we can't use previous frames to predict from but we still can compress it. We will encode the red block selection. If we look at its neighbors, we can estimate that there is a trend of colors around it.

We will predict that the frame will continue to spread the colors vertically, it means that the colors of the unknown pixels will hold the values of its neighbors.

Our prediction can be wrong, for that reason we need to apply this technique (intra prediction) and then subtract the real values which gives us the residual block, resulting in a much more compressible matrix compared to the original.

视频容器 VS 视频编码器

容器视为包含视频（也很可能包含音频）元数据的包装格式，例如 MP4。

编码器压缩过的视频可以看成是它承载的内容，例如 H.264 编码压缩后的内容。

视频编解码

是什么？ 就是用于压缩或解压数字视频的软件或硬件。

为什么？ 人们需要在有限带宽或存储空间下提高视频的质量。

怎么做？

我们接下来要介绍通用视频编解码器背后的主要机制，大多数概念都很实用，并被现代编解码器如 VP9, AV1 和 HEVC 使用。需要注意：我们将简化许多内容。有时我们会使用真实的例子（主要是 H.264）来演示技术。

第一步 - 图片分区

第一步是将帧分成几个分区，子分区甚至更多。

通常，编解码器将这些分区组织成切片（或瓦片），宏（或编码树单元）和许多子分区。这些分区的最大大小有所不同，HEVC 设置成 64x64，而 AVC 使用 16x16，但子分区可以达到 4x4 的大小。

第二步 - 预测

一旦我们有了分区，我们就可以在它们之上做出预测。对于帧间预测，我们需要发送运动向量和残差；至于帧内预测，我们需要发送预测方向和残差。

第三步 - 转换

在我们得到残差块（预测分区-真实分区）之后，我们可以用一种方式变换它，这样我们就知道哪些像素我们应该丢弃，还依然能保持整体质量。这个确切的行为有几种变换方式。

尽管有其它的变换方式，但我们重点关注离散余弦变换（DCT）。DCT 的主要功能有：

将像素块转换为相同大小的频率系数块。

压缩能量，更容易消除空间冗余。

可逆的，也意味着你可以还原回像素。

第四步 - 量化

当我们丢弃一些系数时，在最后一步（变换），我们做了一些形式的量化。这一步，我们选择性地剔除信息（有损部分）或者简单来说，我们将量化系数以实现压缩。

我们如何量化一个系数块？一个简单的方法是均匀量化，我们取一个块并将其除以单个的值（10），并舍入值。

第五步 - 熵编码

在我们量化数据（图像块／切片／帧）之后，我们仍然可以以无损的方式来压缩它。有许多方法（算法）可用来压缩数据。我们将简单体验其中几个，你可以阅读这本很棒的书去深入理解：Understanding Compression: Data Compression for Modern Developers。

第六步 - 比特流格式

完成所有这些步之后，我们需要将压缩过的帧和内容打包进去。需要明确告知解码器编码定义，如颜色深度，颜色空间，分辨率，预测信息（运动向量，帧内预测方向），档次*，级别*，帧率，帧类型，帧号等等更多信息。

视频压缩回顾

之前我们计算过我们需要 139GB 来保存一个一小时，720p 分辨率和30fps的视频文件，如果我们使用在这里学过的技术，如帧间和帧内预测，转换，量化，熵编码和其它我们能实现——假设我们每像素花费 0.031 bit——同样观感质量的视频，对比 139GB 的存储，只需 367.82MB。

🌻

核心参考：

https://github.com/leandromoreira/digital_video_introduction/tree/master

https://en.wikipedia.org/wiki/Image_compression