Audio frames, video frames, and their synchronization.

The previous article introduced the basic knowledge of audio and video development. This article will discuss the main parameters and analysis methods of audio frames, video frames, as well as audio and video synchronization. The main contents are as follows:

Audio Frames
Video Frames
PTS and DTS
Audio-Video Synchronization

Audio Frames#

The concept of audio frames is not as clear as video frames. Almost all video encoding formats can be simply regarded as a frame being the encoded image. However, audio frames can vary depending on the encoding format. For example, PCM audio streams can be directly played. In this article, we will use the MPEG audio frame format as an example to introduce audio frames.

Frame Size#

Frame size refers to the number of samples in each frame. This value is constant and is as follows:

	MPEG 1	MPEG 2	MPEG 2.5
Layer Ⅰ	384	384	384
Layer Ⅱ	1152	1152	1152
Layer Ⅲ	1152	576	576

Frame Length#

Frame length refers to the length of each frame during compression, including the frame header and padding. Due to padding and bit rate changes, the frame length is not constant. The padding can be obtained from the 9th bit of the frame header. If it is 0, there is no padding; if it is 1, there is padding. The explanation of padding is as follows:

Padding is used to fit the bit rates exactly. For an example: 128k 44.1kHz layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128k bitrate. For Layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.

It can be seen that the padding for Layer Ⅰ is 4 bytes, and for Layer Ⅱ and Layer Ⅲ, it is 1 byte. When reading MPEG files, this value must be calculated to find the adjacent frames. The calculation formula for frame length is as follows:

// Layer I(SampleSize = 384) in bytes
FrameLengthInBytes = SampleSize / 8 * BitRate / SampleRate + Padding * 4
FrameLengthInBytes = 48 * BitRate / SampleRate + Padding * 4
// Layer II & III(SampleSize = 1152) in bytes
FrameLengthInBytes = SampleSize / 8 / SampleRate + Padding
FrameLengthInBytes = 144 * BitRate / SampleRate + Padding

Here, SampleSize represents the number of samples per frame, which is a fixed value and can be found in the frame size section. Padding represents the padding, BitRate represents the bit rate, and SampleRate represents the sample rate. The values of bit rate and sample rate can be obtained from the frame header.

If an MP3 audio file has a bit rate of 320kbps, a sample rate of 44.1KHz, and no padding, the frame length of the file would be approximately 144 x 320 / 44.1 ≈ 1044 bytes.

Bit Rate#

The bit rate can be obtained from bits 12 to 15 of the MPEG audio frame header, in kbps. The reference table is as follows:

bits	V1，L1	V1，L2	V1，L3	V2，L1	V2，L2 & L3
0000	free	free	free	free	free
0001	32	32	32	32	8
0010	64	48	40	48	16
0011	96	56	48	56	24
0100	128	64	56	64	32
0101	160	80	64	80	40
0110	192	96	80	96	48
0111	224	112	96	112	56
1000	256	128	112	128	64
1001	288	160	128	144	80
1010	320	192	160	160	96
1011	352	224	192	176	112
1100	384	256	224	192	128
1101	416	320	256	224	144
1110	448	384	320	256	160
1111	bad	bad	bad	bad	bad

Regarding the explanations in the table:

V1: MPEG Version 1
V2: MPEG Version 2 and Version 2.5
L1: Layer Ⅰ
L2: Layer Ⅱ
L3: Layer Ⅲ

MPEG files may have variable bit rates, which means the bit rate can change. It is important to know how to obtain the bit rate.

Sample Rate#

The sample rate can be obtained from bits 10 to 11 of the MPEG audio frame header, in Hz. The reference table is as follows:

bits	MPEG1	MPEG2	MPEG2.5
00	44100	22050	11025
01	48000	24000	12000
10	32000	16000	8000
11	reserv.	reserv.	reserv.

Duration of Each Frame#

The duration of each frame can be calculated using the following formula:

// in ms
FrameTime = SampleSize / SampleRate * 1000

Here, SampleSize represents the number of samples, which is the frame size, and SampleRate represents the sample rate.

For example, for an MP3 audio file with a sample rate of 44.1KHz, the duration of each frame would be 1152 / 44100 * 1000 ≈ 26 ms. This is why we often hear that the playback time of an mp3 frame is fixed at 26ms.

Video Frames#

In video compression technology, different compression algorithms are used for video frames to reduce data size. Usually, only the differences between images are encoded, while the same element information does not need to be repeatedly sent. The different algorithms for video frames are generally referred to as picture types or frame types, and the main three types are I, P, and B frames, with the following characteristics:

I frame: Intra-coded frame, usually the first frame of each GOP (Group of Pictures). It has the lowest compression and can be decoded without other video frames. It can be considered as a complete image. I frames are used for random access and serve as reference for decoding other frames.
P frame: Predictive-coded frame, representing the differences between the current frame and the previous I or P frame. It needs to refer to the previous I or P frame to generate a complete image. Compared to I frames, P frames have higher compression and save space, so they are also called delta frames.
B frame: Bi-directional predictive-coded frame, representing the differences between the current frame and the previous and next P frames. It needs to refer to the previous I or P frame and the next P frame to generate a complete image. B frames have the highest compression.

The frames or pictures mentioned above are usually divided into macroblocks. A macroblock is the basic unit of motion prediction. A complete image is usually divided into several macroblocks. For example, in MPEG-2 and earlier codecs, macroblocks are defined as 8x8 pixels. Specific prediction types are selected based on macroblocks, rather than using the same prediction type for the entire image. The specifics are as follows:

I frame: Contains only intra macroblocks.
P frame: Can contain intra macroblocks or predicted macroblocks.
B frame: Can contain intra, predicted, and bi-directional predicted macroblocks.

The following diagram shows the representation of I frames, P frames, and B frames:

In the H.264 / MPEG-4 AVC standard, the granularity of prediction types is reduced to the slice level. A slice is a different region in space within a frame, and this region is encoded separately from any other region in the same frame. I slices, P slices, and B slices replace I, P, and B frames. This part of the content is temporarily understood as such.

As mentioned earlier, GOP stands for Group of Pictures. Each GOP starts with an I frame, followed by P frames and B frames. The following diagram shows the order:

The order shown in the diagram is:

I1, B2, B3, B4, P5, B6, B7, B8, P9, B10, B11, B12, I13

The decoding order is:

I1, P5, B2, B3, B4, P9, B6, B7, B8, I13, B10, B11, B12

The subscript numbers represent the PTS (Presentation Time Stamp) in the original frame data, which can be understood as the position in the GOP.

DTS and PTS#

DTS (Decoding Time Stamp): It represents the decoding time of the compressed frame, indicating when the player should decode the frame data.
PTS (Presentation Time Stamp): It represents the display time of the original frame obtained after decoding the compressed frame, indicating when the frame data should be displayed during playback.

For audio, DTS and PTS are the same. For video, since B frames are bi-directional predictive frames, DTS and PTS are different. If each GOP does not have B frames, DTS and PTS are the same. Otherwise, DTS and PTS are different. Here is an example:

	I	B	B	P	B	P
Display	I1	B2	B3	P4	B5	P6
Decoding	I1	P4	B2	B3	P6	B5
PTS	1	2	3	4	5	6
DTS	1	4	2	3	6	5

When the receiver receives the bitstream and decodes the frames, the order is obviously not the correct order. It needs to be reordered based on the PTS before displaying.

Audio-Video Synchronization#

Let's briefly introduce the process of video playback. After the microphone and camera capture data, they are encoded separately for audio and video. Then, the audio and video are multiplexed, which means they are formatted and packaged into a media file. When receiving a media file, it needs to be demultiplexed to separate the audio and video. Then, the audio and video are decoded separately, and they are played independently. Due to the differences in playback rates, audio and video can become unsynchronized. The two corresponding indicators for audio and video playback are:

Audio: Sample rate
Video: Frame rate

Sound cards and graphics cards usually play based on each frame of data. Therefore, it is necessary to calculate the playback duration of each audio and video frame. Let's take an example:

From the previous information, we know that for an MP3 audio file with a sample rate of 44.1KHz, the duration of each frame is 26 ms. If the video frame rate is 30fps, the duration of each video frame would be 1000 / 30 ≈ 33ms. If the ideal situation allows us to play according to the calculated values, the audio and video can be considered synchronized.

In reality, due to various reasons, audio and video can become unsynchronized. For example, the decoding and rendering time for each frame may vary, and video frames with rich colors may take longer to decode and render compared to video frames with simple colors. There may also be calculation errors. There are three main methods for audio-video synchronization:

Video synchronization to audio
Audio synchronization to video
Audio-video synchronization to an external clock

Usually, the video is synchronized to the audio clock because, in terms of delay and stuttering, human hearing is more sensitive than vision. Therefore, it is important to keep the audio output normal. Audio-video synchronization allows for a certain amount of delay, which means the delay must be within an acceptable range. It is like a feedback mechanism. When the video is slower than the audio, the video playback speed needs to be increased. Frames can be dropped to catch up with the audio. If there is already a delay, it can be reduced. Conversely, if the video is ahead of the audio, the video playback speed needs to be reduced.