The previous article introduced the basic knowledge of audio and video development. This article will discuss the main parameters and analysis methods of audio frames, video frames, as well as audio and video synchronization. The main contents are as follows:
- Audio Frames
- Video Frames
- PTS and DTS
- Audio-Video Synchronization
Audio Frames#
The concept of audio frames is not as clear as video frames. Almost all video encoding formats can be simply regarded as a frame being the encoded image. However, audio frames can vary depending on the encoding format. For example, PCM audio streams can be directly played. In this article, we will use the MPEG audio frame format as an example to introduce audio frames.
Frame Size#
Frame size refers to the number of samples in each frame. This value is constant and is as follows:
MPEG 1 | MPEG 2 | MPEG 2.5 | |
---|---|---|---|
Layer Ⅰ | 384 | 384 | 384 |
Layer Ⅱ | 1152 | 1152 | 1152 |
Layer Ⅲ | 1152 | 576 | 576 |
Frame Length#
Frame length refers to the length of each frame during compression, including the frame header and padding. Due to padding and bit rate changes, the frame length is not constant. The padding can be obtained from the 9th bit of the frame header. If it is 0, there is no padding; if it is 1, there is padding. The explanation of padding is as follows:
Padding is used to fit the bit rates exactly. For an example: 128k 44.1kHz layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128k bitrate. For Layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.
It can be seen that the padding for Layer Ⅰ is 4 bytes, and for Layer Ⅱ and Layer Ⅲ, it is 1 byte. When reading MPEG files, this value must be calculated to find the adjacent frames. The calculation formula for frame length is as follows:
// Layer I(SampleSize = 384) in bytes
FrameLengthInBytes = SampleSize / 8 * BitRate / SampleRate + Padding * 4
FrameLengthInBytes = 48 * BitRate / SampleRate + Padding * 4
// Layer II & III(SampleSize = 1152) in bytes
FrameLengthInBytes = SampleSize / 8 / SampleRate + Padding
FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
Here, SampleSize represents the number of samples per frame, which is a fixed value and can be found in the frame size section. Padding represents the padding, BitRate represents the bit rate, and SampleRate represents the sample rate. The values of bit rate and sample rate can be obtained from the frame header.
If an MP3 audio file has a bit rate of 320kbps, a sample rate of 44.1KHz, and no padding, the frame length of the file would be approximately 144 x 320 / 44.1 ≈ 1044 bytes.
Bit Rate#
The bit rate can be obtained from bits 12 to 15 of the MPEG audio frame header, in kbps. The reference table is as follows:
bits | V1,L1 | V1,L2 | V1,L3 | V2,L1 | V2,L2 & L3 |
---|---|---|---|---|---|
0000 | free | free | free | free | free |
0001 | 32 | 32 | 32 | 32 | 8 |
0010 | 64 | 48 | 40 | 48 | 16 |
0011 | 96 | 56 | 48 | 56 | 24 |
0100 | 128 | 64 | 56 | 64 | 32 |
0101 | 160 | 80 | 64 | 80 | 40 |
0110 | 192 | 96 | 80 | 96 | 48 |
0111 | 224 | 112 | 96 | 112 | 56 |
1000 | 256 | 128 | 112 | 128 | 64 |
1001 | 288 | 160 | 128 | 144 | 80 |
1010 | 320 | 192 | 160 | 160 | 96 |
1011 | 352 | 224 | 192 | 176 | 112 |
1100 | 384 | 256 | 224 | 192 | 128 |
1101 | 416 | 320 | 256 | 224 | 144 |
1110 | 448 | 384 | 320 | 256 | 160 |
1111 | bad | bad | bad | bad | bad |
Regarding the explanations in the table:
- V1: MPEG Version 1
- V2: MPEG Version 2 and Version 2.5
- L1: Layer Ⅰ
- L2: Layer Ⅱ
- L3: Layer Ⅲ
MPEG files may have variable bit rates, which means the bit rate can change. It is important to know how to obtain the bit rate.
Sample Rate#
The sample rate can be obtained from bits 10 to 11 of the MPEG audio frame header, in Hz. The reference table is as follows:
bits | MPEG1 | MPEG2 | MPEG2.5 |
---|---|---|---|
00 | 44100 | 22050 | 11025 |
01 | 48000 | 24000 | 12000 |
10 | 32000 | 16000 | 8000 |
11 | reserv. | reserv. | reserv. |
Duration of Each Frame#
The duration of each frame can be calculated using the following formula:
// in ms
FrameTime = SampleSize / SampleRate * 1000
Here, SampleSize represents the number of samples, which is the frame size, and SampleRate represents the sample rate.
For example, for an MP3 audio file with a sample rate of 44.1KHz, the duration of each frame would be 1152 / 44100 * 1000 ≈ 26 ms. This is why we often hear that the playback time of an mp3 frame is fixed at 26ms.
Video Frames#
In video compression technology, different compression algorithms are used for video frames to reduce data size. Usually, only the differences between images are encoded, while the same element information does not need to be repeatedly sent. The different algorithms for video frames are generally referred to as picture types or frame types, and the main three types are I, P, and B frames, with the following characteristics:
- I frame: Intra-coded frame, usually the first frame of each GOP (Group of Pictures). It has the lowest compression and can be decoded without other video frames. It can be considered as a complete image. I frames are used for random access and serve as reference for decoding other frames.
- P frame: Predictive-coded frame, representing the differences between the current frame and the previous I or P frame. It needs to refer to the previous I or P frame to generate a complete image. Compared to I frames, P frames have higher compression and save space, so they are also called delta frames.
- B frame: Bi-directional predictive-coded frame, representing the differences between the current frame and the previous and next P frames. It needs to refer to the previous I or P frame and the next P frame to generate a complete image. B frames have the highest compression.
The frames or pictures mentioned above are usually divided into macroblocks. A macroblock is the basic unit of motion prediction. A complete image is usually divided into several macroblocks. For example, in MPEG-2 and earlier codecs, macroblocks are defined as 8x8 pixels. Specific prediction types are selected based on macroblocks, rather than using the same prediction type for the entire image. The specifics are as follows:
- I frame: Contains only intra macroblocks.
- P frame: Can contain intra macroblocks or predicted macroblocks.
- B frame: Can contain intra, predicted, and bi-directional predicted macroblocks.
The following diagram shows the representation of I frames, P frames, and B frames:
In the H.264 / MPEG-4 AVC standard, the granularity of prediction types is reduced to the slice level. A slice is a different region in space within a frame, and this region is encoded separately from any other region in the same frame. I slices, P slices, and B slices replace I, P, and B frames. This part of the content is temporarily understood as such.
As mentioned earlier, GOP stands for Group of Pictures. Each GOP starts with an I frame, followed by P frames and B frames. The following diagram shows the order:
The order shown in the diagram is:
I1, B2, B3, B4, P5, B6, B7, B8, P9, B10, B11, B12, I13
The decoding order is:
I1, P5, B2, B3, B4, P9, B6, B7, B8, I13, B10, B11, B12
The subscript numbers represent the PTS (Presentation Time Stamp) in the original frame data, which can be understood as the position in the GOP.
DTS and PTS#
- DTS (Decoding Time Stamp): It represents the decoding time of the compressed frame, indicating when the player should decode the frame data.
- PTS (Presentation Time Stamp): It represents the display time of the original frame obtained after decoding the compressed frame, indicating when the frame data should be displayed during playback.
For audio, DTS and PTS are the same. For video, since B frames are bi-directional predictive frames, DTS and PTS are different. If each GOP does not have B frames, DTS and PTS are the same. Otherwise, DTS and PTS are different. Here is an example:
I | B | B | P | B | P | |
---|---|---|---|---|---|---|
Display | I1 | B2 | B3 | P4 | B5 | P6 |
Decoding | I1 | P4 | B2 | B3 | P6 | B5 |
PTS | 1 | 2 | 3 | 4 | 5 | 6 |
DTS | 1 | 4 | 2 | 3 | 6 | 5 |
When the receiver receives the bitstream and decodes the frames, the order is obviously not the correct order. It needs to be reordered based on the PTS before displaying.
Audio-Video Synchronization#
Let's briefly introduce the process of video playback. After the microphone and camera capture data, they are encoded separately for audio and video. Then, the audio and video are multiplexed, which means they are formatted and packaged into a media file. When receiving a media file, it needs to be demultiplexed to separate the audio and video. Then, the audio and video are decoded separately, and they are played independently. Due to the differences in playback rates, audio and video can become unsynchronized. The two corresponding indicators for audio and video playback are:
- Audio: Sample rate
- Video: Frame rate
Sound cards and graphics cards usually play based on each frame of data. Therefore, it is necessary to calculate the playback duration of each audio and video frame. Let's take an example:
From the previous information, we know that for an MP3 audio file with a sample rate of 44.1KHz, the duration of each frame is 26 ms. If the video frame rate is 30fps, the duration of each video frame would be 1000 / 30 ≈ 33ms. If the ideal situation allows us to play according to the calculated values, the audio and video can be considered synchronized.
In reality, due to various reasons, audio and video can become unsynchronized. For example, the decoding and rendering time for each frame may vary, and video frames with rich colors may take longer to decode and render compared to video frames with simple colors. There may also be calculation errors. There are three main methods for audio-video synchronization:
- Video synchronization to audio
- Audio synchronization to video
- Audio-video synchronization to an external clock
Usually, the video is synchronized to the audio clock because, in terms of delay and stuttering, human hearing is more sensitive than vision. Therefore, it is important to keep the audio output normal. Audio-video synchronization allows for a certain amount of delay, which means the delay must be within an acceptable range. It is like a feedback mechanism. When the video is slower than the audio, the video playback speed needs to be increased. Frames can be dropped to catch up with the audio. If there is already a delay, it can be reduced. Conversely, if the video is ahead of the audio, the video playback speed needs to be reduced.