Basic Knowledge of Audio and Video Development

Today we will learn some basic knowledge about audio and video, which we encounter in our daily work, such as the development of audio and video. For example, in our current work, we deal with TSPlayer, IjkPlayer, and MediaPlayer to provide playback capabilities. Regardless of the player, the upper-level calls are quite similar, but the specific implementations and supported capabilities vary. To delve deeper, we must study audio and video in depth. The main directions of Android development include applications, Framework, audio and video, NDK, etc. If we continue in the Android field, these areas must be explored. The main content is as follows:

Video Encoding
Audio Encoding
Multimedia Playback Components
Frame Rate
Resolution
Encoding Format
Container Format
Bitrate
Color Space
Sampling Rate
Quantization Precision
Channels

Video Encoding#

Video encoding refers to the method of converting a video file format into another video format using specific compression techniques. The main coding and decoding standards in video transmission are as follows:

Motion Picture Experts Group's M-JPEG
- M-JPEG is an image compression coding standard, short for Motion-JPEG. The JPEG standard is mainly used for processing still images, while M-JPEG treats a moving video sequence as a series of continuous still images. This compression method compresses each frame completely and allows for random storage of each frame during editing, enabling frame-accurate editing. M-JPEG only compresses spatial redundancy within frames and does not compress temporal redundancy between frames, hence its compression efficiency is not high.
International Organization for Standardization (ISO) MPEG series standards
- The MPEG standards mainly include five: MPEG-1, MPEG-2, MPEG-4, MPEG-7, and MPEG-21. The video compression encoding technology of MPEG standards primarily utilizes inter-frame compression coding technology with motion compensation to reduce temporal redundancy, employs DCT technology to reduce spatial redundancy, and uses entropy coding to reduce statistical redundancy in information representation. The combined use of these technologies greatly enhances compression performance.
International Telecommunication Union (ITU-T) H.261, H.263, H.264, etc.
- H.261: The first practical digital video decoding standard, using a compression algorithm that combines motion-compensated inter-frame prediction with block DCT. Its motion compensation uses full pixel accuracy and loop filtering, supporting CIF and QCIF resolutions.
- H.263: H.263 has the same encoding algorithm as H.261 but has made some improvements, allowing H.263 to provide better image quality at lower bitrates than H.261. Its motion compensation uses half-pixel accuracy and supports five resolutions: CIF, QCIF, SQCIF, 4CIF, and 16CIF.
- H.264: H.264 is a new digital video encoding standard jointly developed by two organizations, ISO and ITU-T, through the Joint Video Team (JVT). Therefore, it is both ITU-T's H.264 and ISO/IEC's MPEG-4 Advanced Video Coding (AVC) Part 10. Thus, whether referred to as MPEG-4 AVC, MPEG-4 Part 10, or ISO/IEC 14496-10, it all refers to H.264. H.264 is a hybrid encoding system based on traditional frameworks, optimized locally, focusing on encoding efficiency and reliability. H.264 offers high compression ratios while maintaining high-quality smooth images. Video data compressed with H.264 requires less bandwidth during network transmission, making it the highest compression rate video compression standard.

Audio Encoding#

Common audio codec standards are as follows:

ITU: G.711, G.729, etc.
MPEG: MP3, AAC, etc.
3GPP: AMR, AMR-WB, AMR-WB+, etc.
There are also standards set by companies, such as Dolby AC-3, DTS, WMA, etc.

Common introductions are as follows:

MP3 (MPEG-1 audio layer 3): An audio compression technology designed to significantly reduce the amount of audio data. Using MPEG Audio Layer 3 technology, music can be compressed into smaller files at a ratio of 1:10 or even 1:12, and for most users, the playback quality does not noticeably decline compared to the original uncompressed audio. It utilizes the human ear's insensitivity to high-frequency sound signals, converting time-domain waveform signals into frequency-domain signals, dividing them into multiple frequency bands, applying different compression rates to different bands, increasing the compression ratio for high frequencies (even ignoring signals) and using a lower compression ratio for low-frequency signals to ensure no distortion. This effectively discards high-frequency sounds that are inaudible to the human ear while retaining the audible low-frequency parts, thus achieving some compression of the audio. Additionally, MP3 is a lossy compression file format.
AAC: Short for Advanced Audio Coding, initially based on MPEG-2 audio encoding technology. After the emergence of MPEG-4, AAC re-integrated its features and added SBR and PS technologies. To distinguish it from traditional MPEG-2 AAC, it is also called MPEG-4 AAC. AAC is a file compression format specifically designed for audio data. Compared to MP3, AAC format offers better sound quality and smaller file sizes, but AAC is also a lossy compression format. With the advent of high-capacity devices, its advantages are diminishing.
WMA: Short for Windows Media Audio, it is a series of audio codecs developed by Microsoft, also referring to the corresponding digital audio encoding format. WMA includes four different codecs: WMA, the original WMA codec, which competes with MP3 and RealAudio codecs; WMA Pro, which supports more channels and higher quality audio; WMA Lossless, a lossless codec; and WMA Voice, used for storing voice with low bitrate compression. Some pure audio ASF files encoded in Windows Media Audio format also use WMA as the extension, characterized by supporting encryption, making illegal copies unplayable locally. WMA is also a lossy compression file format.

More audio and video codec standards can be referenced at: Audio Codec Standards

Multimedia Playback Components#

Android multimedia playback components include MediaPlayer, MediaCodec, OMX, StageFright, AudioTrack, etc., specifically as follows:

MediaPlayer: Provides playback control interfaces for the application layer.
MediaCodec: Provides access to the underlying media codec interfaces.
OpenMAX: Open Media Acceleration, abbreviated as OMX, is an open multimedia acceleration layer, a multimedia application standard. Android's main multimedia engine, StageFright, uses OpenMAX through IBinder for encoding and decoding processing.
StageFright: Introduced in Android 2.2 to replace the preset media playback engine OpenCORE, StageFright is a media playback engine located at the Native layer, built with software-based codecs suitable for popular media formats. Its encoding and decoding functions utilize the OpenMAX framework, incorporating the omx-component part of OpenCORE, existing in Android as a shared library corresponding to libstagefright.so.
AudioTrack: Manages and plays a single audio resource, supporting only PCM streams, such as most WAV format audio files, which are PCM streams that AudioTrack can play directly.

Common Multimedia Frameworks and Solutions#

Common multimedia frameworks and solutions include VLC, FFmpeg, GStreamer, etc., specifically as follows:

VLC: Video LAN Client, a free, open-source cross-platform multimedia player and framework.
FFmpeg: A multimedia solution, not a multimedia framework, widely used in audio and video development.
GStreamer: An open-source multimedia framework for building streaming media applications.

Frame Rate#

Frame rate is a measure of the number of frames displayed. The unit is "frames per second" (FPS) or "Hertz, Hz," indicating the number of frames per second (FPS) or the number of times the graphics processor can update per second while processing frames. A higher frame rate results in smoother and more realistic animations. Generally, 30fps is acceptable, but increasing performance to 60fps can significantly enhance interactivity and realism. However, typically, exceeding 75fps does not yield noticeable improvements in smoothness. If the frame rate exceeds the screen refresh rate, it only wastes graphic processing capability, as the monitor cannot update at such a fast speed, rendering the frame rate above the refresh rate wasted.

Resolution#

Video resolution refers to the size or dimensions of the images formed by video imaging products. Common resolutions like 1080P and 4K represent what? The "P" itself means progressive scan, indicating the total number of pixel rows in the video, with 1080P representing a total of 1080 rows of pixels, while "K" indicates the total number of pixel columns, with 4K representing 4000 columns of pixels. Generally, 1080P refers to a resolution of 1080 x 1920, while 4K refers to a resolution of 3840 x 2160.

Refresh Rate#

The refresh rate is the number of times the screen image is refreshed per second. Refresh rate is divided into vertical refresh rate and horizontal refresh rate. The refresh rate usually refers to the vertical refresh rate, indicating how many times the screen image is redrawn per second, or the number of times the screen refreshes per second, measured in Hz (Hertz). The higher the refresh rate, the better, resulting in a more stable image, clearer display, and less impact on the eyes. A lower refresh frequency leads to more severe image flickering and jitter, causing faster eye fatigue. Generally, achieving a refresh frequency above 80Hz can completely eliminate image flicker and jitter, making it less likely for the eyes to become fatigued.

Encoding Format#

For audio and video, the encoding format corresponds to audio encoding and video encoding. Referring to the previous audio and video encoding standards, each encoding standard corresponds to a specific encoding algorithm, aiming to achieve data compression and reduce data redundancy through certain encoding algorithms.

Container Format#

Directly looking at Baidu Baike's introduction to container formats, a container format (also called a wrapper) is a file format that packages already encoded and compressed video and audio tracks into one file. In other words, it is merely a shell, or one can think of it as a folder that holds video and audio tracks. To put it simply, the video track is like rice, and the audio track is like dishes; the container format is a bowl or pot used to hold the rice and dishes.

Bitrate#

Bitrate, also known as bit rate, refers to the number of bits transmitted or processed per unit of time, measured in bps (bits per second), which can also be expressed as b/s. The higher the bitrate, the larger the amount of data transmitted (in bits) per unit of time. In the multimedia industry, when referring to the data transmission rate of audio or video over a unit of time, bitrate is usually used, measured in kbps. Generally, if one has a 1M broadband connection, they can only watch videos with a stream rate not exceeding 125kbps online; videos exceeding 125kbps can only be watched smoothly after buffering.

Bitrate is generally divided into constant bitrate and variable bitrate:

Constant bitrate ensures that the bitrate remains constant but sacrifices video quality. For example, to maintain a constant bitrate, some content with rich images may lose some image details and become blurry.
Variable bitrate means that the output stream's bitrate is variable because the peak information content of the video source itself varies. From the perspective of ensuring video transmission quality and fully utilizing information, variable bitrate video encoding is the most reasonable.

The bitrate's level is proportional to video quality and file size, but once the bitrate exceeds a certain value, it does not affect video quality.

Color Space#

YUV: A color encoding method generally used in image processing components. When encoding photos or videos, YUV considers human perception capabilities, allowing for reduced chroma bandwidth. Y represents brightness, U represents chroma, and V represents concentration. The ranges referred to by Y′UV, YUV, YCbCr, and YPbPr often have confusion or overlap. Historically, YUV and Y'UV are typically used to encode analog signals for television, while YCbCr is used to describe digital image signals, suitable for video and image compression and transmission, such as MPEG and JPEG. Nowadays, YUV is widely used in computer systems.
RGB: The additive color model, also known as the RGB color model or red-green-blue color model, is a color model that combines red (R), green (G), and blue (B) light in different proportions to produce various colors. Most modern displays adopt the RGB color standard.

YUV is mainly used to optimize the transmission of color video signals, ensuring backward compatibility with older black-and-white televisions. Compared to RGB video signal transmission, its greatest advantage is that it occupies very little bandwidth.

Sampling Rate#

The sampling rate indicates the number of samples extracted from a continuous signal to form a discrete signal per second, measured in Hertz (Hz). The sampling rate refers to the sampling frequency when converting an analog signal into a digital signal. The human ear can generally hear sounds in the range of 20Hz to 20KHz. According to the sampling theorem, when the sampling frequency is greater than twice the highest frequency in the signal, the sampled digital signal can fully reflect the real signal. Common sampling rates are as follows:

8000 Hz: The sampling rate used for telephones, sufficient for human speech.
11025 Hz: The sampling rate used for AM amplitude modulation broadcasting.
22050 Hz and 24000 Hz: The sampling rates used for FM frequency modulation broadcasting.
44100 Hz: The sampling rate for audio CDs, commonly used for MPEG-1 audio (VCD, SVCD, MP3).
47250 Hz: The sampling rate used for commercial PCM recorders.
48000 Hz: The sampling rate used for miniDV, digital television, DVD, DAT, movies, and professional audio.

The standard sampling frequency for CD music is 44.1KHz, which is also the most commonly used sampling frequency between sound cards and computer operations. The currently popular Blu-ray sampling rate is quite high, reaching 192kHz. Most sound cards today support 44.1kHz, 48kHz, and 96kHz, with high-end products supporting 192kHz or even higher. In summary, the higher the sampling rate, the better the quality of the sound files obtained, but it also occupies more storage space.

Quantization Precision#

The process of converting sound waves into digital signals is influenced not only by the sampling rate but also by an important factor: quantization precision. The sampling frequency pertains to the number of samples taken per second, while quantization precision refers to the division of the amplitude of sound waves. The number of divisions is calculated as the maximum amplitude divided by 2 raised to the power of n, where n is the number of bits, and the number of bits represents audio resolution.

Additionally, the number of bits determines the range of sound wave amplitudes (i.e., dynamic range, the difference between maximum and minimum volume). A larger bit count allows for a greater range of values, describing waveforms more accurately. Each bit of data can record approximately 6dB of dynamic signal. Generally, 16-bit can provide a maximum dynamic range of 96dB (after high-frequency fluctuations, only 92dB remains). Based on this, it can be inferred that 20-bit can achieve a dynamic range of 120dB. A larger dynamic range has benefits; it refers to the ratio of the system's output noise power to the maximum undistorted volume power. The larger this value, the higher the dynamic range the system can handle.

Channels#

Channels refer to the independent audio signals collected or played back at different spatial locations during recording or playback. Therefore, the number of channels corresponds to the number of sound sources during recording or the number of corresponding speakers during playback. Common channels include mono, stereo, 4 channels, 5.1 channels, 7.1 channels, etc., specifically as follows:

Mono: Set up one speaker.
Stereo: Expands a mono speaker into two symmetrically positioned speakers, distributing sound into two independent channels during recording, achieving excellent sound localization. This technology is particularly useful in music appreciation, allowing listeners to clearly discern the origin of various instruments, making the music more imaginative and closer to a live experience. Stereo technology has been widely used in many sound cards since the Sound Blaster Pro, becoming a far-reaching audio standard.
4 Channels: 4-channel surround sound specifies 4 sound points: front left, front right, rear left, and rear right, surrounding the listener in the middle. It is also recommended to add a subwoofer to enhance the playback of low-frequency signals, which is why the 4.1 channel speaker system is now widely popular. Overall, the 4-channel system can provide listeners with surround sound from multiple different directions, offering a new experience.
5.1 Channels: The 5.1 channel system originates from the 4.1 channel system, splitting the surround channels into left and right surround, with a central position adding a low-frequency effect.
7.1 Channels: The 7.1 channel system adds two sound points, center left and center right, on the basis of the 5.1 channel system, essentially establishing a balanced sound field around the listener, enhancing the rear center sound field channel.