After understanding the relevant knowledge of audio and video, you can read the following two articles first:
- Fundamentals of Audio and Video Development
- Audio Frames, Video Frames, and Synchronization
- Camera2, MediaCodec Recording mp4
- In-depth Explanation of Android Native Encoding and Decoding Interface MediaCodec
This article summarizes the basic knowledge of audio and will introduce it from the following aspects:
- Generation of sound
- Three elements of sound
- Analog-to-digital conversion
- Raw audio data
- PCM and Wav
- Audio processing flow
Generation of sound#
Sound is produced by the vibration of objects and is a type of sound wave that is transmitted through media such as air, solids, and liquids. The range of sound waves that can be recognized by the human ear is between 20Hz and 20000Hz, also known as audible sound waves. According to the frequency of the sound waves, they can be mainly divided into:
- Audible sound waves: 20Hz~20kHz
- Ultrasonic waves: > 20kHz
- Infrasound: < 20Hz
In addition, the range of human vocalization is generally 85Hz~1100Hz.
Three elements of sound#
The three elements of sound are pitch, volume, and timbre, as follows:
- Pitch: Refers to the high and low frequency of the sound. It represents the degree to which a person's hearing can distinguish the pitch of a sound. The faster an object vibrates, the higher the pitch of the sound it produces, and the slower it vibrates, the lower the pitch.
- Volume: Also known as sound intensity or loudness, it refers to the amplitude of the sound, which represents the subjective perception of the loudness of the sound by the human ear.
- Timbre: Also known as tone quality, it refers to the fact that different sounds always have unique characteristics in terms of waveform. Different objects have different characteristics when vibrating, reflecting the unique qualities of the sound produced by each object. The timbre is specifically determined by harmonics, and a pleasant sound is not just a sine wave, but also harmonics.
Analog-to-digital conversion#
Sound is an analog audio signal. If you want to digitize sound, you need to convert the analog audio signal into a digital signal. This is called analog-to-digital conversion. The main process includes sampling, quantization, and encoding, as shown in the following figure:
- Sampling: The process of converting a continuous signal into a discrete signal, where a signal value for a certain period of time replaces the continuous signal value in that time period. The number of samples in 1 second is the sampling rate. For example, 8KHz is the sampling rate of telephone signals, which can meet the needs of communication. Audio CDs generally have a sampling rate of 44.1kHz, and digital TV generally has a sampling rate of 48kHz. The higher the sampling rate, the higher the fidelity of sound reproduction.
- Quantization: The process of converting the sampled analog signal into a digital signal. Quantization can be uniform or non-uniform. The figure above clearly uses uniform quantization, with 8 quantization levels.
- Encoding: The process of converting the quantized signal into the corresponding binary code. The simplest is natural binary code. If you are interested in other encoding methods, you can learn about them yourself. The encoding mentioned in the figure refers to source encoding, in addition to which there is channel encoding.
Raw audio data#
PCM (Pulse Code Modulation) is a pulse code modulation, which actually converts analog audio signals into digital audio signals. In audio and video, PCM refers to uncompressed audio sampling data, which is the raw audio data generated by audio signals through sampling, quantization, and encoding. The key quantization indicators of PCM data are as follows:
- Sample Size: The size of each sample, which is also the number of bits used to store each sample. The commonly used size is 16 bits.
- Sample Rate: The number of samples per second, in Hz. Common sample rates include 8k, 16k, 32k, 44.1k, 48k, etc.
- Number of Channels: The number of channels in the current PCM data, such as mono, stereo, multi-channel, etc.
- Byte Ordering: The byte order in which PCM data is stored, whether it is big-endian or little-endian. In order to improve data processing efficiency, little-endian is usually used.
- Sign: Indicates whether the PCM data has a sign bit.
- Integer or Floating Point: Indicates whether the PCM data is represented using integers or floating-point numbers.
After understanding the quantization indicators that represent PCM data, how is the bitrate calculated? The bitrate is the amount of sample data per second. The calculation is as follows:
Sample Rate * Sample Size * Number of Channels
For example:
For a PCM-encoded WAV file with a sample rate of 44.1KHz, a sample size of 16 bits, and stereo channels, the bitrate is 44.1K *_16 *_2 = 1411.2Kb/s. If you transmit such audio, you will have to deal with data volumes exceeding 1M per second. In addition, the upload speed is often much slower than the download speed during data transmission, so audio compression is needed.
PCM and WAV#
PCM can be understood as mentioned in the previous section. WAV is a lossless audio file format that does not have a rigid requirement for audio encoding. It can be PCM or other encoding methods, such as MP3 encoding. In summary:
- PCM: A coding method, in the field of audio and video, it refers to the raw audio data stream.
- WAV: An audio file format that can store PCM data. It is equivalent to adding a WAV header to the PCM data.
Finally, here is a schematic diagram of a WAV header:
More details will be added after further research.
Audio Processing Flow#
To briefly explain the audio processing flow, first, the generation of audio files. For example, in Android, the audio data captured using AudioRecord
and MediaRecord
is PCM data, which belongs to digital audio signals and is the raw PCM stream. After encoding the PCM data, corresponding audio files are generated through multiplexing. Second, the playback of audio files involves demultiplexing and decoding to convert them into PCM for playback.