Introduction to Audio

I am currently developing an Audio Speech Recognition system, When it comes to ML many of us work on Tabular Data or Images most of time. So to all of those who want to know the true form of audio how audio data is crunched and fed into a ML algorithm, this is a good start

Sound

  • Sound is a continuous signal and has infinite signal values
  • Digital devices require finitearrays and thus, we need to convertthem into a series of discrete values
  • AKA Digital Representation
  • Sound Power — Rate at which energy is transferred($Watt$)
  • Sound Intensity — Sound Power per unit area($Watt/m^2$)
  • Audio File Formats

    Files formats are differentiated by the way they compress digital representation of the audio signal, following are the most widely used file formats

    1. .wav
    2. .flac(free lossless audio codec)
    3. .mp3

    Steps to Conversion

    • The microphone captures an analog signal.
    • The soundwave is then converted into an electrical signal.
    • This electrical signal is then digitized by an analog-to-digital converter.

    Samplling

    • It is the process of measuring the value of a signal at fixed time steps
    • Once sampled, the sampled waveform is in a discrete format
    Fig.1: Sound Wave Representation
    Fig.2: Unit Time after which sample is taken

    Samplling Rate

    • No. of samples taken per second
    • If 1000 samples are taken per second, then sampling rate(SR) = 1000
    • HIGHER SR -> BETTER AUDIO QUALITY
    Fig.3: Sampling Rate

    Sampling Rate Considerations

    • Sampling Rate = Highest frequency that can be captured from a signal $* 2$
    • For the Human ear- the audible frequency is 8KHz hence we can say that the Sampling rate (SR) is 16KHz
    • Although more SR gives better audio quality does not mean we should keep increasing it.
    • After the required line it does not add any information and only increases the computation cost
    • Also, low SR can cause a loss of information

    Some Points to Remember By

    • While training all audio samples to have the same sampling rate
    • If you are using a pre-trained model, the audio sample should be resampled to match the SR of the audio data model trained on
    • If Data from different SRs is used, then the model does not generalize well

    Amplitude

    • Sound is made by a change in air pressure at human audible frequencies
    • Amplitude — sound pressure level at that instant measured in dB(decibel)
    • Amplitude is a measure of loudness

    Bit Depth

    • Describes how much precision value can be described
    • The higher the bit depth, the more closely the digital representation resembles the original continuous sound wave
    • Common values of bit depth are 16-bit and 24-bit

    Quantizing

    Initially, audio is in continuous form, which is a smooth wave. To store it digitally, we need to store it in small steps; we perform quantizing to do that.
    Fig.4: Quantizing
    You can say that Bit Depth is the number of steps needed to represent audio
    • 16-bit audio needs — 65536 steps
    • 24-bit audio need — 16777216 steps
    • This quantizing induces noise, hence high bit depth is preferred
    • Although this noise is not a problem
    • 16 and 24-bit audio are stored in int samples whereas 32-bit audio samples are stored in floating points
    • The model required a floating point, so we need to convert this audio into a floating point before we train the model

    Implementation

    Check out the notebook for code and try to play with it to get to know the audio better.