Sep 11, 2024

AI Sound Recognition: How To Train Your Sound Recognition Model

Statista Market Insights expects the Speech Recognition market to grow by 14.24% annually, reaching a market volume of US$15.87bn by 2030. From the ubiquitous Siri and Alexa to more exotic apps classifying bird songs, sound recognition technology is finding applications in dozens of business niches.

Almost all of these apps are based on sound recognition AI – AI models trained to detect, identify, and classify particular sounds.

In this article, I explain what AI audio processing is and how you can train an ML model for your sound recognition tasks. As an example, I use TensorFlow and a spoken digit sound database from TensorFlow.

Let’s get started.

What is sound recognition used for?

Sound recognition allows computers to identify and categorize sounds. Sound recognition software can recognize human speech, music pieces, bird songs, or other types of sounds.

Sound recognition works by applying classification algorithms to audio signals. The technology recognizes which category the signals belong to and provides an accurate answer to the user.

There are several types of sound recognition technologies.

Voice recognition

Voice recognition allows apps to identify the user by pitch, tone, cadence and other characteristics of the voice. The software identifies who is speaking and reacts accordingly. It is often used in voice biometric identification and smart home kits. Siri and Google Voice Match also use this technology to identify the owner’s voice among others.

Speech recognition

While voice recognition recognizes the person’s voice, speech recognition recognizes the meaning of the spoken words. Voice and speech recognition are mostly used together. They allow users to interact with an app by simply talking to it. Examples of implementation include virtual assistants, chatbots, and car infotainment systems.

Trendy AI implementations like Samsung’s Galaxy AI use voice and speech recognition a lot. Features like note-taking from voice commands, live translations of calls, or transcription of voice recordings are all reliant on voice and speech recognition.

Music recognition

Music recognition apps identify songs by processing a short audio recording. They then compare the snippet with an extensive database and find a music piece that matches the recording best. Music recognition technology in fact predates AI – the most popular music recognition app, Shazam, is using an older algorithm. More advanced music app functionality, such as genre recognition and personalized recommendations, requires the use of machine learning.

Environmental sound recognition

Many industry-specific applications detect and analyze specific environmental noises. For example, manufacturing apps can detect and report abnormal noises during factory production. In the automotive industry, noise detector software notices approaching threats and helps prevent collisions. In construction, noise detection apps measure the level of noise in the construction site, which reduces complaints due to noise pollution.

More information on how to AI can be implement in construction business may be found here.

Overall, sound recognition technology is versatile and has diverse applications. This creates a high demand for AI sound recognition solutions on the market.

In the next sections, I explain how you can effectively train an ML model for sound recognition tasks.

Definition of sound

The first step in implementing sound recognition technology is to understand what sound is from a technical standpoint.

Sounds are vibrations (waves) that spread in the environment through the air. Sound waves spread similarly to how real waves spread to the shore when a stone is thrown into the water.

Sound waves have three basic characteristics: time period, amplitude, and frequency.

Time period is how long a certain sound lasts or, in other words, how many seconds it takes to complete one cycle of vibrations.
Amplitude is the sound intensity measured in decibels (dB) which we perceive as loudness.
Frequency measured in Hertz (Hz) indicates how many sound vibrations happen per second. People interpret frequency as low or high pitch.

The human ear can perceive sound waves in the range from 20 to 20,000 Hz.

So how does AI sound recognition work?

AI sound recognition works by transforming sound into a digital form that is understandable to the machine.

During the AI audio analysis, sound waves are captured by a sound sensor and converted into a digital representation - digital audio. In digital audio, the sound wave of the audio signal is encoded as numerical samples in a continuous sequence.

For example, in a 1-second audio fragment, samples are taken 44,100 times per second, each with a 16-bit sample depth. This can be represented as a vector of values ranging from -1 to 1.

[0.324, -0.234, 0.6534, -0.76234, -0.13534 ….., n] where n = 44.100

Graphically, this can be represented as a waveform. The graph displays the time on the horizontal (X) axis and the amplitude on the vertical (Y) axis.

Visualization of sound signal in waveform — Visualization of a sound signal in waveform

Normally, digital audio is stored in different audio-formats. There are three major groups of audio file formats:

Uncompressed audio formats, such as WAV, AIFF, AU.
Formats with lossless compression, such as FLAC, Monkey's Audio, WavPack
Formats with lossy compression, such as Opus, MP3, Vorbis, Musepack, AAC.

The most common format is MP3, but for sound recognition AI training, it’s best to use WAV. Uncompressed audio formats preserve the sound signal in its pure form without any loss, which increases the accuracy of recognition.

Why is machine learning needed for sound recognition?

It is possible to design a sound recognition app without the use of audio machine learning. Using ML is cheaper and more convenient though. Why?

With traditional programming, the sound recognition process would look like this.

Traditional programming pipeline

To correctly identify sounds, the program should know the rules of categorizing data.

However, It’s hard to devise exact rules that will determine how the app will categorize signals. Most likely, it will be a combination of duration, frequency, tone, and a host of other factors. Programming and testing all these factors in a traditional approach would require writing a lot of code. This costs time and money. But if you turn this diagram over and make the machine look for the rules while knowing the correct answer, everything will change dramatically.

AI Machine learning development pipeline

In a machine learning process, you feed the machine sound files and their correct categorizations. Then, when the machine learns how this or that signal sounds (a dog bark, a bird song, etc), you can quickly teach the machine to find rules and apply them to recognize any type of signal.

Audio Machine Learning is a very effective tool for such tasks, where it shows an efficiency of around 80%-90%.

How to use AI to build sound analysis software

Here are the 5 main steps you need to take to prepare an AI model for sound recognition tasks.

Data collection. Collection specific audio data and store it in standard file formats.
Data labeling (data annotation) - preparing data audio for machine learning.
Data transforming - extracting audio features from prepared data audio.
Training - training the selected machine learning model on audio features.
Data validation – validating the trained model on test sets of data.

Let’s discuss each in more detail.

Step 1. Data collection.

First, you need to gather data on which the neural network model will be trained. In the case of audio recognition, this is a set of the accurate sounds (sound files) that the machine will need to recognize. This set will be different for each task. There are several ways to get the files for the model training.

Free data sources

Sound libraries: Specialized websites where you can find recordings of various sounds, such as freesound and BigSoundBank. The disadvantage is that the data is not structured and of varied quality.
Audio datasets: You can also use pre-adapted datasets specifically for machine learning. Some examples: TensorFlow Datasets, Bird-Audio-Detection-challenge ESC-50: Dataset for Environmental Sound Classification. The disadvantage is that they are difficult to find and not suitable for all cases.

Commercial datasets

You can use the recordings were made by the commercial company prosoundeffects. They are well-structured and high-quality, but you need to pay for them.

Own data sources

In some cases, you can’t find the necessary data online. You can always make your own records or order them from the relevant companies.

Step 2: Data labeling

Data labeling (or data annotation) is the process of identifying raw data (sounds) and providing a meaningful label so that the machine learning model can learn what predictions it is expected to make. Simply put, this is the process of putting specific sound signals into appropriate categories. For example, a fragment of a bird song is labeled as belonging to a nightingale, and this information is provided to the machine for learning.

Creating a dataset from scratch is a tedious process because data labeling is mostly done by humans. There are ways to speed up the process by simplifying task interfaces and other means, but it still requires quite a lot of time and funds. Therefore, it’s recommended to use already existing datasets if possible.

Technically, data labeling can be done in many ways. For example, you can assign a file to a unique identifier and create a table of correspondence of files to categories. You can also place files in the directory and make that directory a label. For now, no standardized process has been devised, but companies like Croissant Working Group are working on it.

Step 3. Data transformation

During the data transformation process, you need to extract audio features from prepared audio data.

Audio features are pieces of meaningful information that are extracted from audio signals to get a description of an audio file that can be processed by machine. This means that you need to transform parts of the information signal into a machine learning form.

There are three most widespread types of audio features for machine learning:

Time domain representation – waveforms
Frequency domain representation – spectrum plot.
Time and frequency domain representation – spectrograms.

Time domain representation refers to the digitized audio signal. Digitized audio signal is most similar to the physical reality of sound – it is a graphical representation of a sound wave as it moves through a medium over time. Most audio signals are stored in this format.

Frequency domain representation shows how the signal is spread within frequency bands over the range of frequencies. It is a plot in which the x-axis represents the vibration frequency and the y-axis represents the amplitudes of a signal's frequency component.

To convert waveforms into a spectrum plot, you need to use Fourier transform. In digital sound processing, Fast Fourier Transform (FFT) is used more often.

Time and frequency domain combines both time and frequency components and uses various types of spectrograms as a visual representation of a sound. To get a spectrogram of sound Short-Time Fourier transform (STFT) is used.

Visualization of sound as waveform, spectrum plot, and spectrogram — Visualization of sound as a waveform, spectrum plot, and spectrogram

The most common way of presenting an audio signal as a feature for machine learning is a spectrogram.

Step 4. Training

Now, you need to choose the neural network and train it. Two types of neural networks are commonly used for sound recognition tasks:

Convolutional Neural Networks. These networks are used for classification and object recognition tasks. CNNs use the principles from linear algebra, such as matrix multiplication, to identify patterns in an image at scale. Convolutional networks are very effective at language and image processing.

Long short-term memory networks. Unlike standard neural networks, these networks have feedback connections, which enables them to learn long-term dependencies in data. LSTM models have memory cells – containers that hold information for a long period. LSTM networks are well-suited for speech recognition, translation, and sound recognition tasks.

Step 5. Validation.

At this stage, you need to check the ability of the trained model to recognize and classify the audio signal. That is, you need to understand whether the model is able to correctly recognize this or that signal according to its data label.

Here is an example of how you can train the model to recognize spoken digit.

AI sound recognition implementation example

Step 1. Data collection.

For training, we will use the dataset from TensorFlow Datasets. It’s a simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends (5 speakers, 2,500 recordings, 50 of each digit per speaker, English pronunciations).

Step 2. Data labeling

At this stage, you need to place each file from the dataset in a directory with the appropriate label. In the case of spoken digit, these are directories dataset/0, dataset/1 ... dataset/9, etc.

Step 3. Data transformation.

At this stage, you need to convert each element from the dataset into audio features for machine learning. You need to convert the elements into time and frequency domain spectrograms. To get a spectrogram, each element goes through the following stages:

Get waveform from audio file (more here)
Trim the noise from the start and end of the waveform (more here)
Form spectrogram from a waveform (more here)

As a result, you will get recorded sound signals converted into spectrograms.

Transformation of a waveform into a spectrogram

Step 4 Train.

Now, you’ll need to “feed” spectrograms need to the model. For this, you can use Keras - the high-level API of the TensorFlow platform. tf.keras.Sequential was used to form the model. This process takes place by superimposing layers through which all data samples will pass (more here). The main layers are:

tf.keras.layers.Resizing – to downsample the input to enable the model to train faster
tf.keras.layers.Normalization - to normalize each pixel in the image based on its mean and standard deviation.

The last step is to export the model and save it in the file system for further use (more here). Thanks to Tensor Flow, the trained model can be used on any mobile, web, desktop or server platforms.

Step 5 Data validation.

Finally, you can check the recognition accuracy. To do this, you will need to take several records of spoken digits and generate prediction results for each of them.

Resuilts of audio recognition by an AI model — Results of AI model training.

As you can see, the model is able to recognize each record and assign it to the corresponding number quite accurately. The figure shows the probability of a record belonging to one or another label (number). In the same way, the model can be used to process full audio recordings or real-time audio processing.

Here, for example, is an application of the model to the recording that consists of three spoken numbers. Each recognized number is marked with a different color, the unrecognized part is marked in gray.

Results of recognition of sounds by AI model — Results of recognition of a more complex recording

Conclusion

AI sound recognition is a versatile technology that is finding more and more applications in consumer and business markets. AI audio analysis consists of data collection, labeling, and transformation. After that, the data is fed to a convolutional or long short-term memory network, which effectively learns rules for identifying sound fragments.

At Apiko, we provide AI development services for a variety of industries, including construction, real estate, and others. If you have a sound recognition project in mind, don’t hesitate to reach out to our experts.