Sound Recognition: How to Make Your AI Model Actually Work?
Nowadays, there are numerous examples of how to create a simple artificial neural network model for sound recognition, e.g. Simple audio recognition: Recognizing keywords. However, the main disadvantage of such AI models is interpolation - they work well only within the margins of a particular task. So, when you follow the step-by-step tutorial instructions, applied for your particular business problem, the model’s behaviour turns out to be totally unpredictable. What to do in this situation? How to make the AI model actually work?
In this article, I’d like to share a few pieces of advice based on a real-life project. It consisted in building AI sound recognition software to detect the sound of graffiti stick shaking from a CCV camera stream.
We can define the three main challenges of the AI model training:
- the dataset problem
- the problem of choosing the right audio features
- the problem of artificial neural network (ANN) training itself.
Problem with the dataset
The first and the main reason why the neural network model may work incorrectly is the data used for its training. There are a few problems related to data.
Problem of the recording environment
Often, the model’s training dataset is recorded in a soundproof studio environment, when the AI model is expected to work outside. As a result, the noise and extraneous sounds affect the model’s operation accuracy considerably.
For instance, even the records you’ve made in the backyard will differ greatly from those made by the highway.
That’s why, if possible, the training data should be recorded in the conditions similar to the ones where you will use the model. If there’s no such opportunity, you can try the noise reducing or low or high bandpass filter techniques for environmental noise suppression.
Sample rate problem
You should make sure that the audio signal parameters and recording settings, such as sample rate, sound channels configuration, and deep rate correspond to the signal parameters that the model will have to work with in real life.
The sample rate is particularly important. Why?
Nyquist frequency = sample rate / 2
This means that for a standard sample rate of 44.100 kHz, a sound with the frequency of 22.05 kHz can be recorded. For a rate of 8,000 kHz, you can record only a 4 kHz digital signal. The figure below shows the same high-frequency signal, yet recorded at different sample rates: 44100 Hz and 8000 Hz respectively.
As you can see, when the model has to work with high-frequency sounds, decreasing the sample rate may lead to data losses.
Here’s a real-life example. Before the graffiti sound detection system development, we had recorded a training dataset at a 44100 Hz sample rate. However, upon the system implementation, we found out that the CVV cameras provided streams at 8000 Hz sample rate.
As a result, the model exhibited 90+% accuracy when working with test data, but its operation in real-life conditions was rather incorrect. Only after the data set was downsampled to 8000 Hz, did the model start working properly.
Defining the duration of informative signal fragments
Depending on the audio detection task, the length of the sound record fragment (its duration) may significantly influence the model’s operation accuracy. It is so because every specific sound has its own duration. For example, let’s take a look at the fragment of a sound signal waveform lasting for 350 ms.
About a half of this record represents unuseful noise, while the informative signal components last for approximately 50ms each.
Noise inside the signal can affect the accuracy of the model's operation. So, to define a particular sound, it’s optimal to measure the duration of a single fragment, from its very beginning till its end.
❗️Note: When you separate the signal into fragments of particular length programmatically, make sure all the fragments have been separated correctly. It’s necessary that the informative components are defined from their beginning till the end, and are not broken up into pieces or cut off.
Often, during such a process, fragments with the noise get into the dataset, and affect the model’s accuracy. The more accurate the data set is, the more accurate the prediction (sound detection) will be.
Lack of variety in the training dataset
If the AI model demonstrates good results using the test dataset, but in real life it often mixes up seemingly simple fragments, it may signal about the lack of variety in the training dataset. Limited diversity in training data leads to poor generalization, or lack of diversity in model outputs.
It means that the dataset is not diverse enough for that particular task. And it’s not about the quantity of the fragments, but, indeed, about their variety.
For image recognition tasks, this problem is often solved with data augmentation: it “takes the approach of generating additional training data from your existing examples by augmenting them using random transformations that yield believable-looking images.” The current dataset gets extended using different transformations (resize, rotate, etc) of existing images.
For sound recognition tasks, there are more suitable approaches:
- Time shift: process of shifting the audio fragment in time by a random amount of time (shifting the sound waveform fragment to the left or to the right along the time axes)
- Time stretch: stretching an audio series in time by a fixed rate
- Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered
- Loudness normalization results in changing the signal loudness, making it lower or higher by a constant value throughout the signal
- Adding background noise.
Applying data augmentation allows increasing the variety of the dataset when it’s impossible to increase it by the means of the real records.
The problem of choosing the right audio features
Why are audio features important? They help AI models distinguish one sound from another. The use of an appropriate audio feature for the particular task allows to increase the model’s efficiency considerably.
We can analyze audio features in time and frequency domains.
In the time domain the audio signal is shown in a time perspective. The first and the simplest is the waveform representation. Below you can see raw audios - 1-dimensional arrays representing the normalized amplitude of the can shake and spray sound signals over time, respectively.
Machine learning models can use raw audio data as a dataset for training. In some cases they demonstrate rather good results.
It’s very simple to calculate (here and below you will find respective Python code snippets for audio features calculation).
[np.max(np.abs(frame)) for frame librosa.util.frame(signal))]
np.sqrt(np.mean(librosa.util.frame(signal) ** 2, axis=0))
librosa.feature.zero_crossing_rate(signal)
Frequency-domain features describe an audio signal in terms of its frequency content rather than its variation over time. They represent how the signal’s energy or amplitude is distributed across different frequencies.
One of the simplest frequency-domain representations is the spectrum, typically obtained using the Fast Fourier Transform (FFT). It shows how the signal’s amplitude or power is distributed across different frequency components within a specified frequency range.
np.abs(np.fft.fft(signal))
Spectrogram is one of the most commonly used audio features for audio-related machine learning tasks.
np.abs(librosa.stft(signal)) ** 2
The frequency axis is transformed to the Mel scale to approximate human auditory perception.
The mel scale is a scale of pitches judged by listeners to be equal in distance one from another regardless of the actual frequency in Hz.
Mapping frequencies to a logarithmic scale that emphasizes lower frequencies provides improved resolution for audio analysis, making it essential for machine learning in speech recognition and music classification.
In some cases, it works much better than a spectrogram. A fundamental parameter here is the number of mel bands, which depends on the problem. Usually it ranges from 40 to 128.
librosa.feature.melspectrogram(y=signal, n_mels=n_mels)
Mel-Frequency Cepstral Coefficients (MFCCs) are compact features derived from the log Mel power spectrum of an audio signal. They are computed by applying a discrete cosine transform (DCT) to the logarithm of Mel-filtered spectral energies, resulting in a decorrelated representation of the spectral envelope.
MFCCs are widely used in speech recognition, speaker identification, and music classification because they capture perceptually relevant spectral characteristics of sound.
librosa.feature.mfcc(y=signal, n_mfcc=n_mfcc)
There are many more audio features, such as band energy ratio, spectral centroid, and bandwidth, among others. The main idea here is to choose the right features that are the most informative and useful for the particular task.
Problem of AI model training
The first problem you should pay attention to when training the model is the model's underfeating / overfitting.
Model underfitting occurs when the model performs poorly on both training and validation datasets. This means that the model is unable to capture the relationship between the input and output variables accurately. There are a few things you can do about it:
- Increase the variety of the data set (the above-mentioned lack of variety problem)
- Increase the complexity of the model (number of layers, number of neurons in a layer, etc.)
- Increase the number of epochs of training
- Decrease regularization.
Model overfitting occurs when a model performs well on training datasets but poorly on validation datasets. This scenario is somewhat harder to track. There are a few things you can do here:
- Simplify the model: complex model structure with a large number of layers and neurons may result in vanishing gradients. So, it’s better to start with 1 layer consisting of fewer neurons (e.g. 32 or 64), and gradually increase these numbers if needed.
- Dropout: randomly deactivate neurons during ANN training.
- Early stopping: halt the training process before the model begins to learn the noise in the data.
- Regularization: methods like L1 and L2 regularization penalize large weights, simplifying the model.
- Data augmentation: increase the variety of the data set.
A proof of no underfitting or overfitting are high accuracy and low losses on both validation and training sets. Moreover, the loss and accuracy curves for training and validation should be close to each other, like in the figure below.
When the model doesn’t have above-mentioned problems, but its accuracy is still low, try to change the type of the artificial neural network. There are a few ANN types that work well for sound recognition tasks:
- Convolutional Neural Network (CNN)
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
- Autoencoders
- Transformer
- Mamba
Different algorithms can show different fit for a particular task. So, it’s better to try a few of them, and define which one shows the best results.
Pre-train models
One more way to refine the model’s efficiency is to use a pre-trained model as a basis for your own model. It can be useful when the model mixes up random sounds with target labels. kaggle offers a large database of pre-trained models. For instance, yamnet contains 521 audio events that may considerably extend your model’s knowledge base.
Conclusion
Unfortunately, there’s no unified step-by-step tutorial on how to design the artificial neural network for sound recognition. Each task is unique and requires a combination of techniques for achieving an optimal result. There are general best practices, by following which you can significantly increase the ANN model’s efficiency. AI model design is more than the search of optimal parameters and compromises to achieve the desired result.
Don’t hesitate to reach out to Apiko AI experts - we’ll be more than eager to share our knowledge and see to your project.