Feb 26, 2026

Sound Recognition: How to Make Your AI Model Actually Work?

Nowadays, there are numerous examples of how to create a simple artificial neural network model for sound recognition, e.g. Simple audio recognition: Recognizing keywords. However, the main disadvantage of such AI models is interpolation - they work well only within the margins of a particular task. So, when you follow the step-by-step tutorial instructions, applied for your particular business problem, the model’s behaviour turns out to be totally unpredictable. What to do in this situation? How to make the AI model actually work?

In this article, I’d like to share a few pieces of advice based on a project that consisted in building AI sound recognition software to detect the sound of a metal clink of an aluminum drum.

We can define the three main challenges of the AI model training:

the dataset problem
the problem of choosing the right audio features
the problem of artificial neural network (ANN) training itself.

Problem with the dataset

The first and the main reason why the neural network model may work incorrectly is the data used for its training. There are a few problems related to data.

Problem of the recording environment

Often, the model’s training dataset is recorded in a soundproof studio environment, when the AI model is expected to work outside. As a result, the noise and extraneous sounds affect the model’s operation accuracy considerably.

For instance, even the records you’ve made outside will differ greatly from those made in a concert hall.

That’s why, if possible, the training data should be recorded in the conditions similar to the ones where you will use the model. If there’s no such opportunity, you can try the noise reducing or low or high bandpass filter techniques for environmental noise suppression.

Sample rate problem

You should make sure that the audio signal parameters and recording settings, such as sample rate, sound channels configuration, and deep rate correspond to the signal parameters that the model will have to work with in real life.

The sample rate is particularly important. Why?

The sample rate defines the Nyquist frequency which is the maximum frequency that can be reproduced via a digital signal without losses (or artifacts).

Nyquist frequency = sample rate / 2

This means that for a standard sample rate of 44.100 kHz, a sound with the frequency of 22.05 kHz can be recorded. For a rate of 8,000 kHz, you can record only a 4 kHz digital signal. The figure below shows the same high-frequency signal, yet recorded at different sample rates: 44100 Hz and 8000 Hz respectively.

audio signal at different sample rates — Audio signal at different sample rates

As you can see, when the model has to work with high-frequency sounds, decreasing the sample rate may lead to data losses.

Here’s a real-life example. Before our sound detection system development, we had recorded a training dataset at a 44100 Hz sample rate. However, upon the system implementation, we found out that the equipment provided records at 8000 Hz sample rate.

As a result, the model exhibited 90+% accuracy when working with test data, but its operation in real-life conditions was rather incorrect. Only after the data set was downsampled to 8000 Hz, did the model start working properly.

Defining the duration of informative signal fragments

Depending on the audio detection task, the length of the sound record fragment (its duration) may significantly influence the model’s operation accuracy. It is so because every specific sound has its own duration. Noise inside the signal can affect the accuracy of the model's operation. So, to define a particular sound, it’s optimal to measure the duration of a single fragment, from its very beginning till its end.

❗️Note: When you separate the signal into fragments of particular length programmatically, make sure all the fragments have been separated correctly. It’s necessary that the informative components are defined from their beginning till the end, and are not broken up into pieces or cut off.

Often, during such a process, fragments with the noise get into the dataset, and affect the model’s accuracy. The more accurate the data set is, the more accurate the prediction (sound detection) will be.

Lack of variety in the training dataset

If the AI model demonstrates good results using the test dataset, but in real life it often mixes up seemingly simple fragments, it may signal about the lack of variety in the training dataset. Limited diversity in training data leads to poor generalization, or lack of diversity in model outputs.

It means that the dataset is not diverse enough for that particular task. And it’s not about the quantity of the fragments, but, indeed, about their variety.

For image recognition tasks, this problem is often solved with data augmentation: it “takes the approach of generating additional training data from your existing examples by augmenting them using random transformations that yield believable-looking images.” The current dataset gets extended using different transformations (resize, rotate, etc) of existing images.

For sound recognition tasks, there are more suitable approaches:

Time shift: process of shifting the audio fragment in time by a random amount of time (shifting the sound waveform fragment to the left or to the right along the time axes)
Time stretch: stretching an audio series in time by a fixed rate
Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered
Loudness normalization results in changing the signal loudness, making it lower or higher by a constant value throughout the signal
Adding background noise.

Applying data augmentation allows increasing the variety of the dataset when it’s impossible to increase it by the means of the real records.

The problem of choosing the right audio features

An audio feature is a measurable value that describes a particular sound property.

Why are audio features important? They help AI models distinguish one sound from another. The use of an appropriate audio feature for the particular task allows to increase the model’s efficiency considerably.

We can analyze audio features in time and frequency domains.

In the time domain the audio signal is shown in a time perspective. The first and the simplest is the waveform representation.

sound signal waveform — Raw sound signal

Machine learning models can use raw audio data as a dataset for training. In some cases they demonstrate rather good results.

The amplitude envelope outlines the maximum amplitude values of all samples in a frame, providing a rough idea of the signal's loudness.

It’s very simple to calculate (here and below you will find respective Python code snippets for audio features calculation).

[np.max(np.abs(frame)) for frame librosa.util.frame(signal))]

amplitude envelope — Amplitude envelope of sound signals

The root-mean-square (RMS) value represents the effective magnitude of a time-varying, periodic signal (such as AC voltage or audio). It is calculated as the square root of the mean of the squared instantaneous values over one period.

np.sqrt(np.mean(librosa.util.frame(signal) ** 2, axis=0))

The zero-crossing rate (ZCR) is the number of times a signal changes sign (crosses the zero amplitude level) within a specified time interval or analysis window.

librosa.feature.zero_crossing_rate(signal)

Frequency-domain features describe an audio signal in terms of its frequency content rather than its variation over time. They represent how the signal’s energy or amplitude is distributed across different frequencies.

One of the simplest frequency-domain representations is the spectrum, typically obtained using the Fast Fourier Transform (FFT). It shows how the signal’s amplitude or power is distributed across different frequency components within a specified frequency range.

np.abs(np.fft.fft(signal))

A spectrogram is a time-frequency representation of a signal that shows how its spectral content evolves over time. It is typically computed from the Short-Time Fourier Transform (STFT) and displays the magnitude (or power) of frequency components as a function of time.

Spectrogram is one of the most commonly used audio features for audio-related machine learning tasks.

np.abs(librosa.stft(signal)) ** 2

Mel spectrogram is a visual representation of the signal frequency spectrum as it varies with time, converted to the mel scale.

The frequency axis is transformed to the Mel scale to approximate human auditory perception.

The mel scale is a scale of pitches judged by listeners to be equal in distance one from another regardless of the actual frequency in Hz.

Simon Fraser University

Mapping frequencies to a logarithmic scale that emphasizes lower frequencies provides improved resolution for audio analysis, making it essential for machine learning in speech recognition and music classification.

In some cases, like ours, it works much better than a spectrogram. A fundamental parameter here is the number of mel bands, which depends on the problem. Usually it ranges from 40 to 128.

librosa.feature.melspectrogram(y=signal, n_mels=n_mels)

Mel-Frequency Cepstral Coefficients (MFCCs) are compact features derived from the log Mel power spectrum of an audio signal. They are computed by applying a discrete cosine transform (DCT) to the logarithm of Mel-filtered spectral energies, resulting in a decorrelated representation of the spectral envelope.

MFCCs are widely used in speech recognition, speaker identification, and music classification because they capture perceptually relevant spectral characteristics of sound.

librosa.feature.mfcc(y=signal, n_mfcc=n_mfcc)

mel-frequency cepstral coefficients — Mel-frequency cepstral coefficients

There are many more audio features, such as band energy ratio, spectral centroid, and bandwidth, among others. The main idea here is to choose the right features that are the most informative and useful for the particular task.

Problem of AI model training

The first problem you should pay attention to when training the model is the model's underfitting / overfitting.

Model underfitting occurs when the model performs poorly on both training and validation datasets. This means that the model is unable to capture the relationship between the input and output variables accurately. There are a few things you can do about it:

Increase the variety of the data set (the above-mentioned lack of variety problem)
Increase the complexity of the model (number of layers, number of neurons in a layer, etc.)
Increase the number of epochs of training
Decrease regularization.

Model overfitting occurs when a model performs well on training datasets but poorly on validation datasets. This scenario is somewhat harder to track. There are a few things you can do here:

Simplify the model: complex model structure with a large number of layers and neurons may result in vanishing gradients. So, it’s better to start with 1 layer consisting of fewer neurons (e.g. 32 or 64), and gradually increase these numbers if needed.
Dropout: randomly deactivate neurons during ANN training.
Early stopping: halt the training process before the model begins to learn the noise in the data.
Regularization: methods like L1 and L2 regularization penalize large weights, simplifying the model.
Data augmentation: increase the variety of the data set.

A proof of no underfitting or overfitting are high accuracy and low losses on both validation and training sets. Moreover, the loss and accuracy curves for training and validation should be close to each other, like in the figure below.

When the model doesn’t have above-mentioned problems, but its accuracy is still low, try to change the type of the artificial neural network. There are a few ANN types that work well for sound recognition tasks:

Different algorithms can show different fit for a particular task. So, it’s better to try a few of them, and define which one shows the best results.

Pre-train models

One more way to refine the model’s efficiency is to use a pre-trained model as a basis for your own model. It can be useful when the model mixes up random sounds with target labels. kaggle offers a large database of pre-trained models. For instance, yamnet contains 521 audio events that may considerably extend your model’s knowledge base.

Conclusion

Unfortunately, there’s no unified step-by-step tutorial on how to design the artificial neural network for sound recognition. Each task is unique and requires a combination of techniques for achieving an optimal result. There are general best practices, by following which you can significantly increase the ANN model’s efficiency. AI model design is more than the search of optimal parameters and compromises to achieve the desired result.

Don’t hesitate to reach out to Apiko AI experts - we’ll be more than eager to share our knowledge and see to your project.