Is it possible to check similarity of two recorded sounds, "throwing" out a noise and a time offset?

This is, in general, still an open research problem.

That being said, you may be able to do something like training an open-source deep learning model to detect similar sounds. Two models that are good at this, are “Spleeter” (which by default separates vocals from music) and “Whisper” (which does speech detection.)

You might be able to do something by figuring out what kinds of sounds you want to detect, and then which frequency ranges are most salient for those sounds, and then use the FFT part of the detection. Because sounds may have slightly different speeds, a matching algorithm might start with the first spectrogram, and extract fingerprints you want to “hit” in the input sound, and then scan the input sound for a sequence of those fingerprints.

So, very briefly:

to record:

spectrograms = empty vector of spectogram
lastspectrogram = empty
foreach block in original sound:
    blockspectrogram = calculate spectrogram from block
    if lastspectrogram is empty or different_enough(lastspectrogram, blockspectrogram) then
        spectrograms = append blockspectrogram to spectrograms

to detect:

index = 0
foreach block in input sound:
    blockspectrogram = calculate spectrogram from block
    if not different_enough(blockspectrogram, spectrograms[index]) then
        index++
        if index == length(spectrograms) then
            return Success
return Failure

Something like that.

Obviously, “different_enough” needs to be tuned based on the kinds of sounds you’re interested in detecting, and “calculate spectrogram” should probably exclude buckets for frequencies you’re not interested in.

1 Like