[Technique] SegSNR: Properly add noise to synthesized speech

2 minute read

Published: January 17, 2020

Last Updated: 2020-01-17

These days I have been working on a project where synthetic speech data is used. However, we want to make the speech data more realistic and evaluate the effects of noise by employing different signal-noise rates (SNR). So, I investigated on how to add noise to speech signal properly given needed SNR level.

Take adding Gaussian noise as an example, given a pure speech data $S$, and noise data of normal distribution (mean:0, std:1), our goal is to add the noise to the clean speech data to make its SNR equal to $q$:

\[S_i = S_i + w * N_i\]

$S_i$ means the i-th data in speech and noise vector.

To solve this problem, let us have a look at the equation of SNR:

\[q=10log_{10}(\frac{P_s}{w^2 * P_n}) = 10log_{10}(\frac{\sum_i{S_i^2}}{w^2 * \sum_i N_i^2})\]

where $P$ means power, $w$ is the target weight we want to get.

Then we have:

\[w = \sqrt{\frac{\sum_i S_i^2}{\sum_i N_i^2 * 10^{q/10}}}\]

This is actually the most common methods explained in other posts and also my first idea of this task.

However, during practice, I found this method not very efficient in speech data, the noise is not obvious even if I have set the SNR to -5dB. I think the reason is that in the speech signal, the amplitude varies greatly. There are very high peaks and low valleys. But the noise is uniformly generated and spread to all the time axis. That is why the noise is not as strong as speech signals to human ears.

According to professor’s advice, I turned to use Segmental SNR [1] to generate and evaluate the results. Instead of working on the whole signal, Segmental Signal-to-Noise Ratio (SNRseg) calculates the average of the SNR values of short segments (15 to 20 ms).

In practice, I cut the whole speech signals to 20-ms segments and amplifies the noise based on the segmental SNR on these segments. As we can see from below, the results are more reasonable, where the noise adapt to the voice automatically.

BTW, we found an interesting phenomenon during the experiments, that is the ASR performs extremely bad on the pure synthesized data. On the contrary, it achieves accuracy of 100% when Segmental SNR = 15dB. The reason was that with the synthesized speech, silent period is completely silent. If the feature extraction is based on log spectral power (like log filter bank, MFCC, etc.), the silent period results in -Inf due to the log operation. i.e. lim_{x->+0} log (x) = -inf. By adding noise, the problem is avoided.

References

[1] Objective Speech Quality Measures. http://www.irisa.fr/armor/lesmembres/Mohamed/Thesis/node94.html

Share on

Twitter Facebook LinkedIn

Wenxin Hou 侯汶昕

[Technique] SegSNR: Properly add noise to synthesized speech

References

Share on

You May Also Enjoy

[Paper-ASR][中文笔记] A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

[Paper-ASR][中文笔记] 用于语音识别的无监督字符级分布适配 (CMatch)

[Collection] Papers, Recipes, Toolkits and Interesting Posts

PyTorch模型部署踩坑记录