[Technique] SegSNR: Properly add noise to synthesized speech
Published:
Last Updated: 2020-01-17
These days I have been working on a project where synthetic speech data is used. However, we want to make the speech data more realistic and evaluate the effects of noise by employing different signal-noise rates (SNR). So, I investigated on how to add noise to speech signal properly given needed SNR level.
Take adding Gaussian noise as an example, given a pure speech data $S$, and noise data of normal distribution (mean:0, std:1), our goal is to add the noise to the clean speech data to make its SNR equal to $q$:
\[S_i = S_i + w * N_i\]$S_i$ means the i-th data in speech and noise vector.
To solve this problem, let us have a look at the equation of SNR:
\[q=10log_{10}(\frac{P_s}{w^2 * P_n}) = 10log_{10}(\frac{\sum_i{S_i^2}}{w^2 * \sum_i N_i^2})\]where $P$ means power, $w$ is the target weight we want to get.
Then we have:
\[w = \sqrt{\frac{\sum_i S_i^2}{\sum_i N_i^2 * 10^{q/10}}}\]This is actually the most common methods explained in other posts and also my first idea of this task.
However, during practice, I found this method not very efficient in speech data, the noise is not obvious even if I have set the SNR to -5dB. I think the reason is that in the speech signal, the amplitude varies greatly. There are very high peaks and low valleys. But the noise is uniformly generated and spread to all the time axis. That is why the noise is not as strong as speech signals to human ears.
According to professor’s advice, I turned to use Segmental SNR [1] to generate and evaluate the results. Instead of working on the whole signal, Segmental Signal-to-Noise Ratio (SNRseg) calculates the average of the SNR values of short segments (15 to 20 ms).
In practice, I cut the whole speech signals to 20-ms segments and amplifies the noise based on the segmental SNR on these segments. As we can see from below, the results are more reasonable, where the noise adapt to the voice automatically.
BTW, we found an interesting phenomenon during the experiments, that is the ASR performs extremely bad on the pure synthesized data. On the contrary, it achieves accuracy of 100% when Segmental SNR = 15dB. The reason was that with the synthesized speech, silent period is completely silent. If the feature extraction is based on log spectral power (like log filter bank, MFCC, etc.), the silent period results in -Inf due to the log operation. i.e. lim_{x->+0} log (x) = -inf. By adding noise, the problem is avoided.
References
[1] Objective Speech Quality Measures. http://www.irisa.fr/armor/lesmembres/Mohamed/Thesis/node94.html