Can biophysically inspired features improve neural network-based speech enhancement?
Recent advances in neural network (NN)-based speech enhancement schemes outperform most of the conventional techniques and are less sensitive to the increase in input dimensionality and correlation between features. However, the performance of the state-of-the-art speech enhancement systems in adverse noisy conditions such as negative signal-to-noise ratios (SNRs) and unseen noise conditions has not been extensively studied. In addition, the performance of these systems is still far from that of humans especially in such difficult noisy conditions. Therefore, there is a growing interest in feature-related research that focuses on applying our knowledge about human auditory processing into the NN framework.
This work investigates the use of various biophysically inspired cochlear models as input to a NN for speech enhancement in adverse and unseen noise conditions. The various cochlear models investigated are Mel frequency spectral coefficients (FBANK), gammatone energies (GT), dynamic-compressive gammachirp (DCGC), dual resonance nonlinear filterbank (DRNL), cascade of asymmetric resonators with fast acting compression (CARFAC) and non-linear transmission-line (TL) model. The DNN setting is comprised of 3 hidden layers with sigmoid activations. The NNs are trained such that they minimize the mean square error between the NN output and the desired mask for optimal noise suppression.
The NNs are trained using noisy speech containing babble noise with positive signal-to-noise ratios (SNRs) and are evaluated using noisy speech containing babble and ICRA noise types with negative SNRs. The quality of noise suppression is evaluated using various objective measures such as short-time objective intelligibility measure (STOI), cepstral distance (CD), log-likelihood ratio (LLR) and segmental SNR (segSNR).