Understanding communication sounds processing in adverse acoustic conditions: Psychoacoustical and neurophysiological findings

Psychoacoustic studies revealed that speech intelligibility can be preserved in conditions of severe acoustic degradations such as those induced by the presence of masking noise or by the processing of speech by a vocoder which removes the temporal fine structure and preserves partially the temporal envelope. At the neuronal level, many studies have pointed out that auditory cortex neurons display robust discrimination, but it seems crucial to investigate how subcortical auditory neurons respond in adverse conditions.


Introduction
Our ears are constantly bombarded by a complex sound mixture, which often generates challenging acoustic conditions for understanding speech. These acoustic conditions can be the presence of background noise, with the particular case of the "cocktail partyˮ noise» (requiring segregating a sound source from a mixture of different sources), the internal room acoustics which can potentially cause reverberation phenomena, but also the environmental conditions for example the attenuations of some frequencies by the environment [1][2][3] and spectro-temporal masking by other competing sounds [4]. In human, partial eliminations of acoustic features crucial for speech perception such as the temporal fine structure (TFS) or the slow temporal envelope (E) are also adverse acoustic situations [5][6][7][8]. As a consequence, they make it increasingly difficult perceiving target sounds such as speech, communication sounds and music in normal-hearing subjects. In addition, these acoustic conditions impair speech understanding for subjects with mild to moderate hearing loss and are very penalizing for subjects with cochlear implants (neuroprosthetic which restores hearing in people suffering from profound deafness).
Understanding what are the spectro-temporal acoustic cues used by human subjects in adverse conditions and the neuronal mechanisms allowing the auditory system to extract relevant cues for discriminating sounds in these acoustic conditions are major aims in psychoacoustic and auditory neuroscience.

Psychoacoustical studies: impact of acoustic degradations on speech intelligibility
A large number of studies have used vocoders [9] that is signalprocessing devices designed to selectively alter specific acoustic features in speech signal [5-7, 10,11]. Vocoders decompose incoming sounds into frequency bands mimicking the spectral decomposition performed by the cochlea. For each band, the temporal-fine-structure component (i.e., a frequency-modulated signal) is degraded (replaced either by a pure tone or by a broadband noise) and then amplitude modulated by the corresponding temporal-envelope component (i.e., an amplitude modulator, AM). The resulting AM carriers are finally added up and presented for discrimination or identification to human subjects. This literature has largely documented that slow AM fluctuations (<16 Hz) in a limited number of frequency bands (4-8) are sufficient to maintain an almost perfect identification of speech in quiet [6,7]. Despite the fact that vocoding reduced the spectral content and the harmonic structure of speech (leading to the loss of pitch and timbre information), slow temporal cues (16Hz) are sufficient to produce 90% correct recognition of consonants, vowels and words [6].
On the other hand, elderly persons often experience important difficulties in understanding speech in adverse listening situations, sometimes in the absence of elevated audiometric thresholds [12,13]. Potentially, this can result from an alteration of supra-threshold auditory processing [14][15][16][17], which can be explained if highthreshold auditory-nerve fibers are the first to be impacted during aging [18]. Initially, it was found that older listeners showed deficits in using complex amplitude modulation (AM) patterns to identify speech accurately [19,20]. This is possibly due to reduced sensitivity to AM cues, or reduced central ability to make optimal use of AM cues [12,21]. However, several studies did not find any effect of aging on AM sensitivity [13,22,23], suggesting that the processing of AM information is roughly comparable for younger and older listeners exhibiting similar audiometric thresholds.
When elderly subjects exhibit sensorineural hearing loss, AM sensitivity is generally improved compared to normal [24], presumably because of the loss in the fast-acting amplitude compression applied by outer hair cells in the cochlea [25]. Thus, the potentially detrimental effects of aging on AM processing may be confounded with (and counter-balanced by) the loss of compression associated with mild hearing loss.

Neurophysiological studies: impact of acoustic degradations on neuronal responses in the auditory pathway
The auditory system successfully maintains a detailed and precise neuronal representation of target acoustic stimuli such as speech, communication sounds and music, in the presence of important acoustic alterations. Many studies describing consequences of acoustic degradations have been performed in the primary auditory cortex (AI) with stimuli that differ in length, spectral content, and other acoustic parameters. All the studies using vocoded vocalizations showed little changes in AI responses: in several species, cortical responses were not drastically reduced even when the number of frequency bands was reduced down to two bands [26][27][28][29]. At the level of the secondary auditory cortex (SRAF area), Carruthers et al. [30] showed that neuronal populations code invariant representations of conspecific vocalizations despite important spectro-temporal degradations. In the only study performed at the subcortical level with vocoded stimuli, it was reported that, in terms of firing rate, the responses of IC neurons were resistant to drastic spectral degradations [31].
Acoustically, the vocoded stimuli become spectrally more similar as the number of frequency bands decreases, whereas their temporal envelopes remain quite different but are partially degraded. The auditory neurons remain thus sensitive to temporal envelope fluctuations still present in the vocoded stimuli. If this interpretation is correct, masking the amplitude modulations of natural stimuli by noise addition, should largely reduce the neuronal discriminative abilities of auditory neurons. In fact, this might be the best adverse condition to evaluate the abilities of auditory neurons to discriminate communication sounds.
Most of the studies describing the consequences of background noise on neuronal responses to target stimuli have been also performed at the level of the primary auditory cortex. Initially, Nagarajan et al. [26] reported that white noise addition reduced neuronal responses to communication sounds only at a 0dB SNR. In bird field L (homologous to AI), neuronal responses to song motifs were strongly reduced by three different types of masking noises [4]. However, recent results revealed a more complex picture [32]. The responses of cortical neurons can be classified in four classes named robust, balanced, insensitive or brittle when vocalizations were embedded in a broadband white or a babble noise. However, a given neuron can fall into a class or another depending on the type of noise, demonstrating the existence of contextual effects. In fact, the initial results of Bar-Yosef et al. [33] in the cat primary auditory cortex have already pointed out that some cortical neurons are more sensitive to the noise background than to the actual communication sounds. In the bird homologous of a secondary auditory area (area NCM), cortical inhibitory microcircuits, which contributes to sparsify the evoked discharges of pyramidal cells, allows the emergence of invariant neural representations of communication sounds in noise conditions [34]. In a quite interesting study, Shetake et al. [35] quantified neuronal discriminative abilities of AI responses to similar speech sounds with and without noise addition and found that the discrimination abilities of cortical cells can closely match the behavioral performance. The discrimination performance of neuronal populations was not affected at a SNR of +12dB, but the performance felt close to the chance level with a SNR of -12dB [35]. This resistance of cortical discrimination is at variance with the strong impact of the noise observed in auditory thalamus. Indeed, a massive reduction in evoked firing rate and temporal reliability of evoked responses was observed in auditory thalamus in the easier noise condition (SNR of +10dB; [36]).
A direct comparison between the consequences of acoustic degradation in different structures is the more straightforward way for dissecting where invariant representations are generated. When measuring how different levels of noise alter neuronal coding in the auditory system, it was found that from the auditory nerve to the IC and to AI, the neural representation of natural sounds became more and more independent of the level of background noise [37]. At the population level, this tolerance to background noise was proposed to result from an adaptation to the noise statistics, which is much more pronounced at the cortical than at the subcortical level [37].

Conclusion
Both in the field of psychoacoustics and auditory neurosciences, the way by which robust representations of communication sounds are generated and allow humans and animals to react rapidly and efficiently in adverse acoustic conditions has become an intense research area. At the present time, most studies confirm that cortical neurons contribute to the invariant representation of speech-like stimuli despite very severe acoustic degradations. Almost no study has been performed so far at the subcortical level but clarifying processing mechanisms at the early stages of the auditory pathway may be crucial for understanding how these invariant features and representations emerged within the auditory system. Electrophysiological explorations combining state-ofthe-art signal processing analyses and large-scale neuronal recordings at different levels of the auditory system are necessary for progressing in this field.