Encoding of mid-level speech features in MEG responses
Studying the neuronal responses to naturalistic speech stimuli offers the perspective of retracing increasingly abstract transformations of the auditory input as it is processed along the auditory pathway. One computational approach to explore hypotheses about such transformations in Magneto- or Electroencephalography (MEEG) data relies on the estimation of multivariate temporal response functions (mTRFs). With this class of regularized linear models, previous studies have demonstrated the relevance of spectrally resolved stimulus energy and linguistically derived phonemic features for the prediction of MEEG signals.
This makes it interesting to investigate potential intermediate acoustic features which could afford the listener such pre-lexical abstraction.
To address this question we recorded the MEG of young healthy human participants (currently N=3) while they listened to a narrative of 1 hour duration. We performed source localisation using an LCMV beamformer and identified regions of interest based on retest-reliable responses to a repeated speech stimulus. In these regions surrounding bilateral Auditory Cortices we estimated mTRFs from bandpass-filtered signals (1-15 Hz) in a 5-fold nested cross-validation framework and compared the predictive power of a number of different feature spaces: Plain stimulus energy, spectrograms, gabor-filtered spectrograms and a set of linguistic articulatory features.
In all participants, we replicated the finding of improved prediction performance when articulatory features were added to cochleagrams. We achieved a comparable increase in prediction performance when we instead used gabor-filtered spectrograms.
Our results demonstrate that mTRFs combining various feature spaces can be estimated reliably for individual grid points from source-localised MEG data. This allowed us to extend previous acoustic models with further mid-level auditory features. Taken together, this is a promising next step to fill the gap between cochlear and linguistic representations of speech stimuli using temporally highly resolved and non-invasive neuroimaging.