Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment

Authors

Jiaming Xu

Chinese Academy of Sciences, Institute of Automation

Jing Shi

Chinese Academy of Sciences, Institute of Automation

Guangcan Liu

Chinese Academy of Sciences, Institute of Automation

Xiuyi Chen

Chinese Academy of Sciences, Institute of Automation

Bo Xu

Chinese Academy of Sciences, Institute of Automation

Published:

2018-02-08

Proceedings:

Proceedings of the AAAI Conference on Artificial Intelligence, 32

Volume

Issue:

Thirty-Second AAAI Conference on Artificial Intelligence 2018

Track:

Main Track: Machine Learning Applications

Downloads:

Download PDF

Abstract:

Developing a computational auditory model to solve the cocktail party problem has long bedeviled scientists, especially for a single microphone recording. Although recent deep learning based frameworks have made significant progress in multi-talker mixed speech separation, most existing deep learning based methods, focusing on separating all the speech channels rather than selectively attending the target speech and ignoring other sounds, may fail to offer a satisfactory solution in a complex auditory scene where the number of input sounds is usually uncertain and even dynamic. In this work, we employ ideas from auditory selective attention of behavioral and cognitive neurosciences and from recent advances of memory-augmented neural networks. Specifically, a unified Auditory Selection framework with Attention and Memory (dubbed ASAM) is proposed. Our ASAM first accumulates the prior knowledge (that is the acoustic feature to one specific speaker) into a life-long memory during the training phase, meanwhile a speech perceptor is trained to extract the temporal acoustic feature and update the memory online when a salient speech is given. Then, the learned memory is utilized to interact with the mixture input to attend and filter the target frequency out from the mixture stream. Finally, the network is trained to minimize the reconstruction error of the attended speech. We evaluate the proposed approach on WSJ0 and THCHS-30 datasets and the experimental results demonstrate that our approach successfully conducts two auditory selection tasks: the top-down task-specific attention (e.g. to follow a conversation with friend) and the bottom-up stimulus-driven attention (e.g. be attracted by a salient speech). Compared with deep clustering based methods, our method conducts competitive advantages especially in a real noise environment (e.g. street junction). Our code is available at https://github.com/jacoxu/ASAM.

DOI:

10.1609/aaai.v32i1.11879

AAAI

Thirty-Second AAAI Conference on Artificial Intelligence 2018

ISSN 2374-3468 (Online) ISSN 2159-5399 (Print)

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.