Call for

Special Sessions

Special Sessions


The Organizing Committee of INTERSPEECH 2022 is proudly announcing that the following special sessions and challenges for INTERSPEECH 2022 will be held.

Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.

Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.

List of sessions - in alphabetical order

Expand all


The focus of this Special Session is to provide a forum for researchers working on the massive naturalistic audio collection stemming from the NASA Apollo Missions. UTDallas-CRSS under NSF support has led the Fearless Steps Initiative, a continued effort spanning eight years has resulted in the digitization, and recovery of over 50,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this naturalistic data resource, including an initial release of pipeline diarization meta-data for all 30 channels of APOLLO-11 and APOLLO-13 Missions. More than 500 sites worldwide have accessed the initial data. A current NSF Community Resource project is continuing this effort to recover the remaining Apollo missions (A7-A17; estimated to be 150,000hrs of data) in addition to motivating collaborative speech and language technology research through the Fearless Steps Challenge series.


Download PDF from HERE


  • John H.L. Hansen, Univ. of Texas at Dallas
  • Christopher Ceiri, Linguistic Data Consortium
  • James Horan, NIST,
  • Aditya Joglekar, Univ. of Texas at Dallas
  • Midia Yousefi, Univ. of Texas at Dallas
  • Meena Chandra Shekar, Univ. of Texas at Dallas


The INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge is intended to stimulate research in the area of Audio Packet Loss Concealment(PLC).

PLC is an important part of audio telecommunications technology and codec development, and methods for performing PLC using machine learning approaches are now becoming viable for practical use. Packet loss, either by missing packets or high packet jitter, is one of the top reasons for speech quality degradation in Voice over IP calls.

While there have been some groups publishing in this area, a lack of common datasets and evaluation procedures complicates the comparison of proposed methods and the establishment of clear baselines. With this challenge, we propose to address this situation: We will open source a dataset based on real-world (as opposed to the common synthetic) packet loss traces and bring the community together to, for the first time, compare approaches in this field on a unified test set.

As the gold standard for audio quality evaluation is human evaluator ratings, we will evaluate submissions using a crowd-source ITU-TP.808CCR approach. The top three approaches which achieve the highest average Mean Opinion Score on the blind set will be declared the winners of the NTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge. As an additional metric, to ensure that approaches are not degrading intelligibility, we will use the speech recognition rate, calculated using the Microsoft Cognitive Services Speech Recognition Service.

To help participants during the challenge, we will provide participants with access to our prototype "PLC-MOS" neural network model that provides estimates of human ratings of audio files with healed packet losses.

Challenge details:

Data and example scripts:


  • Ross Cutler, Microsoft, USA
  • Ando Saabas, Microsoft, Estonia
  • Lorenz Diener, Microsoft, Estonia
  • Sten Sootla, Microsoft, Estonia
  • Solomiya Branets, Microsoft, Estonia


The increasing proliferation of smart devices in our lives offers tremendous opportunities to improve the customer experience by leveraging spatial diversity and distributed computational and memory capability. At the same time, multi sensor networks present unique challenges compared to single smart devices such as synchronization, arbitration, and privacy.

The purpose of this special session is to promote research in multiple device signal processing and machine learning by bringing together leading industry and academic experts to discuss the following topics including, but not limited to:

  • Multiple device audio datasets
  • Automatic speech recognition
  • Keyword spotting
  • Device arbitration (i.e. which device should respond to the user’s inquiry)
  • Speech enhancement: de-reverberation, noise reduction, echo reduction
  • Source separation
  • Speaker localization and tracking
  • Privacy sensitive signal processing and machine learning

The core motivation of this session is the recognition that "more is different". Robust speech recognition, enhancement, and analysis are foundational areas of speech signal processing with many publication outlets. The strength of the special session is to use the engineering specification of multiple devices as a backdrop against which creative solutions from these domains can be demonstrated. The session will co-locate top researchers working in the multi-sensor domain, and even though their specific applications may be different (e.g. enhancement vs acoustic event detection), the similarity of the problem space encourages cross pollination of techniques.


  • Jarred Barber, M.S., Amazon Alexa Speech
  • Gregory Ciccarelli, Ph.D., Amazon Alexa Speech
  • Israel Cohen, Ph.D. Amazon Alexa Speech, Technion-Israel Institute of Technology
  • Tao Zhang, Ph.D., Amazon Alexa Speech


Speech technologies have become increasingly used and now power a very large range of applications. Automatic speech recognition systems have indeed dramatically improved over the past decade thanks to the advances brought by deep learning and the effort on large-scale data collection. The speech technology community's relentless focus on minimum word error rate has thus resulted in a productivity tool that works well for some categories of the population, namely for those of us whose speech patterns match its training data: typically, college-educated first-language speakers of a standardized dialect, with little or no speech disability.

For some groups of people, however, speech technology works less well, maybe because their speech patterns differ significantly from the standard dialect (e.g., because of regional accent), because of intra-group heterogeneity (e.g., speakers of regional African American dialects; second-language learners; and other demographic aspects such as age, gender, or race ), or because the speech pattern of each individual in the group exhibits a large variability (e.g., people with severe disabilities).

The goal of this special session is (1) to discuss these biases and propose methods for making speech technologies more useful to heterogeneous populations and (2) to increase academic and industry collaborations to reach these goals.

Such methods include:

  • analysis of performance biases among different social/linguistic groups in speech technology,
  • new methods to mitigate these differences,
  • new approaches for data collection, curation and coding,
  • new algorithmic training criteria,
  • new methods for envisioning speech technology task descriptions and design criteria.

Moreover, the special session aims to foster cross-disciplinary collaboration between fairness and personalization research, which has the potential to both improve customer experiences and algorithm fairness. The special session will bring experts from both fields to advance the cross-disciplinary study between fairness and personalization, e.g., fairness-aware personalization.

The session promotes collaboration between academia and industry to identify the key challenges and opportunities of fairness research and shed light on future research directions.



  • Prof. Laurent Besacier, Naver Labs Europe, France, Principal Scientist,
  • Dr. Keith Burghardt, USC Information Sciences Institute, USA, Computer Scientist,
  • Dr. Alice Coucke, Sonos Inc., France, Head of Machine Learning Research,
  • Prof. Mark Allan Hasegawa-Johnson, University of Illinois, USA, Professor of Electrical and Computer Engineering,
  • Dr. Peng Liu, Amazon Alexa, USA, Senior Machine Learning Scientist,
  • Anirudh Mani, Amazon Alexa, USA, Applied Scientist,
  • Prof. Mahadeva Prasanna, IIT Dharwad, India, Professor, Dept of Electrical Engineering,
  • Prof. Priyankoo Sarmah, IIT Guwahati, India, Professor, Dept of Humanities and Social Sciences,
  • Dr. Odette Scharenborg, Delft University of Technology, the Netherlands, Associate professor,
  • Dr. Tao Zhang, Amazon Alexa, USA, Senior Manager.


The special session aims to bring together researchers from all sectors working on ASR (Automatic Speech Recognition) for low-resource languages and dialects to discuss the state of the art and future directions. It will allow for fruitful exchanges between participants in low-resource ASR challenges and evaluations and other researchers working on low-resource ASR development.

One such challenge is the OpenASR Challenge series conducted by NIST (National Institute of Standards and Technology) in coordination with IARPA’s (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. The most recent challenge, OpenASR21, offered an ASR test of 15 low resource languages for conversational telephone speech, with additional data genres and case-sensitive scoring for some of the languages.

Another challenge is the Hindi ASR Challenge that was recently opened to evaluate regional variations of Hindi with the use of spontaneous telephone speech recordings made available by Gram Vaani, a social technology enterprise company. The regional variations of Hindi, together with spontaneity of speech, natural background, and transcriptions with varying degrees of accuracy due to crowd sourcing make it a unique corpus for automatic recognition of spontaneous telephone speech in low-resource regional variations of Hindi. A 1000 hours audio-only data (no transcription) is also released with this challenge to explore self-supervised training for such a low-resource framework.

We invite contributions from the OpenASR21 Challenge participants, the MATERIAL performers, the Hindi ASR Challenge participants, and any other researchers with relevant work in the low-resource ASR problem space.


  • Reports of results from tests of low-resource ASR, such as (but not limited to) the NIST/IARPA OpenASR21 Challenge, IARPA MATERIAL evaluations, and the Hindi ASR Challenge.
  • Topics focused on aspects of challenges and solutions in low-resource setting, such as:
    • Zero- or few-shot learning methods
    • Transfer learning techniques
    • Cross-lingual training techniques
    • Use of pretrained models
    • Factors influencing ASR performance (such as dialect, gender, genre, variations in training data amount, or casing)
    • Any other topics focused on low-resource ASR challenges and solutions



  • Peter Bell, University of Edinburgh
  • Jayadev Billa, University of Southern California Information Sciences Institute
  • Prasanta Ghosh, Indian Institute of Science, Bangalore
  • William Hartmann, Raytheon BBN Technologies
  • Kay Peterson, National Institute of Standards and Technology
  • Aaditeshwar Seth, Indian Institute of Technology, Delhi


Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower level tasks. Interest has been growing in higher-level spoken language understanding (SLU) tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks, and the existing datasets tend to be relatively small. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. In this special session, we would like to foster a discussion and invite researchers in the field of SLU working on tasks such as named entity recognition (NER), sentiment analysis, intent classification, dialogue act tagging, or others, using either audio or ASR transcripts.

We invite contributions any relevant work in the low-resource SLU problem includes (but are not limited to):

  • Training/fine-tuning approach using self/semi-supervised model for SLU tasks
  • Comparison between pipeline and end-to-end SLU systems
  • Self/semi-supervised learning approach focusing on SLU
  • Multi-task/transfer/student-teacher learning focusing on SLU tasks
  • Theoretical or empirical study on low-resource SLU problems


Special session website

Contact: Suwon Shon (


  • Suwon Shon - ASAPP
  • Felix Wu - ASAPP
  • Pablo Brusco - ASAPP
  • Kyu J. Han - ASAPP
  • Karen Livescu - TTI at Chicago
  • Ankita Pasad - TTI at Chicago
  • Yoav Artzi - Cornell University
  • Katrin Kirchhoff- Amazon
  • Samuel R. Bowman - New York University
  • Zhou Yu - Columbia University


ConferencingSpeech 2022 challenge is proposed to stimulate research in Non-intrusive speech quality assessment for online conferencing applications. For a long time, speech quality assessment of communication application was carried out by subjective experiments or obtained via computational model relying on the reference clean and degraded speech in an intrusive manner. However, for quality monitoring purpose non-intrusive speech quality model or so-called single-ended model which do not need reference speech is highly preferred and remains a difficult and challenging topic. The challenge aims to bring together researchers from all sectors working on speech quality to show the potential performance of different models, explore new ideas, and discuss the state of the art and future directions. We believe this could accelerate the research topic to make non-intrusive speech quality assessment more reliable and increase the possibility that those models being adopted by online conferencing applications in a near future.

This challenge will provide comprehensive training datasets, a comprehensive test dataset and a baseline system. The final ranking of this challenge will be decided by the accuracy of the predicted MOS scores from the submitted model or algorithm on the test dataset. More details about the data and challenge can be found from the evaluation plan. Please let us know if you have questions or need clarification about any aspect of the challenge.



  • Gaoxiong Yi, Tencent, China
  • Wei Xiao, Tencent, China
  • Yiming Xiao, Tencent, China
  • Babak Naderi, Technical University of Berlin, Germany
  • Sebastian Möller, Technical University of Berlin, Germany
  • Gabriel Mittag, Machine Learning Scientist, Microsoft
  • Ross Cutler, Partner Applied Scientist Manager, Microsoft
  • Zhuohuang Zhang, Indiana University Bloomington, USA
  • Donald S. Williamson, Assistant Professor, Indiana University
  • Bloomington, USA
  • Fei Chen, Professor, Southern University of Science and Technology, China
  • Fuzheng yang, Professor, XiDian University, China
  • Shidong Shang, Senior Director, Tencent, China


Style is becoming more important, as we increasingly deploy variations of one basic dialog system across domains and genres, and as we aim to better customize and individualize our dialog systems.

Style has been a focus of much recent work in speech synthesis, with remarkable advances also in style transfer, style discovery, style recognition, and style modeling, both for utterance-level style properties and interaction-level and dialog-level properties. Nevertheless more work is needed in improving and simplifying our models, in generalizing and systematizing our understanding of style, and in translating research advances to value for users.

In this special session, we seek to promote interaction and collaboration between researchers working on different aspects of style and using different approaches. We encourage submissions that go beyond their technical or empirical contributions to also elaborate on how the work relates to the big picture of style in spoken dialog. We also welcome papers whose motivations, contributions, or implications highlight issues not commonly addressed at Interspeech.

Topics of interest include any aspects of speaking styles and interaction styles, including

  • style as it relates to expressiveness, pragmatic intents, genre, social role, social identity, stance, personality, entrainment, interpersonal dynamics, and so on
  • universal and language-specific aspects of style
  • style in monolog and dialog
  • how styles are realized through phonetic, prosodic, lexical, and turn-taking means
  • applications in dialog systems and beyond



  • Nigel Ward, University of Texas at El Paso
  • Kallirroi Georgila, University of Southern California
  • Yang Gao, Carnegie-Mellon University
  • Mark Hasegawa-Johnson, University of Illinois
  • Koji Inoue, Kyoto University
  • Simon King, University of Edinburgh
  • Rivka Levitan, City University of New York
  • Katherine Metcalf, Apple
  • Eva Szekely, KTH Royal Institute of Technology
  • Pol van Rijn, Max Planck Institute for Empirical Aesthetics
  • Rafael Valle, NVIDIA


Technological advancements have been rapidly transforming healthcare in the last several years, with speech and language tools playing an integral role. However, this brings a multitude of unique challenges to consider to increase the generalisability, reliability, interpretability and utility of speech and language tools in healthcare and health research settings.

Many of these challenges are common to the two themes of this special session. The first theme, From Collection and Analysis to Clinical Translation, seeks to draw attention to all aspects of speech-health studies that affect the overall quality and reliability of any analysis undertaken on the data and thus affect user acceptance and clinical translation.

The second theme, Language Technology For Medical Conversations, covers a growing field of research in which automatic speech recognition and natural language processing tools are combined to automatically transcribe and interpret clinician-patient conversations and generate subsequent medical documentation.

By combining these themes, this session will bring the wider speech-health community together to discuss innovative ideas, challenges and opportunities for utilizing speech technologies within the scope of healthcare applications.

Suggested paper topics include, but are not limited to:

  • Data collection protocols and speech elicitation strategies
  • Device selection and related effects
  • Acceptance of data collection in different health cohorts
  • Longitudinal data collection and analysis
  • Patient and Public Involvement in speech research
  • User evaluation of speech technology in a healthcare setting
  • Feature extraction and novel representations that provide clinical interpretability
  • Advancements in analytics and machine learning methodologies that are clinically or biologically inspired
  • Fusion of linguistic and paralinguistic information
  • Health-related conversational analytics
  • Speech recognition and natural language processing in healthcare settings
  • Creation and annotation of medical conversation datasets
  • Role of medical conversation understanding in reducing documentation burden
  • Use of chatbots in healthcare
  • Spoken language technologies in real-world health settings
  • Utilising Electronic Health Records to personalise models in speech recognition or conversational analytics



  • Nicholas Cummins (Kings's College London and Thymia)
  • Thomas Schaaf (3M)
  • Heidi Christensen (University of Sheffield)
  • Judith Dineley (King’s College London and University of Augsburg)
  • Julien Epps (University of New South Wales)
  • Matt Gormley (Carnegie Mellon University)
  • Sandeep Konam (
  • Emily Mower Provost (University of Michigan)
  • Chaitanya Shivade (
  • Thomas Quatieri (MIT Lincoln Laboratory)



One of the greatest challenges for hearing-impaired listeners is understanding speech in the presence of background noise. Noise levels encountered in everyday social situations can have a devastating impact on speech intelligibility, and thus communication effectiveness, potentially leading to social withdrawal and isolation. Disabling hearing impairment affects 360 million people worldwide, with that number increasing because of the ageing population. Unfortunately, current hearing aid technology is often ineffective at restoring speech intelligibility in noisy situations.

To allow the development of better hearing aids, we need better ways to evaluate the speech intelligibility of audio signals. We need prediction models that can take audio signals and use knowledge of the listener's characteristics (e.g., an audiogram) to estimate the signal’s intelligibility. Further, we need models that can estimate intelligibility not just of natural signals, but also of signals that have been processed using hearing aid algorithms - whether current or under development.

The Clarity Prediction Challenge

As a focus for the session, we have launched the `Clarity Prediction Challenge’. The challenge provides you with noisy speech signals that have been processed with a number of hearing aid signal processing systems and corresponding intelligibility scores produced by a panel of hearing-impaired individuals. You are tasked with producing a model that can predict intelligibility scores given just the signals, their clean references and a characterisation of each listener’s specific hearing impairment. The challenge will remain open until the Interspeech submission deadline and all entrants are welcome. (Note, the Clarity Prediction Challenge is part of a 5-year programme with further prediction and enhancement challenges planned for the future.)

Relevant Topics

The session welcomes submission from entrants to the Clarity Prediction Challenge but is also inviting papers related to topics in hearing impairment and speech intelligibility, including, but not limited to,

  • Statistical speech modelling for intelligibility prediction
  • Modelling energetic and informational noise masking
  • Individualising intelligibility models using audiometric data
  • Intelligibility prediction in online and low latency settings
  • Model-driven speech intelligibility enhancement
  • New methodologies for intelligibility model evaluation
  • Speech resources for intelligibility model evaluation
  • Applications of intelligibility modelling in acoustic engineering
  • Modelling interactions between hearing impairment and speaking style
  • Papers using the data supplied with the Clarity Prediction Challenge



  • Trevor Cox - University of Salford, UK
  • Fei Chen - Southern University of Science and Technology, China
  • Jon Barker - University of Sheffield, UK
  • Daniel Korzekwa - Amazon TTS
  • Michael Akeroyd University of Nottingham, UK
  • John Culling - University of Cardiff, UK
  • Graham Naylor - University of Nottingham, UK


While spoofing countermeasures, promoted within the sphere of the ASVspoof challenge series, can help to protect reliability in the face of spoofing, they have been developed as independent subsystems for a fixed ASV subsystem. Better performance can be expected when countermeasures and ASV subsystems are both optimised to operate in tandem. The first spoofing-aware speaker verification (SASV) challenge aims to encourage the development of original solutions involving, but not limited to:

back-end fusion of pre-trained automatic speaker verification and pre-trained audio spoofing countermeasure subsystems; integrated spoofing-aware automatic speaker verification systems that have the capacity to reject both non-target and spoofed trials. ​

While we invite the submission of general contributions in this direction, the Interspeech 2022 Spoofing-aware Automatic Speaker Verification special session incorporates a challenge – SASV 2022. Potential authors are encouraged to evaluate their solutions using the SASV benchmarking framework which comprises a common database, protocol and evaluation metric. Further details and resources can be found from the SASV challenge website. ​



  • Jee-weon Jung, Naver Corporation, South Korea
  • Hemlata Tak, EURECOM, France
  • Hye-jin Shim, University of Seoul, South Korea
  • Hee-Soo Heo, Naver Corporation, South Korea
  • Bong-Jin Lee, Naver Corporation, South Korea
  • Soo-Whan Chung,Naver Corporation, South Korea
  • Hong-Goo Kang, Yonsei University, South Korea
  • Ha-Jin Yu, University of Seoul, South Korea
  • Nicholas Evans, EURECOM, France
  • Tomi H. Kinnunen, University of Eastern Finland, Finland​


Given the ubiquity of Machine Learning (ML) systems and their relevance in daily lives, it is important to ensure private and safe handling of data alongside equity in human experience. These considerations have gained considerable interest in recent times under the realm of Trustworthy ML. Speech processing in particular presents a unique set of challenges, given the rich information carried in linguistic and paralinguistic content including speaker trait, interaction and state characteristics. This special session on Trustworthy Speech Processing (TSP) was created to bring together new and experienced researchers working on trustworthy ML and speech processing. We invite novel and relevant submissions from both academic and industrial research groups, showcasing advancements in theoretical, empirical as well as real-world design of trustworthy speech applications.

Topics of interest cover a variety of papers centered on speech processing, including (but not limited to):

  • Differential privacy
  • Federated learning
  • Ethics in speech processing
  • Model interpretability
  • Quantifying & mitigating bias in speech processing
  • New datasets, frameworks and benchmarks for TSP
  • Discovery and defense against emerging privacy attacks
  • Trustworthy ML in applications of speech processing like ASR



  • Anil Ramakrishna, Amazon Inc.
  • Shrikanth Narayanan, University of Southern California
  • Rahul Gupta, Amazon Inc.
  • Isabel Trancoso, University of Lisbon
  • Rita Singh, Carnegie Mellon University


Human listening tests are the gold standard for evaluating synthesized speech. Objective measures of speech quality have low correlation with human ratings, and the generalization abilities of current data-driven quality prediction systems suffer significantly from domain mismatch. The VoiceMOS Challenge aims to encourage research in the area of automatic prediction of Mean Opinion Scores (MOS) for synthesized speech. This challenge has two tracks:

Main track: We recently collected a large-scale dataset of MOS ratings for a large variety of text-to-speech and voice conversion systems spanning many years, and this challenge releases this data to the public for the first time as the main track dataset.
Out-of-domain track: The data for this track comes from a different listening test from the main track. The purpose of this track is to study the generalization ability of proposed MOS prediction models to a different listening test context. A smaller amount of labeled data is made available to participants, and unlabeled audio samples from the same listening test are made available as well, to encourage exploration of unsupervised and semi-supervised approaches.

Participation is open to all. The main track is required for all participants, and the out-of-domain track is optional. Participants in the challenge are strongly encouraged to submit papers to the special session. The focus of the special session is on understanding and comparing MOS prediction techniques using a standardized dataset.


Challenge info page

CodaLab competition page (


  • Wen-Chin Huang (Nagoya University, Japan)
  • Erica Cooper (National Institute of Informatics, Japan)
  • Yu Tsao (Academia Sinica, Taiwan)
  • Hsin-Min Wang (Academia Sinica, Taiwan)
  • Tomoki Toda (Nagoya University, Japan)
  • Junichi Yamagishi (National Institute of Informatics, Japan)