Continuous Mobile Speaker Separation based on Audio Data

Large datastreams of audio data facilitate research on human everyday behavior. Microphones, which are already integrated in most Earables and Smartphones, can unobstrusively record both environmental as well as person-related sounds, which is less prone to social desirability than other modalities, such as video based sensing. The processing of in-the-wild audio data with Affect Recognition [e.g. 1] and Activity Recognition [2] has implications for the ecological validity of studies [3], clinical interventions (e.g., detecting depression [1]) and user-adaptive wearable systems [4].

A major challenge for continuous audio recording is the data protection concern for third persons who did not actively agreed to be participate in the study. Therefore in this Thesis, we will develop a mobile prototype for privacy preserving continous audio monitoring.

By doing that, we will make use of existing Speaker Recognition and Verification Tools [5-10]. Our approach has a focus on mobility and the handling of limited training data and computation power [11]. The student will implement a one-vs.-many speaker classification algorithm based on spoken language features (e.g., MFCC [12]). These features can be obtained from public datasets (e.g., https://www.robots.ox.ac.uk/~vgg/data/voxceleb/).

Keywords: Audio data processing; Speech Recognition; Speaker Recognition; Speaker Verification; Machine Learning; In-the-Wild Data Collection; Mobile Computing

Tasks (Scope depends on the type of Thesis)

Literature review;
Implement Speaker Recognition algorithms in a mobile setting (eg. Smartphone or Web-Application);
Design and carry out an in-the-wild evaluation to test performance, robustness, usability, social acceptance, etc.;
If necessary, the evaluation will compare several algorithm in regards to the criteria above.
Comparing the Pros and Cons of AI-Tools for data collection;

What we offer

Professional advice in terms of Data Science and Hardware at the TECO Lab and the KD² School (http://www.kd2school.info/)
Prior experience with Speaker Recognition;
A pleasant working atmosphere and constructive cooperation;
Chances to publish your work on top conference;
Access to a large pool of participants and research material at the TECO Lab and the KD² Lab (https://www.kd2lab.kit.edu/)

Qualification

Proactive and communicative work style;
Good English reading and writing;
Machine Learning, Audio Signal processing, Speech Recognition;
Mobile Computing (e.g., Java, Kotlin)

Interested? Please contact: Tim Schneegans (schneegans@teco.edu)

References

[1] Fabien Ringeval, Bj¨orn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. Avec 2019 workshop and challenge: state-ofmind,
detecting depression with ai, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, pages 3–12, 2019.
[2] Laput, G., Ahuja, K., Goel, M., & Harrison, C. (2018, October). Ubicoustics: Plug-and-play acoustic activity recognition. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (pp. 213-224).
[3] Brunswik, E. (1956). Perception and the representative design of psychological experiments. Univ of California Press.
[4] Katayama, S., Mathur, A., Van den Broeck, M., Okoshi, T., Nakazawa, J., & Kawsar, F. (2019, September). Situation-Aware Emotion Regulation of Conversational Agents with Kinetic Earables. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 725-731). IEEE.
[5] Kunz, M., Kasper, K., Reininger, H., Möbius, M., & Ohms, J. (2011). Continuous speaker verification in realtime. BIOSIG 2011–Proceedings of the Biometrics Special Interest Group.
[6] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,”
Speech Communication, vol. 52, no. 1, pp. 12–40, 2010.
[7] J. H. L. Hansen and T. Hasan, “Speaker recognition by
machines and humans: a tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99, 2015.
[8] Lee, K. A., Vestman, V., & Kinnunen, T. (2021). ASVtorch toolkit: Speaker verification with deep neural networks. SoftwareX, 14, 100697.
[9] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018, April). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329-5333). IEEE.
[10] Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2018, April). Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4879-4883). IEEE.
[11] Woo, R. H., Park, A., & Hazen, T. J. (2006, June). The MIT mobile device speaker verification corpus: data collection and preliminary experiments. In 2006 IEEE Odyssey-The Speaker and Language Recognition Workshop (pp. 1-6). IEEE.
[12] Florian Eyben, Klaus R Scherer, Bj¨orn W Schuller, Johan Sundberg, Elisabeth Andr´e, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE transactions on affective computing, 7(2):190–202, 2015.