CFN Original: The future of Automatic Speaker Verification, in conversation with Rosa González Hautäki

Supported By:

Net Patrol International Inc.  Data Investigation and Forensic Services
Bankruptcy and Insolvency Trustees

Voice recognition and more recently facial recognition as a mainstream implementation of technology, has hit a fever pitch. In part because of the much lauded (deserved is another conversation) Iphone X and with Amazon’s Alexa or the Google Home. And while these new technologies have moved us forward on our societal journey towards a Jetson’s reality, these new types of physical relationships with technology are also raising necessary questions surrounding safety.

Most recently the tech journalism floodgates have opened on the notion of manipulating these interactions, to access our phones or home systems. Take the aforementioned Alexa for example, researchers found a vulnerability which allowed hackers to listen in on conversations happening around the device. Luckily Amazon has released patches supposedly fixing the issue, the same sort of problem has been cropping up with Apple’s latest baby.

In a recent study published by the University of Eastern Finland titled “Acoustical and Perceptual Study of Voice,” researchers delved deep into the different aspects of “voice disguise by age modification in speaker verification.” We sat down with one of the principal authors of the study, Rosa González Hautäki, and asked a bunch of questions related to this field.

What spurred your interest into this type of research? Specifically voice modification and automatic speaker verification.

“The common questions from the public are related to accuracy of systems when the speaker has the flu, or whether impersonations is possible.  Before we started our research, we could only answer those questions intuitively, and a big motivation to start researching voice modification induced by the speaker was to be able to test systems and try to answer those questions based on experimentation and analysis of the data. I joined the research group at the precise time when this was one of the research goals and I was happy to take the challenge.

The research group at the University of Eastern Finland has more than 15 years working in speaker recognition research.  Great collaboration has been established with research institutes and groups internationally (Singapore, Japan, Edinburgh, USA (Georgia Tech) among others) so great advances were reached at implementing and improving our Automatic Speaker Verification (ASV) system.  Our group has participated in the National Institute of Standards (NIST) speaker recognition evaluations which goal is to provide a common evaluation for researchers worldwide and help the research community advance the methods to improve the accuracy of automatic systems.

All this effort has focused on data with so called ‘collaborative’ speakers or speakers that wish to be verified or speech that is not intentionally modified.”  

In the “Acoustical and perceptual study of voice” study it’s written that there’s “interest to improve the robustness of speaker recognition against human – induced voice modifications.” What do you think this looks like moving forward compared to where voice recognition is now?

“Earlier studies related to impersonation and disguise were studied by phoneticians which focus on modeling the voice production apparatus, but this knowledge was not considered or used in the technology to recognize speakers.  I think there is now more effort being done into incorporating the knowledge from acoustical and perceptual perspectives of speech into the speaker models of systems

The research community has been aware that there are intra-speaker variations in the speech, in other words, not the same sentence is said in the same way by the same speaker. I see more awareness of the importance to look into the intra-speaker variations in order to consider speech as a strong person identifier. .  In practice, a multidisciplinary approach requires that not only scientists with the engineering perspective but also the valuable findings from phoneticians and linguist join efforts is necessary to advance speaker recognition technology.”

How often are contextual voice characteristics taken into effect when trying to discern between the correct speaker and someone trying to scam the system? I.E vocal effort, emotion, physical

“In practice these are not considered.  There are multiple studies considering detection of vocal efforts, prominence of speech, emotion detection, and speech affected by health (Parkinson, aphasia) but modern systems do not include a magic ‘block’ that will process speech to consider those speech variations.”

Amazon has started putting countermeasures into place for it’s home security and voice recognition software so it won’t be duped so easily. With fixes being rolled out so quickly after launch are these just band-aids covering up a larger problem?

“Yes, there will be problems in the fast launching of technology, it will be inevitable, but also this will prompt problems that require solutions and the research community is already looking into some of these but in many cases without the real world data.

So I have no idea what type of countermeasures Amazon has put in place, but I think to consider speech biometric it requires a holistic approach. It only requires technical and algorithmic solutions to the problem but also studying the complex speech signal, to represent it appropriately. In that sense, the technology that now includes speech applications whether to interact with the user or to verify the identity for transactions, is of key importance, not only in dealing with the data in real world applications and not only under control conditions-lab studies.”  

It seems pretty quickly, that face and voice recognition are being hacked and manipulated. Do you see this as, despite potential security risks, technology evolving  faster than ever before therefore causing us to not seriously consider the implications of using such tech?

“Biometrics, face, speech or fingerprints, definitely has the possibility of making mistakes, either rejecting the genuine user (error known as miss) or accepting impostors (error known as false alarms/false acceptance).  That’s why the more data that’s  used in training the systems’ models, the more accurate the systems can be.  And we havel need to consider all possible factors, not only the involuntary modifications but also those intentional ones.  The idea is to push forward and because the systems are more robust, hopefully they will make less mistakes. But on the other hand there are some conditions that are still under study and that could be the reason why speaker verification applications are not deployed so widely, so that it wouldn’t cause inconveniences to the user.”

You can find the full study that Rosa worked on, here.