We have been focusing a lot recently on the basis of artificial intelligence, i.e. machine learning and deep learning, analyzing their impact on job market, society, fundamental research and different business areas such as travel and financial advisory.

Recent news and the never decreasing interest towards the language technologies and natural language processing have given me the idea to go back to one of the original subjects of our publications being speech recognition technologies.

With this regard, one can turn back to Andrew Ng who was mentioned last week – the artificial intelligence guru considers voice input as the future of human-machine interface and sees the future to be “all about talking to machines”[i], especially in the domain of Internet of Things. Andrew Ng mentioned deep learning algorithms (or neural networks) as the tool for improving the speech recognition, while the challenge is mainly to separate the voice from the background noise (like when talking to one’s smartphone driving a car). The phenomenon of filtering out the useful conversation is called “cocktail party effect” and seems to have been reserved uniquely to the human brain. At the moment the efficiency of speech recognition is very limited in noisy conditions. MIT Technology Review discloses the results of the work of Andrew Simpson at the University of Surrey in the UK who “used some of the most recent advances associated with deep neural networks to separate human voices from the background in a wide range of songs”[ii]. The article explains the process of training the algorithm with a set of songs and applying the same pattern recognition to more songs with rather remarkable results. “The outputs turned out to be impressive. “These results demonstrate that a convolutional deep neural network approach is capable of generalizing voice separation, learned in a musical context, to new musical contexts.””[iii]

These advancements are important taking into account the direction and dimensions the speech recognition is taking in the mobile world.

According to Android Police, Google is leading the direction towards total voice access in Android with a developers session scheduled in its I/O 2015. Apparently the session suggests integrating voice commands into the third party apps, “basically, instead of touching the phone, you talk to it to control apps”[iv]. “Developers who take advantage of Voice Actions will be able to design apps that use voice commands instead of touch input” (The session does not seem to be there any more in the schedule, but hopefully the Voice Actions kit will be revealed).

Another player in the field, Microsoft, is also going the same direction. After having introduced its digital assistant Cortana last year, Microsoft has released the APIs that “make it possible to add image and speech processing to just about any application, often by using just a single Web request”[v]. The system goes further proposing the services of “speech and text intent detection”[vi] that have not been made public so far, but are surely identifying the context and determine a more exact meaning of the spoken phrases and the concept behind words.

All this information is revealing new advances in the fields of natural language processing and understanding which has gained momentum with the development of mobile technologies and wearables requiring easier human-machine interaction methods and deep learning algorithms. These techniques are also crucial for building Internet of Things, i.e. connected objects that do not necessarily have keyboards, as predicted in our technologies of the year to come forecast.

image source: fotolia.fr


[i] Future of mobile, IoT driven by speech recognition: Andrew Ng by Rebecca Merrett  for CIO, May 6, 2015, online http://www.cio.com.au/article/574317/future-mobile-iot-driven-by-speech-recognition-andrew-ng/, accessed on May 18, 2015
 

[ii] Deep Learning Machine Solves the Cocktail Party Problem, MIT Technology Review, April 29, 2015, online http://www.technologyreview.com/view/537101/deep-learning-machine-solves-the-cocktail-party-problem/, accessed on May 8, 2015

[iii] [iii] Deep Learning Machine Solves the Cocktail Party Problem, MIT Technology Review, April 29, 2015, online http://www.technologyreview.com/view/537101/deep-learning-machine-solves-the-cocktail-party-problem/, accessed on May 8, 2015

[iv] Google Will Unveil Voice Access At I/O, Allowing You To Control Apps Entirely By Voice by Ryan Whitman for Android Police, online http://www.androidpolice.com/2015/05/06/google-will-unveil-voice-access-at-io-allowing-you-to-control-apps-entirely-by-voice/, accessed on May 19, 2015

[v] Cortana for all: Microsoft’s plan to put voice recognition behind anything by Sean Gallagher for Ars Technica, May 15, 2015, online http://arstechnica.com/information-technology/2015/05/cortana-for-all-microsofts-plan-to-put-voice-recognition-behind-anything/, accessed on May 18, 2015

[vi] Cortana for all: Microsoft’s plan to put voice recognition behind anything by Sean Gallagher for Ars Technica, May 15, 2015, online http://arstechnica.com/information-technology/2015/05/cortana-for-all-microsofts-plan-to-put-voice-recognition-behind-anything/, accessed on May 18, 2015