07/05/2020

Paying by Voice: Combining the Payment Request and Speech APIs

by Alex Libby

Let me ask you a question: how often do you buy items online?

I’ll bet that, as part of this, you have to wade through checkout forms of varying lengths or to create an account. On some sites, this is enough to put you off, or maybe question whether you want to go through with your purchase, as it seems like the retailer wants to make life difficult for you…sound familiar?

What if I were to tell you that research by the Baymard Institute suggests that the average checkout form length is 14.88 fields? Yes, you read that correctly: a staggering 14.88 fields! (Okay, this is a little too precise, but you get the picture). Indeed, the length of checkout forms is their third-highest reason as to why customers drop out between initiating and completing the checkout process.

It’s something I touched on in one of my previous books, Checking out with the Payment Request API, available from Apress: I’m still amazed at how this fantastic API will shake up the payment process, even though it has yet to reach official status with the W3C! However, I wanted to take it even further, and prove something: can we use it to create something that could be controlled purely by the power of voice?

Although it might seem a little far-fetched, the reality is closer than we think: it is indeed possible to create something that is at least partially controllable using the power of speech. To help prove this, I’ve created a demo that illustrates how we might achieve this:

New Content Item

The APIs have been available for a few years – Web Speech first appeared on the scene around 2012, while the Payment Request API from the end of 2017. Both are gaining usage rapidly, such that they are reasonably stable, even though they have yet to receive official status by the W3C. Although the Payment Request (PR) API operates as a standalone entity, the Speech API consists of two – for want of a better phrase – sub-APIs, one for speech synthesis, and the other for speech recognition. It’s the latter of the two that I want to focus on in this article.

So – how would we go about combining the PR and Speech Recognition APIs? The great thing about both APIs is that we can control them using nothing more than standard JavaScript. The example I’ve created in the CodePen demo uses standard markup and CSS styling attributes, but is a little lengthy; for this article, I want to focus on the Speech parts of the demo.

The crux of this demo creates a SpeechRecognition object from around line 22:

^{const SpeechRecognition = window.SpeechRecognition ||
window.webkitSpeechRecognition;}

^{const recognition = new SpeechRecognition();}

Next up, we then define several properties – namely, ^{interimResults}, ^{maxAlternatives}, and ^continuous, to control whether the API displays text as we speak, the number of alternatives it can give for certain words, and whether it should be continuously checking for speech. The first one is of particular importance, as the API can detect and display the results as you speak or when you’ve finished speaking. Let us just say that the former can lead to some weird results!

That aside, we then set up some event handlers – the first, voice-button, allows us to initiate a request to the browser to start the speech recognition engine. Chrome blocks automatic initiation of the engine; it has to be launched by a defined process (such as triggering a button), and cannot trigger on page load.

The next three event handlers take care of initiating the speech recognition service, and determining if speech is detected, or how to handle the response; it’s in the result handler that determines what action to take if it can match text against the result of the speech. We need to get it right, as the API isn’t perfect; when you play with it for a while, you will begin to see that it doesn’t always detect what you’ve said, so choosing the right conditions is essential. At the same time, we also render on-screen a confidence level using the ^{SpeechRecognition.results.confidence} property, which we convert into percentage value on-screen.

The next three event handlers look after what happens when no more speech is detected; the Speech API uses the typical on<name of action> (such as ^onspeechend) to determine when an action has taken place, using (in this case) ^speechend to determine the resulting outcome. We then feed values into the Payment Request API object from line 84 onwards, which we fire by saying “checkout” into our laptop or PC microphone and thereby complete the purchase process.

This article is based on demos from my book, Introducing the HTML5 Web Speech API – this is available from Apress.

About the Author

Alex Libby is a front-end engineer and seasoned computer book author who hails from England. His passion for all things Open-Source dates back to the days of his degree studies, where he first came across web development and has been hooked ever since. His daily work involves extensive use of JavaScript, React, HTML, and CSS as a front-end engineer for a major distributor. Alex enjoys tinkering with different open source libraries to see how they work. He has spent a stint maintaining the jQuery Tools library and enjoys writing about open-source technologies, principally for front-end UI development.

This article was contributed by Alex Libby, author of Introducing the HTML5 Web Speech API.