By Maggie Smith, Director, Product Marketing, Developer Platforms
VoiceXML (VXML) is an open standard extensible markup language that grew from the increasing demand to easily create audio-based applications. Like its web language counterpart (HTML), VoiceXML’s main objective is to easily develop voice-based dialogs using the markup language model. VoiceXML is used primarily for the creation of interactive voice response (IVR) and is a natural fit to integrate with text-to-speech (TTS) synthesis and automatic speech recognition (ASR) servers. NMS's Vision VoiceXML Server supports voice and video VXML-based applications and, in Release 3.0, allows interactive video applications (IVVR) to use the same speech servers now used with interactive voice solutions.
Here is an example showing a more traditional voice-prompted VoiceXML application with VXML dialog prompts to choose a beverage. The user is asked by the voice dialog to choose a drink (either coffee, tea, or milk). The user can press a DTMF button or audibly request the preferred drink. The <choice> element within the menu shows which DTMF key is assigned to each drink.
<menu>
<prompt>
<par>
<media src=”choices.3gp”/> <!-- Video only -->
<media type=»application/ssml+xml»>
<speak version=»1.0»>
For tea, say tea or press 1,
For coffee, say coffee or press 2
For milk say milk or press 3
</speak>
</media>
</par>
</prompt>
<choice dtmf=»1» next=»#tea»>tea</choice>
<choice dtmf=»2» next=»#coffee»>coffee</choice>
<choice dtmf=»3» next=»#milk»>milk</choice>
When the user says “tea,” the speech recognizer will send a trigger to the voice browser. This trigger is then matched to the appropriate choice element and the VXML script in the menu dialog will traverse to the next dialog as defined by the next attribute within the choice element. Here the “#” symbol indicates that the next dialog is within the same document. This is similar to HTML when a link is referred to within the same page.
Now imagine this same VXML script, but with an interactive video stream instructing the user to choose a particular beverage by a telephone key pad stroke or requesting the drink by voice. After the user selects a drink, a second clip is displayed, confirming the beverage choice. The ASR server responds to the voice command for the beverage choice by translating the voice stream to correspond with the requested beverage and the VXML server then displays a video clip confirming the beverage choice.
Using video as a visual aid to ask for and confirm the choice reduces the time spent listening to voice dialog and prompts. In our example, the user makes a beverage selection faster due to the power of visual aids.
Video applications using the Vision VoiceXML Server Release 3.0 can also play audio tracks from an alternate source. This new feature will allow for additional customization, such as dubbing a translated language track over the local language video clip. Mobile TV clips that use local language such as in a news or sports broadcast might have wider appeal if a translation audio source is used and dubbed over the embedded audio within this clip.
The Vision VoiceXML Server Release 3.0 supports the latest industry standards, including VoiceXML 2.1 and 2.0, Media Resource Control Protocol (MRCP) to access speech recognition services, and Call Control eXtensible Markup Language (CCXML) to simplify the complexities of call flow and control. In addition, a wide range of voice and video encoders and file formats are available, including H.263, AMR, and .3gp for mobile handset and Internet Protocol (IP) deployment. The Vision VoiceXML Server extends the integration of video capabilities to include access to streamed video content servers, resulting in simplified connections to real-time video streams that are transmitted in the .3gp and H.263 formats for both handset and SIP deployments.
NMS’s Vision VoiceXML Server simplifies the technically challenging task of building interactive voice and video response applications by providing the key elements (e.g., VXML and CCXML) for rapid development of robust and dynamic applications that involve complex media processes and, with its support of ASR and TTS servers, expands the flexibility of an interactive voice and video application.