AWS Machine Learning Blog
Integrating HAQM Polly with legacy IVR systems by converting output to WAV format
HAQM Web Services (AWS) offers a rich stack of artificial intelligence (AI) and machine learning (ML) services that help automate several components of the customer service industry. HAQM Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.
You might face common implementation challenges when updating or modifying legacy interactive voice response (IVR) systems that don’t support file formats such as MP3 and PCM. HAQM Polly, in order to minimize response latency, produces synthesis in real-time and streams the results back to the customer in a streamable format (MP3, Ogg/Vorbis or raw PCM samples) while the request is being processed. WAV audio format is not streamable by definition, but a WAV file can be easily created from a PCM stream generated by Polly at the end of synthesis, when all samples are collected and the length of the result can be calculated. This post shows you how to convert HAQM Polly output to a common audio format like WAV.
Converting HAQM Polly file output to WAV
One of the challenges with legacy systems is that they may not support HAQM Polly file outputs like MP3. The output of the HAQM Polly SynthesizeSpeech
API call doesn’t support WAV, but some legacy IVRs obtain the audio output in WAV file format, which isn’t supported natively in HAQM Polly. Many of these applications are written in Python and Java.
The following sample code which will help in such situations where audio is in WAV file format not supported natively in HAQM Polly. The sample code converts files from PCM to WAV in Python for inputs given in both SSML and text.
The following is the sample output from the preceding code:
Conclusion
You can convert HAQM Polly output from PCM to WAV so that you can use HAQM Polly in your legacy IVR, enabling it to support WAV file format output. Try this out for yourself and let us know how it goes in the comments!
You can further refine the converted file using the powerful capabilities available in HAQM Polly like the SynthesizeSpeech request, managing lexicons, reserved characters in SSML, and controlling volume, speaking rate, and pitch.
About the Author
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.