Start building voice intelligence with AssemblyAI’s speech-to-text model from AWS Marketplace

Voice intelligence and speech-to-text (STT) technology have become essential as organizations collect thousands of hours of calls, meetings, and customer interactions daily. Raw audio alone doesn’t drive decisions—organizations need intelligence to extract value from voice data at scale. Voice intelligence combines speech recognition, natural language processing (NLP), and machine learning (ML) to transform voice data into actionable insights. Modern STT models transcribe conversations accurately and work with additional tools to analyze sentiment, detect key topics, and generate automated summaries for deeper insights. Voice intelligence and STT technology serve multiple industry use cases including call analysis and conversational intelligence, healthcare documentation, customer service, video content optimization, legal discovery and compliance, sales intelligence and coaching, and so on. With the emergence of generative AI and improved models, the demand for effective STT models continues to grow across these applications.

AssemblyAI, an independent software vendor (ISV) in AWS Marketplace, is a research-oriented organization focused on advancing and democratizing speech AI technology for the world. Founded in 2017, they’ve built a team of interdisciplinary research leaders, scientists, and engineers dedicated to creating superhuman speech AI models that unlock new possibilities for voice data applications.

AssemblyAI technology serves thousands of customers and hundreds of thousands of developers worldwide through a simple, developer-friendly API. AssemblyAI provides comprehensive speech AI capabilities including:

Core speech-to-text transcription
Speaker detection
Automatic language detection
Sentiment analysis
Chapter detection
Personally identifiable information (PII) redaction

The Universal-2 model demonstrates AssemblyAI’s commitment to pushing the boundaries of what’s possible in speech AI. This model achieves high accuracy by addressing key challenges in speech recognition, improving proper noun accuracy, formatting and casing, and timestamp generation. AssemblyAI takes a research-focused approach to building accurate, capable speech AI models that integrate easily.

This post shows how to get started with AssemblyAI’s APIs from AWS Marketplace and build initial proofs of concept (POCs) by calling these model APIs in a few steps.

Solution overview

AssemblyAI’s speech-to-text service processes audio through a two-stage pipeline. The first stage uses the Universal-2 automatic speech recognition (ASR) model, a 600M parameter Conformer RNN-T model trained on 12.5M hours of multilingual audio data. This model converts speech to text while handling multiple speakers, accents, and background noise. The second stage employs neural models for text formatting, handling tasks like punctuation, capitalization, and text normalization to produce clean, readable transcripts.

Beyond basic transcription, customers can enable additional intelligence models that run alongside the core ASR process. These include speaker identification to track who said what, sentiment analysis to understand emotional context, topic detection to automatically categorize conversations, content summarization to extract key points, and PII redaction to maintain privacy compliance. All these models work together seamlessly through the same API interface. The following diagram shows the high-level architecture.

Figure 1: High-level architecture diagram for AssemblyAI’s API on transcription

Prerequisites

Before starting, make sure you have the following prerequisites:

An HAQM Web Services (AWS) account with access to HAQM Simple Storage Service (HAQM S3).
AssemblyAI’s API can be purchased in AWS Marketplace. You can also visit AssemblyAI’s site to request a trial account. For the trial account, the account is pre-loaded with some credits, which customers can immediately use for POC testing.
After successfully creating an account with AssemblyAI, make sure the API key is saved in a safe place.
Execute the following Python code to prepare for the scenarios in the solution walkthrough:

!pip install assemblyai
import assemblyai as aai
aai.settings.api_key = "xxxxxxxx"  #your AssemblyAI API key

Solution walkthrough

In this section, we dive into five cases where AssemblyAI’s API can find high value. Each case comes with a code snippet that readers can test in their own environment.

Transcribe an audio from a local file
Transcribe an audio file from HAQM S3
Speaker diarization
Automatic language detection
PII redaction

Transcribe an audio from a local file

This is the basic setup in which the audio files reside in the local repositories where the code is executing. The AssemblyAI API supports most common audio and video file formats, such as mp3, m4a, m4p, wav, or wma. It’s recommended that your audio file is in its native format without additional transcoding or file conversion. For a more detailed discussion about audio file formats, refer to this AssemblyAI blog. Download a publicly available audio file from an AssemblyAI hosted website and save it to a local folder. Execute the following code snippet to perform the transcription:

# Transcribe an audio from a local audio file
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("./Audios/ford_clip_trimmed.mp3")
print(transcript.text)

The result should be similar to the following transcript:

Good evening. Last January 15th, I went before your senators and representatives in Congress with a comprehensive plan to make our country independent of foreign sources of energy. By 1985. Such a program was long overdue. We have become increasingly at the mercy of others for the fuel on which our entire economy runs. Here are the facts and figures that will not go away. The United States is dependent on foreign sources for about 37% of its present petroleum needs. In 10 years, if we do nothing, we will be importing more than half our oil at prices fixed by others if they choose to sell to us at all. In two and a half years, we will be twice as vulnerable to a foreign oil embargo as we were two winters ago. We are now paying out $25 billion a year for foreign oil. Five years ago, we paid out only $3 billion annually. Five years from now, if we do nothing, who knows how many more billions will be flowing out of the United States.

Transcribe an audio file from HAQM S3

In many organizations, the audio data is saved in cloud storage, such as HAQM S3. To transcribe an audio file from an S3 bucket, AssemblyAI needs temporary access to the file. To provide this access, you need to generate a presigned URL, which is a URL that has temporary access rights built in. For more details on how to generate a presigned URL, refer to Sharing objects with presigned URLs.

Execute the following code snippet to perform the transcription:

import requests
import time
p_url = "S3 pre-signed url"
assembly_key = "xxxxxxxx"  #your AssemblyAI API
# Use your AssemblyAI API Key for authorization.
headers = {"authorization": assembly_key, "content-type": "application/json"}
# Specify AssemblyAI's transcription API endpoint.
upload_endpoint = "http://api.assemblyai.com/v2/transcript"
# Use the presigned URL as the `audio_url` in the POST request.
json = {"audio_url": p_url}
# Queue the audio file for transcription with a POST request.
post_response = requests.post(upload_endpoint, json=json, headers=headers)
# Specify the endpoint of the transaction.
get_endpoint = upload_endpoint + "/" + post_response.json()["id"]
# GET request the transcription.</p><p>get_response = requests.get(get_endpoint, headers=headers)
# If the transcription has not finished, wait util it has.
while get_response.json()["status"] != "completed":
   get_response = requests.get(get_endpoint, headers=headers)
   time.sleep(5)
# Once the transcription is complete, print it out.
print(get_response.json()["text"])

Speaker diarization

Speaker diarization is a critical component in audio because it addresses the challenge of establishing the identity of speakers and when they spoke in an audio recording. This capability is essential for a wide range of tasks such as enhancing clarity and structure in transcripts, enabling advanced analytics, and enabling personalization and customization.

Execute the following code snippet to perform the transcription:

config = aai.TranscriptionConfig(speaker_labels=True)
transcriber = aai.Transcriber(config=config)
FILE_URL = "http://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"
transcript = transcriber.transcribe(FILE_URL)
# Extract all utterances from the response
utterances = transcript.utterances
# For each utterance, print its speaker and what was said
for utterance in utterances:
   speaker = utterance.speaker
   text = utterance.text
   print(f"Speaker {speaker}: {text}")

The following transcript shows part of the result of this example:

Speaker A: Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US Skylines from Maine to Maryland to Minnesota are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what’s happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.

Speaker B: Good morning.

Speaker A: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?

Speaker B: Well, there’s a couple of things. The season has been pretty dry already, and then the fact that we’re getting hit in the US is because there’s a couple weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the mid Atlantic and the Northeast and kind of just dropping the smoke there.

Speaker A: So what is it in this haze that makes it harmful? And I’m assuming it is harmful.

Automatic language detection

Automatic language detection is another important feature in audio analysis because it enables systems to process and interpret spoken content more accurately and efficiently. It can enhance user experience in many applications by enabling multilingual support and language-specific customization. Execute the following code snippet to perform the transcription:

config = aai.TranscriptionConfig(language_detection=True)
transcriber = aai.Transcriber(config=config)
FILE_URL = "http://assembly.ai/news.mp4"
transcript = transcriber.transcribe(FILE_URL)
transcript.json_response[‘language_code']

The output in this example is short: en.

For a comprehensive list of supported language, please refer to the Supported language documentation.

PII redaction

Security is always the number one priority at AWS and also for AssemblyAI. The PII redaction feature that AssemblyAI provides can help maintain the privacy and security of sensitive information so that customers can build safe and trustworthy applications without posing legal and regulatory risks. Users can control which type of sensitive data, such as credit card number, email address, and phone number, they want to hold through the configuration setting, as shown in the following code snippet.

config = aai.TranscriptionConfig() 
config.set_redact_pii( 
# What should be redacted
   policies=[
      aai.PIIRedactionPolicy.credit_card_number,
      aai.PIIRedactionPolicy.email_address,
      aai.PIIRedactionPolicy.location,
      aai.PIIRedactionPolicy.person_name,
      aai.PIIRedactionPolicy.phone_number, 
   ], 
   # How it should be redacted
   substitution=aai.PIISubstitutionPolicy.hash, 
) 
transcriber = aai.Transcriber(config=config) 
# Use your own audio file which contains some fake PII info for testing 
FILE_URL = "http://example.org/audio.mp3" 
transcript = transcriber.transcribe(FILE_URL) 
print(transcript.text)

AssemblyAI offers several additional products not covered in this post, including a streaming API for real-time transcription and LeMUR for extracting downstream insights from transcripts using large language models (LLMs). For more detailed information, refer to AssemblyAI Documentation.

Conclusion

AssemblyAI is committed to building a high quality API platform for developers to transform and understand voice data with AI, enabling the creation of innovative products and services. Their speech-to-text models address critical transcription challenges. AssemblyAI’s latest Universal-2 model focuses on solving last-mile issues that impact real-world speech AI workflows, such as improving alphanumeric and rare word accuracy.

Learn more about Universal-2’s advancements: Read the AssemblyAI blog

See how AssemblyAI compares to competitors: View benchmarks

Dive into the research behind Universal-2: Explore the research

You can start using AssemblyAI’s API by visiting their listing in AWS Marketplace or by creating an account on the AssemblyAI website.

AWS Marketplace