Audio to text conversion using AWS Transcribe and Sentiment Analysis using Comprehend API
It is an Automatic Speech Recognition (SAR) service by Amazon.
it is capable of recognizing speech from existing audio or video
file, or from a stream of audio or video content and also from an audio
input coming directly from your computer’s microphone.
Amazon Transcribe uses advanced machine learning technologies to recognize speech in audio files and transcribe them into the text You can use Amazon Transcribe to convert audio to text and to create applications that incorporate the content of audio files, For example, you can transcribe the audio track from a video recording to create closed captioning for the video.
Use case of AWS
- Voice analytics
- Media Entertainment
- Advertising
- Search Compliance
What type of Service it is?
It is a fully managed application service in the machine learning stack, you don’t have to provision any of the servers or manage any infrastructure, you can simply supply the source file through an S3 bucket and will get the transcribed output via the same or different bucket or could be in a bucket that is being “owned by amazon”.
1. Amazon Transcribe
14 Supported Languages for transcription
- Modern Standard Arabic ( ar SA) added to supported list recently on May 28, 2019
- Australian English ( en AU)
- British English ( en GB)
- Indian English ( en IN) added to supported list recently on May 15, 2019
- US English ( en US)
- French ( fr FR)
- Canadian French ( fr CA)
- German (de DE)
- Indian Hindi (hi IN) added to supported list recently on May 15, 2019
- Italian (it IT)
- Korean (ko KR)
- Brazilian Portuguese ( pt BR)
- Spanish (es ES) added to supported list recently on April 19, 2019
- US Spanish (es US)
https://docs.aws.amazon.com/transcribe/?id=docs_gateway
11 Supported Regions
It is supported in 11 regions for the ones who do not know about what an AWS region is, it is basically a Geographical boundary defined by AWS and it contains multiple Availability Zones(know as Data Centres). To give fault tolerance and load balancing capabilities to AWS services in that region or across multiple regions simultaneously. that being said not all of the Services launched by AWS made available in all of the regions.
- Asia Pacific (Sydney)
- Asia Pacific (Singapore)
- Asia Pacific (Mumbai)
- Canada (Central)
- EU (Ireland)
- EU (London)
- EU (Paris)
- US East (Northern Virginia)
- US East (Ohio)
- US West (Oregon)
- US West ( N.California
Key Features
- Recognize voices (Identifying multiple speakers in a audio clip)
- Transcribe separate Audio channels (Agent on L and Customer on R)
- Transcribing Streaming Audio (Real time sound to text ex: microphone)
- Custom Vocabulary (Custom words like: EC2, S3, Names, Industry terms)
- Support for Telephony Audio (at 8KHz with high accuracy)
- Timestamp generation and Confidence score (timestamp for each word to locate it in recording along with confidence score between 0.00 to 1.0)
Technical Specification of Speech Input
Supported formats: • FLAC, MP3, MP4, or WAV
Supported duration and size:
• Less than 4 hours in length or less than 2 Gb of audio data
You must specify the language and format of the input file.
For best results:
• Use a lossless format, such as FLAC or WAV, with PCM 16 bit encoding.
• Use a sample rate of 8000 Hz for telephone audio.
You can specify that Amazon Transcribe identify between 2 to 10 speakers in the audio clip.
Technical Specification of Custom Vocabulary
A custom vocabulary is a list of specific words that you want Amazon Transcribe to recognize in your audio input. These are generally domain specifi c words and phrases, words that Amazon Transcribe isn’t recognizing, or proper nouns.
You can have up to 100 vocabularies in your account. The size limit for a custom vocabulary is 50 Kb. You can have it defined in either a list format or a table format.
Pricing:
The Architecture
Create an account
IAM Role Best for Programmatic Access
An IAM Role is basically a set of permissions that can be assumed by someone(or an entity) to gain access to the allowed services as per their responsibility and allowed scope, roles are a way of providing temporary credentials that aws generates to ensure maximum security for our workloads, role contains temporary access key id and secret key and one additional component which is security token, these temporary keys generated by roles are used to provide desired access to the entity who assumes a role, and these keys are generally valid for 12 hours and security token component make sure to generate new keys 5 minutes before of the expiry of the 12 hour duration so we don’t have to worry about rotating these keys by our self and it just happens automatically.
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_common-scenarios.html
AWS Python SDK Setup with code
https://docs.aws.amazon.com/transcribe/latest/dg/API_Operations.html
- Import library
2. Linking the name of each audio file to the speaker
3. set key access with AWS platform
4. Set S3 credential and check bucket
5. Creating a new S3 bucket to upload the audio files
6. Uploading the files to the created bucket
7. Define the file URLs on the bucket using S3 convention for file paths
8. Create Vocabulary list for transcribing
9. Function to start Amazon Transcribe job
10. Create sagemaker role
11. Iterate over the audio files URLs on S3 and call the start_transcription function defined above.
12. Download JSON file after transcribing from the S3 bucket
13. Delete Transcribe job which is taking the name from the bucket
14. Verify Amazon Transcribe jobs that are under the status COMPLETE
Result:
The outcome is JSON file of hindi audio that comprise of hindi Transcript of audio, Diarization, timestempt of each words with confidence score.
2. Amazon Comprehend:
On the last part of our analysis we are going to use Amazon Comprehend for sentiment analysis of the speeches. As mentioned before, AWS offers a pre-trained model that you can use to return the percentage of 4 different sentiments: positive, negative, mixed or neutral.
To perform the sentiment analysis we simply need to provide the text as a string and the language. One limitation imposed by Amazon Comprehend is the size of the text.
Sentiment: Sentiment allows you to understand whether what the user is saying is positive or negative. Or even neutral, sometimes that’s important as well. You want to know if there’s not sentiment, that might be a signal.
Entities: This feature goes through the unstructured text and extracts entities and actually categorizes them for you. So things like people, or things like organizations will be given a category.
Language detection: So for a company that has a multilingual application, with a multilingual customer base. You can actually determine what language the text is in. So you know if you have to translate the text itself, or take some other kind of business action on the text.
Key phrase: think of this as noun phrases. So where entities are extracted, is maybe proper nouns. The key phrase will catch everything else from the unstructured text, so you actually can go deeper into the meaning. What were they saying about the person? What were they saying about the organization for example?
Topic modeling: Topic modeling works over a large corpus of documents. And helps you do things like organize them into the topics contained within those documents. So it’s really nice for organization and information management.
For example Social Analytics:
Detail step by step followed in sentiment analysis
- Importing Libraries
2. Reading JSON file from the directory
3. set key access with AWS platform
4. Set JSON input and output directory
5. Get file path from the input directory
6. Set comprehend function for sentiment value in 5000 byte chunk. it can be analyzed to 5000 bytes (which translates as a string containing 5000 characters). As we are dealing with texts transcripts that are larger than this limit, we created the start_comprehend_job function that split the input text into smaller chunks and calls the sentiment analysis using boto3 for each independent part.
7. Set transcribe function to use amazon comprehend for sentiment value in dataframe
8. Define the main function
9. Run the main function
Result:
Sentiment analysis for each selected speech. It has four numerical outcome with sentiment lebel i.e positive, negative, neutral and mixed.
Reference:
Google images
https://docs.aws.amazon.com/pt_br/comprehend/latest/dg/guidelines-and-limits.html