Which languages are supported?

Parakeet-TDT-0.6B-v3 supports 25 European languages with automatic language detection, so there is no need to specify the language of the input recording in advance.

How much does one minute of transcription actually cost?

On an AWS g6.xlarge instance, the cost is $0.00011 per minute of audio in on-demand mode and $0.00005 per minute with Spot instances, which is an order of magnitude cheaper than commercial transcription services.

Can the model process long audio recordings?

Yes. Through the buffered streaming approach, AWS Batch processes audio recordings longer than ten hours, with an average processing time of 0.49 seconds per minute of input audio.

AWS Parakeet-TDT: transcription in 25 languages for $0.00005/min

Open-source model with automatic detection of 25 languages

The AWS Machine Learning team published a detailed recipe architecture that uses NVIDIA’s open-source automatic speech recognition model Parakeet-TDT-0.6B-v3 for large-scale multilingual audio transcription. The model with 600 million parameters is published under an open license and directly covers 25 European languages — from Croatian and Serbian to Ukrainian and Finnish — with built-in automatic language detection. This means that users do not need to label the language of each recording in advance; the model recognizes the source on its own and returns the transcription.

For companies processing multicultural content — for example media archives, contact centers, webinars, or podcasts — the absence of mandatory language pre-classification means significantly less work on the data input side. An additional advantage is that the model is small enough to run on a single consumer GPU, making it suitable for large-scale batch processing where large transformer models would be too expensive.

AWS Batch and scale-to-zero economics

AWS’s recommended architecture combines Parakeet-TDT with the AWS Batch service on g6.xlarge GPU instances. The key element of this architecture is the scale-to-zero policy: when there are no jobs in the queue, the cluster automatically scales down to zero GPU instances, so the user pays nothing except for storage. As soon as a new audio recording arrives in the queue, Batch automatically spins up an instance, processes the job, and returns the transcription result to an S3 bucket.

The economics are compelling: $0.00011 per minute of audio in on-demand mode and just $0.00005 per minute on Spot instances. Concretely, one hour of audio in Spot mode costs about three-tenths of a US cent, which is an order of magnitude cheaper than commercial transcription APIs. The blog post explicitly highlights that the combination of Spot instances and the scale-to-zero approach drastically reduces fixed costs, especially for organizations that periodically process large archives.

Buffered streaming for long recordings and processing speed

One of the technical challenges of speech models is a limited context length, due to which long recordings must be manually split into segments. AWS has implemented a buffered streaming mechanism in this recipe that enables processing audio recordings longer than ten hours without manual splitting. The model processes audio in sliding windows and merges transcripts at logical boundaries, which is essential for podcasts, long lectures, and conference recordings.

In terms of speed, the report states that the average processing time is 0.49 seconds per minute of input audio — about 120 times faster than real time on a single GPU. This means Parakeet-TDT processes ten hours of audio in approximately five minutes, at a cost of approximately $0.03 in Spot mode. For newsrooms, law offices, or transcription teams, such speed and cost change the business model — transcription is no longer a bottleneck but an almost free step in the pipeline.

What this means for international users

Support for 25 European languages built into Parakeet-TDT means that companies and media organizations across Europe for the first time have access to quality open-source transcription at a cost that is negligible even for daily volumes in the hundreds of hours. For media outlets this opens the possibility of automatic captioning of archive broadcasts, for law offices the cheap processing of court hearing recordings, and for educational institutions the transcription of lectures in real time. Since the model is open-source, there is no vendor lock-in — the same recipe can be transferred to proprietary GPU servers or other cloud platforms, as long as GPU instances and S3-compatible storage are available.

AWS and NVIDIA Parakeet-TDT bring transcription for 25 languages at a cost of $0.00005 per minute

Open-source model with automatic detection of 25 languages

AWS Batch and scale-to-zero economics

Buffered streaming for long recordings and processing speed

What this means for international users

Sources

Related news