Free Text To Speech for Anki Flash Cards

Flash cards are great. Spaced Repetition is great. Anki is amazing. But what if you need to learn something which isn’t just text and has non visual elements. What if we needed to learn how something sounds, but we can’t easily tell its pronounciation from spelling (looking at you French). In this case, we need to attach audio to our Flash cards. _{Alternatively you could guess and hope the sound you have committed to long term memory over several painstaking
months of self discipline with spaced repetition is correct. Good luck with the French word serrurier tho 👍}

Anki’s Built in TTS and Speech Synthesis

Luckily for us, Anki supports text to speech (tts) out of the box using your operating systems built in tts voices. The problem? It sounds bad…

Consider the following audio generated for the French sentence: Je veux et j’exige des excuses exquises.

Blablabla
Blablabla but better

If you thought 2 sounded better than 1, you would be right. Unfortunately, No. 1 is the builtin TTS available through the anki app. It utilises the underlying TTS to speech system of your operating system which in most cases uses speech synthesis. Modern text to speech systems use machine learning models instead and tend to be far more accurate, intelligible and natural.

While the built in text to speech functionality could suffice for learning simple words, with larger, more complex words and phrases where there are changes to pronounciation based on emotional, grammatical and semantic context, Anki’s built in TTS is not good enough.

Modern TTS

Have you heard of ChatGPT…?

Unsurprisingly, modern text to speech systems almost exclusively use machine learning models to generate life like speech, and the second example above demonstrates the quality of the speech produced.

The question is then how can we use these models, either our own or a third party services, to produce speech for our anki cards.

The first and easiest way to do this is via an Anki addon which in turn uses third parties to generate the audio. AwesomeTTS and HyperTTS are the two that I am aware of. They both offer a free tier and a paid tier. Although they are the most convenient option, they are significantly more expensive than using the third party text to speech providers directly. Not only do they charge more per character, you lose the free tier which the majority of major Cloud provdiders provide.

Using a Third Party TTS Provider Directly for Cheap (or Free)

Instead of using a paid extension, why don’t we take advantage of a Cloud providers free tier? Here are the major cloud providers offerings:

GCP: Text-to-Speech AI
- ‘New customers get up to $300 in free credits to try Text-to-Speech and other Google Cloud products.’
AWS: AWS Polly
- ‘the free tier includes 1 million characters per month for speech or Speech Marks requests, for the first 12 months’
Azure: AI Speech
- ‘0.5 million characters free per month’

Note that even if you leave the free tier, either through usage or time, the pricing is still significantly cheaper than the anki addons, and will likely only come out to max a dollar or two per month, although it is best you calculate this yourself - I do not know your usage.

Here is a little python snippet to generate speech for some text using AWS Polly, assuming you have created an AWS account and configured credentials for programmatic acess. It uses AWS’s boto3 library to interact with AWS. If you would like to use boto, here is the quickstart guide which will take you through full installation and configuration.

import boto3

def french_tts(text):
    """Generate french speech from text using aws polly"""
    client = boto3.client("polly")
    res = client.synthesize_speech(
        Engine="neural", OutputFormat="mp3", Text=utf_8_str, VoiceId="Lea"
    )

    return res["AudioStream"].read()

Using the Hanky Python Package

Disclaimer: I am the author of the hanky package, which I wrote when I first started to become frustrated with learning French without good audio. It is not neccessary to use it and feel free to look at the source code and just take the bits you want. I wrote it with the aim that it would make this process easier for others, as it has made it easier for me.

from hanky import Hanky
import boto3

def french_tts(text):
    """Generate french speech audio from text using aws polly"""
    client = boto3.client("polly")
    res = client.synthesize_speech(
        Engine="neural", OutputFormat="mp3", Text=utf_8_str, VoiceId="Lea"
    )

    return res["AudioStream"].read()

hanky = Hanky()

@hanky.card_processor(
    "french-vocab-model", expected_args=[], card_fields=["native-lang", "target-lang"]
)
def add_speech(card: dict):
    """Add french speech to cards of type/model 'french-vocab-model'. We assume that
    the model/note type has already been created in anki with the following fields
        - native-lang
        - target-lang
        - target-lang-speech
    """

    # generate the speech
    speech = french_tts(card["target-lang"])

    # add the mp3 data to anki
    speech_ref = hanky.add_media(speech, file_ext=".mp3")

    # put the reference to that media inside a field in the lang-vocab model
    # in this case there is a specific field, 'target-lang-speech'
    card["target-lang-speech"] = speech_ref
    return card

hanky.run()

This is a work in progress, more to come…