Transcribe podcast episodes to Markdown using Whisper

I ran OpenAI Whisper on Säkerhetspodcasten, a podcast in the Swedish language. 300+ podcasts episodes, typically a little less than an hour per episode. I spoke about this a bit in Säkerhetspodcasten #292 - Ostrukturerat V.46

Using a RTX 5090 GPU: 36 hours of transcription effort, if my notes are correct. About 7 minutes of transcription effort per podcast episode (36*60/300 = 7.2). 12 GB of VRAM utilization was noted, for Whisper Large model, Operating system, etc. (So Whisper Large should be perfectly manageable on a lesser 16 GB VRAM system, for example RTX 4080, 5080 card. Maybe not so lucky for 8GB - 12GB cards?… YMMV)

This is a very efficient way of heating your appartment! GPU was redlining 80 - 85 degrees Celsius during the transcription. The fans were running constantly. The room was turning into a sauna. I had to open the window to let in the sweet release of freezing winter air.

Disk space utilization is a bit insane. The NVIDIA dependencies etc. for OpenAI Whisper consumes about 7.0G. Note that you can easily need multiple instances of this, for example if you build containers, or if you just want multiple virtual python environments. The model itself, stored under ~/.cache/whisper/large-v3.pt is 2.9G. So you need to take some care to ensure you don’t generate too many redundant copies, if you don’t have a ton of free disk space.

Table of Contents:

What is OpenAI Whisper?
Google defaults vs Whisper defaults
Insane and odd foreign characters
Failing on names
Bugging out and repeating yoursef
Whisper transcription wrapper

What is OpenAI Whisper?

OpenAI Whisper is a big funky AI that can trascribe speech to text. It supports English, Swedish and other languages. It includes:

A Model
A Python API for running the model
A Python script/executable for running the model from the command line

Links: pypi/openai-whisper OpenAI: Whisper

Google defaults vs Whisper defaults

Anecdotal non-scientific observations from using Google and Whisper with default settings, minimal configuration, on Säkerherspodcasten:

Google has really funny failure conditions with Swedish. If Google misidentifies us as speaking english, it can render a one hour transcription of us speaking insane English, constantly swearing. Whisper does not seem to exhibit such failure mode, at least not in the way we run it.
Whisper transcripts tend to render short text fragments that looks and sounds like spoken words. The transcript looks like a believable representation of how we probably were speaking.
Google transcripts tend to render long paragraphs, that makes us seem more professional speakers than we actually are. The transcripts rarely seem to hint at this being four-five people interrupting each other…
Whisper emits really weird foregin characters that do not at all belong in a transcript of a Swedish speech. Google never exhibited such failures.
Whisper performns poorly Names etc. where the AI model has limited experience. This was an issue also with Google.

Insane and odd foreign characters

Whisper occationally emits weird stuff that isn’t Swedish, but Russian, Chineese, Japanese, Korean emojis and what not.

Some fun examples: Nej,े 輸pleks men ja jag Sinろ Hyundai不老 det svenska�etet Ja, medan denna dag delades det av liknandeятиper kring liknande serveringar medthoughtsdelen. 雅per ju du ليst och en ren ningún www.spotify.com.々. Du-du-du-du-du-du-du-du 🎶 마cket B��ін And must하 a then the worst thing I ever knew. 輪 Game� Jag残 vi. Mon heißt. 雅 альные... bok misses x-y-br푼 지 э 載er 輟s

Failing on names

Names seems to be one of the hardest challenge for AI. It failes on names much more and harder than any human would.

Whisper misidentifies names a bunch:

Jidhage as “Vidager”, “Idage”, “Idagre”, “Widåge”, “Wirdhagen”
Bordforss as “Bortfors”
Möller as “Ribe”
Assured as “Short”, “Ashward”
Claude as “Clord”
0x4a (noll ex fyra A) as Knowledge4A

To its defense, it sometimes gets name correctly. Especially common names.

Bugging out and repeating yoursef

In rare occations, Whisper spit out the same line over and over again:

93 00:04:45,720 –> 00:04:47,720 Och är typ det nya svarta på internet sedan ganska länge sedan.
94 00:04:47,720 –> 00:04:48,720 Och är typ det nya svarta på internet sedan ganska länge sedan.
95 00:04:48,720 –> 00:04:49,720 Och är typ det nya svarta på internet sedan ganska länge sedan.
96 00:04:49,720 –> 00:04:50,720 Och är typ det nya svarta på internet sedan ganska länge sedan.
97 00:04:50,720 –> 00:04:51,720 Och är typ det nya svarta på internet sedan ganska länge sedan.

Whisper transcription wrapper

www-migrate/transcribe is a Markdown/MP3 transcription wrapper built around OpenAI Whisper openai-whisper.

Usage

This is the basic usage of the application; you need to specify:

--dir-mp3 directory to download and store MP3’s into.
--dir-trans directory to store transcription files into.
file [file ...] files to transcribe, e.g. epsiode1.md epsiode2.md epsiode3.md etc.

Usage help:

usage: transcribe.py [-h] --dir-mp3 DIR_MP3 --dir-trans DIR_TRANS
                     [--transcribe-header TRANSCRIBE_HEADER]
                     [--transcribe-description TRANSCRIBE_DESCRIPTION]
                     [--whisper-bin WHISPER_BIN]
                     file [file ...]

transcription tool

positional arguments:
  file                  files to process

options:
  -h, --help            show this help message and exit
  --dir-mp3 DIR_MP3     directory to download mp3s to
  --dir-trans DIR_TRANS
                        directory to store transcriptions to
  --transcribe-header TRANSCRIBE_HEADER
                        directory prefix to append to file names
  --transcribe-description TRANSCRIBE_DESCRIPTION
                        A short comment about the transcription
  --whisper-bin WHISPER_BIN
                        whisper executable

Hope this help was helpful! :-)

The main function

The main() does the usual stuff (command line parsing, etc.) and then goes into a few different cases. The main use-case of the application will head into the gloriously named function main_normal_case().

main_normal_case performs three steps:

parse_md(file, args.transcribe_header) parse the Markdown file, detect if it is allready transcribed, locate the .mp3 link.
download(mp3, args.dir_mp3) downloads the MP3 file and returns an identifer.
transcribe(args, file, identifier) transcribes the MP3 file and updates the Markdown file.

def main_normal_case(args):
    for file in args.file:
        parsed = parse_md(file, args.transcribe_header)
        if parsed is None:
            continue
        (mp3, transcribed) = parsed
        if transcribed:
            continue
        print(f"{file}")
        print(f' * mp3: {mp3}')
        identifier = download(mp3, args.dir_mp3)
        transcribe(args, file, identifier)
    return SystemExit, 0

Parsing the Markdown files

parse_md parses a markdown file.

It looks for [mp3](link) markdown code using a regexp… since this is how our podcast formats our MP3-links. (room for improvement: allow alternative formatting).
It detects transcribe_header, the header used to indicate an transcription follows.

def parse_md(fname, transcribe_header):
    if not os.path.isfile(fname):
        print(f'Error: not a file {fname}', file=sys.stderr)
        return None
    mp3 = None
    transcribed = False
    with open(fname, 'r') as file:
        needle = transcribe_header.lower()
        for line in file:
            if line.startswith("#"):
                if needle in line.lower():
                    transcribed = True
            if mp3 is None:
                pattern = r'\[mp3\]\((.*?)\)'
                match = re.search(pattern, line)
                if match:
                   mp3 = match.group(1)
    return (mp3, transcribed)

Downloading the MP3

download(url, download_dir) writes the MP3 to a temporary file. It then persists the file into the download_dir (from --dir-mp3) directory using a name based on the hash of the MP3 file. This avoid duplicates.

def download(url, download_dir):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with tempfile.TemporaryFile(dir=download_dir) as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)

            identifier = archive_file(download_dir, f)
            return identifier

archive_file(download_dir, f) is the helper function used to move the MP3-file into download_dir / sha256(f.read()).

def archive_file(download_dir, f):
    f.seek(0)
    m = hashlib.sha256()
    while (chunk := f.read(8192)):
        if len(chunk) == 0:
            break
        m.update(chunk)

    identifier = m.hexdigest()

    filename = os.path.join(download_dir, identifier)

    if os.path.exists(filename):
        print(f'Warning, allready exists {filename}', file=sys.stderr)
        return identifier

    f.seek(0)
    with open(filename, 'wb') as f2:
        while (chunk := f.read(8192)):
            if len(chunk) == 0:
                break
            f2.write(chunk)

    return identifier

Transcribing with Whisper

transcribe(args, md, identifier) runs openwisper if an .srt transcription does not exist. It then uses srt2md(file_in, file_out) to read the .srt transcription and appends a Markdown conversion into the Markdown file.

def transcribe(args, md, identifier):
    mp3_filename = os.path.join(args.dir_mp3, identifier)
    srt = os.path.join(args.dir_trans, identifier + ".srt")
    if not os.path.exists(srt):
        subprocess.run([args.whisper_bin,
                        "--model", "large",
                        "--language", "Swedish",
                        "--task", "transcribe",
                        "--output_format", "all",
                        "--output_dir", args.dir_trans,
                        mp3_filename])
    else:
        print(f" * Allready transcribed: {srt}")

    with open(srt) as file_in:
        with open(md, "a") as file_out:
            print("", file=file_out)
            print(f"## {args.transcribe_header}", file=file_out)
            print("", file=file_out)
            print(f"_{args.transcribe_description}_", file=file_out)
            print("", file=file_out)
            srt2md(file_in, file_out)

srt2md(file_in, file_out) basically just concatenate to the markdown file, while dealing with some formatting issues.

Basically we convent SRT 1 00:00:00,000 --> 00:00:02,960 timestamps into equivalent Markdown.

def srt2md(file_in, file_out):
    buf = []
    for line in file_in:
        buf.append(line.rstrip())
        if len(buf) == 2:
            t0 = buf[0]
            t1 = buf[1]
            m0 = re.search('^[0-9]+$', t0)
            m1 = re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]+(\\.[0-9]+)? --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]+(\\.[0-9]+)?$', t1)
            if m0 and m1:
                print(f"`{t0} {t1}`", file=file_out)
                buf.clear()
            else:
                md = cleanup_markdown(t0)
                print(md, file=file_out)
                print("", file=file_out)
                del buf[0]

    for line in buf:
        print(line, file=file_out)

We call cleanup_markdown(transcription) just to fix up a few Markdown formatting issues. For most texts, this is a pure no-op that returns the input as-is.

def cleanup_markdown(transcription):
    out = ""
    special_long  = "\\`*_{}[]<>()#+-.!|"
    special_short = "\\`*_{}[]<>()#+!|"
    special = special_long
    for c in transcription:
        if c in special:
            out += "\\"
        out += c
        if c != " ":
            # '-' no longer means start of list...
            # - and . looks annoying when replaced in general text.
            special = special_short
    return out