I used to finish a meeting feeling productive... and then lose another 30 minutes rebuilding what just happened.

My notes were always the same story:

  • half sentences,
  • missing action owners,
  • zero context for "we decided X",
  • and that one moment I needed to remember — gone.

So I replaced my meeting notes workflow with AI transcription.

Now, most meetings turn into a usable transcript in a few minutes, and I spend about 5 minutes doing a quick human pass (titles, action items, key decisions). It's not perfect, but it's consistent — and it saves me 2+ hours per week.

This post is a builder story + a practical guide:

  • why built-in speech recognition didn't cut it,
  • what I tested,
  • what actually improves accuracy,
  • and how I shipped a transcription workflow using the Whisper API (without building a giant infra project).

If you want to try the same workflow:

1) Why built-in speech recognition wasn't enough

I started with the simplest thing: whatever the browser or OS gives you.

Web Speech API: great demo, fragile workflow

Chrome's Web Speech API feels magical... until you rely on it for real work.

The problems I hit quickly:

  • Punctuation is inconsistent (and varies by language).
  • Long sessions are fragile — tab suspends, mic permission glitches, or network hiccups can wipe progress.
  • It's designed for live speech, not for audio files you recorded earlier.
  • It's hard to produce something "archive-quality" (clean paragraphs, stable output, repeatable results).

For quick dictation, it's fine. For "meeting notes you trust," it was not.

That's why I split the product into two workflows:

  1. Real-time speech-to-text (fast capture) — /speech-to-text
  2. File-based transcription (recordings you want to keep) — /audio-to-text

2) The three approaches I tested (and why I chose Whisper API)

When people say "speech-to-text," they often mean very different things. I tested three routes.

Option A: Google Cloud Speech-to-Text

Pros: strong accuracy, enterprise-grade, lots of language support
Cons: billing adds up, you need backend + auth + quotas, more plumbing

It's a good choice if you're building a bigger B2B product with a backend anyway. But my goal was: make transcription feel like uploading a file and getting a transcript — no setup.

Option B: Self-hosted Whisper

Pros: powerful model, lots of tooling, full control
Cons: you'll want GPU for speed, deployment complexity, scaling headaches

Whisper is amazing, but self-hosting it becomes its own project:

  • GPU instances,
  • queueing,
  • retries,
  • storage,
  • security,
  • cost surprises.

Option C: Whisper API (what I shipped)

Pros: solid accuracy, fast enough, no GPUs to manage, simple backend
Cons: you must handle file upload, privacy, retention, and cost controls

This was the sweet spot. I could ship quickly and focus on UX instead of infrastructure.

ApproachAccuracyOngoing costSetup complexityBest for
Google Cloud STTHigh$$-$$$Highinfra-heavy products
Self-host WhisperHigh$ (infra)Medium-Hightinkerers / full control
Whisper APIHigh$-$$Low-Mediumshipping fast

3) What my accuracy tests taught me (real audio, real pain)

I stopped debating models and started testing with real recordings. Three scenarios:

Test 1: Quiet meeting room (best case)

  • clear voices, minimal overlap, consistent mic distance.

Result: Whisper API performed well and the transcript was readable with light editing.

Test 2: Cafe / background noise (realistic case)

Noise causes two failure modes:

  • missed words (especially quiet speakers),
  • hallucinated filler (noise interpreted as speech).

Result: accuracy dropped — less because of the model, more because of input quality.

Test 3: Accents + fast speech (hard case)

Accent + speed breaks the usual assumptions:

  • proper nouns get mangled,
  • sentence boundaries disappear,
  • speaker turns blend together.

Result: the biggest improvement wasn't "switch models." It was: prep audio + choose the correct language.

The 5 things that improved accuracy the most

  1. Set the correct language — Auto-detect is convenient, but it fails silently.
  2. Reduce cross-talk — Two people talking at once turns your transcript into "best effort."
  3. Get closer to the mic — Mic distance beats most model changes.
  4. Split long recordings — 60 minutes to 3 x 20 minutes improves stability and review speed.
  5. Remove long silences — Dead air wastes processing and can confuse segmentation.

I wrote the full checklist here: /blog/speech-to-text-accuracy-tips

4) Subtitles were cool... until I realized I didn't need them

What I needed was:

  • clean paragraphs,
  • searchable text,
  • something I can copy into Notion quickly,
  • and enough structure to extract decisions/action items.

Export formats (TXT / SRT / VTT)

I started with TXT because it's the fastest way to copy into Notion/Docs. But subtitles are too useful to ignore, so I added SRT and VTT exports as well.

  • TXT: best for notes, docs, and editing
  • SRT/VTT: perfect for video subtitles, timestamped reviews, and searchable archives

And if you only want text from video, this path still works: /video-to-text

5) My weekly meeting workflow (the one that actually sticks)

This is the workflow that finally became habit:

  1. Record the meeting (or export audio from Zoom/Meet)
  2. Upload it to the file-based tool — /audio-to-text
  3. Download as TXT
  4. Paste into Notion and do a quick "human edit" pass:
  • title + participants
  • decisions
  • action items (owner + deadline)
  • open questions

That's it.

I'm not trying to eliminate human judgment. I'm trying to eliminate the boring "replay the audio" work.

On average, I'm saving 2+ hours/week — and my notes are more complete because I'm not relying on memory.

6) Under the Hood: How I Built the Upload, Transcribe, Delete Pipeline

The pipeline

  1. Client uploads audio to the server
  2. Server stores it temporarily (unique job ID)
  3. Server calls Whisper API to transcribe
  4. Server returns transcript (TXT)
  5. Cleanup job deletes files on a timer

What mattered most (practical lessons)

  • Cost controls: enforce file size + duration limits, and rate limit abuse
  • Retries: uploads fail more than transcription does — make uploads resumable if possible
  • Queueing: long files should go async (job status + polling)
  • Deletion you can prove: set a retention window, log deletes, document it

Privacy note (server processing)

FastlyConvert processes files on the server to generate transcripts. Files are uploaded temporarily for processing and automatically deleted within 24 hours. Transfers use HTTPS encryption. Please upload only files you own or have permission to use.

Try it (no setup)

If you want to copy this workflow:

FastlyConvert supports 30+ languages, offers a free trial, and deletes files automatically within 24 hours.