Skip to main content
Innovatrix Infotech — home
Build an AI Meeting Summarizer with n8n and Whisper in 2026 (Step-by-Step) cover
AI Automation

Build an AI Meeting Summarizer with n8n and Whisper in 2026 (Step-by-Step)

Build a production AI meeting summarizer with n8n and Whisper for ₹27/meeting. Step-by-step tutorial with structured output, multi-destination delivery, and cost comparison vs Otter.ai and Fireflies.

Photo of Rishabh SethiaRishabh SethiaFounder & CEO1 December 202513 min read1.5k words
#n8n#whisper#meeting summarizer#ai automation#openai

Every 45-minute meeting generates 20 minutes of note-writing. Multiply that by 5 meetings a day across a 10-person team, and you are burning 16+ hours daily on documentation that nobody reads properly anyway.

We built this workflow for our own team first, then deployed variations for three clients. Total cost per meeting: approximately ₹30-35 (∼$0.40). Compare that to Otter.ai at $16.99/month or Fireflies.ai at $19/month — and those tools do not push summaries into your CRM, create Google Docs, or trigger Slack notifications.

This tutorial walks through the complete build. You will have a working, importable n8n workflow by the end.

What You Will Learn

  • Complete n8n workflow: recording → transcription → summarization → multi-output distribution
  • OpenAI Whisper API configuration and the 25MB chunking gotcha
  • GPT-4o structured output prompting for meeting summaries
  • Cost comparison against commercial meeting note tools
  • Speaker diarization workarounds (Whisper's biggest limitation)

Prerequisites

  • n8n instance (cloud or self-hosted)
  • OpenAI API key with Whisper and GPT-4o access
  • Google Workspace account (for Docs output)
  • Slack workspace (for notification output)
  • Optional: HubSpot/CRM for note integration (see our HubSpot + n8n guide)

The Architecture

The workflow follows a linear pipeline with parallel outputs at the end:

TriggerAudio ProcessingTranscription (Whisper)Summarization (GPT-4o)Outputs (Google Docs + Slack + CRM)

Total execution time for a 45-minute recording: 3-5 minutes depending on file size and API latency.

Step 1: Set Up the Trigger

You have two options:

Option A — Google Drive monitoring (automated): Use n8n's Google Drive Trigger node. Configure it to watch a specific folder (e.g., "Meeting Recordings") and trigger on new file uploads. Set polling interval to 1 minute.

This is ideal when your recording tool (Zoom, Google Meet, or Loom) auto-uploads to Drive.

Option B — Manual upload via n8n form (on-demand): Create an n8n Form Trigger that accepts a file upload. Add fields for meeting title, attendees (comma-separated), and optional context.

We use Option B for ad-hoc recordings and Option A for regularly scheduled meetings.

Step 2: Audio Processing (The 25MB Gotcha)

This is where most implementations break. OpenAI's Whisper API has a 25MB file size limit. A 45-minute meeting recorded at standard quality is typically 30-80MB.

The fix: Add a Code node before the Whisper call that checks file size. If the file exceeds 25MB, use an Execute Command node to run FFmpeg:

ffmpeg -i input.webm -vn -acodec libmp3lame -ab 64k -ar 16000 output.mp3

This does three things:

  1. Strips video (-vn) — you do not need video for transcription
  2. Reduces bitrate to 64kbps — more than sufficient for speech
  3. Downsamples to 16kHz — Whisper's optimal sample rate

A 60MB webm file typically compresses to 5-8MB MP3 with this configuration. Transcription accuracy remains at 99%+ for clear speech.

Critical note: If your n8n instance is on a minimal VPS (1-2GB RAM), FFmpeg processing of large files will spike memory. Either allocate 4GB+ RAM or process in chunks using FFmpeg's segment feature.

Step 3: Transcription with OpenAI Whisper

Add an HTTP Request node configured for the Whisper API:

  • Method: POST
  • URL: https://api.openai.com/v1/audio/transcriptions
  • Authentication: Header Auth with your OpenAI API key
  • Body: Form-data with file (binary from previous node) and model set to whisper-1
  • Optional parameters:
    • language: Set explicitly for better accuracy (e.g., en for English)
    • response_format: verbose_json for timestamps, text for plain text
    • timestamp_granularities: segment for paragraph-level timestamps

Cost: $0.006 per minute of audio. A 45-minute meeting costs $0.27 for transcription.

Language support: Whisper handles 98 languages with high accuracy. For our clients in India, Dubai, and Singapore, this is a significant advantage — team meetings often switch between English, Hindi, and Arabic.

The speaker diarization gap: Whisper does not natively identify who is speaking. If you need "Speaker 1 said X, Speaker 2 responded Y," you have two options:

  1. pyannote.audio (open-source): Run it as a pre-processing step before Whisper. Requires a GPU-enabled server.
  2. AssemblyAI: Offers built-in diarization at $0.01/minute. 67% more expensive than Whisper but includes speaker labels.

For most business use cases, speaker identification is nice-to-have, not essential. Action items and decisions matter more than attribution.

Step 4: Summarization with GPT-4o

This is where the value multiplies. Feed the Whisper transcript to GPT-4o with a structured output prompt:

You are a meeting notes assistant. Given the following transcript, produce a structured summary in this exact JSON format:

{
  "meeting_title": "<inferred from context>",
  "date": "<ISO 8601>",
  "duration_minutes": <number>,
  "summary": "<3-5 sentence executive summary>",
  "decisions_made": ["<decision 1>", "<decision 2>"],
  "action_items": [
    {"task": "<description>", "owner": "<name or Unknown>", "deadline": "<if mentioned, else null>"}
  ],
  "key_discussion_points": ["<point 1>", "<point 2>"],
  "follow_up_date": "<if mentioned, else null>",
  "open_questions": ["<unresolved question 1>"]
}

Transcript:
{{$json.text}}

Cost: GPT-4o processes a 45-minute transcript (~8,000 tokens input) for approximately $0.012 input + $0.04 output = $0.052.

Total cost per meeting: $0.27 (Whisper) + $0.052 (GPT-4o) = $0.32 (∼₹27)

Step 5: Multi-Output Distribution

The summarized JSON feeds into three parallel output nodes:

Output 1: Google Docs

Use n8n's Google Docs node to create a new document in a shared "Meeting Notes" folder. Format the JSON into readable markdown with headers for each section.

Output 2: Slack Notification

Post to a #meeting-notes channel with a condensed version: meeting title, executive summary, action items with owners, and a link to the full Google Doc.

Output 3: CRM Note (Optional)

If the meeting is a client call, use the attendee email to look up the HubSpot/CRM contact and create an engagement note. This integrates directly with the HubSpot automation workflow we covered.

Cost Comparison: DIY vs. Commercial Tools

This Workflow Otter.ai Pro Fireflies.ai Pro Grain
Monthly cost (50 meetings) ₹1,350 (~$16) $16.99 $19 $19
Monthly cost (200 meetings) ₹5,400 (~$65) $30 (Business) $39 (Business) $29
Custom outputs (CRM, Slack, Docs) Yes No Limited Limited
Self-hosted / data privacy Yes (n8n self-hosted) No No No
Language support 98 languages ~30 ~60 English-focused
Speaker diarization Requires add-on Built-in Built-in Built-in

The breakeven point: if your team has fewer than 50 meetings/month and does not need custom integrations, a commercial tool is simpler. If you need CRM integration, custom formatting, data sovereignty, or handle 100+ meetings/month, build your own.

Our opinionated take: if you are paying $20/month for an AI meeting notes tool and your team has fewer than 50 meetings/month, you are getting reasonable value. But the moment you need those summaries flowing into your CRM, creating Jira tickets, or triggering follow-up workflows, you need n8n-based automation.

Common Issues and Fixes

Audio quality degrades accuracy: Noisy recordings drop Whisper accuracy from 99% to 80-85%. Solutions: use a dedicated microphone, enable noise suppression in your recording tool, and consider preprocessing with FFmpeg's anlmdn noise reduction filter.

Large meetings timeout: Recordings over 2 hours should be split into 30-minute chunks using FFmpeg before sending to Whisper. Process chunks sequentially and concatenate transcripts.

GPT-4o hallucinates action items: This happens when the transcript is ambiguous. Add a validation prompt: "Only include action items that were explicitly stated as tasks, commitments, or next steps. Do not infer implied actions."

Google Drive trigger misses files: Set the polling interval to 1 minute, not the default 5 minutes. For critical workflows, use a Google Drive Push Notification (webhook) instead of polling.

Frequently Asked Questions

Written by

Photo of Rishabh Sethia
Rishabh Sethia

Founder & CEO

Rishabh Sethia is the founder and CEO of Innovatrix Infotech, a Kolkata-based digital engineering agency. He leads a team that delivers web development, mobile apps, Shopify stores, and AI automation for startups and SMBs across India and beyond.

Connect on LinkedIn
Get started

Ready to talk about your project?

Whether you have a clear brief or an idea on a napkin, we'd love to hear from you. Most projects start with a 30-minute call — no pressure, no sales pitch.

No upfront commitmentResponse within 24 hoursFixed-price quotes