v0.4.0: From a Quick Auth Update to a Two-Month Deep Dive
OpenTranscribe v0.4.0 started as a "quick authentication improvement" and snowballed into 281 commits, two months of engineering, and a complete transformation of the platform. What began as adding LDAP support turned into enterprise auth, neural search, a native transcription pipeline, GPU optimization patches submitted upstream to PyAnnote and WhisperX, a frontend security hardening sprint, and dozens of features born from processing 1,400 real-world podcasts. This is the story of how dirty data, user requests, and engineering curiosity turned a side project into a production-grade platform.
How We Got Here
The plan was simple: add LDAP authentication and ship it. That took about a week. Then came a request for Keycloak OIDC support. Then PKI certificates. Then MFA. Before we knew it, we had a full enterprise authentication system with four methods running simultaneously, audit logging, rate limiting, and FedRAMP-aligned security controls.
But authentication was just the beginning.
We started loading real data -- 1,400 podcast episodes, ranging from 30-minute interviews to 5-hour marathon recordings. The kind of messy, real-world audio that exposes every assumption in your code. Overlapping speakers, background music, mic pops, dead air, non-English segments mid-conversation, and transcription artifacts that looked like keyboard smashes.
That dataset became our proving ground. Every bug we fixed, every feature we added, came from something that broke or felt wrong while using the app with real content at scale. The search was too slow. Speaker matching was inconsistent across files. The GPU sat idle while the CPU did preprocessing. Summarization prompts produced garbage on noisy transcripts.
Each fix opened the door to the next improvement. Fixing search led us to OpenSearch neural vectors. Fixing speaker matching led us to PyAnnote v4 and alias-based index architecture. Fixing GPU idle time led us to a 3-stage Celery pipeline. And somewhere in the middle, we found ourselves submitting performance patches upstream to PyAnnote and contributing a 273x speedup to WhisperX.
Two months later, here we are.
What Changed: The Full Picture
Enterprise Authentication
Four authentication methods that can run simultaneously, configured through the admin UI without restarts:
- Local -- Username/password with bcrypt, MFA (TOTP), password policies, and account lockout
- LDAP/Active Directory -- Enterprise directory integration with group mapping
- Keycloak OIDC -- OpenID Connect with any standards-compliant identity provider
- PKI/X.509 -- Certificate-based mTLS authentication for high-security environments
All backed by audit logging, per-IP and per-user rate limiting, refresh token rotation, and configurable session management. External IdP users (PKI, Keycloak) bypass local MFA since their identity provider handles it.
Native Transcription Pipeline
We replaced the legacy WhisperX pipeline with a native engine built directly on faster-whisper's BatchedInferencePipeline and PyAnnote v4. The old pipeline had three problems we couldn't work around: WAV2VEC2 alignment took 55% of processing time, WhisperX hardcoded word_timestamps=False in batched mode, and PyAnnote v4 required a 465-line monkey-patching layer.
The native pipeline eliminates all three. Cross-attention DTW provides word timestamps during transcription (no separate alignment pass), and we call PyAnnote v4 directly.
Benchmark (3.3-hour podcast, RTX A6000):
| Pipeline | Total Time | Alignment | Speaker Accuracy |
|---|---|---|---|
| Old (WhisperX + WAV2VEC2) | 706s | 389s | ~96% |
| Native (v0.4.0) | 332s | N/A | 95.2% |
That 1% accuracy difference is irrelevant in practice -- diarization segments are 2-30 seconds long, so a word timestamp off by 200ms still maps to the correct speaker.
273x Faster Speaker Assignment (Upstream Contribution)
WhisperX's assign_word_speakers() used O(n*m) linear scan -- for each word, iterate all diarization segments to find the match. We replaced it with an interval tree for O(log n) lookups plus NumPy vectorized operations. Processing dropped from 10.2 seconds to 0.037 seconds for a 3-hour file.
This was contributed upstream as a pull request to the WhisperX repository.
GPU Optimization Patches for PyAnnote
We profiled PyAnnote's diarization pipeline and found that the embedding extraction loop runs ~850,000 individual CUDA kernel launches for a 4.7-hour file with 8 speakers (because embedding_batch_size defaults to 1). We're testing and contributing patches upstream:
- Patch 1: Increase
embedding_batch_sizefrom 1 to 32 (26x fewer kernel launches) - Patch 2: Insert
torch.cuda.empty_cache()between diarization sub-stages
The combined result: VRAM usage becomes predictable and consistent (~14.6 GB regardless of file length) instead of varying wildly between 3 GB and 25 GB. This is critical for scheduling concurrent GPU tasks.
Additional patches for pinned memory transfers, DataLoader-based prefetching, and CUDA stream overlap are documented and planned.
3-Stage Celery Pipeline
The old monolithic transcription task blocked the GPU while doing CPU work (audio extraction, search indexing). The new architecture splits processing into three Celery stages across separate queues:
- Preprocess (CPU) -- Download, extract audio, normalize
- Transcribe + Diarize (GPU) -- Whisper + PyAnnote
- Postprocess (CPU) -- Indexing, embeddings, downstream tasks
While the GPU processes file N, the CPU preprocesses file N+1. For batch imports, the GPU never sits idle.
Hybrid Search with Neural Vectors
Full-text BM25 search combined with semantic vector search via OpenSearch ML Commons. Search for "budget discussion" and find segments about "financial planning" even if those exact words never appear.
- BM25 + neural vectors merged via Reciprocal Rank Fusion (RRF)
- Six model tiers from fast (384-dim MiniLM) to best (768-dim mpnet)
- Chunk-level indexing for precise segment retrieval
- Automatic model download and offline caching
Speaker Intelligence
Speaker management evolved from basic diarization labels into a full intelligence system:
- Cross-video voice fingerprinting -- WeSpeaker 256-dim embeddings with alias-based OpenSearch indices
- Gender classification -- ML-based gender model for cluster validation
- Speaker profile pages -- Aggregated stats, appearance history, profile sharing via collections
- Cluster analysis -- Outlier detection, blacklisting, merge interface
- Embedding migration -- Atomic alias swap from v3 (512-dim) to v4 (256-dim) indices with zero downtime
Cloud ASR Providers
Not everyone has a GPU. v0.4.0 adds cloud ASR support for API-lite deployments:
- Deepgram, AssemblyAI, OpenAI Whisper API, Google, AWS, Azure, Speechmatics, Gladia
- 2 GB Docker image (vs 8.9 GB for full GPU image)
- Cloud-transcribed files still get local speaker embedding extraction for cross-file matching
Everything Else
There's too much to list exhaustively, but here are the highlights:
- Auto-labeling -- AI suggests tags and collections from transcript content, with fuzzy dedup
- Custom vocabulary -- Domain-specific hotwords for medical, legal, corporate, government terminology
- Per-collection AI prompts -- Different summarization styles for different content types
- Organization context -- Inject domain knowledge into all LLM prompts
- Disable diarization / AI summary -- Per-upload and per-user toggles to skip processing stages you don't need
- Selective reprocessing -- Stage picker to re-run only specific pipeline stages on existing files
- File retention -- Admin-configurable auto-deletion with audit logging for GDPR compliance
- User groups & collection sharing -- Granular viewer/editor permissions, cross-user prompt and config sharing
- Full PWA & mobile overhaul -- Installable app, bottom navigation, responsive layouts, 15+ mobile fixes
- Blackwell GPU support -- NVIDIA GB10x/GB20x (DGX Spark) via
--blackwellflag with dedicated Dockerfile - Multilingual summaries -- Generate AI output in 12 languages
- URL import from 1,800+ platforms -- YouTube, TikTok, Vimeo, and more via yt-dlp with 2026 bot-bypass improvements
- Configurable URL download quality -- User-selectable resolution and audio quality for URL imports
- Export formats -- TXT (configurable columns), SRT, VTT, JSON, CSV, plus bulk ZIP export
- Transcript comments -- Timestamped annotations on transcript segments
- Settings redesign -- Tabbed navigation, per-user preferences, speaker behavior defaults
- Codebase modularization -- 9 new shared backend modules, 6 new UI components, dead code removal
- Security hardening -- CSP headers, private MinIO buckets, AES-256-GCM encryption, non-root containers, FIPS 140-3 readiness
- Russian UI language -- 8th supported interface language (added in v0.3.3)
- Protected media auth -- Plugin architecture for downloading from password-protected corporate video portals (v0.3.3)
The Final Polish: Frontend Hardening Sprint
After the backend and pipeline work stabilized, we took a hard look at the frontend and found that two months of feature sprints had accumulated technical debt and security gaps that needed attention before shipping. A dedicated audit sprint closed those gaps.
Session Security
The biggest finding: Svelte stores persisted across logout. Logging out as User A and logging in as User B on the same device leaked data from the previous session — toasts, search queries, gallery filters, upload queues, WebSocket notifications, transcript segments, speaker colors, and a dozen other pieces of state remained in memory. The auth cookie was invalidated on the backend, but the frontend kept showing User A's data until the next page reload.
We fixed this with a single source of truth for session teardown: frontend/src/lib/session/clearUserState.ts. It clears 17+ subsystems in parallel on every login/logout transition, plus the user-scoped localStorage keys (notification queue, upload queue, remembered upload values). Preferences like theme, locale, and view mode are explicitly preserved — those are user choices, not user data.
While we were in there, we also found and fixed three more session-level leaks:
-
Flash of Authenticated Content (FOAC) on fresh page loads. The layout's
{:else}branch rendered the protected page slot for ~1-2 frames while the asyncgoto('/login')was in flight. Unauthenticated users briefly saw the gallery UI AND triggered/filesAPI calls. Fixed by gating all rendering behindauthReady && isAuthenticated && !isPublicPath, showing a loading screen in the route-mismatch state. -
In-flight request race. When a user clicks a file and immediately logs out, the
GET /files/{uuid}response can still resolve afterclearUserState()and repopulate the store. Fixed with a session-scopedAbortControllerinlib/axios.tsthat cancels all pending requests on logout (except auth endpoints like/auth/logout, which must always complete). -
Back-button (bfcache) after logout. Modern browsers restore pages from in-memory snapshots when users navigate back, bypassing all our auth guards and store clearing because the DOM is served from the cached snapshot. Fixed with a
pageshowevent listener that detectsevent.persisted === trueand forceswindow.location.reload()to discard the snapshot.
XSS Hardening
Audit turned up a bypassable regex sanitizer in the search result highlighting utility. The regex /<(?!\/?mark[\s>])[^>]*>/g only matched opening < characters, so a payload like </mark><script>alert(1)</script><mark> would pass through unscathed. Replaced with DOMPurify, using a strict tag whitelist that allows only the specific markup produced by the highlight pipeline (mark, span, br, ul, li, em, strong, div, p with class and data-match-index attributes).
As defense-in-depth, we wrapped every other {@html} render site — 8 in total — with the same sanitizer: transcript segments, search highlights, LLM-generated topic summaries, AI summary sections. The underlying escape pipeline was already correct, but running everything through DOMPurify as a final pass means a future code change that forgets to escape can't introduce an XSS.
And one config fix: production source maps were enabled. .js.map files exposed variable names, API endpoints, error messages, and the entire business logic to any visitor via DevTools. Disabled with a one-line sourcemap: mode !== 'production' change to vite.config.ts.
Upload Modal Redesign
The old upload modal was a 4,603-line monolith with accordion sections that didn't scroll. We redesigned it as a 6-step linear stepper:
Media → Tags → Collections → Speakers → Options → Submit
Plus a conditional Extract step for large video files. The new flow decomposes the monolith into a 1,294-line coordinator plus 9 focused components (each under ~470 lines). All three upload sources (file, URL, recording) share steps 2-6, so URL downloads and recordings now get full tag/collection/speaker configuration — previously file-only.
The stepper has a "Review with defaults" shortcut for power users who want to skip the middle, a "Remember previous values" feature that pre-fills from the last upload, and clickable step dots that let users jump back to any visited step. The backdrop no longer closes the modal (prevents data loss), and dropdowns for tags/collections were converted from auto-open search boxes to clean checkbox lists that don't visually explode on focus.
Skeleton Loaders Replace Spinners
Research shows skeleton screens feel ~20% faster than spinners for the same actual load time (Nielsen Norman). We replaced generic <Spinner size="large"/> loading states on four high-traffic pages with skeleton components that mirror the final layout:
- File detail page — full 2-column skeleton with header/video/transcript shapes
- Home gallery — card grid skeleton matching the media file layout
- Search results — horizontal search-card skeleton with thumbnail + snippet shape
- Speaker clusters/profiles/inbox — list row and profile card skeletons
The new FileDetailSkeleton, CardGridSkeleton, and ListRowSkeleton components live in frontend/src/components/ and frontend/src/components/ui/ and are reusable across any page with a predictable layout. Loading feels structured and anticipatory instead of "waiting."
Gallery Click Feedback
Clicking a file card used to have a visible delay before the detail page appeared, leading users to click twice (and briefly see the same file loaded twice). Fixed with a three-layer approach:
- Instant press state — dims the clicked card (opacity 0.72, scale 0.985) and disables
pointer-eventson click - Double-click guard — the
navigatingTostate variable prevents the second click from firing - Prefetch on mousedown — kicks off
/files/{uuid}~50-100ms before the click handler runs, giving the browser a head start
Users now get immediate visual feedback (~16ms) plus a shorter actual wait.
Collection & Share Modal Polish
The Create/Edit Collection modals were bare forms with no explanation. Added intro banners, field hints with maxlength indicators, and pre-selected the Universal Content Analyzer prompt (the system default) so new collections work out of the box.
The Share Collection modal got the biggest facelift: an introductory sentence explaining what sharing does, a collection name banner with a folder icon, and a permission-level reference card showing Viewer/Editor labels inline with descriptions (previously those descriptions only appeared in tooltips, invisible to most users). Empty state added for collections with no existing shares so the modal doesn't look broken on first open.
Also fixed a long-standing visual glitch where the Manage Collections modal showed a gray "card within a card" — the inner panel had its own surface background that duplicated the outer modal container.
Lessons from 1,400 Podcasts
Processing real data at scale taught us things that unit tests never would:
Dirty audio is the norm, not the exception. Background music, overlapping speakers, mic switches mid-recording, dead air followed by sudden loud segments. Our garbage cleanup system, segment deduplication, and confidence-based filtering all came from encountering these patterns hundreds of times.
Speaker diarization is non-deterministic. The same audio processed twice can produce slightly different speaker boundaries. This is fundamental to how agglomerative clustering works -- random initialization means different runs converge differently at the edges. We stopped fighting this and instead built systems that handle it gracefully: confidence scoring, manual verification workflows, and merge interfaces.
Search needs to understand meaning, not just words. Keyword search failed for the most common use case: "find the part where they talked about X." People don't remember exact phrases. Neural search solved this, but only after we built chunk-level indexing so results point to specific transcript segments rather than entire files.
GPU time is precious. When you're processing 1,400 files, every second of GPU idle time adds up. The 3-stage pipeline, warm model caching (saving 10+ hours of model loading for batch imports), and sequential VRAM management all came from watching nvidia-smi and asking "why is it doing nothing right now?"
Upgrading
# Standard upgrade
docker compose pull && docker compose up -d
# Or with the management script
./opentr.sh stop && ./opentr.sh start prod
Alembic migrations run automatically on startup. No manual database changes required. All existing data is preserved.
If you want to reprocess existing files with the improved pipeline, use the selective reprocessing feature in the file detail view.
What's Next
The pipeline optimization work continues -- we're actively testing PyAnnote speed patches for upstream contribution. Cloud ASR provider integration is expanding. And the search infrastructure opens the door for RAG-based question answering over your transcript library.
We're also looking at live/streaming transcription, a mobile companion app, and deeper analytics on speaker patterns and conversation dynamics.
Get Involved
OpenTranscribe is open source (AGPL-3.0) and we welcome contributions:
This release wouldn't exist without the community feedback that pushed us from "add LDAP" to "build a platform."
Special thanks to everyone who contributed code and ideas to this release:
Code contributors:
- @vfilon (Vitali Filon) — Authored the entire LDAP/Active Directory authentication feature: initial auth engine, username attribute mapping, auth_type handling, password restrictions for non-local users, conditional settings UI, docs, and migration detection. Nine commits that became the foundation of the enterprise auth system.
- @imorrish (Ian Morrish) — Submitted the LDAP PR and contributed the Postgres password reset guide to the troubleshooting docs.
Feature requests and bug reports that shipped in this release:
- @imorrish — Scrollable speaker dropdown (#129), filename in AI summary template (#138), collection/tag selection at upload (#145), per-collection default AI prompt (#146)
- @it-service-gemag — Disable diarization per upload (#151), disable AI summary per upload (#152), per-transcription Whisper model selection (#153)
- @Politiezone-MIDOW — File retention and auto-deletion system (#134)
- @coltrall — Docker daemon detection fix in the installation script (#137)
- @SQLServerIO (Wes Brown) — Pagination for large transcripts, fixing file detail page hang with long recordings (#109)
Thank you to everyone who filed issues, tested pre-releases, and shared their use cases — your feedback directly drives what gets built.
Full changelog: GitHub Releases
