Pipeline Optimization

OpenTranscribe invests heavily in pipeline performance engineering. This page documents the optimizations that make the transcription pipeline fast and memory-efficient, including upstream contributions to WhisperX and ongoing work on PyAnnote diarization speed.

Native Pipeline Architecture

OpenTranscribe v0.4.0 replaced the legacy WhisperX pipeline with a native pipeline built directly on faster-whisper's BatchedInferencePipeline and PyAnnote v4's speaker diarization API. This eliminates several architectural bottlenecks:

Why We Replaced WhisperX

WhisperX was the original transcription backend but had three constraints that limited performance:

WAV2VEC2 alignment bottleneck: The forced alignment step took 389 seconds (55% of total processing time) for a 3-hour file, processing segments sequentially with 300+ separate CUDA kernel launches. Native word timestamps via cross-attention DTW eliminate this step entirely.
No batched word timestamps: WhisperX hardcodes word_timestamps=False in its batched pipeline (asr.py lines 370-372). You had to choose batched speed OR word timestamps -- not both.
Monkey-patching for PyAnnote v4: PyAnnote v4 API changes required a 465-line compatibility layer (pyannote_compat.py) that monkey-patched WhisperX's diarization internals. The native pipeline calls PyAnnote v4 directly.

Benchmark Results

All benchmarks use a 3.3-hour podcast (11,893 seconds) on an NVIDIA RTX A6000 (49GB VRAM):

Configuration	Total Time	Transcription	Alignment	Diarization	Speaker Assign
Old baseline (WhisperX + WAV2VEC2)	706s	75s	389s	194s	10.2s
WhisperX batched, alignment OFF	304s	76s	skipped	198s	0.04s
Native pipeline (production)	332s	105s	N/A	192s	0.5s

The native pipeline is 2.1x faster than the old baseline while maintaining 95.2% speaker assignment accuracy (vs ~96% with WAV2VEC2 alignment -- a negligible difference for multi-second diarization segments).

Sequential VRAM Management

The pipeline loads models sequentially to minimize peak VRAM:

Load transcription model (~3-4 GB)
Run transcription
Release transcription model VRAM
Load diarization model (~5-9 GB depending on audio length)
Run diarization
Release diarization model

This keeps peak VRAM under ~9 GB, enabling the pipeline to run on 8-12 GB GPUs. The ModelManager singleton caches loaded models between Celery tasks (since the GPU worker has concurrency=1), saving ~15 seconds of model loading overhead per file.

Scaling Projections

Configuration	Per File (3.3h)	2,500 Files	With 4x GPU Workers
Old baseline (WhisperX + alignment)	706s	490 hours	123 hours
Native pipeline	332s	231 hours	58 hours
+ warm model caching	~320s	222 hours	56 hours

273x Faster Speaker Assignment (Upstream WhisperX PR)

The most dramatic single optimization was replacing WhisperX's assign_word_speakers() function. The original implementation used an O(n) linear scan per word to find the matching diarization segment:

# WhisperX original: O(n*m) where n=words, m=diarization segments
for word in words:
    for segment in diarization_segments:
        if segment.start <= word.start <= segment.end:
            word.speaker = segment.speaker
            break

Our replacement uses an interval tree for O(log n) lookups combined with NumPy vectorized operations for batch processing:

# OpenTranscribe: O(n log m) via interval tree + vectorized NumPy
from intervaltree import IntervalTree
tree = IntervalTree.from_tuples(diarization_segments)
# Vectorized lookup for all words simultaneously
speakers = np.array([tree[w.start] for w in words])

Results

Metric	WhisperX Original	OpenTranscribe	Speedup
150 segments, 1,349 words	10.2 seconds	0.037 seconds	273x
Algorithm complexity	O(n * m) per word	O(n log m) total	--

This optimization was contributed upstream as a pull request to the WhisperX repository. The implementation is in backend/app/transcription/speaker_assigner.py.

Vectorized Segment Deduplication

The transcription pipeline produces both coarse VAD-chunked segments and fine-grained subsegments for the same time ranges. The deduplication module (backend/app/utils/segment_dedup.py) handles this using vectorized NumPy operations:

NLTK punkt sentence splitting (replicates what WAV2VEC2 alignment implicitly provided)
Containment detection -- removes coarse "parent" segments covered by finer children
Exact text duplicate removal
Time+text overlap merging

Performance: Under 0.2 seconds for 3,000+ segments. Quality: 99.5% match to the WAV2VEC2-aligned baseline.

GPU Optimization Patches for PyAnnote

OpenTranscribe is actively contributing performance patches upstream to PyAnnote. These patches target the diarization pipeline's GPU utilization, which accounts for 50-60% of total processing time.

The Problem

PyAnnote's speaker diarization has three GPU-intensive sub-stages:

Segmentation -- SincNet + LSTM sliding window over audio (~10-15s for 1 hour)
Embedding extraction -- WeSpeaker ResNet34 forward pass per chunk-speaker pair (~40-50s for 1 hour)
Clustering -- Agglomerative clustering on CPU (~5-10s)

The embedding extraction stage dominates because it runs one GPU forward pass per chunk-speaker pair. For a 4.7-hour file with 8 speakers, that means approximately 850,000 individual CUDA kernel launches with the default embedding_batch_size=1.

Patch 1: Increase `embedding_batch_size` (Testing)

PyAnnote defaults embedding_batch_size=1, meaning each chunk-speaker pair gets its own GPU forward pass. Batching reduces per-call overhead by 32x:

`embedding_batch_size`	Kernel Launches (4.7h/8 speakers)	Reduction
1 (default)	~850,000	Baseline
32 (proposed)	~26,500	32x fewer

Impact: Processing speed is unchanged (within run-to-run variation), but VRAM behavior becomes predictable and consistent -- critical for concurrent task scheduling.

Patch 2: `torch.cuda.empty_cache()` Between Sub-stages (Testing)

Between segmentation and embedding extraction, PyTorch's caching allocator holds freed GPU memory. Inserting torch.cuda.empty_cache() calls between sub-stages releases this memory back to CUDA:

segmentations = self.get_segmentations(file, hook=hook)  # GPU stage 1
torch.cuda.empty_cache()  # Release cached memory
embeddings = self.get_embeddings(file, ...)               # GPU stage 2
torch.cuda.empty_cache()  # Release cached memory
hard_clusters, _, centroids = self.clustering(...)        # CPU stage 3

Impact: Negligible speed overhead (~1-5ms per call). Reduces peak VRAM for long multi-speaker files.

Patch Results: VRAM Predictability

Combined testing of Patches 1 and 2 on an RTX A6000 across 5 test files (0.5h to 4.7h):

VRAM (Device Peak):

Duration	Speakers	Stock Peak	Patched Peak	Change
0.5h	5	2,991 MB	14,634 MB	*
1.0h	5	19,517 MB	14,634 MB	-25%
2.2h	3	11,309 MB	14,634 MB	*
3.2h	3	2,993 MB	2,993 MB	0%
4.7h	8	25,770 MB	14,634 MB	-43%

Stock VRAM readings vary wildly (2,991-25,770 MB) due to PyTorch caching allocator timing. The patched version shows consistent ~14,634 MB regardless of duration -- the key improvement for concurrent scheduling.

Processing Speed: Unchanged (within +/-5% run-to-run variation).

Result Accuracy: Speaker counts identical in all cases. Small segment count differences (2 of 5 files) are due to PyAnnote's inherent non-determinism in VBx clustering, not the patches.

Patch 3: Pinned Memory for CPU-to-GPU Transfers (Documented)

PyAnnote sends data to GPU via pageable memory (chunks.to(self.device)). Using pin_memory() enables DMA (Direct Memory Access) for 2-3x faster CPU-to-GPU transfer per batch. This follows the same pattern used by CTranslate2 (faster-whisper's backend).

Patch 4: DataLoader-Based Prefetching (Documented)

Replace the serial embedding extraction loop with torch.utils.data.DataLoader using num_workers=2, pin_memory=True, and prefetch_factor=2. This overlaps CPU data preparation with GPU inference -- estimated 10-30% faster embedding extraction.

Upstream PR Strategy

All patches target pyannote/pyannote-audio (MIT license). Each PR includes:

Benchmark data (before/after timing and VRAM on reference files)
Result equivalence proof (segment counts and speaker assignments unchanged)
Hardware tested (GPU model, VRAM, driver version)
Minimal diff with clear documentation

Warm Model Caching

The ModelManager singleton (backend/app/transcription/model_manager.py) keeps AI models loaded in GPU memory between Celery tasks. Since the GPU worker runs with concurrency=1, only one task uses the GPU at a time -- safe for singleton model state.

Scenario	Model Loading Overhead (2,500 files)
Without cache (load/unload per file)	2,500 x 15s = 10.4 hours
With warm cache (first file only)	15s + 2,500 x ~0s = 15 seconds

For batch imports, this saves approximately 10 hours of idle GPU time spent loading and unloading models.

Hardware Auto-Detection

OpenTranscribe automatically detects GPU capabilities and configures optimal settings via backend/app/utils/hardware_detection.py:

Parameter	How It Is Set
`batch_size`	Based on GPU VRAM (8-32 range)
`compute_type`	Based on CUDA compute capability
`beam_size`	Default 5 (configurable)
`segmentation_batch_size`	Based on VRAM (8-32 range)

No manual tuning required for most deployments. See the Performance Tuning guide for manual overrides.

Transcription Engine -- Pipeline architecture and model selection
Speaker Diarization -- Diarization algorithms and speaker matching
Performance Tuning -- Operational tuning for all subsystems
Multi-GPU Scaling -- Parallel GPU worker configuration

Native Pipeline Architecture​

Why We Replaced WhisperX​

Benchmark Results​

Sequential VRAM Management​

Scaling Projections​

273x Faster Speaker Assignment (Upstream WhisperX PR)​

Results​

Vectorized Segment Deduplication​

GPU Optimization Patches for PyAnnote​

The Problem​

Patch 1: Increase embedding_batch_size (Testing)​

Patch 2: torch.cuda.empty_cache() Between Sub-stages (Testing)​

Patch Results: VRAM Predictability​

Patch 3: Pinned Memory for CPU-to-GPU Transfers (Documented)​

Patch 4: DataLoader-Based Prefetching (Documented)​

Upstream PR Strategy​

Warm Model Caching​

Hardware Auto-Detection​

Related Documentation​