Skip to main content

Performance Tuning

This guide covers tuning OpenTranscribe for maximum throughput on your hardware. The transcription pipeline, GPU utilization, database, search, and task queue settings all contribute to overall performance.

Transcription Performance

Batch Size Tuning

Batch size controls how many audio chunks the Whisper encoder processes simultaneously. It is the single most impactful transcription speed parameter. OpenTranscribe auto-detects the optimal batch size based on GPU VRAM, but you can override it.

GPU VRAMAuto-Detected Batch SizeNotes
48 GB+ (A6000, A100)32Optimal for large-v3-turbo
24 GB (RTX 3090, A5000)24Optimal
16 GB (RTX 4080)16Good headroom
12 GB (RTX 3080)12Can push to 16 with turbo model
8 GB (RTX 3070)8Test 12 with turbo model
6 GB (entry-level)4Conservative

Override via environment variable:

# In .env
BATCH_SIZE=24

Important: Transcription VRAM is constant per batch -- it does not scale with audio duration. Longer files simply process more batches sequentially at the same peak VRAM.

Compute Type Selection

The compute type (quantization) affects speed and quality. OpenTranscribe auto-selects based on GPU compute capability.

Compute TypeGPU RequirementSpeedQuality Impact
int8_float16Compute >= 7.5 (Turing+)Fastest-0.1 WER (negligible)
float16Compute >= 7.0 (Volta+)FastBaseline
bfloat16Compute >= 8.0 (Ampere+)FastNegligible loss
float32AnySlowestBaseline
int8Compute >= 6.1Fastest (CPU)-0.1 WER

Auto-detection logic (from hardware_detection.py):

  • Compute 7.5+ (RTX 20xx, 30xx, 40xx, A-series): int8_float16
  • Compute 7.0 (V100): float16
  • Below 7.0: float32

Override:

WHISPER_COMPUTE_TYPE=float16

Beam Size

Beam SizeSpeedQuality Impact
1 (greedy)2-3x faster-1-2% WER
21.5x faster-0.5% WER
5 (default)BaselineBaseline

For batch processing where speed matters more than perfection, beam_size=1 gives a significant speedup with minimal quality loss.

WHISPER_BEAM_SIZE=1

Model Selection Impact

ModelSpeedVRAMEnglishMultilingualTranslation
large-v3-turbo (default)6x faster~6 GBExcellentGoodNo
large-v3Slow~10 GBExcellentBestYes
large-v2Slow~10 GBExcellentGoodYes

Use large-v3-turbo unless you need translation to English or maximum non-English accuracy. The turbo model uses ~40% less VRAM and processes 6x faster.

WHISPER_MODEL=large-v3-turbo

Hybrid Mode

For systems where the GPU cannot fit the full transcription model, OpenTranscribe auto-activates hybrid mode: transcription runs on CPU while diarization stays on GPU/MPS. This requires only ~1.3 GB VRAM.

Auto-activation thresholds (minimum batch=2 VRAM peak vs. 80% of GPU VRAM):

ModelMin Peak VRAMAuto-hybrid if GPU <
large-v3-turbo / large-v33,893 MB~4.9 GB
medium3,829 MB~4.8 GB
small2,933 MB~3.7 GB

macOS (Apple Silicon) always uses hybrid mode — PyAnnote runs on MPS, transcription runs on CPU.

# Hybrid mode environment variables
WHISPER_HYBRID_MODE=auto # auto (default) | true | false
WHISPER_HYBRID_CPU_MODEL=small # CPU transcription model: small | medium | base

Accuracy: The small model (default for hybrid) gives good accuracy at 15–30× real-time on modern CPUs. Use WHISPER_HYBRID_CPU_MODEL=medium for better accuracy at ~half the speed.

Benchmark data: See docs/whisper-vram-profile/README.md for full VRAM sweep results across models and batch sizes.

GPU Optimization

VRAM Usage by Pipeline Stage

The transcription pipeline runs in sequential mode -- the transcriber is loaded, used, and released before the diarizer loads. This keeps peak VRAM manageable for consumer GPUs.

StageVRAM UsageScales With
Model loading (both models)~5.5 GBFixed
Transcription inference+300-400 MBBatch size (not duration)
Diarization (PyAnnote)+1-11 GBAudio duration
Speaker embeddings+4.5-6.6 GBPost-pipeline, separate step
Peak (sequential mode)~9 GB typicalLongest audio file

Diarization VRAM Scaling

Diarization is the primary VRAM consumer and scales with audio duration:

Audio DurationDiarization VRAM OverheadPeak Device VRAM
0.5 hours+1 GB~3 GB
1.0 hours+11.5 GB~19.5 GB
2.2 hours+9.2 GB~11.3 GB
3.2 hours+0.9 GB~3 GB
4.7 hours+11.1 GB~25.8 GB

VRAM variability is caused by PyTorch's caching allocator timing -- the allocator holds freed memory as "reserved" until torch.cuda.empty_cache() is called. Actual peak during processing is often higher than post-stage snapshots.

Concurrent Processing

For multi-GPU or high-VRAM systems, concurrent processing can multiply throughput:

Shared model weights via thread pool mode (GPU_WORKER_POOL=threads):

  • 5 concurrent tasks share one copy of model weights (~5.5 GB)
  • vs. prefork mode: 5 separate copies = ~27.5 GB
  • Savings: ~22 GB VRAM for 5 concurrent tasks
# In .env
GPU_WORKER_POOL=threads
GPU_CONCURRENT_REQUESTS=5 # Number of concurrent transcription tasks

Multi-GPU scaling for systems with multiple GPUs:

GPU_SCALE_ENABLED=true
GPU_SCALE_DEVICE_ID=2 # Which GPU to use
GPU_SCALE_WORKERS=4 # Parallel workers

See the Multi-GPU Scaling documentation for details.

Auto-Detection Logic

The HardwareConfig class in backend/app/utils/hardware_detection.py handles all auto-detection:

  1. Device detection: CUDA > MPS (Apple Silicon) > CPU
  2. Compute type: Based on GPU compute capability (see table above)
  3. Batch size: Based on total VRAM (see table above)
  4. Environment overrides: TORCH_DEVICE, COMPUTE_TYPE, BATCH_SIZE override auto-detection

Warm Model Caching

The ModelManager singleton keeps AI models loaded between Celery tasks. This eliminates the ~15 second model loading overhead per file.

ScenarioModel Loading Overhead (2500 files)
Without cache (load/unload per file)2500 x 15s = 10.4 hours
With warm cache (first load only)15s + 2500 x 0s = 15 seconds

This is enabled by default. The GPU worker uses --max-tasks-per-child=100000 to keep the process (and its cached models) alive across many tasks.

PostgreSQL Tuning

OpenTranscribe configures PostgreSQL with tuning parameters optimized for transcription workloads (2500+ files, 1M+ segments on SSD storage). All values are configurable via .env.

ParameterDefaultPurposeWhen to Increase
PG_SHARED_BUFFERS256 MBShared memory for cachingSet to 25% of available RAM (e.g., 2 GB for 8 GB RAM)
PG_EFFECTIVE_CACHE_SIZE1 GBPlanner's estimate of OS cacheSet to 50-75% of available RAM
PG_WORK_MEM16 MBPer-operation sort/hash memoryIncrease for complex queries; careful -- multiplied by connections
PG_MAINTENANCE_WORK_MEM128 MBVACUUM, CREATE INDEX operationsIncrease to 256-512 MB for faster index rebuilds
PG_RANDOM_PAGE_COST1.1Planner cost estimate for random I/OAlready tuned for SSD; use 4.0 for spinning disks
PG_EFFECTIVE_IO_CONCURRENCY200Concurrent I/O operationsAlready tuned for SSD; use 2 for spinning disks
PG_MAX_CONNECTIONS200Maximum client connectionsIncrease if connection errors appear

Additional hardcoded tuning (in docker-compose.yml):

ParameterValuePurpose
wal_buffers16 MBWrite-ahead log buffer size
checkpoint_completion_target0.9Spread checkpoint I/O over time
default_statistics_target100Statistics sampling for query planner
System RAMshared_bufferseffective_cache_sizework_memmaintenance_work_mem
4 GB256 MB (default)1 GB (default)16 MB128 MB
8 GB2 GB6 GB32 MB256 MB
16 GB4 GB12 GB64 MB512 MB
32 GB+8 GB24 GB128 MB1 GB

Monitor cache hit ratio -- it should be above 95%:

docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT round(100.0 * sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0), 2) AS cache_hit_ratio
FROM pg_stat_database WHERE datname = 'opentranscribe';"

OpenSearch Tuning

JVM Heap Sizing

OpenSearch runs with 1 GB heap by default. This is sufficient for small deployments but should be increased for larger datasets.

# In docker-compose.yml or docker-compose.override.yml
environment:
- "OPENSEARCH_JAVA_OPTS=-Xms2g -Xmx2g"
Dataset SizeRecommended HeapNotes
Under 1,000 transcripts1 GB (default)Sufficient
1,000-5,000 transcripts2 GBRecommended
5,000-20,000 transcripts4 GBRequired for neural search
20,000+ transcripts8 GBMaximum 50% of system RAM

Rule of thumb: Never set heap above 50% of available RAM, and never above ~30 GB (JVM compressed OOPs threshold).

Refresh Interval

By default, OpenSearch refreshes indices every 1 second. During bulk operations (reindexing, batch imports), increase the refresh interval:

# Set refresh interval to 30 seconds during bulk operations
curl -X PUT http://localhost:5180/transcripts/_settings \
-H 'Content-Type: application/json' \
-d '{"index": {"refresh_interval": "30s"}}'

# Reset to default after bulk operation
curl -X PUT http://localhost:5180/transcripts/_settings \
-H 'Content-Type: application/json' \
-d '{"index": {"refresh_interval": "1s"}}'

Neural Search Model

OpenSearch loads a sentence-transformer model for vector/semantic search. The default all-MiniLM-L6-v2 (384-dim, ~80 MB) runs on CPU within the OpenSearch JVM.

Key ML Commons settings (already configured in docker-compose.yml):

SettingValuePurpose
plugins.ml_commons.only_run_on_ml_nodefalseRun ML on data nodes (single-node)
plugins.ml_commons.native_memory_threshold99Allow high native memory for models
plugins.ml_commons.model_access_control_enabledfalseSimplified access for single-node

Redis Optimization

Redis serves as the Celery broker and result backend.

Memory Configuration

# Check current memory usage
docker exec opentranscribe-redis redis-cli info memory

# Key metrics to watch:
# used_memory_human - current usage
# maxmemory_human - configured limit (0 = unlimited)
# evicted_keys - keys removed due to memory pressure

Recommendations

SettingDefaultRecommendation
maxmemoryUnlimitedSet to 256 MB - 1 GB depending on workload
maxmemory-policynoevictionUse allkeys-lru if memory-constrained
PersistenceRDB snapshotsDefault is fine; disable AOF for performance
PasswordNoneSet REDIS_PASSWORD in .env for production

For most deployments, Redis memory usage stays under 100 MB. It primarily stores Celery task metadata and results, not large data.

Celery Worker Tuning

Worker Configuration

Each worker type has different concurrency and lifecycle settings:

WorkerQueue(s)Concurrencymax-tasks-per-childRationale
GPUgpu1100,000Sequential GPU processing; keep models warm
Downloaddownload310I/O-bound; restart frequently to release file handles
CPUcpu,utility820CPU-bound; restart to prevent memory leaks
NLPnlp,celery450LLM API calls; moderate concurrency
Embeddingembedding1500Single model loaded; keep warm

Tuning Concurrency

# In .env
GPU_CONCURRENT_REQUESTS=1 # GPU worker concurrency (default: 1)
DOWNLOAD_CONCURRENCY=3 # Download worker concurrency
NLP_CONCURRENCY=4 # NLP/LLM worker concurrency

GPU worker concurrency should stay at 1 for most setups. The sequential pipeline (transcribe then diarize) is designed for single-task processing to manage VRAM. Only increase for high-VRAM GPUs (48 GB+) with thread pool mode.

NLP worker concurrency can be increased if your LLM provider handles concurrent requests well. With self-hosted vLLM, match this to --max-num-seqs on the LLM server.

Prefetch Multiplier

By default, Celery prefetches tasks. For GPU tasks (which are long-running), prefetching can cause uneven distribution:

# Disable prefetching for GPU worker (already recommended for long tasks)
CELERY_WORKER_PREFETCH_MULTIPLIER=1

LLM/vLLM Optimization

If you use a self-hosted vLLM server for summarization and speaker identification, concurrent request handling is the key bottleneck.

Concurrent Request Tuning

OpenTranscribe can send up to 6 concurrent LLM requests per file (4 summary chunks + topics + speaker ID). If your vLLM server limits concurrent sequences, requests queue and wait.

vLLM SettingBeforeAfter (Recommended)
--max-num-seqs26
--enable-chunked-prefillNoYes

Expected Performance Impact

Metricmax-num-seqs=2max-num-seqs=6
Concurrent requests26
Summary (4 chunks) time~40s sequential~10s parallel
Total LLM time per file~60s~15-20s
Throughput~1 file/min~3-4 files/min

VRAM Budget for vLLM

For a 20B parameter model at FP16:

StateVRAM Usage
Model weights (idle)~40 GB
2 concurrent requests~42-44 GB
6 concurrent requests~44-47 GB

If OOM errors occur, reduce --max-num-seqs or lower --gpu-memory-utilization.

Tuning Checklist

ObservationAction
Memory stays under 45 GB with 6 seqsTry --max-num-seqs 8
OOM errors with 6 seqsReduce to --max-num-seqs 4
Slow first-token latencyAdd --enable-prefix-caching
Requests time outIncrease --swap-space to 48-64

Alternative: SGLang

If vLLM bottlenecks under load, SGLang offers superior continuous batching with the same OpenAI-compatible API:

image: lmsysorg/sglang:latest
command:
- python -m sglang.launch_server
- --model-path openai/gpt-oss-20b
- --host 0.0.0.0
- --port 8000
- --mem-fraction-static 0.90
- --max-running-requests 8

Network & I/O

MinIO Performance

MinIO stores all uploaded media files. For large batch imports:

  • Use SSD storage for the MinIO data volume
  • Ensure the Docker volume is on a fast filesystem (ext4 or XFS on SSD)
  • For NAS/NFS storage: mount with noatime,async for better write performance
  • MinIO's default settings are sufficient for most deployments

Model Cache Location

AI models (~2.5 GB total) are cached on disk and loaded to GPU/RAM on first use. Place the cache on fast storage:

# In .env
MODEL_CACHE_DIR=/path/to/fast/ssd/models

For NFS/NAS model storage: the initial model load will be slower (~30s vs ~5s on local SSD), but subsequent loads use warm caching in memory. This only affects cold starts and worker restarts.

Benchmark Reference

All benchmarks use a 3.3-hour podcast (11,893s) on an NVIDIA RTX A6000 (49 GB).

Pipeline Performance

ConfigurationTotal TimeTranscriptionDiarizationSpeaker Assignment
Legacy WhisperX + alignment706s75s194s10.2s
WhisperX batched, no alignment304s76s198s0.04s
Native pipeline (batch_size=32, beam=5)332s105s192s0.5s

Quality Comparison

MetricWhisperX (no alignment)Native Pipeline
Text word overlap92.5%~92%
Speaker consistency76.7% (segment-level)95.2% (word-level)
Timestamp accuracy0.00s MAE0.00s MAE

The native pipeline achieves 95% speaker consistency because word-level timestamps enable precise speaker-to-word matching via an interval tree algorithm (273x faster than WhisperX's linear scan).

Scaling Projections (2500 Three-Hour Files)

ConfigurationPer File2500 FilesWith 4x GPU Workers
Legacy WhisperX + alignment706s490 hours123 hours
Native pipeline332s231 hours58 hours
+ warm model caching~320s222 hours56 hours
+ 4x GPU workers on A6000~80s effective56 hoursN/A

Solo Processing Times by Duration (A6000)

Audio DurationTranscriptionDiarizationTotal
0.5 hours18.7s31.4s54.6s
1.0 hours38.8s62.5s103.2s
2.2 hours72.8s131.4s217.2s
3.2 hours112.9s202.5s326.4s
4.7 hours183.7s441.8s707.4s

Scaling Decision Matrix

SymptomRoot CauseSolutionCost
gpu queue backing up, GPU utilization 100%GPU is the bottleneckAdd a second GPU worker on another GPUHardware cost
gpu queue backing up, GPU utilization under 50%I/O or CPU bottleneckMove model cache to SSD, increase CPU workersLow
Long diarization times on large filesVRAM pressure from PyAnnoteUpgrade to higher-VRAM GPUHardware cost
LLM tasks slowvLLM concurrent request limitIncrease --max-num-seqsFree (config change)
LLM tasks slow, vLLM GPU at 100%LLM GPU is the bottleneckAdd dedicated LLM GPU or use cloud APIHardware/API cost
Database queries slowPostgreSQL needs tuningIncrease shared_buffers, work_memFree (config change)
Search queries slowOpenSearch heap too smallIncrease OPENSEARCH_JAVA_OPTS heapRAM
Many concurrent users, API slowBackend needs scalingRun multiple backend replicas behind load balancerCPU/RAM
Download queue backing upNetwork bandwidthIncrease DOWNLOAD_CONCURRENCY, check bandwidthFree/network

When to Scale Vertically vs. Horizontally

Scale vertically (upgrade hardware):

  • Moving from 8 GB to 24 GB GPU eliminates VRAM constraints for most files
  • More system RAM improves PostgreSQL and OpenSearch caching
  • NVMe SSD dramatically improves model loading and media I/O

Scale horizontally (add workers):

  • Multiple GPU workers on separate GPUs for parallel transcription
  • Additional CPU workers for utility tasks
  • Separate machines for LLM inference (vLLM/Ollama on dedicated GPU)

Use cloud APIs instead of self-hosting:

  • When GPU hardware cost exceeds API usage cost
  • For occasional/bursty workloads that do not justify dedicated hardware
  • Configure LLM_PROVIDER=openai or LLM_PROVIDER=anthropic in .env