Operational Runbooks
This page provides step-by-step procedures for diagnosing and resolving common production issues in OpenTranscribe.
Runbook: Stuck Transcription Tasks
Symptoms: Files remain in "Processing" or "Transcribing" status indefinitely. The progress bar stops updating. No new files are being processed.
Diagnosis:
- Open the Flower dashboard at
http://localhost:5175/flower - Check the Active tab for tasks that have been running longer than expected (typical transcription takes 1-5 minutes per 10 minutes of audio)
- Check the backend logs for errors:
./opentr.sh logs celery-worker - Query the database for stuck files:
./opentr.sh shell backend
python -c "
from app.db.session import SessionLocal
from app.models.media import MediaFile, FileStatus
db = SessionLocal()
stuck = db.query(MediaFile).filter(
MediaFile.status.in_([FileStatus.PROCESSING, FileStatus.TRANSCRIBING])
).all()
for f in stuck:
print(f'{f.id}: {f.filename} - {f.status} (updated: {f.updated_at})')
"
Resolution:
Option A: Use the Admin Recovery Panel (recommended)
- Log in as an admin user
- Navigate to Admin Panel > System > Task Recovery
- The system automatically detects stuck tasks (files in processing state for longer than the configured timeout)
- Click Recover Stuck Tasks to reset them to a retryable state
- The built-in
TaskRecoveryServicewill clean up partial transcript segments before retrying
Option B: Manual recovery via API
- Call the admin recovery endpoint:
curl -X POST http://localhost:5174/api/admin/recover-tasks \
-H "Authorization: Bearer <admin-token>"
Option C: Cancel and retry individual files
- In the Flower dashboard, revoke the stuck Celery task
- Update the file status in the database:
./opentr.sh shell backend
python -c "
from app.db.session import SessionLocal
from app.models.media import MediaFile, FileStatus
db = SessionLocal()
file = db.query(MediaFile).get(<FILE_ID>)
file.status = FileStatus.UPLOADED
db.commit()
" - Reprocess the file from the UI
Prevention:
- The
TaskRecoveryServiceruns automatically on a schedule and handles most stuck task scenarios - Configure
TASK_RECOVERY_TIMEOUT_MINUTESin.envto match your expected maximum processing time - Monitor the Flower dashboard regularly for long-running tasks
Runbook: Queue Backup / Slow Processing
Symptoms: New uploads queue up but are not processed. Flower shows a growing number of tasks in the Reserved or Scheduled state. Processing takes much longer than usual.
Diagnosis:
- Check queue depth in Flower at
http://localhost:5175/flower - Check if the Celery worker is running:
docker compose ps celery-worker - Check worker logs for errors:
./opentr.sh logs celery-worker - Check Redis connectivity (the message broker):
docker compose exec redis redis-cli ping
# Should respond: PONG - Check GPU utilization (if applicable):
docker compose exec celery-worker nvidia-smi
Resolution:
Clear stuck tasks first:
- Follow the "Stuck Transcription Tasks" runbook above to clear any blocked tasks
Scale workers if processing is just slow:
- For multi-GPU systems, enable GPU scaling:
# In .env:
GPU_SCALE_ENABLED=true
GPU_SCALE_DEVICE_ID=2
GPU_SCALE_WORKERS=4
# Restart with scaling
./opentr.sh stop
./opentr.sh start dev --gpu-scale
Restart the worker if it is unhealthy:
- Restart only the Celery worker:
docker compose restart celery-worker
Flush the queue as a last resort (WARNING: loses queued tasks):
- Purge all pending Celery tasks:
docker compose exec celery-worker celery -A app.core.celery purge -f
Prevention:
- Monitor queue depth via Flower
- Scale GPU workers to match throughput needs
- Set up alerts on queue depth thresholds
Runbook: GPU Out of Memory (OOM)
Symptoms: Celery worker crashes or restarts unexpectedly. Logs show CUDA out of memory, RuntimeError: CUDA error: out of memory, or the container is killed by the OOM killer.
Diagnosis:
- Check worker logs for OOM errors:
./opentr.sh logs celery-worker | grep -i "out of memory\|OOM\|CUDA error" - Check current GPU memory usage:
docker compose exec celery-worker nvidia-smi - Check if multiple workers are competing for GPU memory:
docker compose ps | grep celery
Resolution:
Step 1: Reduce concurrency
Lower the number of parallel workers to reduce simultaneous GPU memory usage:
# In .env:
GPU_SCALE_WORKERS=2 # Reduce from 4 to 2
Restart the worker after changing this value.
Step 2: Use a smaller Whisper model
Switch from large-v3 (10GB VRAM) to large-v3-turbo (6GB VRAM):
# In .env:
WHISPER_MODEL=large-v3-turbo
Restart the worker. Note: large-v3-turbo does not support translation tasks.
Step 3: Check for GPU memory leaks
If memory usage grows over time without releasing:
# Restart the worker to free all GPU memory
docker compose restart celery-worker
# Monitor memory usage over time
watch -n 5 'docker compose exec celery-worker nvidia-smi'
Step 4: Limit Docker GPU memory (advanced)
In docker-compose.yml, add memory limits to the worker:
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
device_ids: ['0']
Prevention:
- Match
GPU_SCALE_WORKERSto your GPU's VRAM capacity - Use
large-v3-turbofor GPUs with less than 10GB VRAM - Monitor GPU memory with
nvidia-smiafter deployments - Avoid running other GPU workloads on the same device
Runbook: Database Connection Issues
Symptoms: Backend returns 500 errors. Logs show connection refused, too many connections, or connection pool exhausted. API requests hang or timeout.
Diagnosis:
- Check if PostgreSQL is running:
docker compose ps postgres - Test database connectivity:
docker compose exec postgres pg_isready -U opentranscribe - Check current connection count:
docker compose exec postgres psql -U opentranscribe -c \
"SELECT count(*) FROM pg_stat_activity;" - Check backend logs for connection errors:
./opentr.sh logs backend | grep -i "connection\|pool\|database"
Resolution:
If PostgreSQL is down:
docker compose restart postgres
# Wait for it to become healthy
docker compose exec postgres pg_isready -U opentranscribe
# Then restart the backend to re-establish connections
docker compose restart backend
If connection pool is exhausted:
- Increase the pool size in
.env:DATABASE_POOL_SIZE=20 # Default is 5
DATABASE_MAX_OVERFLOW=30 # Default is 10 - Restart the backend:
docker compose restart backend
If connections are leaking (count keeps growing):
- Kill idle connections:
docker compose exec postgres psql -U opentranscribe -c \
"SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';" - Restart the backend to get fresh connections:
docker compose restart backend
Prevention:
- Monitor active connection count
- Set appropriate
DATABASE_POOL_SIZEfor your workload - Ensure the backend properly closes sessions (SQLAlchemy context managers)
Runbook: OpenSearch Index Corruption
Symptoms: Search returns no results or incomplete results. Admin panel shows index health as "red" or "yellow". Logs show index_not_found_exception or shard allocation errors.
Diagnosis:
- Check cluster health:
curl -s http://localhost:5180/_cluster/health | python3 -m json.tool - Check index status:
curl -s http://localhost:5180/_cat/indices?v - Check for unassigned shards:
curl -s http://localhost:5180/_cat/shards?v | grep UNASSIGNED - Check the speaker alias:
curl -s http://localhost:5180/_cat/aliases?v
Resolution:
For missing or corrupt transcription index:
-
Trigger a full reindex from the Admin Panel:
- Navigate to Admin Panel > System > Search
- Click Reindex All Documents
- This rebuilds the search index from the database (source of truth)
-
Or via API:
curl -X POST http://localhost:5174/api/admin/reindex \
-H "Authorization: Bearer <admin-token>"
For unassigned shards:
# Retry shard allocation
curl -X POST http://localhost:5180/_cluster/reroute?retry_failed=true
# If single-node, ensure replica count is 0
curl -X PUT http://localhost:5180/_settings -H 'Content-Type: application/json' -d '
{
"index.number_of_replicas": 0
}'
For corrupt speaker index:
The speaker index uses an alias-based architecture (speakers alias pointing to speakers_v3 or speakers_v4). To rebuild:
- Delete and recreate via the Admin Panel embedding migration tool
- Or manually:
# Check which index the alias points to
curl -s http://localhost:5180/_cat/aliases/speakers?v
# Delete the corrupt index
curl -X DELETE http://localhost:5180/speakers_v3
# Restart the backend - it will recreate indices on startup
docker compose restart backend
Prevention:
- Run OpenSearch on reliable storage (avoid network-mounted volumes for production)
- Monitor cluster health regularly
- Keep regular database backups (the database is the source of truth, not OpenSearch)
Runbook: Full Disk Space
Symptoms: Uploads fail. Container logs show No space left on device. Docker commands fail. Services crash and cannot restart.
Diagnosis:
- Check overall disk usage:
df -h / - Check Docker disk usage:
docker system df - Check large directories:
du -sh /var/lib/docker/*
du -sh ./models/*
du -sh ./backups/*
Resolution:
Step 1: Clean Docker resources
# Remove unused containers, networks, and dangling images
docker system prune -f
# Remove unused volumes (WARNING: deletes orphaned data volumes)
docker volume prune -f
# Remove old images
docker image prune -a -f --filter "until=168h"
Step 2: Clean temporary files
# Clean backend temp files
docker compose exec backend rm -rf /tmp/yt-dlp-* /tmp/whisperx-*
# Check MinIO for orphaned files
# (files in storage but not referenced in the database)
Step 3: Archive old backups
# List backups sorted by size
ls -lhS backups/
# Remove backups older than 30 days
find backups/ -name "*.sql" -mtime +30 -delete
Step 4: Move model cache to larger disk
# In .env, change MODEL_CACHE_DIR to a path on a larger disk
MODEL_CACHE_DIR=/mnt/large-disk/opentranscribe-models
# Fix permissions and restart
./scripts/fix-model-permissions.sh
./opentr.sh stop && ./opentr.sh start dev
Prevention:
- Monitor disk usage with alerts at 80% and 90% thresholds
- Set up automated backup rotation
- Use a separate volume for Docker storage and model cache
- Periodically run
docker system prune
Runbook: Service Won't Start
Symptoms: Running ./opentr.sh start dev fails. One or more containers exit immediately or enter a restart loop. The frontend or backend is unreachable.
Diagnosis:
- Check which services are failing:
./opentr.sh status
docker compose ps -a - Check logs for the failing service:
./opentr.sh logs backend
./opentr.sh logs frontend
./opentr.sh logs postgres - Check for port conflicts:
# Check if ports are already in use
ss -tlnp | grep -E '5173|5174|5175|5179|5180' - Verify the
.envfile exists and is valid:# Check for syntax errors (no spaces around =, no missing quotes)
cat .env | grep -n "= \|^ \|^\t" - Check Docker daemon is running:
docker info
Resolution:
If a port is in use:
# Find and kill the process using the port
sudo kill $(lsof -t -i:5174)
# Or change the port in .env
BACKEND_PORT=5184
If the backend fails on startup (migration error, missing env var):
# Check the specific error in logs
./opentr.sh logs backend | tail -50
# Common fix: ensure .env has all required variables
diff .env .env.example
If containers keep restarting:
# Check exit codes
docker compose ps -a
# Check for resource constraints
docker stats --no-stream
If Docker daemon issues:
sudo systemctl restart docker
./opentr.sh stop && ./opentr.sh start dev
Prevention:
- Keep
.envin sync with.env.examplewhen upgrading - Test configuration changes in development before production
- Use
./opentr.sh statusafter starting to verify all services are healthy
Runbook: Failed Database Migration
Symptoms: Backend fails to start with Alembic errors. Logs show alembic.util.exc.CommandError, Can't locate revision, or Target database is not up to date.
Diagnosis:
- Check backend logs for the specific migration error:
./opentr.sh logs backend | grep -i "alembic\|migration\|revision" - Check the current database revision:
./opentr.sh shell backend
alembic current - Check migration history:
alembic history --verbose
Resolution:
If the database has no Alembic stamp (fresh or legacy database):
The backend's migrations.py auto-detection handles this on startup. If it fails:
./opentr.sh shell backend
# Stamp the database at the appropriate version
alembic stamp <current_version>
# Then upgrade
alembic upgrade head
If a migration partially applied (some columns exist, some don't):
- Check what was applied:
./opentr.sh shell postgres
psql -U opentranscribe -c "\d+ <table_name>" - Manually complete the migration or stamp past it:
./opentr.sh shell backend
# If the schema is correct but stamp is wrong
alembic stamp <target_revision>
If you need to rollback a migration:
./opentr.sh shell backend
# Downgrade one revision
alembic downgrade -1
# Or downgrade to a specific revision
alembic downgrade <revision_id>
Nuclear option (development only -- destroys all data):
./opentr.sh reset dev
Prevention:
- Always use idempotent SQL in migrations (
IF NOT EXISTS,IF EXISTS) - Test migrations with
./opentr.sh reset devbefore deploying - Back up the database before applying new migrations in production:
./opentr.sh backup
Runbook: High Memory Usage
Symptoms: System becomes sluggish. Docker containers are killed by the OOM killer. docker stats shows high memory consumption. Swap usage is high.
Diagnosis:
- Check per-container memory usage:
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" - Identify the largest consumer:
docker stats --no-stream --format "{{.Name}}\t{{.MemUsage}}" | sort -k2 -h - Check host system memory:
free -h
Resolution:
If OpenSearch is consuming too much memory:
OpenSearch defaults can be aggressive. Reduce the JVM heap:
# In docker-compose.yml or .env:
OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m # Default is often 1g
Restart OpenSearch after changing.
If the Celery worker is consuming too much:
This is usually due to model loading. Each model stays in memory:
# Restart the worker to free memory
docker compose restart celery-worker
If PostgreSQL is consuming too much:
Reduce shared buffers:
# In docker-compose.yml postgres environment:
POSTGRES_SHARED_BUFFERS=256MB # Reduce from default
Set container memory limits:
Add limits in docker-compose.yml:
services:
backend:
deploy:
resources:
limits:
memory: 2G
opensearch:
deploy:
resources:
limits:
memory: 2G
Prevention:
- Set memory limits on all containers
- Monitor memory usage trends over time
- Size your server appropriately (16GB minimum recommended, 32GB for production)
Runbook: Credential Rotation
Symptoms: Planned maintenance -- you need to rotate credentials for security compliance or after a suspected compromise.
Diagnosis: Not applicable (proactive maintenance).
Resolution:
Rotate Database Password
- Back up the database first:
./opentr.sh backup - Update the password in PostgreSQL:
docker compose exec postgres psql -U opentranscribe -c \
"ALTER USER opentranscribe WITH PASSWORD 'new_secure_password';" - Update
.env:DATABASE_URL=postgresql://opentranscribe:new_secure_password@postgres:5432/opentranscribe - Restart services that connect to the database:
docker compose restart backend celery-worker
Rotate MinIO Keys
- Update MinIO credentials via the MinIO console at
http://localhost:5179, or:docker compose exec minio mc admin user svcacct add myminio opentranscribe \
--access-key NEW_ACCESS_KEY --secret-key NEW_SECRET_KEY - Update
.env:MINIO_ACCESS_KEY=NEW_ACCESS_KEY
MINIO_SECRET_KEY=NEW_SECRET_KEY - Restart services:
docker compose restart backend celery-worker
Rotate JWT Secret
- Update
.env:JWT_SECRET_KEY=$(openssl rand -hex 32) - Restart the backend:
docker compose restart backend - Note: All existing user sessions will be invalidated. Users will need to log in again.
Rotate LLM API Keys
- Generate a new key from your LLM provider's dashboard
- Update
.env:LLM_API_KEY=new_api_key_here - Or update via the Admin Panel: Settings > LLM Configuration > API Key
- Restart the backend:
docker compose restart backend celery-worker
Rotate Redis Password
- Update
.env:REDIS_PASSWORD=new_redis_password
CELERY_BROKER_URL=redis://:new_redis_password@redis:6379/0 - Restart all services that use Redis:
docker compose restart redis backend celery-worker
Prevention:
- Rotate credentials on a regular schedule (e.g., every 90 days)
- Use strong, randomly generated passwords (
openssl rand -hex 32) - Document credential rotation dates
- Never commit credentials to version control
Runbook: Model Re-Download
Symptoms: AI models are corrupted, producing garbled output, or you need to switch to a different model version. Or you want to force a fresh download after a failed partial download.
Diagnosis:
- Check current model cache:
ls -la ${MODEL_CACHE_DIR:-./models}/huggingface/hub/
ls -la ${MODEL_CACHE_DIR:-./models}/torch/pyannote/ - Check for incomplete downloads (very small files, lock files):
find ${MODEL_CACHE_DIR:-./models} -name "*.lock" -o -name "*.incomplete"
Resolution:
Re-download a specific model type:
# Clear Whisper models
rm -rf ${MODEL_CACHE_DIR:-./models}/huggingface/hub/models--Systran--faster-whisper-*
# Clear PyAnnote speaker models
rm -rf ${MODEL_CACHE_DIR:-./models}/torch/pyannote/
# Clear sentence transformer models
rm -rf ${MODEL_CACHE_DIR:-./models}/sentence-transformers/
# Clear OpenSearch neural models
rm -rf ${MODEL_CACHE_DIR:-./models}/opensearch-ml/
Re-download all models:
# Stop services
./opentr.sh stop
# Clear the entire model cache
rm -rf ${MODEL_CACHE_DIR:-./models}/*
# Fix permissions
./scripts/fix-model-permissions.sh
# Restart -- models will download on first use
./opentr.sh start dev
Pre-download models for offline deployment:
bash scripts/download-models.sh models
Prevention:
- Verify model downloads completed successfully after first startup
- Use stable network connections for initial model downloads
- For air-gapped environments, pre-download models on an internet-connected machine and transfer via
rsync
Runbook: WebSocket Connection Issues
Symptoms: The UI does not update in real-time after uploading files. Progress bars do not animate. Transcription completes but the UI still shows "Processing". Browser console shows WebSocket errors.
Diagnosis:
- Check browser developer console (F12) for WebSocket errors:
- Look for
WebSocket connection to 'ws://...' failed - Look for
ERR_CONNECTION_REFUSEDor403 Forbidden
- Look for
- Verify the backend WebSocket endpoint is reachable:
# Test with wscat (install: npm install -g wscat)
wscat -c ws://localhost:5174/api/ws - Check backend logs for WebSocket errors:
./opentr.sh logs backend | grep -i "websocket\|ws\|upgrade" - If using NGINX (production), check NGINX logs:
./opentr.sh logs frontend | grep -i "websocket\|upgrade"
Resolution:
If running in development (no NGINX):
- Verify the backend is running and healthy:
curl http://localhost:5174/api/health - Restart the backend:
docker compose restart backend - Hard-refresh the browser (
Ctrl+Shift+R)
If running in production (with NGINX):
- Verify NGINX is configured for WebSocket proxying. The configuration should include:
location /api/ws {
proxy_pass http://backend:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 86400;
} - Check that
proxy_read_timeoutis set high enough (WebSocket connections are long-lived) - Restart NGINX:
docker compose restart frontend
If WebSocket connects but no updates arrive:
- Check that the Celery worker is sending notifications:
./opentr.sh logs celery-worker | grep -i "notification\|websocket\|send" - Check Redis pub/sub (used for WebSocket message passing):
docker compose exec redis redis-cli ping
Prevention:
- In production, always use the provided NGINX configuration which includes WebSocket support
- Set
proxy_read_timeoutto at least 86400 (24 hours) for WebSocket connections - Monitor WebSocket connection counts in production
- Test WebSocket functionality after any NGINX configuration changes