Skip to main content

Monitoring & Logging

OpenTranscribe runs as a multi-service Docker Compose application with GPU-accelerated AI processing, background task queues, and several data stores. Effective monitoring ensures reliable operation and early detection of issues.

Monitoring Architecture

Service Health Checks

Every service in OpenTranscribe has a Docker health check. These are defined in docker-compose.yml and automatically monitored by Docker.

ServiceContainerHealth CheckIntervalWhat It Verifies
PostgreSQLopentranscribe-postgrespg_isready -U postgres5sDatabase accepts connections
MinIOopentranscribe-miniocurl -f http://localhost:9000/minio/health/live5sObject storage API is responsive
Redisopentranscribe-redisredis-cli ping (with auth if configured)5sCache/broker responds to PING
OpenSearchopentranscribe-opensearchcurl -sS http://localhost:92005sSearch cluster is reachable
Backendopentranscribe-backendcurl -f http://localhost:8080/health10sFastAPI app is serving requests
GPU Workeropentranscribe-celery-workercelery inspect ping -d gpu-transcription@$HOSTNAME30sWorker is connected to broker and responsive
Download Workeropentranscribe-celery-download-workercelery inspect ping -d media-downloader@$HOSTNAME30sWorker is connected and processing downloads
CPU Workeropentranscribe-celery-cpu-workercelery inspect ping -d cpu-processor@$HOSTNAME30sWorker handles CPU-bound tasks
NLP Workeropentranscribe-celery-nlp-workercelery inspect ping -d ai-nlp@$HOSTNAME30sWorker handles LLM/NLP tasks
Embedding Workeropentranscribe-celery-embedding-workercelery inspect ping -d search-indexer@$HOSTNAME30sWorker handles search embedding tasks
Celery Beatopentranscribe-celery-beatChecks /app/celerybeat-schedule modification time < 300s30sScheduler is writing schedule file
Floweropentranscribe-flowerWeb UI on port 5555N/AMonitoring dashboard is accessible

Check all health statuses at once:

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Flower Dashboard

Flower provides real-time monitoring of all Celery workers and tasks.

Access: http://localhost:5175/flower Default credentials: admin / flower (configurable via FLOWER_USER and FLOWER_PASSWORD in .env)

What to Monitor in Flower

TabKey MetricsWhat to Look For
DashboardActive/processed/failed task countsFailed count increasing, active count stuck
WorkersOnline workers, task counts per workerWorkers offline, uneven task distribution
TasksTask state, runtime, argsTasks stuck in STARTED for too long, repeated failures
QueuesQueue depth per queueMessages backing up in gpu queue
BrokerRedis connection statusBroker connectivity issues

Queue Architecture

OpenTranscribe uses dedicated queues for different workload types:

QueueWorkerConcurrencyPurpose
gpucelery-worker1 (default)Transcription + diarization (GPU-bound)
downloadcelery-download-worker3Media URL downloads (I/O-bound)
cpu,utilitycelery-cpu-worker8CPU-bound processing tasks
nlp,celerycelery-nlp-worker4LLM summarization, speaker ID
embeddingcelery-embedding-worker1Search index embedding generation

Flower Configuration

Flower is configured with these operational settings in docker-compose.yml:

  • --max_tasks=10000 -- retains last 10,000 tasks in the dashboard
  • --persistent=True -- persists task history to /app/flower.db
  • --purge_offline_workers=600 -- removes offline workers after 10 minutes
  • --natural_time=True -- displays human-readable timestamps

Docker Container Monitoring

Resource Usage

# Live resource usage for all containers
docker stats

# One-shot snapshot
docker stats --no-stream

# Specific container
docker stats opentranscribe-celery-worker

Restart Counts

Frequent restarts indicate instability (often OOM kills or crash loops):

# Check restart counts
docker inspect --format='{{.Name}}: {{.RestartCount}}' $(docker ps -aq) 2>/dev/null | sort -t: -k2 -nr

# Check if a container was OOM killed
docker inspect --format='{{.Name}}: OOMKilled={{.State.OOMKilled}}' $(docker ps -aq) 2>/dev/null

Container Events

# Watch for container start/stop/die events
docker events --filter 'type=container' --format '{{.Time}} {{.Actor.Attributes.name}} {{.Action}}'

GPU Monitoring

nvidia-smi

# One-shot GPU status
nvidia-smi

# Continuous monitoring (updates every 1 second)
watch -n 1 nvidia-smi

# Compact output with utilization and memory
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader,nounits

VRAM Profiling

OpenTranscribe includes built-in VRAM profiling that uses NVML (not PyTorch) for accurate device-level memory tracking. This captures memory used by CTranslate2, which is invisible to torch.cuda.memory_allocated().

Enable profiling:

# In .env
ENABLE_VRAM_PROFILING=true

View profiling results:

# Via Admin API
curl http://localhost:5174/api/admin/gpu-profiles

# Via profiling test script
./scripts/gpu-profile-test.sh --results

Key GPU Metrics

MetricNormal RangeWarning Threshold
GPU Utilization80-100% during transcriptionSustained 0% with queued tasks
VRAM Usage (idle)~5.5 GB (models loaded)N/A
VRAM Usage (transcription)+300-400 MB above idleN/A
VRAM Usage (diarization)+1-11 GB (scales with audio length)>90% of total VRAM
Temperature40-80 C>85 C sustained

Database Monitoring

PostgreSQL

# Connection count
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT count(*) as connections FROM pg_stat_activity;"

# Active queries
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active';"

# Table sizes
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT relname AS table, pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"

# Slow queries (if pg_stat_statements is enabled)
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT calls, mean_exec_time::numeric(10,2) AS avg_ms, query
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"

# Cache hit ratio (should be >95%)
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT round(100.0 * sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0), 2) AS cache_hit_ratio
FROM pg_stat_database WHERE datname = 'opentranscribe';"

Connection Limits

The default max_connections is 200 (configurable via PG_MAX_CONNECTIONS in .env). Monitor connection usage to avoid exhaustion -- each backend instance, Celery worker, and Flower connection consumes a slot.

OpenSearch Monitoring

# Cluster health (green/yellow/red)
curl -s http://localhost:5180/_cluster/health | python3 -m json.tool

# Index stats (document counts, sizes)
curl -s http://localhost:5180/_cat/indices?v

# Node stats (JVM heap, disk, CPU)
curl -s http://localhost:5180/_nodes/stats/jvm,os,fs | python3 -m json.tool

# Pending tasks
curl -s http://localhost:5180/_cluster/pending_tasks | python3 -m json.tool

# ML model status (neural search)
curl -s http://localhost:5180/_plugins/_ml/models/_search -H 'Content-Type: application/json' \
-d '{"query":{"match_all":{}}}'

OpenSearch Health Status

StatusMeaningAction
greenAll shards assignedNormal operation
yellowPrimary shards OK, replicas unassignedExpected in single-node deployments
redSome primary shards unassignedInvestigate immediately -- data may be unavailable

Log Management

Log Locations

All services log to Docker's logging driver (default: json-file). Access logs via docker compose logs or docker logs.

# All services
docker compose logs -f

# Specific service (with timestamps)
docker compose logs -f --timestamps backend

# Last 100 lines from GPU worker
docker logs --tail 100 opentranscribe-celery-worker

# Using opentr.sh
./opentr.sh logs backend
./opentr.sh logs celery-worker

Log Levels

ServiceDefault LevelEnvironment Variable
Backend (FastAPI)infoLOG_LEVEL
Celery WorkersinfoSet in command: (e.g., --loglevel=info)
FlowerinfoSet in command:
PostgreSQLnoticePostgreSQL config
OpenSearchinfoOpenSearch config

What to Look For in Logs

ServiceLog PatternIndicates
GPU Workertorch.cuda.OutOfMemoryErrorGPU VRAM exhausted -- reduce batch size or concurrency
GPU WorkerVRAM Usage [...]Per-stage VRAM reporting (when profiling enabled)
BackendAlembic migrationDatabase schema migration on startup
BackendModel registeredOpenSearch neural model initialization
Download Workeryt-dlp errorsMedia download failures (auth, geo-restriction)
NLP WorkerLLM provider errorsLLM API failures (timeout, rate limit, auth)
OpenSearchcircuit_breaking_exceptionJVM heap exhausted -- increase OPENSEARCH_JAVA_OPTS

Key Metrics to Watch

MetricHow to CheckWarning ThresholdAction
Disk spacedf -hUnder 10% freeClean old transcriptions, expand storage
GPU VRAMnvidia-smi>90% sustainedReduce BATCH_SIZE, lower concurrency
GPU temperaturenvidia-smi>85 CImprove cooling, reduce workload
gpu queue depthFlower dashboard>20 pendingAdd GPU workers or upgrade GPU
PostgreSQL connectionspg_stat_activity>80% of max_connectionsIncrease PG_MAX_CONNECTIONS
OpenSearch heap_nodes/stats/jvm>85% of heapIncrease OPENSEARCH_JAVA_OPTS
Redis memoryredis-cli info memory>80% of maxmemoryIncrease limit or tune eviction
Container restartsdocker inspect>3 in 1 hourCheck OOM kills, review logs
Celery task failuresFlower tasks tab>5% failure rateReview failed task args and exceptions
MinIO disk usageMinIO ConsoleUnder 10% freeArchive old media, expand storage

Integration with External Monitoring

OpenTranscribe does not ship with built-in Prometheus/Grafana integration, but its services expose standard interfaces that work with external monitoring stacks.

Prometheus + Grafana

  • Node Exporter: Install on the host for CPU, memory, disk, and network metrics
  • NVIDIA GPU Exporter: Use dcgm-exporter for GPU metrics in Prometheus format
  • PostgreSQL Exporter: Use postgres_exporter pointed at the exposed PostgreSQL port
  • Redis Exporter: Use redis_exporter for Redis metrics
  • OpenSearch: OpenSearch exposes /_prometheus/metrics via the prometheus-exporter plugin
  • Flower: Flower exposes a JSON API at /api/workers and /api/tasks that can be scraped by a custom exporter
  • Docker: Use cAdvisor for per-container resource metrics

Datadog / New Relic / Similar

  • Use the vendor's Docker integration for container metrics
  • Point database integrations at exposed ports (PostgreSQL 5176, Redis 5177, OpenSearch 5180)
  • Use the NVIDIA GPU integration for GPU metrics
  • Configure log collection from Docker's json-file driver

Syslog / ELK

Configure Docker's logging driver to forward to syslog or a centralized log collector:

{
"log-driver": "syslog",
"log-opts": {
"syslog-address": "tcp://logserver:514",
"tag": "opentranscribe/{{.Name}}"
}
}

Alerting Recommendations

Set up alerts for these critical conditions:

ConditionSeverityDetectionRecommended Action
Service container downCriticalDocker health check fails 3xAuto-restart (Docker handles this), page if persists >5 min
GPU OOMHightorch.cuda.OutOfMemoryError in GPU worker logsReduce BATCH_SIZE, check for concurrent diarization
Disk space under 10%Highdf -h or node exporterArchive media, expand storage
gpu queue >50 tasksMediumFlower API or Redis LLEN gpuScale GPU workers, prioritize batches
OpenSearch red statusCritical_cluster/health APICheck disk space, review shard allocation
PostgreSQL connections >80%Mediumpg_stat_activityIncrease PG_MAX_CONNECTIONS, check connection leaks
Redis memory >80%Mediumredis-cli info memoryIncrease maxmemory, review eviction policy
Task failure rate >5%MediumFlower dashboardReview failed task exceptions
GPU temperature >85 CHighnvidia-smiImprove cooling, throttle workload
Celery worker offline >5 minHighFlower workers tabCheck container logs, restart worker