Skip to main content

Monitoring & Logging

OpenTranscribe runs as a multi-service Docker Compose application with GPU-accelerated AI processing, background task queues, and several data stores. Effective monitoring ensures reliable operation and early detection of issues.

OpenTranscribe ships with a built-in, fully self-hosted observability stack — Prometheus + Grafana scraping a native FastAPI /metrics endpoint, plus structured JSON access logs. There is no Google Analytics or third-party SaaS telemetry: all metrics stay on your infrastructure. The sections immediately below cover this built-in stack; later sections cover the standing operational tooling (Flower, health checks, nvidia-smi, log management) and how to plug into external monitoring (CloudWatch, Datadog, etc.).

Built-in Metrics & Dashboards

What the backend exposes

The backend instruments every HTTP request and database query and exposes them in Prometheus format. Two unauthenticated endpoints are mounted at the application root (next to /health), not under /api and never proxied by nginx — they are reachable only from inside the Docker network:

EndpointPurpose
GET /metricsPrometheus exposition format. Request latency/RPS/errors by route template, DB queries per request (the duplicate-call / N+1 detector), DB query latency, in-flight requests, cache hit/miss counters, Celery queue depth, and product counters (signups, uploads).
GET /health/readyReadiness probe for load balancers / Kubernetes. Checks Postgres + Redis (critical → 503 if down) and OpenSearch + MinIO (degraded-but-ready). Returns {"status": "ready", "checks": {...}}. The original GET /health (static 200) is unchanged and still drives the Docker healthcheck.

Key metric names (stable; dashboards are built against these):

MetricTypeLabels
http_request_duration_secondsHistogrammethod, route, status
http_requests_totalCountermethod, route, status (5xx rate derived in Grafana)
http_requests_in_flightGauge
db_query_duration_secondsHistogram— (no statement/table labels — cardinality)
db_queries_per_requestHistogrammethod, route
cache_operations_totalCountercache (redis/settings), result (hit/miss)
celery_queue_depthGaugequeue
user_signups_totalCountermethod (local/ldap/keycloak/pki/external)
files_uploaded_totalCountersource (upload/url/watch)
Route labels use the route template (e.g. /api/files/{file_id}), never the raw path or query string — this bounds cardinality and keeps tokens/PII out of metrics. user_id/org_id are written to the JSON access log only, never as Prometheus labels.
Worker-side product events (transcription outcomes, processed minutes, watch-source imports) happen in Celery workers, whose Prometheus registries are never scraped. Those are dashboarded from the database via Grafana's PostgreSQL datasource (the Product dashboard below), not from Prometheus counters.

Starting the stack

The Prometheus + Grafana overlay is optional and started with a single flag:

./opentr.sh start dev --with-monitoring

This loads docker-compose.monitoring.yml and brings up:

ServiceURLNotes
Prometheushttp://localhost:518615s scrape of backend:8080/metrics; 15-day retention
Grafanahttp://localhost:5185Login admin / $GRAFANA_PASSWORD (default admin)

Both containers run no-new-privileges, restart: unless-stopped, with read-only config mounts and named data volumes. Grafana anonymous access and self-signup are disabled. Override the host ports with PROMETHEUS_PORT / GRAFANA_PORT and the password with GRAFANA_PASSWORD in .env.

Omit the flag and the stack runs completely unchanged — the overlay adds nothing to the base services.

Verify after start: Prometheus → Status → Targets shows opentranscribe-backend UP; Grafana → Dashboards → OpenTranscribe lists both dashboards and renders data after you click around the app.

Dashboard tour

Two dashboards are auto-provisioned into the OpenTranscribe folder:

OpenTranscribe — Backend Ops (opentranscribe.json, Prometheus datasource):

  • Request latency p50 / p95 / p99 by routehistogram_quantile(...) over http_request_duration_seconds_bucket. The histogram buckets run out to 600s so long uploads don't saturate p99 at +Inf.
  • Requests per second by route and 5xx error rate (fraction of all requests).
  • Requests in flight — a stat panel off http_requests_in_flight; watch this near the DB pool ceiling.
  • DB queries per request — p95 by route — the duplicate-call radar. A route whose p95 jumps to dozens of queries is doing N+1 or repeated identical lookups within one request.
  • DB query latency p99 / p95 and cache hit ratio by cache (split by the redis / settings cache label).
  • Celery queue depth by queue (summed across priority sub-keys).
  • Signups / uploads rate product counters (API-process events).

OpenTranscribe — Product & Usage (product.json, mixed datasources):

  • Signups over time by method and uploads over time by source — from Prometheus (user_signups_total, files_uploaded_total).
  • DAU / WAU and daily active users — distinct user_id from the refresh_token table (a refresh token is minted per login), via the PostgreSQL datasource.
  • Files completed per day, transcription minutes processed per day (file_pipeline_timing.audio_duration_s), and files by status over time / current error+orphaned count — straight from media_file / file_pipeline_timing. This is how worker-side product events are tracked without Prometheus.

The PostgreSQL datasource (read-only role for production)

The Product dashboard reads the database directly through a provisioned Grafana PostgreSQL datasource (UID opentranscribe-pg). In dev it reuses the stack's Postgres credentials for convenience. In production, point it at a dedicated read-only role instead of the application superuser. Create one (idempotently) and grant it read access:

-- Run once against the OpenTranscribe database, as a superuser.
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'grafana_ro') THEN
CREATE ROLE grafana_ro LOGIN PASSWORD 'CHANGE_ME_strong_password';
END IF;
END
$$;

GRANT CONNECT ON DATABASE opentranscribe TO grafana_ro;
GRANT USAGE ON SCHEMA public TO grafana_ro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO grafana_ro;
-- Make future tables readable too:
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO grafana_ro;

Then set the datasource's user/password to grafana_ro (the datasource is provisioned editable: true, so you can repoint it from Connections → Data sources in the Grafana UI, or pass grafana_ro credentials via the POSTGRES_* env vars the overlay forwards). OpenTranscribe does not create this role for you — provision it deliberately as part of your production setup.

Per-user / per-tenant analysis via JSON access logs

Set LOG_FORMAT=json (in .env, then recreate the backend container — env changes need a recreate, not just a restart) to emit one structured JSON object per request on the access logger. Each line carries user_id, org_id (null in the self-hosted edition), request_id, route (template), method, status, duration_ms, db_query_count, and client_ip.

Because Prometheus deliberately omits user_id (cardinality), this is where per-user and per-tenant analysis lives — DAU/WAU by tenant, onboarding funnels (first login → first upload → first transcript view, by route template), and per-user request volumes. Pipe the logs into any log-analytics tool:

# Quick local analysis: top routes by request count today
./opentr.sh logs backend | grep '"message": "request"' | jq -r '.route' | sort | uniq -c | sort -rn | head

In text mode (the default) the same fields are folded into a human-readable one-liner, so logs stay readable without JSON tooling.

AWS / cloud notes

The built-in stack is portable to managed AWS services with no code change:

  • Amazon Managed Prometheus (AMP) scrapes the identical backend:8080/metrics endpoint — point an AMP scraper or an ADOT/Prometheus agent (or a Kubernetes podMonitor) at it. Keep /metrics off any public Ingress; it is internal-only by design.
  • Amazon Managed Grafana (AMG): import both dashboard JSONs as-is. The Ops dashboard is pure PromQL (fully portable); the Product dashboard's PostgreSQL panels just need an AMG PostgreSQL datasource pointed at your RDS instance (use the read-only role above on RDS).
  • CloudWatch Logs: set LOG_FORMAT=json and let Fluent Bit / the CloudWatch agent ship the structured access lines. CloudWatch Logs Insights then queries user_id / org_id / route / duration_ms directly for DAU/WAU and funnels.
  • Readiness: switch your load balancer / Kubernetes readinessProbe from /health to /health/ready so traffic is only routed once Postgres and Redis are actually reachable.
Future tweak: uvicorn emits its own access log line next to OpenTranscribe's structured one (duplicate request lines). This is left as-is because changing the container CMD is a behavior change for existing log consumers; in production you can add --no-access-log to the uvicorn command to drop the duplicate and rely solely on the structured access logger.

Monitoring Architecture

Service Health Checks

Every service in OpenTranscribe has a Docker health check. These are defined in docker-compose.yml and automatically monitored by Docker.

ServiceContainerHealth CheckIntervalWhat It Verifies
PostgreSQLopentranscribe-postgrespg_isready -U postgres5sDatabase accepts connections
MinIOopentranscribe-miniocurl -f http://localhost:9000/minio/health/live5sObject storage API is responsive
Redisopentranscribe-redisredis-cli ping (with auth if configured)5sCache/broker responds to PING
OpenSearchopentranscribe-opensearchcurl -sS http://localhost:92005sSearch cluster is reachable
Backendopentranscribe-backendcurl -f http://localhost:8080/health10sFastAPI app is serving requests
GPU Workeropentranscribe-celery-workercelery inspect ping -d gpu-transcription@$HOSTNAME30sWorker is connected to broker and responsive
Download Workeropentranscribe-celery-download-workercelery inspect ping -d media-downloader@$HOSTNAME30sWorker is connected and processing downloads
CPU Workeropentranscribe-celery-cpu-workercelery inspect ping -d cpu-processor@$HOSTNAME30sWorker handles CPU-bound tasks
NLP Workeropentranscribe-celery-nlp-workercelery inspect ping -d ai-nlp@$HOSTNAME30sWorker handles LLM/NLP tasks
Embedding Workeropentranscribe-celery-embedding-workercelery inspect ping -d search-indexer@$HOSTNAME30sWorker handles search embedding tasks
Celery Beatopentranscribe-celery-beatChecks /app/celerybeat-schedule modification time < 300s30sScheduler is writing schedule file
Floweropentranscribe-flowerWeb UI on port 5555N/AMonitoring dashboard is accessible

Check all health statuses at once:

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Flower Dashboard

Flower provides real-time monitoring of all Celery workers and tasks.

Access: http://localhost:5175/flower Default credentials: admin / flower (configurable via FLOWER_USER and FLOWER_PASSWORD in .env)

What to Monitor in Flower

TabKey MetricsWhat to Look For
DashboardActive/processed/failed task countsFailed count increasing, active count stuck
WorkersOnline workers, task counts per workerWorkers offline, uneven task distribution
TasksTask state, runtime, argsTasks stuck in STARTED for too long, repeated failures
QueuesQueue depth per queueMessages backing up in gpu queue
BrokerRedis connection statusBroker connectivity issues

Queue Architecture

OpenTranscribe uses dedicated queues for different workload types:

QueueWorkerConcurrencyPurpose
gpucelery-worker1 (default)Transcription + diarization (GPU-bound)
downloadcelery-download-worker3Media URL downloads (I/O-bound)
cpu,utilitycelery-cpu-worker8CPU-bound processing tasks
nlp,celerycelery-nlp-worker4LLM summarization, speaker ID
embeddingcelery-embedding-worker1Search index embedding generation

Flower Configuration

Flower is configured with these operational settings in docker-compose.yml:

  • --max_tasks=10000 -- retains last 10,000 tasks in the dashboard
  • --persistent=True -- persists task history to /app/flower.db
  • --purge_offline_workers=600 -- removes offline workers after 10 minutes
  • --natural_time=True -- displays human-readable timestamps

Docker Container Monitoring

Resource Usage

# Live resource usage for all containers
docker stats

# One-shot snapshot
docker stats --no-stream

# Specific container
docker stats opentranscribe-celery-worker

Restart Counts

Frequent restarts indicate instability (often OOM kills or crash loops):

# Check restart counts
docker inspect --format='{{.Name}}: {{.RestartCount}}' $(docker ps -aq) 2>/dev/null | sort -t: -k2 -nr

# Check if a container was OOM killed
docker inspect --format='{{.Name}}: OOMKilled={{.State.OOMKilled}}' $(docker ps -aq) 2>/dev/null

Container Events

# Watch for container start/stop/die events
docker events --filter 'type=container' --format '{{.Time}} {{.Actor.Attributes.name}} {{.Action}}'

GPU Monitoring

nvidia-smi

# One-shot GPU status
nvidia-smi

# Continuous monitoring (updates every 1 second)
watch -n 1 nvidia-smi

# Compact output with utilization and memory
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader,nounits

VRAM Profiling

OpenTranscribe includes built-in VRAM profiling that uses NVML (not PyTorch) for accurate device-level memory tracking. This captures memory used by CTranslate2, which is invisible to torch.cuda.memory_allocated().

Enable profiling:

# In .env
ENABLE_VRAM_PROFILING=true

View profiling results:

# Via Admin API
curl http://localhost:5174/api/admin/gpu-profiles

# Via profiling test script
./scripts/gpu-profile-test.sh --results

Key GPU Metrics

MetricNormal RangeWarning Threshold
GPU Utilization80-100% during transcriptionSustained 0% with queued tasks
VRAM Usage (idle)~5.5 GB (models loaded)N/A
VRAM Usage (transcription)+300-400 MB above idleN/A
VRAM Usage (diarization)+1-11 GB (scales with audio length)>90% of total VRAM
Temperature40-80 C>85 C sustained

Database Monitoring

PostgreSQL

# Connection count
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT count(*) as connections FROM pg_stat_activity;"

# Active queries
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active';"

# Table sizes
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT relname AS table, pg_size_pretty(pg_total_relation_size(relid)) AS size
FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"

# Slow queries (if pg_stat_statements is enabled)
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT calls, mean_exec_time::numeric(10,2) AS avg_ms, query
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"

# Cache hit ratio (should be >95%)
docker exec opentranscribe-postgres psql -U postgres -d opentranscribe -c \
"SELECT round(100.0 * sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0), 2) AS cache_hit_ratio
FROM pg_stat_database WHERE datname = 'opentranscribe';"

Connection Limits

The default max_connections is 200 (configurable via PG_MAX_CONNECTIONS in .env). Monitor connection usage to avoid exhaustion -- each backend instance, Celery worker, and Flower connection consumes a slot.

OpenSearch Monitoring

# Cluster health (green/yellow/red)
curl -s http://localhost:5180/_cluster/health | python3 -m json.tool

# Index stats (document counts, sizes)
curl -s http://localhost:5180/_cat/indices?v

# Node stats (JVM heap, disk, CPU)
curl -s http://localhost:5180/_nodes/stats/jvm,os,fs | python3 -m json.tool

# Pending tasks
curl -s http://localhost:5180/_cluster/pending_tasks | python3 -m json.tool

# ML model status (neural search)
curl -s http://localhost:5180/_plugins/_ml/models/_search -H 'Content-Type: application/json' \
-d '{"query":{"match_all":{}}}'

OpenSearch Health Status

StatusMeaningAction
greenAll shards assignedNormal operation
yellowPrimary shards OK, replicas unassignedExpected in single-node deployments
redSome primary shards unassignedInvestigate immediately -- data may be unavailable

Log Management

Log Locations

All services log to Docker's logging driver (default: json-file). Access logs via docker compose logs or docker logs.

# All services
docker compose logs -f

# Specific service (with timestamps)
docker compose logs -f --timestamps backend

# Last 100 lines from GPU worker
docker logs --tail 100 opentranscribe-celery-worker

# Using opentr.sh
./opentr.sh logs backend
./opentr.sh logs celery-worker

Log Levels

ServiceDefault LevelEnvironment Variable
Backend (FastAPI)infoLOG_LEVEL
Celery WorkersinfoSet in command: (e.g., --loglevel=info)
FlowerinfoSet in command:
PostgreSQLnoticePostgreSQL config
OpenSearchinfoOpenSearch config

What to Look For in Logs

ServiceLog PatternIndicates
GPU Workertorch.cuda.OutOfMemoryErrorGPU VRAM exhausted -- reduce batch size or concurrency
GPU WorkerVRAM Usage [...]Per-stage VRAM reporting (when profiling enabled)
BackendAlembic migrationDatabase schema migration on startup
BackendModel registeredOpenSearch neural model initialization
Download Workeryt-dlp errorsMedia download failures (auth, geo-restriction)
NLP WorkerLLM provider errorsLLM API failures (timeout, rate limit, auth)
OpenSearchcircuit_breaking_exceptionJVM heap exhausted -- increase OPENSEARCH_JAVA_OPTS

Key Metrics to Watch

MetricHow to CheckWarning ThresholdAction
Disk spacedf -hUnder 10% freeClean old transcriptions, expand storage
GPU VRAMnvidia-smi>90% sustainedReduce BATCH_SIZE, lower concurrency
GPU temperaturenvidia-smi>85 CImprove cooling, reduce workload
gpu queue depthFlower dashboard>20 pendingAdd GPU workers or upgrade GPU
PostgreSQL connectionspg_stat_activity>80% of max_connectionsIncrease PG_MAX_CONNECTIONS
OpenSearch heap_nodes/stats/jvm>85% of heapIncrease OPENSEARCH_JAVA_OPTS
Redis memoryredis-cli info memory>80% of maxmemoryIncrease limit or tune eviction
Container restartsdocker inspect>3 in 1 hourCheck OOM kills, review logs
Celery task failuresFlower tasks tab>5% failure rateReview failed task args and exceptions
MinIO disk usageMinIO ConsoleUnder 10% freeArchive old media, expand storage

Integration with External Monitoring

OpenTranscribe ships with a built-in Prometheus + Grafana stack for application-level metrics (see Built-in Metrics & Dashboards above). The exporters below complement it with host, GPU, and data-store metrics that the in-app /metrics endpoint does not cover.

Prometheus + Grafana (host / infra exporters)

  • Node Exporter: Install on the host for CPU, memory, disk, and network metrics
  • NVIDIA GPU Exporter: Use dcgm-exporter for GPU metrics in Prometheus format
  • PostgreSQL Exporter: Use postgres_exporter pointed at the exposed PostgreSQL port
  • Redis Exporter: Use redis_exporter for Redis metrics
  • OpenSearch: OpenSearch exposes /_prometheus/metrics via the prometheus-exporter plugin
  • Flower: Flower exposes a JSON API at /api/workers and /api/tasks that can be scraped by a custom exporter
  • Docker: Use cAdvisor for per-container resource metrics

Datadog / New Relic / Similar

  • Use the vendor's Docker integration for container metrics
  • Point database integrations at exposed ports (PostgreSQL 5176, Redis 5177, OpenSearch 5180)
  • Use the NVIDIA GPU integration for GPU metrics
  • Configure log collection from Docker's json-file driver

Syslog / ELK

Configure Docker's logging driver to forward to syslog or a centralized log collector:

{
"log-driver": "syslog",
"log-opts": {
"syslog-address": "tcp://logserver:514",
"tag": "opentranscribe/{{.Name}}"
}
}

Alerting Recommendations

Set up alerts for these critical conditions:

ConditionSeverityDetectionRecommended Action
Service container downCriticalDocker health check fails 3xAuto-restart (Docker handles this), page if persists >5 min
GPU OOMHightorch.cuda.OutOfMemoryError in GPU worker logsReduce BATCH_SIZE, check for concurrent diarization
Disk space under 10%Highdf -h or node exporterArchive media, expand storage
gpu queue >50 tasksMediumFlower API or Redis LLEN gpuScale GPU workers, prioritize batches
OpenSearch red statusCritical_cluster/health APICheck disk space, review shard allocation
PostgreSQL connections >80%Mediumpg_stat_activityIncrease PG_MAX_CONNECTIONS, check connection leaks
Redis memory >80%Mediumredis-cli info memoryIncrease maxmemory, review eviction policy
Task failure rate >5%MediumFlower dashboardReview failed task exceptions
GPU temperature >85 CHighnvidia-smiImprove cooling, throttle workload
Celery worker offline >5 minHighFlower workers tabCheck container logs, restart worker