Monitoring Guide
Overview
Section titled “Overview”Guía de monitoreo para Orchestrator, Sevastopol y PostgreSQL. Incluye métricas clave, configuración de alertas y dashboards recomendados.
Métricas Clave
Section titled “Métricas Clave”Application Metrics (Orchestrator)
Section titled “Application Metrics (Orchestrator)”| Métrica | Descripción | Umbral Alerta |
|---|---|---|
http_requests_total | Total de requests HTTP | N/A (informativo) |
http_request_duration_seconds | Latencia de requests | > 2s = warning, > 5s = critical |
http_requests_errors_total | Requests con error (4xx, 5xx) | > 10/min = warning |
active_connections | Conexiones activas al pool DB | > 80% max = warning |
jwt_validations_failed | Intentos de auth fallidos | > 10/min = critical |
tenant_switches | Cambios de tenant por minuto | N/A (informativo) |
Database Metrics (PostgreSQL)
Section titled “Database Metrics (PostgreSQL)”| Métrica | Descripción | Umbral Alerta |
|---|---|---|
pg_stat_activity_count | Conexiones activas | > 80 = warning |
pg_stat_database_xact_commit | Transacciones/segundo | < 10/s = investigate |
pg_stat_database_deadlocks | Deadlocks detectados | > 0 = warning |
pg_stat_replication_lag | Lag de replicación | > 60s = critical |
pg_database_size_bytes | Tamaño de DB | > 50GB = warning |
pg_stat_user_tables_n_dead_tup | Tuplas muertas (bloat) | > 100k = run vacuum |
System Metrics
Section titled “System Metrics”| Métrica | Descripción | Umbral Alerta |
|---|---|---|
cpu_usage_percent | Uso de CPU | > 80% = warning, > 95% = critical |
memory_usage_percent | Uso de RAM | > 85% = warning |
disk_usage_percent | Uso de disco | > 80% = warning, > 90% = critical |
disk_io_util | I/O de disco | > 80% = investigate |
network_rx_bytes | Bytes recibidos | N/A (baseline) |
Instrumentación
Section titled “Instrumentación”Orchestrator (Node.js)
Section titled “Orchestrator (Node.js)”Instalar prom-client:
npm install prom-clientConfiguración básica:
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics,} from "prom-client";
export const register = new Registry();
// Métricas por defecto (CPU, memoria, etc.)collectDefaultMetrics({ register });
// Request counterexport const httpRequestsTotal = new Counter({ name: "http_requests_total", help: "Total HTTP requests", labelNames: ["method", "route", "status"], registers: [register],});
// Request duration histogramexport const httpRequestDuration = new Histogram({ name: "http_request_duration_seconds", help: "HTTP request duration in seconds", labelNames: ["method", "route"], buckets: [0.1, 0.5, 1, 2, 5, 10], registers: [register],});
// Active DB connectionsexport const dbActiveConnections = new Gauge({ name: "db_active_connections", help: "Active database connections", registers: [register],});Middleware de métricas:
import { Request, Response, NextFunction } from "express";import { httpRequestsTotal, httpRequestDuration } from "../metrics/prometheus";
export function metricsMiddleware( req: Request, res: Response, next: NextFunction,) { const start = Date.now();
res.on("finish", () => { const duration = (Date.now() - start) / 1000; const route = req.route?.path || req.path;
httpRequestsTotal.inc({ method: req.method, route, status: res.statusCode, });
httpRequestDuration.observe({ method: req.method, route }, duration); });
next();}Endpoint /metrics:
import { Router } from "express";import { register } from "../metrics/prometheus";
const router = Router();
router.get("/metrics", async (req, res) => { res.set("Content-Type", register.contentType); res.send(await register.metrics());});
export default router;PostgreSQL
Section titled “PostgreSQL”Habilitar pg_stat_statements:
-- postgresql.confshared_preload_libraries = 'pg_stat_statements'pg_stat_statements.track = all
-- Ejecutar en DBCREATE EXTENSION IF NOT EXISTS pg_stat_statements;Queries de monitoreo:
-- Conexiones activas por estadoSELECT state, count(*)FROM pg_stat_activityWHERE datname = 'nostromo'GROUP BY state;
-- Top queries por tiempoSELECT query, calls, mean_exec_time::numeric(10,2) as mean_msFROM pg_stat_statementsORDER BY mean_exec_time DESCLIMIT 10;
-- Tamaño por schemaSELECT schemaname, pg_size_pretty(sum(pg_total_relation_size(schemaname||'.'||tablename))) as sizeFROM pg_tablesWHERE schemaname LIKE 'tenant_%'GROUP BY schemanameORDER BY sum(pg_total_relation_size(schemaname||'.'||tablename)) DESC;
-- Tuplas muertas (necesitan vacuum)SELECT relname, n_dead_tupFROM pg_stat_user_tablesWHERE n_dead_tup > 1000ORDER BY n_dead_tup DESC;Alertas
Section titled “Alertas”Configuración de Alertas (Prometheus/Alertmanager)
Section titled “Configuración de Alertas (Prometheus/Alertmanager)”alert_rules.yml:
groups: - name: nostromo_alerts rules: # API lenta - alert: HighApiLatency expr: http_request_duration_seconds{quantile="0.95"} > 2 for: 5m labels: severity: warning annotations: summary: "API latency is high" description: "95th percentile latency > 2s for 5 minutes"
# Muchos errores - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" description: "More than 10% of requests are failing"
# Pool de conexiones lleno - alert: DbConnectionPoolExhausted expr: db_active_connections / db_max_connections > 0.8 for: 2m labels: severity: warning annotations: summary: "DB connection pool is > 80% full"
# PostgreSQL deadlocks - alert: PostgresDeadlocks expr: increase(pg_stat_database_deadlocks[5m]) > 0 labels: severity: warning annotations: summary: "PostgreSQL deadlock detected"
# Disco lleno - alert: DiskSpaceLow expr: disk_usage_percent > 85 for: 10m labels: severity: warning annotations: summary: "Disk space is running low (> 85%)"Notificaciones
Section titled “Notificaciones”Alertmanager config (alertmanager.yml):
global: smtp_smarthost: "smtp.example.com:587"
route: receiver: "default" group_by: ["alertname"] group_wait: 30s group_interval: 5m repeat_interval: 4h
routes: - match: severity: critical receiver: "oncall" - match: severity: warning receiver: "slack"
receivers: - name: "default" email_configs:
- name: "oncall" email_configs: # webhook_configs: # - url: 'https://pagerduty.com/...'
- name: "slack" slack_configs: - api_url: "https://hooks.slack.com/services/..." channel: "#alerts"Dashboards
Section titled “Dashboards”Grafana - Dashboard Principal
Section titled “Grafana - Dashboard Principal”Panels recomendados:
| Panel | Tipo | Query |
|---|---|---|
| Request Rate | Graph | rate(http_requests_total[5m]) |
| Error Rate | Graph | rate(http_requests_total{status=~"5.."}[5m]) |
| Latency P95 | Graph | histogram_quantile(0.95, http_request_duration_seconds_bucket) |
| Active Connections | Gauge | db_active_connections |
| CPU Usage | Graph | process_cpu_seconds_total |
| Memory Usage | Graph | process_resident_memory_bytes |
Dashboard PostgreSQL
Section titled “Dashboard PostgreSQL”Panels recomendados:
| Panel | Tipo | Query/Source |
|---|---|---|
| Connections by State | Pie | pg_stat_activity grouped by state |
| Transactions/sec | Graph | rate(pg_stat_database_xact_commit[1m]) |
| Database Size | Gauge | pg_database_size_bytes |
| Deadlocks | Counter | pg_stat_database_deadlocks |
| Cache Hit Ratio | Gauge | blks_hit / (blks_hit + blks_read) |
| Top Slow Queries | Table | pg_stat_statements |
Comandos de Monitoreo Manual
Section titled “Comandos de Monitoreo Manual”Quick Health Check
Section titled “Quick Health Check”#!/bin/bashecho "=== Orchestrator ==="curl -s http://localhost:8000/health | jq .
echo "=== PostgreSQL ==="pg_isready && echo "OK" || echo "FAIL"
echo "=== PM2 ==="pm2 status
echo "=== Disk ==="df -h /
echo "=== Memory ==="free -h
echo "=== CPU ==="uptimeLogs en Tiempo Real
Section titled “Logs en Tiempo Real”# Orchestrator logspm2 logs orchestrator --lines 50
# PostgreSQL logssudo tail -f /var/log/postgresql/postgresql-16-main.log
# System logssudo journalctl -u orchestrator -fConexiones PostgreSQL
Section titled “Conexiones PostgreSQL”# Contar conexionessudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'nostromo';"
# Ver queries activassudo -u postgres psql -c "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5;"Herramientas Recomendadas
Section titled “Herramientas Recomendadas”| Herramienta | Propósito | Alternativa Ligera |
|---|---|---|
| Prometheus | Métricas time-series | PM2 metrics |
| Grafana | Visualización | Netdata |
| Alertmanager | Alertas | Cron + email |
| Loki | Logs | journalctl + grep |
| pgBadger | Análisis PostgreSQL | pg_stat_statements |
Runbook: Respuesta a Alertas
Section titled “Runbook: Respuesta a Alertas”HighApiLatency
Section titled “HighApiLatency”- Verificar queries lentas en PostgreSQL
- Revisar conexiones activas del pool
- Check CPU/memoria del servidor
- Ver logs de error recientes
HighErrorRate
Section titled “HighErrorRate”- Revisar logs de error:
pm2 logs orchestrator --err - Verificar conectividad a PostgreSQL
- Check servicios externos (SII API, etc.)
- Rollback si deploy reciente
DiskSpaceLow
Section titled “DiskSpaceLow”- Limpiar logs antiguos:
logrotate -f - Borrar backups locales antiguos
- Vacuum PostgreSQL:
VACUUM FULL - Expandir disco si necesario
Related Documentation
Section titled “Related Documentation”Changelog
Section titled “Changelog”| Fecha | Version | Cambios |
|---|---|---|
| 2026-01-18 | 1.0 | Guía inicial creada |