Skip to content
GitHub

Monitoring Guide

Guía de monitoreo para Orchestrator, Sevastopol y PostgreSQL. Incluye métricas clave, configuración de alertas y dashboards recomendados.


MétricaDescripciónUmbral Alerta
http_requests_totalTotal de requests HTTPN/A (informativo)
http_request_duration_secondsLatencia de requests> 2s = warning, > 5s = critical
http_requests_errors_totalRequests con error (4xx, 5xx)> 10/min = warning
active_connectionsConexiones activas al pool DB> 80% max = warning
jwt_validations_failedIntentos de auth fallidos> 10/min = critical
tenant_switchesCambios de tenant por minutoN/A (informativo)

MétricaDescripciónUmbral Alerta
pg_stat_activity_countConexiones activas> 80 = warning
pg_stat_database_xact_commitTransacciones/segundo< 10/s = investigate
pg_stat_database_deadlocksDeadlocks detectados> 0 = warning
pg_stat_replication_lagLag de replicación> 60s = critical
pg_database_size_bytesTamaño de DB> 50GB = warning
pg_stat_user_tables_n_dead_tupTuplas muertas (bloat)> 100k = run vacuum

MétricaDescripciónUmbral Alerta
cpu_usage_percentUso de CPU> 80% = warning, > 95% = critical
memory_usage_percentUso de RAM> 85% = warning
disk_usage_percentUso de disco> 80% = warning, > 90% = critical
disk_io_utilI/O de disco> 80% = investigate
network_rx_bytesBytes recibidosN/A (baseline)

Instalar prom-client:

Terminal window
npm install prom-client

Configuración básica:

src/metrics/prometheus.ts
import {
Registry,
Counter,
Histogram,
Gauge,
collectDefaultMetrics,
} from "prom-client";
export const register = new Registry();
// Métricas por defecto (CPU, memoria, etc.)
collectDefaultMetrics({ register });
// Request counter
export const httpRequestsTotal = new Counter({
name: "http_requests_total",
help: "Total HTTP requests",
labelNames: ["method", "route", "status"],
registers: [register],
});
// Request duration histogram
export const httpRequestDuration = new Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration in seconds",
labelNames: ["method", "route"],
buckets: [0.1, 0.5, 1, 2, 5, 10],
registers: [register],
});
// Active DB connections
export const dbActiveConnections = new Gauge({
name: "db_active_connections",
help: "Active database connections",
registers: [register],
});

Middleware de métricas:

src/middleware/metrics.ts
import { Request, Response, NextFunction } from "express";
import { httpRequestsTotal, httpRequestDuration } from "../metrics/prometheus";
export function metricsMiddleware(
req: Request,
res: Response,
next: NextFunction,
) {
const start = Date.now();
res.on("finish", () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestsTotal.inc({
method: req.method,
route,
status: res.statusCode,
});
httpRequestDuration.observe({ method: req.method, route }, duration);
});
next();
}

Endpoint /metrics:

src/routes/metrics.ts
import { Router } from "express";
import { register } from "../metrics/prometheus";
const router = Router();
router.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.send(await register.metrics());
});
export default router;

Habilitar pg_stat_statements:

-- postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
-- Ejecutar en DB
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Queries de monitoreo:

-- Conexiones activas por estado
SELECT state, count(*)
FROM pg_stat_activity
WHERE datname = 'nostromo'
GROUP BY state;
-- Top queries por tiempo
SELECT query, calls, mean_exec_time::numeric(10,2) as mean_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Tamaño por schema
SELECT schemaname,
pg_size_pretty(sum(pg_total_relation_size(schemaname||'.'||tablename))) as size
FROM pg_tables
WHERE schemaname LIKE 'tenant_%'
GROUP BY schemaname
ORDER BY sum(pg_total_relation_size(schemaname||'.'||tablename)) DESC;
-- Tuplas muertas (necesitan vacuum)
SELECT relname, n_dead_tup
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Configuración de Alertas (Prometheus/Alertmanager)

Section titled “Configuración de Alertas (Prometheus/Alertmanager)”

alert_rules.yml:

groups:
- name: nostromo_alerts
rules:
# API lenta
- alert: HighApiLatency
expr: http_request_duration_seconds{quantile="0.95"} > 2
for: 5m
labels:
severity: warning
annotations:
summary: "API latency is high"
description: "95th percentile latency > 2s for 5 minutes"
# Muchos errores
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "More than 10% of requests are failing"
# Pool de conexiones lleno
- alert: DbConnectionPoolExhausted
expr: db_active_connections / db_max_connections > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "DB connection pool is > 80% full"
# PostgreSQL deadlocks
- alert: PostgresDeadlocks
expr: increase(pg_stat_database_deadlocks[5m]) > 0
labels:
severity: warning
annotations:
summary: "PostgreSQL deadlock detected"
# Disco lleno
- alert: DiskSpaceLow
expr: disk_usage_percent > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space is running low (> 85%)"

Alertmanager config (alertmanager.yml):

global:
smtp_smarthost: "smtp.example.com:587"
smtp_from: "[email protected]"
route:
receiver: "default"
group_by: ["alertname"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "oncall"
- match:
severity: warning
receiver: "slack"
receivers:
- name: "default"
email_configs:
- name: "oncall"
email_configs:
# webhook_configs:
# - url: 'https://pagerduty.com/...'
- name: "slack"
slack_configs:
- api_url: "https://hooks.slack.com/services/..."
channel: "#alerts"

Panels recomendados:

PanelTipoQuery
Request RateGraphrate(http_requests_total[5m])
Error RateGraphrate(http_requests_total{status=~"5.."}[5m])
Latency P95Graphhistogram_quantile(0.95, http_request_duration_seconds_bucket)
Active ConnectionsGaugedb_active_connections
CPU UsageGraphprocess_cpu_seconds_total
Memory UsageGraphprocess_resident_memory_bytes

Panels recomendados:

PanelTipoQuery/Source
Connections by StatePiepg_stat_activity grouped by state
Transactions/secGraphrate(pg_stat_database_xact_commit[1m])
Database SizeGaugepg_database_size_bytes
DeadlocksCounterpg_stat_database_deadlocks
Cache Hit RatioGaugeblks_hit / (blks_hit + blks_read)
Top Slow QueriesTablepg_stat_statements

health_check.sh
#!/bin/bash
echo "=== Orchestrator ==="
curl -s http://localhost:8000/health | jq .
echo "=== PostgreSQL ==="
pg_isready && echo "OK" || echo "FAIL"
echo "=== PM2 ==="
pm2 status
echo "=== Disk ==="
df -h /
echo "=== Memory ==="
free -h
echo "=== CPU ==="
uptime
Terminal window
# Orchestrator logs
pm2 logs orchestrator --lines 50
# PostgreSQL logs
sudo tail -f /var/log/postgresql/postgresql-16-main.log
# System logs
sudo journalctl -u orchestrator -f
Terminal window
# Contar conexiones
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'nostromo';"
# Ver queries activas
sudo -u postgres psql -c "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5;"

HerramientaPropósitoAlternativa Ligera
PrometheusMétricas time-seriesPM2 metrics
GrafanaVisualizaciónNetdata
AlertmanagerAlertasCron + email
LokiLogsjournalctl + grep
pgBadgerAnálisis PostgreSQLpg_stat_statements

  1. Verificar queries lentas en PostgreSQL
  2. Revisar conexiones activas del pool
  3. Check CPU/memoria del servidor
  4. Ver logs de error recientes
  1. Revisar logs de error: pm2 logs orchestrator --err
  2. Verificar conectividad a PostgreSQL
  3. Check servicios externos (SII API, etc.)
  4. Rollback si deploy reciente
  1. Limpiar logs antiguos: logrotate -f
  2. Borrar backups locales antiguos
  3. Vacuum PostgreSQL: VACUUM FULL
  4. Expandir disco si necesario


FechaVersionCambios
2026-01-181.0Guía inicial creada