[이제와서 시작하는 Metabase 마스터하기 #15] 운영과 모니터링 - 백업과 복구
[이제와서 시작하는 Metabase 마스터하기 #15] 운영과 모니터링 - 백업과 복구
학습 목표
이 포스트를 마치면 다음을 할 수 있습니다:
- 정기 백업 전략 수립 및 실행
- 재해 복구 계획 수립
- 시스템 모니터링 및 성능 추적
- 일반적인 문제 해결
- 고가용성 아키텍처 구성
- 로그 분석 및 트러블슈팅
백업 전략
백업 대상
graph TD
A[Metabase Backup] --> B[Application DB]
A --> C[Uploaded Files]
A --> D[Settings]
B --> B1[Questions]
B --> B2[Dashboards]
B --> B3[Users/Permissions]
B --> B4[Collections]
C --> C1[Logos]
C --> C2[Custom Assets]
D --> D1[Environment Variables]
D --> D2[Database Connections]
Application Database 백업
PostgreSQL:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# backup-metabase.sh
# Variables
BACKUP_DIR="/backups/metabase"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="metabase"
DB_USER="metabase"
DB_HOST="localhost"
RETENTION_DAYS=30
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup database
pg_dump -h $DB_HOST -U $DB_USER -F c -b -v -f "$BACKUP_DIR/metabase_$DATE.backup" $DB_NAME
# Compress
gzip "$BACKUP_DIR/metabase_$DATE.backup"
# Delete old backups
find $BACKUP_DIR -name "metabase_*.backup.gz" -mtime +$RETENTION_DAYS -delete
# Log
echo "[$(date)] Backup completed: metabase_$DATE.backup.gz" >> /var/log/metabase-backup.log
Cron 설정 (매일 2 AM):
0 2 * * * /path/to/backup-metabase.sh
MySQL:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# backup-metabase-mysql.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/metabase"
mysqldump \
--host=localhost \
--user=metabase \
--password=your_password \
--single-transaction \
--quick \
--lock-tables=false \
metabase | gzip > "$BACKUP_DIR/metabase_$DATE.sql.gz"
파일 시스템 백업
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# backup-metabase-files.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/metabase"
METABASE_DATA="/var/lib/metabase"
# Backup uploaded files and assets
tar -czf "$BACKUP_DIR/metabase_files_$DATE.tar.gz" \
$METABASE_DATA/plugins \
$METABASE_DATA/uploads \
/path/to/custom-assets
# Backup environment configuration
cp /path/to/.env "$BACKUP_DIR/env_$DATE.backup"
Cloud Storage 백업
AWS S3:
1
2
3
4
5
6
7
8
#!/bin/bash
# sync to S3
aws s3 sync /backups/metabase/ s3://company-backups/metabase/ \
--storage-class GLACIER \
--exclude "*" \
--include "*.backup.gz" \
--include "*.tar.gz"
Google Cloud Storage:
1
gsutil -m rsync -r /backups/metabase/ gs://company-backups/metabase/
자동화된 백업 모니터링
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# backup-monitor.sh
BACKUP_DIR="/backups/metabase"
ALERT_EMAIL="ops@company.com"
MAX_AGE_HOURS=26 # Alert if backup older than 26 hours
# Find latest backup
LATEST_BACKUP=$(find $BACKUP_DIR -name "metabase_*.backup.gz" -type f -printf '%T@ %p\n' | sort -rn | head -1 | cut -d' ' -f2)
if [ -z "$LATEST_BACKUP" ]; then
echo "No backup found!" | mail -s "Metabase Backup FAILED" $ALERT_EMAIL
exit 1
fi
# Check age
BACKUP_AGE=$(( ($(date +%s) - $(stat -c %Y "$LATEST_BACKUP")) / 3600 ))
if [ $BACKUP_AGE -gt $MAX_AGE_HOURS ]; then
echo "Latest backup is $BACKUP_AGE hours old (threshold: $MAX_AGE_HOURS)" \
| mail -s "Metabase Backup WARNING" $ALERT_EMAIL
fi
복구 절차
전체 복구
Step 1: 서비스 중지
1
2
3
docker-compose down
# 또는
systemctl stop metabase
Step 2: 데이터베이스 복구
1
2
3
4
5
6
# PostgreSQL
gunzip metabase_20250101.backup.gz
pg_restore -h localhost -U metabase -d metabase -c metabase_20250101.backup
# MySQL
gunzip < metabase_20250101.sql.gz | mysql -u metabase -p metabase
Step 3: 파일 복구
1
2
tar -xzf metabase_files_20250101.tar.gz -C /
cp env_20250101.backup /path/to/.env
Step 4: 서비스 시작
1
2
3
docker-compose up -d
# 또는
systemctl start metabase
Step 5: 검증
1
2
3
4
5
6
# Health check
curl http://localhost:3000/api/health
# Login test
# Dashboard test
# Query execution test
부분 복구 (특정 객체)
단일 Question 복구:
1
2
3
4
5
6
-- From backup database
SELECT * FROM report_card WHERE id = 123;
-- Insert into production
INSERT INTO report_card (...)
VALUES (...);
모니터링
시스템 메트릭
Prometheus + Grafana:
1
2
3
4
5
6
7
# prometheus.yml
scrape_configs:
- job_name: 'metabase'
static_configs:
- targets: ['localhost:3000']
metrics_path: '/api/health'
scrape_interval: 30s
주요 메트릭:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
System:
- CPU usage
- Memory usage
- Disk I/O
- Network traffic
Application:
- Active users
- Query execution time
- Dashboard load time
- Error rate
- Cache hit rate
Database:
- Connection pool usage
- Query latency
- Slow queries (> 5s)
- Database size
Application Logs
로그 레벨 설정:
1
2
3
4
5
6
# Environment variable
MB_LOG_LEVEL=INFO # DEBUG, INFO, WARN, ERROR
# Docker
docker logs -f metabase
docker logs --since 1h metabase | grep ERROR
로그 분석:
1
2
3
4
5
6
7
8
# Slow queries
grep "Query took" /var/log/metabase/metabase.log | awk '$4 > 5000'
# Errors
grep ERROR /var/log/metabase/metabase.log | tail -100
# User activity
grep "GET /api/card" /var/log/metabase/metabase.log | wc -l
Alert 설정
Grafana Alerts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
alerts:
- name: Metabase High CPU
condition: cpu_usage > 80
duration: 5m
notification: ops-team
- name: Metabase High Memory
condition: memory_usage > 90
duration: 5m
notification: ops-team
- name: Metabase Slow Queries
condition: avg_query_time > 10s
duration: 10m
notification: analytics-team
- name: Metabase Error Rate
condition: error_rate > 5%
duration: 5m
notification: ops-team
성능 모니터링
Query Performance
Slow Query Log:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Metabase internal metrics
SELECT
qe.id,
qe.started_at,
qe.running_time as duration_ms,
qe.result_rows,
c.name as card_name,
u.email as user_email
FROM query_execution qe
JOIN report_card c ON qe.card_id = c.id
JOIN core_user u ON qe.executor_id = u.id
WHERE qe.running_time > 5000 -- 5 seconds
AND qe.started_at >= NOW() - INTERVAL '24 hours'
ORDER BY qe.running_time DESC
LIMIT 50;
성능 트렌드:
1
2
3
4
5
6
7
8
9
10
11
-- Daily average query time
SELECT
DATE(started_at) as date,
COUNT(*) as query_count,
AVG(running_time) as avg_duration_ms,
MAX(running_time) as max_duration_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY running_time) as p95_duration_ms
FROM query_execution
WHERE started_at >= NOW() - INTERVAL '30 days'
GROUP BY DATE(started_at)
ORDER BY date DESC;
Dashboard Usage
1
2
3
4
5
6
7
8
9
10
11
12
-- Most viewed dashboards
SELECT
d.id,
d.name,
COUNT(dv.id) as view_count,
COUNT(DISTINCT dv.user_id) as unique_users
FROM dashboard d
LEFT JOIN dashboard_view dv ON d.id = dv.dashboard_id
WHERE dv.timestamp >= NOW() - INTERVAL '30 days'
GROUP BY d.id, d.name
ORDER BY view_count DESC
LIMIT 20;
User Activity
1
2
3
4
5
6
7
8
9
-- Active users trend
SELECT
DATE(timestamp) as date,
COUNT(DISTINCT user_id) as active_users,
COUNT(*) as total_actions
FROM activity_log
WHERE timestamp >= NOW() - INTERVAL '90 days'
GROUP BY DATE(timestamp)
ORDER BY date DESC;
트러블슈팅
일반적인 문제
1. Metabase가 시작되지 않음
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 로그 확인
docker logs metabase
# Common issues:
# - Port already in use
sudo lsof -i :3000
# - Database connection failed
# Check MB_DB_* environment variables
# - Out of memory
docker stats metabase
# - Permissions
ls -la /metabase-data
2. 쿼리가 느림
1
2
3
4
5
6
7
8
9
10
11
Diagnosis:
1. Query execution log 확인
2. EXPLAIN ANALYZE 실행
3. 데이터베이스 인덱스 확인
4. Connection pool 상태 확인
Solutions:
- 인덱스 추가
- 쿼리 최적화
- 캐싱 TTL 증가
- Connection pool 크기 증가
3. 대시보드 로딩 느림
1
2
3
4
5
6
7
8
9
10
Diagnosis:
- Browser DevTools Network tab
- Question별 로딩 시간 확인
- 캐시 hit rate 확인
Solutions:
- Question 수 줄이기
- 캐싱 활성화
- Model 사용 (pre-aggregation)
- Lazy loading
4. Out of Memory
1
2
3
4
5
# Increase JVM heap size
MB_JAVA_OPTS="-Xmx4g"
# Docker
docker run -e "JAVA_OPTS=-Xmx4g" metabase/metabase
5. Database Connection Pool Exhausted
1
2
3
4
5
6
# Increase pool size
MB_DB_CONNECTION_POOL_SIZE=15
# Check current connections
SELECT COUNT(*) FROM pg_stat_activity
WHERE datname = 'your_db';
디버그 모드
1
2
3
4
5
6
7
# Enable debug logging
MB_LOG_LEVEL=DEBUG
# Specific namespace
MB_LOG_LEVEL_<namespace>=DEBUG
# Example:
MB_LOG_LEVEL_metabase.driver.postgres=DEBUG
Health Checks
1
2
3
4
5
6
7
8
# API health check
curl http://localhost:3000/api/health
# Database health
curl http://localhost:3000/api/health/database
# Full diagnostics
curl http://localhost:3000/api/util/diagnostic-info
고가용성 (HA) 아키텍처
Load Balancer 설정
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# nginx.conf
upstream metabase_backend {
least_conn;
server metabase1:3000 max_fails=3 fail_timeout=30s;
server metabase2:3000 max_fails=3 fail_timeout=30s;
server metabase3:3000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name analytics.company.com;
location / {
proxy_pass http://metabase_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Health check
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_connect_timeout 5s;
}
# Health check endpoint
location /health {
access_log off;
proxy_pass http://metabase_backend/api/health;
}
}
Docker Compose (Multi-instance)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# docker-compose.yml
version: '3.8'
services:
metabase1:
image: metabase/metabase:latest
environment:
MB_DB_TYPE: postgres
MB_DB_HOST: postgres
# ... other env vars
depends_on:
- postgres
restart: always
metabase2:
image: metabase/metabase:latest
environment:
MB_DB_TYPE: postgres
MB_DB_HOST: postgres
depends_on:
- postgres
restart: always
metabase3:
image: metabase/metabase:latest
environment:
MB_DB_TYPE: postgres
MB_DB_HOST: postgres
depends_on:
- postgres
restart: always
nginx:
image: nginx:latest
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
ports:
- "80:80"
- "443:443"
depends_on:
- metabase1
- metabase2
- metabase3
postgres:
image: postgres:15
environment:
POSTGRES_DB: metabase
POSTGRES_USER: metabase
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
restart: always
volumes:
postgres-data:
실전 연습 문제
연습 1: 백업 자동화 (초급)
과제: 매일 자동 백업 시스템 구축
해답 보기
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#!/bin/bash
# /usr/local/bin/metabase-backup.sh
set -e
# Configuration
BACKUP_DIR="/backups/metabase"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="metabase"
DB_USER="metabase"
DB_HOST="localhost"
RETENTION_DAYS=30
S3_BUCKET="s3://company-backups/metabase"
ALERT_EMAIL="ops@company.com"
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup database
echo "[$(date)] Starting database backup..."
pg_dump -h $DB_HOST -U $DB_USER -F c -b -v \
-f "$BACKUP_DIR/metabase_db_$DATE.backup" $DB_NAME
if [ $? -eq 0 ]; then
echo "[$(date)] Database backup successful"
else
echo "[$(date)] Database backup FAILED" | mail -s "Backup Failed" $ALERT_EMAIL
exit 1
fi
# Backup files
echo "[$(date)] Starting file backup..."
tar -czf "$BACKUP_DIR/metabase_files_$DATE.tar.gz" \
/var/lib/metabase/plugins \
/var/lib/metabase/uploads \
/etc/metabase/.env
# Compress database backup
gzip "$BACKUP_DIR/metabase_db_$DATE.backup"
# Upload to S3
echo "[$(date)] Uploading to S3..."
aws s3 cp "$BACKUP_DIR/metabase_db_$DATE.backup.gz" $S3_BUCKET/
aws s3 cp "$BACKUP_DIR/metabase_files_$DATE.tar.gz" $S3_BUCKET/
# Delete local backups older than retention period
find $BACKUP_DIR -name "metabase_*.gz" -mtime +$RETENTION_DAYS -delete
# Delete S3 backups older than 90 days
aws s3 ls $S3_BUCKET/ | while read -r line; do
createDate=$(echo $line | awk '{print $1" "$2}')
createDate=$(date -d "$createDate" +%s)
olderThan=$(date -d "90 days ago" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk '{print $4}')
if [[ $fileName != "" ]]; then
aws s3 rm $S3_BUCKET/$fileName
fi
fi
done
# Success notification
echo "[$(date)] Backup completed successfully" | \
mail -s "Metabase Backup Success" $ALERT_EMAIL
# Log
echo "[$(date)] Backup completed: DB=$BACKUP_DIR/metabase_db_$DATE.backup.gz, Files=$BACKUP_DIR/metabase_files_$DATE.tar.gz" \
>> /var/log/metabase-backup.log
Crontab:
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/metabase-backup.sh >> /var/log/metabase-backup.log 2>&1
# Weekly verification (Sunday 3 AM)
0 3 * * 0 /usr/local/bin/metabase-backup-verify.sh
연습 2: 모니터링 대시보드 (중급)
과제: Metabase 성능 모니터링 대시보드 구축
해답 보기
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
-- Dashboard: "Metabase Operations"
-- Question 1: System Health
SELECT
'Metabase' as service,
CASE
WHEN (SELECT COUNT(*) FROM query_execution WHERE started_at >= NOW() - INTERVAL '5 minutes' AND error IS NOT NULL) > 10
THEN 'Critical'
WHEN (SELECT AVG(running_time) FROM query_execution WHERE started_at >= NOW() - INTERVAL '15 minutes') > 5000
THEN 'Warning'
ELSE 'Healthy'
END as status,
(SELECT COUNT(DISTINCT user_id) FROM activity_log WHERE timestamp >= NOW() - INTERVAL '15 minutes') as active_users,
(SELECT COUNT(*) FROM query_execution WHERE started_at >= NOW() - INTERVAL '15 minutes') as queries_15min
-- Question 2: Query Performance (Last 24h)
SELECT
DATE_TRUNC('hour', started_at) as hour,
COUNT(*) as query_count,
AVG(running_time) as avg_duration_ms,
MAX(running_time) as max_duration_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY running_time) as p95_duration_ms,
COUNT(CASE WHEN error IS NOT NULL THEN 1 END) as error_count
FROM query_execution
WHERE started_at >= NOW() - INTERVAL '24 hours'
GROUP BY DATE_TRUNC('hour', started_at)
ORDER BY hour DESC
-- Question 3: Slow Queries (Last 1h)
SELECT
qe.id,
qe.started_at,
qe.running_time / 1000.0 as duration_seconds,
c.name as question_name,
u.email as user_email,
qe.result_rows,
CASE
WHEN qe.error IS NOT NULL THEN 'Error'
WHEN qe.running_time > 30000 THEN 'Very Slow'
WHEN qe.running_time > 10000 THEN 'Slow'
ELSE 'Normal'
END as performance
FROM query_execution qe
LEFT JOIN report_card c ON qe.card_id = c.id
LEFT JOIN core_user u ON qe.executor_id = u.id
WHERE qe.started_at >= NOW() - INTERVAL '1 hour'
AND qe.running_time > 5000
ORDER BY qe.running_time DESC
LIMIT 20
-- Question 4: Most Active Users
SELECT
u.email,
COUNT(DISTINCT DATE(al.timestamp)) as active_days,
COUNT(CASE WHEN al.model = 'card' AND al.model_id IS NOT NULL THEN 1 END) as questions_run,
COUNT(CASE WHEN al.model = 'dashboard' THEN 1 END) as dashboards_viewed
FROM core_user u
JOIN activity_log al ON u.id = al.user_id
WHERE al.timestamp >= NOW() - INTERVAL '30 days'
GROUP BY u.email
ORDER BY active_days DESC, questions_run DESC
LIMIT 20
-- Question 5: Database Connection Pool
SELECT
datname as database,
count(*) as connections,
count(*) FILTER (WHERE state = 'active') as active,
count(*) FILTER (WHERE state = 'idle') as idle,
max(now() - state_change) as max_idle_time
FROM pg_stat_activity
WHERE datname IS NOT NULL
GROUP BY datname
-- Question 6: Cache Hit Rate
SELECT
DATE(started_at) as date,
COUNT(*) as total_queries,
COUNT(CASE WHEN cache_hit = true THEN 1 END) as cache_hits,
(COUNT(CASE WHEN cache_hit = true THEN 1 END) * 100.0 / COUNT(*)) as cache_hit_rate
FROM query_execution
WHERE started_at >= NOW() - INTERVAL '30 days'
GROUP BY DATE(started_at)
ORDER BY date DESC
-- Alert 설정
-- Alert 1: High Error Rate
-- Condition: error_count > 10 in last 15 minutes
-- Recipients: ops-team@company.com
-- Alert 2: Slow Average Performance
-- Condition: avg_duration_ms > 5000 in last 30 minutes
-- Recipients: analytics-team@company.com
다음 단계
운영과 모니터링을 마스터했습니다. 마지막 포스트에서는:
- 실전 프로젝트: 이커머스 분석 대시보드 구축
- Best Practices 총정리
- 리소스 및 커뮤니티
요약
운영 체크리스트
백업:
- 자동화된 일일 백업
- 오프사이트 백업 (S3/GCS)
- 정기 복구 테스트
- 백업 모니터링
모니터링:
- 시스템 메트릭 수집
- 애플리케이션 로그
- 성능 트렌드 추적
- Alert 설정
고가용성:
- Load balancer
- Multiple instances
- Database replication
- Health checks
트러블슈팅:
- 로그 분석 도구
- 디버그 프로세스
- Runbook 작성
- On-call 프로세스
다음 포스트에서는 배운 모든 것을 활용하여 실전 프로젝트를 구축합니다!
📚 시리즈 전체 목차
🚀 기초편 (1-5화)
💪 활용편 (6-10화)
🎯 고급편 (11-16화)
이 기사는 저작권자의 CC BY 4.0 라이센스를 따릅니다.