포스트

[이제와서 시작하는 Metabase 마스터하기 #15] 운영과 모니터링 - 백업과 복구

[이제와서 시작하는 Metabase 마스터하기 #15] 운영과 모니터링 - 백업과 복구

학습 목표

이 포스트를 마치면 다음을 할 수 있습니다:

  • 정기 백업 전략 수립 및 실행
  • 재해 복구 계획 수립
  • 시스템 모니터링 및 성능 추적
  • 일반적인 문제 해결
  • 고가용성 아키텍처 구성
  • 로그 분석 및 트러블슈팅

백업 전략

백업 대상

graph TD
    A[Metabase Backup] --> B[Application DB]
    A --> C[Uploaded Files]
    A --> D[Settings]

    B --> B1[Questions]
    B --> B2[Dashboards]
    B --> B3[Users/Permissions]
    B --> B4[Collections]

    C --> C1[Logos]
    C --> C2[Custom Assets]

    D --> D1[Environment Variables]
    D --> D2[Database Connections]

Application Database 백업

PostgreSQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# backup-metabase.sh

# Variables
BACKUP_DIR="/backups/metabase"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="metabase"
DB_USER="metabase"
DB_HOST="localhost"
RETENTION_DAYS=30

# Create backup directory
mkdir -p $BACKUP_DIR

# Backup database
pg_dump -h $DB_HOST -U $DB_USER -F c -b -v -f "$BACKUP_DIR/metabase_$DATE.backup" $DB_NAME

# Compress
gzip "$BACKUP_DIR/metabase_$DATE.backup"

# Delete old backups
find $BACKUP_DIR -name "metabase_*.backup.gz" -mtime +$RETENTION_DAYS -delete

# Log
echo "[$(date)] Backup completed: metabase_$DATE.backup.gz" >> /var/log/metabase-backup.log

Cron 설정 (매일 2 AM):

0 2 * * * /path/to/backup-metabase.sh

MySQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# backup-metabase-mysql.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/metabase"

mysqldump \
  --host=localhost \
  --user=metabase \
  --password=your_password \
  --single-transaction \
  --quick \
  --lock-tables=false \
  metabase | gzip > "$BACKUP_DIR/metabase_$DATE.sql.gz"

파일 시스템 백업

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# backup-metabase-files.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/metabase"
METABASE_DATA="/var/lib/metabase"

# Backup uploaded files and assets
tar -czf "$BACKUP_DIR/metabase_files_$DATE.tar.gz" \
  $METABASE_DATA/plugins \
  $METABASE_DATA/uploads \
  /path/to/custom-assets

# Backup environment configuration
cp /path/to/.env "$BACKUP_DIR/env_$DATE.backup"

Cloud Storage 백업

AWS S3:

1
2
3
4
5
6
7
8
#!/bin/bash
# sync to S3

aws s3 sync /backups/metabase/ s3://company-backups/metabase/ \
  --storage-class GLACIER \
  --exclude "*" \
  --include "*.backup.gz" \
  --include "*.tar.gz"

Google Cloud Storage:

1
gsutil -m rsync -r /backups/metabase/ gs://company-backups/metabase/

자동화된 백업 모니터링

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# backup-monitor.sh

BACKUP_DIR="/backups/metabase"
ALERT_EMAIL="ops@company.com"
MAX_AGE_HOURS=26  # Alert if backup older than 26 hours

# Find latest backup
LATEST_BACKUP=$(find $BACKUP_DIR -name "metabase_*.backup.gz" -type f -printf '%T@ %p\n' | sort -rn | head -1 | cut -d' ' -f2)

if [ -z "$LATEST_BACKUP" ]; then
  echo "No backup found!" | mail -s "Metabase Backup FAILED" $ALERT_EMAIL
  exit 1
fi

# Check age
BACKUP_AGE=$(( ($(date +%s) - $(stat -c %Y "$LATEST_BACKUP")) / 3600 ))

if [ $BACKUP_AGE -gt $MAX_AGE_HOURS ]; then
  echo "Latest backup is $BACKUP_AGE hours old (threshold: $MAX_AGE_HOURS)" \
    | mail -s "Metabase Backup WARNING" $ALERT_EMAIL
fi

복구 절차

전체 복구

Step 1: 서비스 중지

1
2
3
docker-compose down
# 또는
systemctl stop metabase

Step 2: 데이터베이스 복구

1
2
3
4
5
6
# PostgreSQL
gunzip metabase_20250101.backup.gz
pg_restore -h localhost -U metabase -d metabase -c metabase_20250101.backup

# MySQL
gunzip < metabase_20250101.sql.gz | mysql -u metabase -p metabase

Step 3: 파일 복구

1
2
tar -xzf metabase_files_20250101.tar.gz -C /
cp env_20250101.backup /path/to/.env

Step 4: 서비스 시작

1
2
3
docker-compose up -d
# 또는
systemctl start metabase

Step 5: 검증

1
2
3
4
5
6
# Health check
curl http://localhost:3000/api/health

# Login test
# Dashboard test
# Query execution test

부분 복구 (특정 객체)

단일 Question 복구:

1
2
3
4
5
6
-- From backup database
SELECT * FROM report_card WHERE id = 123;

-- Insert into production
INSERT INTO report_card (...)
VALUES (...);

모니터링

시스템 메트릭

Prometheus + Grafana:

1
2
3
4
5
6
7
# prometheus.yml
scrape_configs:
  - job_name: 'metabase'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/api/health'
    scrape_interval: 30s

주요 메트릭:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
System:
  - CPU usage
  - Memory usage
  - Disk I/O
  - Network traffic

Application:
  - Active users
  - Query execution time
  - Dashboard load time
  - Error rate
  - Cache hit rate

Database:
  - Connection pool usage
  - Query latency
  - Slow queries (> 5s)
  - Database size

Application Logs

로그 레벨 설정:

1
2
3
4
5
6
# Environment variable
MB_LOG_LEVEL=INFO  # DEBUG, INFO, WARN, ERROR

# Docker
docker logs -f metabase
docker logs --since 1h metabase | grep ERROR

로그 분석:

1
2
3
4
5
6
7
8
# Slow queries
grep "Query took" /var/log/metabase/metabase.log | awk '$4 > 5000'

# Errors
grep ERROR /var/log/metabase/metabase.log | tail -100

# User activity
grep "GET /api/card" /var/log/metabase/metabase.log | wc -l

Alert 설정

Grafana Alerts:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
alerts:
  - name: Metabase High CPU
    condition: cpu_usage > 80
    duration: 5m
    notification: ops-team

  - name: Metabase High Memory
    condition: memory_usage > 90
    duration: 5m
    notification: ops-team

  - name: Metabase Slow Queries
    condition: avg_query_time > 10s
    duration: 10m
    notification: analytics-team

  - name: Metabase Error Rate
    condition: error_rate > 5%
    duration: 5m
    notification: ops-team

성능 모니터링

Query Performance

Slow Query Log:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Metabase internal metrics
SELECT
  qe.id,
  qe.started_at,
  qe.running_time as duration_ms,
  qe.result_rows,
  c.name as card_name,
  u.email as user_email
FROM query_execution qe
JOIN report_card c ON qe.card_id = c.id
JOIN core_user u ON qe.executor_id = u.id
WHERE qe.running_time > 5000  -- 5 seconds
  AND qe.started_at >= NOW() - INTERVAL '24 hours'
ORDER BY qe.running_time DESC
LIMIT 50;

성능 트렌드:

1
2
3
4
5
6
7
8
9
10
11
-- Daily average query time
SELECT
  DATE(started_at) as date,
  COUNT(*) as query_count,
  AVG(running_time) as avg_duration_ms,
  MAX(running_time) as max_duration_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY running_time) as p95_duration_ms
FROM query_execution
WHERE started_at >= NOW() - INTERVAL '30 days'
GROUP BY DATE(started_at)
ORDER BY date DESC;

Dashboard Usage

1
2
3
4
5
6
7
8
9
10
11
12
-- Most viewed dashboards
SELECT
  d.id,
  d.name,
  COUNT(dv.id) as view_count,
  COUNT(DISTINCT dv.user_id) as unique_users
FROM dashboard d
LEFT JOIN dashboard_view dv ON d.id = dv.dashboard_id
WHERE dv.timestamp >= NOW() - INTERVAL '30 days'
GROUP BY d.id, d.name
ORDER BY view_count DESC
LIMIT 20;

User Activity

1
2
3
4
5
6
7
8
9
-- Active users trend
SELECT
  DATE(timestamp) as date,
  COUNT(DISTINCT user_id) as active_users,
  COUNT(*) as total_actions
FROM activity_log
WHERE timestamp >= NOW() - INTERVAL '90 days'
GROUP BY DATE(timestamp)
ORDER BY date DESC;

트러블슈팅

일반적인 문제

1. Metabase가 시작되지 않음

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 로그 확인
docker logs metabase

# Common issues:
# - Port already in use
sudo lsof -i :3000

# - Database connection failed
# Check MB_DB_* environment variables

# - Out of memory
docker stats metabase

# - Permissions
ls -la /metabase-data

2. 쿼리가 느림

1
2
3
4
5
6
7
8
9
10
11
Diagnosis:
  1. Query execution log 확인
  2. EXPLAIN ANALYZE 실행
  3. 데이터베이스 인덱스 확인
  4. Connection pool 상태 확인

Solutions:
  - 인덱스 추가
  - 쿼리 최적화
  - 캐싱 TTL 증가
  - Connection pool 크기 증가

3. 대시보드 로딩 느림

1
2
3
4
5
6
7
8
9
10
Diagnosis:
  - Browser DevTools Network tab
  - Question별 로딩 시간 확인
  - 캐시 hit rate 확인

Solutions:
  - Question 수 줄이기
  - 캐싱 활성화
  - Model 사용 (pre-aggregation)
  - Lazy loading

4. Out of Memory

1
2
3
4
5
# Increase JVM heap size
MB_JAVA_OPTS="-Xmx4g"

# Docker
docker run -e "JAVA_OPTS=-Xmx4g" metabase/metabase

5. Database Connection Pool Exhausted

1
2
3
4
5
6
# Increase pool size
MB_DB_CONNECTION_POOL_SIZE=15

# Check current connections
SELECT COUNT(*) FROM pg_stat_activity
WHERE datname = 'your_db';

디버그 모드

1
2
3
4
5
6
7
# Enable debug logging
MB_LOG_LEVEL=DEBUG

# Specific namespace
MB_LOG_LEVEL_<namespace>=DEBUG
# Example:
MB_LOG_LEVEL_metabase.driver.postgres=DEBUG

Health Checks

1
2
3
4
5
6
7
8
# API health check
curl http://localhost:3000/api/health

# Database health
curl http://localhost:3000/api/health/database

# Full diagnostics
curl http://localhost:3000/api/util/diagnostic-info

고가용성 (HA) 아키텍처

Load Balancer 설정

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# nginx.conf
upstream metabase_backend {
    least_conn;
    server metabase1:3000 max_fails=3 fail_timeout=30s;
    server metabase2:3000 max_fails=3 fail_timeout=30s;
    server metabase3:3000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name analytics.company.com;

    location / {
        proxy_pass http://metabase_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Health check
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_connect_timeout 5s;
    }

    # Health check endpoint
    location /health {
        access_log off;
        proxy_pass http://metabase_backend/api/health;
    }
}

Docker Compose (Multi-instance)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# docker-compose.yml
version: '3.8'

services:
  metabase1:
    image: metabase/metabase:latest
    environment:
      MB_DB_TYPE: postgres
      MB_DB_HOST: postgres
      # ... other env vars
    depends_on:
      - postgres
    restart: always

  metabase2:
    image: metabase/metabase:latest
    environment:
      MB_DB_TYPE: postgres
      MB_DB_HOST: postgres
    depends_on:
      - postgres
    restart: always

  metabase3:
    image: metabase/metabase:latest
    environment:
      MB_DB_TYPE: postgres
      MB_DB_HOST: postgres
    depends_on:
      - postgres
    restart: always

  nginx:
    image: nginx:latest
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      - metabase1
      - metabase2
      - metabase3

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: metabase
      POSTGRES_USER: metabase
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: always

volumes:
  postgres-data:

실전 연습 문제

연습 1: 백업 자동화 (초급)

과제: 매일 자동 백업 시스템 구축

해답 보기
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#!/bin/bash
# /usr/local/bin/metabase-backup.sh

set -e

# Configuration
BACKUP_DIR="/backups/metabase"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="metabase"
DB_USER="metabase"
DB_HOST="localhost"
RETENTION_DAYS=30
S3_BUCKET="s3://company-backups/metabase"
ALERT_EMAIL="ops@company.com"

# Create backup directory
mkdir -p $BACKUP_DIR

# Backup database
echo "[$(date)] Starting database backup..."
pg_dump -h $DB_HOST -U $DB_USER -F c -b -v \
  -f "$BACKUP_DIR/metabase_db_$DATE.backup" $DB_NAME

if [ $? -eq 0 ]; then
  echo "[$(date)] Database backup successful"
else
  echo "[$(date)] Database backup FAILED" | mail -s "Backup Failed" $ALERT_EMAIL
  exit 1
fi

# Backup files
echo "[$(date)] Starting file backup..."
tar -czf "$BACKUP_DIR/metabase_files_$DATE.tar.gz" \
  /var/lib/metabase/plugins \
  /var/lib/metabase/uploads \
  /etc/metabase/.env

# Compress database backup
gzip "$BACKUP_DIR/metabase_db_$DATE.backup"

# Upload to S3
echo "[$(date)] Uploading to S3..."
aws s3 cp "$BACKUP_DIR/metabase_db_$DATE.backup.gz" $S3_BUCKET/
aws s3 cp "$BACKUP_DIR/metabase_files_$DATE.tar.gz" $S3_BUCKET/

# Delete local backups older than retention period
find $BACKUP_DIR -name "metabase_*.gz" -mtime +$RETENTION_DAYS -delete

# Delete S3 backups older than 90 days
aws s3 ls $S3_BUCKET/ | while read -r line; do
  createDate=$(echo $line | awk '{print $1" "$2}')
  createDate=$(date -d "$createDate" +%s)
  olderThan=$(date -d "90 days ago" +%s)
  if [[ $createDate -lt $olderThan ]]; then
    fileName=$(echo $line | awk '{print $4}')
    if [[ $fileName != "" ]]; then
      aws s3 rm $S3_BUCKET/$fileName
    fi
  fi
done

# Success notification
echo "[$(date)] Backup completed successfully" | \
  mail -s "Metabase Backup Success" $ALERT_EMAIL

# Log
echo "[$(date)] Backup completed: DB=$BACKUP_DIR/metabase_db_$DATE.backup.gz, Files=$BACKUP_DIR/metabase_files_$DATE.tar.gz" \
  >> /var/log/metabase-backup.log

Crontab:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/metabase-backup.sh >> /var/log/metabase-backup.log 2>&1

# Weekly verification (Sunday 3 AM)
0 3 * * 0 /usr/local/bin/metabase-backup-verify.sh

연습 2: 모니터링 대시보드 (중급)

과제: Metabase 성능 모니터링 대시보드 구축

해답 보기
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
-- Dashboard: "Metabase Operations"

-- Question 1: System Health
SELECT
  'Metabase' as service,
  CASE
    WHEN (SELECT COUNT(*) FROM query_execution WHERE started_at >= NOW() - INTERVAL '5 minutes' AND error IS NOT NULL) > 10
      THEN 'Critical'
    WHEN (SELECT AVG(running_time) FROM query_execution WHERE started_at >= NOW() - INTERVAL '15 minutes') > 5000
      THEN 'Warning'
    ELSE 'Healthy'
  END as status,
  (SELECT COUNT(DISTINCT user_id) FROM activity_log WHERE timestamp >= NOW() - INTERVAL '15 minutes') as active_users,
  (SELECT COUNT(*) FROM query_execution WHERE started_at >= NOW() - INTERVAL '15 minutes') as queries_15min

-- Question 2: Query Performance (Last 24h)
SELECT
  DATE_TRUNC('hour', started_at) as hour,
  COUNT(*) as query_count,
  AVG(running_time) as avg_duration_ms,
  MAX(running_time) as max_duration_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY running_time) as p95_duration_ms,
  COUNT(CASE WHEN error IS NOT NULL THEN 1 END) as error_count
FROM query_execution
WHERE started_at >= NOW() - INTERVAL '24 hours'
GROUP BY DATE_TRUNC('hour', started_at)
ORDER BY hour DESC

-- Question 3: Slow Queries (Last 1h)
SELECT
  qe.id,
  qe.started_at,
  qe.running_time / 1000.0 as duration_seconds,
  c.name as question_name,
  u.email as user_email,
  qe.result_rows,
  CASE
    WHEN qe.error IS NOT NULL THEN 'Error'
    WHEN qe.running_time > 30000 THEN 'Very Slow'
    WHEN qe.running_time > 10000 THEN 'Slow'
    ELSE 'Normal'
  END as performance
FROM query_execution qe
LEFT JOIN report_card c ON qe.card_id = c.id
LEFT JOIN core_user u ON qe.executor_id = u.id
WHERE qe.started_at >= NOW() - INTERVAL '1 hour'
  AND qe.running_time > 5000
ORDER BY qe.running_time DESC
LIMIT 20

-- Question 4: Most Active Users
SELECT
  u.email,
  COUNT(DISTINCT DATE(al.timestamp)) as active_days,
  COUNT(CASE WHEN al.model = 'card' AND al.model_id IS NOT NULL THEN 1 END) as questions_run,
  COUNT(CASE WHEN al.model = 'dashboard' THEN 1 END) as dashboards_viewed
FROM core_user u
JOIN activity_log al ON u.id = al.user_id
WHERE al.timestamp >= NOW() - INTERVAL '30 days'
GROUP BY u.email
ORDER BY active_days DESC, questions_run DESC
LIMIT 20

-- Question 5: Database Connection Pool
SELECT
  datname as database,
  count(*) as connections,
  count(*) FILTER (WHERE state = 'active') as active,
  count(*) FILTER (WHERE state = 'idle') as idle,
  max(now() - state_change) as max_idle_time
FROM pg_stat_activity
WHERE datname IS NOT NULL
GROUP BY datname

-- Question 6: Cache Hit Rate
SELECT
  DATE(started_at) as date,
  COUNT(*) as total_queries,
  COUNT(CASE WHEN cache_hit = true THEN 1 END) as cache_hits,
  (COUNT(CASE WHEN cache_hit = true THEN 1 END) * 100.0 / COUNT(*)) as cache_hit_rate
FROM query_execution
WHERE started_at >= NOW() - INTERVAL '30 days'
GROUP BY DATE(started_at)
ORDER BY date DESC

-- Alert 설정
-- Alert 1: High Error Rate
-- Condition: error_count > 10 in last 15 minutes
-- Recipients: ops-team@company.com

-- Alert 2: Slow Average Performance
-- Condition: avg_duration_ms > 5000 in last 30 minutes
-- Recipients: analytics-team@company.com

다음 단계

운영과 모니터링을 마스터했습니다. 마지막 포스트에서는:

  • 실전 프로젝트: 이커머스 분석 대시보드 구축
  • Best Practices 총정리
  • 리소스 및 커뮤니티

요약

운영 체크리스트

백업:

  • 자동화된 일일 백업
  • 오프사이트 백업 (S3/GCS)
  • 정기 복구 테스트
  • 백업 모니터링

모니터링:

  • 시스템 메트릭 수집
  • 애플리케이션 로그
  • 성능 트렌드 추적
  • Alert 설정

고가용성:

  • Load balancer
  • Multiple instances
  • Database replication
  • Health checks

트러블슈팅:

  • 로그 분석 도구
  • 디버그 프로세스
  • Runbook 작성
  • On-call 프로세스

다음 포스트에서는 배운 모든 것을 활용하여 실전 프로젝트를 구축합니다!

📚 시리즈 전체 목차

🚀 기초편 (1-5화)

  1. Metabase 소개와 핵심 개념
  2. 설치와 초기 설정
  3. 샘플 데이터 둘러보기
  4. 첫 차트 만들기 - 실습 완전정복
  5. 대시보드 만들기 - 한 화면에 모으기

💪 활용편 (6-10화)

  1. 필터와 파라미터
  2. SQL 네이티브 쿼리
  3. 데이터 모델링
  4. 자동화와 알림
  5. 권한과 보안

🎯 고급편 (11-16화)

  1. 임베딩과 공유
  2. 성능 최적화
  3. 멀티 데이터소스
  4. 커스터마이징
  5. [운영과 모니터링] (현재 글)
  6. 실전 프로젝트
이 기사는 저작권자의 CC BY 4.0 라이센스를 따릅니다.