[Python 100일 챌린지] Day 54 - 웹 스크래핑 고급

게시 2025/04/23

웹 스크래핑 고급

By YonYonWare

24 분읽는 시간

[Python 100일 챌린지] Day 54 - 웹 스크래핑 고급

while next_link: soup = 가져오기... → next_link 찾기 → 1000페이지 자동 수집! 😊

페이지네이션 자동 처리, 동적 콘텐츠 스크래핑, robots.txt 준수… 대규모 데이터 수집도 윤리적이고 효율적으로 처리합니다!

(45-55분 완독 ⭐⭐⭐)

🎯 오늘의 학습 목표

동적 콘텐츠 스크래핑 이해하기
페이지네이션 처리하기
에러 처리와 재시도 로직 구현하기
스크래핑 윤리와 로봇 규칙 이해하기

📚 사전 지식

Day 53: BeautifulSoup 기초 - HTML 파싱과 CSS 선택자
Day 51: requests 라이브러리 기초 - HTTP 요청 기초
Phase 5의 파일 입출력과 예외 처리

🎯 학습 목표 1: 동적 콘텐츠 스크래핑 이해하기

1.1 동적 콘텐츠란?

정적 콘텐츠 vs 동적 콘텐츠:

구분	정적 콘텐츠	동적 콘텐츠
렌더링	서버에서 HTML 완성	JavaScript로 클라이언트 렌더링
requests 사용	✅ 가능	❌ 불가능 (빈 HTML 받음)
해결 방법	BeautifulSoup만으로 충분	API 직접 호출 또는 Selenium 필요
예시	뉴스 사이트, 블로그	SPA, 무한 스크롤

문제 상황:

  
import requests
from bs4 import BeautifulSoup

# 동적 콘텐츠 사이트 스크래핑 시도
response = requests.get('https://example.com/dynamic-content')
soup = BeautifulSoup(response.text, 'html.parser')

# 데이터가 비어있음!
items = soup.select('.product')
print(len(items))  # 0 (JavaScript로 로드되기 때문)

1.2 AJAX 요청 분석하기

해결 방법: 브라우저 개발자 도구로 실제 API 찾기

단계별 가이드:

개발자 도구 열기 (F12)
Network 탭 선택
XHR 또는 Fetch 필터 적용
페이지 새로고침 또는 스크롤
API 요청 찾기 (JSON 응답 확인)

실전 예제:

  
import requests

# 브라우저에서 찾은 실제 API 엔드포인트
api_url = 'https://example.com/api/products'

# 쿼리 파라미터 (페이지, 정렬 등)
params = {
    'page': 1,
    'limit': 20,
    'sort': 'price_asc'
}

# API 직접 호출
response = requests.get(api_url, params=params)

# JSON 데이터 파싱
data = response.json()

# 데이터 추출
for item in data['products']:
    print(f"{item['name']}: {item['price']}원")

1.3 Headers 분석 및 복제

많은 API가 인증 헤더를 요구합니다.

  
import requests

# 브라우저에서 복사한 헤더
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json',
    'Accept-Language': 'ko-KR,ko;q=0.9',
    'Referer': 'https://example.com',
    # API 키가 있다면
    'Authorization': 'Bearer YOUR_TOKEN',
    # CSRF 토큰이 있다면
    'X-CSRF-Token': 'token-value'
}

response = requests.get(api_url, headers=headers, params=params)
data = response.json()

1.4 POST 요청으로 데이터 가져오기

일부 API는 POST 요청을 사용합니다.

  
import requests

api_url = 'https://example.com/api/search'

# POST 데이터
payload = {
    'query': '노트북',
    'category': 'electronics',
    'min_price': 500000,
    'max_price': 2000000
}

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Content-Type': 'application/json'
}

# POST 요청
response = requests.post(api_url, json=payload, headers=headers)

# 응답 처리
if response.status_code == 200:
    results = response.json()
    print(f"총 {results['total']}개 상품")

    for product in results['items']:
        print(f"- {product['name']}: {product['price']:,}원")
else:
    print(f"오류: {response.status_code}")

1.5 실전 예제: 무한 스크롤 페이지

  
import requests
import time

def scrape_infinite_scroll(base_api_url, max_pages=10):
    """무한 스크롤 API 스크래핑"""
    all_products = []
    page = 1

    while page <= max_pages:
        print(f"페이지 {page} 수집 중...")

        # API 요청
        params = {
            'page': page,
            'limit': 20
        }

        try:
            response = requests.get(base_api_url, params=params, timeout=10)
            response.raise_for_status()

            data = response.json()

            # 데이터가 없으면 종료
            if not data.get('items'):
                print("더 이상 데이터가 없습니다.")
                break

            # 데이터 수집
            all_products.extend(data['items'])

            # 다음 페이지 존재 여부 확인
            if not data.get('has_next', False):
                break

            page += 1
            time.sleep(1)  # 서버 부하 방지

        except Exception as e:
            print(f"오류 발생: {e}")
            break

    return all_products

# 사용
# products = scrape_infinite_scroll('https://example.com/api/products')
# print(f"총 {len(products)}개 상품 수집")

🎯 학습 목표 2: 페이지네이션 처리하기

2.1 페이지네이션 유형

1. 다음/이전 버튼형:

  
<a href="/page/2" class="next">다음</a>

2. 페이지 번호형:

  
<a href="/products?page=1">1</a>
<a href="/products?page=2">2</a>

3. 무한 스크롤형: JavaScript로 자동 로드

2.2 다음 페이지 링크 따라가기

  
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def scrape_all_pages(start_url, max_pages=None):
    """모든 페이지 자동 스크래핑"""
    current_url = start_url
    all_data = []
    page_count = 0

    while current_url:
        page_count += 1
        print(f"[{page_count}페이지] {current_url}")

        # 최대 페이지 제한
        if max_pages and page_count > max_pages:
            break

        try:
            response = requests.get(current_url, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, 'html.parser')

            # 데이터 추출
            items = soup.select('.item')
            print(f"  → {len(items)}개 항목 발견")

            for item in items:
                title = item.select_one('.title')
                if title:
                    all_data.append(title.get_text(strip=True))

            # 다음 페이지 링크 찾기
            next_link = soup.select_one('a.next-page')

            if next_link and next_link.get('href'):
                # 상대 URL을 절대 URL로 변환
                current_url = urljoin(current_url, next_link['href'])
            else:
                print("  → 마지막 페이지입니다.")
                current_url = None

            # 서버 부하 방지
            import time
            time.sleep(1)

        except Exception as e:
            print(f"  ❌ 오류: {e}")
            break

    return all_data

# 사용
# data = scrape_all_pages('https://example.com/page/1', max_pages=5)
# print(f"\n총 {len(data)}개 데이터 수집")

2.3 페이지 번호 기반 스크래핑

  
import requests
from bs4 import BeautifulSoup
import time

def scrape_by_page_numbers(base_url, start_page=1, max_pages=10):
    """페이지 번호로 스크래핑"""
    all_data = []

    for page in range(start_page, start_page + max_pages):
        # URL 패턴에 맞게 수정 필요
        # 예: /products?page=1
        # 예: /products/page/1
        url = f"{base_url}?page={page}"

        print(f"[페이지 {page}] {url}")

        try:
            response = requests.get(url, timeout=10)

            # 404 또는 빈 페이지면 종료
            if response.status_code == 404:
                print("  → 페이지 없음 (404)")
                break

            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')

            # 데이터 추출
            items = soup.select('.product')

            # 데이터가 없으면 종료
            if not items:
                print("  → 데이터 없음 (마지막 페이지)")
                break

            print(f"  → {len(items)}개 상품")

            for item in items:
                name = item.select_one('.name')
                price = item.select_one('.price')

                if name:
                    all_data.append({
                        'name': name.get_text(strip=True),
                        'price': price.get_text(strip=True) if price else 'N/A'
                    })

            time.sleep(1)  # 1초 대기

        except Exception as e:
            print(f"  ❌ 오류: {e}")
            break

    return all_data

# 사용
# products = scrape_by_page_numbers('https://example.com/products', max_pages=5)
# for p in products:
#     print(f"{p['name']}: {p['price']}")

2.4 페이지네이션 자동 감지

  
def auto_detect_pagination(url):
    """페이지네이션 패턴 자동 감지"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 1. 다음 버튼 찾기
    next_patterns = [
        'a.next',
        'a.next-page',
        'a[rel="next"]',
        'a:contains("다음")',
        'a:contains("Next")'
    ]

    for pattern in next_patterns:
        next_btn = soup.select_one(pattern)
        if next_btn:
            return 'next-button', next_btn.get('href')

    # 2. 페이지 번호 링크 찾기
    page_links = soup.select('a[href*="page"]')
    if page_links:
        return 'page-numbers', page_links

    # 3. 무한 스크롤 (JavaScript 필요)
    if soup.select('[data-scroll="infinite"]'):
        return 'infinite-scroll', None

    return 'unknown', None

# pagination_type, info = auto_detect_pagination('https://example.com')
# print(f"감지된 페이지네이션: {pagination_type}")

🎯 학습 목표 3: 에러 처리와 재시도 로직 구현하기

3.1 기본 예외 처리

  
import requests
from bs4 import BeautifulSoup

def safe_scrape(url):
    """안전한 스크래핑 with 예외 처리"""
    try:
        # HTTP 요청
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # 4xx, 5xx 에러 발생

        # HTML 파싱
        soup = BeautifulSoup(response.text, 'html.parser')

        # 데이터 추출
        title = soup.find('h1')
        if title:
            return title.get_text(strip=True)
        else:
            return None

    except requests.exceptions.Timeout:
        print(f"⏱️ 타임아웃: {url}")
        return None

    except requests.exceptions.HTTPError as e:
        print(f"❌ HTTP 오류: {e.response.status_code}")
        return None

    except requests.exceptions.ConnectionError:
        print(f"🌐 연결 오류: {url}")
        return None

    except requests.exceptions.RequestException as e:
        print(f"❌ 요청 오류: {e}")
        return None

    except Exception as e:
        print(f"❌ 예상치 못한 오류: {e}")
        return None

3.2 재시도 로직 (Retry)

  
import requests
import time

def scrape_with_retry(url, max_retries=3, delay=2):
    """재시도 로직이 있는 스크래핑"""
    for attempt in range(1, max_retries + 1):
        try:
            print(f"시도 {attempt}/{max_retries}: {url}")

            response = requests.get(url, timeout=10)
            response.raise_for_status()

            # 성공
            print(f"✅ 성공!")
            return response

        except requests.exceptions.RequestException as e:
            print(f"❌ 실패: {e}")

            if attempt < max_retries:
                print(f"⏳ {delay}초 후 재시도...")
                time.sleep(delay)
                delay *= 2  # 지수 백오프 (2, 4, 8초...)
            else:
                print(f"❌ 최대 재시도 횟수 초과")
                return None

# 사용
# response = scrape_with_retry('https://example.com')
# if response:
#     soup = BeautifulSoup(response.text, 'html.parser')

3.3 데코레이터로 재시도 로직 구현

  
import time
from functools import wraps

def retry(max_attempts=3, delay=1, backoff=2):
    """재시도 데코레이터"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            current_delay = delay

            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    print(f"[시도 {attempt}/{max_attempts}] 실패: {e}")

                    if attempt < max_attempts:
                        print(f"⏳ {current_delay}초 대기 중...")
                        time.sleep(current_delay)
                        current_delay *= backoff
                    else:
                        print(f"❌ 모든 재시도 실패")
                        raise

        return wrapper
    return decorator

# 사용
@retry(max_attempts=3, delay=2, backoff=2)
def fetch_data(url):
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.json()

# data = fetch_data('https://api.example.com/data')

3.4 Session과 연결 풀 사용

  
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session_with_retries():
    """재시도 로직이 내장된 Session 생성"""
    session = requests.Session()

    # 재시도 전략
    retry_strategy = Retry(
        total=3,                    # 총 재시도 횟수
        backoff_factor=1,           # 대기 시간 (1, 2, 4초...)
        status_forcelist=[429, 500, 502, 503, 504],  # 재시도할 상태 코드
        allowed_methods=["HEAD", "GET", "OPTIONS"]   # 재시도할 HTTP 메서드
    )

    # HTTP 어댑터에 재시도 전략 적용
    adapter = HTTPAdapter(max_retries=retry_strategy)

    # http, https 모두 적용
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    # 기본 헤더 설정
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })

    return session

# 사용
session = create_session_with_retries()

# 자동으로 재시도됨
# response = session.get('https://example.com')
# data = response.json()

3.5 에러 로깅

  
import logging
import requests

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename='scraping.log'
)

logger = logging.getLogger(__name__)

def scrape_with_logging(url):
    """로깅이 포함된 스크래핑"""
    logger.info(f"스크래핑 시작: {url}")

    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()

        logger.info(f"성공: {url} (상태 코드: {response.status_code})")
        return response

    except requests.exceptions.Timeout:
        logger.error(f"타임아웃: {url}")
        return None

    except requests.exceptions.HTTPError as e:
        logger.error(f"HTTP 오류 {e.response.status_code}: {url}")
        return None

    except Exception as e:
        logger.exception(f"예상치 못한 오류: {url}")
        return None

# 사용
# response = scrape_with_logging('https://example.com')

🎯 학습 목표 4: 스크래핑 윤리와 로봇 규칙 이해하기

4.1 robots.txt 이해하기

robots.txt는 웹사이트가 크롤러에게 접근 규칙을 알려주는 파일입니다.

예시 (https://example.com/robots.txt):

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

Crawl-delay: 2

해석:

모든 크롤러(*)에 대해
/admin/, /private/ 접근 금지
/public/ 접근 허용
요청 간격 최소 2초

4.2 robots.txt 확인하기

  
import requests

def check_robots_txt(base_url):
    """robots.txt 내용 확인"""
    robots_url = f"{base_url}/robots.txt"

    try:
        response = requests.get(robots_url, timeout=5)

        if response.status_code == 200:
            print(f"=== {robots_url} ===\n")
            print(response.text)
            return response.text
        else:
            print("robots.txt 파일이 없습니다.")
            return None

    except Exception as e:
        print(f"오류: {e}")
        return None

# 사용
# check_robots_txt('https://www.naver.com')

4.3 robotparser로 규칙 확인

  
from urllib.robotparser import RobotFileParser

def can_fetch(url):
    """URL 접근 가능 여부 확인"""
    rp = RobotFileParser()

    # robots.txt URL
    robots_url = '/'.join(url.split('/')[:3]) + '/robots.txt'
    rp.set_url(robots_url)

    try:
        rp.read()

        # User-agent '*'로 접근 가능한지 확인
        if rp.can_fetch('*', url):
            print(f"✅ 접근 가능: {url}")
            return True
        else:
            print(f"❌ 접근 금지: {url}")
            return False

    except Exception as e:
        print(f"오류: {e}")
        return False

# 사용
# can_fetch('https://example.com/products')
# can_fetch('https://example.com/admin')

4.4 스크래핑 베스트 프랙티스

1. 요청 속도 제한:

  
import time
import random

# 고정 딜레이
time.sleep(2)  # 2초

# 랜덤 딜레이 (더 자연스러움)
time.sleep(random.uniform(1, 3))  # 1-3초 랜덤

2. User-Agent 명시:

  
headers = {
    'User-Agent': 'MyBot/1.0 (contact@example.com)'  # 연락처 포함
}
requests.get(url, headers=headers)

3. 세션 유지:

  
session = requests.Session()
session.headers.update({'User-Agent': '...'})

# 쿠키 자동 관리
response1 = session.get('https://example.com/login')
response2 = session.get('https://example.com/data')

4. 에러 처리 및 재시도:

  
# 앞서 배운 retry 로직 사용
session = create_session_with_retries()

4.5 법적 고려사항

✅ 허용되는 경우:

공개된 정보 수집
개인적 용도 연구/학습
API 제공 시 API 우선 사용
이용약관 준수
robots.txt 준수

❌ 금지되는 경우:

개인정보 무단 수집
저작권 침해
서버 과부하 유발
이용약관 위반
상업적 악용

4.6 윤리적 스크래핑 체크리스트

  
def ethical_scraping_checklist():
    """윤리적 스크래핑 가이드"""
    checklist = [
        "✅ robots.txt 확인했나요?",
        "✅ 이용약관을 읽었나요?",
        "✅ API가 있다면 API를 사용했나요?",
        "✅ 요청 간격을 1-2초 이상 두었나요?",
        "✅ User-Agent에 연락처를 포함했나요?",
        "✅ 개인정보를 수집하지 않나요?",
        "✅ 서버에 부담을 주지 않나요?",
        "✅ 수집한 데이터를 적법하게 사용하나요?"
    ]

    print("=== 윤리적 스크래핑 체크리스트 ===\n")
    for item in checklist:
        print(item)

# ethical_scraping_checklist()

💡 실전 팁 & 주의사항

✅ DO: 이렇게 하세요

API 우선 사용
- 가능하면 공식 API 사용
- 스크래핑은 최후의 수단

요청 속도 제한

  
import time
time.sleep(1)  # 최소 1초

User-Agent 설정

  
headers = {'User-Agent': 'MyBot/1.0 (contact@me.com)'}

재시도 로직 구현
- 네트워크 불안정 대비
- 지수 백오프 사용
에러 로깅
- 문제 발생 시 추적 가능

❌ DON’T: 이러지 마세요

무한 루프 방지

  
# 나쁜 예
while True:
    scrape()

# 좋은 예
max_pages = 100
for page in range(max_pages):
    scrape()

과도한 요청
- 초당 10회 이상 요청 금지
- 서버 부하 고려
robots.txt 무시
- 법적 문제 발생 가능
개인정보 수집
- 이메일, 전화번호 등 민감정보 수집 금지

🧪 연습 문제

문제 1: 페이지네이션 스크래퍼

여러 페이지에 걸친 상품 목록을 스크래핑하는 함수를 작성하세요.

요구사항:

최대 5페이지까지 스크래핑
각 페이지에서 상품명과 가격 추출
페이지당 1초 딜레이
에러 처리 포함

💡 힌트

for page in range(1, 6) 사용
try-except로 예외 처리
time.sleep(1) 추가
빈 페이지면 break

✅ 정답

  
import requests
from bs4 import BeautifulSoup
import time

def scrape_products(base_url, max_pages=5):
    """상품 페이지네이션 스크래퍼"""
    all_products = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        print(f"[페이지 {page}] {url}")

        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, 'html.parser')

            # 상품 추출
            products = soup.select('.product')

            if not products:
                print("  → 빈 페이지, 종료")
                break

            for product in products:
                name = product.select_one('.name')
                price = product.select_one('.price')

                if name and price:
                    all_products.append({
                        'name': name.get_text(strip=True),
                        'price': price.get_text(strip=True)
                    })

            print(f"  → {len(products)}개 상품 수집")

            # 1초 대기
            if page < max_pages:
                time.sleep(1)

        except Exception as e:
            print(f"  ❌ 오류: {e}")
            break

    return all_products

# 사용
# products = scrape_products('https://example.com/products')
# print(f"\n총 {len(products)}개 상품")

문제 2: 재시도 로직 구현

네트워크 오류 시 3번까지 재시도하는 함수를 작성하세요.