[이제와서 시작하는 Python 마스터하기 #18] 데이터 분석: Pandas와 NumPy로 데이터 다루기

게시 2025/08/22

Python 데이터 분석 완전 정복

By YonYonWare

76 분읽는 시간

[이제와서 시작하는 Python 마스터하기 #18] 데이터 분석: Pandas와 NumPy로 데이터 다루기

💼 실무 예시: 네이버쇼핑 같은 이커머스 데이터 분석하기

데이터 분석을 배우기 전에, 실제 한국 기업들이 어떻게 Python 데이터 분석을 활용하고 있는지 살펴보겠습니다.

네이버쇼핑/쿠팡 스타일의 이커머스 데이터 분석 시나리오를 통해 Pandas와 NumPy의 실무 활용법을 이해해보겠습니다:

  
# 이커머스 데이터 분석 시나리오
"""
쿠팡/네이버쇼핑 데이터 분석팀이 하는 일:

1. 고객 구매 패턴 분석
   - 시간대별 주문량 (새벽 배송 최적화)
   - 지역별 선호 상품 (로켓배송 센터 위치 결정)
   - 계절별 트렌드 (재고 관리)

2. 상품 추천 알고리즘
   - 협업 필터링 (함께 구매한 상품)
   - 컨텐츠 기반 필터링 (상품 속성 유사도)

3. 마케팅 효과 측정
   - A/B 테스트 결과 분석
   - 쿠폰/할인 효과 분석
   - 광고 ROI 계산

4. 재고 최적화
   - 수요 예측 모델
   - 안전 재고량 계산
   - 공급업체 성과 분석

실제 사용 기업:
- 쿠팡: 배송 최적화, 수요 예측
- 네이버쇼핑: 상품 추천, 가격 비교
- 티몬/위메프: 타임딜 효과 분석
- 11번가: 사용자 행동 분석
"""

# 샘플 이커머스 데이터 구조
import pandas as pd
import numpy as np

# 실제 쿠팡 스타일 주문 데이터 구조
sample_orders = {
    'order_id': ['ORD001', 'ORD002', 'ORD003'],
    'customer_id': ['CUST001', 'CUST002', 'CUST001'],
    'product_name': ['삼성 갤럭시 버즈', '나이키 에어맥스', '다이슨 청소기'],
    'category': ['전자제품', '패션', '가전'],
    'price': [189000, 129000, 459000],
    'order_date': ['2024-01-15', '2024-01-15', '2024-01-16'],
    'region': ['서울', '부산', '서울'],
    'delivery_type': ['로켓배송', '일반배송', '새벽배송']
}

print("이커머스 데이터 분석 시나리오:")
print("- 주문 패턴 분석으로 배송 센터 최적화")
print("- 고객 세그먼테이션으로 맞춤형 마케팅")
print("- 재고 데이터로 수요 예측 모델링")

이제 이런 실제 비즈니스 데이터를 분석할 수 있는 Pandas와 NumPy를 배워보겠습니다.

📊 데이터 분석의 세계로의 첫 걸음

안녕하세요! Python 마스터리 시리즈의 18번째 포스트에 오신 것을 환영합니다. 오늘은 데이터 분석의 핵심 라이브러리인 Pandas와 NumPy를 활용해 실제 데이터를 다루는 방법을 배워보겠습니다. 데이터 수집부터 정제, 분석, 시각화까지 전 과정을 체계적으로 익혀보겠습니다.

데이터 분석 기초 개념

데이터 분석이란?

데이터 분석은 원시 데이터를 수집, 정제, 분석하여 의미 있는 인사이트를 도출하는 과정입니다.

  
# 데이터 분석 프로세스
"""
1. 데이터 수집 → 2. 데이터 정제 → 3. 탐색적 분석 → 4. 모델링 → 5. 인사이트 도출
"""

# 필수 라이브러리 설치 및 임포트
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# 한글 폰트 설정 (맥)
plt.rcParams['font.family'] = 'AppleGothic'
plt.rcParams['axes.unicode_minus'] = False

print("=== 데이터 분석 환경 설정 완료 ===")
print(f"NumPy 버전: {np.__version__}")
print(f"Pandas 버전: {pd.__version__}")
print(f"Matplotlib 버전: {plt.matplotlib.__version__}")
print(f"Seaborn 버전: {sns.__version__}")

> [!TIP]
> **Pandas와 NumPy는 뭐가 다른가요?**
>
> - **NumPy**: 숫자 배열을 빠르게 계산하는 도구입니다. 행렬 연산, 수학 계산에 최적화되어 있습니다.
> - **Pandas**: 표(테이블) 형태의 데이터를 다루는 도구입니다. 엑셀처럼 행과 열이 있는 데이터를 쉽게 조작할 수 있습니다.
>
> 쉽게 말하면, NumPy는 "계산기", Pandas는 "엑셀"이라고 생각하시면 됩니다!

# 데이터 분석의 단계
analysis_steps = {
    "1. 문제 정의": "해결하고자 하는 비즈니스 문제를 명확히 정의",
    "2. 데이터 수집": "분석에 필요한 데이터를 다양한 소스에서 수집",
    "3. 데이터 탐색": "데이터의 구조, 품질, 분포를 파악",
    "4. 데이터 정제": "결측값, 이상값, 중복값 처리",
    "5. 특성 엔지니어링": "새로운 변수 생성 및 기존 변수 변환",
    "6. 탐색적 데이터 분석": "시각화를 통한 패턴 발견",
    "7. 모델링": "통계적 분석 및 머신러닝 적용",
    "8. 결과 해석": "분석 결과를 비즈니스 관점에서 해석",
    "9. 커뮤니케이션": "이해관계자에게 결과 전달"
}

print("\n=== 데이터 분석 단계 ===")
for step, description in analysis_steps.items():
    print(f"{step}: {description}")

데이터 유형과 구조

  
# 데이터 유형별 특징
data_types = {
    "정량적 데이터": {
        "연속형": "키, 몸무게, 온도 (무한히 많은 값)",
        "이산형": "인구수, 판매량 (셀 수 있는 값)"
    },
    "정성적 데이터": {
        "명목형": "성별, 색깔, 브랜드 (순서 없음)",
        "순서형": "학점, 만족도, 등급 (순서 있음)"
    }
}

print("=== 데이터 유형 분류 ===")
for main_type, sub_types in data_types.items():
    print(f"\n{main_type}:")
    for sub_type, description in sub_types.items():
        print(f"  {sub_type}: {description}")

# 샘플 데이터셋 생성
np.random.seed(42)
sample_data = {
    'customer_id': range(1, 1001),
    'age': np.random.normal(35, 10, 1000).astype(int),
    'income': np.random.exponential(50000, 1000),
    'purchase_amount': np.random.gamma(2, 50, 1000),
    'satisfaction': np.random.choice(['매우불만', '불만', '보통', '만족', '매우만족'], 1000),
    'city': np.random.choice(['서울', '부산', '인천', '대구', '대전'], 1000),
    'gender': np.random.choice(['남성', '여성'], 1000),
    'purchase_date': pd.date_range('2023-01-01', periods=1000, freq='D')
}

df = pd.DataFrame(sample_data)
print(f"\n=== 샘플 데이터셋 생성 ===")
print(f"데이터 크기: {df.shape}")
print(f"컬럼: {list(df.columns)}")
print(f"\n데이터 미리보기:")
print(df.head())

NumPy 핵심 활용법

NumPy 배열 기초

  
import numpy as np

print("=== NumPy 배열 기초 ===")

# 1. 배열 생성
print("1. 배열 생성 방법들")

# 리스트에서 배열 생성
arr_from_list = np.array([1, 2, 3, 4, 5])
print(f"리스트에서 생성: {arr_from_list}")

# 다차원 배열
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"2차원 배열:\n{matrix}")

# 특수 배열 생성
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
eye = np.eye(3)  # 단위행렬
arange = np.arange(0, 10, 2)  # 0부터 10까지 2씩 증가
linspace = np.linspace(0, 1, 5)  # 0부터 1까지 5개 구간

print(f"영 배열 (3x4):\n{zeros}")
print(f"일 배열 (2x3):\n{ones}")
print(f"단위행렬 (3x3):\n{eye}")
print(f"arange(0,10,2): {arange}")
print(f"linspace(0,1,5): {linspace}")

# 2. 배열 속성
print(f"\n2. 배열 속성")
print(f"배열 형태: {matrix.shape}")
print(f"차원 수: {matrix.ndim}")
print(f"크기: {matrix.size}")
print(f"데이터 타입: {matrix.dtype}")
print(f"메모리 사용량: {matrix.nbytes} bytes")

# 3. 랜덤 배열
print(f"\n3. 랜덤 배열")
np.random.seed(42)

random_uniform = np.random.random((3, 3))  # 0~1 균등분포
random_normal = np.random.normal(0, 1, (3, 3))  # 정규분포
random_int = np.random.randint(1, 10, (3, 3))  # 정수 랜덤

print(f"균등분포 (0~1):\n{random_uniform}")
print(f"정규분포 (μ=0, σ=1):\n{random_normal}")
print(f"정수 랜덤 (1~9):\n{random_int}")

NumPy 연산과 함수

  
# NumPy 연산
print("=== NumPy 연산과 함수 ===")

# 기본 배열
a = np.array([1, 2, 3, 4, 5])
b = np.array([2, 3, 4, 5, 6])

print("1. 기본 산술 연산")
print(f"a = {a}")
print(f"b = {b}")
print(f"a + b = {a + b}")
print(f"a - b = {a - b}")
print(f"a * b = {a * b}")  # 요소별 곱
print(f"a / b = {a / b}")
print(f"a ** 2 = {a ** 2}")

# 행렬 연산
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])

print(f"\n2. 행렬 연산")
print(f"행렬 A:\n{matrix_a}")
print(f"행렬 B:\n{matrix_b}")
print(f"행렬 곱셈 (A @ B):\n{matrix_a @ matrix_b}")
print(f"행렬 곱셈 (dot):\n{np.dot(matrix_a, matrix_b)}")

# 통계 함수
data = np.random.normal(100, 15, 1000)
print(f"\n3. 통계 함수 (샘플 크기: {len(data)})")
print(f"평균: {np.mean(data):.2f}")
print(f"중앙값: {np.median(data):.2f}")
print(f"표준편차: {np.std(data):.2f}")
print(f"분산: {np.var(data):.2f}")
print(f"최솟값: {np.min(data):.2f}")
print(f"최댓값: {np.max(data):.2f}")
print(f"25% 분위수: {np.percentile(data, 25):.2f}")
print(f"75% 분위수: {np.percentile(data, 75):.2f}")

# 집계 함수
matrix = np.random.randint(1, 10, (4, 5))
print(f"\n4. 집계 함수")
print(f"원본 행렬 (4x5):\n{matrix}")
print(f"전체 합: {np.sum(matrix)}")
print(f"행별 합: {np.sum(matrix, axis=1)}")
print(f"열별 합: {np.sum(matrix, axis=0)}")
print(f"행별 평균: {np.mean(matrix, axis=1)}")
print(f"열별 최댓값: {np.max(matrix, axis=0)}")

# 배열 조작
print(f"\n5. 배열 조작")
arr = np.arange(12)
print(f"원본 배열: {arr}")
print(f"reshape (3x4):\n{arr.reshape(3, 4)}")
print(f"reshape (2x6):\n{arr.reshape(2, 6)}")

# 배열 분할과 결합
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = np.array([7, 8, 9])

concatenated = np.concatenate([arr1, arr2, arr3])
stacked_v = np.vstack([arr1, arr2, arr3])
stacked_h = np.hstack([arr1, arr2, arr3])

print(f"\n6. 배열 결합")
print(f"concatenate: {concatenated}")
print(f"vstack (세로):\n{stacked_v}")
print(f"hstack (가로): {stacked_h}")

NumPy 고급 인덱싱

  
# 고급 인덱싱과 슬라이싱
print("=== NumPy 고급 인덱싱 ===")

# 2차원 배열 생성
arr_2d = np.arange(20).reshape(4, 5)
print(f"2차원 배열 (4x5):\n{arr_2d}")

print(f"\n1. 기본 인덱싱")
print(f"arr_2d[0, 0] = {arr_2d[0, 0]}")  # 첫 번째 행, 첫 번째 열
print(f"arr_2d[1, 3] = {arr_2d[1, 3]}")  # 두 번째 행, 네 번째 열
print(f"arr_2d[-1, -1] = {arr_2d[-1, -1]}")  # 마지막 행, 마지막 열

print(f"\n2. 슬라이싱")
print(f"첫 두 행: \n{arr_2d[:2]}")
print(f"마지막 두 열: \n{arr_2d[:, -2:]}")
print(f"2-3행, 1-3열: \n{arr_2d[1:3, 0:3]}")

print(f"\n3. 불린 인덱싱")
# 조건에 맞는 요소 선택
condition = arr_2d > 10
print(f"10보다 큰 요소들: {arr_2d[condition]}")

# 복합 조건
complex_condition = (arr_2d > 5) & (arr_2d < 15)
print(f"5보다 크고 15보다 작은 요소들: {arr_2d[complex_condition]}")

print(f"\n4. 팬시 인덱싱")
# 인덱스 배열을 사용한 선택
rows = [0, 2, 3]
cols = [1, 3, 4]
print(f"선택된 요소들: {arr_2d[rows, cols]}")

# 특정 행들 선택
selected_rows = arr_2d[[0, 2]]
print(f"0번째, 2번째 행:\n{selected_rows}")

print(f"\n5. 배열 수정")
arr_copy = arr_2d.copy()
# 조건에 맞는 요소들 수정
arr_copy[arr_copy > 15] = -1
print(f"15보다 큰 값들을 -1로 변경:\n{arr_copy}")

# where 함수 사용
arr_where = np.where(arr_2d > 10, arr_2d, 0)
print(f"10보다 큰 값만 유지, 나머지는 0:\n{arr_where}")

Pandas 완전 정복

DataFrame과 Series 기초

  
import pandas as pd
import numpy as np

print("=== Pandas DataFrame과 Series 기초 ===")

# 1. Series 생성
print("1. Series 생성")
series_from_list = pd.Series([1, 2, 3, 4, 5])
series_with_index = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
series_from_dict = pd.Series({'서울': 9776000, '부산': 3414000, '인천': 2947000})

print(f"리스트에서 생성:\n{series_from_list}")
print(f"인덱스 지정:\n{series_with_index}")
print(f"딕셔너리에서 생성:\n{series_from_dict}")

# 2. DataFrame 생성
print(f"\n2. DataFrame 생성")

# 딕셔너리에서 생성
data_dict = {
    'name': ['김철수', '이영희', '박민수', '최지영'],
    'age': [25, 30, 35, 28],
    'city': ['서울', '부산', '대구', '인천'],
    'salary': [3000, 3500, 4000, 3200]
}
df = pd.DataFrame(data_dict)
print(f"딕셔너리에서 생성:\n{df}")

# 리스트의 리스트에서 생성
data_list = [
    ['김철수', 25, '서울', 3000],
    ['이영희', 30, '부산', 3500],
    ['박민수', 35, '대구', 4000],
    ['최지영', 28, '인천', 3200]
]
df_from_list = pd.DataFrame(data_list, columns=['name', 'age', 'city', 'salary'])
print(f"\n리스트에서 생성:\n{df_from_list}")

# 3. DataFrame 기본 정보
print(f"\n3. DataFrame 기본 정보")
print(f"형태: {df.shape}")
print(f"컬럼: {list(df.columns)}")
print(f"인덱스: {list(df.index)}")
print(f"데이터 타입:\n{df.dtypes}")
print(f"기본 통계:\n{df.describe()}")

# 4. 데이터 선택과 접근
print(f"\n4. 데이터 선택과 접근")

# 컬럼 선택
print(f"name 컬럼:\n{df['name']}")
print(f"여러 컬럼 선택:\n{df[['name', 'age']]}")

# 행 선택
print(f"첫 번째 행:\n{df.iloc[0]}")
print(f"name이 '김철수'인 행:\n{df[df['name'] == '김철수']}")

# loc와 iloc
print(f"loc[0, 'name']: {df.loc[0, 'name']}")
print(f"iloc[0, 0]: {df.iloc[0, 0]}")

[!WARNING] loc vs iloc 헷갈리지 마세요!

loc: 라벨(이름)로 접근합니다. df.loc[0, 'name'] → 0번 인덱스의 ‘name’ 컬럼

iloc: 위치(숫자)로 접근합니다. df.iloc[0, 0] → 0번째 행의 0번째 열

loc는 “location by label”, iloc는 “location by integer”라고 기억하세요!

데이터 조작과 변환

  
# 데이터 조작과 변환
print("=== 데이터 조작과 변환 ===")

# 샘플 데이터 생성
np.random.seed(42)
sales_data = {
    'product': ['A', 'B', 'C', 'A', 'B', 'C'] * 100,
    'region': ['북부', '남부', '동부', '서부'] * 150,
    'sales': np.random.normal(100, 20, 600),
    'quantity': np.random.poisson(10, 600),
    'date': pd.date_range('2023-01-01', periods=600, freq='D')
}
sales_df = pd.DataFrame(sales_data)

print(f"판매 데이터 샘플:\n{sales_df.head()}")
print(f"데이터 크기: {sales_df.shape}")

# 1. 새 컬럼 추가
print(f"\n1. 새 컬럼 추가")
sales_df['revenue'] = sales_df['sales'] * sales_df['quantity']
sales_df['month'] = sales_df['date'].dt.month
sales_df['quarter'] = sales_df['date'].dt.quarter

print(f"새 컬럼 추가 후:\n{sales_df.head()}")

# 2. 조건부 컬럼 생성
sales_df['performance'] = pd.cut(sales_df['sales'],
                                bins=[0, 80, 120, float('inf')],
                                labels=['낮음', '보통', '높음'])

sales_df['season'] = sales_df['month'].map({
    12: '겨울', 1: '겨울', 2: '겨울',
    3: '봄', 4: '봄', 5: '봄',
    6: '여름', 7: '여름', 8: '여름',
    9: '가을', 10: '가을', 11: '가을'
})

print(f"조건부 컬럼 추가 후:\n{sales_df[['sales', 'performance', 'month', 'season']].head()}")

# 3. 그룹별 집계
print(f"\n2. 그룹별 집계")

# 제품별 집계
product_summary = sales_df.groupby('product').agg({
    'sales': ['mean', 'sum', 'count'],
    'quantity': ['mean', 'sum'],
    'revenue': ['mean', 'sum']
}).round(2)

print(f"제품별 집계:\n{product_summary}")

# 지역별, 제품별 집계
region_product = sales_df.groupby(['region', 'product'])['revenue'].sum().unstack()
print(f"\n지역별, 제품별 매출:\n{region_product}")

# 월별 매출 추이
monthly_sales = sales_df.groupby('month')['revenue'].sum()
print(f"\n월별 총매출:\n{monthly_sales}")

# 4. 피벗 테이블
print(f"\n3. 피벗 테이블")
pivot_table = pd.pivot_table(sales_df,
                           values='revenue',
                           index='region',
                           columns='product',
                           aggfunc='sum',
                           fill_value=0)
print(f"지역-제품 피벗 테이블:\n{pivot_table}")

# 복합 피벗
complex_pivot = pd.pivot_table(sales_df,
                             values=['sales', 'quantity'],
                             index='region',
                             columns='season',
                             aggfunc='mean',
                             fill_value=0).round(2)
print(f"\n지역-계절별 평균 판매량/수량:\n{complex_pivot}")

데이터 병합과 결합

  
# 데이터 병합과 결합
print("=== 데이터 병합과 결합 ===")

# 샘플 데이터 생성
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['김철수', '이영희', '박민수', '최지영', '정수현'],
    'city': ['서울', '부산', '대구', '인천', '광주']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 2, 3, 1, 4, 6],  # 6번 고객은 customers에 없음
    'product': ['A', 'B', 'C', 'A', 'B', 'C'],
    'amount': [1000, 1500, 2000, 1200, 1800, 2200]
})

products = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'product_name': ['노트북', '마우스', '키보드', '모니터'],
    'category': ['전자제품', '액세서리', '액세서리', '전자제품']
})

print(f"고객 데이터:\n{customers}")
print(f"\n주문 데이터:\n{orders}")
print(f"\n제품 데이터:\n{products}")

# 1. Inner Join (교집합)
inner_join = pd.merge(customers, orders, on='customer_id', how='inner')
print(f"\n1. Inner Join (고객-주문):\n{inner_join}")

# 2. Left Join (왼쪽 기준)
left_join = pd.merge(customers, orders, on='customer_id', how='left')
print(f"\n2. Left Join (모든 고객 + 주문):\n{left_join}")

# 3. Right Join (오른쪽 기준)
right_join = pd.merge(customers, orders, on='customer_id', how='right')
print(f"\n3. Right Join (모든 주문 + 고객):\n{right_join}")

# 4. Outer Join (합집합)
outer_join = pd.merge(customers, orders, on='customer_id', how='outer')
print(f"\n4. Outer Join (모든 고객과 주문):\n{outer_join}")

# 5. 다중 테이블 조인
# 먼저 고객과 주문을 조인한 후 제품 정보 추가
customer_orders = pd.merge(customers, orders, on='customer_id', how='inner')
full_data = pd.merge(customer_orders, products, on='product', how='left')
print(f"\n5. 다중 테이블 조인 (고객-주문-제품):\n{full_data}")

# 6. 다른 컬럼명으로 조인
# customers 테이블의 customer_id를 cust_id로 변경했다고 가정
customers_renamed = customers.rename(columns={'customer_id': 'cust_id'})
different_names = pd.merge(customers_renamed, orders,
                          left_on='cust_id', right_on='customer_id', how='inner')
print(f"\n6. 다른 컬럼명으로 조인:\n{different_names}")

# 7. 인덱스로 조인
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')
index_join = customers_indexed.join(orders_indexed, how='inner', rsuffix='_order')
print(f"\n7. 인덱스로 조인:\n{index_join}")

# 8. concat을 이용한 결합
# 세로로 결합 (행 추가)
more_customers = pd.DataFrame({
    'customer_id': [6, 7],
    'name': ['김영수', '박지은'],
    'city': ['대전', '울산']
})
combined_customers = pd.concat([customers, more_customers], ignore_index=True)
print(f"\n8. concat 세로 결합:\n{combined_customers}")

# 가로로 결합 (열 추가)
customer_details = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 28, 32],
    'gender': ['M', 'F', 'M', 'F', 'M']
})
combined_horizontal = pd.concat([customers, customer_details[['age', 'gender']]], axis=1)
print(f"\n가로 결합:\n{combined_horizontal}")

데이터 정제와 전처리

결측값과 이상값 처리

  
# 결측값과 이상값 처리
print("=== 결측값과 이상값 처리 ===")

# 결측값이 있는 데이터 생성
np.random.seed(42)
dirty_data = pd.DataFrame({
    'name': ['김철수', '이영희', None, '최지영', '정수현', '박민수'],
    'age': [25, None, 35, 28, 32, None],
    'salary': [3000, 3500, None, 3200, 4500, 3800],
    'city': ['서울', '부산', '대구', None, '광주', '대전'],
    'experience': [2, 5, 8, 3, None, 7]
})

# 일부 이상값 추가
dirty_data.loc[len(dirty_data)] = ['이상값', 150, 50000, '서울', 50]

print(f"원본 데이터 (결측값과 이상값 포함):\n{dirty_data}")

# 1. 결측값 확인
print(f"\n1. 결측값 현황")
print(f"결측값 개수:\n{dirty_data.isnull().sum()}")
print(f"결측값 비율:\n{(dirty_data.isnull().sum() / len(dirty_data) * 100).round(2)}%")

# 결측값 시각화
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(dirty_data.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('결측값 분포')
plt.show()

# 2. 결측값 처리 방법들
print(f"\n2. 결측값 처리")

# 방법 1: 결측값이 있는 행 제거
no_missing = dirty_data.dropna()
print(f"결측값 행 제거 후: {no_missing.shape[0]}행")

# 방법 2: 특정 컬럼의 결측값만 제거
no_name_missing = dirty_data.dropna(subset=['name'])
print(f"name 결측값만 제거: {no_name_missing.shape[0]}행")

# 방법 3: 결측값 채우기
filled_data = dirty_data.copy()

# 수치형: 평균, 중앙값으로 채우기
filled_data['age'].fillna(filled_data['age'].mean(), inplace=True)
filled_data['salary'].fillna(filled_data['salary'].median(), inplace=True)
filled_data['experience'].fillna(filled_data['experience'].mode()[0], inplace=True)

# 범주형: 최빈값 또는 '미상'으로 채우기
filled_data['name'].fillna('미상', inplace=True)
filled_data['city'].fillna(filled_data['city'].mode()[0], inplace=True)

print(f"결측값 채우기 후:\n{filled_data}")

# 방법 4: 앞/뒤 값으로 채우기 (시계열 데이터에 유용)
time_series = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=10),
    'value': [10, None, 12, None, None, 15, 16, None, 18, 19]
})
time_series['forward_fill'] = time_series['value'].fillna(method='ffill')
time_series['backward_fill'] = time_series['value'].fillna(method='bfill')
time_series['interpolate'] = time_series['value'].interpolate()

print(f"\n시계열 결측값 처리:\n{time_series}")

# 3. 이상값 탐지
print(f"\n3. 이상값 탐지")

# IQR 방법으로 이상값 탐지
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# 나이 이상값 탐지
age_outliers, age_lower, age_upper = detect_outliers_iqr(filled_data, 'age')
print(f"나이 이상값: {len(age_outliers)}개")
print(f"정상 범위: {age_lower:.2f} ~ {age_upper:.2f}")
print(f"이상값:\n{age_outliers[['name', 'age']]}")

# 연봉 이상값 탐지
salary_outliers, sal_lower, sal_upper = detect_outliers_iqr(filled_data, 'salary')
print(f"\n연봉 이상값: {len(salary_outliers)}개")
print(f"정상 범위: {sal_lower:.2f} ~ {sal_upper:.2f}")
print(f"이상값:\n{salary_outliers[['name', 'salary']]}")

# Z-score 방법으로 이상값 탐지
from scipy import stats

def detect_outliers_zscore(data, column, threshold=3):
    z_scores = np.abs(stats.zscore(data[column].dropna()))
    return data[z_scores > threshold]

z_outliers = detect_outliers_zscore(filled_data, 'salary')
print(f"\nZ-score 방법 연봉 이상값:\n{z_outliers[['name', 'salary']]}")

# 4. 이상값 처리
print(f"\n4. 이상값 처리")

cleaned_data = filled_data.copy()

# 이상값을 상한/하한선으로 대체 (클리핑)
cleaned_data['age'] = cleaned_data['age'].clip(lower=age_lower, upper=age_upper)
cleaned_data['salary'] = cleaned_data['salary'].clip(lower=sal_lower, upper=sal_upper)

print(f"이상값 클리핑 후:\n{cleaned_data}")

# 이상값을 평균으로 대체
capped_data = filled_data.copy()
capped_data.loc[capped_data['age'] > age_upper, 'age'] = capped_data['age'].mean()
capped_data.loc[capped_data['salary'] > sal_upper, 'salary'] = capped_data['salary'].median()

print(f"\n이상값을 평균/중앙값으로 대체:\n{capped_data}")

[!IMPORTANT] 결측값 처리는 신중하게!

결측값을 무조건 평균으로 채우면 안 됩니다! 데이터의 의미를 왜곡할 수 있습니다.

MCAR (완전 무작위 결측): 평균/중앙값으로 채워도 OK

MAR (무작위 결측): 다른 변수와의 관계를 고려해서 채우기

MNAR (비무작위 결측): 결측 자체가 의미가 있을 수 있음 (예: 고소득자가 연봉을 안 적음)

결측값이 왜 생겼는지 먼저 파악하는 것이 중요합니다!

데이터 타입 변환과 정규화

  
# 데이터 타입 변환과 정규화
print("=== 데이터 타입 변환과 정규화 ===")

# 샘플 데이터 생성
sample_df = pd.DataFrame({
    'id': ['001', '002', '003', '004', '005'],
    'score': ['85.5', '92.0', '78.5', '95.0', '88.5'],
    'grade': ['B', 'A', 'C', 'A', 'B'],
    'pass_fail': ['Pass', 'Pass', 'Fail', 'Pass', 'Pass'],
    'date_str': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12']
})

print(f"원본 데이터:\n{sample_df}")
print(f"데이터 타입:\n{sample_df.dtypes}")

# 1. 수치형 변환
print(f"\n1. 수치형 변환")
sample_df['id'] = pd.to_numeric(sample_df['id'])
sample_df['score'] = pd.to_numeric(sample_df['score'])

print(f"변환 후 데이터 타입:\n{sample_df.dtypes}")

# 2. 날짜형 변환
sample_df['date'] = pd.to_datetime(sample_df['date_str'])
sample_df['year'] = sample_df['date'].dt.year
sample_df['month'] = sample_df['date'].dt.month
sample_df['day_of_week'] = sample_df['date'].dt.day_name()

print(f"\n날짜 변환 후:\n{sample_df[['date_str', 'date', 'year', 'month', 'day_of_week']]}")

# 3. 범주형 변환
# 순서가 있는 범주형 (Ordered Categorical)
grade_categories = ['C', 'B', 'A']
sample_df['grade_cat'] = pd.Categorical(sample_df['grade'],
                                       categories=grade_categories,
                                       ordered=True)

# 순서가 없는 범주형
sample_df['pass_fail_cat'] = sample_df['pass_fail'].astype('category')

print(f"\n범주형 변환:")
print(f"Grade categories: {sample_df['grade_cat'].cat.categories}")
print(f"Pass/Fail categories: {sample_df['pass_fail_cat'].cat.categories}")

# 4. 원-핫 인코딩
print(f"\n2. 원-핫 인코딩")
grade_dummies = pd.get_dummies(sample_df['grade'], prefix='grade')
pass_dummies = pd.get_dummies(sample_df['pass_fail'], prefix='result')

encoded_df = pd.concat([sample_df, grade_dummies, pass_dummies], axis=1)
print(f"원-핫 인코딩 후:\n{encoded_df[['grade'] + list(grade_dummies.columns)]}")

# 5. 라벨 인코딩
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
sample_df['grade_encoded'] = le.fit_transform(sample_df['grade'])
print(f"\n라벨 인코딩 (grade): {dict(zip(le.classes_, le.transform(le.classes_)))}")

# 6. 데이터 정규화/표준화
print(f"\n3. 데이터 정규화/표준화")

# 더 큰 샘플 데이터로 정규화 예제
np.random.seed(42)
norm_data = pd.DataFrame({
    'feature1': np.random.normal(100, 15, 100),
    'feature2': np.random.normal(50, 10, 100),
    'feature3': np.random.exponential(2, 100)
})

print(f"원본 데이터 통계:\n{norm_data.describe()}")

# Min-Max 정규화 (0-1 스케일)
from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler()
norm_data_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(norm_data),
    columns=[f'{col}_minmax' for col in norm_data.columns]
)

# 표준화 (Z-score)
from sklearn.preprocessing import StandardScaler

scaler_standard = StandardScaler()
norm_data_standard = pd.DataFrame(
    scaler_standard.fit_transform(norm_data),
    columns=[f'{col}_standard' for col in norm_data.columns]
)

print(f"\nMin-Max 정규화 후:\n{norm_data_minmax.describe()}")
print(f"\n표준화 후:\n{norm_data_standard.describe()}")

# 정규화 효과 시각화
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 원본 데이터
axes[0].hist(norm_data['feature1'], bins=20, alpha=0.7)
axes[0].set_title('원본 데이터')
axes[0].set_xlabel('Feature1')

# Min-Max 정규화
axes[1].hist(norm_data_minmax['feature1_minmax'], bins=20, alpha=0.7, color='orange')
axes[1].set_title('Min-Max 정규화')
axes[1].set_xlabel('Feature1 (0-1 스케일)')

# 표준화
axes[2].hist(norm_data_standard['feature1_standard'], bins=20, alpha=0.7, color='green')
axes[2].set_title('표준화 (Z-score)')
axes[2].set_xlabel('Feature1 (표준화)')

plt.tight_layout()
plt.show()

# 7. 로그 변환 (왜도가 있는 데이터)
print(f"\n4. 로그 변환")

# 왜도가 있는 데이터 생성
skewed_data = pd.DataFrame({
    'original': np.random.exponential(2, 1000)
})

skewed_data['log_transformed'] = np.log1p(skewed_data['original'])  # log(1+x)
skewed_data['sqrt_transformed'] = np.sqrt(skewed_data['original'])

print(f"왜도 비교:")
print(f"원본: {skewed_data['original'].skew():.2f}")
print(f"로그 변환: {skewed_data['log_transformed'].skew():.2f}")
print(f"제곱근 변환: {skewed_data['sqrt_transformed'].skew():.2f}")

# 변환 효과 시각화
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].hist(skewed_data['original'], bins=50, alpha=0.7)
axes[0].set_title('원본 데이터')

axes[1].hist(skewed_data['log_transformed'], bins=50, alpha=0.7, color='orange')
axes[1].set_title('로그 변환')

axes[2].hist(skewed_data['sqrt_transformed'], bins=50, alpha=0.7, color='green')
axes[2].set_title('제곱근 변환')

plt.tight_layout()
plt.show()

탐색적 데이터 분석 (EDA)

기술 통계와 분포 분석

  
# 탐색적 데이터 분석 (EDA)
print("=== 탐색적 데이터 분석 (EDA) ===")

# 종합적인 샘플 데이터 생성
np.random.seed(42)
eda_data = pd.DataFrame({
    'customer_id': range(1, 1001),
    'age': np.random.normal(35, 12, 1000).astype(int),
    'income': np.random.lognormal(10.5, 0.5, 1000),
    'education': np.random.choice(['고등학교', '대학교', '대학원'], 1000, p=[0.3, 0.5, 0.2]),
    'region': np.random.choice(['서울', '경기', '부산', '기타'], 1000, p=[0.3, 0.25, 0.15, 0.3]),
    'purchase_frequency': np.random.poisson(8, 1000),
    'last_purchase_days': np.random.exponential(30, 1000).astype(int),
    'satisfaction_score': np.random.beta(2, 1, 1000) * 5,  # 0-5점 척도
    'is_premium': np.random.choice([True, False], 1000, p=[0.2, 0.8])
})

# 연령대별 소득 조정 (현실적으로)
eda_data.loc[eda_data['age'] < 25, 'income'] *= 0.6
eda_data.loc[eda_data['age'] > 55, 'income'] *= 1.3

print(f"EDA 데이터셋:\n{eda_data.head()}")
print(f"데이터 크기: {eda_data.shape}")

# 1. 기본 정보 파악
print(f"\n1. 기본 정보")
print(f"데이터 타입:\n{eda_data.dtypes}")
print(f"\n기술 통계:\n{eda_data.describe()}")

# 범주형 변수 요약
categorical_cols = ['education', 'region']
for col in categorical_cols:
    print(f"\n{col} 분포:")
    print(eda_data[col].value_counts())
    print(f"비율:\n{eda_data[col].value_counts(normalize=True).round(3)}")

# 2. 분포 분석
print(f"\n2. 분포 분석")

# 수치형 변수들의 분포 시각화
numeric_cols = ['age', 'income', 'purchase_frequency', 'satisfaction_score']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, col in enumerate(numeric_cols):
    axes[i].hist(eda_data[col], bins=30, alpha=0.7, edgecolor='black')
    axes[i].set_title(f'{col} 분포')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('빈도')

    # 통계 정보 추가
    mean_val = eda_data[col].mean()
    median_val = eda_data[col].median()
    axes[i].axvline(mean_val, color='red', linestyle='--', label=f'평균: {mean_val:.1f}')
    axes[i].axvline(median_val, color='green', linestyle='--', label=f'중앙값: {median_val:.1f}')
    axes[i].legend()

plt.tight_layout()
plt.show()

# 3. 왜도와 첨도 분석
print(f"\n3. 왜도와 첨도")
for col in numeric_cols:
    skewness = eda_data[col].skew()
    kurtosis = eda_data[col].kurtosis()
    print(f"{col}: 왜도={skewness:.2f}, 첨도={kurtosis:.2f}")

# 4. 박스플롯으로 이상값 탐지
print(f"\n4. 박스플롯 분석")

fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for i, col in enumerate(numeric_cols):
    eda_data.boxplot(column=col, ax=axes[i])
    axes[i].set_title(f'{col} 박스플롯')

plt.tight_layout()
plt.show()

# 5. 범주형 변수 분석
print(f"\n5. 범주형 변수 분석")

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# 교육 수준 분포
education_counts = eda_data['education'].value_counts()
axes[0].pie(education_counts.values, labels=education_counts.index, autopct='%1.1f%%')
axes[0].set_title('교육 수준 분포')

# 지역 분포
region_counts = eda_data['region'].value_counts()
axes[1].bar(region_counts.index, region_counts.values)
axes[1].set_title('지역별 고객 수')
axes[1].set_xlabel('지역')
axes[1].set_ylabel('고객 수')

plt.tight_layout()
plt.show()

상관관계와 관계 분석

  
# 상관관계와 관계 분석
print("=== 상관관계와 관계 분석 ===")

# 1. 상관계수 계산
print("1. 상관계수 분석")

# 수치형 변수들 간의 상관관계
numeric_data = eda_data[['age', 'income', 'purchase_frequency', 'last_purchase_days', 'satisfaction_score']]
correlation_matrix = numeric_data.corr()

print(f"상관계수 행렬:\n{correlation_matrix.round(3)}")

# 상관계수 히트맵
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('변수 간 상관관계 히트맵')
plt.show()

# 2. 산점도 행렬
print(f"\n2. 산점도 분석")

# 주요 변수들 간의 산점도
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 나이 vs 소득
axes[0, 0].scatter(eda_data['age'], eda_data['income'], alpha=0.6)
axes[0, 0].set_xlabel('나이')
axes[0, 0].set_ylabel('소득')
axes[0, 0].set_title('나이 vs 소득')

# 소득 vs 구매빈도
axes[0, 1].scatter(eda_data['income'], eda_data['purchase_frequency'], alpha=0.6, color='orange')
axes[0, 1].set_xlabel('소득')
axes[0, 1].set_ylabel('구매빈도')
axes[0, 1].set_title('소득 vs 구매빈도')

# 구매빈도 vs 만족도
axes[1, 0].scatter(eda_data['purchase_frequency'], eda_data['satisfaction_score'], alpha=0.6, color='green')
axes[1, 0].set_xlabel('구매빈도')
axes[1, 0].set_ylabel('만족도')
axes[1, 0].set_title('구매빈도 vs 만족도')

# 마지막 구매일 vs 만족도
axes[1, 1].scatter(eda_data['last_purchase_days'], eda_data['satisfaction_score'], alpha=0.6, color='red')
axes[1, 1].set_xlabel('마지막 구매일 (일)')
axes[1, 1].set_ylabel('만족도')
axes[1, 1].set_title('마지막 구매일 vs 만족도')

plt.tight_layout()
plt.show()

# 3. 그룹별 분석
print(f"\n3. 그룹별 분석")

# 교육 수준별 소득 분포
education_income = eda_data.groupby('education')['income'].agg(['mean', 'median', 'std']).round(0)
print(f"교육 수준별 소득 통계:\n{education_income}")

# 지역별 구매 패턴
region_stats = eda_data.groupby('region').agg({
    'income': 'mean',
    'purchase_frequency': 'mean',
    'satisfaction_score': 'mean'
}).round(2)
print(f"\n지역별 통계:\n{region_stats}")

# 프리미엄 고객 vs 일반 고객
premium_comparison = eda_data.groupby('is_premium').agg({
    'age': 'mean',
    'income': 'mean',
    'purchase_frequency': 'mean',
    'satisfaction_score': 'mean'
}).round(2)
print(f"\n프리미엄 고객 vs 일반 고객:\n{premium_comparison}")

# 4. 박스플롯으로 그룹 비교
print(f"\n4. 그룹별 분포 비교")

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 교육 수준별 소득
sns.boxplot(data=eda_data, x='education', y='income', ax=axes[0, 0])
axes[0, 0].set_title('교육 수준별 소득 분포')
axes[0, 0].tick_params(axis='x', rotation=45)

# 지역별 구매빈도
sns.boxplot(data=eda_data, x='region', y='purchase_frequency', ax=axes[0, 1])
axes[0, 1].set_title('지역별 구매빈도 분포')

# 프리미엄 고객 여부별 만족도
sns.boxplot(data=eda_data, x='is_premium', y='satisfaction_score', ax=axes[1, 0])
axes[1, 0].set_title('프리미엄 고객별 만족도')

# 교육 수준별 나이
sns.boxplot(data=eda_data, x='education', y='age', ax=axes[1, 1])
axes[1, 1].set_title('교육 수준별 나이 분포')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# 5. 교차표 분석
print(f"\n5. 교차표 분석")

# 교육 수준과 지역 간의 관계
cross_tab = pd.crosstab(eda_data['education'], eda_data['region'])
print(f"교육 수준 x 지역 교차표:\n{cross_tab}")

# 비율로 변환
cross_tab_pct = pd.crosstab(eda_data['education'], eda_data['region'], normalize='index')
print(f"\n교육 수준별 지역 비율:\n{cross_tab_pct.round(3)}")

# 교차표 히트맵
plt.figure(figsize=(10, 6))
sns.heatmap(cross_tab, annot=True, fmt='d', cmap='Blues')
plt.title('교육 수준 x 지역 교차표')
plt.show()

# 6. 통계적 검정
print(f"\n6. 통계적 검정")

from scipy import stats

# 교육 수준별 소득 차이 검정 (ANOVA)
education_groups = [group['income'].values for name, group in eda_data.groupby('education')]
f_stat, p_value = stats.f_oneway(*education_groups)
print(f"교육 수준별 소득 차이 ANOVA: F={f_stat:.2f}, p={p_value:.4f}")

# 프리미엄 고객과 일반 고객의 만족도 차이 (t-test)
premium_satisfaction = eda_data[eda_data['is_premium']]['satisfaction_score']
regular_satisfaction = eda_data[~eda_data['is_premium']]['satisfaction_score']
t_stat, p_value = stats.ttest_ind(premium_satisfaction, regular_satisfaction)
print(f"프리미엄 vs 일반 고객 만족도 t-test: t={t_stat:.2f}, p={p_value:.4f}")

# 카이제곱 검정 (교육 수준과 지역의 독립성)
chi2, p_value, dof, expected = stats.chi2_contingency(cross_tab)
print(f"교육 수준과 지역 독립성 카이제곱 검정: χ²={chi2:.2f}, p={p_value:.4f}")

데이터 시각화

Matplotlib과 Seaborn 활용

  
# 데이터 시각화
print("=== 데이터 시각화 ===")

# 1. Matplotlib 기초
print("1. Matplotlib 기본 시각화")

# 기본 라인 플롯
months = ['1월', '2월', '3월', '4월', '5월', '6월']
sales = [100, 120, 140, 110, 160, 180]
profit = [20, 25, 30, 22, 35, 40]

plt.figure(figsize=(12, 8))

# 서브플롯 생성
plt.subplot(2, 2, 1)
plt.plot(months, sales, marker='o', linewidth=2, label='매출')
plt.plot(months, profit, marker='s', linewidth=2, label='이익')
plt.title('월별 매출 및 이익')
plt.xlabel('월')
plt.ylabel('금액 (백만원)')
plt.legend()
plt.grid(True, alpha=0.3)

# 막대 그래프
plt.subplot(2, 2, 2)
x_pos = np.arange(len(months))
plt.bar(x_pos - 0.2, sales, 0.4, label='매출', alpha=0.7)
plt.bar(x_pos + 0.2, profit, 0.4, label='이익', alpha=0.7)
plt.title('월별 매출 및 이익 (막대그래프)')
plt.xlabel('월')
plt.ylabel('금액 (백만원)')
plt.xticks(x_pos, months)
plt.legend()

# 파이 차트
plt.subplot(2, 2, 3)
regions = ['서울', '경기', '부산', '기타']
customers = [300, 250, 150, 300]
plt.pie(customers, labels=regions, autopct='%1.1f%%', startangle=90)
plt.title('지역별 고객 분포')

# 산점도
plt.subplot(2, 2, 4)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
colors = np.random.randn(100)
plt.scatter(x, y, c=colors, alpha=0.6, cmap='viridis')
plt.title('산점도 예제')
plt.xlabel('X 변수')
plt.ylabel('Y 변수')
plt.colorbar()

plt.tight_layout()
plt.show()

# 2. Seaborn 고급 시각화
print(f"\n2. Seaborn 고급 시각화")

# 샘플 데이터 사용 (앞에서 생성한 eda_data)
plt.style.use('seaborn-v0_8')  # seaborn 스타일 적용

# 분포 플롯
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 히스토그램 + KDE
sns.histplot(data=eda_data, x='income', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('소득 분포 (히스토그램 + KDE)')

# 박스플롯
sns.boxplot(data=eda_data, x='education', y='income', ax=axes[0, 1])
axes[0, 1].set_title('교육 수준별 소득')
axes[0, 1].tick_params(axis='x', rotation=45)

# 바이올린 플롯
sns.violinplot(data=eda_data, x='region', y='satisfaction_score', ax=axes[1, 0])
axes[1, 0].set_title('지역별 만족도 분포')

# 포인트 플롯
sns.pointplot(data=eda_data, x='education', y='purchase_frequency',
              hue='is_premium', ax=axes[1, 1])
axes[1, 1].set_title('교육 수준별 구매빈도 (프리미엄 여부)')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# 3. 관계 시각화
print(f"\n3. 관계 시각화")

# 페어플롯
numeric_subset = eda_data[['age', 'income', 'purchase_frequency', 'satisfaction_score', 'is_premium']]
g = sns.pairplot(numeric_subset, hue='is_premium', diag_kind='kde')
g.fig.suptitle('변수 간 관계 (페어플롯)', y=1.02)
plt.show()

# 회귀선이 있는 산점도
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.regplot(data=eda_data, x='age', y='income', scatter_kws={'alpha':0.6})
plt.title('나이 vs 소득 (회귀선 포함)')

plt.subplot(1, 3, 2)
sns.regplot(data=eda_data, x='income', y='purchase_frequency', scatter_kws={'alpha':0.6})
plt.title('소득 vs 구매빈도')

plt.subplot(1, 3, 3)
sns.regplot(data=eda_data, x='purchase_frequency', y='satisfaction_score', scatter_kws={'alpha':0.6})
plt.title('구매빈도 vs 만족도')

plt.tight_layout()
plt.show()

# 4. 고급 시각화 기법
print(f"\n4. 고급 시각화")

# 히트맵 (상관관계)
plt.figure(figsize=(10, 8))
correlation = numeric_subset.select_dtypes(include=[np.number]).corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))  # 상삼각 마스킹
sns.heatmap(correlation, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('상관관계 히트맵')
plt.show()

# FacetGrid 사용
g = sns.FacetGrid(eda_data, col='education', row='region', margin_titles=True, height=4)
g.map(plt.hist, 'satisfaction_score', bins=15, alpha=0.7)
g.add_legend()
plt.show()

# 5. 인터랙티브 시각화 (plotly 스타일)
print(f"\n5. 고급 plotting 기법")

# 서브플롯을 이용한 대시보드 스타일
fig = plt.figure(figsize=(20, 15))

# 1행: 전체 개요
ax1 = plt.subplot(3, 4, (1, 2))
monthly_trend = pd.DataFrame({
    'month': range(1, 13),
    'customers': np.random.poisson(100, 12),
    'revenue': np.random.normal(1000, 200, 12)
})
ax1_twin = ax1.twinx()
line1 = ax1.plot(monthly_trend['month'], monthly_trend['customers'], 'b-o', label='고객 수')
line2 = ax1_twin.plot(monthly_trend['month'], monthly_trend['revenue'], 'r-s', label='매출')
ax1.set_xlabel('월')
ax1.set_ylabel('고객 수', color='b')
ax1_twin.set_ylabel('매출 (만원)', color='r')
ax1.set_title('월별 고객 수 및 매출')
ax1.grid(True, alpha=0.3)

# 고객 분포
ax2 = plt.subplot(3, 4, (3, 4))
education_counts = eda_data['education'].value_counts()
wedges, texts, autotexts = ax2.pie(education_counts.values, labels=education_counts.index,
                                  autopct='%1.1f%%', startangle=90)
ax2.set_title('교육 수준별 고객 분포')

# 2행: 세부 분석
ax3 = plt.subplot(3, 4, 5)
sns.boxplot(data=eda_data, x='region', y='income', ax=ax3)
ax3.set_title('지역별 소득 분포')
ax3.tick_params(axis='x', rotation=45)

ax4 = plt.subplot(3, 4, 6)
sns.scatterplot(data=eda_data, x='age', y='income', hue='education', ax=ax4)
ax4.set_title('나이-소득 관계 (교육별)')

ax5 = plt.subplot(3, 4, 7)
satisfaction_by_region = eda_data.groupby('region')['satisfaction_score'].mean()
ax5.bar(satisfaction_by_region.index, satisfaction_by_region.values)
ax5.set_title('지역별 평균 만족도')
ax5.tick_params(axis='x', rotation=45)

ax6 = plt.subplot(3, 4, 8)
premium_stats = eda_data.groupby('is_premium')['purchase_frequency'].mean()
ax6.bar(['일반', '프리미엄'], premium_stats.values, color=['lightblue', 'orange'])
ax6.set_title('고객 유형별 구매빈도')

# 3행: 상세 분석
ax7 = plt.subplot(3, 4, (9, 10))
pivot_data = eda_data.pivot_table(values='satisfaction_score',
                                 index='education', columns='region', aggfunc='mean')
sns.heatmap(pivot_data, annot=True, fmt='.2f', cmap='YlOrRd', ax=ax7)
ax7.set_title('교육-지역별 평균 만족도')

ax8 = plt.subplot(3, 4, 11)
sns.violinplot(data=eda_data, x='education', y='purchase_frequency', ax=ax8)
ax8.set_title('교육별 구매빈도 분포')
ax8.tick_params(axis='x', rotation=45)

ax9 = plt.subplot(3, 4, 12)
correlation_subset = eda_data[['age', 'income', 'purchase_frequency']].corr()
sns.heatmap(correlation_subset, annot=True, cmap='coolwarm', center=0, ax=ax9)
ax9.set_title('주요 변수 상관관계')

plt.tight_layout()
plt.show()

실전 데이터 분석 프로젝트

종합 프로젝트: 고객 세분화 분석

  
# 실전 데이터 분석 프로젝트: 고객 세분화
print("=== 실전 프로젝트: 고객 세분화 분석 ===")

class CustomerSegmentationAnalysis:
    def __init__(self):
        self.data = None
        self.processed_data = None
        self.segments = None

    def generate_realistic_data(self, n_customers=2000):
        """현실적인 고객 데이터 생성"""
        np.random.seed(42)

        # 기본 고객 정보
        data = {
            'customer_id': range(1, n_customers + 1),
            'age': np.random.normal(40, 15, n_customers).astype(int),
            'gender': np.random.choice(['M', 'F'], n_customers),
            'city': np.random.choice(['서울', '경기', '부산', '대구', '기타'],
                                   n_customers, p=[0.3, 0.25, 0.15, 0.1, 0.2]),
            'signup_date': pd.date_range('2020-01-01', '2023-12-31', periods=n_customers)
        }

        df = pd.DataFrame(data)

        # 나이에 따른 소득 조정
        base_income = np.random.lognormal(10.5, 0.5, n_customers)
        age_factor = (df['age'] - 20) / 40  # 20세 기준으로 정규화
        df['annual_income'] = base_income * (0.5 + age_factor)
        df['annual_income'] = df['annual_income'].clip(lower=2000, upper=20000)  # 현실적 범위

        # 구매 행동 데이터
        df['total_purchases'] = np.random.negative_binomial(20, 0.3, n_customers)
        df['avg_order_value'] = np.random.gamma(2, 50, n_customers)
        df['days_since_last_purchase'] = np.random.exponential(45, n_customers).astype(int)

        # 소득에 따른 구매액 조정
        income_factor = (df['annual_income'] - df['annual_income'].min()) / \
                       (df['annual_income'].max() - df['annual_income'].min())
        df['avg_order_value'] = df['avg_order_value'] * (0.5 + income_factor)

        # 총 구매액 계산
        df['total_spent'] = df['total_purchases'] * df['avg_order_value']

        # 고객 만족도 및 충성도
        df['satisfaction_score'] = np.random.beta(2, 1, n_customers) * 5
        df['loyalty_score'] = np.random.normal(3, 1, n_customers).clip(1, 5)

        # 디지털 활동
        df['website_visits'] = np.random.poisson(15, n_customers)
        df['email_opens'] = np.random.binomial(50, 0.3, n_customers)
        df['social_media_engagement'] = np.random.exponential(2, n_customers)

        # 일부 결측값 추가 (현실적)
        missing_indices = np.random.choice(df.index, int(0.05 * n_customers), replace=False)
        df.loc[missing_indices, 'satisfaction_score'] = np.nan

        self.data = df
        return df

    def preprocess_data(self):
        """데이터 전처리"""
        print("데이터 전처리 시작...")

        df = self.data.copy()

        # 결측값 처리
        df['satisfaction_score'].fillna(df['satisfaction_score'].median(), inplace=True)

        # 이상값 처리 (IQR 방법)
        def cap_outliers(series, factor=1.5):
            Q1 = series.quantile(0.25)
            Q3 = series.quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - factor * IQR
            upper = Q3 + factor * IQR
            return series.clip(lower, upper)

        numerical_cols = ['annual_income', 'total_spent', 'avg_order_value',
                         'website_visits', 'social_media_engagement']

        for col in numerical_cols:
            df[col] = cap_outliers(df[col])

        # 파생 변수 생성
        df['customer_lifetime_value'] = df['total_spent'] / \
                                       ((pd.Timestamp.now() - df['signup_date']).dt.days / 365.25 + 1)

        df['purchase_frequency'] = df['total_purchases'] / \
                                  ((pd.Timestamp.now() - df['signup_date']).dt.days / 30.44 + 1)

        df['digital_engagement'] = (df['website_visits'] + df['email_opens'] +
                                   df['social_media_engagement'])

        df['recency_score'] = pd.cut(df['days_since_last_purchase'],
                                   bins=[0, 30, 90, 180, float('inf')],
                                   labels=[4, 3, 2, 1])

        df['frequency_score'] = pd.qcut(df['purchase_frequency'],
                                      q=4, labels=[1, 2, 3, 4])

        df['monetary_score'] = pd.qcut(df['total_spent'],
                                     q=4, labels=[1, 2, 3, 4])

        # RFM 스코어 계산
        df['rfm_score'] = (df['recency_score'].astype(int) +
                          df['frequency_score'].astype(int) +
                          df['monetary_score'].astype(int)) / 3

        self.processed_data = df
        print("전처리 완료!")
        return df

    def exploratory_analysis(self):
        """탐색적 데이터 분석"""
        print("탐색적 데이터 분석 시작...")

        df = self.processed_data

        # 기본 통계
        print("=== 기본 통계 ===")
        print(f"총 고객 수: {len(df):,}")
        print(f"평균 연령: {df['age'].mean():.1f}세")
        print(f"평균 연소득: {df['annual_income'].mean():,.0f}만원")
        print(f"평균 총 구매액: {df['total_spent'].mean():,.0f}원")
        print(f"평균 구매빈도: {df['purchase_frequency'].mean():.2f}회/월")

        # 시각화
        fig, axes = plt.subplots(3, 3, figsize=(20, 18))

        # 1. 연령 분포
        axes[0, 0].hist(df['age'], bins=30, alpha=0.7, edgecolor='black')
        axes[0, 0].set_title('연령 분포')
        axes[0, 0].set_xlabel('나이')

        # 2. 소득 분포
        axes[0, 1].hist(df['annual_income'], bins=30, alpha=0.7, edgecolor='black')
        axes[0, 1].set_title('연소득 분포')
        axes[0, 1].set_xlabel('연소득 (만원)')

        # 3. 총 구매액 분포
        axes[0, 2].hist(df['total_spent'], bins=30, alpha=0.7, edgecolor='black')
        axes[0, 2].set_title('총 구매액 분포')
        axes[0, 2].set_xlabel('총 구매액 (원)')

        # 4. 성별 분포
        gender_counts = df['gender'].value_counts()
        axes[1, 0].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%')
        axes[1, 0].set_title('성별 분포')

        # 5. 지역별 분포
        city_counts = df['city'].value_counts()
        axes[1, 1].bar(city_counts.index, city_counts.values)
        axes[1, 1].set_title('지역별 고객 수')
        axes[1, 1].tick_params(axis='x', rotation=45)

        # 6. 소득 vs 총 구매액
        axes[1, 2].scatter(df['annual_income'], df['total_spent'], alpha=0.6)
        axes[1, 2].set_xlabel('연소득 (만원)')
        axes[1, 2].set_ylabel('총 구매액 (원)')
        axes[1, 2].set_title('소득 vs 총 구매액')

        # 7. RFM 점수 분포
        axes[2, 0].hist(df['rfm_score'], bins=20, alpha=0.7, edgecolor='black')
        axes[2, 0].set_title('RFM 점수 분포')
        axes[2, 0].set_xlabel('RFM 점수')

        # 8. 구매빈도 vs 만족도
        axes[2, 1].scatter(df['purchase_frequency'], df['satisfaction_score'], alpha=0.6)
        axes[2, 1].set_xlabel('구매빈도 (회/월)')
        axes[2, 1].set_ylabel('만족도')
        axes[2, 1].set_title('구매빈도 vs 만족도')

        # 9. 디지털 참여도 분포
        axes[2, 2].hist(df['digital_engagement'], bins=30, alpha=0.7, edgecolor='black')
        axes[2, 2].set_title('디지털 참여도 분포')
        axes[2, 2].set_xlabel('디지털 참여도')

        plt.tight_layout()
        plt.show()

        # 상관관계 분석
        numeric_columns = ['age', 'annual_income', 'total_spent', 'purchase_frequency',
                          'satisfaction_score', 'loyalty_score', 'digital_engagement', 'rfm_score']

        correlation_matrix = df[numeric_columns].corr()

        plt.figure(figsize=(12, 10))
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                   square=True, linewidths=0.5)
        plt.title('변수 간 상관관계')
        plt.show()

        return df

    def customer_segmentation(self):
        """고객 세분화 수행"""
        print("고객 세분화 분석 시작...")

        from sklearn.cluster import KMeans
        from sklearn.preprocessing import StandardScaler

        df = self.processed_data

        # 세분화를 위한 특성 선택
        segmentation_features = ['annual_income', 'total_spent', 'purchase_frequency',
                               'satisfaction_score', 'digital_engagement', 'rfm_score']

        X = df[segmentation_features].copy()

        # 특성 표준화
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)

        # 최적 클러스터 수 찾기 (엘보우 방법)
        inertias = []
        k_range = range(2, 11)

        for k in k_range:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            kmeans.fit(X_scaled)
            inertias.append(kmeans.inertia_)

        # 엘보우 그래프
        plt.figure(figsize=(10, 6))
        plt.plot(k_range, inertias, 'bo-')
        plt.xlabel('클러스터 수 (k)')
        plt.ylabel('이너셔 (Inertia)')
        plt.title('엘보우 방법 - 최적 클러스터 수 찾기')
        plt.grid(True, alpha=0.3)
        plt.show()

        # K-means 클러스터링 (k=5 선택)
        optimal_k = 5
        kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
        cluster_labels = kmeans.fit_predict(X_scaled)

        # 클러스터 라벨을 원본 데이터에 추가
        df['cluster'] = cluster_labels

        # 클러스터별 특성 분석
        cluster_summary = df.groupby('cluster')[segmentation_features].mean().round(2)

        print("=== 클러스터별 특성 ===")
        print(cluster_summary)

        # 클러스터 크기
        cluster_counts = df['cluster'].value_counts().sort_index()
        print(f"\n=== 클러스터별 고객 수 ===")
        for cluster, count in cluster_counts.items():
            percentage = count / len(df) * 100
            print(f"클러스터 {cluster}: {count:,}명 ({percentage:.1f}%)")

        # 클러스터 시각화
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))

        # 소득 vs 총 구매액
        scatter = axes[0, 0].scatter(df['annual_income'], df['total_spent'],
                                   c=df['cluster'], cmap='tab10', alpha=0.6)
        axes[0, 0].set_xlabel('연소득 (만원)')
        axes[0, 0].set_ylabel('총 구매액 (원)')
        axes[0, 0].set_title('클러스터별 소득 vs 구매액')
        plt.colorbar(scatter, ax=axes[0, 0])

        # 구매빈도 vs 만족도
        scatter = axes[0, 1].scatter(df['purchase_frequency'], df['satisfaction_score'],
                                   c=df['cluster'], cmap='tab10', alpha=0.6)
        axes[0, 1].set_xlabel('구매빈도 (회/월)')
        axes[0, 1].set_ylabel('만족도')
        axes[0, 1].set_title('클러스터별 구매빈도 vs 만족도')
        plt.colorbar(scatter, ax=axes[0, 1])

        # RFM 점수 vs 디지털 참여도
        scatter = axes[0, 2].scatter(df['rfm_score'], df['digital_engagement'],
                                   c=df['cluster'], cmap='tab10', alpha=0.6)
        axes[0, 2].set_xlabel('RFM 점수')
        axes[0, 2].set_ylabel('디지털 참여도')
        axes[0, 2].set_title('클러스터별 RFM vs 디지털 참여도')
        plt.colorbar(scatter, ax=axes[0, 2])

        # 클러스터별 박스플롯
        sns.boxplot(data=df, x='cluster', y='annual_income', ax=axes[1, 0])
        axes[1, 0].set_title('클러스터별 연소득 분포')

        sns.boxplot(data=df, x='cluster', y='total_spent', ax=axes[1, 1])
        axes[1, 1].set_title('클러스터별 총 구매액 분포')

        sns.boxplot(data=df, x='cluster', y='satisfaction_score', ax=axes[1, 2])
        axes[1, 2].set_title('클러스터별 만족도 분포')

        plt.tight_layout()
        plt.show()

        self.segments = df
        return df, cluster_summary

    def interpret_segments(self):
        """세그먼트 해석 및 네이밍"""
        print("=== 고객 세그먼트 해석 ===")

        df = self.segments

        # 각 클러스터의 특성을 바탕으로 이름 부여
        segment_names = {
            0: "가치 지향 고객",      # 중간 소득, 높은 만족도
            1: "VIP 고객",          # 높은 소득, 높은 구매액
            2: "신규/저관여 고객",    # 낮은 구매빈도, 낮은 참여도
            3: "충성 고객",         # 높은 충성도, 꾸준한 구매
            4: "디지털 활성 고객"     # 높은 디지털 참여도
        }

        # 세그먼트별 상세 분석
        for cluster in sorted(df['cluster'].unique()):
            cluster_data = df[df['cluster'] == cluster]
            size = len(cluster_data)
            percentage = size / len(df) * 100

            print(f"\n🎯 {segment_names[cluster]} (클러스터 {cluster})")
            print(f"   크기: {size:,}명 ({percentage:.1f}%)")
            print(f"   평균 나이: {cluster_data['age'].mean():.1f}세")
            print(f"   평균 연소득: {cluster_data['annual_income'].mean():,.0f}만원")
            print(f"   평균 총 구매액: {cluster_data['total_spent'].mean():,.0f}원")
            print(f"   평균 구매빈도: {cluster_data['purchase_frequency'].mean():.2f}회/월")
            print(f"   평균 만족도: {cluster_data['satisfaction_score'].mean():.2f}/5")
            print(f"   평균 디지털 참여도: {cluster_data['digital_engagement'].mean():.1f}")
            print(f"   주요 지역: {cluster_data['city'].mode().iloc[0]}")

            # 마케팅 전략 제안
            if cluster == 0:  # 가치 지향 고객
                strategy = "품질 대비 가격 어필, 합리적 프로모션"
            elif cluster == 1:  # VIP 고객
                strategy = "프리미엄 서비스, 개인화된 상품 추천"
            elif cluster == 2:  # 신규/저관여 고객
                strategy = "온보딩 프로그램, 할인 쿠폰으로 재구매 유도"
            elif cluster == 3:  # 충성 고객
                strategy = "로열티 프로그램, 추천 보상 시스템"
            else:  # 디지털 활성 고객
                strategy = "디지털 마케팅, 소셜미디어 캠페인"

            print(f"   💡 추천 전략: {strategy}")

    def business_insights(self):
        """비즈니스 인사이트 도출"""
        print("\n=== 비즈니스 인사이트 및 권장사항 ===")

        df = self.segments

        # 1. 매출 기여도 분석
        cluster_revenue = df.groupby('cluster')['total_spent'].sum().sort_values(ascending=False)
        total_revenue = df['total_spent'].sum()

        print("1. 💰 클러스터별 매출 기여도")
        for cluster, revenue in cluster_revenue.items():
            percentage = revenue / total_revenue * 100
            print(f"   클러스터 {cluster}: {revenue:,.0f}원 ({percentage:.1f}%)")

        # 2. 고객 생애 가치 분석
        cluster_clv = df.groupby('cluster')['customer_lifetime_value'].mean().sort_values(ascending=False)
        print(f"\n2. 📈 클러스터별 평균 고객 생애 가치")
        for cluster, clv in cluster_clv.items():
            print(f"   클러스터 {cluster}: {clv:,.0f}원/년")

        # 3. 이탈 위험 고객 식별
        high_risk = df[df['days_since_last_purchase'] > 90]
        print(f"\n3. ⚠️ 이탈 위험 고객 (90일 이상 미구매)")
        print(f"   총 {len(high_risk):,}명 ({len(high_risk)/len(df)*100:.1f}%)")

        risk_by_cluster = high_risk['cluster'].value_counts().sort_index()
        for cluster, count in risk_by_cluster.items():
            cluster_size = len(df[df['cluster'] == cluster])
            risk_rate = count / cluster_size * 100
            print(f"   클러스터 {cluster}: {count}명 (해당 클러스터의 {risk_rate:.1f}%)")

        # 4. 성장 기회 분석
        print(f"\n4. 🚀 성장 기회")

        # 높은 만족도 but 낮은 구매빈도
        opportunity_customers = df[(df['satisfaction_score'] > 4) &
                                 (df['purchase_frequency'] < df['purchase_frequency'].median())]
        print(f"   만족도는 높지만 구매빈도가 낮은 고객: {len(opportunity_customers)}명")

        # 높은 디지털 참여도 but 낮은 구매액
        digital_opportunity = df[(df['digital_engagement'] > df['digital_engagement'].median()) &
                               (df['total_spent'] < df['total_spent'].median())]
        print(f"   디지털 참여도는 높지만 구매액이 낮은 고객: {len(digital_opportunity)}명")

        # 5. 권장 액션 아이템
        print(f"\n5. 📋 권장 액션 아이템")
        recommendations = [
            "VIP 고객(클러스터 1)에게 개인화된 프리미엄 서비스 제공",
            "신규 고객(클러스터 2) 온보딩 프로세스 개선",
            "이탈 위험 고객에게 타겟 리텐션 캠페인 실행",
            "디지털 활성 고객에게 온라인 전용 이벤트 제공",
            "만족도 높은 고객들의 추천 시스템 활용",
            "클러스터별 맞춤형 이메일 마케팅 캠페인 실행"
        ]

        for i, rec in enumerate(recommendations, 1):
            print(f"   {i}. {rec}")

        return df

    def run_full_analysis(self):
        """전체 분석 파이프라인 실행"""
        print("🔍 고객 세분화 분석 프로젝트 시작")
        print("=" * 50)

        # 1. 데이터 생성
        self.generate_realistic_data(2000)
        print(f"✅ 데이터 생성 완료: {len(self.data):,}명의 고객 데이터")

        # 2. 전처리
        self.preprocess_data()
        print("✅ 데이터 전처리 완료")

        # 3. 탐색적 분석
        self.exploratory_analysis()
        print("✅ 탐색적 데이터 분석 완료")

        # 4. 세분화
        self.customer_segmentation()
        print("✅ 고객 세분화 완료")

        # 5. 해석
        self.interpret_segments()
        print("✅ 세그먼트 해석 완료")

        # 6. 비즈니스 인사이트
        self.business_insights()
        print("✅ 비즈니스 인사이트 도출 완료")

        print("\n🎉 고객 세분화 분석 프로젝트 완료!")

        return self.segments

# 프로젝트 실행
analyzer = CustomerSegmentationAnalysis()
final_results = analyzer.run_full_analysis()

graph TD
    A[데이터 수집] --> B[데이터 탐색]
    B --> C[데이터 정제]
    C --> D[특성 엔지니어링]
    D --> E[탐색적 데이터 분석]
    E --> F[고객 세분화]
    F --> G[세그먼트 해석]
    G --> H[비즈니스 인사이트]

    subgraph "데이터 정제"
        C1[결측값 처리]
        C2[이상값 처리]
        C3[데이터 타입 변환]
    end

    subgraph "EDA"
        E1[기술 통계]
        E2[분포 분석]
        E3[상관관계 분석]
        E4[시각화]
    end

    subgraph "고객 세분화"
        F1[특성 선택]
        F2[클러스터링]
        F3[세그먼트 검증]
    end

    C --> C1
    C --> C2
    C --> C3

    E --> E1
    E --> E2
    E --> E3
    E --> E4

    F --> F1
    F --> F2
    F --> F3

    style A fill:#e1f5fe
    style H fill:#f3e5f5
    style F fill:#fff3e0

🎯 핵심 포인트 정리

NumPy 핵심 기능

배열 연산: 벡터화 연산으로 빠른 수치 계산
브로드캐스팅: 크기가 다른 배열 간 연산
인덱싱: 불린 인덱싱, 팬시 인덱싱으로 데이터 선택
통계 함수: 평균, 표준편차, 분위수 등 기본 통계

Pandas 핵심 기능

DataFrame/Series: 테이블 형태의 데이터 구조
데이터 조작: 필터링, 그룹화, 피벗, 병합
결측값 처리: dropna, fillna, interpolate
시계열 데이터: 날짜/시간 데이터 처리

데이터 전처리 핵심

결측값 처리: 삭제, 대체, 보간
이상값 처리: IQR, Z-score 방법
데이터 변환: 정규화, 표준화, 로그 변환
특성 엔지니어링: 파생변수 생성, 범주화

EDA 핵심 기법

기술 통계: 중심경향, 분산, 분포 특성
시각화: 히스토그램, 박스플롯, 산점도, 히트맵
관계 분석: 상관관계, 교차표, 그룹별 비교
패턴 발견: 트렌드, 계절성, 이상 패턴

실전 프로젝트 프로세스

문제 정의: 비즈니스 목표 설정
데이터 수집: 다양한 소스에서 데이터 확보
데이터 정제: 품질 향상과 일관성 확보
분석: 통계적 분석과 시각화
인사이트 도출: 비즈니스 가치 창출
커뮤니케이션: 결과 공유와 액션 아이템

🔗 시리즈 네비게이션

기초편 (1-7)

중급편 (8-12)

고급편 (13-16)

실전편 (17-20)

Python 마스터리 #17 - 웹 개발: Flask/Django 시작하기
Python 마스터리 #18 - 데이터 분석: Pandas와 NumPy로 데이터 다루기 (현재 글)
Python 마스터리 #19 - 머신러닝 기초: scikit-learn으로 시작하는 AI
Python 마스터리 #20 - 실전 프로젝트와 모범 사례

⚠️ 초보자들이 자주 하는 실수

데이터 분석을 처음 배울 때 자주 발생하는 실수들을 정리했습니다. 이런 실수들을 미리 알고 피해가세요!

1. 데이터 탐색 없이 바로 분석 시작

  
# ❌ 잘못된 예: 데이터 구조 파악 없이 분석
df = pd.read_csv('sales_data.csv')
result = df.groupby('category').sum()  # 데이터 구조 모름

# ✅ 올바른 예: 충분한 탐색 후 분석
df = pd.read_csv('sales_data.csv')
print(df.info())  # 데이터 타입 확인
print(df.describe())  # 기술통계 확인
print(df.head())  # 샘플 데이터 확인
print(df.isnull().sum())  # 결측값 확인
# 이제 분석 시작

2. 결측값 처리 방식을 무작정 선택

  
# ❌ 잘못된 예: 무조건 삭제 또는 평균값으로 채우기
df.dropna(inplace=True)  # 데이터 손실 많음
df.fillna(df.mean(), inplace=True)  # 부적절한 대체값

# ✅ 올바른 예: 결측값 패턴 분석 후 적절한 처리
# 결측값 패턴 분석
missing_pattern = df.isnull().sum() / len(df)
print("결측값 비율:", missing_pattern)

# 적절한 처리 방법 선택
if missing_pattern['price'] < 0.05:  # 5% 미만이면 삭제
    df = df.dropna(subset=['price'])
else:  # 많으면 비즈니스 로직으로 채우기
    df['price'].fillna(df.groupby('category')['price'].median(), inplace=True)

3. 날짜 데이터를 문자열로 처리

  
# ❌ 잘못된 예: 문자열로 날짜 처리
df['order_date'] = '2024-01-15'  # 문자열
# 날짜 연산 불가능

# ✅ 올바른 예: datetime 타입 사용
df['order_date'] = pd.to_datetime(df['order_date'])
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['weekday'] = df['order_date'].dt.day_name()

4. 메모리 효율을 고려하지 않은 데이터 타입 사용

  
# ❌ 잘못된 예: 기본 타입만 사용
df['category'] = df['category']  # object 타입 (메모리 많이 사용)
df['user_id'] = df['user_id'].astype('int64')  # 큰 정수 타입

# ✅ 올바른 예: 효율적인 타입 사용
df['category'] = df['category'].astype('category')  # 범주형
df['user_id'] = df['user_id'].astype('int32')  # 적절한 크기
print(f"메모리 사용량 감소: {df.memory_usage().sum() / 1024**2:.2f}MB")

5. 반복문으로 DataFrame 처리

  
# ❌ 잘못된 예: 느린 반복문 사용
total = 0
for index, row in df.iterrows():  # 매우 느림
    total += row['price'] * row['quantity']

# ✅ 올바른 예: 벡터화 연산 사용
total = (df['price'] * df['quantity']).sum()  # 빠름

6. 그룹화 후 원본 데이터와 연결 실패

  
# ❌ 잘못된 예: 그룹 통계를 원본에 합치기 어려움
group_stats = df.groupby('category')['price'].mean()
# 원본 DataFrame에 어떻게 합칠지 모호

# ✅ 올바른 예: transform 또는 merge 사용
df['category_avg_price'] = df.groupby('category')['price'].transform('mean')
# 또는
avg_prices = df.groupby('category')['price'].mean().reset_index()
df = df.merge(avg_prices, on='category', suffixes=('', '_avg'))

7. 시각화에서 한글 폰트 설정 누락

  
# ❌ 잘못된 예: 한글이 깨져서 나옴
plt.title('월별 매출 분석')  # 깨진 글씨

# ✅ 올바른 예: 한글 폰트 설정
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'  # 맥
# plt.rcParams['font.family'] = 'Malgun Gothic'  # 윈도우
plt.rcParams['axes.unicode_minus'] = False
plt.title('월별 매출 분석')  # 정상 출력

8. 대용량 데이터를 한 번에 메모리에 로드

  
# ❌ 잘못된 예: 큰 파일을 통째로 로드
df = pd.read_csv('huge_file.csv')  # 메모리 부족 에러

# ✅ 올바른 예: 청크 단위로 처리
chunk_size = 10000
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    # 각 청크별로 처리
    processed_chunk = chunk.groupby('category').sum()
    # 결과를 파일에 저장하거나 누적

이런 실수들을 피하면 더 효율적이고 정확한 데이터 분석을 할 수 있습니다!

다음 포스트에서는 머신러닝 기초에 대해 알아보겠습니다. scikit-learn을 활용해 AI 모델을 구축하는 방법을 배워보세요!

Happy Coding! 🐍✨

Programming, Python