가설 머신러닝 진행

<aside> 📌 Task : 머신러닝 진행

</aside>

<aside>

📍 인기도 = 정답 컬럼

→ 분류 모델 활용해서 인기도가 어떤 컬럼과 연관성이 있는지(의미 있는지) 컬럼 별로 확인

</aside>

<aside> 📌 실행 및 진행 사항 정리

</aside>

feature_cols = [
    'price',
    'minimum_nights',
    'availability_365',
    'reviews_per_month',
    'estimated_revenue_per_month',
    'daily_guests',
    'has_landmark'  # 관광지 이름 포함 여부 (0 or 1)
]

target_col = 'is_popular'  # 상위 25% 숙소 여부 (0 or 1)

# popularity_score 기준으로 지역 내 분위수로 is_popular 다시 부여
def get_popularity_cat(row):
    group = df_filtered[df_filtered['neighbourhood_group'] == row['neighbourhood_group']]['popularity_score']
    if len(group) < 4:
        return 'Mid'
    q25 = group.quantile(0.25)
    q75 = group.quantile(0.75)
    if row['popularity_score'] >= q75:
        return 'High'
    elif row['popularity_score'] <= q25:
        return 'Low'
    else:
        return 'Mid'

# 다시 적용
df_filtered['popularity_category'] = df_filtered.apply(get_popularity_cat, axis=1)
df_filtered['is_popular'] = (df_filtered['popularity_category'] == 'High').astype(int)

from sklearn.model_selection import train_test_split

X = df_filtered[feature_cols]
y = df_filtered[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

import matplotlib.pyplot as plt
import seaborn as sns

importances = model.feature_importances_
feat_importance = pd.Series(importances, index=feature_cols).sort_values(ascending=True)

plt.figure(figsize=(8, 5))
sns.barplot(x=feat_importance, y=feat_importance.index)
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.grid(True)
plt.show()

<aside> 📌 결과

</aside>

스크린샷(1).png

📊 혼동 행렬 (Confusion Matrix)

[[5747 75] → 실제 0 (비인기): 5822명 중 5747명 정확히 예측, 75명만 오답

[ 48 1899]] → 실제 1 (인기): 1947명 중 1899명 정확히 예측, 48명만 오답

🔍 분석 요약

지표	설명	값
Precision	내가 '인기 숙소'라고 한 것 중에
실제 몇 %가 맞았나?	0.96 (1번 기준)
Recall	진짜 인기 숙소를 내가 얼마나 잘 찾았나?	0.98
F1-score	Precision이랑 Recall의 조화 평균	0.97
Accuracy	전체 데이터 중에 몇 % 맞췄냐?	0.98