데이터 전처리 및 결측치

세포라 데이터 전처리 및 결측치 처리

<aside>

[병합할 데이터]

제품 데이터

product_name

size

variation_type

ingredients

sale_price_usd(함수로 조정)

limited_edition(타입 변경 int64→TF로 변경 해야함)

new(타입 변경 int64→TF로 변경 해야함)

online_only(타입 변경 int64→TF로 변경 해야함)

out_of_stock(타입 변경 int64→TF로 변경 해야함)

sephora_exclusive(타입 변경 int64→TF로 변경 해야함)

highlights

primary_category 8494 non-null object secondary_category 8486 non-null object tertiary_category 7504 non-null object

child_count 8494 non-null int64

child_max_price 2754 non-null float64 child_min_price 2754 non-null float64

리뷰 데이터에서 제거

skin_tone

eye_color

hair_color

통합 전처리

is_recommended → bool 변경

helpfulness 결측치 0 처리, 반올림, … 0으로 변경

review_text 결측치 제거

review_title 결측치 unknown 로 변경

skin_type 결측치 unknown 으로 변경

size 결측치 제거

variation_type 결측치 제거

ingredients 결측치 제거

highlights 결측치 unknown 으로 변경

tertiary_category 결측치 unknown 대체

child_max_price 결측값 0으로 대체 child_min_price 결측값 0으로 대체

</aside>

전처리 부분

<aside> 💡

리뷰를 기준으로 레프트 조인

스킨케어 제품만 보기

</aside>

helpfulness 은 도움 되는 비율

<aside> 💡

100곱하기
NAN 값은 0으로 하기 </aside>
리뷰 텍스트 NULL 값 제거
review_title = UNKNOWN 로 하기
skin_type = NONE 으로 하기
size 결측치 제거.

<aside> 💡

size 결측치 개수: 43363 size 결측치 비율: 3.96%

</aside>

ingredients 결측치 제고 하기 ( 약 2만개 )
highlights 결측치 = unknown
차일드 price null 값 = 0이로 통일하기

[코드 통합]

from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import statsmodels.api as sm
from imblearn.over_sampling import SMOTE

raw_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/product_info.csv")
product_df = raw_df.copy()

review1 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_0-250.csv")
review2 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_250-500.csv")
review3 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_500-750.csv")
review4 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_750-1250.csv")
review5 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_1250-end.csv")
reviews_df = pd.concat([review1, review2, review3, review4, review5], ignore_index=True)
# "Unnamed" 같은 불필요한 컬럼 제거
reviews_df = reviews_df.loc[:, ~reviews_df.columns.str.contains('^Unnamed')]

# 1. 리뷰 데이터에서 불필요한 컬럼 제거
reviews_df = reviews_df.drop(columns=["skin_tone", "eye_color", "hair_color"], errors="ignore")

# 2. product_df에서 필요한 컬럼만 선택
product_cols = [
    "product_id", "product_name", "size", "variation_type", "ingredients",
    "sale_price_usd", "limited_edition", "new", "online_only", 
    "out_of_stock", "sephora_exclusive",
    "highlights", "primary_category", "secondary_category", "tertiary_category",
    "child_count", "child_max_price", "child_min_price"
]
product_sub = product_df[product_cols].copy()

# 3. sale_price_usd 조정 (결측 시 price_usd로 대체)
product_sub["sale_price_usd"] = product_df["sale_price_usd"].fillna(product_df["price_usd"])

# 4. 정수 → 불리언 변환
bool_cols = ["limited_edition", "new", "online_only", "out_of_stock", "sephora_exclusive"]
for col in bool_cols:
    product_sub[col] = product_sub[col].astype(bool)

# 5. 리뷰 데이터와 병합 (product_id 기준)
merged_df = pd.merge(reviews_df, product_sub, on="product_id", how="left")
# product_name_y를 최종 product_name으로 사용
merged_df["product_name"] = merged_df["product_name_y"]

# 불필요한 중복 컬럼 제거
merged_df = merged_df.drop(columns=["product_name_x", "product_name_y"], errors="ignore")

print("✅ 병합 완료! shape:", merged_df.shape)
display(merged_df.head())