범주형 데이터 전처리 하기(one-hot encoding)

학습목표

  1. 범주형 데이터 전처리 하기(one-hot encoding)
import pandas as pd
# data 출처: https://www.kaggle.com/hesh97/titanicdataset-traincsv/data
train_data = pd.read_csv('./train.csv')
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

One-hot encoding

  • 범주형 데이터는 분석단계에서 계산이 어렵기 때문에 숫자형으로 변경이 필요함
  • 범주형 데이터의 각 범주(category)를 column레벨로 변경
  • 해당 범주에 해당하면 1, 아니면 0으로 채우는 인코딩 기법
  • pandas.get_dummies 함수 사용
    • drop_first : 첫번째 카테고리 값은 사용하지 않음
pd.get_dummies(train_data)
PassengerId Survived Pclass Age SibSp Parch Fare Name_Abbing, Mr. Anthony Name_Abbott, Mr. Rossmore Edward Name_Abbott, Mrs. Stanton (Rosa Hunt) ... Cabin_F G73 Cabin_F2 Cabin_F33 Cabin_F38 Cabin_F4 Cabin_G6 Cabin_T Embarked_C Embarked_Q Embarked_S
0 1 0 3 22.0 1 0 7.2500 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 2 1 1 38.0 1 0 71.2833 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 3 1 3 26.0 0 0 7.9250 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 4 1 1 35.0 1 0 53.1000 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 5 0 3 35.0 0 0 8.0500 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 27.0 0 0 13.0000 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
887 888 1 1 19.0 0 0 30.0000 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
888 889 0 3 NaN 1 2 23.4500 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
889 890 1 1 26.0 0 0 30.0000 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
890 891 0 3 32.0 0 0 7.7500 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

891 rows × 1731 columns

pd.get_dummies(train_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=False)
PassengerId Survived Name Age SibSp Parch Ticket Fare Cabin Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 1 0 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 0 0 1 0 1 0 0 1
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 1 0 0 1 0 1 0 0
2 3 1 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 0 1 1 0 0 0 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 1 0 0 1 0 0 0 1
4 5 0 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0 0 1 0 1 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 Montvila, Rev. Juozas 27.0 0 0 211536 13.0000 NaN 0 1 0 0 1 0 0 1
887 888 1 Graham, Miss. Margaret Edith 19.0 0 0 112053 30.0000 B42 1 0 0 1 0 0 0 1
888 889 0 Johnston, Miss. Catherine Helen "Carrie" NaN 1 2 W./C. 6607 23.4500 NaN 0 0 1 1 0 0 0 1
889 890 1 Behr, Mr. Karl Howell 26.0 0 0 111369 30.0000 C148 1 0 0 0 1 1 0 0
890 891 0 Dooley, Mr. Patrick 32.0 0 0 370376 7.7500 NaN 0 0 1 0 1 0 1 0

891 rows × 17 columns

pd.get_dummies(train_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True)
PassengerId Survived Name Age SibSp Parch Ticket Fare Cabin Pclass_2 Pclass_3 Sex_male Embarked_Q Embarked_S
0 1 0 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 0 1 1 0 1
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 0 0 0 0 0
2 3 1 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 1 0 0 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 0 0 0 0 1
4 5 0 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0 1 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 Montvila, Rev. Juozas 27.0 0 0 211536 13.0000 NaN 1 0 1 0 1
887 888 1 Graham, Miss. Margaret Edith 19.0 0 0 112053 30.0000 B42 0 0 0 0 1
888 889 0 Johnston, Miss. Catherine Helen "Carrie" NaN 1 2 W./C. 6607 23.4500 NaN 0 1 0 0 1
889 890 1 Behr, Mr. Karl Howell 26.0 0 0 111369 30.0000 C148 0 0 1 0 0
890 891 0 Dooley, Mr. Patrick 32.0 0 0 370376 7.7500 NaN 0 1 1 1 0

891 rows × 14 columns