샘플 csv 데이터로 DataFrame 데이터 생성하기

학습목표

  1. 수치해석 라이브러리인 numpy의 이해 및 사용
  2. 데이터 분석 라이브러이인 pandas의 이해 및 사용

csv 데이터로 부터 Dataframe 생성

  • 데이터 분석을 위해, dataframe을 생성하는 가장 일반적인 방법
  • 데이터 소스로부터 추출된 csv(comma separated values) 파일로부터 생성
  • pandas.read_csv 함수 사용
import pandas as pd

# data 출처: https://www.kaggle.com/hesh97/titanicdataset-traincsv/data
train_data = pd.read_csv('./train.csv')
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

read_csv 함수 파라미터

  • sep - 각 데이터 값을 구별하기 위한 구분자(separator) 설정
  • header - header를 무시할 경우, None 설정
  • index_col - index로 사용할 column 설정
  • usecols - 실제로 dataframe에 로딩할 columns만 설정
train_data = pd.read_csv('./train.csv', sep=',')
train_data
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

train_data = pd.read_csv('./train.csv', index_col='PassengerId', usecols=['PassengerId', 'Survived', 'Pclass', 'Name'])
train_data
Survived Pclass Name
PassengerId
1 0 3 Braund, Mr. Owen Harris
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
3 1 3 Heikkinen, Miss. Laina
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
5 0 3 Allen, Mr. William Henry
... ... ... ...
887 0 2 Montvila, Rev. Juozas
888 1 1 Graham, Miss. Margaret Edith
889 0 3 Johnston, Miss. Catherine Helen "Carrie"
890 1 1 Behr, Mr. Karl Howell
891 0 3 Dooley, Mr. Patrick

891 rows × 3 columns

train_data.columns
Index(['Survived', 'Pclass', 'Name'], dtype='object')
train_data = pd.read_csv('./train.csv', index_col='PassengerId', usecols=['PassengerId','Name','Sex','Age'])
train_data
Name Sex Age
PassengerId
1 Braund, Mr. Owen Harris male 22.0
2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
3 Heikkinen, Miss. Laina female 26.0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
5 Allen, Mr. William Henry male 35.0
... ... ... ...
887 Montvila, Rev. Juozas male 27.0
888 Graham, Miss. Margaret Edith female 19.0
889 Johnston, Miss. Catherine Helen "Carrie" female NaN
890 Behr, Mr. Karl Howell male 26.0
891 Dooley, Mr. Patrick male 32.0

891 rows × 3 columns

train_data.columns
Index(['Name', 'Sex', 'Age'], dtype='object')