샘플 csv 데이터로 DataFrame 데이터 생성하기
학습목표
- 수치해석 라이브러리인 numpy의 이해 및 사용
- 데이터 분석 라이브러이인 pandas의 이해 및 사용
csv 데이터로 부터 Dataframe 생성
- 데이터 분석을 위해, dataframe을 생성하는 가장 일반적인 방법
- 데이터 소스로부터 추출된 csv(comma separated values) 파일로부터 생성
- pandas.read_csv 함수 사용
import pandas as pd
# data 출처: https://www.kaggle.com/hesh97/titanicdataset-traincsv/data
train_data = pd.read_csv('./train.csv')
train_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
read_csv 함수 파라미터
- sep - 각 데이터 값을 구별하기 위한 구분자(separator) 설정
- header - header를 무시할 경우, None 설정
- index_col - index로 사용할 column 설정
- usecols - 실제로 dataframe에 로딩할 columns만 설정
train_data = pd.read_csv('./train.csv', sep=',')
train_data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
train_data = pd.read_csv('./train.csv', index_col='PassengerId', usecols=['PassengerId', 'Survived', 'Pclass', 'Name'])
train_data
Survived | Pclass | Name | |
---|---|---|---|
PassengerId | |||
1 | 0 | 3 | Braund, Mr. Owen Harris |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
3 | 1 | 3 | Heikkinen, Miss. Laina |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
5 | 0 | 3 | Allen, Mr. William Henry |
... | ... | ... | ... |
887 | 0 | 2 | Montvila, Rev. Juozas |
888 | 1 | 1 | Graham, Miss. Margaret Edith |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" |
890 | 1 | 1 | Behr, Mr. Karl Howell |
891 | 0 | 3 | Dooley, Mr. Patrick |
891 rows × 3 columns
train_data.columns
Index(['Survived', 'Pclass', 'Name'], dtype='object')
train_data = pd.read_csv('./train.csv', index_col='PassengerId', usecols=['PassengerId','Name','Sex','Age'])
train_data
Name | Sex | Age | |
---|---|---|---|
PassengerId | |||
1 | Braund, Mr. Owen Harris | male | 22.0 |
2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 |
3 | Heikkinen, Miss. Laina | female | 26.0 |
4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 |
5 | Allen, Mr. William Henry | male | 35.0 |
... | ... | ... | ... |
887 | Montvila, Rev. Juozas | male | 27.0 |
888 | Graham, Miss. Margaret Edith | female | 19.0 |
889 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN |
890 | Behr, Mr. Karl Howell | male | 26.0 |
891 | Dooley, Mr. Patrick | male | 32.0 |
891 rows × 3 columns
train_data.columns
Index(['Name', 'Sex', 'Age'], dtype='object')