DataFrame 원하는 row(데이터)만 선택하기

학습목표

  1. dataframe row 선택하기
import numpy as np
import pandas as pd
# data 출처: https://www.kaggle.com/hesh97/titanicdataset-traincsv/data
train_data = pd.read_csv('./train.csv')
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

dataframe slicing

  • dataframe의 경우 기본적으로 [] 연산자가 column 선택에 사용
  • 하지만, slicing은 row 레벨로 지원
train_data[7:10]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

row 선택하기

  • Seires의 경우 []로 row 선택이 가능하나, DataFrame의 경우는 기본적으로 column을 선택하도록 설계
  • .loc, .iloc로 row 선택 가능
    • loc - 인덱스 자체를 사용
    • iloc - 0 based index로 사용
    • 이 두 함수는 ,를 사용하여 column 선택도 가능
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
train_data.index = np.arange(100, 991)
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
100 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
101 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
102 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
103 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
104 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train_data.tail()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
986 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
987 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
988 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
989 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
990 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
train_data.loc[986] #index 명시해줌
PassengerId                      887
Survived                           0
Pclass                             2
Name           Montvila, Rev. Juozas
Sex                             male
Age                             27.0
SibSp                              0
Parch                              0
Ticket                        211536
Fare                            13.0
Cabin                            NaN
Embarked                           S
Name: 986, dtype: object
train_data.loc[[986, 100, 110, 990]] #복수 loc 조회하려면 리스트 형태로 준다
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
986 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
100 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S
110 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.70 G6 S
990 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
100 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
101 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
102 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
103 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
104 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train_data.iloc[0] # 100이 0번째 주소이다
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 100, dtype: object
train_data.iloc[[0, 100, 200, 2]] # loc 는 정확한 주소를 이용하고 iloc 는 순서를 이용할 때 사용한다. /0 based index로 사용 / 
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
100 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
200 101 0 3 Petranec, Miss. Matilda female 28.0 0 0 349245 7.8958 NaN S
300 201 0 3 Vande Walle, Mr. Nestor Cyriel male 28.0 0 0 345770 9.5000 NaN S
102 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

row, column 동시에 선택하기

  • loc, iloc 속성을 이용할 때, 콤마를 이용하여 둘 다 명시 가능
train_data.loc[[986, 100, 110, 990], ['Survived', 'Name', 'Sex', 'Age']]
Survived Name Sex Age
986 0 Montvila, Rev. Juozas male 27.0
100 0 Braund, Mr. Owen Harris male 22.0
110 1 Sandstrom, Miss. Marguerite Rut female 4.0
990 0 Dooley, Mr. Patrick male 32.0
train_data.iloc[[101, 100, 200, 102], [1, 4, 5]] # 맨뒤 컬럼값도 0 based index로 사용 ,  0 번째 주소는 PassengerId이다 
Survived Sex Age
201 0 male NaN
200 0 female 28.0
300 0 male 28.0
202 0 male 21.0