Natural Language Processing (NLP) in Python with 8 Projects - Restaurant Reviews Classification with NLTK 응용해보기

Natural Language Processing (NLP) in Python with 8 Projects 목차

Restaurant_Reviews_Classification_with_NLTK 응용해보기

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv("data/Restaurant_Reviews.tsv", delimiter='\t')

dataset

	Review	Liked
0	Wow... Loved this place.	1
1	Crust is not good.	0
2	Not tasty and the texture was just nasty.	0
3	Stopped by during the late May bank holiday of...	1
4	The selection on the menu was great and so wer...	1
...	...	...
995	I think food should have flavor and texture an...	0
996	Appetite instantly gone.	0
997	Overall I was not impressed and would not go b...	0
998	The whole experience was underwhelming, and I ...	0
999	Then, as if I hadn't wasted enough of my life ...	0

1000 rows × 2 columns

dataset.head()

	Review	Liked
0	Wow... Loved this place.	1
1	Crust is not good.	0
2	Not tasty and the texture was just nasty.	0
3	Stopped by during the late May bank holiday of...	1
4	The selection on the menu was great and so wer...	1

dataset.describe()

	Liked
count	1000.00000
mean	0.50000
std	0.50025
min	0.00000
25%	0.00000
50%	0.50000
75%	1.00000
max	1.00000

dataset.info() # dataframe 확인 하기 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB

Checking for null values

dataset.isnull().sum()

Review    0
Liked     0
dtype: int64

sns.countplot(x = dataset['Liked'],data= dataset)

<AxesSubplot:xlabel='Liked', ylabel='count'>

output_9_1

dataset[dataset['Liked']==1]['Liked'].count()

dataset[dataset['Liked']==0]['Liked'].count()

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

import re

Data Preprocessing

stemmer = SnowballStemmer('english')

corpus = []

for i in range(0,1000):
    review = re.sub('[^a-zA-Z]',' ',dataset['Review'][i])
    review = review.lower()
    review = review.split()
    review = [stemmer.stem(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    

corpus[1] # is , not 불용어 제거됨

'crust good'

len(corpus)

corpus[999]

'wast enough life pour salt wound draw time took bring check'

Creating Bag of Words Model

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)

x = cv.fit_transform(corpus).toarray()

x.shape

(1000, 1500)

y =dataset['Liked'].values

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=17)

Naive Baye’s Classifier(MultinomialNB)

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

Training the classifier

classifier.fit(x_train,y_train)

MultinomialNB()

making Predictions

y_pred = classifier.predict(x_test)

y_train_pred = classifier.predict(x_train)

Evaluating the classifier

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.74      0.77      0.75        95
           1       0.78      0.75      0.77       105

    accuracy                           0.76       200
   macro avg       0.76      0.76      0.76       200
weighted avg       0.76      0.76      0.76       200

#https://www.kaggle.com/satheeshrsm/restaurant-review-classification/notebook

Restaurant_Reviews_Classification_with_NLTK 응용해보기

Checking for null values

Data Preprocessing

Creating Bag of Words Model

Naive Baye’s Classifier(MultinomialNB)

Training the classifier

making Predictions

Evaluating the classifier

Search Big Data