단순선형회귀분석 실습 - 단순선형회귀 적합 및 해석
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
# 현재경로 확인
os.getcwd()
'C:\\Users\\MyCom\\jupyter-tutorial\\머신러닝과 데이터분석 A-Z 올인원 패키지 Online\\Machine learning의 개념과 종류'
# 데이터 불러오기
boston = pd.read_csv("Boston_house.csv")
boston.head()
AGE | B | RM | CRIM | DIS | INDUS | LSTAT | NOX | PTRATIO | RAD | ZN | TAX | CHAS | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65.2 | 396.90 | 6.575 | 0.00632 | 4.0900 | 2.31 | 4.98 | 0.538 | 15.3 | 1 | 18.0 | 296 | 0 | 24.0 |
1 | 78.9 | 396.90 | 6.421 | 0.02731 | 4.9671 | 7.07 | 9.14 | 0.469 | 17.8 | 2 | 0.0 | 242 | 0 | 21.6 |
2 | 61.1 | 392.83 | 7.185 | 0.02729 | 4.9671 | 7.07 | 4.03 | 0.469 | 17.8 | 2 | 0.0 | 242 | 0 | 34.7 |
3 | 45.8 | 394.63 | 6.998 | 0.03237 | 6.0622 | 2.18 | 2.94 | 0.458 | 18.7 | 3 | 0.0 | 222 | 0 | 33.4 |
4 | 54.2 | 396.90 | 7.147 | 0.06905 | 6.0622 | 2.18 | 5.33 | 0.458 | 18.7 | 3 | 0.0 | 222 | 0 | 36.2 |
# target 제외한 데이터만 뽑기
boston_data = boston.drop(['Target'],axis=1)
# boston_data
boston_data.describe()
# data 통계 뽑아보기
AGE | B | RM | CRIM | DIS | INDUS | LSTAT | NOX | PTRATIO | RAD | ZN | TAX | CHAS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 68.574901 | 356.674032 | 6.284634 | 3.613524 | 3.795043 | 11.136779 | 12.653063 | 0.554695 | 18.455534 | 9.549407 | 11.363636 | 408.237154 | 0.069170 |
std | 28.148861 | 91.294864 | 0.702617 | 8.601545 | 2.105710 | 6.860353 | 7.141062 | 0.115878 | 2.164946 | 8.707259 | 23.322453 | 168.537116 | 0.253994 |
min | 2.900000 | 0.320000 | 3.561000 | 0.006320 | 1.129600 | 0.460000 | 1.730000 | 0.385000 | 12.600000 | 1.000000 | 0.000000 | 187.000000 | 0.000000 |
25% | 45.025000 | 375.377500 | 5.885500 | 0.082045 | 2.100175 | 5.190000 | 6.950000 | 0.449000 | 17.400000 | 4.000000 | 0.000000 | 279.000000 | 0.000000 |
50% | 77.500000 | 391.440000 | 6.208500 | 0.256510 | 3.207450 | 9.690000 | 11.360000 | 0.538000 | 19.050000 | 5.000000 | 0.000000 | 330.000000 | 0.000000 |
75% | 94.075000 | 396.225000 | 6.623500 | 3.677083 | 5.188425 | 18.100000 | 16.955000 | 0.624000 | 20.200000 | 24.000000 | 12.500000 | 666.000000 | 0.000000 |
max | 100.000000 | 396.900000 | 8.780000 | 88.976200 | 12.126500 | 27.740000 | 37.970000 | 0.871000 | 22.000000 | 24.000000 | 100.000000 | 711.000000 | 1.000000 |
'''
타겟 데이터
1978 보스턴 주택 가격
506개 타운의 주택 가격 중앙값 (단위 1,000 달러)
특징 데이터
CRIM: 범죄율
INDUS: 비소매상업지역 면적 비율
NOX: 일산화질소 농도
RM: 주택당 방 수
LSTAT: 인구 중 하위 계층 비율
B: 인구 중 흑인 비율
PTRATIO: 학생/교사 비율
ZN: 25,000 평방피트를 초과 거주지역 비율
CHAS: 찰스강의 경계에 위치한 경우는 1, 아니면 0
AGE: 1940년 이전에 건축된 주택의 비율
RAD: 방사형 고속도로까지의 거리
DIS: 직업센터의 거리
TAX: 재산세율'''
'\n타겟 데이터\n1978 보스턴 주택 가격\n506개 타운의 주택 가격 중앙값 (단위 1,000 달러)\n\n특징 데이터\nCRIM: 범죄율\nINDUS: 비소매상업지역 면적 비율\nNOX: 일산화질소 농도\nRM: 주택당 방 수\nLSTAT: 인구 중 하위 계층 비율\nB: 인구 중 흑인 비율\nPTRATIO: 학생/교사 비율\nZN: 25,000 평방피트를 초과 거주지역 비율\nCHAS: 찰스강의 경계에 위치한 경우는 1, 아니면 0\nAGE: 1940년 이전에 건축된 주택의 비율\nRAD: 방사형 고속도로까지의 거리\nDIS: 직업센터의 거리\nTAX: 재산세율'
crim/rm/lstat 세게의 변수로 각각 단순 선형 회귀 분석하기
target = boston[['Target']]
# boston_target
crim=boston[['CRIM']]
rm=boston[['RM']]
lstat=boston['LSTAT']
target ~ crim 선형회귀분석
crim1 = sm.add_constant(crim, has_constant='add')
model1 = sm.OLS(target,crim1)
fitted_model1=model1.fit()
fitted_model1.summary()
Dep. Variable: | Target | R-squared: | 0.151 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.149 |
Method: | Least Squares | F-statistic: | 89.49 |
Date: | Wed, 12 May 2021 | Prob (F-statistic): | 1.17e-19 |
Time: | 15:52:25 | Log-Likelihood: | -1798.9 |
No. Observations: | 506 | AIC: | 3602. |
Df Residuals: | 504 | BIC: | 3610. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 24.0331 | 0.409 | 58.740 | 0.000 | 23.229 | 24.837 |
CRIM | -0.4152 | 0.044 | -9.460 | 0.000 | -0.501 | -0.329 |
Omnibus: | 139.832 | Durbin-Watson: | 0.713 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 295.404 |
Skew: | 1.490 | Prob(JB): | 7.14e-65 |
Kurtosis: | 5.264 | Cond. No. | 10.1 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
y_hat=beta0 + beta1 * X 계산해보기
(np.dot(crim1,fitted_model1.params))
array([ 24.03048217, 24.02176733, 24.02177563, 24.01966646,
24.00443729, 24.02071274, 23.99644902, 23.97309042,
23.94540138, 23.96250722, 23.93973403, 23.98433377,
23.99416963, 23.77163594, 23.76823138, 23.77261995,
23.59552468, 23.70751396, 23.69982879, 23.73176107,
23.51337514, 23.67934745, 23.52139661, 23.62271965,
23.72160552, 23.68412214, 23.75413567, 23.63627976,
23.71216824, 23.61689868, 23.56360486, 23.4706396 ,
23.45682622, 23.55492323, 23.36347899, 24.00646341,
23.99265003, 23.99983283, 23.96042712, 24.02163447,
24.01915993, 23.98019433, 23.97435675, 23.96694145,
23.98216648, 23.96193426, 23.95490093, 23.9379155 ,
23.92770182, 23.94185981, 23.99626634, 24.01509937,
24.01085198, 24.01242555, 24.02745959, 24.02766303,
24.02457401, 24.02716065, 23.96898004, 23.99022532,
23.97110996, 23.96181385, 23.98732314, 23.9805846 ,
24.02500581, 24.01822575, 24.01492499, 24.00907081,
23.97683128, 23.97989539, 23.99646148, 23.96719057,
23.99505814, 23.95198215, 24.00032275, 23.99361327,
23.99095191, 23.99695556, 24.00966453, 23.99828417,
24.0160294 , 24.01458038, 24.01791436, 24.01836277,
24.0121017 , 24.00929501, 24.0115661 , 24.00341592,
24.0096064 , 24.01109279, 24.01365866, 24.01678089,
24.01565573, 24.02116945, 24.0152779 , 23.98243635,
23.98534268, 23.98293873, 23.99911455, 24.00462412,
23.97138399, 23.98564162, 23.93812725, 23.94524776,
23.97514561, 23.97804364, 23.9620256 , 23.97864567,
23.97995351, 23.92364956, 23.98829469, 23.99123839,
23.98191736, 23.94088411, 23.97402045, 23.96196747,
23.97847544, 23.97042075, 23.97889063, 23.97300323,
24.0044622 , 24.00335779, 23.99449763, 23.97066986,
23.99221408, 23.96293071, 23.87228222, 23.92550961,
23.8979908 , 23.66721974, 23.89191657, 23.53780908,
23.78812315, 23.89616812, 23.62780988, 23.80152134,
23.89914918, 23.88682218, 23.92939164, 23.80702676,
23.91232732, 23.35691068, 22.6542385 , 22.33190553,
22.87898515, 23.04522734, 23.13835037, 23.04967818,
23.06530179, 22.89798841, 23.34530196, 23.41184866,
23.56536111, 23.14078753, 23.4460894 , 22.56540439,
23.01726842, 23.52508765, 23.47557206, 23.44145172,
23.50437796, 23.42553333, 23.2717427 , 23.40242384,
23.1021001 , 22.8190898 , 23.19849483, 23.28564742,
23.07800246, 23.01608513, 23.53179713, 23.07239739,
23.9753366 , 23.99500001, 23.99803505, 24.00543789,
24.00395151, 24.0105821 , 24.00552924, 24.00910818,
24.00575344, 24.00450787, 23.9953114 , 23.99155393,
23.99861217, 24.00799962, 24.00984721, 24.00040994,
23.98087939, 23.99835475, 23.99545672, 24.00441237,
23.99713409, 24.02402596, 24.02713159, 24.0273724 ,
24.01645289, 24.0137334 , 24.0174618 , 24.02002768,
24.02572409, 24.01880287, 24.02406748, 24.018533 ,
24.024765 , 23.97646592, 23.93774112, 23.92848238,
23.97669427, 23.85220362, 23.96067208, 23.87708597,
23.942931 , 23.97476364, 23.91288783, 23.9508902 ,
24.0141735 , 24.00398888, 23.98714876, 23.98567068,
23.88443069, 23.86382895, 23.77421012, 23.77788871,
23.90218422, 23.81432996, 23.87444536, 23.86189001,
23.90930059, 23.84968341, 23.81014899, 23.84088968,
23.79425136, 23.89548305, 23.8471383 , 23.89590655,
23.81696642, 23.82059933, 23.99887789, 23.99469277,
23.98606927, 23.98904618, 23.99038309, 23.98014035,
23.94754376, 23.95366782, 23.89201206, 23.95149222,
23.96485304, 23.95391693, 23.97485498, 23.94421809,
23.99897338, 23.87992587, 24.01309815, 24.01837522,
24.02672055, 23.77920071, 23.75762327, 23.76047148,
23.80885775, 23.81134474, 23.8171491 , 23.69046625,
23.80472246, 23.71688895, 23.70689117, 23.79298503,
23.80869583, 23.99546918, 23.90889785, 23.96579968,
23.98552537, 23.94098376, 24.00967283, 23.9932313 ,
23.9896399 , 24.00766747, 23.99998229, 23.94575844,
24.01825067, 24.01772337, 24.00765916, 24.02687417,
24.02934455, 24.02855569, 24.02494769, 24.01703416,
24.01404894, 24.01526545, 24.01856621, 24.00036427,
24.01809705, 23.9987907 , 23.99906472, 23.97941377,
24.01080215, 23.97455189, 24.00625997, 24.01001744,
24.01476722, 24.01842089, 23.99463464, 23.99158715,
24.01020843, 24.0103579 , 24.00195445, 24.01262899,
23.82842567, 23.88803869, 22.9388805 , 23.70493563,
23.92445503, 23.92126222, 23.87981792, 23.92783053,
23.90096356, 23.93129321, 23.86619138, 23.83569565,
23.96352028, 23.95771177, 23.88731626, 23.91522535,
23.89148892, 23.95344777, 23.90710838, 23.93303286,
24.00563303, 24.00518878, 24.01423993, 24.01225117,
24.01871568, 24.01200205, 24.01758636, 24.01666049,
24.0188776 , 24.02048024, 24.01937998, 24.01028316,
24.00756782, 24.02770455, 24.02273472, 24.02254789,
24.02044702, 24.0201813 , 24.00752215, 24.02534212,
24.02687417, 24.02106981, 24.00731871, 24.00009855,
24.00302979, 24.02601057, 24.01524884, 23.98885104,
20.30346852, 22.43474816, 21.87338184, 22.26385169,
22.14734515, 22.44008751, 22.50594499, 22.2800109 ,
22.5906189 , 22.14155324, 22.49816848, 18.4188202 ,
21.99941285, 21.6789856 , 21.31827659, 20.19994497,
20.60062435, 19.42113105, 16.35283338, 15.8915985 ,
17.68567721, 19.95448863, 14.21460344, 16.61502604,
-12.90894703, 17.44220963, 20.21874479, 20.71470618,
15.69405096, 17.05301026, 13.90503757, 14.65100995,
18.08189329, 20.64858298, 21.14248918, 21.83548327,
19.22607466, 20.44388587, 18.4862471 , 20.41399632,
21.5950881 , 20.84775806, 8.10981167, 19.91585102,
13.63420895, 18.12237434, 20.04906067, 13.73568146,
6.79058608, -4.16694965, 15.43194134, 19.07112564,
20.95908303, 18.03846438, 2.80201916, 18.19939214,
16.22296186, 12.13549661, 5.0397702 , 16.52455607,
19.53485167, 13.26282125, -6.49753724, 19.12875405,
19.42972549, 21.11739508, 19.03081067, 21.10584033,
20.38270343, 17.44806381, 18.9481878 , 8.39625145,
20.97435373, 20.15568984, 20.50725636, 19.85533704,
21.35759926, 21.71590017, 18.25639776, 19.3994166 ,
18.04573021, 17.73168029, 18.35409203, 20.13420789,
14.87770384, 19.99572118, 21.68048444, 19.89509566,
18.71771568, 19.60227857, 21.42236064, 19.91240494,
20.1597587 , 20.90837999, 21.24397414, 21.77399775,
21.91971708, 20.60857939, 20.08313949, 22.05996835,
22.09465335, 20.62830508, 20.81445565, 21.20932651,
22.03515658, 22.49976281, 21.27004809, 21.61622129,
20.77829672, 22.71961021, 22.46577118, 22.19701851,
17.56622696, 18.60445177, 22.22753085, 22.3563976 ,
22.55142493, 22.10376262, 20.68842049, 21.3787449 ,
22.0105441 , 17.79553655, 19.78446406, 18.08189329,
21.61503384, 21.66312533, 21.65358426, 22.8629422 ,
23.04554703, 22.50783411, 21.66994691, 22.025383 ,
23.97047057, 23.95697273, 23.9469708 , 23.98920395,
23.98688719, 23.96114955, 23.91703143, 23.95879127,
23.91286707, 23.92167741, 23.93382587, 23.95927289,
23.93994578, 24.00710281, 24.01431051, 24.00787921,
23.98760547, 24.013422 ])
len(np.dot(crim1,fitted_model1.params))
506
pred1=fitted_model1.predict(crim1)
pred1-np.dot(crim1,fitted_model1.params)
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
21 0.0
22 0.0
23 0.0
24 0.0
25 0.0
26 0.0
27 0.0
28 0.0
29 0.0
...
476 0.0
477 0.0
478 0.0
479 0.0
480 0.0
481 0.0
482 0.0
483 0.0
484 0.0
485 0.0
486 0.0
487 0.0
488 0.0
489 0.0
490 0.0
491 0.0
492 0.0
493 0.0
494 0.0
495 0.0
496 0.0
497 0.0
498 0.0
499 0.0
500 0.0
501 0.0
502 0.0
503 0.0
504 0.0
505 0.0
Length: 506, dtype: float64
적합시킨 직선 시각화
import matplotlib.pyplot as plt
plt.yticks(fontname = "Arial") #
plt.scatter(crim,target,label="data")
plt.plot(crim,pred1,label="result")
plt.legend()
plt.show()
plt.scatter(target,pred1)
plt.xlabel("real_value")
plt.ylabel("pred_value")
plt.show()
fitted_model1.resid.plot()
plt.xlabel("residual_number")
plt.show()
##잔차의 합계산해보기
sum(fitted_model1.resid)
-2.717825964282383e-13
위와 동일하게 rm변수와 lstat 변수로 각각 단순선형회귀분석 적합시켜보기
rm1 = sm.add_constant(rm, has_constant='add')
lstat1 = sm.add_constant(lstat, has_constant='add')
model2 = sm.OLS(target,rm1)
fitted_model2=model2.fit()
model3 = sm.OLS(target,lstat1)
fitted_model3=model3.fit()
fitted_model2.summary()
Dep. Variable: | Target | R-squared: | 0.484 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.483 |
Method: | Least Squares | F-statistic: | 471.8 |
Date: | Mon, 12 Aug 2019 | Prob (F-statistic): | 2.49e-74 |
Time: | 16:00:59 | Log-Likelihood: | -1673.1 |
No. Observations: | 506 | AIC: | 3350. |
Df Residuals: | 504 | BIC: | 3359. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -34.6706 | 2.650 | -13.084 | 0.000 | -39.877 | -29.465 |
RM | 9.1021 | 0.419 | 21.722 | 0.000 | 8.279 | 9.925 |
Omnibus: | 102.585 | Durbin-Watson: | 0.684 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 612.449 |
Skew: | 0.726 | Prob(JB): | 1.02e-133 |
Kurtosis: | 8.190 | Cond. No. | 58.4 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
fitted_model3.summary()
Dep. Variable: | Target | R-squared: | 0.544 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.543 |
Method: | Least Squares | F-statistic: | 601.6 |
Date: | Mon, 12 Aug 2019 | Prob (F-statistic): | 5.08e-88 |
Time: | 16:04:22 | Log-Likelihood: | -1641.5 |
No. Observations: | 506 | AIC: | 3287. |
Df Residuals: | 504 | BIC: | 3295. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 34.5538 | 0.563 | 61.415 | 0.000 | 33.448 | 35.659 |
LSTAT | -0.9500 | 0.039 | -24.528 | 0.000 | -1.026 | -0.874 |
Omnibus: | 137.043 | Durbin-Watson: | 0.892 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 291.373 |
Skew: | 1.453 | Prob(JB): | 5.36e-64 |
Kurtosis: | 5.319 | Cond. No. | 29.7 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
pred2=fitted_model2.predict(rm1)
pred3=fitted_model3.predict(lstat1)
import matplotlib.pyplot as plt
plt.scatter(rm,target,label="data")
plt.plot(rm,pred2,label="result")
plt.legend()
plt.show()
import matplotlib.pyplot as plt
plt.scatter(lstat,target,label="data")
plt.plot(lstat,pred3,label="result")
plt.legend()
plt.show()
fitted_model2.resid.plot()
plt.xlabel("residual_number")
plt.show()
fitted_model3.resid.plot()
plt.xlabel("residual_number")
plt.show()
fitted_model1.resid.plot(label="crim")
fitted_model2.resid.plot(label="rm")
fitted_model3.resid.plot(label="lstat")
plt.legend()