Model zależności liniowej wielkości \(Y\) od \(X\):
\[Y \sim \beta_0 + \beta_1X\]
Tak naprawdÄ™ modelujemy \(Y\) za pomocÄ…:
\[Y \sim \beta_0 + \beta_1X + \varepsilon\]
Po wyborze postaci modelu będziemy stosowali różne metody (jakie ? ) aby otrzymać współczynniki modelu z danych.
\[\hat{y} = \hat{\beta_0} + \hat{\beta1} x\]
gdzie \(y\) to predykcja \(Y\) dla wartości \(x\).
Gdy mamy wiele zmiennych (cech) wtedy:
\[Y \sim \beta_0 + \beta_1X_1+ \beta_2X_2 + \dots + \beta_pX_p + \varepsilon\]
Wczytajmy dane sprzedaży.
df <- read.csv('https://sebastianzajac.pl/data/Salary_Data.csv')
bądź bezpośrednio z pliku.
Podziel zbiór danych na dane treningowe i testowe.
library(caTools)
## Warning: package 'caTools' was built under R version 3.5.2
set.seed(123)
split <- sample.split(df$Salary, SplitRatio = 2/3)
training_set <- subset(df, split == TRUE)
test_set <- subset(df, split == FALSE)
Czasem przydaje się przeskalować zmienne:
tr_set_scaled = data.frame(scale(training_set))
test_set_scaled = data.frame(scale(test_set))
Co rozumiemy przez model ?
regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)
# sprawdz info o modelu
summary(regressor)
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7325.1 -3814.4 427.7 3559.7 8884.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25592 2646 9.672 1.49e-08 ***
## YearsExperience 9365 421 22.245 1.52e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5391 on 18 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.963
## F-statistic: 494.8 on 1 and 18 DF, p-value: 1.524e-14
Utwórz zmienną
model_scaled
realizujÄ…cy ten sam model dla zmiennych przeskalowanych.
Zobaczmy jak nasz model wyglÄ…da dla danych treningowych:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')
Pamiętaj. Wydajność modelu powinna być weryfikowana na danych testowych, których proces uczenia się jeszcze nie widział.
library(ggplot2)
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Jak przeprowadzisz weryfikacjÄ™ modelu na danych testowych ?
Potrzebna jest nam predykcja wartości modelowanych dla danych testowych.
y_pred <- predict(regressor, newdata = test_set)
Przeanalizujmy dane w przypadku większej ilości zmiennych.
df2 <- read.csv('https://sebastianzajac.pl/data/50_Startups.csv')
Czy te dane nadajÄ… siÄ™ do modelowania ?
df2$State = factor(df2$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
library(caTools)
set.seed(123)
split_2 = sample.split(df2$Profit, SplitRatio = 0.8)
training_set_2 = subset(df2, split_2 == TRUE)
test_set_2 = subset(df2, split_2 == FALSE)
Model dla wszystkich zmiennych:
regressor_2= lm(formula = Profit ~ .,
data = training_set_2)
summary(regressor_2)
##
## Call:
## lm(formula = Profit ~ ., data = training_set_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33128 -4865 5 6098 18065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.965e+04 7.637e+03 6.501 1.94e-07 ***
## R.D.Spend 7.986e-01 5.604e-02 14.251 6.70e-16 ***
## Administration -2.942e-02 5.828e-02 -0.505 0.617
## Marketing.Spend 3.268e-02 2.127e-02 1.537 0.134
## State2 1.213e+02 3.751e+03 0.032 0.974
## State3 2.376e+02 4.127e+03 0.058 0.954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9908 on 34 degrees of freedom
## Multiple R-squared: 0.9499, Adjusted R-squared: 0.9425
## F-statistic: 129 on 5 and 34 DF, p-value: < 2.2e-16
I w końcu predykcja wartości:
y_pred_2 = predict(regressor_2, newdata = test_set_2)
Boston
#install.packages('ISLR')
#install.packages("MASS")
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.2
library(ISLR)
# fix data
#fix(Boston)
# zmienne Boston
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Modele Boston.
lm.fit <- lm(medv~lstat, data=Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
# wspolczynniki modelu
lm.fit$coefficients
## (Intercept) lstat
## 34.5538409 -0.9500494
# wspolczynniki jeszcze raz
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
# predykcja dla moich wartoscie
predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval ="confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval ="prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
# grafika
plot(Boston$lstat, Boston$medv)
abline(lm.fit)
Dla wielu zmiennych:
lm.fit2=lm(medv~lstat+age,data=Boston)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
Dla wszystkich zmiennych
lm.fit3=lm(medv~.,data=Boston)
summary(lm.fit3)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Dla zmiennych z oddziaływaniem:
lm.fit4=lm(medv~lstat*age,data=Boston)
summary(lm.fit4)
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
Dane treningowe i testowe
train <- read.csv('https://sebastianzajac.pl/data/train.csv')
test <- read.csv('https://sebastianzajac.pl/data/test.csv')
Sprawdź strukturę dataframe str(train)
Zobaczmy na ilość ludzi, która przeżyła
table(train$Survived)
##
## 0 1
## 549 342
prop.table(table(train$Survived))
##
## 0 1
## 0.6161616 0.3838384
Stwórz nową zmienną do której dodamy info, iż wszyscy umarli (dostali 0)
test$Survived <- rep(0, 418)
Jeśli masz konto na Kaggle możesz wysłać otrzymane wynik:
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "sz1.csv", row.names = FALSE)
Sprawdzmy zmienną płeć
summary(train$Sex)
## female male
## 314 577
prop.table(table(train$Sex, train$Survived))
##
## 0 1
## female 0.09090909 0.26150393
## male 0.52525253 0.12233446
prop.table(table(train$Sex, train$Survived), 1)
##
## 0 1
## female 0.2579618 0.7420382
## male 0.8110919 0.1889081
Wstawmy odpowiedź w dwóch etapach
# Create new column in test set with our prediction that everyone dies
test$Survived <- 0
# Update the prediction to say that all females will survive
test$Survived[test$Sex == 'female'] <- 1
# Look at age patterns
summary(train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
train$Child <- 0
train$Child[train$Age < 18] <- 1
aggregate(Survived ~ Child + Sex, data=train, FUN=sum)
## Child Sex Survived
## 1 0 female 195
## 2 1 female 38
## 3 0 male 86
## 4 1 male 23
aggregate(Survived ~ Child + Sex, data=train, FUN=length)
## Child Sex Survived
## 1 0 female 259
## 2 1 female 55
## 3 0 male 519
## 4 1 male 58
aggregate(Survived ~ Child + Sex, data=train, FUN=function(x) {sum(x)/length(x)})
## Child Sex Survived
## 1 0 female 0.7528958
## 2 1 female 0.6909091
## 3 0 male 0.1657033
## 4 1 male 0.3965517
# Look at class and fare patterns
train$Fare2 <- '30+'
train$Fare2[train$Fare < 30 & train$Fare >= 20] <- '20-30'
train$Fare2[train$Fare < 20 & train$Fare >= 10] <- '10-20'
train$Fare2[train$Fare < 10] <- '<10'
aggregate(Survived ~ Fare2 + Pclass + Sex, data=train, FUN=function(x) {sum(x)/length(x)})
## Fare2 Pclass Sex Survived
## 1 20-30 1 female 0.8333333
## 2 30+ 1 female 0.9772727
## 3 10-20 2 female 0.9142857
## 4 20-30 2 female 0.9000000
## 5 30+ 2 female 1.0000000
## 6 <10 3 female 0.5937500
## 7 10-20 3 female 0.5813953
## 8 20-30 3 female 0.3333333
## 9 30+ 3 female 0.1250000
## 10 <10 1 male 0.0000000
## 11 20-30 1 male 0.4000000
## 12 30+ 1 male 0.3837209
## 13 <10 2 male 0.0000000
## 14 10-20 2 male 0.1587302
## 15 20-30 2 male 0.1600000
## 16 30+ 2 male 0.2142857
## 17 <10 3 male 0.1115385
## 18 10-20 3 male 0.2368421
## 19 20-30 3 male 0.1250000
## 20 30+ 3 male 0.2400000
# Create new column in test set with our prediction that everyone dies
test$Survived <- 0
# Update the prediction to say that all females will survive
test$Survived[test$Sex == 'female'] <- 1
# Update once more to say that females who pay more for a third class fare also perish
test$Survived[test$Sex == 'female' & test$Pclass == 3 & test$Fare >= 20] <- 0
#### nowe zmienne Lasy losowe