Laboratorium 4

Modelowanie

Regresja liniowa

Model zależności liniowej wielkości \(Y\) od \(X\):

\[Y \sim \beta_0 + \beta_1X\]

Tak naprawdę modelujemy \(Y\) za pomocą:

\[Y \sim \beta_0 + \beta_1X + \varepsilon\]

Po wyborze postaci modelu będziemy stosowali różne metody (jakie ? ) aby otrzymać współczynniki modelu z danych.

\[\hat{y} = \hat{\beta_0} + \hat{\beta1} x\]

gdzie \(y\) to predykcja \(Y\) dla wartości \(x\).

Gdy mamy wiele zmiennych (cech) wtedy:

\[Y \sim \beta_0 + \beta_1X_1+ \beta_2X_2 + \dots + \beta_pX_p + \varepsilon\]

Wczytajmy dane sprzedaży.

df <- read.csv('https://sebastianzajac.pl/data/Salary_Data.csv')

bądź bezpośrednio z pliku.

Podziel zbiór danych na dane treningowe i testowe.

library(caTools)

## Warning: package 'caTools' was built under R version 3.5.2

set.seed(123)
split <- sample.split(df$Salary, SplitRatio = 2/3)
training_set <- subset(df, split == TRUE)
test_set <- subset(df, split == FALSE)

Czasem przydaje się przeskalować zmienne:

tr_set_scaled = data.frame(scale(training_set))
test_set_scaled = data.frame(scale(test_set))

Co rozumiemy przez model ?

regressor = lm(formula = Salary ~ YearsExperience,
               data = training_set)

# sprawdz info o modelu
summary(regressor)

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7325.1 -3814.4   427.7  3559.7  8884.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        25592       2646   9.672 1.49e-08 ***
## YearsExperience     9365        421  22.245 1.52e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5391 on 18 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.963 
## F-statistic: 494.8 on 1 and 18 DF,  p-value: 1.524e-14

Utwórz zmienną model_scaled realizujący ten sam model dla zmiennych przeskalowanych.

Zobaczmy jak nasz model wygląda dla danych treningowych:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

ggplot() +
  geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Training set)') +
  xlab('Years of experience') +
  ylab('Salary')

Pamiętaj. Wydajność modelu powinna być weryfikowana na danych testowych, których proces uczenia się jeszcze nie widział.

library(ggplot2)
ggplot() +
  geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Test set)') +
  xlab('Years of experience') +
  ylab('Salary')

Jak przeprowadzisz weryfikację modelu na danych testowych ?

Potrzebna jest nam predykcja wartości modelowanych dla danych testowych.

y_pred <- predict(regressor, newdata = test_set)

Przeanalizujmy dane w przypadku większej ilości zmiennych.

df2 <- read.csv('https://sebastianzajac.pl/data/50_Startups.csv')

Czy te dane nadają się do modelowania ?

df2$State = factor(df2$State,
                       levels = c('New York', 'California', 'Florida'),
                       labels = c(1, 2, 3))

library(caTools)
set.seed(123)
split_2 = sample.split(df2$Profit, SplitRatio = 0.8)
training_set_2 = subset(df2, split_2 == TRUE)
test_set_2 = subset(df2, split_2 == FALSE)

Model dla wszystkich zmiennych:

regressor_2= lm(formula = Profit ~ .,
               data = training_set_2)

summary(regressor_2)

## 
## Call:
## lm(formula = Profit ~ ., data = training_set_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33128  -4865      5   6098  18065 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
## R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
## Administration  -2.942e-02  5.828e-02  -0.505    0.617    
## Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
## State2           1.213e+02  3.751e+03   0.032    0.974    
## State3           2.376e+02  4.127e+03   0.058    0.954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9908 on 34 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9425 
## F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16

I w końcu predykcja wartości:

y_pred_2 = predict(regressor_2, newdata = test_set_2)

Boston

#install.packages('ISLR')
#install.packages("MASS")
library(MASS)

## Warning: package 'MASS' was built under R version 3.5.2

library(ISLR)
# fix data
#fix(Boston)
# zmienne Boston
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Modele Boston.

lm.fit <- lm(medv~lstat, data=Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

# wspolczynniki modelu
lm.fit$coefficients

## (Intercept)       lstat 
##  34.5538409  -0.9500494

# wspolczynniki jeszcze raz
coef(lm.fit)

## (Intercept)       lstat 
##  34.5538409  -0.9500494

confint(lm.fit)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

# predykcja dla moich wartoscie
predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval ="confidence")

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval ="prediction")

##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

# grafika
plot(Boston$lstat, Boston$medv)
abline(lm.fit)

Dla wielu zmiennych:

lm.fit2=lm(medv~lstat+age,data=Boston)
summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

Dla wszystkich zmiennych

lm.fit3=lm(medv~.,data=Boston)
summary(lm.fit3)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Dla zmiennych z oddziaływaniem:

lm.fit4=lm(medv~lstat*age,data=Boston)
summary(lm.fit4)

## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Regresja logistyczna

Titanic

Dane treningowe i testowe

train <- read.csv('https://sebastianzajac.pl/data/train.csv')
test <- read.csv('https://sebastianzajac.pl/data/test.csv')

Kill all

Sprawdź strukturę dataframe str(train)

Zobaczmy na ilość ludzi, która przeżyła

table(train$Survived)

## 
##   0   1 
## 549 342

prop.table(table(train$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

Stwórz nową zmienną do której dodamy info, iż wszyscy umarli (dostali 0)

test$Survived <- rep(0, 418)

Jeśli masz konto na Kaggle możesz wysłać otrzymane wynik:

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "sz1.csv", row.names = FALSE)

kill all women

Sprawdzmy zmienną płeć

summary(train$Sex)

## female   male 
##    314    577

prop.table(table(train$Sex, train$Survived))

##         
##                   0          1
##   female 0.09090909 0.26150393
##   male   0.52525253 0.12233446

prop.table(table(train$Sex, train$Survived), 1)

##         
##                  0         1
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081

Wstawmy odpowiedź w dwóch etapach

# Create new column in test set with our prediction that everyone dies
test$Survived <- 0
# Update the prediction to say that all females will survive
test$Survived[test$Sex == 'female'] <- 1

# Look at age patterns
summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

train$Child <- 0
train$Child[train$Age < 18] <- 1
aggregate(Survived ~ Child + Sex, data=train, FUN=sum)

##   Child    Sex Survived
## 1     0 female      195
## 2     1 female       38
## 3     0   male       86
## 4     1   male       23

aggregate(Survived ~ Child + Sex, data=train, FUN=length)

##   Child    Sex Survived
## 1     0 female      259
## 2     1 female       55
## 3     0   male      519
## 4     1   male       58

aggregate(Survived ~ Child + Sex, data=train, FUN=function(x) {sum(x)/length(x)})

##   Child    Sex  Survived
## 1     0 female 0.7528958
## 2     1 female 0.6909091
## 3     0   male 0.1657033
## 4     1   male 0.3965517

# Look at class and fare patterns
train$Fare2 <- '30+'
train$Fare2[train$Fare < 30 & train$Fare >= 20] <- '20-30'
train$Fare2[train$Fare < 20 & train$Fare >= 10] <- '10-20'
train$Fare2[train$Fare < 10] <- '<10'
aggregate(Survived ~ Fare2 + Pclass + Sex, data=train, FUN=function(x) {sum(x)/length(x)})

##    Fare2 Pclass    Sex  Survived
## 1  20-30      1 female 0.8333333
## 2    30+      1 female 0.9772727
## 3  10-20      2 female 0.9142857
## 4  20-30      2 female 0.9000000
## 5    30+      2 female 1.0000000
## 6    <10      3 female 0.5937500
## 7  10-20      3 female 0.5813953
## 8  20-30      3 female 0.3333333
## 9    30+      3 female 0.1250000
## 10   <10      1   male 0.0000000
## 11 20-30      1   male 0.4000000
## 12   30+      1   male 0.3837209
## 13   <10      2   male 0.0000000
## 14 10-20      2   male 0.1587302
## 15 20-30      2   male 0.1600000
## 16   30+      2   male 0.2142857
## 17   <10      3   male 0.1115385
## 18 10-20      3   male 0.2368421
## 19 20-30      3   male 0.1250000
## 20   30+      3   male 0.2400000

# Create new column in test set with our prediction that everyone dies
test$Survived <- 0
# Update the prediction to say that all females will survive
test$Survived[test$Sex == 'female'] <- 1
# Update once more to say that females who pay more for a third class fare also perish
test$Survived[test$Sex == 'female' & test$Pclass == 3 & test$Fare >= 20] <- 0

Drzewo decyzyjne

#### nowe zmienne Lasy losowe