Chapter 2 (Exercises)

Statistical Learning

Author

Aditya Dahiya

Published

January 31, 2021

2.4 Exercises (Chapter 2: ISLR)

Conceptual

Question 1

(a)

When sample size \(n\) is extremely large, and \(p\) is small, a flexible statistical learning method will perform better than a non-flexible method, because

  • The low \(p\) allows us to avoid the curse of dimensionality.
  • Large number of \(n\) allows us to better predict with a flexible method, lowering the chance of overfitting.

(b)

When the sample size \(n\) is very small, and number of predictors \(p\) is very large, a flexible statistical learning method will perform worse than a non-flexible one because a large number of predictors increase the , making it difficult to identify nearest-neighbors. A small number of observations also means that a highly flexible model to could lead to high variance, or overfitting.

(c)

When the relationship between predictors and response is highly non-linear, a flexible model will perform much better because it can better allow for non-linear \(f(x)\), and thus will better approximate a highly non-linear relationship.

(d)

When the variance of error terms, i.e. \(\sigma^2 = Var(\epsilon)\), is extremely high, both flexible and non-flexible methods will lead to inaccurate predictions. However, comparatively, a non-flexible method will perform better as it is less likely to overfit or mis-read noise for actual relationship.

Question 2

(a)

  • This scenario is a regression problem, because the response / outcome is a continuous variable, i.e. CEO Salary.
  • Here, we are primarily interested in inference, rather than prediction, because we want to understand which factors affect CEO salary, rather than predicting the salary for any given CEO.
  • In this scenario, \(n = 500\) and \(p = 3\) .

(b)

  • This scenario is a classification problem, as the primary response is a qualitative variable (success vs. failure).
  • Here, we are primarily interested in prediction because we wish to know the response our test data, i.e. new product being launched, rather than understanding the factors behind the response.
  • In this scenario, \(n = 20\) and \(p = 13\) .

(c)

  • This scenario is a regression problem, because the response / outcome is a continuous variable, i.e. percentage change in US dollar.
  • Here, we are primarily interested in prediction instead of inference since our intended purpose to predict % change in US dollar. We are not concerned with the exact factors that cause such a change.
  • In this scenario, \(n = 52\) (i.e. number of weeks in 2012) and \(p = 3\) .

Question 3

(a)

The sketch is displayed below: —

 

(b)

In this figure, the curve (C) is the Irreducible Error which is a feature of the data-set, and thus stays constant. The curve(D) represents the Training Error which will always decrease monotonously with increasing flexibility of the fitted model because almost all techniques directly or indirectly aim at reducing the MSE in the training data-set. As flexibility of the method increases, it’s variance will increase [curve (B)] and bias will decrease [curve(D)] because a flexible model is highly variable (change in a single observation will change the fit) and less biased (each observation is closely covered). The sum of curves B, C and E represents the Test Error i.e. Curve A.

Question 4

(a)

Three real-life applications in which classification is useful are: —
Email Spam
1. Response: A qualitative binary indicator whether the email is spam or not spam.
2. Predictors: Presence of common words in subject of the email (such as offer, lottery etc.), Presence of name of email account holder in text of email, Unverified attachments, Relative frequency of commonly used words.
3. Goal of the application: Prediction.


Handwriting recognition : ZIP codes on postal envelopes
1. Response: Each of the 5 digits (0-9) - categorical outcome
2. Predictors: A matrix corresponding to an image where each pixel corresponds is an entry in the matrix, and its pixel intensity ranges from 0 (black) to 255 (white).
3. Goal of the application: Prediction


Factors Affecting Consumer purchase amongst competing goods : Which factors most affect the consumers’ purchase of a particular product amongst competing brands.
1. Response: Which one of the competing products does a consumer buy?
2. Predictors: Area, demographics of buyer, advertising revenues etc.
3. Goal of the application: Inference.

(b)

Three real-life applications in which regression is useful are: —

1. Predicting response to dosage of BP medication
i) Response: Blood Pressure (in mmHg)
ii) Predictors: Dose of a given medication, age, vital indicators etc.
iii) Goal of the application: Prediction


2. Factors affecting Crop Yields
i) Response: Crop Yield in units per area
ii) Predictors: Amount of fertilizer applied, Soil pH, Temperature, Machines used, etc.
iii) Goal of the application: Both inference (which inputs yield maximum benefits) and prediction (expected crop yield in a particular season)


3. Increasing operational efficiency in a production line by identifying low-hanging fruits
i) Response: Factory output of a good
ii) Predictors: Various raw materials, number of workers, capital investment, working hours, etc.
iii) Goal of the application: Inference (spending on which input / predictor) will cause maximum increase in output.

(c)

Three real-life applications in which cluster analysis is useful are: —
1. Social Network Clusters: Identifying like-minded twitter users based on their activity.
2. Finding Similar books for a book recommendation software based on past purchase data from users.
3. Gene mapping in evolutionary biology to find clusters of similar genes within a genome / evolutionary lineage.

Question 5

  1. A very flexible approach has following advantages and disadvantages :–
  • Advantages: Allows fitting to highly non-linear relationships. Better performance when \(n\) is large and \(p\) is small.
  • Disadvantages: Requires estimation of a large number of parameters. Low level of interpretability. Can lead to overfitting. Does not perform well when \(n\) is very small or data is high-dimensional i.e. large \(p\).
  1. A more flexible approach may be considered when we have a large data (high \(n\)), low dimensionality (low \(p\)), or when we are mainly interested in prediction in a non-linear relationship.

  2. A less flexible approach is preferred when out objective is inference and interpretability of the results. (Or when data is high dimensional or low \(n\).)

Question 6

A parametric statistical learning approach assumes a \(f(x)\) i.e. it pre-supposes a functional form for relationship between predictors and response. On the other hand, a non-parametric approach does not assume any functional form for \(f\) at all. It just tries to best fit the available data. Thus, it requires a very large amount of data.


Parametric approach advantages:

1) Easier to compute and evaluate

2) Interpretation is simpler

3) Better for inference tasks.

4) Can be used with a low number of observations.

Parametric approach disadvantages:

1) If the underlying relationship is far off from assumed functional form, the method will lead to high error rate.

2) If too flexible a model fitted, it will lead to overfitting.

3) Real-life relations are rarely simple functional forms.

Question 7

(a)

data <- read.csv("docs/ISLRCh2Ex2-4Q7.csv") |>
  mutate(y = as.factor(y)) |>
  mutate(dist_000 = sqrt(x1^2 + x2^2 + x3^2))
data |> kbl() |> kable_classic_2(full_width = F)
x1 x2 x3 y dist_000
0 3 0 red 3.000000
2 0 0 red 2.000000
0 1 3 red 3.162278
0 1 2 green 2.236068
-1 0 1 green 1.414214
1 1 1 red 1.732051

The distance between each test point and the test point (0,0,0) is displayed above, and for the six points is

3.00 2.00 3.16 2.24 1.41 1.73

(b)

With \(K=1\), our prediction for \(y\) is “green”, as the nearest neighbor is “green”.

(c)

With \(K=3\), our prediction for \(y\) is “red”, as the nearest neighbors are 1 “green” and 2 “red”.

(d)

If the Bayes Decision Boundary is highly non-linear, then we would expect the best value of \(K\) to be small, because a smaller \(K\) allows us more flexibility in the estimated decision boundary.

Applied

Question 8

(a)

data(College)
college <- College
rm(College)

(b)

rownames(college) <- college[,1]
fix(college)

(c)

 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00  

8 (c) (iv)

As per the summary() function, there are 78 elite universities. The box-plot of Outstate vs. Elite is as below :—

with(college, plot(x=Elite, y=Outstate, ylab = "Out-of-State Tuition", xlab = "Whether Elite College?", main = "Question 8(c)(iv)"))

8 (c) (v)

par(mfrow=c(2,2))
hist(college$Room.Board, main = "", sub = "Room Boarding Charges")
hist(college$Room.Board, breaks = 100, main="", sub = "Room Boarding Charges(100 bins)")
hist(college$Books, main = "", sub = "Expenses on Books")
hist(college$Books, breaks = 100, main="", sub = "Expenses on Books")

Question 9

(a)

data(Auto)
Auto <- na.omit(Auto)

The predictors Origin and Name in the Auto data set are qualitative, while all the other variables are quantitative, i.e. mpg, cylinders, displacement, horsepower, weight, acceleration, year.

(b) (c)

Min <- rep(0, 7)
Max <- rep(0, 7)
StDev <- rep(0, 7)
Mean <- rep(0, 7)
Name <- colnames(Auto)[-c(8, 9)]
for (i in 1:7) {
  Min[i] <- min(Auto[, i])
}
for (i in 1:7) {
  Max[i] <- max(Auto[, i])
}
for (i in 1:7) {
  Mean[i] <- round(mean(Auto[, i]), digits = 2)
}
for (i in 1:7) {
  StDev[i] <- round(sd(Auto[, i]), digits = 2)
}
rangetab <- as.data.frame(cbind(Name, Min, Max))
msdtab <- as.data.frame(cbind(Name, Mean, StDev))
rangetab |>
  kbl() |>
  kable_classic_2(full_width = FALSE)
Name Min Max
mpg 9 46.6
cylinders 3 8
displacement 68 455
horsepower 46 230
weight 1613 5140
acceleration 8 24.8
year 70 82
msdtab |>
  kbl() |>
  kable_classic(full_width = FALSE)
Name Mean StDev
mpg 23.45 7.81
cylinders 5.47 1.71
displacement 194.41 104.64
horsepower 104.47 38.49
weight 2977.58 849.4
acceleration 15.54 2.76
year 75.98 3.68

(d)

Auto1 <- Auto[-c(10:85), ]
for (i in 1:7) {
  Min[i] <- min(Auto1[, i])
}
for (i in 1:7) {
  Max[i] <- max(Auto1[, i])
}
for (i in 1:7) {
  Mean[i] <- round(mean(Auto1[, i]), digits = 2)
}
for (i in 1:7) {
  StDev[i] <- round(sd(Auto1[, i]), digits = 2)
}
rangetab <- as.data.frame(cbind(Name, Min, Max))
msdtab <- as.data.frame(cbind(Name, Mean, StDev))
rangetab |>
  kbl() |>
  kable_classic_2(full_width = FALSE)
Name Min Max
mpg 11 46.6
cylinders 3 8
displacement 68 455
horsepower 46 230
weight 1649 4997
acceleration 8.5 24.8
year 70 82
msdtab |>
  kbl() |>
  kable_classic(full_width = FALSE)
Name Mean StDev
mpg 24.4 7.87
cylinders 5.37 1.65
displacement 187.24 99.68
horsepower 100.72 35.71
weight 2935.97 811.3
acceleration 15.73 2.69
year 77.15 3.11

(e) & (f)

pairs(Auto[, c(1, 3, 4, 5, 6)])

model <- lm(mpg ~ . - name, data = Auto)
cor <- as.data.frame(model$coefficients[-1])
colnames(cor) <- "Coefficient of Regression"
cor |>
  kbl() |>
  kable_classic_2()
Coefficient of Regression
cylinders -0.4933763
displacement 0.0198956
horsepower -0.0169511
weight -0.0064740
acceleration 0.0805758
year 0.7507727
origin 1.4261405

The above table shows regression coefficients of other variables with mpg.

Question 10

(a)

library(MASS)
data(Boston)
# VarDescrip <- read.csv("BostonVarDescrip.csv")
nrow(Boston)
[1] 506
ncol(Boston)
[1] 14

There are total 506 rows in the data-set and 14 columns. The rows represent the 506 neighborhoods and each column represents the following :—

# VarDescrip |> kbl() |> kable_classic_2()

(b)

The pairwise plots of some important variables is as below : —

Boston1 <- Boston |> dplyr::select(crim, nox, dis, tax, lstat)
pairs(Boston1)

(c)

The linear regression shows us the following result :—

mdl <- lm(crim ~ ., data = Boston)
summary(mdl)

Call:
lm(formula = crim ~ ., data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-9.924 -2.120 -0.353  1.019 75.051 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17.033228   7.234903   2.354 0.018949 *  
zn            0.044855   0.018734   2.394 0.017025 *  
indus        -0.063855   0.083407  -0.766 0.444294    
chas         -0.749134   1.180147  -0.635 0.525867    
nox         -10.313535   5.275536  -1.955 0.051152 .  
rm            0.430131   0.612830   0.702 0.483089    
age           0.001452   0.017925   0.081 0.935488    
dis          -0.987176   0.281817  -3.503 0.000502 ***
rad           0.588209   0.088049   6.680 6.46e-11 ***
tax          -0.003780   0.005156  -0.733 0.463793    
ptratio      -0.271081   0.186450  -1.454 0.146611    
black        -0.007538   0.003673  -2.052 0.040702 *  
lstat         0.126211   0.075725   1.667 0.096208 .  
medv         -0.198887   0.060516  -3.287 0.001087 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.439 on 492 degrees of freedom
Multiple R-squared:  0.454, Adjusted R-squared:  0.4396 
F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

Thus, per capital crime rate is significantly related with the variables zn, dis, rad, black and medv.

(d)

Yes, there are some suburbs with abnormally high per-capita crime rate. But no such trend is visible in Tax-Rates or Pupil-Teacher Ratio. The plots of these three variables are shown below:—

attach(Boston)
par(mfrow=c(2,3))
plot(crim)
plot(tax)
plot(ptratio)
boxplot(crim)
boxplot(tax)
boxplot(ptratio)

(e)

35 suburbs in this data set bound the Charles river. This is found using the code:—

nrow(subset(Boston, chas == 1))
[1] 35

(f)

The median pupil-teacher ratio is 19.05, found using the code:—

median(Boston$ptratio)

(g)

The suburb of Boston with lowest median value of ownership occupied homes is 399. Its values for other variables are:—

as.data.frame(Boston[which.min(Boston$medv),]) |> kbl() |> kable_classic_2()
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59 5

(h)

In this data set, 13 suburbs average more than 8 rooms per dwelling and 64 suburbs average more than 7 rooms per dwelling. The code to find these values is :—

nrow(Boston |> dplyr::filter(rm>8))
[1] 13
nrow(Boston |> dplyr::filter(rm>7))
[1] 64