IODS-project/chapter4.Rmd at master · goliatl/IODS-project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: "Chapter 4"
author: "Paul Lassmann Klee"
date: "`r Sys.Date()`"
output: html_document
---

```{r, echo=FALSE, message=FALSE,results="hide"}
library(MASS)
library(ggplot2)
library(GGally)
library(dplyr)
data("Boston")
str(Boston)
head(Boston)
dim(Boston)
```
# Chapter 4
## Use of Boston data from MASS library
Boston is dataset with 506 observations and 14 variables of housing values in the suburbs of Boston
Correlation plot matrix and summaries of the variables can be found below
```{r, echo=FALSE, message=FALSE,}
# create a  plot matrix with ggpairs()
  p <- ggpairs(Boston, upper = list(continuous = wrap("cor", size = 2)))+ theme_bw()


```
```{r, echo=FALSE, message=FALSE}
# draw the plot

p

```

We observe middle-strong correlations (ca. 0.6) between the variables rad and crim, tax and crim, age and zn, dis and zn, nox and indus, age and indus, dis and indus, rad and indus, tax and indus, lstat and indus, age and nox, dis and nox, rad and nox, tax and nox, lstat and nox, lstat and rm, medv and rm, dis and age, lstat and age, tax and rad, lstat and medv.
None of the variables is normally distributed, apart from rm that appears to follow a normal distribution.

###Descriptive statistics (summary) of variables in Boston in following table.
```{r, echo=FALSE, message=FALSE,}
table.descriptive<-psych::describe(Boston, IQR=TRUE)#%>%as_data_frame()
table.descriptive<-subset (table.descriptive,select= -c(1:2,6:12))
table.descriptive<-round(table.descriptive,digits = 1)
knitr::kable(table.descriptive)
```

Therefore, we standardised the data.
```{r, echo=FALSE, message=FALSE,results="hide"}
boston_scaled <- scale(Boston)
summary(boston_scaled)
boston_scaled<-as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
bins
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
```

###Descriptive statistics (summary) of standardized variables in Boston in following table. Observe the mean O after standardization and standard deviation of 1
```{r, echo=FALSE, message=FALSE,}
table.descriptive2<-psych::describe(boston_scaled, IQR=TRUE)#%>%as_data_frame()
table.descriptive2<-subset (table.descriptive2,select= -c(1:2,6:12))
table.descriptive2<-round(table.descriptive2,digits = 1)
knitr::kable(table.descriptive2)
```

```{r, echo=FALSE, message=FALSE}
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

```
We fitted a linear discriminant analysis to the target variable crime and its classes.
We divided the standardised Boston data set into a training and a test set, with 80% of the data assigned to the training dataset.

We plotted this lda model in the following biplot
```{r, echo=FALSE, message=FALSE,results="hide"}


# linear discriminant analysis
lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0,
         x1 = myscale * heads[,choices[1]],
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads),
       cex = tex, col=color, pos=2)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
ldaplot<-plot(lda.fit, dimen = 2,col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata =test)

# cross tabulate the results
crosstable<-table(correct = correct_classes, predicted = lda.pred$class)


```


```{r, echo=FALSE, message=FALSE}
ldaplot
knitr::kable(crosstable)
```

As noted in the biplot the predictor variable rad ( radial highway )predicts a high crime per capita ( blue). On the other hand a high proportion of residential land zoned for lots over 25,000 sq.ft. (variable zn) predicts a low crime per capita.
The middle crime per capita depicted in red in green has overlaps and is predicted by multiple predictors.

We predicted classes with the lda model and observed that the lda model predicts very efficently the crime rates above the mean (i.e. higher crime rates), but fails to distinguish the lower crime classes in an effective manner.

```{r, echo=FALSE, message=FALSE,results="hide"}
data("Boston")
scaled.Boston<-scale(Boston)
dist_eu <-dist(scaled.Boston)
km <-kmeans(scaled.Boston, centers = 3)
```

### We calculated the distances between the observations in the standardised Boston dataset and performed an k means clustering analysis. We obtained two clusters and see that for example rad predicts well the two crime clusters.

```{r, echo=FALSE, message=FALSE,results="hide"}
library(plotly)

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)

# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scaled.Boston, k)$tot.withinss})

# visualize the results
#qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(scaled.Boston, centers = 2)


Tdplot1<-plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train$crime, colors = "PuBuGn", type= 'scatter3d', mode='markers')
```

```{r, echo=FALSE, message=FALSE}
pairs(scaled.Boston, col = km$cluster)
Tdplot1
```