uuadvstatcomp/advstatcomp2.html at master · cnettel/uuadvstatcomp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
<!DOCTYPE html>

<html lang="en">
<head>
    <style>
        code, samp, kbd {
	font-family: "Courier New", Courier, monospace, sans-serif;
	text-align: left;
	color: #555;
            font-size: 13px;
	}
pre code {
	line-height: 1.6em;
    font-size: 11px;
	}
pre {
	padding: 0.1em 0.5em 0.3em 0.7em;
	border-left: 11px solid #ccc;
	margin: 1.7em 0 1.7em 0.3em;
	overflow: auto;
	width: 93%;
	}
/* target IE7 and IE6 */
*:first-child+html pre {
	padding-bottom: 2em;
	overflow-y: hidden;
	overflow: visible;
	overflow-x: auto;
	}
* html pre {
	padding-bottom: 2em;
	overflow: visible;
	overflow-x: auto;
	}
    </style>
    <meta charset="utf-8" />
    <title></title>
</head>
<body>
<h1>Lab 2</h1>
    In lab 2, we are trying out functional programming, or at least using the functional style of R.

    <section>
        <h2>
            Optimization
        </h2>
        <p>
            Fitting a statistical model is frequently an optimization problem. Optimization in this sense means finding a
            set of parameters that gives an optimal score or fitness value of some kind.
        </p>
        <p>
            Numerical optimization is frequently a hard problem beyond trivial cases. One wants a global optimum, i.e. that
            there is no other set of parameter values that constitutes a better fit. Those optimization methods that give
            strong guarantees tend to only promise that a <em>local</em> optimum is reached. This means that it is not possible
            to make any small change to the function value.
        </p>
        <p>
            Create a univariate function that computes the sum of (<em>x</em> - 3) <sup>2</sup> + 2 * (<em>x</em> - 3) <sup>2</sup> + 3 * (<em>x</em> - 15) <sup>2</sup> + sin(100<em>x</em>).
        </p>
        <p>
            Try to use <code>optimise</code> with this function. You also need to specify a range, as a list of lower and upper bounds, e.g. <code>optimise(f, c(0,100))</code>. Try at least the ranges (0,15), (9,12), and (10,11). Do you get the same optimum each time? Is is the global optimum?
         </p>
        <h2>
            Adding a file in a repository
        </h2>
        <p>
            Create a new R file <code>lab2a.R</code> in your <code>uuadvstatcomp</code> directory if you haven't already done so. Include the code for your
            <code>optimise</code> experiments there, and keep updating this file throughout the lab. Do remember to commit to the file frequently!
        </p>
        <p>
            You can add a file by just doing another commit in the graphical client. If you're using the command-line based git client, the most clear way is
            to first add and then commit:
        </p>
        <pre>
            git add lab2a.R
            git commit -m "Added lab 2"
        </pre>
        <h2>
            Integrating a function
        </h2>
        <p>
            Now, implement the function <em>x</em>sin(<em>x</em>). We want to integrate this function over a large finite range, from -7e5 to 7e5.
        </p>
        <p>
            The <code>integrate</code> function in R is actually adaptive. This means that it tries to find out what regions to explore in more detail to give a good
            approximation of the integral. To get a good approximation of the integral, you will need to increase the named parameter <code>subdivisions</code> to a high value, say 1e7.
        </p>
        <p>
            How long does it take to compute the value? Try using <code>system.time</code>. Add whatever experiments you think illustrate this to your <code>lab2a.R</code> file and commit it.
        </p>
        <p>
            We can now try using the <code>parallel</code> library, like we did in the lecture. A hint could be that mathematically you can
            divide a definite integral over a closed interval into the sum of the corresponding definite intervals over several subintervals that together
            form the total interval.
        </p>
        <p>
            Time your resulting code. What cluster size are you using? What kind of speedup do you get? Is it in any way relevant that the inner method is adaptive?
        </p>
        <p>
            Add some examples of your parallel integration to the script file and commit it.
        </p>

        <h2>Functional operators</h2>
        <p>
            The most complex form of functional programming could be seen to be when a function takes a function as input and returns another function.
            One purpose of doing this is to achieve what is sometimes called aspect-oriented programming. You can add a specific form of feature to any
            existing function. Let's have a look at some examples.
        </p>

        <h3>Memoisation</h3>
        <p>
            A very common technique to achieve higher performance, at the cost of more memory, is to pre-calculate tables of values that are going to be used repeatedly. With closures,
            you could make the outer function calculate the table and put it in the environment, while returning a function which then only makes look-ups in the table.
        </p>
        <p>
            Sometimes, it is hard to predict what values are going to be used and you can't compute them all. Still, you may know that the same values will come back over and over.
            In this case, a technique called <i>memoisation</i> is convenient. Simply store those input values that hav already been seen, and their corresponding output values. The
            table is built up continuously, call by call.
        </p>
        <p>
            By using functional operators in R, you can <i>add</i> memoisation to any existing function. Install the package <code>memoise</code>.
            You can try memoisation with the Fibonacci example found on the <a href="http://adv-r.had.co.nz/Function-operators.html">textbook website</a> (slightly modified here).
        </p>
        <pre>
fib <- function(n) {
    if (n < 2) return(1)
    fib(n - 2) + fib(n - 1)
}

fib2 <- memoise(function(n) {
  if (n < 2) return(1)
  fib2(n - 2) + fib2(n - 1)
})

fib3 <- memoise(fib)
        </pre>
        <p>
            Now time how long it takes to compute Fibonacci number 28, with <code>fib</code>, <code>fib2</code>,
            and <code>fib3</code>. Which one is fastest the first time round? What if you do it right away again?
        </p>
        <p>
            The <code>memoise</code> package has a nice <code>forget(<emph>memoisedfunction</emph>)</code> call.
            This can be useful if you want to rerun the tests with the original case where the memoisation table is empty.
        </p>
    </section>
    <section>
        <h2>Domain-specific languages</h2>
        <p>
            A <b>domain-specific language</b> (DSL) is a computer language specialized to a particular application domain.
            This is in contrast to a general-purpose language, which is broadly applicable across domains, and lacks
            specialized features for a particular domain [from Wikipedia]. R is a general-purpose programing language,
            but we can develop DSLs on top of R. Some features of the language also makes this particularly attractive to do.
        </p>
        <p>
            As an example of DSL for visualizations is ggplot2 which defines a <b>grammar of graphics</b> in R. The idea in ggplot is how the data maps to the parts of a graph, rather than describing the final graph itself.
        </p>
        <cite>
“In brief, the grammar tells us that a statistical graphic is <b>mapping</b> from data to <b>aesthetic</b> attributes (color, shape, size)
            of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn
            on a specific coordinate system” [from ggplot2 book]
            </cite>

        <p>
            In base R plot we start with a blank canvas and build from there using the <code>plot</code> function.
            We can’t go back and change our plot or translate our plot to a new plot. There is no graphical “language”.
            The plot is just a series of R commands. <code>ggplot</code> has a special grammar for graphics. It automatically deals with spacing,
            text, titles and allows you to annotate by “adding”.
        </p>
        <h3>qplot()</h3>
        <p>
            <code>qplot()</code> is the quick way of using ggplot2. It looks for data in a data frame. Be careful that
            your input data must be in the form of a data frame. If not, then you should reshape it. qplots are made up
            of aesthetic (size, shape, color) and geoms (points, lines).
        </p>
        <p>
            Here is an example how to use qplot.
           First install ggplot2 package and load the library.</p>
<pre>install.packages("ggplot2")
library(ggplot2)</pre>
        <p>
We work with the data set mpg in ggplot2. Here is the description of this data set: <a href="http://docs.ggplot2.org/0.9.2.1/mpg.html">http://docs.ggplot2.org/0.9.2.1/mpg.html</a>.
            </p>
<p>You can look at the structure of mpg data using <code>str()</code> to get a better understanding of its structure.</p>
<pre>str(mpg)</pre>
<p>We want to look at the relation between engine displacement (displ) and highway miles per gallon (hwy). We use:</p>
<pre>qplot(displ, hwy , data=mpg)</pre>

<p>The data set has three different factors:  <code>drv. f</code> = front-wheel drive, <code>r</code> = rear wheel drive, <code>4</code> = 4wd.
    We can easily plot the relation between <code>displ</code> and <code>hwy</code> for different factors using <code>color=drv</code>.  </p>
<pre>qplot(displ, hwy , data=mpg, color =drv)</pre>
    <p>
        <code>qplot()</code> automatically chooses different colors for rear wheel drive, front wheel drive and 4 wheel drive, using non-standard evaluation.
        </p>
        <p>
You can add statistics to the plot using <code>geom</code>. Let’s add a smoother to a plot</p>
        <pre>
qplot(displ, hwy, data =mpg, geom=c("point","smooth")) </pre>
        <p>
You can also check the linear relationship for the data: </p>
<pre>qplot(displ, hwy, data =mpg, geom=c("point","smooth"),method="lm") </pre>
<p>
You can check the linear relationship for different groups:</p>
<pre>qplot(displ, hwy, data =mpg, geom=c("point","smooth"),method="lm",color=drv)</pre>
        <p>
As we mentioned, one benefit of these plots is that you can modify them after they have been created.
            </p>
        <p>For example, you can use <code>theme()</code> to modify theme setting.  We assign the graph to variable:</p>
<pre>gr <- qplot(displ, hwy, data =mpg, geom=c("point","smooth"),method="lm",color=drv)</pre>
        <p>
and then we can modify it. Let’s change the color of the rectangular elements to pink:</p>
<pre>gr + theme(panel.background = element_rect(fill = "pink"))</pre>

        <h3>Exercise</h3>
<p>1. Load the data set <a href="http://docs.ggplot2.org/0.9.3.1/diamonds.html">diamonds</a> from ggplot2.</p>
<p>2. Look at the variables <code>carat</code> and <codde>price</codde> in the dataframe and plot them against each other.</p>
<p>3. Look at the relation between carat and price for different diamond colors. </p>
<p>4. Add a smoother to the plot. </p>
<p>5. Change the title of legend of your plot to “New legends”. Tips: look at the theme() help to find out how to change the title of legend. </p>
<p>6. Find out how to draw a boxplot to check the distribution of price/carat for different colors. </p>
<p><b>Add and commit a representative part of your plot work to your github project.</b></p>
<p></p><p>For reference, here are very useful videos to learn more about ggplot2:</p>
<p>Plotting with ggplot2: Part 1: <a href="https://www.youtube.com/watch?v=HeqHMM4ziXA">https://www.youtube.com/watch?v=HeqHMM4ziXA</a></p>
<p>Plotting with ggplot2: Part 2: <a href="https://www.youtube.com/watch?v=n8kYa9vu1l8">https://www.youtube.com/watch?v=n8kYa9vu1l8</a></p>
    </section>
    <section>

    </section>
    <section>
        <h2>Getting a SUPR account</h2>
        <p>In blocks 2 and 3, and possibly for your project, you will benefit from having access to some high-performance computing resources.</p>
        <p>The first step for this is to get an account at the SUPR service, which is common to all Swedish academic high-performance computing infrastructures.</p>
        <p>If you do <emph>not</emph> already have such an account, go to <a href="https://supr.snic.se/person/register/">this page</a>. Click "Register via SWAMID" and login with your
            ordinary university user account there. Doing this now might save some time and hassle later.
        </p>
    </section>

</body>
</html>