Skip to content

Commit 85ddba5

Browse files
committed
2 parents 99ea85a + 89b5aa7 commit 85ddba5

5 files changed

Lines changed: 762 additions & 2 deletions

File tree

18.337 2026 hw4.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# 18.337 2026
2+
## Homework 4
3+
### Due April 24, 2026
4+
5+
Please submit a PDF of your answers on canvas along with your codes. Use of
6+
AI is allowed, perhaps encouraged (even for generating code, really!),
7+
but please write solutions to questions in your own words and comment on what
8+
LLM-generated code does and how you verified correctness, give credit, and please tell
9+
us, if you use an AI, which prompts worked and which ones did not work so well.
10+
(... and never never trust an AI without checking.)
11+
12+
13+
14+
**IF YOU ARE USING A SHARED GPU NODE, LAUNCH A LOW-THREAD NOTEBOOK
15+
KERNEL WITH AT MOST 32 THREADS.**
16+
17+
18+
19+
### 1. Playing with Tasks
20+
21+
Matrix multiplication is the backbone of HPC. We will explore the performance of tasks
22+
using a hand-written matrix multiplication. Use the following code for this section:
23+
24+
```julia
25+
using OhMyThreads
26+
27+
function matmul!(C, A, B, nt = Threads.nthreads())
28+
m, k = size(A)
29+
k_check, n = size(B)
30+
k == k_check ||
31+
throw(DimensionMismatch("Inner dimensions must match"))
32+
size(C) == (m, n) ||
33+
throw(DimensionMismatch("Output matrix has incorrect dimensions"))
34+
fill!(C, 0.0)
35+
OhMyThreads.@tasks for i in 1:m
36+
@set begin
37+
ntasks = nt
38+
end
39+
for j in 1:n
40+
for l in 1:k
41+
C[i, j] += A[i, l] * B[l, j]
42+
end
43+
end
44+
end
45+
return C
46+
end
47+
```
48+
49+
**(a)** Benchmark the code above with `nt = {1, 2, 8, 32, 64, 128}`. Report your results
50+
in a table for small matrices (100×100) and large matrices (4000×4000). Comment
51+
on what you observe — does increasing the number of tasks always help?
52+
53+
**(b)** There are 3 loop indices: `for i in 1:m`, `for j in 1:n`, and `for l in 1:k`.
54+
Experiment with different permutations of the order of these loops — specifically,
55+
what happens if you swap `for i in 1:m` with `for j in 1:n`? Try several
56+
permutations and report your results for 4000×4000 matrices. Can you identify any
57+
reason for performance differences you may observe? *(Hint: look up column-major
58+
order.)* You may ask an AI for help but use your own words.
59+
60+
---
61+
62+
### 2. Introduction to GPUs
63+
64+
Run `nvidia-smi` to see the specifications of the GPUs you have available.
65+
Julia automatically dispatches GPU-allocated arrays to vendor-optimized libraries
66+
(CUDA, CUSOLVER, CUTLASS), so you do not need to write GPU kernels for every
67+
function yourself. We use the `CUDA.jl` package to interact with the hardware.
68+
69+
**(a)** Allocate matrices on the GPU and benchmark matrix addition and multiplication,
70+
analogous to what you did for CPUs. Use:
71+
72+
```julia
73+
using CUDA
74+
# define n =
75+
eltype = Float32
76+
A = CUDA.randn(eltype, n, n)
77+
B = CUDA.randn(eltype, n, n)
78+
C = CUDA.zeros(eltype, n, n)
79+
mul!(C, A, B) # 1. benchmark matmul for different n
80+
C = A * B # 2. compare with above — what is the difference in speed and why?
81+
C .= A .+ B # 3. benchmark matadd for different n
82+
C = A + B # 4. compare with above — what is the difference in speed and why?
83+
```
84+
85+
To benchmark correctly, load `BenchmarkTools` and use `@belapsed CUDA.@sync mul!(C, A, B)` (and equivalently
86+
for the other operations). Submit your code and a table of the absolute execution time
87+
for matmul and matadd, with and without allocations, as a function of matrix size.
88+
89+
**(b)** Generate the following plots. Use HW1 results for CPU timing and part (a) for
90+
GPU timing. Use both very small and very large `n` (keep in mind the memory
91+
limits of your GPU).
92+
93+
- **Matmul and matadd without allocation:** Plot CPU and GPU execution times for
94+
both operations on the same figure. What is faster at which matrix size, and why?
95+
Is there a difference between when matmul and matadd become faster on the GPU?
96+
*(Tip: use one color per operation and dashed/solid lines for CPU/GPU.)*
97+
98+
- **Matadd with and without allocations:** Plot CPU and GPU execution times for
99+
matadd with and without allocations. Do you see a difference in how much impact
100+
allocations have on CPU vs GPU? *(Same plotting tip as above.)*
101+
102+
Use logarithmic axes where helpful.
103+
104+
**(c)** Test `@belapsed` *without* `CUDA.@sync` for different matrix sizes. What do you
105+
observe? Look up the purpose of `CUDA.@sync` and explain in your own words why it
106+
is necessary for correct benchmarking.
107+
108+
**(d)**
109+
Allocate a GPU vector and use element-wise operations to double every element of a vector
110+
(one line of code). Submit your code, benchmark it for different sizes, and generate a
111+
plot comparing these GPU results to the CPU results from HW1 Q2a. At which sizes
112+
does each implementation dominate, and why?
113+
114+
**(e)** Now we will write a `KernelAbstractions.jl` kernel that performs the same
115+
element-doubling operation. We assign one thread per element. Fill in the blanks:
116+
117+
```julia
118+
using KernelAbstractions, CUDA
119+
backend = CUDABackend()
120+
elty = Float32
121+
const NUMTHREADSINBLOCK = 64 # threads per CUDA block
122+
123+
@kernel function mykernel!(size_in::Int, input::AbstractGPUArray)
124+
idx = # calculate idx using @index(Group, Linear) and @index(Local, Linear)
125+
if idx <= size_in
126+
value = # load value from input into a register
127+
# double the value
128+
input[idx] = value # write result back to global memory
129+
end
130+
return
131+
end
132+
133+
doubling!(size_in::Int, A::AbstractGPUArray) =
134+
mykernel!(backend, (NUMTHREADSINBLOCK,))(size_in, A; ndrange = size(A))
135+
```
136+
137+
In the last line, `ndrange` sets the *total* number of threads across all blocks; the
138+
GPU runtime determines how many blocks to launch. Submit your completed code and
139+
verify it produces correct results.
140+
141+
---
142+
143+
### 3. 2D Access Patterns
144+
145+
Data is stored linearly on hardware, and threads are also indexed linearly; 2D
146+
representation is a programmer convenience. Julia is **column-major**: columns are
147+
stored consecutively in memory. Consider a naive GPU matrix multiply kernel acting on
148+
a square matrix of size $2^{13}$.
149+
150+
The standard starting point uses a square block size of $(16, 16)$.
151+
152+
**(a)** Test at least six other block sizes: $(1,\, 16\times16)$, $(16\times16,\, 1)$,
153+
$(8, 8)$, $(32, 32)$, and two more of your own choosing. Report the performance for
154+
each and explain why you think the results differ.
155+
156+
**(b)** Have every thread handle more than one output element. Test at least the
157+
configurations where each thread handles $(2,2)$, $(1,4)$, and $(4,1)$ consecutive
158+
elements. Report performance and discuss what you observe.
159+
160+
**(c)** Use your experience from parts (a) and (b) to propose at least one additional
161+
strategy for dividing work across threads. Implement it and report on its performance
162+
relative to the configurations above.
163+
164+
---
165+
166+
### 4. A Parallel Prefix Kernel
167+
168+
On a GPU you can synchronize threads
169+
*within* a workgroup/block using `@synchronize`, but to synchronize *across* blocks you
170+
must launch a new kernel.
171+
172+
**(a)** Write a naive parallel prefix sum kernel using only global memory (no shared
173+
memory). Test your kernel at different vector sizes (powers of two, $\geq 2^8$) and
174+
verify correctness. Submit your code.
175+
176+
**(b)** Based on your experience optimizing global memory access , propose
177+
and implement at least one alternative structure for this algorithm that you expect to
178+
be more performant. Compare both implementations at a vector size large enough to
179+
occupy the GPU (ideally $2^{28}$; use a smaller size if this takes too long).
180+
181+
**(c)** What makes data-access optimization more complex for parallel prefix than for
182+
element doubling? Write 2–4 sentences.
183+
184+
**(d)** Rewrite the kernel making use of **workgroup-shared memory**
185+
(`@localmem eltype(input) (blocksize,)`) and **thread-private register memory**
186+
(`@private eltype(input) (n,)`). Aim to avoid bank conflicts in your access pattern.
187+
Benchmark this version against your implementations from parts (a) and (b) and
188+
comment on what you observe.

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@ Take a look at [MIT Engaging](https://engaging-ood.mit.edu:8443/auth/realms/enga
6464
| | 3/23/2026 | Monday | Spring Break | | |
6565
| | 3/25/2026 | Wednesday | Spring Break | | |
6666
| 15 | 3/30/2026 | Monday | | Alan | Parallel Prefix [[prefix spring 2026.pptx]](https://github.com/mitmath/18337/blob/master/prefix%20%20spring%202026.pptx) [[prefixspring2026.jl]](https://github.com/mitmath/18337/blob/master/prefixspring2026.jl) |
67-
| 16 | 4/1/2026 | Wednesday | | Alan | |
68-
| 17 | 4/6/2026 | Monday | | Alan | |
67+
| 16 | 4/1/2026 | Wednesday | | Alan | Into to GPUS [[gpus2026.pptx]](https://github.com/mitmath/18337/blob/master/gpus2026.pptx) |
68+
| 17 | 4/6/2026 | Monday | | Alan | [Optimizing Serial Code](https://book.sciml.ai/notes/02-Optimizing_Serial_Code/) Types in Julia [[7_ptypes.jl]](https://github.com/mitmath/18337/blob/master/lecture%2017/7_ptypes.jl) [[html]](https://mitmath.github.io/18337/lecture17/7_ptypes.html) [Threading](https://mitmath.github.io/Parallel-Computing-Spoke/notebooks/ThreadingNotebook.html) [[handwritten notes]](https://github.com/mitmath/18337/blob/master/lecture4/lecture_4_handwritten_2023.pdf) [[Serial Performance .jl]](https://github.com/mitmath/18337/blob/master/lecture4/serial%20performance.jl) [[Loop Fusion Blog]](https://julialang.org/blog/2017/01/moredots/) |
6969
| 18 | 4/8/2026 | Wednesday | | Alan | |
7070
| 19 | 4/13/2026 | Monday | | Alan | |
7171
| 20 | 4/15/2026 | Wednesday | | Alan | |
@@ -98,6 +98,7 @@ Final Project reports due: May 11
9898
|1| [PINNs](https://github.com/mitmath/18337/blob/master/18.337%20%202026%20homework%201.pdf)| 2/11/2026 |
9999
|2| [Performance etc](https://github.com/mitmath/18337/blob/master/18.337%202026%20homework%202.md)| Thurs 3/5/2026 ||
100100
|3| [Project Proposal & Adjoint Eqn ](https://github.com/mitmath/18337/blob/master/18.337%202026%20hw3.pdf) | Friday 3/20/2026 ||
101+
|4| [Threads & GPU](https://github.com/mitmath/18337/blob/master/18.337%202026%20hw4.md) | 4/24/2026 ||
101102

102103

103104

0 commit comments

Comments
 (0)