|
| 1 | +# 18.337 2026 |
| 2 | +## Homework 4 |
| 3 | +### Due April 24, 2026 |
| 4 | + |
| 5 | +Please submit a PDF of your answers on canvas along with your codes. Use of |
| 6 | +AI is allowed, perhaps encouraged (even for generating code, really!), |
| 7 | +but please write solutions to questions in your own words and comment on what |
| 8 | +LLM-generated code does and how you verified correctness, give credit, and please tell |
| 9 | +us, if you use an AI, which prompts worked and which ones did not work so well. |
| 10 | +(... and never never trust an AI without checking.) |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | +**IF YOU ARE USING A SHARED GPU NODE, LAUNCH A LOW-THREAD NOTEBOOK |
| 15 | +KERNEL WITH AT MOST 32 THREADS.** |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +### 1. Playing with Tasks |
| 20 | + |
| 21 | +Matrix multiplication is the backbone of HPC. We will explore the performance of tasks |
| 22 | +using a hand-written matrix multiplication. Use the following code for this section: |
| 23 | + |
| 24 | +```julia |
| 25 | +using OhMyThreads |
| 26 | + |
| 27 | +function matmul!(C, A, B, nt = Threads.nthreads()) |
| 28 | + m, k = size(A) |
| 29 | + k_check, n = size(B) |
| 30 | + k == k_check || |
| 31 | + throw(DimensionMismatch("Inner dimensions must match")) |
| 32 | + size(C) == (m, n) || |
| 33 | + throw(DimensionMismatch("Output matrix has incorrect dimensions")) |
| 34 | + fill!(C, 0.0) |
| 35 | + OhMyThreads.@tasks for i in 1:m |
| 36 | + @set begin |
| 37 | + ntasks = nt |
| 38 | + end |
| 39 | + for j in 1:n |
| 40 | + for l in 1:k |
| 41 | + C[i, j] += A[i, l] * B[l, j] |
| 42 | + end |
| 43 | + end |
| 44 | + end |
| 45 | + return C |
| 46 | +end |
| 47 | +``` |
| 48 | + |
| 49 | +**(a)** Benchmark the code above with `nt = {1, 2, 8, 32, 64, 128}`. Report your results |
| 50 | +in a table for small matrices (100×100) and large matrices (4000×4000). Comment |
| 51 | +on what you observe — does increasing the number of tasks always help? |
| 52 | + |
| 53 | +**(b)** There are 3 loop indices: `for i in 1:m`, `for j in 1:n`, and `for l in 1:k`. |
| 54 | +Experiment with different permutations of the order of these loops — specifically, |
| 55 | +what happens if you swap `for i in 1:m` with `for j in 1:n`? Try several |
| 56 | +permutations and report your results for 4000×4000 matrices. Can you identify any |
| 57 | +reason for performance differences you may observe? *(Hint: look up column-major |
| 58 | +order.)* You may ask an AI for help but use your own words. |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +### 2. Introduction to GPUs |
| 63 | + |
| 64 | +Run `nvidia-smi` to see the specifications of the GPUs you have available. |
| 65 | +Julia automatically dispatches GPU-allocated arrays to vendor-optimized libraries |
| 66 | +(CUDA, CUSOLVER, CUTLASS), so you do not need to write GPU kernels for every |
| 67 | +function yourself. We use the `CUDA.jl` package to interact with the hardware. |
| 68 | + |
| 69 | +**(a)** Allocate matrices on the GPU and benchmark matrix addition and multiplication, |
| 70 | +analogous to what you did for CPUs. Use: |
| 71 | + |
| 72 | +```julia |
| 73 | +using CUDA |
| 74 | +# define n = |
| 75 | +eltype = Float32 |
| 76 | +A = CUDA.randn(eltype, n, n) |
| 77 | +B = CUDA.randn(eltype, n, n) |
| 78 | +C = CUDA.zeros(eltype, n, n) |
| 79 | +mul!(C, A, B) # 1. benchmark matmul for different n |
| 80 | +C = A * B # 2. compare with above — what is the difference in speed and why? |
| 81 | +C .= A .+ B # 3. benchmark matadd for different n |
| 82 | +C = A + B # 4. compare with above — what is the difference in speed and why? |
| 83 | +``` |
| 84 | + |
| 85 | +To benchmark correctly, load `BenchmarkTools` and use `@belapsed CUDA.@sync mul!(C, A, B)` (and equivalently |
| 86 | +for the other operations). Submit your code and a table of the absolute execution time |
| 87 | +for matmul and matadd, with and without allocations, as a function of matrix size. |
| 88 | + |
| 89 | +**(b)** Generate the following plots. Use HW1 results for CPU timing and part (a) for |
| 90 | +GPU timing. Use both very small and very large `n` (keep in mind the memory |
| 91 | +limits of your GPU). |
| 92 | + |
| 93 | +- **Matmul and matadd without allocation:** Plot CPU and GPU execution times for |
| 94 | + both operations on the same figure. What is faster at which matrix size, and why? |
| 95 | + Is there a difference between when matmul and matadd become faster on the GPU? |
| 96 | + *(Tip: use one color per operation and dashed/solid lines for CPU/GPU.)* |
| 97 | + |
| 98 | +- **Matadd with and without allocations:** Plot CPU and GPU execution times for |
| 99 | + matadd with and without allocations. Do you see a difference in how much impact |
| 100 | + allocations have on CPU vs GPU? *(Same plotting tip as above.)* |
| 101 | + |
| 102 | +Use logarithmic axes where helpful. |
| 103 | + |
| 104 | +**(c)** Test `@belapsed` *without* `CUDA.@sync` for different matrix sizes. What do you |
| 105 | +observe? Look up the purpose of `CUDA.@sync` and explain in your own words why it |
| 106 | +is necessary for correct benchmarking. |
| 107 | + |
| 108 | +**(d)** |
| 109 | +Allocate a GPU vector and use element-wise operations to double every element of a vector |
| 110 | +(one line of code). Submit your code, benchmark it for different sizes, and generate a |
| 111 | +plot comparing these GPU results to the CPU results from HW1 Q2a. At which sizes |
| 112 | +does each implementation dominate, and why? |
| 113 | + |
| 114 | +**(e)** Now we will write a `KernelAbstractions.jl` kernel that performs the same |
| 115 | +element-doubling operation. We assign one thread per element. Fill in the blanks: |
| 116 | + |
| 117 | +```julia |
| 118 | +using KernelAbstractions, CUDA |
| 119 | +backend = CUDABackend() |
| 120 | +elty = Float32 |
| 121 | +const NUMTHREADSINBLOCK = 64 # threads per CUDA block |
| 122 | + |
| 123 | +@kernel function mykernel!(size_in::Int, input::AbstractGPUArray) |
| 124 | + idx = # calculate idx using @index(Group, Linear) and @index(Local, Linear) |
| 125 | + if idx <= size_in |
| 126 | + value = # load value from input into a register |
| 127 | + # double the value |
| 128 | + input[idx] = value # write result back to global memory |
| 129 | + end |
| 130 | + return |
| 131 | +end |
| 132 | + |
| 133 | +doubling!(size_in::Int, A::AbstractGPUArray) = |
| 134 | + mykernel!(backend, (NUMTHREADSINBLOCK,))(size_in, A; ndrange = size(A)) |
| 135 | +``` |
| 136 | + |
| 137 | +In the last line, `ndrange` sets the *total* number of threads across all blocks; the |
| 138 | +GPU runtime determines how many blocks to launch. Submit your completed code and |
| 139 | +verify it produces correct results. |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +### 3. 2D Access Patterns |
| 144 | + |
| 145 | +Data is stored linearly on hardware, and threads are also indexed linearly; 2D |
| 146 | +representation is a programmer convenience. Julia is **column-major**: columns are |
| 147 | +stored consecutively in memory. Consider a naive GPU matrix multiply kernel acting on |
| 148 | +a square matrix of size $2^{13}$. |
| 149 | + |
| 150 | +The standard starting point uses a square block size of $(16, 16)$. |
| 151 | + |
| 152 | +**(a)** Test at least six other block sizes: $(1,\, 16\times16)$, $(16\times16,\, 1)$, |
| 153 | +$(8, 8)$, $(32, 32)$, and two more of your own choosing. Report the performance for |
| 154 | +each and explain why you think the results differ. |
| 155 | + |
| 156 | +**(b)** Have every thread handle more than one output element. Test at least the |
| 157 | +configurations where each thread handles $(2,2)$, $(1,4)$, and $(4,1)$ consecutive |
| 158 | +elements. Report performance and discuss what you observe. |
| 159 | + |
| 160 | +**(c)** Use your experience from parts (a) and (b) to propose at least one additional |
| 161 | +strategy for dividing work across threads. Implement it and report on its performance |
| 162 | +relative to the configurations above. |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +### 4. A Parallel Prefix Kernel |
| 167 | + |
| 168 | +On a GPU you can synchronize threads |
| 169 | +*within* a workgroup/block using `@synchronize`, but to synchronize *across* blocks you |
| 170 | +must launch a new kernel. |
| 171 | + |
| 172 | +**(a)** Write a naive parallel prefix sum kernel using only global memory (no shared |
| 173 | +memory). Test your kernel at different vector sizes (powers of two, $\geq 2^8$) and |
| 174 | +verify correctness. Submit your code. |
| 175 | + |
| 176 | +**(b)** Based on your experience optimizing global memory access , propose |
| 177 | +and implement at least one alternative structure for this algorithm that you expect to |
| 178 | +be more performant. Compare both implementations at a vector size large enough to |
| 179 | +occupy the GPU (ideally $2^{28}$; use a smaller size if this takes too long). |
| 180 | + |
| 181 | +**(c)** What makes data-access optimization more complex for parallel prefix than for |
| 182 | +element doubling? Write 2–4 sentences. |
| 183 | + |
| 184 | +**(d)** Rewrite the kernel making use of **workgroup-shared memory** |
| 185 | +(`@localmem eltype(input) (blocksize,)`) and **thread-private register memory** |
| 186 | +(`@private eltype(input) (n,)`). Aim to avoid bank conflicts in your access pattern. |
| 187 | +Benchmark this version against your implementations from parts (a) and (b) and |
| 188 | +comment on what you observe. |
0 commit comments