mitmath
diff --git a/‎18.337 2026 hw4.md‎
Lines changed: 188 additions & 0 deletions b/‎18.337 2026 hw4.md‎
Lines changed: 188 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 2 deletions b/‎README.md‎
Lines changed: 3 additions & 2 deletions
@@ -0,0 +1,188 @@
+# 18.337 2026
+## Homework 4
+### Due April 24, 2026
+
+Please submit a PDF of your answers on canvas along with your codes.  Use of
+AI  is allowed, perhaps encouraged (even for generating code, really!),
+but please write solutions to questions in your own words and comment on what
+LLM-generated code does and how you verified correctness, give credit, and please tell
+us, if you use an AI, which prompts worked and which ones did not work so well.
+(... and never never trust an AI without checking.)
+
+
+
+**IF YOU ARE USING A SHARED GPU NODE, LAUNCH A LOW-THREAD NOTEBOOK
+KERNEL WITH AT MOST 32 THREADS.**
+
+
+
+### 1. Playing with Tasks
+
+Matrix multiplication is the backbone of HPC. We will explore the performance of tasks
+using a hand-written matrix multiplication. Use the following code for this section:
+
+```julia
+using OhMyThreads
+
+function matmul!(C, A, B, nt = Threads.nthreads())
+    m, k = size(A)
+    k_check, n = size(B)
+    k == k_check ||
+        throw(DimensionMismatch("Inner dimensions must match"))
+    size(C) == (m, n) ||
+        throw(DimensionMismatch("Output matrix has incorrect dimensions"))
+    fill!(C, 0.0)
+    OhMyThreads.@tasks for i in 1:m
+        @set begin
+            ntasks = nt
+        end
+        for j in 1:n
+            for l in 1:k
+                C[i, j] += A[i, l] * B[l, j]
+            end
+        end
+    end
+    return C
+end
+```
+
+**(a)** Benchmark the code above with `nt = {1, 2, 8, 32, 64, 128}`. Report your results
+in a table for small matrices (100×100) and large matrices (4000×4000). Comment
+on what you observe — does increasing the number of tasks always help?
+
+**(b)** There are 3 loop indices: `for i in 1:m`, `for j in 1:n`, and `for l in 1:k`.
+Experiment with different permutations of the order of these loops — specifically,
+what happens if you swap `for i in 1:m` with `for j in 1:n`? Try several
+permutations and report your results for 4000×4000 matrices. Can you identify any
+reason for performance differences you may observe? *(Hint: look up column-major
+order.)* You may ask an AI for help but use your own words.
+
+---
+
+### 2. Introduction to GPUs
+
+Run `nvidia-smi` to see the specifications of the GPUs you have available.
+Julia automatically dispatches GPU-allocated arrays to vendor-optimized libraries
+(CUDA, CUSOLVER, CUTLASS), so you do not need to write GPU kernels for every
+function yourself. We use the `CUDA.jl` package to interact with the hardware.
+
+**(a)** Allocate matrices on the GPU and benchmark matrix addition and multiplication,
+analogous to what you did for CPUs. Use:
+
+```julia
+using CUDA
+# define n =
+eltype = Float32
+A = CUDA.randn(eltype, n, n)
+B = CUDA.randn(eltype, n, n)
+C = CUDA.zeros(eltype, n, n)
+mul!(C, A, B)        # 1. benchmark matmul for different n
+C = A * B            # 2. compare with above — what is the difference in speed and why?
+C .= A .+ B          # 3. benchmark matadd for different n
+C = A + B            # 4. compare with above — what is the difference in speed and why?
+```
+
+To benchmark correctly, load `BenchmarkTools` and use `@belapsed CUDA.@sync mul!(C, A, B)` (and equivalently
+for the other operations). Submit your code and a table of the absolute execution time
+for matmul and matadd, with and without allocations, as a function of matrix size.
+
+**(b)** Generate the following plots. Use HW1 results for CPU timing and part (a) for
+GPU timing. Use both very small and very large `n` (keep in mind the memory
+limits of your GPU).
+
+- **Matmul and matadd without allocation:** Plot CPU and GPU execution times for
+  both operations on the same figure. What is faster at which matrix size, and why?
+  Is there a difference between when matmul and matadd become faster on the GPU?
+  *(Tip: use one color per operation and dashed/solid lines for CPU/GPU.)*
+
+- **Matadd with and without allocations:** Plot CPU and GPU execution times for
+  matadd with and without allocations. Do you see a difference in how much impact
+  allocations have on CPU vs GPU? *(Same plotting tip as above.)*
+
+Use logarithmic axes where helpful.
+
+**(c)** Test `@belapsed` *without* `CUDA.@sync` for different matrix sizes. What do you
+observe? Look up the purpose of `CUDA.@sync` and explain in your own words why it
+is necessary for correct benchmarking.
+
+**(d)** 
+Allocate a GPU vector and use element-wise operations to double every element of a vector
+(one line of code). Submit your code, benchmark it for different sizes, and generate a
+plot comparing these GPU results to the CPU results from HW1 Q2a. At which sizes
+does each implementation dominate, and why?
+
+**(e)** Now we will write a `KernelAbstractions.jl` kernel that performs the same
+element-doubling operation. We assign one thread per element. Fill in the blanks:
+
+```julia
+using KernelAbstractions, CUDA
+backend = CUDABackend()
+elty = Float32
+const NUMTHREADSINBLOCK = 64  # threads per CUDA block
+
+@kernel function mykernel!(size_in::Int, input::AbstractGPUArray)
+    idx = # calculate idx using @index(Group, Linear) and @index(Local, Linear)
+    if idx <= size_in
+        value = # load value from input into a register
+                # double the value
+        input[idx] = value  # write result back to global memory
+    end
+    return
+end
+
+doubling!(size_in::Int, A::AbstractGPUArray) =
+    mykernel!(backend, (NUMTHREADSINBLOCK,))(size_in, A; ndrange = size(A))
+```
+
+In the last line, `ndrange` sets the *total* number of threads across all blocks; the
+GPU runtime determines how many blocks to launch. Submit your completed code and
+verify it produces correct results.
+
+---
+
+### 3. 2D Access Patterns
+
+Data is stored linearly on hardware, and threads are also indexed linearly; 2D
+representation is a programmer convenience. Julia is **column-major**: columns are
+stored consecutively in memory. Consider a naive GPU matrix multiply kernel acting on
+a square matrix of size $2^{13}$.
+
+The standard starting point uses a square block size of $(16, 16)$.
+
+**(a)** Test at least six other block sizes: $(1,\, 16\times16)$, $(16\times16,\, 1)$,
+$(8, 8)$, $(32, 32)$, and two more of your own choosing. Report the performance for
+each and explain why you think the results differ.
+
+**(b)** Have every thread handle more than one output element. Test at least the
+configurations where each thread handles $(2,2)$, $(1,4)$, and $(4,1)$ consecutive
+elements. Report performance and discuss what you observe.
+
+**(c)** Use your experience from parts (a) and (b) to propose at least one additional
+strategy for dividing work across threads. Implement it and report on its performance
+relative to the configurations above.
+
+---
+
+### 4. A Parallel Prefix Kernel
+
+On a GPU you can synchronize threads
+*within* a workgroup/block using `@synchronize`, but to synchronize *across* blocks you
+must launch a new kernel.
+
+**(a)** Write a naive parallel prefix sum kernel using only global memory (no shared
+memory). Test your kernel at different vector sizes (powers of two, $\geq 2^8$) and
+verify correctness. Submit your code.
+
+**(b)** Based on your experience optimizing global memory access , propose
+and implement at least one alternative structure for this algorithm that you expect to
+be more performant. Compare both implementations at a vector size large enough to
+occupy the GPU (ideally $2^{28}$; use a smaller size if this takes too long).
+
+**(c)** What makes data-access optimization more complex for parallel prefix than for
+element doubling? Write 2–4 sentences.
+
+**(d)** Rewrite the kernel making use of **workgroup-shared memory**
+(`@localmem eltype(input) (blocksize,)`) and **thread-private register memory**
+(`@private eltype(input) (n,)`). Aim to avoid bank conflicts in your access pattern.
+Benchmark this version against your implementations from parts (a) and (b) and
+comment on what you observe.
@@ -64,8 +64,8 @@ Take a look at [MIT Engaging](https://engaging-ood.mit.edu:8443/auth/realms/enga
 | | 3/23/2026 | Monday | Spring Break | | |
 | | 3/25/2026 | Wednesday | Spring Break | | |
 | 15 | 3/30/2026 | Monday | | Alan | Parallel Prefix [[prefix spring 2026.pptx]](https://github.com/mitmath/18337/blob/master/prefix%20%20spring%202026.pptx) [[prefixspring2026.jl]](https://github.com/mitmath/18337/blob/master/prefixspring2026.jl) |
-| 16 | 4/1/2026 | Wednesday | | Alan | |
-| 17 | 4/6/2026 | Monday | | Alan | |
+| 16 | 4/1/2026 | Wednesday | | Alan | Into to GPUS [[gpus2026.pptx]](https://github.com/mitmath/18337/blob/master/gpus2026.pptx) |
+| 17 | 4/6/2026 | Monday | | Alan | [Optimizing Serial Code](https://book.sciml.ai/notes/02-Optimizing_Serial_Code/) Types in Julia [[7_ptypes.jl]](https://github.com/mitmath/18337/blob/master/lecture%2017/7_ptypes.jl) [[html]](https://mitmath.github.io/18337/lecture17/7_ptypes.html) [Threading](https://mitmath.github.io/Parallel-Computing-Spoke/notebooks/ThreadingNotebook.html) [[handwritten notes]](https://github.com/mitmath/18337/blob/master/lecture4/lecture_4_handwritten_2023.pdf) [[Serial Performance .jl]](https://github.com/mitmath/18337/blob/master/lecture4/serial%20performance.jl) [[Loop Fusion Blog]](https://julialang.org/blog/2017/01/moredots/) |
 | 18 | 4/8/2026 | Wednesday | | Alan | |
 | 19 | 4/13/2026 | Monday | | Alan | |
 | 20 | 4/15/2026 | Wednesday | | Alan | |
@@ -98,6 +98,7 @@ Final Project reports due: May 11
 |1| [PINNs](https://github.com/mitmath/18337/blob/master/18.337%20%202026%20homework%201.pdf)| 2/11/2026 |
 |2| [Performance etc](https://github.com/mitmath/18337/blob/master/18.337%202026%20homework%202.md)|  Thurs 3/5/2026 ||
 |3| [Project Proposal & Adjoint Eqn ](https://github.com/mitmath/18337/blob/master/18.337%202026%20hw3.pdf) | Friday 3/20/2026 ||
+|4| [Threads & GPU](https://github.com/mitmath/18337/blob/master/18.337%202026%20hw4.md) | 4/24/2026 ||