Merged
Conversation
Member
|
This looks fantastic. Great optimizing work. I don't have time right now, but I will take a closer look at this before the end of the week (probably Friday). If others look/review this and approve the PR, feel free to merge without me. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I happened to notice that extrapolation is quite a lot slower than interpolation:
This slowness doesn't seem inevitable, because the amount of "work" (computation) performed by extrapolation is quite minor compared to the work done by interpolation:
Interpolation also involves a whole second round of
clamps and throws in a bunch offloor/roundand arithmetic operations to boot (16 multiplies and 16 adds for linear interpolation in 3 dimensions). So one might suspect this could be fixed through a careful look at the generated code.The first commit here is almost trivial: it adds forced-inlining to circumvent the splatting penalty. This leads to an approximately 20ns reduction in this test case. The second commit is more complicated;
sizeandindices, when called with a dimension argument, involve a branch that looks something like this:Our generated code used these dimension-specific constructs and triggered two branches per dimension (one for
lboundand one forubound), resulting in 6 branches total for this example. Note that such branches are not present if you start fromsz = size(A), so I changed our generated code to use this style (withindicesrather thansize). This gave another approximate 20ns boost, so that the result with this PR iswhich is a much more reasonable overhead compared to interpolation.