Add new multithreaded TwoQubitPeepholeOptimization pass#13419
Add new multithreaded TwoQubitPeepholeOptimization pass#13419mtreinish wants to merge 86 commits intoQiskit:mainfrom
Conversation
Pull Request Test Coverage Report for Build 15654439601Details
💛 - Coveralls |
ad06d1a to
4d160bc
Compare
This commit adds a new transpiler pass for physical optimization, TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks, ConsolidateBlocks, and UnitarySynthesis in the optimization stage for a default pass manager setup. The pass logically works the same way where it analyzes the dag to get a list of 2q runs, calculates the matrix of each run, and then synthesizes the matrix and substitutes it inplace. The distinction this pass makes though is it does this all in a single pass and also parallelizes the matrix calculation and synthesis steps because there is no data dependency there. This new pass is not meant to fully replace the Collect2qBlocks, ConsolidateBlocks, or UnitarySynthesis passes as those also run in contexts where we don't have a physical circuit. This is meant instead to replace their usage in the optimization stage only. Accordingly this new pass also changes the logic on how we select the synthesis to use and when to make a substituion. Previously this logic was primarily done via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if the number of basis gates needed based on the weyl chamber coordinates was less than the number of 2q gates in the block (see Qiskit#11659 for discussion on this). Since this new pass skips the explicit consolidation stage we go ahead and try all the available synthesizers Right now this commit has a number of limitations, the largest are: - Only supports the target - It doesn't support any synthesizers besides the TwoQubitBasisDecomposer, because it's the only one in rust currently. For plugin handling I left the logic as running the three pass series, but I'm not sure this is the behavior we want. We could say keep the synthesis plugins for `UnitarySynthesis` only and then rely on our built-in methods for physical optimiztion only. But this also seems less than ideal because the plugin mechanism is how we support synthesizing to custom basis gates, and also more advanced approximate synthesis methods. Both of those are things we need to do as part of the synthesis here. Additionally, this is currently missing tests and documentation and while running it manually "works" as in it returns a circuit that looks valid, I've not done any validation yet. This also likely will need several rounds of performance optimization and tuning. t this point this is just a rough proof of concept and will need a lof refinement along with larger changes to Qiskit's rust code before this is ready to merge. Fixes Qiskit#12007 Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to run in rust, the TwoQubitControlledUDecomposer. This commit updates the new TwoQubitPeepholeOptimization to call this decomposer if the target supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two decomposer types in the TwoQubitDecomposer enumese two types is large. TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer is only 24 bytes. This means each element of ControlledU is wasting > 1600 bytes. However, in this case that is acceptable in order to avoid a layer of pointer indirection as these are stored temporarily in a vec inside a thread to decompose a unitary. A trait would be more natural for this to define a common interface between all the two qubit decomposers but since we keep them instantiated for each edge in a Vec they need to be sized and doing something like `Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer` instead of a enum) to get around this would have additional runtime overhead. This is also considering that TwoQubitControlledUDecomposer has far less likelihood in practice as it only works with some targets that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
|
Copy here the comment of @t-imamichi #13568 (comment)
We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates . |
|
I added support for using the |
…re locking is needed
We don't want to spend time reconstructing an exact copy of the dag if there are no substituions needed. Prior to using a vec for tracking the run indices that nodes are part of we would check if that map was empty. The vec is always populated and to determine if there are no entries we'd have to do a worse case O(n) lookup to determine if any entries are set. To avoid that this overhead but keeping the check this adds an atomic bool that is used to track whether we've substituted any blocks. If this is not set to true we can just exit early since there are no substitutions to make.
|
I think this is almost ready, the only missing piece is I want to figure out a good block count to use for the parallel threshold. There is overhead associated with launching the thread pool and for circuits with few blocks to check/optimize that overhead can be higher than the runtime of the pass when run serially. We'll need to do some benchmarking to figure out where the cross over point is and then add that to the pass |
Here are a few examples of some circuits when using the following optimization code. Example 1: gives: while the inverse circuit: gives: Example 2: gives: while the inverse circuit: gives: |
|
My concern is that without performing one-qubit optimizations, the two-qubit decomposers may produce too many unnecessary one-qubit gates, that may eventually negatively affect the two-qubit gates optimizations. Here is an example: gives: and the gates counts are As we have seen above, with optimizing the one-qubit gates, one can get the gate counts of |
I agree you almost always want to follow this pass with a 1q optimization pass. But the way the algorithm works it doesn't have the full context of the circuit when it is evaluating peephole optimizations to make. Each synthesis decision is made in the isolated context of a single block. To get a benefit from the
This should already be factored into the unitary synthesis code and will depend on your target. This pass doesn't really add new logic around this, we give the matrix and the qubits it operates on to the existing 2q synthesis code from
There is always going to be a compromise with these kinds of optimization passes. Those don't change with this new pass, in particular the combination of
In this particular case this is actually partially covered. The estimated fidelity of the synthesis is factored into the algorithm already. However, as you point out if the number of 2q gates decreases that takes priority over the fidelity estimate. When I originally wrote this pass I had the scoring function compare This is actually a case where I think we need to talk about benchpress (and maybe other benchmark suites too) because right now benchpress is not factoring in this kind of tradeoff. It only looks at 2q gate count and 2q depth, so the optimization the pass is doing right now, regardless of the gates error rates, makes us look better in benchpress. Personally I'm fine doing what we think results in a better quality compilation (which should be higher estimated fidelity) if we can determine that with some certainty, even if it potentially makes us look worse in benchpress. Since it's the real world results that matter and not what a synthetic benchmark says. But this is the kind of discussion we should feed back into benchpress and improve the quality of benchmarking there. |
This commit switches the topological sort function we use in transpiler passes when reconstructing a dag from scratch. In several passes where we typically replace or remove a large numbers of gates we iterate over the input dag in topological order and construct a copy of it making the alterations as we go. Right now when we do this we rely on `DAGCircuit::topological_op_nodes` which makes sense because it's or built-in method for iterating over a dag's op node in topological ordering. Internally this uses rustworkx's lexicographical topological sort function with our custom sort function that maintains our desired tie breaker using the bits of a node. However, since Qiskit#14762 where we're asserting structural equality in passes we don't need to use that sort anymore for these reconstruction cases. We just need a consistent topological sort. In optimizing Qiskit#13419 one thing that showed in profiles was that for very large circuits the overhead of the lexicographical topological sort for the iteration. The toposort function in petgraph is lower overhead because it doesn't need to work about the lexicographical tie breaker. By switching to use this instead we can reduce the overhead of the final sort in all these passes. In asv benchmarking this commit speeds up transpiler benchmarks are 2-5% faster (although without asv flagging it as significant).
ShellyGarion
left a comment
There was a problem hiding this comment.
I think this is an important addition to the unitary synthesis.
This is a preliminary review with some minor comments (there are still a few commnets from my previous review).
I still need to look at the score calculation and the tests.
ShellyGarion
left a comment
There was a problem hiding this comment.
Another question, how is this new transpiler pass appear in the default optimization levels?
In some places, you mention that it should replace UnitarySynthesis and ConsolidateBlocks, but should it be added in the default optimization levels?
| and :class:`.TwoQubitControlledUDecomposer` for synthesizing the two qubit unitaries. | ||
| You should not use this pass if you need to use the pluggable interface and the ability | ||
| to use different synthesis algorithms, instead you should use a combination of | ||
| :class:`.ConsolidateBlocks` and :class:`.UnitarySynthesis` to use the plugin mechanism |
There was a problem hiding this comment.
maybe add some of this as note ? (as there are too many details here).
also, it may be good to add some example (like you did in the relase notes).
This pass only makes sense as a physical optimization pass where we've already run layout and translation on the circuit since we need to compare the estimated fidelity of the block against the available synthesis outcomes. So this pass is designed to replace the use of qiskit/qiskit/transpiler/preset_passmanagers/builtin_plugins.py Lines 506 to 519 in 859c425 and qiskit/qiskit/transpiler/preset_passmanagers/builtin_plugins.py Lines 537 to 549 in 859c425 we'd replace the usage in those 2 places with this new pass. This is for level 2 and 3 respectively which matches where we are currently doing this peephole optimization. But, we still would use |
This should be done in a follow-up PR? |
Yes that's the plan. This PR is big enough on it's own and we should handle it as adding a new pass without worrying about the impact on preset pass managers. Then in a follow up PR we can experiment with adding it to our default pipeline and see what the impacts are from doing that in isolation. |
Previously there was a mismatch between the scoring of synthesis results and the peephole pass's comparison with the original block. The pass is documented as using the tuple (num_2q_gates, error, num_gates) and picking the min of all the choices. But, when we called the unitary synthesis function that selects the best synthesis outcome it was maximizing the estimated fidelity but not considering the gate counts like the pass is documented as doing. This corrects this mismatch by updating the function doing the synthesis to be generic on score type and taking a scorer callback. This lets the peephole pass control the heuristic used for selecting the best score.
ShellyGarion
left a comment
There was a problem hiding this comment.
I'm happy to see that finally this PR is ready.
I have some minor suggestions on the tests.
| from ..legacy_cmaps import YORKTOWN_CMAP | ||
|
|
||
|
|
||
| class FakeBackend2QV2(GenericBackendV2): |
There was a problem hiding this comment.
why is this class and the following class here and not in: test.python.providers ?
I noticed that we have several tests in which we define these classes (like test_unitary_synthesis), and wondered if we should put these classes in a shared place?
There was a problem hiding this comment.
I think in this case that's fine and almost expected. In general test code is a bit different than normally library code. You want test code to be explicit in what's testing to make it easier to debug and understand what's going on. You want the code structure to be flatter and easier to understand because test code needs to be easier to understand and debug when it fails. That sometimes comes at the cost of some code duplication or decentralization. There's a tradeoff in either way but I think for this case it's fine.
ShellyGarion
left a comment
There was a problem hiding this comment.
LGTM. I think it's an important contribution to qiskit, I'm not merging it yet, since perhaps others would like to review it.
Summary
This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.
This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers
Right now this commit has a number of limitations, the largest are:
TwoQubitBasisDecomposerandTwoQubitControlledUDecomposerare used)This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of
ConsolidateBlocksandUnitarySynthesisshould be used instead if the plugin interface is necessary.
Details and comments
Fixes #12007
Fixes #11659
TODO: