CUDAExecutionProvider optimized model adds incompatible node resulting in Failed to find kernel for MemcpyToHost

**Describe the bug**

I get this error:

```text
Unhandled exception: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
 ---> Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:NotImplemented] Failed to find kernel for MemcpyToHost(1) (node Memcpy). Kernel not found
   at Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess(IntPtr nativeStatus)
   at Microsoft.ML.OnnxRuntime.InferenceSession.Init(String modelPath, SessionOptions options, PrePackedWeightsContainer prepackedWeightsContainer)
   at Microsoft.ML.OnnxRuntime.InferenceSession..ctor(String modelPath, SessionOptions options)
```

This is a minimum test (also mentioned [here](https://github.com/pytorch/pytorch/issues/76344)).

```python
class MinTest(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        top_probs, top = x.max(1, keepdim=True)
        return top_probs, top

```

This is the exported model:

![image](https://user-images.githubusercontent.com/873905/165182186-9cd3b5f8-aea8-4ae0-80b2-42aff27674f5.png)

I can us this function to do basic cleanup and the graph would be the same:

```python
def optimize_graph(onnxfile, onnxfile_optimized=None):
    import onnxruntime as rt

    if not onnxfile_optimized:
        onnxfile_optimized = onnxfile[:-5] + "_optimized.onnx"  # ONNX optimizer is broken, using ORT to optimzie
    sess_options = rt.SessionOptions()
    sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_BASIC
    sess_options.optimized_model_filepath = onnxfile_optimized
    _ = rt.InferenceSession(onnxfile, sess_options, providers=['CPUExecutionProvider'])
    return onnxfile_optimized
```

But if I use CUDA EP for clean up:

```python
def optimize_graph(onnxfile, onnxfile_optimized=None):
    import onnxruntime as rt

    if not onnxfile_optimized:
        onnxfile_optimized = onnxfile[:-5] + "_optimized.onnx"  # ONNX optimizer is broken, using ORT to optimzie
    sess_options = rt.SessionOptions()
    sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_BASIC
    sess_options.optimized_model_filepath = onnxfile_optimized
    _ = rt.InferenceSession(onnxfile, sess_options, providers=['CUDAExecutionProvider'])
    return onnxfile_optimized
```

The graph adds this problematic node

![image](https://user-images.githubusercontent.com/873905/165183881-68eb5bd6-b5a8-4bb3-8ee9-f43deab5b0e9.png)


**System information**
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
- ONNX Runtime installed from (source or binary): 1.10
- ONNX Runtime version: 1.10
- Python version:
- Visual Studio version (if applicable):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:

**Expected behavior**
Optimization should not add that extra incompatible node



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDAExecutionProvider optimized model adds incompatible node resulting in Failed to find kernel for MemcpyToHost #11348

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDAExecutionProvider optimized model adds incompatible node resulting in Failed to find kernel for MemcpyToHost #11348

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions