I traced the BERT model from PyTorchTransformers library and getting the following results for 10 iterations.
a) Using Python runtime for running the forward: 979,292 µs
import time
model = torch.jit.load('models_backup/2_2.pt')
x = torch.randint(2000, (1, 14), dtype=torch.long, device='cpu')
start = time.time()
for i in range(10):
model(x)
end = time.time()
print((end - start)*1000000, "µs")
b) Using C++ runtime for running the forward: 3,333,758 µs which is almost 3x of what Python
torch::Tensor x = torch::randint(index_max, {1, inputsize}, torch::dtype(torch::kInt64).device(torch::kCPU));
input.push_back(x);
#endif
// Execute the model and turn its output into a tensor.
auto outputs = module->forward(input).toTuple();
auto start = chrono::steady_clock::now();
for (int16_t i = 0; i<10; ++i)
{
outputs = module->forward(input).toTuple();
}
auto end = chrono::steady_clock::now();
cout << "Elapsed time in microseconds : "
<< chrono::duration_cast<chrono::microseconds>(end - start).count()
<< " µs" << endl;
@thomwolf any suggestions on what am I missing ?
I traced the BERT model from PyTorchTransformers library and getting the following results for 10 iterations.
a) Using Python runtime for running the forward: 979,292 µs
b) Using C++ runtime for running the forward: 3,333,758 µs which is almost 3x of what Python
@thomwolf any suggestions on what am I missing ?