I wrote a cpp extension for torch which is a custom convolutional function.
Firstly, I compiled this function with g++ directly which was used for testing, the latency is 5 milliseconds.
Secondly, I tried to integrate this function to torch and installed this extension by setuptools, following the steps shown in the tutorial provided by torch. However, the latency is now 16 milliseconds.
The function invokation will consumes about 1-2 ms, so why the performance differs so much?
The compilation by g++ directly was done by
g++ -pthread -mavx2 -mfma ...
and the directives in the source file includes
#pragma GCC diagnostic ignored "-Wformat"
#pragma STDC FP_CONTRACT ON
#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags
// #pragma GCC option("arch=native","tune=native","no-zero-upper") //Enable AVX
#pragma GCC target("avx")
These directives were also included in the file built by setuptools. The "setup.py" file is
setup(
name = 'cusconv_cpp',
ext_modules=[
CppExtension(name='cusconv_cpp', sources=['src/cusconv.cpp'],
extra_compile_args={'cxx': ['-O3', '-pthread', '-mavx2', '-mfma']})
],
cmdclass={
'build_ext': BuildExtension
})
The output log by setuptools for buiding is
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include/torch/csrc/api/include -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include/TH -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/include/python3.6m -c src/indconv.cpp -o build/temp.linux-x86_64-3.6/src/indconv.o -O3 -pthread -mavx2 -mfma -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=indconv_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
which indeed includes those flags but many other flags were also used. Anyone has any ideas?
Aucun commentaire:
Enregistrer un commentaire