lundi 22 octobre 2018

pybind11+EIGEN+MKL slower than Numpy at QR Decomposition

I am trying to wrap some C++ code that calls Eigen using pybind11. I am able to successfully compile with EIGEN_USE_MKL_ALL defined. My setup.py script is given by the following:

import os, sys
import numpy as np
from distutils.core import setup, Extension
from distutils import sysconfig

args  = []
args += ['-std=c++14','-lstdc++']
args += ['-O3', '-march=native','-fopenmp']
args += ['-DMKL_ILP64', '-m64', '-I${MKLROOT}/include']
args += ['-L${MKLROOT}/lib/intel64', '-Wl,--no-as-needed', '-lmkl_intel_ilp64', '-lmkl_intel_thread', '-lmkl_core', '-liomp5', '-lpthread', '-lm', '-ldl']

ext_modules = [
    Extension(
        'linear_algebra_utilities',
        ['linear_algebra_utilities.cpp'],
        extra_link_args=args,
        extra_compile_args = args,
        include_dirs=['pybind11/include','eigen3'],
        language='c++14',
    ),
]

setup(
    name='cpputilities',
    version='0.0.1',
    author='Benjamin Cohen-Stead',
    author_email='bwcohenstead@ucdavis.edu',
    description='Linear Algebra Utilities.',
    ext_modules=ext_modules,
)

This generates the following compilation calls:

running build_ext
building 'linear_algebra_utilities' extension
gcc -pthread -B /home/benwcs/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ipybind11/include -Ieigen3 -I/home/benwcs/anaconda3/include/python3.7m -c linear_algebra_utilities.cpp -o build/temp.linux-x86_64-3.7/linear_algebra_utilities.o -std=c++14 -lstdc++ -O3 -march=native -fopenmp -DMKL_ILP64 -m64 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /home/benwcs/anaconda3/compiler_compat -L/home/benwcs/anaconda3/lib -Wl,-rpath=/home/benwcs/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/linear_algebra_utilities.o -o /home/benwcs/Documents/matrix_stabilization/linear_algebra_utilities.cpython-37m-x86_64-linux-gnu.so -std=c++14 -lstdc++ -O3 -march=native -fopenmp -DMKL_ILP64 -m64 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl

The code compiles the following function which can be called from python and does a QR decomposition:

void QR_to_UdV(const Eigen::Ref<const Eigen::MatrixXd> A,
               Eigen::Ref<Eigen::MatrixXd> U,
               Eigen::Ref<Eigen::VectorXd> d,
               Eigen::Ref<Eigen::MatrixXd> V){

    U = A.householderQr().householderQ();
    V = A.householderQr().matrixQR().triangularView<Eigen::Upper>();

    d = V.diagonal();
    V.array().colwise() /= d.array();
}

However, when I time this function against numpy's qr decomposition it is many time slower:

  • Eigen+MKL QR: 238ms
  • numpy QR: 41ms

I am fairly certain this difference in runtime is because the numpy function is using multithreading but mine is not. I am including the -fopenmp complation flag, so I do not understand why multithreading is not being used in my code, given that I have linked it to MKL.

One final piece of information: I am running this code in Ubuntu on an XPS 15 9570 with a sixth generation i7 processor.

Can anyone show me how to maybe fix my setup.py script so that I can get comparable performance to numpy?

Aucun commentaire:

Enregistrer un commentaire