# Description For our BLAS variant of the similarity-join, we use the matrix multiplication provided by BLAS and the Euclidean distance by scalar product (see paper). ![matrix_multiplication](description.jpg "BLAS-similarity join") The individual blocksize needs to be experimental evaluated for each individual hardware. Our selfjoin variant, only iterates over the lower triangle of the similarity matrix. # Requirements - GNU compiler version >= 5.1 - cmake version >= 3.7.0 - git version >= 1.8.3.1 - Linux package: *build-essential*, including *GNU make* version >= 4.1 ### Random number generators - We use the random number generator, as well as the matrix multiplication provided by Intel© MKL. Therefore, a working [Intel© MKL](https://software.intel.com/en-us/mkl) environment should be installed. Ensure, that the environment variable `$MKLROOT` [is set correctly](https://software.intel.com/en-us/mkl-linux-developer-guide-scripts-to-set-environment-variables). # Before compilation To explicitly ensure, that CMake will use the GNU compiler use: ```{bash, engine='sh'} export CXX=g++ export CC=gcc ``` Lookup the [compiler-flag](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html) for your hardware. Change the `-march` flag in your `CMakeLists.txt` depending on the hardware. Example configuration for Skylake processors: ```{bash, engine='sh'} set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -march=skylake -ffast-math -fassociative-math -O3 -fopenmp -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -liomp5") ``` # Build with CMake to build this project you need to type the following commands into your shell: ```{bash, engine='sh'} git clone https://gitlab.cs.univie.ac.at/martinp16cs/BLAS-join.git cd cmake mkdir build cd build cmake .. make -j ``` # Example calls ### Self-join For a selfjoin with random generated uniform data [0.0, 1.0): `./blasSelfJoinCardinality -n 200000 -s 5120 -e 0.2 -d 64 -t 64` - `-n` are the number of objects in set A - `-s` blocksize - `-e` epsilon - `-d` number of features (or dimensions) - `-t` number of threads For a selfjoin with a dataset from a file: `./blasSelfJoinCardinality -n 200000 -s 5120 -e 0.2 -d 64 -t 64 -f uniform_200000x64.csv` - `-f` filename Each value is separated by a comma ',' and has _d_ objects in each line. The file has _n_ lines without a header. You could also use a binary format ".bin". ### Join Join between two sets `A` and `B` with random generated uniform data [0.0, 1.0): `./blasJoinCardinality -n 200000 -m 200000 -s 5120 -e 0.2 -d 20 -t 64` where - `-n` are the number of objects in set A - `-m` are the number of objects in set B and files could be specified with - `-f` file for set A - `-g` file for set B # Issues Feel free to report [issues](https://gitlab.cs.univie.ac.at/martinp16cs/BLAS-join/issues) about the code.