README.md 2.87 KB
Newer Older
Martin Perdacher's avatar
Martin Perdacher committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

# Description

For our BLAS variant of the similarity-join, we use the matrix multiplication provided by BLAS and the Euclidean distance by scalar product (see paper).

![matrix multiplication](https://gitlab.cs.univie.ac.at/Google-TPU/BLAS-join/blob/master/description.jpg)

The individual blocksize needs to be experimental evaluated for each individual hardware. Our selfjoin variant, only iterates over the lower triangle of the similarity matrix.

# Requirements

- GNU compiler version >= 5.1
- cmake version >= 3.7.0
- git version >= 1.8.3.1
- Linux package: *build-essential*, including *GNU make* version >= 4.1

### Random number generators
- We use the random number generator, as well as the matrix multiplication provided by Intel© MKL. Therefore, a working [Intel© MKL](https://software.intel.com/en-us/mkl) environment should be installed. Ensure, that the environment variable `$MKLROOT` [is set correctly](https://software.intel.com/en-us/mkl-linux-developer-guide-scripts-to-set-environment-variables).

# Before compilation

To explicitly ensure, that CMake will use the GNU compiler use:

```{bash, engine='sh'}
export CXX=g++
export CC=gcc
```

Lookup the [compiler-flag](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html) for your hardware. Change the `-march` flag in your `CMakeLists.txt` depending on the hardware.

Example configuration for Skylake processors:
```{bash, engine='sh'}
set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS} -std=c++11 -march=skylake -ffast-math -fassociative-math -O3 -fopenmp -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -liomp5")
```

# Build with CMake

to build this project you need to type the following commands into your shell:

```{bash, engine='sh'}
git clone https://gitlab.cs.univie.ac.at/martinp16cs/BLAS-join.git
cd cmake
mkdir build
cd build
cmake ..
make -j
```

# Example calls

### Self-join

For a selfjoin with random generated uniform data [0.0, 1.0):
`./blasSelfJoinCardinality -n 200000 -e 0.2 -d 64 -t 64`

- `-n` are the number of objects in set A
- `-e` epsilon
- `-d` number of features (or dimensions)
- `-t` number of threads

For a selfjoin with a dataset from a file:
`./blasSelfJoinCardinality -n 200000 -e 0.2 -d 64 -t 64 -f uniform_200000x64.csv`

- `-f` filename
    Each value is separated by a comma ',' and has _d_ objects in each line. The file has _n_ lines without a header.
    You could also use a binary format ".bin".

### Join

Join between two sets `A` and `B` with random generated uniform data [0.0, 1.0):
`./blasJoinCardinality -n 200000 -m 200000 -e 0.2 -d 20 -t 64`

where
- `-n` are the number of objects in set A
- `-m` are the number of objects in set B

and files could be specified with

- `-f` file for set A
- `-g` file for set B

# Datasets used in our publication

Note: use `.csv` files without header!

# Issues

Feel free to report [issues](https://gitlab.cs.univie.ac.at/martinp16cs/BLAS-join/issues) about the code.