README.md 2.86 KB
Newer Older
Martin Perdacher's avatar
Martin Perdacher committed
1 2 3 4
# Description

For our BLAS variant of the similarity-join, we use the matrix multiplication provided by BLAS and the Euclidean distance by scalar product (see paper).

Martin Perdacher's avatar
Martin Perdacher committed
5

Martin Perdacher's avatar
Martin Perdacher committed
6
![matrix_multiplication](description.jpg "BLAS-similarity join")
Martin Perdacher's avatar
Martin Perdacher committed
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

The individual blocksize needs to be experimental evaluated for each individual hardware. Our selfjoin variant, only iterates over the lower triangle of the similarity matrix.

# Requirements

- GNU compiler version >= 5.1
- cmake version >= 3.7.0
- git version >= 1.8.3.1
- Linux package: *build-essential*, including *GNU make* version >= 4.1

### Random number generators
- We use the random number generator, as well as the matrix multiplication provided by Intel© MKL. Therefore, a working [Intel© MKL](https://software.intel.com/en-us/mkl) environment should be installed. Ensure, that the environment variable `$MKLROOT` [is set correctly](https://software.intel.com/en-us/mkl-linux-developer-guide-scripts-to-set-environment-variables).

# Before compilation

Martin Perdacher's avatar
Martin Perdacher committed
22
To explicitly ensure, that CMake will use the Intel compiler use:
Martin Perdacher's avatar
Martin Perdacher committed
23 24

```{bash, engine='sh'}
Martin Perdacher's avatar
Martin Perdacher committed
25 26
export CXX=icpc
export CC=icc
Martin Perdacher's avatar
Martin Perdacher committed
27 28
```

Martin Perdacher's avatar
Martin Perdacher committed
29
Lookup the [compiler-flag](https://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations) for your hardware.
Martin Perdacher's avatar
Martin Perdacher committed
30 31 32

Example configuration for Skylake processors:
```{bash, engine='sh'}
Martin Perdacher's avatar
Martin Perdacher committed
33
-xmic-avx512 -fpic -qopenmp -axCOMMON-AVX512 -lmemkind -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -liomp5 -lpthread -g -debug all -save-temps -Wl, -O0 -fstack-security-check -lboost_system
Martin Perdacher's avatar
Martin Perdacher committed
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
```

# Build with CMake

to build this project you need to type the following commands into your shell:

```{bash, engine='sh'}
git clone https://gitlab.cs.univie.ac.at/martinp16cs/BLAS-join.git
cd cmake
mkdir build
cd build
cmake ..
make -j
```

# Example calls

### Self-join

For a selfjoin with random generated uniform data [0.0, 1.0):
Martin Perdacher's avatar
Martin Perdacher committed
54
`./blasSelfJoinCardinality -n 200000 -s 5120 -e 0.2 -d 64 -t 64`
Martin Perdacher's avatar
Martin Perdacher committed
55 56

- `-n` are the number of objects in set A
Martin Perdacher's avatar
Martin Perdacher committed
57
- `-s` blocksize
Martin Perdacher's avatar
Martin Perdacher committed
58 59 60 61 62
- `-e` epsilon
- `-d` number of features (or dimensions)
- `-t` number of threads

For a selfjoin with a dataset from a file:
Martin Perdacher's avatar
Martin Perdacher committed
63
`./blasSelfJoinCardinality -n 200000 -s 5120 -e 0.2 -d 64 -t 64 -f uniform_200000x64.csv`
Martin Perdacher's avatar
Martin Perdacher committed
64 65 66 67 68 69 70 71

- `-f` filename
    Each value is separated by a comma ',' and has _d_ objects in each line. The file has _n_ lines without a header.
    You could also use a binary format ".bin".

### Join

Join between two sets `A` and `B` with random generated uniform data [0.0, 1.0):
Martin Perdacher's avatar
Martin Perdacher committed
72
`./blasJoinCardinality -n 200000 -m 200000 -s 5120 -e 0.2 -d 20 -t 64`
Martin Perdacher's avatar
Martin Perdacher committed
73 74 75 76 77 78 79 80 81 82 83 84 85 86

where
- `-n` are the number of objects in set A
- `-m` are the number of objects in set B

and files could be specified with

- `-f` file for set A
- `-g` file for set B


# Issues

Feel free to report [issues](https://gitlab.cs.univie.ac.at/martinp16cs/BLAS-join/issues) about the code.