README.md 6.19 KB
Newer Older
Martin Perdacher's avatar
Martin Perdacher committed
1 2
Paper: ["Cache-oblivious High-performance Similarity Join" - Martin Perdacher (University of Vienna); Claudia Plant (University of Vienna); Christian Böhm (Ludwig-Maximilians-Universität)](http://dx.doi.org/10.1145/3299869.3319859)
at [SIGMOD-2019](http://sigmod2019.org/sigmodcfp) in Amsterdam from 30th of June to 5th of July 2019.
Martin Perdacher's avatar
Martin Perdacher committed
3

Martin Perdacher's avatar
Martin Perdacher committed
4
# Reproducibility
Martin Perdacher's avatar
Martin Perdacher committed
5

Martin Perdacher's avatar
Martin Perdacher committed
6
We want to make our experiments transparent and comprehensible. Code and experimental data (packed with [ReproZip](https://www.reprozip.org/)) for almost all of our figures is available at our [cloud-repository](https://ucloud.univie.ac.at/index.php/s/09fYaJlKclk0Iq8).
Martin Perdacher's avatar
Martin Perdacher committed
7

Martin Perdacher's avatar
Martin Perdacher committed
8 9 10 11 12 13 14 15 16 17 18 19 20 21
#### Download reprozip image:

To download the ReproZip package in the background with the following command:
```{bash, engine='sh'}
wget -bqc -O hilbertJoin.rpz "https://ucloud.univie.ac.at/index.php/s/CiZpFJ7ahrf7D0C/download?path=%2F&files=hilbertJoin.rpz"
```

#### Download reprozip image:
Alternatively you could download the .zip file using the command:
```{bash, engine='sh'}
wget -bqc -O hilbertJoin.zip "https://ucloud.univie.ac.at/index.php/s/CiZpFJ7ahrf7D0C/download?path=%2F&files=hilbertJoin.zip"

```

Martin Perdacher's avatar
Martin Perdacher committed
22 23 24 25
# Requirements

- To run our Hilbert-Join your hardware needs to support AVX-512 instructions.
- GNU compiler version >= 5.1
Martin Perdacher's avatar
Martin Perdacher committed
26
- cmake version >= 3.7.0
Martin Perdacher's avatar
Martin Perdacher committed
27
- git version >= 1.8.3.1
Martin Perdacher's avatar
Martin Perdacher committed
28
- Linux package: *build-essential*, including *GNU make* version >= 4.1 
Martin Perdacher's avatar
Martin Perdacher committed
29

Martin Perdacher's avatar
Martin Perdacher committed
30
### Random number generators
Martin Perdacher's avatar
Martin Perdacher committed
31
- We use the random number generator provided by Intel© MKL. Therefore, a working [Intel© MKL](https://software.intel.com/en-us/mkl) environment should be installed. Ensure, that the environment variable `$MKLROOT` [is set correctly](https://software.intel.com/en-us/mkl-linux-developer-guide-scripts-to-set-environment-variables).
Martin Perdacher's avatar
Martin Perdacher committed
32 33


Martin Perdacher's avatar
Martin Perdacher committed
34 35
# Before compilation

Martin Perdacher's avatar
Martin Perdacher committed
36 37
To explicitly ensure, that CMake will use the GNU compiler use:

Martin Perdacher's avatar
Martin Perdacher committed
38
```{bash, engine='sh'}
Martin Perdacher's avatar
Martin Perdacher committed
39 40 41 42
export CXX=g++
export CC=gcc
```

Martin Perdacher's avatar
Martin Perdacher committed
43 44 45 46 47 48 49
Lookup the [compiler-flag](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html) for your hardware. Change the `-march` flag in your `CMakeLists.txt` depending on the hardware.

Example configuration for Skylake processors:
```{bash, engine='sh'}
set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS} -std=c++11 -march=skylake -ffast-math -fassociative-math -O3 -fopenmp -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -liomp5")
```

Martin Perdacher's avatar
Martin Perdacher committed
50
If your hardware does not support AVX-512, you could use [Intel© Software Development Emulator (SDE)](https://software.intel.com/en-us/articles/intel-software-development-emulator) to emulate AVX-512 registers.
Martin Perdacher's avatar
Martin Perdacher committed
51

Martin Perdacher's avatar
Martin Perdacher committed
52
# Build with CMake
Martin Perdacher's avatar
Martin Perdacher committed
53

Martin Perdacher's avatar
Martin Perdacher committed
54
to build this project you need to type the following commands into your shell:
Martin Perdacher's avatar
Martin Perdacher committed
55

Martin Perdacher's avatar
Martin Perdacher committed
56
```{bash, engine='sh'}
Martin Perdacher's avatar
Martin Perdacher committed
57 58 59 60
git clone https://gitlab.cs.univie.ac.at/martinp16cs/hilbertJoin.git
cd hilbertJoin
mkdir build
cd build
Martin Perdacher's avatar
Martin Perdacher committed
61
export KBLOCK=4
Martin Perdacher's avatar
Martin Perdacher committed
62
export NUM_THREADS=_MAX_NUMBER_OF_CORES_
Martin Perdacher's avatar
Martin Perdacher committed
63 64
cmake ..
make -j
Martin Perdacher's avatar
Martin Perdacher committed
65 66
```

Martin Perdacher's avatar
Martin Perdacher committed
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
# Example calls

### Self-join

For a selfjoin with random generated uniform data [0.0, 1.0):
`./hilbertSelfJoinCardinality -n 200000 -e 0.2 -d 64 -t 64`

- `-n` are the number of objects in set A
- `-e` epsilon
- `-d` number of features (or dimensions)
- `-t` number of threads
 
For a selfjoin with a dataset from a file:
`./hilbertSelfJoinCardinality -n 200000 -e 0.2 -d 64 -t 64 -f uniform_200000x64.csv`

- `-f` filename 
Martin Perdacher's avatar
Martin Perdacher committed
83
    Each value is separated by a comma ',' and has _d_ objects in each line. The file has _n_ lines without a header.
Martin Perdacher's avatar
Martin Perdacher committed
84 85 86 87 88 89 90 91 92 93
    You could also use a binary format ".bin". 

### Join

Join between two sets `A` and `B` with random generated uniform data [0.0, 1.0):
`./hilbertJoinCardinality -n 200000 -m 200000 -e 0.2 -d 20 -t 64`

where 
- `-n` are the number of objects in set A
- `-m` are the number of objects in set B
Martin Perdacher's avatar
Martin Perdacher committed
94 95 96 97 98
 
and files could be specified with

- `-f` file for set A
- `-g` file for set B
Martin Perdacher's avatar
Martin Perdacher committed
99

Martin Perdacher's avatar
Martin Perdacher committed
100 101
# Datasets used in our publication

Martin Perdacher's avatar
Martin Perdacher committed
102 103
Note: use `.csv` files without header!

Martin Perdacher's avatar
Martin Perdacher committed
104 105 106 107 108 109
#### Synthetic data (as comma seperated files)

- [Uniform_200K](https://ucloud.univie.ac.at/index.php/s/LaPLUmXQKsldvcO)
- [Uniform_select](https://ucloud.univie.ac.at/index.php/s/pUPFeZDXGtGNEpa)

#### Real data
Martin Perdacher's avatar
Martin Perdacher committed
110
- [BigCross](https://ucloud.univie.ac.at/index.php/s/ITlAQkZfIGFTvTD) see [[2]](https://doi.org/10.1145/2133803.2184450)
Martin Perdacher's avatar
Martin Perdacher committed
111 112 113
- [Activity recognition](http://archive.ics.uci.edu/ml/datasets/heterogeneity+activity+recognition) see [[1]](https://doi.org/10.1145/2809695.2809718) from UCI Data Repository
- [Higgs](https://archive.ics.uci.edu/ml/datasets/HIGGS) see [[3]](https://www.nature.com/articles/ncomms5308) from UCI Data Repository 
- [IoT botnet](https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT) see [[4]](http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018_03A-3_Mirsky_paper.pdf) from UCI Data Repository 
Martin Perdacher's avatar
Martin Perdacher committed
114

Martin Perdacher's avatar
Martin Perdacher committed
115 116
# Issues

Martin Perdacher's avatar
Martin Perdacher committed
117
Feel free to report [issues](https://gitlab.cs.univie.ac.at/martinp16cs/hilbertJoin/issues) about the code.
Martin Perdacher's avatar
Martin Perdacher committed
118

Martin Perdacher's avatar
Martin Perdacher committed
119 120
# Freqeuntly Asked Quesionts

Martin Perdacher's avatar
Martin Perdacher committed
121 122
[see FAQ](FAQ.md)

Martin Perdacher's avatar
Martin Perdacher committed
123 124 125
# Comparison partners

- [BLAS-join](https://gitlab.cs.univie.ac.at/Google-TPU/BLAS-join/)
Martin Perdacher's avatar
Martin Perdacher committed
126
- [Super-EGO](https://www.ics.uci.edu/~dvk/code/SuperEGO.html). 
Martin Perdacher's avatar
Martin Perdacher committed
127 128 129
  <br/> List of changes:
  - Changed from float to double precision.
  - Adapted to a self-join problem.
Martin Perdacher's avatar
Martin Perdacher committed
130
  - We take care of the dimensions, within our compilation process (D must be known at compile time).
Martin Perdacher's avatar
Martin Perdacher committed
131
  - Removed storing of results, and added thread independent counting.
Martin Perdacher's avatar
Martin Perdacher committed
132

Martin Perdacher's avatar
Martin Perdacher committed
133 134
# References

Martin Perdacher's avatar
Martin Perdacher committed
135
- [1] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind K. Dey, Tobias Sonne, Mads Møller Jensen:
Martin Perdacher's avatar
Martin Perdacher committed
136
Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition. SenSys 2015: 127-140
Martin Perdacher's avatar
Martin Perdacher committed
137
- [2] Marcel R. Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, Christian Sohler:
Martin Perdacher's avatar
Martin Perdacher committed
138
StreamKM++: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)
Martin Perdacher's avatar
Martin Perdacher committed
139 140
- [3] P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Commu- nications 5 (02 Jul 2014), 4308 EP –. Article.
- [4] Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. 2018. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018.
Martin Perdacher's avatar
Martin Perdacher committed
141