Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs

Source code to accompany the paper:

Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs
Diogo R. Ferreira, Rui M. Santos
12th International Workshop on Business Process Intelligence (BPI 2016)
in Business Process Management Workshops, LNBIP 281, pp. 36-48, Springer, 2017
[PDF] [BibTeX]

This page contains the source code and data used to generate the results reported in the paper presented at the BPI 2016 workshop.

The source code comprises the following files:

preproc.py – A Python script to do the preprocessing of the event log. This script receives a CSV file (.csv) as input and produces a binary file (.pre) as output.

common.h – Some common routines and variables used by all programs, namely to read the event log in the binary format (.pre) created by the previous script.

flow1.h – The single-threaded CPU version of the flow algorithm.
flow2.h – The multi-threaded CPU version of the flow algorithm.
flow3.h – The GPU version of the flow algorithm.

flow_cpu.cc – The C/C++ program that runs the multi-threaded CPU version of the flow algorithm.
flow_gpu.cu – The CUDA C program that runs the GPU version of the flow algorithm.

handover1.h – The single-threaded CPU version of the handover algorithm.
handover2.h – The multi-threaded CPU version of the handover algorithm.
handover3.h – The GPU version of the handover algorithm.

handover_cpu.cc – The C/C++ program that runs the multi-threaded CPU version of the handover algorithm.
handover_gpu.cu – The CUDA C program that runs the GPU version of the handover algorithm.

together1.h – The single-threaded CPU version of the together algorithm.
together2.h – The multi-threaded CPU version of the together algorithm.
together3.h – The GPU version of the together algorithm.

together_cpu.cc – The C/C++ program that runs the multi-threaded CPU version of the together algorithm.
together_gpu.cu – The CUDA C program that runs the GPU version of the together algorithm.

runtests.py – A Python script that runs a series of tests and generates results in a similar form to Table 2 in the paper.

The data comprises the following files:

eventlog_10.csv – An event log with 10¹ cases.
eventlog_100.csv – An event log with 10² cases.
eventlog_1000.csv – An event log with 10³ cases.
eventlog_10000.csv – An event log with 10⁴ cases.
eventlog_100000.csv – An event log with 10⁵ cases.
eventlog_1000000.csv – An event log with 10⁶ cases.
eventlog_10000000.csv – An event log with 10⁷ cases.

The files above are zipped. Before proceeding, you will have to unzip them.

How to compile the code

Operating system: Ubuntu (or another Linux distro).

Check that you have all the necessary pre-requisites, namely CUDA. Also, make sure that CUDA is properly configured, as described here. A good place to configure the environment variables (PATH and LD_LIBRARY_PATH) is in ~/.bashrc.

Use this Makefile to compile the code. However, before running make, change the value of the ARCH variable to the compute capability of your GPU. For a list of GPUs and their compute capabilities, check this page.

Preprocessing the event logs

Start by preprocessing the event logs with the following instructions:

python preproc.py 0 1 2 3 eventlog_10.csv
python preproc.py 0 1 2 3 eventlog_100.csv
python preproc.py 0 1 2 3 eventlog_1000.csv
python preproc.py 0 1 2 3 eventlog_10000.csv
python preproc.py 0 1 2 3 eventlog_100000.csv
python preproc.py 0 1 2 3 eventlog_1000000.csv
python preproc.py 0 1 2 3 eventlog_10000000.csv

The arguments 0 1 2 3 specify that the case id is in column 0 (first column), task is in column 1, user is in column 2, and timestamp is in column 3, respectively.

You may be surprised that preprocessing the largest event log can take about 5min. Most of this time is spent on reading the data (i.e. parsing the CSV file). This is precisely the reason why we decided to separate the preprocessing from the main programs. This way we can preprocess the event log only once, and then run several different programs on it, without having to reprocess it again.

Running the programs

To run the multi-threaded CPU version of each algorithm, you can use instructions similar to the following:

./flow_cpu 100 4 eventlog_1000000.pre
./handover_cpu 100 4 eventlog_1000000.pre
./together_cpu 100 4 eventlog_1000000.pre

where the first argument is the number of runs, and the second argument is the number of threads. As a rule of thumb, set the number of threads to be equal to the number of physical cores available in the CPU.

To run the GPU version of each algorithm, you can use instructions similar to the following:

./flow_gpu 100 128 eventlog_1000000.pre
./handover_gpu 100 128 eventlog_1000000.pre
./together_gpu 100 128 eventlog_1000000.pre

where the first argument is the number of runs, and the second argument is the number of threads per block to be used on the GPU. Please check the hardware architecture of your GPU to determine the number of threads per block that should be used. As a rule of thumb, use a number of threads per block that is equal to the number of cores per SM (streaming multiprocessor). For example, all Maxwell generation cards (e.g. GTX 750 Ti, GTX 980, GTX Titan X) have 128 cores per SM.

Running a series of tests

For convenience, we provide a Python script that runs all algorithms on all event logs. Note that the event logs must have been preprocessed first by preproc.py (see above).

To run the tests, execute:

python runtests.py

This will present the results in a similar form to Table 2 in the paper.

Testing on the BPI Challenge 2016 event logs

To run the experiment that we describe in the paper, do the following:

Grab the largest event log for BPI Challenge 2016 (BPI2016_Clicks_NOT_Logged_In.csv) from here.

We use the columns SessionID as case id, and PAGE_NAME both as task and user. Therefore, preprocess the event log with the following command:
python preproc.py 0 5 5 2 BPI2016_Clicks_NOT_Logged_In.csv

Run the flow algorithm on the GPU (adjust the number of threads per block according to the number of cores per SM):
./flow_gpu 1 128 BPI2016_Clicks_NOT_Logged_In.pre

Run the together algorithm on the CPU (adjust the number of threads according to the number of physical cores):
./together_cpu 1 4 BPI2016_Clicks_NOT_Logged_In.pre

How to cite this work

Diogo R. Ferreira, Rui M. Santos, Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs, in Business Process Management Workshops, LNBIP 281, pp. 36-48, Springer, 2017

@InProceedings{ferreira17parallelization,
  author    = {Diogo R. Ferreira and Rui M. Santos},
  title     = {Parallelization of Transition Counting for Process Mining on Multi-core {CPUs} and {GPUs}},
  booktitle = {Business Process Management Workshops},
  year      = {2017},
  volume    = {281},
  series    = {Lecture Notes in Business Information Processing},
  pages     = {36--48},
  publisher = {Springer}
}

Last modified: 2018-02-12