SYCL_MatMul/README.org

#+title: Matrix multiplication with SYCL, yay

This project serves as a sample demonstration of SYCL syntax and offers a
straightforward program as an illustration.

Its primary objective is to function as a benchmark for executing matrix
multiplication on a single CPU core while using SYCL for both OpenMP and GPU
parallelization. Subsequently, we will record and analyze the execution times.

At this stage, the project showcases how to transfer and manipulate data on the
GPU using +the Unified Shared Memory (USM) model with explicit data movement+ an
abstract view to the host and device memory using buffers and accessors. I will
not attend to implement those functions using Unified Shared Memory.

For more detailed information about the implementation and how specific
functions are used, as well as explanations for the reasoning behind certain
design choices, I recommend referring to the source code itself. The source code
typically contains comments that provide insights into the code's functionality
and rationale.

* Prerequisites

To use the project, you'll need the following prerequisites:

** Mandatory Prerequisites

- A functional SYCL compiler. You can choose from options like Intel's oneAPI or
  AdaptiveCpp.

- The "xxhash" library.

** Optional Prerequisite

- CMake (for generating build files)

* Compilation

** Using Intel oneAPI

Finally, I've made to code run with Intel's oneAPI and adapated the CMake
generation process.

#+BEGIN_SRC bash
# Make sure to source Intels vars together with the inbuild llvm!
. /opt/intel/oneapi/setvars.sh --include-intel-llvm

# Create a build directory and navigate to it
mkdir build && cd build

# Adjust the path to AdaptiveCpp and your target devices according to your system
CXX=$(which clang++) cmake .. -DUSE_INTELSYCL=ON \
    -DCMAKE_BUILD_TYPE="Release"

# Compile the executable
make
#+END_SRC

** Using AdaptiveCpp

Regrettably, integrating Intel's oneAPI with the AMD GPU plugin proves to be
quite challenging on Arch Linux, primarily due to the plugin's dependency on an
older version of ROCm than what's available in the official repositories. While
I could have chosen to compile my own ROCm/hip version, I opted for a more
convenient solution and turned to the [[https://github.com/AdaptiveCpp/AdaptiveCpp/tree/develop][AdaptiveCpp]] compiler, which offers both
CPU and GPU acceleration through CUDA and ROCm support. You can find a version
of AdaptiveCpp compatible with AMD GPUs on the AUR (Arch User Repository).

If your goal is to run benchmarks on an AMD GPU alongside AdaptiveCpp, I
recommend using [[https://github.com/sobc/pkgbuilds/tree/master/hipsycl-rocm-git][this]] specific PKGBUILD. Other versions that rely on ROCm might
not build correctly at the moment. I've already raised an issue with the
responsible maintainer of the PKGBUILDs to address this compatibility issu

Currently, I can only utilize CMake for generating makefiles when working with
AdaptiveCpp. However, I intend to add CMake support for Intel's oneAPI as soon
as I have a working version of the compiler.

To generate Makefiles for AdaptiveCpp, you can follow these steps:

#+BEGIN_SRC bash
# Create a build directory and navigate to it
mkdir build && cd build

# Adjust the path to AdaptiveCpp and your target devices according to your system
cmake .. -DUSE_ACPP=ON \
    -DAdaptiveCpp_DIR=/opt/AdaptiveCpp/ROCm/lib/cmake/AdaptiveCpp \
    -DACPP_TARGETS="omp.accelerated;hip.integrated-multipass;gfx90c" \
    -DCMAKE_BUILD_TYPE="Release"
#+END_SRC

You can find more information about =ACPP_TARGETS= and the compilation process in
the documentation [[https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md][here]].

Once your Makefiles are generated, you can build the project using the following
command:

#+BEGIN_SRC bash
make -j$(nproc)
#+END_SRC

The compiled executable can be found in the =build/src= directory.

* Data Information

I have provided a set of 6 matrices, each with 3 different sizes:

- =sma*.txt=: These matrices are of size 16x16
- =med*.txt=: These matrices are of size 2048x2048
- =big*.txt=: These matrices are of size 8192x8192

All of these matrices are available in text file format, and you can locate them
within the =data/= directory.


*Important note*:

A word of caution when working with the large matrices (=big*.txt=): To avoid
exceedingly long execution times, it is advisable to disable the benchmark for a
single CPU core. You can achieve this by invoking CMake with the option
=-DSYCL_EX_COMPILE_SEQUENTIAL_BENCH=OFF= and then recompiling the executable
accordingly.

Additionally, below, you will find the results of multiplying all combinations
of these matrices along with their corresponding checksums. Please feel free to
reach out if you come across any other checksums or encounter further questions.

| Matrix A   | Matrix B   | Checksum     |
|------------+------------+--------------|
| =sma1.txt= | =sma1.txt= | =0xe6134d8e= |
| =sma2.txt= | =sma2.txt= | =0xf1ba0ac6= |
| =sma1.txt= | =sma2.txt= | =0xe71fdf1e= |
| =sma2.txt= | =sma1.txt= | =0x36b44d2c= |
|------------+------------+--------------|
| =med1.txt= | =med1.txt= | =0xd92eb6d6= |
| =med2.txt= | =med2.txt= | =0x9f0e1206= |
| =med1.txt= | =med2.txt= | =0x4cf45b91= |
| =med2.txt= | =med1.txt= | =0xfdeb52bf= |
|------------+------------+--------------|
| =big1.txt= | =big1.txt= | =0xde9b4c0d= |
| =big2.txt= | =big2.txt= | =0x05365fc1= |
| =big1.txt= | =big2.txt= | =0xb185e6c1= |
| =big2.txt= | =big1.txt= | =0x59f5ffef= |