111 lines
4.5 KiB
Org Mode
111 lines
4.5 KiB
Org Mode
#+title: Matrix multiplication with SYCL, yay
|
|
|
|
This project serves as a sample demonstration of SYCL syntax and offers a
|
|
straightforward program as an illustration.
|
|
|
|
Its primary objective is to function as a benchmark for executing matrix
|
|
multiplication on a single CPU core while using SYCL for both OpenMP and GPU
|
|
parallelization. Subsequently, we will record and analyze the execution times.
|
|
|
|
At this stage, the project showcases how to transfer and manipulate data on the
|
|
GPU using +the Unified Shared Memory (USM) model with explicit data movement+ an
|
|
abstract view to the host and device memory using buffers and accessors. I will
|
|
not attend to implement those functions using Unified Shared Memory.
|
|
|
|
For more detailed information about the implementation and how specific
|
|
functions are used, as well as explanations for the reasoning behind certain
|
|
design choices, I recommend referring to the source code itself. The source code
|
|
typically contains comments that provide insights into the code's functionality
|
|
and rationale.
|
|
|
|
* Prerequisites
|
|
|
|
To use the project, you'll need the following prerequisites:
|
|
|
|
** Mandatory Prerequisites
|
|
|
|
- A functional SYCL compiler. You can choose from options like Intel's oneAPI or
|
|
AdaptiveCpp.
|
|
|
|
- The "xxhash" library.
|
|
|
|
** Optional Prerequisite
|
|
|
|
- CMake (for generating build files)
|
|
|
|
* Compilation
|
|
|
|
Regrettably, integrating Intel's oneAPI with the AMD GPU plugin proves to be
|
|
quite challenging on Arch Linux, primarily due to the plugin's dependency on an
|
|
older version of ROCm than what's available in the official repositories. While
|
|
I could have chosen to compile my own ROCm/hip version, I opted for a more
|
|
convenient solution and turned to the [[https://github.com/AdaptiveCpp/AdaptiveCpp/tree/develop][AdaptiveCpp]] compiler, which offers both
|
|
CPU and GPU acceleration through CUDA and ROCm support. You can find a version
|
|
of AdaptiveCpp compatible with AMD GPUs on the AUR (Arch User Repository).
|
|
|
|
If your goal is to run benchmarks on an AMD GPU alongside AdaptiveCpp, I
|
|
recommend using [[https://github.com/sobc/pkgbuilds/tree/master/hipsycl-rocm-git][this]] specific PKGBUILD. Other versions that rely on ROCm might
|
|
not build correctly at the moment. I've already raised an issue with the
|
|
responsible maintainer of the PKGBUILDs to address this compatibility issu
|
|
|
|
Currently, I can only utilize CMake for generating makefiles when working with
|
|
AdaptiveCpp. However, I intend to add CMake support for Intel's oneAPI as soon
|
|
as I have a working version of the compiler.
|
|
|
|
To generate Makefiles for AdaptiveCpp, you can follow these steps:
|
|
|
|
#+BEGIN_SRC bash
|
|
# Create a build directory and navigate to it
|
|
mkdir build && cd build
|
|
|
|
# Adjust the path to AdaptiveCpp and your target devices according to your system
|
|
cmake .. -DAdaptiveCpp_DIR=/opt/AdaptiveCpp/ROCm/lib/cmake/AdaptiveCpp -DACPP_TARGETS="omp.accelerated;hip.integrated-multipass;gfx90c"
|
|
#+END_SRC
|
|
|
|
You can find more information about =ACPP_TARGETS= and the compilation process in
|
|
the documentation [[https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md][here]].
|
|
|
|
Once your Makefiles are generated, you can build the project using the following
|
|
command:
|
|
|
|
#+BEGIN_SRC bash
|
|
make -j$(nproc)
|
|
#+END_SRC
|
|
|
|
The compiled executable can be found in the =build/src= directory.
|
|
|
|
* Data
|
|
|
|
I provide 6 different matrices with 3 different sizes:
|
|
|
|
- =sma*.txt= are matrices with the size of 16x16
|
|
- =med*.txt= are matrices with the size of 2048x2048
|
|
- =big*.txt= are matrices with the size of 8192x8192
|
|
|
|
All matrices are stored in text files under =data=.
|
|
|
|
*Warning*: If you're about to run the benchmark with the big matrices, please
|
|
disable the benchmark on one single CPU core, unless you want to sit and wait
|
|
forever. Do this by calling cmake with =-DSEQ_BENCH=OFF= and recompile the
|
|
executable.
|
|
|
|
Below you will find the combination of all multiplication of all matrices and
|
|
their checksum. Let me now if you encounter other checksums.
|
|
|
|
| Matrix A | Matrix B | Checksum |
|
|
|------------+------------+--------------|
|
|
| =sma1.txt= | =sma1.txt= | =0xe6134d8e= |
|
|
| =sma2.txt= | =sma2.txt= | =0xf1ba0ac6= |
|
|
| =sma1.txt= | =sma2.txt= | =0xe71fdf1e= |
|
|
| =sma2.txt= | =sma1.txt= | =0x36b44d2c= |
|
|
|------------+------------+--------------|
|
|
| =med1.txt= | =med1.txt= | =0xd92eb6d6= |
|
|
| =med2.txt= | =med2.txt= | =0x9f0e1206= |
|
|
| =med1.txt= | =med2.txt= | =0x4cf45b91= |
|
|
| =med2.txt= | =med1.txt= | =0xfdeb52bf= |
|
|
|------------+------------+--------------|
|
|
| =big1.txt= | =big1.txt= | =0xde9b4c0d= |
|
|
| =big2.txt= | =big2.txt= | =0x5365fc1= |
|
|
| =big1.txt= | =big2.txt= | =0xb185e6c1= |
|
|
| =big2.txt= | =big1.txt= | =0x59f5ffef= |
|