Commit CUDA code in src/cuda

adapt tiled version
use correct compiler
2023-10-13 10:09:35 +02:00 · 2023-10-12 14:57:26 +02:00 · 2023-10-06 15:38:39 +02:00 · 2023-10-06 15:31:40 +02:00 · 2023-10-06 15:16:32 +02:00 · 2023-10-06 11:08:11 +02:00
7 changed files with 514 additions and 73 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -4,8 +4,33 @@ set(CMAKE_CXX_STANDARD 17)

 project(sycl_example)

-find_package(AdaptiveCpp REQUIRED)
+option(USE_INTELSYCL "Use Intel oneAPI compiler" OFF)
+option(USE_ACPP "Use AdaptiveCpp compiler" OFF)
+
+if(USE_INTELSYCL AND USE_ACPP)
+  message(FATAL_ERROR "Only one SYCL compiler can be selected.")
+endif()
+
+if(USE_INTELSYCL)
+  find_package(IntelSYCL REQUIRED)
+elseif(USE_ACPP)
+  find_package(AdaptiveCpp REQUIRED)
+else()
+  message(FATAL_ERROR "Set either -DUSE_INTELSYCL or -DUSE_ACPP.")
+endif()
+
+

 find_library(LIB_XXHASH xxhash)

+message(STATUS "${IntelSYCL_FOUND}")
+
+if ((NOT(AdaptiveCpp_FOUND)) AND (NOT(IntelSYCL_FOUND)))
+  message(FATAL_ERROR
+    "Could not find either AdaptiveCpp or IntelDPCPP.
+    Install one of them and provide -D<Compiler>_DIR=/path/to/cmake/config.
+    Configuration not possible.")
+endif()
+
+
 add_subdirectory(src)
--- a/README.org
+++ b/README.org
@ -8,7 +8,134 @@ multiplication on a single CPU core while using SYCL for both OpenMP and GPU
 parallelization. Subsequently, we will record and analyze the execution times.

 At this stage, the project showcases how to transfer and manipulate data on the
-GPU using the Unified Shared Memory (USM) model with explicit data movement.
-Unfortunately, I've encountered a hurdle as my current implementation with =hip=
-lacks a valid USM provider for my graphics card, the AMD Radeon RX 6700 XT,
-preventing me from achieving implicit data movement for demonstration 😔
+GPU using +the Unified Shared Memory (USM) model with explicit data movement+ an
+abstract view to the host and device memory using buffers and accessors. I will
+not attend to implement those functions using Unified Shared Memory.
+
+For more detailed information about the implementation and how specific
+functions are used, as well as explanations for the reasoning behind certain
+design choices, I recommend referring to the source code itself. The source code
+typically contains comments that provide insights into the code's functionality
+and rationale.
+
+* Prerequisites
+
+To use the project, you'll need the following prerequisites:
+
+** Mandatory Prerequisites
+
+- A functional SYCL compiler. You can choose from options like Intel's oneAPI or
+  AdaptiveCpp.
+
+- The "xxhash" library.
+
+** Optional Prerequisite
+
+- CMake (for generating build files)
+
+* Compilation
+
+** Using Intel oneAPI
+
+Finally, I've made to code run with Intel's oneAPI and adapated the CMake
+generation process.
+
+#+BEGIN_SRC bash
+# Make sure to source Intels vars together with the inbuild llvm!
+. /opt/intel/oneapi/setvars.sh --include-intel-llvm
+
+# Create a build directory and navigate to it
+mkdir build && cd build
+
+# Adjust the path to AdaptiveCpp and your target devices according to your system
+CXX=$(which icpx) cmake .. -DUSE_INTELSYCL=ON \
+    -DCMAKE_BUILD_TYPE="Release"
+
+# Compile the executable
+make
+#+END_SRC
+
+** Using AdaptiveCpp
+
+Regrettably, integrating Intel's oneAPI with the AMD GPU plugin proves to be
+quite challenging on Arch Linux, primarily due to the plugin's dependency on an
+older version of ROCm than what's available in the official repositories. While
+I could have chosen to compile my own ROCm/hip version, I opted for a more
+convenient solution and turned to the [[https://github.com/AdaptiveCpp/AdaptiveCpp/tree/develop][AdaptiveCpp]] compiler, which offers both
+CPU and GPU acceleration through CUDA and ROCm support. You can find a version
+of AdaptiveCpp compatible with AMD GPUs on the AUR (Arch User Repository).
+
+If your goal is to run benchmarks on an AMD GPU alongside AdaptiveCpp, I
+recommend using [[https://github.com/sobc/pkgbuilds/tree/master/hipsycl-rocm-git][this]] specific PKGBUILD. Other versions that rely on ROCm might
+not build correctly at the moment. I've already raised an issue with the
+responsible maintainer of the PKGBUILDs to address this compatibility issu
+
+Currently, I can only utilize CMake for generating makefiles when working with
+AdaptiveCpp. However, I intend to add CMake support for Intel's oneAPI as soon
+as I have a working version of the compiler.
+
+To generate Makefiles for AdaptiveCpp, you can follow these steps:
+
+#+BEGIN_SRC bash
+# Create a build directory and navigate to it
+mkdir build && cd build
+
+# Adjust the path to AdaptiveCpp and your target devices according to your system
+cmake .. -DUSE_ACPP=ON \
+    -DAdaptiveCpp_DIR=/opt/AdaptiveCpp/ROCm/lib/cmake/AdaptiveCpp \
+    -DACPP_TARGETS="omp.accelerated;hip.integrated-multipass;gfx90c" \
+    -DCMAKE_BUILD_TYPE="Release"
+#+END_SRC
+
+You can find more information about =ACPP_TARGETS= and the compilation process in
+the documentation [[https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md][here]].
+
+Once your Makefiles are generated, you can build the project using the following
+command:
+
+#+BEGIN_SRC bash
+make -j$(nproc)
+#+END_SRC
+
+The compiled executable can be found in the =build/src= directory.
+
+* Data Information
+
+I have provided a set of 6 matrices, each with 3 different sizes:
+
+- =sma*.txt=: These matrices are of size 16x16
+- =med*.txt=: These matrices are of size 2048x2048
+- =big*.txt=: These matrices are of size 8192x8192
+
+All of these matrices are available in text file format, and you can locate them
+within the =data/= directory.
+
+
+*Important note*:
+
+A word of caution when working with the large matrices (=big*.txt=): To avoid
+exceedingly long execution times, it is advisable to disable the benchmark for a
+single CPU core. You can achieve this by invoking CMake with the option
+=-DSYCL_EX_COMPILE_SEQUENTIAL_BENCH=OFF= and then recompiling the executable
+accordingly.
+
+Additionally, below, you will find the results of multiplying all combinations
+of these matrices along with their corresponding checksums. Please feel free to
+reach out if you come across any other checksums or encounter further questions.
+
+| Matrix A   | Matrix B   | Checksum     |
+|------------+------------+--------------|
+| =sma1.txt= | =sma1.txt= | =0xe6134d8e= |
+| =sma2.txt= | =sma2.txt= | =0xf1ba0ac6= |
+| =sma1.txt= | =sma2.txt= | =0xe71fdf1e= |
+| =sma2.txt= | =sma1.txt= | =0x36b44d2c= |
+|------------+------------+--------------|
+| =med1.txt= | =med1.txt= | =0xd92eb6d6= |
+| =med2.txt= | =med2.txt= | =0x9f0e1206= |
+| =med1.txt= | =med2.txt= | =0x4cf45b91= |
+| =med2.txt= | =med1.txt= | =0xfdeb52bf= |
+|------------+------------+--------------|
+| =big1.txt= | =big1.txt= | =0xde9b4c0d= |
+| =big2.txt= | =big2.txt= | =0x05365fc1= |
+| =big1.txt= | =big2.txt= | =0xb185e6c1= |
+| =big2.txt= | =big1.txt= | =0x59f5ffef= |
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@ -1,5 +1,7 @@
 add_executable(sycl_comp sycl_comp.cpp)
-add_sycl_to_target(TARGET sycl_comp)
+
+add_sycl_to_target(TARGET sycl_comp SOURCES sycl_comp.cpp)
+
 target_link_libraries(sycl_comp PRIVATE ${LIB_XXHASH})

 option(SYCL_EX_COMPILE_SEQUENTIAL_BENCH
--- a/src/cuda/matrix.hpp
+++ b/src/cuda/matrix.hpp
@ -0,0 +1,87 @@
+#ifndef MATRIX_H_
+#define MATRIX_H_
+
+#include <cstddef>
+#include <cstdint>
+#include <fstream>
+#include <iostream>
+#include <numeric>
+#include <vector>
+
+#include <xxhash.h>
+
+template <class T> struct Matrix {
+  std::uint32_t rows;
+  std::uint32_t cols;
+  std::vector<T> mem;
+
+  Matrix<T>(std::uint32_t _rows, std::uint32_t _cols)
+      : rows(_rows), cols(_cols) {
+    mem.resize(rows * cols);
+  }
+
+  Matrix<T>(const char *filepath) {
+    std::ifstream matfs(filepath);
+
+    if (matfs.fail() || !matfs.is_open()) {
+      throw std::runtime_error("Error opening matrix file");
+    }
+
+    matfs >> this->rows >> this->cols;
+    this->mem.resize(this->rows * this->cols);
+
+    for (std::uint32_t i = 0; i < rows; i++) {
+      for (std::uint32_t j = 0; j < cols; j++) {
+        matfs >> (*this)(i, j);
+      }
+    }
+
+    matfs.close();
+  }
+
+  T &operator()(std::uint32_t row_i, std::uint32_t col_j) {
+    return mem[row_i * cols + col_j];
+  }
+
+  const T &operator()(std::uint32_t row_i, std::uint32_t col_j) const {
+    return mem[row_i * cols + col_j];
+  }
+
+  Matrix<T> &operator=(const Matrix<T> &mat) {
+    this->rows = mat.rows;
+    this->cols = mat.cols;
+    this->data = mem;
+
+    return *this;
+  }
+
+  Matrix<T> t() const {
+    Matrix<T> transposed = *this;
+    for (std::uint32_t i = 0; i < this->rows; i++) {
+      for (std::uint32_t j = 0; j < this->cols; j++) {
+        transposed(j, i) = (*this)(i, j);
+      }
+    }
+    return transposed;
+  }
+
+  XXH32_hash_t chksum() const {
+    constexpr XXH32_hash_t HASH_SEED = 42;
+    return XXH32(this->mem.data(), mem.size(), HASH_SEED);
+  }
+
+  std::size_t bytes() const { return this->mem.size() * sizeof(T); }
+};
+
+template <class T> std::ostream &operator<<(std::ostream &os, Matrix<T> &mat) {
+  for (std::uint32_t i = 0; i < mat.rows; i++) {
+    for (std::uint32_t j = 0; j < mat.cols; j++) {
+      os << mat(i, j) << "\t";
+    }
+    os << "\n";
+  }
+
+  return os;
+}
+
+#endif // MATRIX_H_
--- a/src/cuda/matrixMult.cu
+++ b/src/cuda/matrixMult.cu
@ -0,0 +1,193 @@
+#include <algorithm>
+#include <cassert>
+#include <cmath>
+#include <cstdint>
+#include <cuda_runtime.h>
+#include <iostream>
+
+#include "matrix.hpp"
+#include "timer.hpp"
+
+#define stream_hex(_val) std::hex << _val << std::dec
+
+#define print_pair(_desc, _chksm, _time)                                       \
+  std::cout << _desc << "\n\t"                                                 \
+            << "\t-> Check: 0x" << stream_hex(_chksm)                          \
+            << "\tRuntime: " << _time << " us\n\n"
+
+#define checkCudaErrors(call)                                                  \
+  do {                                                                         \
+    cudaError_t err = call;                                                    \
+    if (err != cudaSuccess) {                                                  \
+      printf("CUDA error at %s %d: %s\n", __FILE__, __LINE__,                  \
+             cudaGetErrorString(err));                                         \
+      exit(EXIT_FAILURE);                                                      \
+    }                                                                          \
+  } while (0)
+
+enum CUDA_FUNC { CUDA_NAIVE, CUDA_TILED };
+
+using data_type = int;
+
+template <class T>
+__global__ void matrixMultCuda(T *matA, T *matB, std::uint32_t colsA,
+                               T *matRes) {
+
+  const auto tidx = blockIdx.x * blockDim.x + threadIdx.x;
+  const auto tidy = blockIdx.y * blockDim.y + threadIdx.y;
+
+  T sum = 0;
+
+  for (std::uint32_t k = 0; k < colsA; k++) {
+    sum += matA[tidy * colsA + k] * matB[k * colsA + tidx];
+  }
+
+  matRes[tidy * colsA + tidx] = sum;
+}
+
+template <class T>
+__global__ void matrixMultCudaTiled(T *matA, T *matB, std::uint32_t colsA,
+                                    std::uint32_t colsB, T *matRes) {
+
+  const auto tidx = blockIdx.x * blockDim.x + threadIdx.x;
+  const auto tidy = blockIdx.y * blockDim.y + threadIdx.y;
+
+  const auto tile_size = blockDim.x;
+
+  std::uint32_t a = tidy * colsA + threadIdx.x;
+  std::uint32_t b = tile_size * blockIdx.x + threadIdx.x + threadIdx.y * colsB;
+
+  const auto a_step = tile_size;
+  const auto b_step = tile_size * colsB;
+
+  const auto a_end = a + colsA;
+
+  extern __shared__ T localMem[];
+
+  T *localA = &localMem[0];
+  T *localB = &localMem[tile_size * tile_size];
+
+  T sum = 0;
+
+  for (; a < a_end; a += a_step, b += b_step) {
+    localA[threadIdx.y * tile_size + threadIdx.x] = matA[a];
+    localB[threadIdx.y * tile_size + threadIdx.x] = matB[b];
+
+    __syncthreads();
+
+#pragma unroll
+    for (std::uint32_t k = 0; k < tile_size; k++) {
+      sum += localA[threadIdx.y * tile_size + k] *
+             localB[k * tile_size + threadIdx.x];
+    }
+
+    __syncthreads();
+  }
+
+  matRes[tidy * colsA + tidx] = sum;
+}
+
+inline int prevPowerOfTwo(int n) {
+  if (n <= 0) {
+    return 0;
+  }
+
+  // Calculate the most significant bit position (MSB)
+  int msbPos = 0;
+  while (n > 1) {
+    n >>= 1;
+    msbPos++;
+  }
+
+  // Calculate the next power of 2
+  int result = 1 << msbPos;
+
+  return result;
+}
+
+template <class T, CUDA_FUNC F>
+std::uint32_t matrixMult(const Matrix<T> &matA, const Matrix<T> &matB) {
+  Matrix<T> matRes(matA.rows, matB.cols);
+
+  T *d_matA;
+  checkCudaErrors(cudaMalloc(reinterpret_cast<void **>(&d_matA), matA.bytes()));
+  checkCudaErrors(cudaMemcpy(d_matA, matA.mem.data(), matA.bytes(),
+                             cudaMemcpyHostToDevice));
+
+  T *d_matB;
+  checkCudaErrors(cudaMalloc(reinterpret_cast<void **>(&d_matB), matB.bytes()));
+  checkCudaErrors(cudaMemcpy(d_matB, matB.mem.data(), matB.bytes(),
+                             cudaMemcpyHostToDevice));
+
+  T *d_matRes;
+  checkCudaErrors(
+      cudaMalloc(reinterpret_cast<void **>(&d_matRes), matRes.bytes()));
+
+  // obtain the maximum work group size for the given device
+  int device_id;
+  checkCudaErrors(cudaGetDevice(&device_id));
+
+  cudaDeviceProp dev_props;
+  checkCudaErrors(cudaGetDeviceProperties(&dev_props, device_id));
+
+  std::size_t max_block_size = dev_props.maxThreadsPerBlock;
+
+  // each tile should not exceed the max_group_size -> T x T <= max_group_size
+  const std::uint32_t max_tile_size =
+      static_cast<std::uint32_t>(prevPowerOfTwo(std::sqrt(max_block_size)));
+
+  // maybe the problem size of the multiplication is smaller than the maximum
+  // tile size
+  const std::uint32_t block_size = std::min(max_tile_size, matA.cols);
+
+  dim3 threads(block_size, block_size);
+  dim3 grid(matB.cols / threads.x, matA.rows / threads.y);
+
+  cudaDeviceSynchronize();
+
+  if constexpr (F == CUDA_NAIVE) {
+    matrixMultCuda<T><<<grid, threads>>>(d_matA, d_matB, matA.cols, d_matRes);
+  } else if constexpr (F == CUDA_TILED) {
+    const auto shared_mem_size = sizeof(T) * block_size * block_size * 2;
+    matrixMultCudaTiled<T><<<grid, threads, shared_mem_size>>>(
+        d_matA, d_matB, matA.cols, matB.cols, d_matRes);
+  }
+
+  cudaDeviceSynchronize();
+
+  checkCudaErrors(cudaMemcpy(matRes.mem.data(), d_matRes, matRes.bytes(),
+                             cudaMemcpyDeviceToHost));
+  cudaDeviceSynchronize();
+
+  checkCudaErrors(cudaFree(d_matA));
+  checkCudaErrors(cudaFree(d_matB));
+  checkCudaErrors(cudaFree(d_matRes));
+
+  return matRes.chksum();
+}
+
+auto main(int argc, char *argv[]) -> int {
+
+  if (argc != 3) {
+    std::cerr << "Provide 2 arguments to the program!\n"
+              << "Usage: <prog> <matA>.txt <matB>.txt\n";
+    return EXIT_FAILURE;
+  }
+
+  Matrix<data_type> matA(argv[1]);
+  Matrix<data_type> matB(argv[2]);
+
+  assert(matA.rows == matB.cols);
+
+  const auto naive_chk =
+      measure<>::duration(matrixMult<data_type, CUDA_NAIVE>, matA, matB);
+
+  print_pair("CUDA naive", naive_chk.first, naive_chk.second.count());
+
+  const auto tiled_chk =
+      measure<>::duration(matrixMult<data_type, CUDA_TILED>, matA, matB);
+
+  print_pair("CUDA tiled", tiled_chk.first, tiled_chk.second.count());
+
+  return 0;
+}
--- a/src/cuda/timer.hpp
+++ b/src/cuda/timer.hpp
@ -0,0 +1,22 @@
+#ifndef TIMER_H_
+#define TIMER_H_
+
+#include <chrono>
+#include <functional>
+#include <utility>
+
+template <class TimeUnit = std::chrono::microseconds,
+          class ClockType = std::chrono::steady_clock>
+struct measure {
+  template <class F, class... Args>
+  static auto duration(F &&func, Args &&...args) {
+    auto start = ClockType::now();
+    auto retVal =
+        std::invoke(std::forward<F>(func), std::forward<Args>(args)...);
+    auto end = ClockType::now();
+    return std::make_pair(retVal,
+                          std::chrono::duration_cast<TimeUnit>(end - start));
+  }
+};
+
+#endif // TIMER_H_
--- a/src/sycl_comp.cpp
+++ b/src/sycl_comp.cpp
@ -1,20 +1,18 @@
+#include <sycl/sycl.hpp>
+
 #include <algorithm>
 #include <cassert>
 #include <cmath>
 #include <cstddef>
 #include <cstdint>
 #include <cstdlib>
-#include <hipSYCL/sycl/info/device.hpp>
 #include <iostream>

-#include <CL/sycl.hpp>
 #include <utility>

 #include "matrix.hpp"
 #include "timer.hpp"

-namespace sycl = cl::sycl;
-
 #define stream_hex(_val) std::hex << _val << std::dec

 #define print_pair(_desc, _chksm, _time)                                       \
@ -100,19 +98,19 @@ auto matrixMultSYCL(sycl::queue &q, const Matrix<T> &matA,
  // cell of the product. Thus, leading to a problem size of N x M
  sycl::range<2> global_range(matRes.rows, matRes.cols);

+  const auto &inner_loop = matA.cols;
+
  {
    // defining 2 dimensional buffers which can then be exposed to the device.
    // It also possible to use 1D buffers here, but then we have to manually
    // calculate the index to access the matrices for each thread in the kernel
    // code. Solving it this way will ask the compiler to do the work.
-    sycl::buffer<T, 2> b_matA(matA.mem.data(),
-                              sycl::range<2>(matA.rows, matA.cols));
+    sycl::buffer b_matA(matA.mem.data(), sycl::range(matA.rows, matA.cols));

-    sycl::buffer<T, 2> b_matB(matB.mem.data(),
-                              sycl::range<2>(matB.rows, matB.cols));
+    sycl::buffer b_matB(matB.mem.data(), sycl::range(matB.rows, matB.cols));

-    sycl::buffer<T, 2> b_matRes(matRes.mem.data(),
-                                sycl::range<2>(matRes.rows, matRes.cols));
+    sycl::buffer b_matRes(matRes.mem.data(),
+                          sycl::range(matRes.rows, matRes.cols));

    // submit work to the device. this is done by using a lambda function which
    // references all values known to the scope i.e. the previously defined
@ -124,10 +122,9 @@ auto matrixMultSYCL(sycl::queue &q, const Matrix<T> &matA,
      //
      // Here, we only the matrix A and B to be read from and the matrix C to be
      // written to.
-      auto acc_matA = b_matA.template get_access<sycl::access::mode::read>(h);
-      auto acc_matB = b_matB.template get_access<sycl::access::mode::read>(h);
-      auto acc_matRes =
-          b_matRes.template get_access<sycl::access::mode::write>(h);
+      sycl::accessor acc_matA(b_matA, h, sycl::read_only);
+      sycl::accessor acc_matB(b_matB, h, sycl::read_only);
+      sycl::accessor acc_matRes(b_matRes, h, sycl::write_only);

      // For the parallelized loop another lambda function is used, but all
      // known values are passed by value, as host and device doesn't share the
@ -137,12 +134,12 @@ auto matrixMultSYCL(sycl::queue &q, const Matrix<T> &matA,
      // the global range we defined earlier to provide the size of the problem
      // and launch the count of tasks accordingly. The identifier of the task
      // is then passed to the lambda function as a parameter.
-      h.parallel_for(global_range, [=](sycl::id<2> ID) {
+      h.parallel_for(global_range, [=](auto ID) {
        const auto i = ID[0];
        const auto j = ID[1];
        T sum = 0;

-        for (auto k = 0; k < matA.cols; k++) {
+        for (auto k = 0; k < inner_loop; k++) {
          sum += acc_matA[i][k] * acc_matB[k][j];
        }
        acc_matRes[i][j] = sum;
@ -179,30 +176,30 @@ auto matrixMultTransposeSYCL(sycl::queue &q, const Matrix<T> &matA,

  Matrix<T> matB_t = matB.t();
  Matrix<T> matRes(matA.rows, matB.cols);
-  sycl::range<2> global_range(matRes.rows, matRes.cols);
+  sycl::range global_range(matRes.rows, matRes.cols);
+
+  const auto &inner_loop = matA.cols;

  {
-    sycl::buffer<T, 2> b_matA(matA.mem.data(),
-                              sycl::range<2>(matA.rows, matA.cols));
+    sycl::buffer b_matA(matA.mem.data(), sycl::range(matA.rows, matA.cols));

-    sycl::buffer<T, 2> b_matB(matB_t.mem.data(),
-                              sycl::range<2>(matB_t.rows, matB_t.cols));
+    sycl::buffer b_matB(matB_t.mem.data(),
+                        sycl::range(matB_t.rows, matB_t.cols));

-    sycl::buffer<T, 2> b_matRes(matRes.mem.data(),
-                                sycl::range<2>(matRes.rows, matRes.cols));
+    sycl::buffer b_matRes(matRes.mem.data(),
+                          sycl::range(matRes.rows, matRes.cols));

    q.submit([&](sycl::handler &h) {
-      auto acc_matA = b_matA.template get_access<sycl::access::mode::read>(h);
-      auto acc_matB = b_matB.template get_access<sycl::access::mode::read>(h);
-      auto acc_matRes =
-          b_matRes.template get_access<sycl::access::mode::write>(h);
+      sycl::accessor acc_matA(b_matA, h, sycl::read_only);
+      sycl::accessor acc_matB(b_matB, h, sycl::read_only);
+      sycl::accessor acc_matRes(b_matRes, h, sycl::write_only);

-      h.parallel_for(global_range, [=](sycl::id<2> ID) {
+      h.parallel_for(global_range, [=](auto ID) {
        auto i = ID[0];
        auto j = ID[1];
        T sum = 0;

-        for (auto k = 0; k < matA.cols; k++) {
+        for (auto k = 0; k < inner_loop; k++) {
          sum += acc_matA[i][k] * acc_matB[j][k];
        }
        acc_matRes[i][j] = sum;
@ -215,25 +212,19 @@ auto matrixMultTransposeSYCL(sycl::queue &q, const Matrix<T> &matA,
  return matRes.chksum();
 }

-/*
- * Obtained from
- * https://github.com/codeplaysoftware/computecpp-sdk/blob/master/samples/matrix-multiply.cpp
- *
- * Obtains the previous power of two from the given integer.
- * It works by masking out all ones after the first one bit,
- * then leaves the first one bit intact, effectively
- * yielding the first power of two < x. */
-inline int prevPowerOfTwo(int x) {
-  if (x < 0) {
+inline int prevPowerOfTwo(int n) {
+  if (n <= 0) {
    return 0;
  }
-  --x;
-  x |= x >> 1;
-  x |= x >> 2;
-  x |= x >> 4;
-  x |= x >> 8;
-  x |= x >> 16;
-  return x - (x >> 1);
+
+  // Calculate the most significant bit position (MSB)
+  int msbPos = 0;
+  while (n > 1) {
+    n >>= 1;
+    msbPos++;
+  }
+
+  return 1 << msbPos;
 }

 /**
@ -290,32 +281,25 @@ auto matrixMultTiledSYCL(sycl::queue &q, const Matrix<T> &matA,

  {
    // allocate the buffers
-    sycl::buffer<T, 2> b_matA(matA.mem.data(),
-                              sycl::range<2>(matA.rows, matA.cols));
+    sycl::buffer b_matA(matA.mem.data(), sycl::range(matA.rows, matA.cols));

-    sycl::buffer<T, 2> b_matB(matB.mem.data(),
-                              sycl::range<2>(matB.rows, matB.cols));
+    sycl::buffer b_matB(matB.mem.data(), sycl::range(matB.rows, matB.cols));

-    sycl::buffer<T, 2> b_matRes(matRes.mem.data(),
-                                sycl::range<2>(matRes.rows, matRes.cols));
+    sycl::buffer b_matRes(matRes.mem.data(),
+                          sycl::range(matRes.rows, matRes.cols));

    q.submit([&](sycl::handler &h) {
      // provide access to the buffers and ...
-      auto acc_matA = b_matA.template get_access<sycl::access::mode::read>(h);
-      auto acc_matB = b_matB.template get_access<sycl::access::mode::read>(h);
-      auto acc_matRes =
-          b_matRes.template get_access<sycl::access::mode::write>(h);
+      sycl::accessor acc_matA(b_matA, h, sycl::read_only);
+      sycl::accessor acc_matB(b_matB, h, sycl::read_only);
+      sycl::accessor acc_matRes(b_matRes, h, sycl::write_only);

      // ... allocate memory in the local device memory which should be
      // accessble to each thread per matrix A ...
-      sycl::accessor<int, 2, sycl::access::mode::read_write,
-                     sycl::access::target::local>
-          tileA(tile_range, h);
+      sycl::local_accessor<T, 2> tileA(tile_range, h);

      // ... and matrix B
-      sycl::accessor<int, 2, sycl::access::mode::read_write,
-                     sycl::access::target::local>
-          tileB(tile_range, h);
+      sycl::local_accessor<T, 2> tileB(tile_range, h);

      // We define a kernel function by passing the global_range and the
      // tile_range to the parallel_for function of the handler. Secondly,
@ -323,7 +307,7 @@ auto matrixMultTiledSYCL(sycl::queue &q, const Matrix<T> &matA,
      // passed-by-value lambda captures. As a parameter serves a nd_item, which
      // can be used to extract all relevant data linked to the running task.
      h.parallel_for<class tiled_matmul>(
-          sycl::nd_range{global_range, tile_range}, [=](sycl::nd_item<2> &ID) {
+          sycl::nd_range{global_range, tile_range}, [=](auto ID) {
            // extract all relevant information
            const int i = ID.get_global_id(0);
            const int j = ID.get_global_id(1);
@ -342,7 +326,7 @@ auto matrixMultTiledSYCL(sycl::queue &q, const Matrix<T> &matA,
              tileA[local_i][local_j] =
                  acc_matA[i][tile_i * block_size + local_j];
              // here we will also transpose the B matrix
-              tileB[local_j][local_i] =
+              tileB[local_i][local_j] =
                  acc_matB[block_size * tile_i + local_i][j];

              // we need an explicit barrier to ensure all threads of the
@ -350,10 +334,11 @@ auto matrixMultTiledSYCL(sycl::queue &q, const Matrix<T> &matA,
              // memory
              ID.barrier(sycl::access::fence_space::local_space);

-              // build the 'local' sum over the part of matrix A and B stored in
-              // the local memory
+#pragma unroll
              for (auto k = 0; k < block_size; k++) {
-                sum += tileA[local_i][k] * tileB[local_j][k];
+                // build the 'local' sum over the part of matrix A and B stored
+                // in the local memory
+                sum += tileA[local_i][k] * tileB[k][local_j];
              }

              // ensure all threads finished the multiplication, allowing to
Author	SHA1	Message	Date
Marco De Lucia	77c6b84a6f	Commit CUDA code in src/cuda	2023-10-13 10:09:35 +02:00
Max Lübke	1091a55b81	adapt tiled version	2023-10-12 14:57:26 +02:00
Max Lübke	493be3576c	use correct compiler	2023-10-06 15:38:39 +02:00
Max Lübke	8145d533dd	adapt CMake and Readme	2023-10-06 15:31:40 +02:00
Max Lübke	9633302226	refactor code to compile with Intel oneAPI	2023-10-06 15:16:32 +02:00
Max Lübke	3fcd37fb1f	fix headers	2023-10-06 11:08:11 +02:00
Max Lübke	ca136ac2e6	update cmake instructions	2023-10-05 11:40:51 +02:00
Max Lübke	47331f9b00	rework Data section	2023-10-05 11:38:22 +02:00
Max Lübke	3598bb7642	add prerequisites to readme	2023-10-05 11:25:12 +02:00
Max Lübke	69350034eb	Extend Readme	2023-10-05 11:21:17 +02:00