I love optimizing small handy-dandy C/C++ programs in general and spent quite a bit of time working on optimizing a simple C hash table previously, but that’s about as far as I’ve gotten in this field. This will be my first time trying my hand at optimizing a fairly large Math library. Preliminary Googling tells me there’s quite a few libraries which have implemented BLAS operations to varying levels of success (Blaze, Intel MKL, BLIS, Armadillo, etc.), so it’ll be fun to see just how far I can get against some of these libraries. If time permits I’ll hopefully be able to benchmark my code against them.
Before we can get our hands dirty writing code and profiling programs, I’ll need to have the library setup. I also want to get the damned Intel C/C++ compiler installed. The only issue is I’m currently running Manjaro Linux and Intel does not support this distribution officially, this makes installing Intel oneAPI tools and VTune much harder than it needs to be. Being in college, there’s going to be some days when I’ve got nothing to do and can afford to spend a lot of time working on this project, but today isn’t one of these days.
March 25, 2022 The current goal is to just get the BLIS library up and compiled on my system and also get the Intel C/C++ compiler working by tonight. I want to be able to run a saxpy
program which basically just computes $S = \alpha X + Y$ (where $S, X \text{ and } Y$ are vectors and $\alpha$ is a scalar) without any optimizations. Just a simple for
loop program compiled and running so I can make sure my setup works.
Okay, admittedly I’ve gone through this pain before while installing it on my VM, but that doesn’t make it any better. The AUR package intel-compiler-base
does not work and asks for some license. I’ll have to install it from Intel’s page, but Intel’s installer does not recognize half the required packages on my system. Hopefully it’ll still work.
Lol. To save anyone else in the same situation, here’s what you’ll want to do. Use the offline installer instead of the online one. And install libcryptx-compat
on your system before launching the installer. Then install the oneAPI HPC Toolkit next. Once everything is installed, cd
to the installation folder. Default should be /opt/intel/oneapi/
. Here you will find setvars.sh
. Sourcing setvars
will allow you to use the icc
command for that shell session. You can add a line in .zshrc
to source this file every time you enter a session. source /opt/intel/oneapi/setvars.sh > /dev/null
. There’s a noticeable slowdown when I launch a shell session, so will probably find a workaround for this soon.
It looks like I’ve got icc
working. I compiled a few files with different flags and everything seems to work as expected.
Installing and setting up BLIS was relatively easy. It was interesting reading through their build system doc and I’d highly recommend reading their doc on multi-threading as well. I’ve installed BLIS using the zen3 config for my system and multi-threading enabled using openMP
. There’s a section on why to use openMP
vs pthreads
(tl;dr BLIS not providing support for setting thread affinity via pthreads
). The entire section on thread affinity is pretty interesting to read though.
A little more effort was required to setup the dev environment I wanted. I recompiled BLIS using --enable-cblas
to get CBLAS support and we’ll have to source add /usr/lib/
to LD_LIBRARY_PATH
. So add that to ~/.zshrc
as well. Then I just setup a simple Makefile and now I can just #include<cblas.h>
and things will work as expected. Remember to link -lpthreads
as BLIS requires this. And that’s about it, my simple SAXPY program works. I’ve got the cblas
and blis
libraries setup to benchmark against as well. That’ll be it for tonight. The plan is to get some roof-line analysis done tomorrow and get to know my hardware better so I know what all I have at my disposal and how much performance I can reasonably expect to squeeze out of this machine.
Also installed perf
and kcachegrind
and gptrace
. gprof
and valgrind
are already installed. However, as mentioned in the doc about profilers, I’m more interested in trying to get stack samples during program execution. I believe gdb
and gptrace
should help me out here.
March 26, 2022 I more or less just plan on getting to know my hardware well so I can try to exploit as many hardware features as I can that I have at my disposal and setup a bench-marking environment. If I make any changes to some program in the name of optimization I want to be able to see the effect it has. Further, I should probably get roof-line analysis done so I know what theoretical peak I can hope to achieve.
I’m not dealing with any specialized hardware here, so consequently there won’t be (much) inspection to do either. Running htop
and cat /proc/cpu
should provide plenty of information. The official AMD website + wiki-chip should be enough to provide all the spec information. Official AMD Website. Wikichip Website.
Note: Got the memory bandwidth from a third part source (https://nanoreview.net/en/cpu/amd-ryzen-7-5800h)
CPU Details | |
---|---|
CPU | Ryzen 7 5800H |
Cores | 8 |
Threads | 16 |
Base Clock | 3.2GHz |
Max. Boost Clock | 4.4GHz |
Memory bandwidth | 69.27 GB/s |
Cache Details | (64 bit alignment) | |||
---|---|---|---|---|
L1 | 512KB | 8-way set associative | - | Per-core |
L1I | 256KB 8x32KB | 8-way set associative | - | Per-core |
L1D | 256KB 8x32KB | 8-way set associative | Write-back | Per-core |
L2 | 4MB 8x512KB | 8-way set associative | Write-back | Per-core |
L3 | 16MB 1x16MB | 16-way set associative | Write-back | Shared |
TLB Size | 2560 4K pages | - | - | - |