Enter AMX （Advanced Matrix Extensions）

To use in Linux, see a doc

We can note that:

In its initial form, it implements a set of up to eight "tiles", which are arrays of 16 64-byte rows.

So it one tile is 16 x 64 = 1024 Bytes.

Programmers can store matrices in these tiles of any dimension that will fit therein; a matrix of 16x16 32-bit floating-point values would work, but other geometries are supported too.

16x16 4byte can fit in ( as 16x16x4=16x64=1 tile)

The one supported operation currently will multiply the matrices stored in two tiles, then add the result to a third tile. By chaining these operations, multiplication of matrices of any size can be implemented. Evidently other operations are meant to come in the future.

The design of AMX gives the kernel control over whether these features can be used by any given process. There are a couple of reasons for this, one being that AMX instructions, as one might imagine, use a lot of processor resources. A process doing heavy AMX work on a shared computer may adversely affect other processes. But AMX also cannot be supported properly unless both the kernel and the user-space process are ready for it.

Intro From Intel

Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and an accelerator able to operate on tiles, the first implementation is called TMUL (tile matrix multiply unit).

An Intel AMX implementation enumerates to the programmer how the tiles can be programmed by providing a palette of options. Two palettes are supported; palette 0 represents the initialized state, and palette 1 consists of 8 KB of storage spread across 8 tile registers named TMM0..TMM7. Each tile has a maximum size of 16 rows x 64 bytes, (1 KB), however the programmer can configure each tile to smaller dimensions appropriate to their algorithm.

The tile dimensions supplied by the programmer (rows and bytes_per_row, i.e., colsb) are metadata that drives the execution of tile and accelerator instructions. In this way, a single instruction can launch autonomous multi-cycle execution in the tile and accelerator hardware. The palette value (palette_id) and metadata are held internally in a tile related control register (TILECFG). The TILECFG contents will be commensurate with that reported in the palette_table.

Walk Through Basic Eg.

Tile Matrix Multiply (TMUL） • Tile Matrix Multiplication (TMUL)：TMUL is an accelerator engine connected to Tile that performs matrix multiplication calculations for AI.

Code Sample: Intel® Advanced Matrix Extensions (Intel® AMX) -...

Before using TMUL instructions, the tile architecture must be configured specifying the tile configuration including number of tiles and tile sizes (the palette). This configuration step is to be performed once and the configuration remains until it is either changed by the code or is released. Once the tiles are configured, TMUL instructions can be used to perform matrix multiplications (currently, INT8 and BF16 types are supported). The TMUL instructions, when executed, will dynamically check the maximum sizes of the tile and the matrix sizes that allow a mathematically correct matrix multiplication.

In the code sample walkthrough next, an INT8 matrix multiplication will demonstrate the above procedure step by step. Specifically, the code sample will multiply matrices A and B of size 16 x 64 containing INT8 values, and accumulate the result to a 16 x 16 matrix C containing INT32 values.

Config

define some config