C++ Templates for HLS

Introduction

HLS is a powerful tool for building complex hardware design using traditional software tooling with control over applying hardware-specific optimizations. Loop unrolling, pipelining, and array partitioning are three powerful optimizations that are commonly applied. But, for these optimizations to be applied to loops, arrays, and kernels, these components need to be statically defined in whatever HLS language you are using. For example, when unrolling a loop, the loop's bounds must be fixed and known ahead of time for the HLS compiler to perform the optimizations. Conversely, to partition an array, the array size must be fixed and statically defined, which also means any function (aka kernel) calling the array must know the fixed array size when defining the arguments to the function. Writing HLS code or picking a coding style around this constraint of things needing to be staticky defined can be challenging and messy.

This article will focus on using the templates feature of the C++ language to overcome these challenges that come with needing to statically define these things to take advantage of optimizations while still producing clean and concise code.

I will mainly focus on writing HLS code using C++ and the Vitis HLS compiler. Any HLS tooling that supports C++ and templates would also apply to the ideas presented.

There is one important clarification I want to point out before I dive in. There are two "compliers" being used in HLS tooling. First is the language compiler, in this case, GCC or any compiler that complies down C++ to LLVM. This compiler translates the C++ code into an IR for the tool to analyze and eventually turn into hardware. The second is the HLS compiler. This compiler takes the LLVM produced along with any defined user optimizations and creates the final hardware design.

Key Challenge and Examples of Suboptimal Coding Approaches

To highlight the key challenges of witting code to satisfy static constraints, I will present some basic examples.

Example: ReLU Activation Layer

Below is a simple example of a C++ implementation of an activation layer being applied to a vector.

typedef ap_fixed<32, 16> F_TYPE;
typedef ap_fixed<32, 16> W_TYPE;

// Define ReLU operation
F_TYPE relu(F_TYPE in){
	if(in > 0){
		return in;
	}
	else{
		return 0;
	}
}


void apply_relu(F_TYPE *out, int length){
#pragma HLS INLINE off
	for(int out_idx = 0; out_idx < length; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

#define DATA_VECTOR_1_SZIE = 32
#define DATA_VECTOR_2_SZIE = 64

void top_example(F_TYPE data_vector_1[DATA_VECTOR_1_SZIE],
                 F_TYPE data_vector_2[DATA_VECTOR_2_SZIE]){
    apply_relu(data_vector_1, DATA_VECTOR_1_SZIE);
    apply_relu(data_vector_2, DATA_VECTOR_2_SZIE);
}

This implementation allows you to define the apply_relu() function once and use to apply the ReLU function to data arrays of different sizes.

However, a couple of flaws prevent HLS optimizations from being applied to the apply_relu() function. In apply_relu(), the array length of your data vector is passed in as part of the function call and becomes the loop bound. Since this variable is unknown at compile time and can be any value during runtime, the loop bound is not fixed. Since the bound is not fixed or known at compile-time, you cannot apply pipeline and loop unrolling optimizations to this function. Additionally, since the apply_relu() function references the input array as a pointer, the array size within the function is unknown, and array partition optimizations cannot be applied.

Let's make some simple changes to make the input array size and loop bound fixed.

typedef ap_fixed<32, 16> F_TYPE;
typedef ap_fixed<32, 16> W_TYPE;

// Define ReLU operation
F_TYPE relu(F_TYPE in){
	if(in > 0){
		return in;
	}
	else{
		return 0;
	}
}

#define DATA_VECTOR_1_SZIE = 32
#define DATA_VECTOR_2_SZIE = 64

void apply_relu_1(F_TYPE out[DATA_VECTOR_1_SZIE]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < DATA_VECTOR_1_SZIE; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

void apply_relu_2(F_TYPE out[DATA_VECTOR_2_SZIE]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < DATA_VECTOR_2_SZIE; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}


void top_example(F_TYPE data_vector_1[DATA_VECTOR_1_SZIE],
                 F_TYPE data_vector_2[DATA_VECTOR_2_SZIE]){
    apply_relu_1(data_vector_1);
    apply_relu_2(data_vector_2);
}

In this example, the issues of dynamic loop bounds and unknown array sizes per function call for apply_relu() have been solved. However, in order to do this, we, as the programmer, needed to redefine apply_relu() twice, as apply_relu_1() and apply_relu_2() in order to define the loop bounds and input array sizes statically for each of the data_vectors, data_vector_1 and data_vector_2, I want to apply this function to. Wring the C++ code this way allows me to apply valid pipelining+loop unrolling and array partition optimizations that the HLS compiler will implement when compiling the final hardware design.

The major downside is that the coding style of this approach is not scaling to a larger design. For example, if I had a design that had to implement the apply_relu function 20 times on 20 different sized data vectors, I as a programmer would need to redefine the apply_relu function 20 different times, each with different values for the size of the array argument and loop bound. This is a practical issue in larger or more complex design like a neural network accelerator where many of the same operations (linear, relu, conv2d) will be called in different places but with different fixed parameters for loop bounds, input sizes, and output sizes and fixed domain-specific parameters like side size and dilation size that make this approach messy and unproductive. There are some hacky ways to write slightly better code, like using large switch statements within a single function definition, but they avoid this coding style issue altogether rather than solve it.

Example: Multilayer Perceptron (MLP)

Below is simple implementation of a two-layer MLP that satisfies the static constraints for writing HLS code that can be optimized but has the practical coding style flaws as previously mentioned.

typedef ap_fixed<32, 16> F_TYPE;
typedef ap_fixed<32, 16> W_TYPE;

// Define ReLU operation
F_TYPE relu(F_TYPE in){
	if(in > 0){
		return in;
	}
	else{
		return 0;
	}
}


#define l_0_in_dim 128
#define l_0_out_dim 64

#define l_1_in_dim  64
#define l_1_out_dim  10

void linear_1(F_TYPE in[l_0_in_dim], F_TYPE out[l_0_out_dim], W_TYPE weights[l_0_out_dim][l_0_in_dim], W_TYPE bias[l_0_out_dim]){
#pragma HLS array_partition variable = weights dim = 1 complete
#pragma HLS array_partition variable = bias complete
#pragma HLS array_partition variable = in complete
#pragma HLS array_partition variable = out complete

	for (int out_idx = 0; out_idx < l_0_out_dim; out_idx++){
#pragma HLS PIPELINE II=1
		F_TYPE sum = bias[out_idx];
		for (int in_idx = 0; in_idx < l_0_in_dim; in_idx++)
		{
			sum += in[in_idx] * weights[out_idx][in_idx];
		}
		out[out_idx] = sum;
	}
}

void linear_2(F_TYPE in[l_1_in_dim], F_TYPE out[l_1_out_dim], W_TYPE weights[l_1_out_dim][l_1_in_dim], W_TYPE bias[l_1_out_dim]){
#pragma HLS array_partition variable = weights dim = 1 complete
#pragma HLS array_partition variable = bias complete
#pragma HLS array_partition variable = in complete
#pragma HLS array_partition variable = out complete

	for (int out_idx = 0; out_idx < l_1_out_dim; out_idx++){
#pragma HLS PIPELINE II=1
		F_TYPE sum = bias[out_idx];
		for (int in_idx = 0; in_idx < l_1_in_dim; in_idx++)
		{
			sum += in[in_idx] * weights[out_idx][in_idx];
		}
		out[out_idx] = sum;
	}
}

void apply_relu_1(F_TYPE out[l_0_out_dim]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < l_0_out_dim; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

void apply_relu_2(F_TYPE out[l_1_out_dim]){
#pragma HLS INLINE off
#pragma HLS array_partition variable=out complete
#pragma HLS PIPELINE II=1
	for(int out_idx = 0; out_idx < l_1_out_dim; out_idx++){
		out[out_idx] = relu(out[out_idx]);
	}
}

void top_example(F_TYPE in[l_0_in_dim], F_TYPE out[l_1_out_dim]){

	W_TYPE w_0[l_0_out_dim][l_0_in_dim];
	W_TYPE b_0[l_0_out_dim];

	F_TYPE layer_0_buffer[l_0_out_dim];

	W_TYPE w_1[l_1_out_dim][l_1_in_dim];
	W_TYPE b_1[l_1_out_dim];


	linear_1(in, layer_0_buffer, w_0, b_0);
	apply_relu_1(layer_0_buffer);

	linear_2(layer_0_buffer, out, w_1, b_1);
	apply_relu_2(out);
}

This example clarifies that as the complexity of your design increases to include more kernels with different fixed parameters, the redundancy and complexity of your code also increase. We have to redefine both the linear and apply_relu kernel for each dense layer in the model since the size of each linear layer is different, which also changes the input and output sizes as data passes through the model. Using #define statements to statically define the layer sizes is a good idea. However, tracking which #define value for your layer sizes goes for all the functions definitions can get messy. Making any architectural change to your design, like changing layer parameters or adding/removing layers, may require significant refactoring and keeping track of which #define values to change in your function definitions unique for each layer. This makes high-level architecture design exploration and optimization exploration more tedious and significantly less productive.

Quick Introduction to C++ Templates

Templates are a feature of the C++ language that gives you a wide range of flexibility when writing c++ code and creating flexible function and class definitions. The core idea is that we to make a function parametrizable,