I have designed a matrix-vector multiplier with systolic array architecture. I finally got the simulation to work. Now that I want to synthesize the design it seems that the data_flow control block is not synthesizable (the always block). And I think it because of using for loops with variable number of iterations. Could you please give me some tips to make it synthesizable?
My design has 64 (8 by 8 fixed) processing elements (PE) to do the multiply and accumulation (MAC) for the matrix multiplication, and I used a subset/all of them depending on inputs dimensions (up to 64 * 64). It gets the dimensions of the inputs (the matrix and the vector) from external CPU. For example, if (M*N) matrix has M=5 and N=5 and (N *1) vector the result will be (M *1) vector and we will be using M_val number of PEs, that is 5. I used activation signal to activate M_val number of PEs.
EDIT: With the following code I get the message below from design compiler:
Warning: /local/home/synth/systolic.sv:46: Out of bounds bit select W_reg[64], valid bounds are [0:63]. (ELAB-312) Error: /local/home/synth/systolic.sv:45: Loop exceeded maximum iteration limit. (ELAB-900) *** Presto compilation terminated with 1 errors. ***
When I change the M_val upperbounds of the for loops to M and change the m=cycle-N_val+1 to m = 1, it will synthesize but it doesn't simulate correctly (does not produce the right result for the multiplication).
Here is my code:
module systolic #(parameter DW = 8,
// fixed! not meant to be change from outside
parameter M = 8,
parameter N = 8)
(
input clk,
input reset,
input reg[15:0] M_val, // number of rows
input reg [15:0] N_val, // number of columns
input start_mult,
input [DW-1:0] W_i [0:M*N-1],
input [DW-1:0] X_i [0:N-1],
output [2*DW:0] Y_o [0:M-1],
output reg mult_done
);
reg [7:0] cycle; // counts the cycles of multiplication process
reg [DW-1:0] W_reg[0:M-1]; // regfile to hold W matrix elements
reg [DW-1:0] X_reg; // register to hold X vector elements at each cycle
integer m;
// ROW-MAJOR w_i
always @(posedge clk) begin
if (!reset) begin
cycle <= 8'd0;
mult_done <= 1'b0;
X_reg <= 8'd0;
for(m = 0; m < M; m = m + 1) begin
W_reg[m] <= 8'd0;
end
end
else if (start_mult) begin
if (cycle == (M_val + N_val)) begin // the number of cycles needed for the multiplication
mult_done <= 1'b1;
cycle <= 8'd0;
for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
W_reg[m] <= 8'd0;
end
end else if (cycle < N_val) begin // N_val is the number of times we have to shift X values
X_reg <= X_i[cycle];
for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
W_reg[m] <= W_i[(cycle-m) + m*N_val];
end
end else begin // if (cycle >= N) X will get zeros, its elements has been shifted to the last PE
X_reg <= 8'd0;
// after N cycles the first PE is done processing, so the m index starts from 1,
// or we are feeding W elements to the PEs other than the first one.
for (m=cycle-N_val+1; m < M_val; m = m + 1) begin //if change M_val -> M && m=cycle-N_val+1 it it will synthesize
W_reg[m] <= W_i[(cycle-m) + m*N_val];
end
end
cycle <= cycle + 8'd1;
end
end
wire [DW-1:0] Ws[1:0][0:M-1];
wire [DW-1:0] Xs [0:M];
wire [2*DW:0] Ys [0:M-1];
reg [M-1:0] activate_reg;
wire activate_pe [M-1:0];
// at first all the PEs are activated.
always@(posedge reset or posedge start_mult) begin
if(!reset)
activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
else if (M_val != 64)
activate_reg = (activate_reg >> M - M_val);
else
activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
end
genvar i,j;
generate
for (i = 0; i < M; i = i + 1) begin: PE_activators
assign activate_pe[i] = activate_reg[i];
end
for (i = 0; i < M; i = i + 1) begin: Weights
assign Ws[0][i] = W_reg[i];
end
assign Xs[0] = X_reg;
for (i = 0; i < 8; i = i + 1) begin: ROWs
for (j = 0; j < 8; j = j + 1) begin: COLs
PE #(DW)
pe (
.clk(clk),
.reset(reset),
.activate(activate_pe[i*8+j]),
.w_i(Ws[0][i*8+j]),
.x_i(Xs[i*8+j]),
.w_o(), // This can be float
.x_o(Xs[i*8+j+1]),
.mac(Ys[i*8+j])
);
end
end
endgenerate
assign Y_o = Ys;
endmodule
module PE #(parameter DW = 8)
(
input clk,
input reset,
input activate,
input [DW-1:0]w_i,
input [DW-1:0]x_i,
output reg [DW-1:0] w_o,
output reg [DW-1:0] x_o,
output reg [2*DW:0] mac
);
wire [2*DW:0] multiply = w_i * x_i;
always @(posedge clk) begin
if(!reset) begin
w_o <= {DW{1'b0}};
x_o <= {DW{1'b0}};
mac <= {(2*DW +1){1'b0}};
end
else begin
if(activate == 1) begin
w_o <= w_i;
x_o <= x_i;
mac <= mac + multiply;
end
end
end
endmodule
So for (4*4) Matrix W and (4**1) vector X,
W = {w44,w43,w42,w41, w34,w33,w32,w31, w24,w23,w22, w14,w13,w12,w11};
X = {x4, x3, x2, x1};
the dataflow/Timing for 4 PEs would be like this (please let me know if you need more info):

Change your loops to be a constant number of iterations and use an
ifstatement to control what happens. E.g. change this:to this:
Or if you don't need the W_reg values to remain unchanged then just take the
ifout completely.Also note that while this code looks simple, it's creating lots of multiplexers, multipliers and other things that need to complete in a single clock cycle. You might need to break this down into a longer pipeline to get good performance out of it.