How to make the data_flow control of a matrix multiplier synthesizable?

206 Views Asked by At

I have designed a matrix-vector multiplier with systolic array architecture. I finally got the simulation to work. Now that I want to synthesize the design it seems that the data_flow control block is not synthesizable (the always block). And I think it because of using for loops with variable number of iterations. Could you please give me some tips to make it synthesizable?

My design has 64 (8 by 8 fixed) processing elements (PE) to do the multiply and accumulation (MAC) for the matrix multiplication, and I used a subset/all of them depending on inputs dimensions (up to 64 * 64). It gets the dimensions of the inputs (the matrix and the vector) from external CPU. For example, if (M*N) matrix has M=5 and N=5 and (N *1) vector the result will be (M *1) vector and we will be using M_val number of PEs, that is 5. I used activation signal to activate M_val number of PEs.

EDIT: With the following code I get the message below from design compiler:

Warning: /local/home/synth/systolic.sv:46: Out of bounds bit select W_reg[64], valid bounds are [0:63]. (ELAB-312) Error: /local/home/synth/systolic.sv:45: Loop exceeded maximum iteration limit. (ELAB-900) *** Presto compilation terminated with 1 errors. ***

When I change the M_val upperbounds of the for loops to M and change the m=cycle-N_val+1 to m = 1, it will synthesize but it doesn't simulate correctly (does not produce the right result for the multiplication).

Here is my code:

module systolic #(parameter DW = 8,

         // fixed! not meant to be change from outside
             parameter M = 8,
             parameter N = 8)
             (
                input clk,
                input reset,
                input reg[15:0] M_val,     // number of rows 
                input reg [15:0] N_val,    // number of columns
                input start_mult,

                input  [DW-1:0] W_i [0:M*N-1],
                input  [DW-1:0] X_i [0:N-1],
                
                output [2*DW:0] Y_o [0:M-1],
                output reg mult_done 
             );
              


reg [7:0] cycle;    // counts the cycles of multiplication process

reg  [DW-1:0] W_reg[0:M-1]; // regfile to hold W matrix elements
reg  [DW-1:0] X_reg;        // register to hold X vector elements at each cycle
integer m;

// ROW-MAJOR w_i
always @(posedge clk) begin
  if (!reset) begin
    cycle <= 8'd0;
    mult_done <= 1'b0;
    X_reg <= 8'd0;
    for(m = 0; m < M; m = m + 1) begin
      W_reg[m] <= 8'd0;
    end
  end
  else if (start_mult) begin
    if (cycle == (M_val + N_val)) begin   // the number of cycles needed for the multiplication
      mult_done <= 1'b1;
      cycle <= 8'd0;
      for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
        W_reg[m] <= 8'd0;
      end
    end else if (cycle < N_val) begin        // N_val is the number of times we have to shift X values
      X_reg <= X_i[cycle];
      for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
        W_reg[m] <= W_i[(cycle-m) + m*N_val];

      end
    end else begin // if (cycle >= N) X will get zeros, its elements has been shifted to the last PE 
      X_reg <= 8'd0;
      
      // after N cycles the first PE is done processing, so the m index starts from 1, 
      // or we are feeding W elements to the PEs other than the first one.
      
       for (m=cycle-N_val+1; m < M_val; m = m + 1) begin //if change M_val -> M && m=cycle-N_val+1 it it will synthesize
        W_reg[m] <= W_i[(cycle-m) + m*N_val];
      end
    end
    cycle <= cycle + 8'd1;
  end
end
  

wire  [DW-1:0] Ws[1:0][0:M-1];
wire  [DW-1:0] Xs [0:M];


wire [2*DW:0] Ys [0:M-1];

reg [M-1:0] activate_reg;
wire activate_pe [M-1:0];

// at first all the PEs are activated.
always@(posedge reset or posedge start_mult) begin
    if(!reset)
      activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
    else if (M_val != 64)
      activate_reg  = (activate_reg >> M - M_val);
    else 
      activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
end




genvar i,j;

generate
  
  for (i = 0; i < M; i = i + 1) begin: PE_activators
    assign activate_pe[i] = activate_reg[i];
  end 
  
  for (i = 0; i < M; i = i + 1) begin: Weights
    assign Ws[0][i] = W_reg[i];
  end 

  assign Xs[0] = X_reg;

  
  for (i = 0; i < 8; i = i + 1) begin: ROWs
    for (j = 0; j < 8; j = j + 1) begin: COLs
     PE #(DW)
         pe (
            .clk(clk),
            .reset(reset),
            .activate(activate_pe[i*8+j]),
            .w_i(Ws[0][i*8+j]),
            .x_i(Xs[i*8+j]),
            .w_o(),            // This can be float
            .x_o(Xs[i*8+j+1]),
            .mac(Ys[i*8+j])
         );
    end
  end
  
endgenerate

assign Y_o = Ys;

endmodule

module PE #(parameter DW = 8)
       (
        input clk,
        input reset,
        input activate,
        
        input  [DW-1:0]w_i,
        input  [DW-1:0]x_i,
        
        output reg  [DW-1:0] w_o,
        output reg  [DW-1:0] x_o,

        output reg [2*DW:0] mac
       );

wire  [2*DW:0] multiply = w_i * x_i;
always @(posedge clk) begin
  if(!reset) begin
    w_o <= {DW{1'b0}};
    x_o <= {DW{1'b0}};
    mac <= {(2*DW +1){1'b0}};
  end
  else begin
    if(activate == 1) begin
      w_o <= w_i;
      x_o <= x_i;
      mac <= mac + multiply;

    end
  end

end

endmodule

So for (4*4) Matrix W and (4**1) vector X,

W = {w44,w43,w42,w41, w34,w33,w32,w31, w24,w23,w22, w14,w13,w12,w11};

X = {x4, x3, x2, x1};

the dataflow/Timing for 4 PEs would be like this (please let me know if you need more info):

enter image description here

1

There are 1 best solutions below

2
Justin N On

Change your loops to be a constant number of iterations and use an if statement to control what happens. E.g. change this:

        for (m=cycle-N_val+1; m < M_val; m = m + 1)
            W_reg[m] <= W_i[(cycle-m) + m*N_val];

to this:

        for (m = 0; m < M; m = m + 1)
            if ((m >= cycle-N_val+1) && (m < M_val))
                W_reg[m] <= W_i[(cycle-m) + m*N_val];

Or if you don't need the W_reg values to remain unchanged then just take the if out completely.

Also note that while this code looks simple, it's creating lots of multiplexers, multipliers and other things that need to complete in a single clock cycle. You might need to break this down into a longer pipeline to get good performance out of it.