24

MVM Engine

A SystemVerilog hardware accelerator for matrix–vector multiplication, featuring a pipelined datapath, accumulator, and FSM control.

Matrix-Vector Multiplication (MVM) Engine

A high-performance SystemVerilog implementation of a matrix-vector multiplication accelerator inspired by Microsoft's BrainWave deep learning architecture.

πŸš€ Project Overview

This project implements a complete digital hardware system for matrix-vector multiplication, featuring memory management, pipelined datapath components, and intelligent control logic. The design is optimized to achieve timing closure at 150+ MHz on FPGA platforms.

Key Features

  • Fully Pipelined Architecture: Optimized for high throughput and performance
  • Parameterizable Design: Configurable bit widths and memory depths
  • Scalable Compute Lanes: Variable number of output lanes (OLANES)
  • Memory-Mapped Interface: Efficient data loading and computation orchestration
  • Hardware Acceleration: Similar architecture to commercial deep learning accelerators

πŸ“ Project Structure

mvm-engine/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ dot8.sv          # 8-lane dot product unit (pipelined)
β”‚   β”œβ”€β”€ accum.sv         # Accumulator with control logic
β”‚   β”œβ”€β”€ ctrl.sv          # FSM-based controller
β”‚   β”œβ”€β”€ mvm.sv           # Top-level MVM engine
β”‚   └── mem.sv           # Dual-port memory blocks (provided)
β”œβ”€β”€ testbench/
β”‚   └── mvm_tb.sv        # Comprehensive testbench
β”œβ”€β”€ constraints/
β”‚   └── constraints.xdc  # Timing constraints for synthesis
└── README.md

πŸ—οΈ Architecture

System Components

  1. Dot Product Unit (dot8.sv)

    • 8-element vector dot product computation
    • Fully pipelined with binary reduction tree
    • Configurable input/output bit widths
  2. Accumulator (accum.sv)

    • Signed integer accumulation with overflow protection
    • First/last signal control for accumulation sequences
    • Configurable accumulation register width
  3. Controller (ctrl.sv)

    • Two-state FSM (IDLE/COMPUTE)
    • Memory address generation and sequencing
    • Control signal orchestration for datapath
  4. Memory System (mem.sv)

    • Dual-port memory blocks (1 read + 1 write port)
    • 2-cycle read/write latency
    • Parameterizable depth and data width
  5. Top-Level Integration (mvm.sv)

    • Vector memory + NUM_OLANES compute lanes
    • Each lane: Matrix memory + Dot product + Accumulator
    • Round-robin matrix row distribution

Data Layout

The engine uses an optimized memory layout where:

  • Vector data: Stored as consecutive 8-element words
  • Matrix data: Rows distributed across compute lanes in round-robin fashion
  • Output: Parallel computation of result vector elements

βš™οΈ Parameters

ParameterDescriptionDefault
IWIDTHInput element bit width8
OWIDTHOutput element bit width32
NUM_OLANESNumber of output compute lanes4
MEM_DATAWMemory data width64
VEC_MEM_DEPTHVector memory depth1024
MAT_MEM_DEPTHMatrix memory depth1024

πŸ”§ Getting Started

Prerequisites

  • Xilinx Vivado 2023.1 or later
  • SystemVerilog simulation tools
  • PYNQ board (for hardware deployment)

Setup Instructions

  1. Create Vivado Project

    # Create new project in Vivado
    # Add all .sv files to project sources
    # Add constraints.xdc to constraints
  2. Simulation

    # Set mvm_tb.sv as top-level testbench
    # Run behavioral simulation
    # Verify functionality with provided test vectors
  3. Synthesis & Implementation

    # Set timing goal to 150+ MHz
    # Use "-mode out_of_context" for maximum performance
    # Monitor timing closure and resource utilization

πŸ§ͺ Testing

The project includes a comprehensive testbench (mvm_tb.sv) that:

  • Generates random test vectors and matrices
  • Configures memory layout automatically
  • Verifies results against golden reference
  • Measures performance metrics

Running Tests

// The testbench automatically:
// 1. Writes random data to vector/matrix memories
// 2. Configures start addresses and sizes
// 3. Initiates computation
// 4. Compares results with expected values
// 5. Reports pass/fail status

πŸ“Š Performance Optimization

Timing Goals

  • Target Frequency: 150+ MHz
  • Throughput: Maximized for 512x512 matrices
  • Resource Utilization: Optimized for PYNQ FPGA

Optimization Strategies

  1. Pipeline Depth: Balanced for frequency vs. latency
  2. Parallelism: Configurable compute lanes
  3. Memory Banking: Distributed matrix storage
  4. Control Logic: Minimal FSM overhead

πŸ† Bonus Challenge

Achieve maximum throughput by:

  • Optimizing NUM_OLANES for target FPGA
  • Maximizing operating frequency
  • Minimizing computation cycles for 512x512 MVM
  • Using out-of-context synthesis for best results

πŸ“‹ Interface Specifications

Top-Level Ports

module mvm #(
    parameter IWIDTH = 8,
    parameter OWIDTH = 32,
    parameter NUM_OLANES = 4,
    // ... other parameters
) (
    input  logic clk,
    input  logic rst,
    
    // Vector memory interface
    input  logic [MEM_DATAW-1:0] i_vec_wdata,
    input  logic [VEC_ADDRW-1:0] i_vec_waddr,
    input  logic i_vec_wen,
    
    // Matrix memory interface  
    input  logic [MEM_DATAW-1:0] i_mat_wdata,
    input  logic [MAT_ADDRW-1:0] i_mat_waddr,
    input  logic [NUM_OLANES-1:0] i_mat_wen,
    
    // Control interface
    input  logic i_start,
    input  logic [VEC_ADDRW-1:0] i_vec_start_addr,
    input  logic [VEC_SIZEW-1:0] i_vec_num_words,
    input  logic [MAT_ADDRW-1:0] i_mat_start_addr,
    input  logic [MAT_SIZEW-1:0] i_mat_num_rows_per_olane,
    
    // Output interface
    output logic o_busy,
    output logic [OWIDTH-1:0] o_result [0:NUM_OLANES-1],
    output logic o_valid
);

πŸ” Implementation Details

Controller FSM

IDLE ──start──→ COMPUTE
 ↑                 β”‚
 └─────doneβ”€β”€β”€β”€β”€β”€β”€β”€β”˜

IDLE State: Register input parameters, clear outputs COMPUTE State: Generate addresses, sequence operations

Pipeline Stages

  1. Memory Read (2 cycles latency)
  2. Dot Product (logβ‚‚(8) pipeline stages)
  3. Accumulation (1 cycle)

πŸ› Debugging Tips

  1. Simulation Waveforms: Primary debugging approach
  2. Unit Testing: Create individual testbenches for each module
  3. Timing Analysis: Check critical paths in synthesis reports
  4. Memory Latency: Account for 2-cycle read delay in controller

πŸ“š References

  • Microsoft BrainWave Architecture
  • ECE 327 Digital Hardware Systems Course
  • Xilinx UltraScale+ FPGA Documentation

πŸ‘₯ Contributors

Developed by Talha Amir