Micro-Arch Notes

Custom CSRs Available in C-Class

The C-class includes the following custom csrs implemented in the non-standard space for extra control and special features.

custom control csr (0x800)

The csr is used to the enable or disable the caches, branch predictor and arithmetic exceptions at run-time.

Bit Position Reset Value Description
0 from config Enable or disable the data-cache.
1 from config Enable or disable the instruction-cache.
2 from config Enable or disable the branch_predictor.
3 Disabled Enable or disable arithmetic exceptions.

dtvec csr (0x7c0)

XLEN register which indicates the address of the debug loop when a the debugger halts the core.

denable csr (0x7c1)

1-bit csr indicating if the debugger can halt the core

mhpminterrupten csr (0x7c2)

XLEN bit register following the same encoding as mcounteren/mcountinhibit. A bit set to 1 indicates the an interrupt will be generated when the corresponding counter reaches the value 0. More details to use this register is available [here](../docs/performance_counters.md#interrupts-from-counters)

dtim base adddress csr (0x7c3)

An XLEN bit register holding the base address of the data tightly integrated scratch memory. This should correspond to the physical address space and not the virtual

dtim bound adddress csr (0x7c4)

An XLEN bit register holding the bound address of the data tightly integrated scratch memory. This should correspond to the physical address space and not the virtual

itim base adddress csr (0x7c5)

An XLEN bit register holding the base address of the instruction tightly integrated scratch memory. This should correspond to the physical address space and not the virtual

itim bound adddress csr (0x7c6)

An XLEN bit register holding the bound address of the instruction tightly integrated scratch memory. This should correspond to the physical address space and not the virtual

cause values for arith traps

When configured with fpu_trap: True ,as an extension to the 15 exceptions mentioned in the RISC-V SPEC, we have added six arithmetic exceptions. Out of this five are floating point exceptions specified by IEEE 754 floating point format.

Description Cause Value
Integer divide by zero 17
Floating point Invalid operation 18
Floating point Zero divide 19
Floating point Overflow 20
Floating point Underflow 21
Floating point Inexact 22

Performance Monitors

Introduction

Currently the RISC-V privilege spec (v1.12) describes a basic hardware performance facility at the hart (core) level .

3 counters for dedicated functions have been defined:

Address Name Description
0xB01 mcycle counts the number of cycles executed by the hart starting from an arbitrary point of time.
0xB02 minstret counts the number of instructions executed by the hart starting from an arbitrary point of time.
0xC03 mtime this is a read-only csr which reads the memory mapped value of the platforms real-time counter.

Each of the above are 64-bit counters. Shadow csrs of the above also exist in the user-space.

Apart from the above, RISC-V also provides provision to instantiate additional 29 64-bit event counters: mhpmcounter3 - mhpmcounter31. The event selectors for these counters are also defined: mhpmevent3 - mhpmevent31. The meaning of these events is defined by the platform and can be customized for each platform.

In addition, RISC-V also defines a single 32-bit counter-enable register : mcounteren. Each bit in this register corresponds to each of the 32 event-counters described above. This register controls only the accessibility of the counter registers and has no effect on the underlying counters, which can continue to increment irrespective of the settings of the mcounteren fields.

Clearing a bit in the mcounteren only indicates that the event-counters cannot be accessed by lower level privilege modes. Similar functionality is implemented by the scounteren register when S-mode is supported.

Overhead Analysis

  1. Each event-counter is mapped to a CSR address and additionally all counters are read-write CSRs. Thus each 64-bit counter will have an additional 12-bit decoder to select that counter in case of a read/write CSR op.
  2. Since all CSRs are accessed in the write-back stage of the C-Class core, the 12-bit address from this stage, fans-out to all CSRs. Since the event-counters are implemented as 64-bit adders, the fan-out load is further increased as they become part of the CSR read/write op.
  3. Further more, suppose there are 30 events defined by the core/platform and each event-counter if configurable to choose any of the 30 events to track. This leads to an additional 30 is 1 demux on each event-counter.

All the three factors defined above can cause the event-counters to become critical in terms of area and frequency closure.

Possible solutions

  1. To address the issues 1 and 2 listed above, it is possible to implement the CSRs as a daisy-chain as shown below:

![daisy chain](./figs/daisy-chain-csrs.png)

Here the CSRs are group based on their functionality and accesses to CSRs can thus take variable number of cycles. For eg, less frequently accessed CSRs like fcssr or *scratch or debug registers can be placed in GRP-2 or GRP-3. Performance counters and status registers can be placed in GRP-1 to enable quick and fast access. Such daisy chaining will reduce the comparator fan-out while performing CSR read/write ops.

  1. To address the 3rd issue from the above list, it is proposed to split the events in groups and have each counter track only events involved within a specific group. This strategy is further elaborated in the next-section.

List of Events for C-class

The C-Class core will support capturing the following 26 events:

Event number Description
1 Number of misprediction
2 Number of exceptions
3 Number of interrupts
4 Number of csrops
5 Number of jumps
6 Number of branches
7 Number of floats
8 Number of muldiv
9 Number of rawstalls
10 Number of exetalls
11 Number of icache_access
12 Number of icache_miss
13 Number of icache_fbhit
14 Number of icache_ncaccess
15 Number of icache_fbrelease
16 Number of dcache_read_access
17 Number of dcache_write_access
18 Number of dcache_atomic_access
19 Number of dcache_nc_read_access
20 Number of dcache_nc_write_access
21 Number of dcache_read_miss
22 Number of dcache_write_miss
23 Number of dcache_atomic_miss
24 Number of dcache_read_fb_hits
25 Number of dcache_write_fb_hits
26 Number of dcache_atomic_fb_hits
27 Number of dcache_fb_releases
28 Number of dcache_line_evictions
29 Number of itlb_misses
30 Number of dtlb_misses

Interrupts from Counters

There is a need to raise an interrupt when a particular counter has observed delta number of counts. This feature is however, not part of the current RISC-V ISA, since it does not mandate how the counters are interpreted neither on which direction should they move (up or down).

Thus, to achieve the above said functionality, we propose a new custom CSR:

mhpminterrupten: The encoding for this csr is the same as that of mcounteren/mcountinhibit. When a particular bit is set, it indicates that the corresponding counter will generate an interrupt when the value reaches 0 and the counter is enabled (mhpmevent != 0). The interrupt can be disabled by writing a 0 to the corresponding mhpmevent register (equivalent to disabling the counter)

Following is an example of how such a framework can be used:

> csrw mhpminterrupten, 0x4         # enable interrupt for mhpmcounter3
> addi x31, x0, -delta              # note the negative delta
> csrw mhpmcounter3, x31
> csrw mhpmevent3, 0x9              # enable mhpmcounter3 to track event-code-9
> ...
> interrupt is generated jump to isr!
> ...
>
ISR Routine
> csrw mhpmevent3, x0               # disable mhphmcounter3 will also disable the interrupt.

RAMS used in the C-Class

This document describes in detail how various RAM based structures are used within the shakti-designs (specifically the C-class processor). The doc also highlights the differences for porting the same structures to ASIC or FPGAs.

Overview

The caches used in the C-class core (instruction and data both), use a single-ported RAM instance (1RW), i.e. one port to perform either a read or a write.

The branch predictors ,however, depending on the choice at compile time may or may not use RAMs. For specific instances, the RAMs used are dual-ported (1R + 1W) i.e. a dedicated port to read and another dedicated port to write.

Functionality

Single-Ported RAMs (1RW)

  • Module Name: bram_1rw

  • Verilog source: bram_1rw.v

  • Port Descriptions:

    Port Name Direction Description
    clka input Clock signal. Positive edge of clock is used.
    ena input When high indicates the port is being used
    wea input When high indicates a write operation is being performed.
    addr input Indicates the address for read/write
    dina Input Indicates the data for write operations
    douta output Holds the data for a read operation
  • Instantiation Parameters:

    Parameter Name Description
    DATA_WIDTH Width of dina and douta ports.
    ADDR_WIDTH Width of addra port.
    MEMSIZE Depth of the RAM.

    The size of the instantiated RAM will be MEMSIZE x DATA_WIDTH bits where the number of indices is equal to MEMSIZE and the number of bits at each index is equal to DATA_WIDTH.

  • Read Operation: The address is written onto the addr port, and the ena signal is driven high. In the next positive edge, douta port will hold the data. Therefore, the read operations have a one cycle latency. Also, a new address can be given at every cycle (whose output will be obtained in the subsequent cycle).

  • Write Operation: The address is written onto the addr port, data to be written is driven on the dina port, and, ena and wea signals are asserted. At the next positive edge of clock the value at dina is written onto the address addr. Also, a new write operation can be initiated at every clock edge.

Note

  1. The single-ported rams follow a no-change model, where the output douta remains unchanged on write-operations and will always hold the data of the previous read operation.
  2. The single-ported rams assume the outputs are registered for reads.

Dual-Ported RAMs (1R + 1W)

  • Module Name: bram_1r1w

  • Verilog source: bram_1r1w.v

  • Ports:

    Port Name Direction Description
    clka Input Clock signal for port A. Operations are performed at the positive edge of the clock.
    ena Input Enable signal for port A. When high, indicates that the port is being used for write.
    wea Input Write enable for port A. When high, indicates that a write operation is being performed.
    addra Input Index address for port A that indicates the address for write
    dina Input Indicates the data for write operations
    clkb Input Clock signal for port B. Operations are performed at the positive edge of the clock.
    enb Input Enable signal for port B. When high, indicates that the port is being used for read.
    addrb Input Index address for port B that indicates the address for read
    doutb Output Holds the data for a read operation
  • Instantiation Parameters:

    Parameter Name Description
    DATA_WIDTH Width of dina and douta ports.
    ADDR_WIDTH Width of addra and addrb ports.
    MEMSIZE Depth of the RAM.

    The size of the instantiated BRAM will be MEMSIZE x DATA_WIDTH bits where the number of indices is equal to MEMSIZE and the number of bits at each index is equal to DATA_WIDTH.

  • Read Operation: Port-B is used for performing reads. The address is written onto the addrb port, and the enb signal is driven high. In the next cycle, doutb port will hold the data. Therefore, the read operations have a one cycle latency. Also, a new address can be given at every cycle (whose output will be obtained in the subsequent cycle).

  • Write Operation: Port-A is used for writes. The address is written onto the addra port, data to be written is driven on the dina port, and, ena and wea signals are asserted. At the next positive edge of clock the value at dina is written onto the address addra. Also, a new write operation can be initiated at every clock edge.

  • Read Write Conflicts: In case of a read and write occurring to the same address at the same time, the writes are guaranteed while the reads need not be.

Note

  1. Here port A is used for write, and port B is used for read operations. Also, the various enable and write enable signals are active high signals.
  2. The dual-ported rams assume the outputs are registered for reads.

Synthesis

Mapping to FPGAs

The single-ported RAMs (1RW) used in the caches are directly mapped to the true-single ported BRAMs provided by xilinx.

The dual-ported RAMs (1R + 1W) used in branch predictors are directly mapped to true-dual ported RAMs provided by Xilinx. Since the true-dual ported RAMs from xilinx provide a (1RW + 1RW) configuration, our dual-ported instances ensure that portA is used for writes and portB is used only for reads (by ensuring wea port is held low always)

The * RAM_STYLE = "BLOCK" * pragma in the verilog source makes it easy for Vivado to infer these as BRAMs and thus no edits are required in the source file.

Mapping to ASICs

For mapping to ASICs, the user has to replace the files bram_1rw and bram_1r1w with respective instances for SRAM modules which meet the same functionality as described above.

In case where SRAM cells of the same size as that of the instantiations are not avaialable, it is the onus of the user to bank/combine available SRAMs cells into a top-module which has the same functionality as bram_1r1w or bram_1rw.

If an SRAM cell has extra ports than the ones required in this document, the user is required to ensure they are driven accordingly to maintain the same functionality as described in this document.

Additionally, if a parameterized instance of the SRAMs can be developed by the user, its the user’s responsibility to manually replace each instance of the RAM’s in the design. For the c-class the instances are defined below:

C-Class Specific instances of RAMs.

The size and configuration of the RAMs instantiated in the design can be controlled at the BSV level at compile time using the YAML configuration files. For a quick reference of all 1RW/1R1W instances do the following in the verilog release:

$ grep "bram_1rw " mk*cache.v -A2
$ grep "bram_1r1w " mkbpu.v -A2

Instruction Cache

The variables below refer to the fields within the icache_configuration node in the YAML spec. VADDR refers to the XLEN and PADDR refers to the physical_addr_size in the YAML spec.

  • For Data Array

    • instance path: mkicache/data_arr_*
    • Total number of 1RW instances : dbanks x ways
    • DATA_WIDTH per instance: (word_size x 8 x block_size)/ dbanks
    • MEM_SIZE per instance: sets
    • ADDR_WIDTH per instance: Log(sets)
  • For Tag Array

    • instance path: mkicache/tag_arr_*
    • Total number of 1RW instances : tbanks x ways
    • DATA_WIDTH per instance: PADDR - (Log(word_size) + Log(block_size) + Log(sets)) )/tbanks
    • MEM_SIZE per instance: sets
    • ADDR_WIDTH per instance: Log(sets)

Data Cache

The variables below refer to the fields within the dcache_configuration node in the YAML spec. VADDR refers to the XLEN and PADDR refers to the physical_addr_size in the YAML spec.

  • For Data Array

    • instance path: mkdcache/data_arr_*
    • Total number of 1RW instances : dbanks x ways
    • DATA_WIDTH per instance: (word_size x 8 x block_size)/ dbanks
    • MEM_SIZE per instance: sets
    • ADDR_WIDTH per instance: Log(sets)
  • For Tag Array

    • instance path: mkdcache/tag_arr_*
    • Total number of 1RW instances : tbanks x ways
    • DATA_WIDTH per instance: PADDR - (Log(word_size) + Log(block_size) + Log(sets)) )/tbanks
    • MEM_SIZE per instance: sets
    • ADDR_WIDTH per instance: Log(sets)

Branch Predictors

RAMs will not be instantiated if the predictor option in YAML config is set to gshare_fa. RAM instances for other values are described below. The variables below refer to the fields within the branch_predictor node in the YAML spec. VADDR refers to the XLEN and PADDR refers to the physical_addr_size in the YAML spec.

  • With compressed support:

    • Total number of 1R+1W instances : 2
    • DATA_WIDTH per instance: (VADDR - Log(btb_depth)) + VADDR + 4
    • MEM_SIZE per instance: btb_depth/2
    • ADDR_WIDTH per instance: Log(btb_depth/2)
    • NOTE: One instance will have DATA_WIDTH + 1 bits.
  • Without compressed support:

    • Total number of 1R+1W instances : 1
    • DATA_WIDTH per instance: (VADDR - Log(btb_depth)) + VADDR + 3
    • MEM_SIZE per instance: btb_depth
    • ADDR_WIDTH per instance: Log(btb_depth)

Physical Memory Protection (PMP)

The phyiscal memory protection unit is integrated with the caches (data and instruction). The pmp-module implements permission checks region-wise as described in the riscv-privilege spec. See PMP configuration parameters for the pmp support are available

When pmp is disabled, then all pmp csrs are read as zeros.

When PMPEnable is zero, the PMP module is not instantiated and all PMP registers read as zero (regardless of the value of PMPNumRegions)

PMP Granularity

The PMP granularity parameter is used to reduce the size of the address matching comparators by increasing the minimum region size. For a 32-bit core the minimum granularity is 4 bytes and for a 64-bit core the minimum granularity is 8 bytes. This choice has been made to reduce the overheads of checking homogeneity of the access. Thus, for a 64-bit core NA4 is no longer available.