C-Class Core Generator
¶
This repository contains the open-source C-Class core generator. C-class belongs to the SHAKTI family of processors.
Table of Contents:¶
Introduction¶

What is C-Class¶
C-Class is a member of the SHAKTI family of processors. It is an extremely configurable and commercial-grade 5-stage in-order core supporting the standard RV64GCSUN ISA extensions. The core generator in this repository is capable of configuring the core to generate a wide variety of design instances from the same high-level source code. The design instances can serve domains ranging from embedded systems, motor-control, IoT, storage, industrial applications all the way to low-cost high-performance linux based applications such as networking, gateways etc.
There have been multiple successful silicon prototypes of the different instances of the C-class thus proving its versatility. The extreme parameterization of the design in conjunction with using an HLS like Bluespec, it makes it easy to add new features and design points on a continual basis.
Why Bluespec¶
The entire core is implemented in Bluespec System Verilog (BSV), an open-source high-level hardware description language. Apart from guaranteeing synthesizable circuits, BSV also gives you a high-level abstraction, like going from assembly [level programming] to C. You don’t do the dirty work, the compiler does all the work for you. It enables users to work at a much higher level thereby increasing throughput.
The language is now supported by an open-source Bluespec compiler, which can generate synthesizable verilog compatible for FPGA and ASIC targets.
License¶
All of the source code available in this repository is under the BSD license. Please refer to LICENSE.* files for more details.
Commercial Adoption¶
The following industrial partners have adopted SHAKTI for commercialization purposes and provide continous support in maintaining and supporting this repository.
- InCore Semiconductors Pvt. Ltd.
- Silint Consulting Pvt. Ltd.
Quick Start¶
For this quick-start you will need the following tools. The user is requested to install these from the respective repositories/sources:
- Bluespec Compiler : Make sure you are using the version post April 26 2020
- Verilator
- RISC-V GNU ToolChain
- Modified RISC-V ISA Sim
- RISC-V OpenOCD
- DTC 1.4.7: see dtc
- Python 3.7.0: see python
Warning
The following few sections are a quick copy-paste of the steps to install the above tools. However,it is possible that these steps are outdated as either the repository has shifted or the master of the respective repositories now have moved forward with new dependencies or installation procedures. We thereby suggest refering to the original repositories of the above tools to install them.
If you already have the above tools installed you can directly jump to building your core: build
Install Python Dependencies¶
The core generator requires pip
and python
(>=3.7) to be available on
your system. If you have issues installing, either of these, directly on your system we
suggest using a virtual environment like pyenv to make things easy.
First Install the required libraries/dependencies:
$ sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev python-openssl git
Next, install pyenv
$ curl -L https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash
Add the following to your .bashrc with appropriate changes to username:
export PATH="/home/<username>/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
Open a new terminal and create a new python virtual environment:
$ pyenv install 3.7.0
$ pyenv virtualenv 3.7.0 myenv
Now you can activate this environment in any other terminal :
pyenv activate myenv
python --version
Install DTC (device tree compiler)¶
We use the DTC 1.4.7 to generate the device tree string in the boot-files. To install DTC follow the below commands:
sudo wget https://git.kernel.org/pub/scm/utils/dtc/dtc.git/snapshot/dtc-1.4.7.tar.gz
sudo tar -xvzf dtc-1.4.7.tar.gz
cd dtc-1.4.7/
sudo make NO_PYTHON=1 PREFIX=/usr/
sudo make install NO_PYTHON=1 PREFIX=/usr/
Building the Core¶
The code is hosted on Gitlab and can be checked out using the following command:
$ git clone https://gitlab.com/shaktiproject/cores/c-class.git
If you are cloning the c-class repo for the first time it would be best to install the dependencies first:
$ cd c-class/
$ pyenv activate venv # ignore this is you are not using pyenv
$ pip install -U -r c-class/requirements.txt
The C-class core generator takes a specific YAML format as input. It makes specific checks to validate if the user has entered valid data and none of the parameters conflict with each other. For e.g., mentioning the ‘D’ extension without the ‘F’ will get captured by the generator as an invalid spec. More information on the exact parameters and constraints on each field are discussed here.
Once the input YAML has been validated, the generator then clones all the dependent repositories which enable building a test-soc, simulating it and performing verification of the core. This is an alternative to maintaining the repositories as submodules, which typically pollutes the commit history with bump commits.
At the end, the generator outputs a single makefile.inc
in the same folder that it was run,
which contains definitions of paths where relevant bluespec files are present, bsc command with
macro definitions, verilator simulation commands, etc.
A sample yaml input YAML (default.yaml) is available in the sample_config
directory of the
repository.
To build the core with a sample test-soc using the default config do the following:
$ python -m configure.main -ispec sample_config/default.yaml
The above step generates a makefile.inc
file in the same folder and also
clones other dependent repositories to build a test-soc and carry out
verification. This should generate a log something similar to:
[INFO] : ************ C-Class Core Generator ************
[INFO] : Available under BSD License
[INFO] : [update] Cloning caches_mmu ...
...
...
...
[INFO] : Loading input file: ..../sample_config/default.yaml
[INFO] : Load Schema configure/schema.yaml
[INFO] : Initiating Validation
[INFO] : No Syntax errors in Input Yaml.
[INFO] : Performing Specific Checks
[INFO] : Generating BSC compile options
[INFO] : makefile.inc generated
To compile the bluespec source and generate verilog:
$ make
This should generate the following folders:
- verilog: contains the verilofg files generated by bsc
- bsv_build: contains all the intermediate and information files generated by bsc
- bin: contains final verilated executable :
out
which is used for simulation along with some boot and application hex files.
Note
To leverage parallel builds you can do the following:
make -j<jobs> generate_verilog; make generate_boot_files link_verilator
Run Smoke Tests¶
You can run the individual riscv-tests on the generated verilog of the test-soc using the following:
$ make test opts='--test=add --suite=rv64ui ' CONFIG_ISA=RV64IMAFDC
You can run the entire riscv-tests suite in a regression using the following: :
$ make regress opts='--filter=rv64 --parallel=20 --sub' CONFIG_ISA=RV64IMAFDC
$ make regress opts='--filter=rv64 --final'
The last command, after some delay, should present the following output:
recoding rv64uf v PASSED
slt rv64ui p PASSED
fadd rv64uf v PASSED
and rv64ui p PASSED
fcvt_w rv64uf v PASSED
amoadd_d rv64ua p PASSED
fmadd rv64ud p PASSED
ldst rv64uf v PASSED
amoand_d rv64ua p PASSED
fmin rv64ud p PASSED
lh rv64ui v PASSED
amomaxu_w rv64ua v PASSED
amoand_w rv64ua p PASSED
amoxor_d rv64ua v PASSED
fence_i rv64ui v PASSED
bne rv64ui p PASSED
amomin_d rv64ua v PASSED
fcvt_w rv64uf p PASSED
srli rv64ui p PASSED
sw rv64ui v PASSED
amomaxu_d rv64ua v PASSED
lrsc rv64ua v PASSED
fmadd rv64ud v PASSED
blt rv64ui v PASSED
fadd rv64ud p PASSED
recoding rv64uf p PASSED
sh rv64ui v PASSED
ori rv64ui p PASSED
fdiv rv64uf v PASSED
ma_addr rv64mi p PASSED
recoding rv64ud p PASSED
add rv64ui p PASSED
blt rv64ui p PASSED
fcvt_w rv64ud p PASSED
bltu rv64ui v PASSED
sll rv64ui v PASSED
ma_fetch rv64mi p PASSED
jal rv64ui p PASSED
lwu rv64ui p PASSED
sd rv64ui v PASSED
ori rv64ui v PASSED
access rv64mi p PASSED
sw rv64ui p PASSED
srl rv64ui p PASSED
fcvt rv64ud v PASSED
fmadd rv64uf v PASSED
amoxor_w rv64ua v PASSED
sb rv64ui v PASSED
slliw rv64ui p PASSED
amoadd_d rv64ua v PASSED
fdiv rv64ud p PASSED
lw rv64ui v PASSED
slti rv64ui p PASSED
add rv64ui v PASSED
amomax_d rv64ua v PASSED
move rv64ud v PASSED
lhu rv64ui v PASSED
andi rv64ui p PASSED
addiw rv64ui v PASSED
amoswap_d rv64ua v PASSED
fdiv rv64ud v PASSED
lui rv64ui p PASSED
ldst rv64uf p PASSED
fmin rv64uf v PASSED
amoxor_w rv64ua p PASSED
srai rv64ui p PASSED
addi rv64ui p PASSED
subw rv64ui p PASSED
sd rv64ui p PASSED
amoand_d rv64ua v PASSED
sra rv64ui p PASSED
rvc rv64uc v PASSED
scall rv64mi p PASSED
beq rv64ui p PASSED
rvc rv64uc p PASSED
fmin rv64ud v PASSED
amoadd_w rv64ua p PASSED
scall rv64si p PASSED
fcmp rv64uf p PASSED
srliw rv64ui p PASSED
addiw rv64ui p PASSED
amomax_w rv64ua p PASSED
andi rv64ui v PASSED
addi rv64ui v PASSED
lhu rv64ui p PASSED
xor rv64ui p PASSED
amoor_w rv64ua p PASSED
and rv64ui v PASSED
lbu rv64ui v PASSED
dirty rv64si p PASSED
ldst rv64ud v PASSED
bge rv64ui p PASSED
amoor_w rv64ua v PASSED
sh rv64ui p PASSED
amoswap_w rv64ua p PASSED
amoxor_d rv64ua p PASSED
fadd rv64uf p PASSED
sll rv64ui p PASSED
amoand_w rv64ua v PASSED
ma_fetch rv64si p PASSED
sraiw rv64ui p PASSED
csr rv64si p PASSED
ldst rv64ud p PASSED
amoswap_w rv64ua v PASSED
bltu rv64ui p PASSED
ld rv64ui v PASSED
fmin rv64uf p PASSED
slli rv64ui v PASSED
fadd rv64ud v PASSED
addw rv64ui v PASSED
lb rv64ui p PASSED
amominu_d rv64ua p PASSED
fcvt_w rv64ud v PASSED
move rv64uf p PASSED
bge rv64ui v PASSED
or rv64ui p PASSED
srlw rv64ui p PASSED
xori rv64ui p PASSED
structural rv64ud v PASSED
sllw rv64ui p PASSED
amomax_d rv64ua p PASSED
fcvt rv64uf p PASSED
amoor_d rv64ua p PASSED
amomaxu_d rv64ua p PASSED
fdiv rv64uf p PASSED
sb rv64ui p PASSED
jal rv64ui v PASSED
addw rv64ui p PASSED
amomaxu_w rv64ua p PASSED
auipc rv64ui p PASSED
bne rv64ui v PASSED
amoswap_d rv64ua p PASSED
lw rv64ui p PASSED
bgeu rv64ui v PASSED
recoding rv64ud v PASSED
simple rv64ui p PASSED
or rv64ui v PASSED
lbu rv64ui p PASSED
amomax_w rv64ua v PASSED
move rv64ud p PASSED
fclass rv64uf p PASSED
jalr rv64ui p PASSED
fclass rv64ud v PASSED
sltiu rv64ui p PASSED
fcmp rv64ud p PASSED
sltu rv64ui p PASSED
structural rv64ud p PASSED
lb rv64ui v PASSED
fcvt rv64uf v PASSED
amomin_d rv64ua p PASSED
sub rv64ui p PASSED
wfi rv64si p PASSED
ld rv64ui p PASSED
amoor_d rv64ua v PASSED
fcvt rv64ud p PASSED
lrsc rv64ua p PASSED
fclass rv64uf v PASSED
fclass rv64ud p PASSED
sraw rv64ui p PASSED
amomin_w rv64ua v PASSED
bgeu rv64ui p PASSED
move rv64uf v PASSED
amoadd_w rv64ua v PASSED
fence_i rv64ui p PASSED
lh rv64ui p PASSED
csr rv64mi p PASSED
simple rv64ui v PASSED
lui rv64ui v PASSED
lwu rv64ui v PASSED
fcmp rv64ud v PASSED
beq rv64ui v PASSED
auipc rv64ui v PASSED
amominu_w rv64ua p PASSED
fmadd rv64uf p PASSED
amominu_w rv64ua v PASSED
amomin_w rv64ua p PASSED
fcmp rv64uf v PASSED
jalr rv64ui v PASSED
slli rv64ui p PASSED
amominu_d rv64ua v PASSED
div rv64um p PASSED
mul rv64um p PASSED
remuw rv64um p PASSED
divw rv64um p PASSED
remw rv64um p PASSED
mulhu rv64um p PASSED
mulw rv64um p PASSED
rem rv64um p PASSED
remu rv64um p PASSED
mulh rv64um p PASSED
divuw rv64um p PASSED
mulhsu rv64um p PASSED
divu rv64um p PASSED
divu rv64um v PASSED
sltiu rv64ui v PASSED
xor rv64ui v PASSED
subw rv64ui v PASSED
mulw rv64um v PASSED
srli rv64ui v PASSED
slliw rv64ui v PASSED
div rv64um v PASSED
sub rv64ui v PASSED
srlw rv64ui v PASSED
sltu rv64ui v PASSED
xori rv64ui v PASSED
remw rv64um v PASSED
mul rv64um v PASSED
slt rv64ui v PASSED
sra rv64ui v PASSED
divw rv64um v PASSED
srai rv64ui v PASSED
mulhu rv64um v PASSED
remuw rv64um v PASSED
srl rv64ui v PASSED
rem rv64um v PASSED
mulhsu rv64um v PASSED
slti rv64ui v PASSED
srliw rv64ui v PASSED
remu rv64um v PASSED
divuw rv64um v PASSED
sllw rv64ui v PASSED
sraw rv64ui v PASSED
mulh rv64um v PASSED
sraiw rv64ui v PASSED
Congratulations - You have built your very first C-Class core !! :)
Configure the Core¶
The C-class core is highly parameterized and configurable. By changing a single configuration the user can generate a core instance randing in size from embedded micro-controllers to Linux capable high-performance cores.
ISA Level Configurations¶
In RISC-V both, the Unprivileged and the Privileged specs both offer a great amount of choices to configure an implementation with. The Unprivileged spec offers various extensions and sub-extensions like Multiply-divide, Floating Point, Atomic, Compressed, etc which a user can choose to implement or not.
The Unprivileged Spec on the other hand provides a much more larger space of configurability to the user. Apart from choosing which privilege modes to implement (Machine, Hypervisor, Supervisor or User), the spec also provides a huge number of Control and Status Registers (CSRs) which impact various aspects of the RISC-V system. For example the MISA csr can be used to dynamically enable or disable execution of certain sub-extensions. Similarly, the valid and legal values of the satp.mode fields indicate what paging schemes are supported by the underlying implementation.
To capture all such possible choices of the RISC-V ISA in a single standard format, InCore has proposed the RISCV-CONFIG YAML format, which has also been adopted by the riscv-community, primarily for the ISA compatibility framework. The core generator uses the same YAML inputs to control various ISA level features of the core.
Generating CSRs¶
For implementing the CSR module, C-Class uses the CSR-BOX utility to automatically create a bsv module which implements all the necessary CSRs as per the input YAML specification provided in riscv-config format. An example of the isa YAML is provided in the sample_config directory. . CSR-BOX ensures the warl functions specified in the YAML are faithfullty replicated in bsv. Along with CSRs CSR-BOX also provides methods and logic to handle traps and xRET instructions based on the privileged modes (U, S, H) defined in the ISA node of the input yaml.
Note that the CSR-BOX allows one to split the CSRs into a daisy-chain like fashion to reduce the impact on timing when instantiating large number of CSRs. Thus, apart from the isa yaml, CSR-BOX also requires a grouping yaml file which indicates which daisy-chain unit should contain which set of CSRs.
CSR-BOX also takes in an optional debug spec yaml (as defined by riscv-config) to capture basic debug related information like where the parking loop code of the debug is placed in the memory map. Providing the debug spec, also indicates CSR-BOX to implement the necessary logic for handling custom debug interrupts like halt, resume and step. The Debug csrs must be defined in the debug spec. TODO provide example LINK
CSR-BOX also allows the user to define custom CSRs that may be required by the the implementation. C-Class uses a custom csr to control the enabling/disabling of caches and branch predictors. The details of this CSR are provided here. An example YAML containing the definition of these CSRs which can be fed into CSR-BOX is available in the sample_config directory.
Other Derived Configuration Settings¶
Other than the CSRs, C-Class derives the following parameters from the input isa yaml
- The ISA string indicates what extensions be enabled in Hardware and its associated collaterals
- The max value in the supported_xlen node indicates the xlen variable in C-Class. This is used to defined the width of the integer register file, alu operations, bypass width, virtual address size, etc.
- The flen variable in C-Class is set based on the presence of ‘F’ or ‘D’ characters in the ISA string.
- If the ‘S’ extension is present in the ISA string, then C-Class detects the supervisor page translation mode to be implemented by detecting the max legal values of the satp.mode csr field present in the input yaml
- The asid length to be used in the implementation is also derived by checking legal values of the satp.asid csr field.
- The size of the physical address to be implemented is derived from the physical_addr_sz node of the isa yaml
- The number of mhpmcounters (and therefore mhpmevents) and their behavior is also captured from the csrs defined in the input isa yaml
- the number of pmp entries and granularity is also captured from the input isa yaml.
- custom interrupts/exceptions and their cause values are also captured from the input isa yaml. The implementation creates an entry in the defines file with for the name and cause value. The usage of these custom causes need to be implemented separately in the bsv code.
- The max size of the cause field in the mcause csr is also derived by checking for the max cause value being used after accounting for the custom interrupts and exceptions.
Micro-Architectural Configuration hooks¶
The C-Class core has also defined a custom schema to control various micro-architectural features of the core. A sample configuration file is available in the sample_config directory.
The following provides a list and description of the configuration hooks available at the micro-architectural level. Note, there are also hooks in this configuration which control the bluespec compilation commands and the verilator commands as well.
num_harts¶
Description: Total number of harts to be instantiated in the dummy test-soc. Note that these will non-coherent cores simply acting as masters on the fast-bus.
Examples:
num_harts: 2
isb_sizes¶
Description: A dictionary controlling the size of the inter-stage buffers of the pipeline. The variable isb_s0s1 controls the size of the isb between stage0 and stage1. Similarly isb_s1s2 dictates the size of the isb between stage1 and stage2 and so on. By increasing isb_s0s1 and isb_s1s2 one can shadow the stalls or latencies in the backend stages of the pipeline by fetching more instructions into the front-end stages of the pipeline.
There is a restriction however that isb_s2s3 should always be 1. This is because the outputs of register file accessed in stage2 are not buffered and niether is the bypass scheme implemented to handle this scenario.
One can however increase the number of in-flight instructions by increasing the sizes of isb_s3s4 and isb_s4s5 (increasing isb_s3s4 has a larger impact).
Also note that if write-after-write stalls are disabled , the size of the wawid is defined by the sum of isb_s3s4 and isb_s4s5. Therefore, increasing in-flight instructions caused a logarithmic increase in the wawid used for maintaining bypass of operands.
Examples:
isb_sizes : isb_s0s1: 2 isb_s1s2: 2 isb_s2s3: 1 isb_s3s4: 2 isb_s4s5: 2
merged_rf¶
Description: Boolean field to indicate if the architectural registerfiles for floating and integer should be implemented as a single extended regfile in hw or as separate. This field only makes sense ‘F’ support is enabled in the ISA string of the input isa yaml. Under certain targets like FPGA or certain technologies maintaining a single registerfile might lead to better area and timing savings.
Examples:
merged_rf: True
total_events¶
Description: This field indicates the total number of events that can be used to program the mhpm counters. This field is used to capture the size of the events signals that drives the counters.
Examples:
total_events: 28
waw_stalls¶
Description: Indicates if stalls must occur on a WAW hazard. If you are looking for higher performance set this to False. Setting this to true would lead to instructions stalling in stage3 due to a WAW hazard.
Setting this to false also means the scoreboad will not allocate a unique id to the destination register of every instruction that is offloaded for execution. The size of this id depends on the numbr of in-flight instructions after the execution stage, which in turn depends on the size of the isb_s3s4 and isb_s4s5 as defined above.
Examples:
waw_stalls: False
iepoch_size¶
Description: integer value indicating the size of the epochs for the instruction memory subsystem. Allowed value is 2 only
Examples:
iepoch_size: 2
depoch_size¶
Description: integer value indicating the size of the epochs for the data memory subsystem. Allowed value is 1 only
Examples:
depoch_size: 1
s_extension¶
Description: Describes various supervisor and MMU related parameters. These parameters only take effect when “S” is present in the ISA field.
itlb_size
: integer indicating the size of entries in the fully-associative Instruction TLBdtlb_size
: integer indicating the size of entries in the fully-associative Data TLBExamples:
s_extension: itlb_size: 4 dtlb_size: 4
a_extension¶
Description: Describes various A-extension related parameters. These params take effect only when the “A” extension is enabled in the riscv_config ISA
reservation_size
: integer indicate the size of the reservation in terms of bytes. Minimum value is 4 and must be a power of 2. For RV64 system minimum should be 8 bytes.Examples:
a_extension: reservation_size: 8
m_extension¶
Description: Describes various M-extension related parameters. These parameters take effect only is “M” is present in the ISA field. The multiplier used in the core is a retimed one. The parameters below indicate the number of input and output registers around the combo block to enable retiming.
mul_stages_out
: Number of stages to be inserted after the multiplier combinational block. Minimum value is 1.mul_stages_in
: Number of stages to be inserted before the multiplier combinational block. Minimum value is 0div_stages
: an integer indicating the number of cycles for a single division operation. Max value is limited to the XLEN defined in the ISA.Examples:
m_extension: mul_stages_in : 2 mul_stages_out : 2 div_stages: 32
branch_predictor¶
Description: Describes various branch predictor related parameters.
instantiate
: boolean value indicating if the predictor needs to be instantiatedpredictor
: string indicating the type of predictor to be implemented. Valid values are: ‘gshare’ not. Valid values are : [‘enable’,’disable’]btb_depth
: integer indicating the size of the branch target bufferbht_depth
: integer indicating the size of the bracnh history bufferhistory_len
: integer indicating the size of the global history registerhistory_bits
: integer indicating the number of bits used for indexing bht/btb.ras_depth
: integer indicating the size of the return address stack.Examples:
branch_predictor: instantiate: True predictor: gshare btb_depth: 32 bht_depth: 512 history_len: 8 history_bits: 5 ras_depth: 8
icache_configuration¶
Description: Describes the various instruction cache related features.
instantiate
: boolean value indicating if the predictor needs to be instantiated not. Valid values are : [‘enable’,’disable’]sets
: integer indicating the number of sets in the cacheword_size
: integer indicating the number of bytes in a word. Fixed to 4.block_size
: integer indicating the number of words in a cache-block.ways
: integer indicating the number of the ways in the cachefb_size
: integer indicating the number of fill-buffer entries in the cachereplacement
: strings indicating the replacement policy. Valid values are: [“PLRU”, “RR”, “Random”]ecc_enable
: boolean field indicating if ECC should be enabled on the cache.one_hot_select
: boolean value indicating if the bsv one-hot selection funcion should be used of conventional for-loops to choose amongst lines/fb-lines. Choice of this has no affect on the functionalityIf supervisor is enabled then the max size of a single way should not exceed 4Kilo Bytes
Examples:
icache_configuration: instantiate: True sets: 4 word_size: 4 block_size: 16 ways: 4 fb_size: 4 replacement: "PLRU" ecc_enable: false one_hot_select: false
dcache_configuration¶
Description: Describes the various instruction cache related features.
instantiate
: boolean value indicating if the predictor needs to be instantiated not. Valid values are : [‘enable’,’disable’]sets
: integer indicating the number of sets in the cacheword_size
: integer indicating the number of bytes in a word. Fixed to 4.block_size
: integer indicating the number of words in a cache-block.ways
: integer indicating the number of the ways in the cachefb_size
: integer indicating the number of fill-buffer entries in the cachesb_size
: integer indicating the number of store-buffer entries in the cache. Fixed to 2lb_size
: integer indicating the number lines to be stored in the store buffer. Applicable only when rwports == 1r1wib_Size
: integer indicating the number of io-buffer entries in the cache. Default to 2replacement
: strings indicating the replacement policy. Valid values are: [“PLRU”, “RR”, “Random”]ecc_enable
: boolean field indicating if ECC should be enabled on the cache.one_hot_select
: boolean value indicating if the bsv one-hot selection funcion should be used of conventional for-loops to choose amongst lines/fb-lines. Choice of this has no affect on the functionalityrwports
: number of read-write ports available on the brams. Allowed values are 1rw, 1r1w and 2rwIf supervisor is enabled then the max size of a single way should not exceed 4Kilo Bytes
Examples:
dcache_configuration: instantiate: True sets: 4 word_size: 4 block_size: 16 ways: 4 fb_size: 4 sb_size: 2 lb_size: 2 ib_size: 2 replacement: "PLRU" ecc_enable: false one_hot_select: false rwports: 1r1w
reset_pc¶
Description: Integer value indicating the reset value of program counter
Example:
bus_protocol¶
Description: bus protocol for the master interfaces of the core. Fixed to “AXI4”
Examples:
bus_protocol: AXI4
- fpu_trap
Description: Boolean value indicating if the core should trap on floating point exception and integer divide-by-zero conditions.
Examples:
fpu_trap: False
verilator_configuration¶
Description: describes the various configurations for verilator compilation.
coverage
: indicates the type of coverage that the user would like to track. Valid values are: [“none”, “line”, “toggle”, “all”]trace
: boolean value indicating if vcd dumping should be enabled.threads
: an integer field indicating the number of threads to be used during simulationverbosity
: a boolean field indicating of the verbose/display statements in the generated verilog should be compiled or not.out_dir
: name of the directory where the final executable will be dumped.sim_speed
: indicates if the user would prefer a fast simulation or slow simulation. Valid values are : [“fast”,”slow”]. Please selecting “fast” will speed up simulation but slow down compilation, while selecting “slow” does the opposite.Examples:
verilator_configuration: coverage: "none" trace: False threads: 1 verbosity: True open_ocd: False sim_speed: fast
bsc_compile_options¶
Description: Describes the various bluespec compile options
test_memory_size
: size of the BRAM memory in the test-SoC in bytes.- Default is 32MB
assertions
: boolean value indicating if assertions used in the design should be compiled or nottrace_dump
: boolean value indicating if the logic to generate a simple trace should be implemented or not. Note this is only for simulation and not a real tracecompile_target
: a string indicating if the bsv files are being compiled for simulation of for asic/fpga synthesis. The valid values are: [ ‘sim’, ‘asic’, ‘fpga’ ]suppress_warnings
: List of warnings which can be suppressed during bluespec compilation. Valid values are: [“none”, “all”, “G0010”, “T0054”, “G0020”, “G0024”, “G0023”, “G0096”, “G0036”, “G0117”, “G0015”]ovl_assertions
: boolean value indicating if OVL based assertions must be turned on/offovl_path
: string indicating the path where the OVL library is installed.sva_assertions
: boolean value indicating if SVA based assertions must be turned on/offverilog_dir
: the directory name of where the generated verilog will be dumpedopen_ocd
: a boolean field indicating if the test-bench should have an open-ocd vpi enabled.build_dir
: the directory name where the bsv build files will be dumpedtop_module
: name of the top-level bluespec module to be compiled.top_file
: file containing the top-level module.top_dir
: directory containing the top_file.
cocotb_sim
: boolean variable. When set the terminating conditions in the test-bench environments are disabled, as the cocotb environment is meant to handle that. When set to false, the bluespect test-bench holds the terminating conditions.Examples:
bsc_compile_options: assertions: True trace_dump: True suppress_warnings: "none" top_module: mkTbSoc top_file: TbSoc top_dir: base_sim out_dir: bin
noinline_modules¶
Description: This node contains multiple module names which take a boolean value. Setting a module to True would generate a separate verilog file for that module during bluespec compilation. If set to False, then that particular module will be in lined the module above it in hierarchy in the generated verilog.
Examples:
noinline_modules: stage0: False stage1: True stage2: False stage3: False
Test SoC¶
The C-class repository also contains a simple test-soc for the purpose of simulating applications and verifying the core. More enhanced and open-source SoCs can be found here.
Structure of SoC¶
The Test-SoC has the following structure (defined to a max of 4 levels of depth):
Description of the above modules:
Module-Name Description mkriscv Contains the 5-stages of the core pipeline including the execution and only the interface to the memory subsystem mkdmem The Data memory subsystem. Includes the data-cache and data-tlbs mkimem The instruction memory subsystem. Includes the instruction-cache and the instruction-tlbs mkccore_axi4 Contains the above modules and the integrations across them. Also provides 3 AXI-4 interfaces to be connected to the Cross-bar fabric mkuart UART module mkclint Core Level Interrupt mksignature_dump Signature dump module (for simulation only) mkSoc contains all the above modules and instantiates the AXI-4 crossbar fabric as well. The fabric has 2 additional slaves, which are brought out through the interface to connect to the boot-rom and bram-main-memory present in the Test-bench mkbram BRAM based memory acting as main-memory mkbootrom Bootrom slave mkTbSoC Testbench that instantiates the Soc, and integrates it with the bootrom and a bram memory
The details of the devices can be found in devices
Address Map of SoC¶
Module Address Range BRAM-Memory 0x80000000 - 0x8FFFFFFF BootROM 0x00001000 - 0x00010FFF UART 0x00011300 - 0x00011340 CLINT 0x02000000 - 0x020BFFFF Debug-Halt Loop 0x00000000 - 0x0000000F Signature Dump 0x00002000 - 0x0000200c
Please note that the bram-based memory in the test-bench can only hold upto 256MB of code. Thus the elf2hex arguments will need to applied accordingly
BootRom Content¶
By default, on system-reset the core will always jump to 0x1000
which is mapped to the bootrom.
The bootrom is initialized using the files boot.MSB
and boot.LSB
. The bootrom immediately
causes a re-direction to address 0x80000000
where the main program is expected to lie.
It is thus required that all programs are linked with text-section begining at 0x80000000
.
The rest of the boot-rom holds a dummy device-tree-string information.
Synthesis of Core¶
When synthesizing for an FPGA/ASIC, the top module should be mkccore_axi4 (mkccore_axi4.v)
as the top module.
The mkimem
and mkdmem
module include SRAM instances which implement the respective data
and tag arrays. These are implemented as BRAMs and thus require no changes for FPGAs.
However for an ASIC flow, it is strictly advised to replace the BRAMs with respective SRAMs.
The user should refer to RAM Structures for correctly performing the replacement.
Simulating the Core¶
Generate Verilated Executable¶
$ cd c-class
$ python -m configure.main -ispec sample_config/default.yaml
$ make
The above should result in following files in the bin
folder:
- out
- boot.LSB
- boot.MSB
Executing Programs¶
Let’s assume the software program that you would like to simulate on the core is called
prog.elf
(compiled using standard riscv-gcc). This elf needs to be converted
to a hex file which can be provided to the verilated executable: out
. This
hex can be generated using the following command:
For 64-bit:
$ elf2hex 8 33554432 bbl 2147483648 > code.mem
For 32-bit:
$ elf2hex 4 67108864 add.elf 2147483648 > code.mem
place the code.mem
file in the bin
folder and execute the out
binary
to initiate simulation.
Please note, since the boot code in the bootrom implicitly jumps to 0x80000000
the programs
should also be compiled at 0x80000000
. Plus the bram main memory is 256MB large.
Support for PutChar¶
The test-soc for simulation contains a simple uart. The putchar
function for the same is available
HERE.
This has to be used in the printf functions. The output of the putchar
is captured in a separate
file app_log during simulation.
Simulation Arguments (Logger Utility)¶
./out +rtldump
: if the core has been configured withtrace_dump: true
, then a rtl.dump file is created which shows the trace of instruction execution. Each line in the file has the following format:<privilege-mode> <program-counter> <instruction> <register-updated><register value>
To enable printing of debug statements from the bluespec code, one can pass custom logger arguments to the simulation binary as follows
./out +fullverbose
: prints all the logger statements across all modules and all levels of verbosity./out +mstage1 +l0
: prints all the logger statements within module stage1 which are at verbosity level 0../out +mstage2 +mstage4 +l0 +l3
: prints all the logger statements within modules stage2 and stage4 which are at verbosity levels 0 and 3 only.
An
app_log
file is also created which captures the output of the uart, typically used in theputchar
function in C/C++ codes as mentioned above.
Connect to GDB in Simulation¶
A debugger implementation following the riscv-debug-draft-014 has been integrated with the core.
This can be instantiated in the design by configuring with: debugger_support: true
Perform the following steps to connect to the core executable with a gdb terminal. This assumes you have installed openocd and is available as part of you $PATH variable.
Modify the sample_config/default.yaml
to enable: debugger_support and open_ocd.
Generate a new executable with this config to support jtag remote-bitbang in the
test-bench
$ python -m configure.main -ispec sample_config/default.yaml
$ make gdb # generate executable with open-ocd vpi enabled in the test-bench
Simulate the RTL In a new terminal do the following:
$ cd c-class/bin/ $ ./out > /dev/null
Connect to OpenOCD Open a new terminal and type the following:
$ cd c-class/test_soc/gdb_setup/ $ openocd -f shakti_ocd.cfg
Connect to GDB Open yet another terminal and type the following:
$ cd c-class/test_soc/gdb_setup $ riscv64-unknown-elf-gdb -x gdb.script
In this window you can now perform gdb commands like : set $pc, i r, etc
To reset the SoC via the debugger you can execute the following within the gdb shell:
$ monitor reset halt
$ monitor gdb_sync
$ stepi
$ i r
Note
The above will not reset memories like caches, brams, etc
Dhrystone¶
The max DMIPS of the core is 1.72DMIPs/MHz.
$ git clone https://gitlab.com/shaktiproject/cores/benchmarks.git
$ cd benchmakrs
$ make dhrystone ITERATIONS=100000
the output
directory will contain a code.mem file which needs to be copied
to the bin
and execute the cclass verilated binary:
$ cp benchmarks/output/code.mem c-class/bin # change paths accordingly
$ cd c-class/bin
$ ./out
$ cat app_log
Microseconds for one run through Dhrystone: 10.0
Dhrystones per Second: 95746.0
Linux on C-Class¶
Generate RTL using the default.yaml config as provided in the repo
$ python -m configure.main -ispec sample_config/default.yaml $ make # generate executable
Download the shakti-linux repository and generate the kernel image:
$ git clone https://gitlab.com/shaktiproject/software/shakti-linux $ cd shakti-linux $ export SHAKTI_LINUX=$(pwd) $ git submodule update --init --recursive $ cd $SHAKTI_LINUX $ make -j16 ISA=rv64imafd
Come back to the folder c-class/ to simulate the kernel on the C-class executable:
$ cd c-class/ $ cp $SHAKTI_LINUX/work/riscv-pk/bbl ./bin/ $ cd bin $ elf2hex 8 33554432 bbl 2147483648 > code.mem $ ./out
Track the
app_log
file to see the kernel messages being printed
FreeRTOS on C-class¶
Generate a 32-bit RTL with the following command:
$ python -m configure.main -ispec sample_config/freertos.yaml $ make # generate executable
Download the free-RTOS repository for C-class
$ git clone https://gitlab.com/shaktiproject/software/FreeRTOS $ cd FreeRTOS/FreeRTOS-RISCV/Demo/shakti/ $ make
Come back to the c-class folder and do the following:
$ cd c-class/ $ cp FreeRTOS/FreeRTOS-RISCV/Demo/shakti/frtos-shakti.elf ./bin $ cd bin $ elf2hex 8 4194304 frtos-shakti.elf 2147483648 > code.mem $ ./out
Track the
app_log
file to see the kernel messages being printed
Benchmarking the Core¶
The max DMIPS of the C-class core is 1.72DMIPs/MHz.
The max CoreMarks of the C-class core is 2.9CoreMarks/MHz
The C-class core is highly configurable and thus requires a specific kind of tuning to achieve the
maximum performance. This document will highlight some of the settings and their respective
benchmark numbers. For the following benchmarks the c-class core has been configured using the
default.yaml available in the samples/
folder.
Note
Make sure you are using gcc 9.2.0 or above to replicate the following results.
Benchmarking Dhrystone¶
The following numbers have been obtained via simulation where the number of ITERATIONS were fixed to 5000
Flags used for compilation:
-mcmodel=medany -static -std=gnu99 -O2 -ffast-math \
-fno-common -fno-builtin-printf -march=rv64$(march) -mabi=lp64d \
-w -static -nostartfiles -lgcc
When $march
is rv64imac
the DMIPs/MHz is 1.68:
Microseconds for one run through Dhrystone: 10.0
Dhrystones per Second: 94652.0
When $march
is rv64ima
the DMIPs/MHz is 1.72:
Microseconds for one run through Dhrystone: 10.0
Dhrystones per Second: 96216.0
Benchmarking CoreMarks¶
The following numbers have been obtained via simulation where the number of ITERATIONS were fixed at 100
Flags used for compilation are available in the logs below:
When $march
is rv64imac
the CoreMarks/MHz is 2.84:
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 35205197
Total time (secs): 35
Iterations/Sec : 2
Iterations : 100
Compiler version : riscv64-unknown-elf-9.2.0
Compiler flags : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 \
-DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 \
-DITERATIONS=10 -O3 -fno-common -funroll-loops -finline-functions \
-fselective-scheduling -falign-functions=16 -falign-jumps=4 \
-falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math \
-fno-builtin-printf -march=rv64imac -mexplicit-relocs
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See README.md for run and reporting rules.
When $march
is rv64ima
the CoreMarks/MHz is 2.897:
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 34516277
Total time (secs): 34
Iterations/Sec : 2
Iterations : 100
Compiler version : riscv64-unknown-elf-9.2.0
Compiler flags : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 \
-DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 \
-DITERATIONS=100 -O3 -fno-common -funroll-loops -finline-functions \
-fselective-scheduling -falign-functions=16 -falign-jumps=4 \
-falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math \
-fno-builtin-printf -march=rv64ima -mexplicit-relocs
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See README.md for run and reporting rules.
Why Compressed Binaries perform bad on C-class?¶
If you have observed the numbers above, it is evident that for the same configuration of the branch-predictor compressed provides a slight reduction in the performance of DMIPs. This is because how the fetch-stage (stage1) has been designed.
The fetch stage always expects the I$ to respond with a 32-bit word which is 4-byte aligned. Since it is possible that the 32-bit word can hold upto 2 16-bit compressed instructions the predictor also always presents 2 predictions one for pc and one for pc+2. While analysing the 32-bit word from the I$ the following scenarios can occur:
- Case-1: entire word is a 32-bit instruction. In this case the entire word and the prediction for pc is sent to the decode stage.
- Case-2: word contains 2 16-bit instructions. in this case in the first cycle the lower 16-bits of the word and prediction of pc is sent to the decode stage. In the next cycle the upper 16-bits and prediction of pc+2 is sent to the decode stage.
- Case-3: lower 16-bits need to be concatenated with the upper 16-bits of the previous I$ response. in this case the a new 32-bit instruction is formed and the prediction of the previous response is sent to the decode stage.
- Case-4” Only the upper 16-bits of the I$ needs to be analysed. If the upper 16-bits are compressed then the same and prediction of pc+2 is sent to the decode stage. If however, the upper 16-bits are the lower part of a 32-bit instruction, then we need to wait for the next I$ response and use the Case-3 scheme then. Now one can land in this case, when there is jump to a 32-bit instruction placed at a 2-byte buondary.
Now that we understand how the fetch-stage works, assume that all the dhrystone code fits within the I$ (i.e. no misses) and predictor is also well trained to provide all correct-predictions. Consider the following sequence from dhrystone:
...
8000106e: 0x00001797 auipc a5,0x1
...
...
...
800010d8: 0xf97ff0ef jal ra,8000106e
...
Now each time the jal
instruction is executed the fetch-stage enters into case-4 where the upper 16-bits of the 32-bit word at 8000106c
is the lower part of a 32-bit instruction starting at 0x8000106e
and thus lead to a single-cycle stall in sending the auipc
instruction into the decode stage.
Since in dhrystone the above kind of sequence occurs for 3 scenarios in each iteration, and thus there is always a single-cycle delay for each scenario - hence the reduced performance for compressed support.
Micro-Arch Notes¶
Custom CSRs Available in C-Class¶
The C-class includes the following custom csrs implemented in the non-standard space for extra control and special features.
custom control csr (0x800)¶
The csr is used to the enable or disable the caches, branch predictor and arithmetic exceptions at run-time.
Bit Position | Reset Value | Description |
---|---|---|
0 | from config | Enable or disable the data-cache. |
1 | from config | Enable or disable the instruction-cache. |
2 | from config | Enable or disable the branch_predictor. |
3 | Disabled | Enable or disable arithmetic exceptions. |
dtvec csr (0x7c0)¶
XLEN register which indicates the address of the debug loop when a the debugger halts the core.
denable csr (0x7c1)¶
1-bit csr indicating if the debugger can halt the core
mhpminterrupten csr (0x7c2)¶
XLEN bit register following the same encoding as mcounteren/mcountinhibit
. A bit set to 1
indicates the an interrupt will be generated when the corresponding counter reaches the value 0.
More details to use this register is available [here](../docs/performance_counters.md#interrupts-from-counters)
dtim base adddress csr (0x7c3)¶
An XLEN bit register holding the base address of the data tightly integrated scratch memory. This should correspond to the physical address space and not the virtual
dtim bound adddress csr (0x7c4)¶
An XLEN bit register holding the bound address of the data tightly integrated scratch memory. This should correspond to the physical address space and not the virtual
itim base adddress csr (0x7c5)¶
An XLEN bit register holding the base address of the instruction tightly integrated scratch memory. This should correspond to the physical address space and not the virtual
itim bound adddress csr (0x7c6)¶
An XLEN bit register holding the bound address of the instruction tightly integrated scratch memory. This should correspond to the physical address space and not the virtual
cause values for arith traps¶
When configured with fpu_trap: True
,as an extension to the 15 exceptions mentioned in the
RISC-V SPEC, we have added six arithmetic exceptions.
Out of this five are floating point exceptions specified by IEEE 754 floating point format.
Description | Cause Value |
---|---|
Integer divide by zero | 17 |
Floating point Invalid operation | 18 |
Floating point Zero divide | 19 |
Floating point Overflow | 20 |
Floating point Underflow | 21 |
Floating point Inexact | 22 |
Performance Monitors¶
Introduction¶
Currently the RISC-V privilege spec (v1.12) describes a basic hardware performance facility at the hart (core) level .
3 counters for dedicated functions have been defined:
Address | Name | Description |
---|---|---|
0xB01 | mcycle |
counts the number of cycles executed by the hart starting from an arbitrary point of time. |
0xB02 | minstret |
counts the number of instructions executed by the hart starting from an arbitrary point of time. |
0xC03 | mtime |
this is a read-only csr which reads the memory mapped value of the platforms real-time counter. |
Each of the above are 64-bit counters. Shadow csrs of the above also exist in the user-space.
Apart from the above, RISC-V also provides provision to instantiate additional 29 64-bit event counters: mhpmcounter3
- mhpmcounter31
. The event selectors for these counters are also defined: mhpmevent3
- mhpmevent31
. The meaning of these events is defined by the platform and can be customized for each platform.
In addition, RISC-V also defines a single 32-bit counter-enable register : mcounteren
. Each bit in this register corresponds to each of the 32 event-counters described above. This register controls only the accessibility of the counter registers and has no effect on the underlying counters, which can continue to increment irrespective of the settings of the mcounteren
fields.
Clearing a bit in the mcounteren
only indicates that the event-counters cannot be accessed by lower level privilege modes. Similar functionality is implemented by the scounteren
register when S-mode is supported.
Overhead Analysis¶
- Each event-counter is mapped to a CSR address and additionally all counters are read-write CSRs. Thus each 64-bit counter will have an additional 12-bit decoder to select that counter in case of a read/write CSR op.
- Since all CSRs are accessed in the write-back stage of the C-Class core, the 12-bit address from this stage, fans-out to all CSRs. Since the event-counters are implemented as 64-bit adders, the fan-out load is further increased as they become part of the CSR read/write op.
- Further more, suppose there are 30 events defined by the core/platform and each event-counter if configurable to choose any of the 30 events to track. This leads to an additional 30 is 1 demux on each event-counter.
All the three factors defined above can cause the event-counters to become critical in terms of area and frequency closure.
Possible solutions¶
- To address the issues
1 and 2
listed above, it is possible to implement the CSRs as a daisy-chain as shown below:

Here the CSRs are group based on their functionality and accesses to CSRs can thus take variable number of cycles. For eg, less frequently accessed CSRs like fcssr
or *scratch
or debug registers
can be placed in GRP-2 or GRP-3. Performance counters and status registers can be placed in GRP-1 to enable quick and fast access.
Such daisy chaining will reduce the comparator fan-out while performing CSR read/write ops.
- To address the 3rd issue from the above list, it is proposed to split the events in groups and have each counter track only events involved within a specific group. This strategy is further elaborated in the next-section.
List of Events for C-class¶
The C-Class core will support capturing the following 26 events:
Event number | Description |
1 | Number of misprediction |
2 | Number of exceptions |
3 | Number of interrupts |
4 | Number of csrops |
5 | Number of jumps |
6 | Number of branches |
7 | Number of floats |
8 | Number of muldiv |
9 | Number of rawstalls |
10 | Number of exetalls |
11 | Number of icache_access |
12 | Number of icache_miss |
13 | Number of icache_fbhit |
14 | Number of icache_ncaccess |
15 | Number of icache_fbrelease |
16 | Number of dcache_read_access |
17 | Number of dcache_write_access |
18 | Number of dcache_atomic_access |
19 | Number of dcache_nc_read_access |
20 | Number of dcache_nc_write_access |
21 | Number of dcache_read_miss |
22 | Number of dcache_write_miss |
23 | Number of dcache_atomic_miss |
24 | Number of dcache_read_fb_hits |
25 | Number of dcache_write_fb_hits |
26 | Number of dcache_atomic_fb_hits |
27 | Number of dcache_fb_releases |
28 | Number of dcache_line_evictions |
29 | Number of itlb_misses |
30 | Number of dtlb_misses |
Interrupts from Counters¶
There is a need to raise an interrupt when a particular counter has observed delta
number of counts.
This feature is however, not part of the current RISC-V ISA, since it does not mandate how the
counters are interpreted neither on which direction should they move (up or down).
Thus, to achieve the above said functionality, we propose a new custom CSR:
mhpminterrupten
: The encoding for this csr is the same as that of mcounteren/mcountinhibit.
When a particular bit is set, it indicates that the corresponding counter will generate an
interrupt when the value reaches 0 and the counter is enabled (mhpmevent != 0)
. The interrupt
can be disabled by writing a 0 to the corresponding mhpmevent
register
(equivalent to disabling the counter)
Following is an example of how such a framework can be used:
> csrw mhpminterrupten, 0x4 # enable interrupt for mhpmcounter3
> addi x31, x0, -delta # note the negative delta
> csrw mhpmcounter3, x31
> csrw mhpmevent3, 0x9 # enable mhpmcounter3 to track event-code-9
> ...
> interrupt is generated jump to isr!
> ...
>
ISR Routine
> csrw mhpmevent3, x0 # disable mhphmcounter3 will also disable the interrupt.
RAMS used in the C-Class¶
This document describes in detail how various RAM based structures are used within the shakti-designs (specifically the C-class processor). The doc also highlights the differences for porting the same structures to ASIC or FPGAs.
Overview¶
The caches used in the C-class core (instruction and data both), use a single-ported RAM instance (1RW), i.e. one port to perform either a read or a write.
The branch predictors ,however, depending on the choice at compile time may or may not use RAMs. For specific instances, the RAMs used are dual-ported (1R + 1W) i.e. a dedicated port to read and another dedicated port to write.
Functionality¶
Single-Ported RAMs (1RW)¶
Module Name: bram_1rw
Verilog source: bram_1rw.v
Port Descriptions:
Port Name Direction Description clka input Clock signal. Positive edge of clock is used. ena input When high indicates the port is being used wea input When high indicates a write operation is being performed. addr input Indicates the address for read/write dina Input Indicates the data for write operations douta output Holds the data for a read operation Instantiation Parameters:
Parameter Name Description DATA_WIDTH Width of dina
anddouta
ports.ADDR_WIDTH Width of addra
port.MEMSIZE Depth of the RAM. The size of the instantiated RAM will be MEMSIZE x DATA_WIDTH bits where the number of indices is equal to MEMSIZE and the number of bits at each index is equal to DATA_WIDTH.
Read Operation: The address is written onto the
addr
port, and theena
signal is driven high. In the next positive edge,douta
port will hold the data. Therefore, the read operations have a one cycle latency. Also, a new address can be given at every cycle (whose output will be obtained in the subsequent cycle).Write Operation: The address is written onto the
addr
port, data to be written is driven on the dina port, and,ena
andwea
signals are asserted. At the next positive edge of clock the value atdina
is written onto the addressaddr
. Also, a new write operation can be initiated at every clock edge.
Note
- The single-ported rams follow a
no-change
model, where the outputdouta
remains unchanged on write-operations and will always hold the data of the previous read operation. - The single-ported rams assume the outputs are registered for reads.
Dual-Ported RAMs (1R + 1W)¶
Module Name: bram_1r1w
Verilog source: bram_1r1w.v
Ports:
Port Name Direction Description clka Input Clock signal for port A. Operations are performed at the positive edge of the clock. ena Input Enable signal for port A. When high, indicates that the port is being used for write. wea Input Write enable for port A. When high, indicates that a write operation is being performed. addra Input Index address for port A that indicates the address for write dina Input Indicates the data for write operations clkb Input Clock signal for port B. Operations are performed at the positive edge of the clock. enb Input Enable signal for port B. When high, indicates that the port is being used for read. addrb Input Index address for port B that indicates the address for read doutb Output Holds the data for a read operation Instantiation Parameters:
Parameter Name Description DATA_WIDTH Width of dina
anddouta
ports.ADDR_WIDTH Width of addra
andaddrb
ports.MEMSIZE Depth of the RAM. The size of the instantiated BRAM will be MEMSIZE x DATA_WIDTH bits where the number of indices is equal to MEMSIZE and the number of bits at each index is equal to DATA_WIDTH.
Read Operation: Port-B is used for performing reads. The address is written onto the
addrb
port, and theenb
signal is driven high. In the next cycle,doutb
port will hold the data. Therefore, the read operations have a one cycle latency. Also, a new address can be given at every cycle (whose output will be obtained in the subsequent cycle).Write Operation: Port-A is used for writes. The address is written onto the
addra
port, data to be written is driven on thedina
port, and,ena
andwea
signals are asserted. At the next positive edge of clock the value atdina
is written onto the addressaddra
. Also, a new write operation can be initiated at every clock edge.Read Write Conflicts: In case of a read and write occurring to the same
address
at the same time, the writes are guaranteed while the reads need not be.
Note
- Here port A is used for write, and port B is used for read operations. Also, the various enable and write enable signals are active high signals.
- The dual-ported rams assume the outputs are registered for reads.
Synthesis¶
Mapping to FPGAs¶
The single-ported RAMs (1RW) used in the caches are directly mapped to the true-single ported BRAMs provided by xilinx.
The dual-ported RAMs (1R + 1W) used in branch predictors are directly mapped to true-dual ported RAMs provided by Xilinx. Since the true-dual ported RAMs from xilinx provide a (1RW + 1RW) configuration, our dual-ported instances ensure that portA is used for writes and portB is used only for reads (by ensuring wea port is held low always)
The * RAM_STYLE = "BLOCK" *
pragma in the verilog source makes it easy for Vivado to infer
these as BRAMs and thus no edits are required in the source file.
Mapping to ASICs¶
For mapping to ASICs, the user has to replace the files bram_1rw
and bram_1r1w
with
respective instances for SRAM modules which meet the same functionality as described above.
In case where SRAM cells of the same size as that of the instantiations are not avaialable, it
is the onus of the user to bank/combine available SRAMs cells into a top-module which has the
same functionality as bram_1r1w
or bram_1rw
.
If an SRAM cell has extra ports than the ones required in this document, the user is required to ensure they are driven accordingly to maintain the same functionality as described in this document.
Additionally, if a parameterized instance of the SRAMs can be developed by the user, its the user’s responsibility to manually replace each instance of the RAM’s in the design. For the c-class the instances are defined below:
C-Class Specific instances of RAMs.¶
The size and configuration of the RAMs instantiated in the design can be controlled at the BSV level at compile time using the YAML configuration files. For a quick reference of all 1RW/1R1W instances do the following in the verilog release:
$ grep "bram_1rw " mk*cache.v -A2
$ grep "bram_1r1w " mkbpu.v -A2
Instruction Cache¶
The variables below refer to the fields within the icache_configuration
node
in the YAML spec. VADDR
refers to the XLEN and PADDR
refers to the
physical_addr_size
in the YAML spec.
For Data Array
- instance path:
mkicache/data_arr_*
- Total number of 1RW instances :
dbanks x ways
- DATA_WIDTH per instance:
(word_size x 8 x block_size)/ dbanks
- MEM_SIZE per instance:
sets
- ADDR_WIDTH per instance:
Log(sets)
For Tag Array
- instance path:
mkicache/tag_arr_*
- Total number of 1RW instances :
tbanks x ways
- DATA_WIDTH per instance:
PADDR - (Log(word_size) + Log(block_size) + Log(sets)) )/tbanks
- MEM_SIZE per instance:
sets
- ADDR_WIDTH per instance:
Log(sets)
Data Cache¶
The variables below refer to the fields within the dcache_configuration
node
in the YAML spec. VADDR
refers to the XLEN and PADDR
refers to the
physical_addr_size
in the YAML spec.
For Data Array
- instance path:
mkdcache/data_arr_*
- Total number of 1RW instances :
dbanks x ways
- DATA_WIDTH per instance:
(word_size x 8 x block_size)/ dbanks
- MEM_SIZE per instance:
sets
- ADDR_WIDTH per instance:
Log(sets)
For Tag Array
- instance path:
mkdcache/tag_arr_*
- Total number of 1RW instances :
tbanks x ways
- DATA_WIDTH per instance:
PADDR - (Log(word_size) + Log(block_size) + Log(sets)) )/tbanks
- MEM_SIZE per instance:
sets
- ADDR_WIDTH per instance:
Log(sets)
Branch Predictors¶
RAMs will not be instantiated if the predictor
option in YAML config is set to
gshare_fa
. RAM instances for other values are described below.
The variables below refer to the fields within the branch_predictor
node
in the YAML spec. VADDR
refers to the XLEN and PADDR
refers to the
physical_addr_size
in the YAML spec.
With compressed support:
- Total number of 1R+1W instances : 2
- DATA_WIDTH per instance:
(VADDR - Log(btb_depth)) + VADDR + 4
- MEM_SIZE per instance:
btb_depth/2
- ADDR_WIDTH per instance:
Log(btb_depth/2)
- NOTE: One instance will have DATA_WIDTH + 1 bits.
Without compressed support:
- Total number of 1R+1W instances : 1
- DATA_WIDTH per instance:
(VADDR - Log(btb_depth)) + VADDR + 3
- MEM_SIZE per instance:
btb_depth
- ADDR_WIDTH per instance:
Log(btb_depth)
Physical Memory Protection (PMP)¶
The phyiscal memory protection unit is integrated with the caches (data and instruction). The pmp-module implements permission checks region-wise as described in the riscv-privilege spec. See PMP configuration parameters for the pmp support are available
When pmp is disabled, then all pmp csrs are read as zeros.
When PMPEnable is zero, the PMP module is not instantiated and all PMP registers read as zero (regardless of the value of PMPNumRegions)
PMP Granularity¶
The PMP granularity parameter is used to reduce the size of the address matching comparators by increasing the minimum region size. For a 32-bit core the minimum granularity is 4 bytes and for a 64-bit core the minimum granularity is 8 bytes. This choice has been made to reduce the overheads of checking homogeneity of the access. Thus, for a 64-bit core NA4 is no longer available.
For Developers¶
This section describes the directory structure and other details for folks interested in hacking/modifying the core/generator scripts.
Directory Structure¶
c-class
├── bsvpath # file listing all the directories containing relevant bsv files
├── CHANGELOG.rst # contains the CHANGELOG of versions
├── configure # contains the python configuration scripts
├── CONTRIBUTING.md # guideline for making contributions
├── docs # all the documentation sources
├── LICENSE.* # License files
├── Makefile # makefile for compiling bsv files and linking using verilator
├── micro-arch-tests # contains a variety of directed tests
├── README.md # main doc readme
├── rename_translate.sh # bash script for manipulating verilog files
├── requirements.txt # list of all python packages required for configuring the core
├── sample_config # sample yaml configuration files
├── src # contains bsv source code of the C-class core
└── test_soc # contains a sample test-bench for simulation purposes
Upgrading dependencies¶
The core and test-soc uses modules which are available in different repositories. This list of
repositories is maintained in the configure/constants.py under the variable: dependency_yaml
.
The configurator uses the repo-manager package to clone and patch all relevant dependencies.
Changing Compile arguments¶
The bsc and verilator commands along with their arugments is stored in the configure/constants.py
file under the variables: bsc_cmd
and verilator_cmd
respectively. These are directly used by
the configurator to generate the makefile.inc file.
Adding Checks on YAML¶
The configurator also performs specific checks on the legality of the input yaml. Not all
configurations are legal and this is performed by the function specific_checks
in the
configure/configure.py
file. More checks should be added only to this function.
CHANGELOG¶
This project adheres to Semantic Versioning.
[2.0.0] - 2022-12-08¶
- Pipeline upgraded
- Rtldump changed to match with newer spike
- Updates made to use newer caches_mmu
- FPU support added
- Updates made to use newer devices
- Configure scripts changes made to build csrbox and riscv-config and use them
- Changes made to Debugger 1.00
- Verification updates and ci fixes
[1.10.0] - 2022-10-19¶
- Added UARTv2 changes
- Modified requirements.txt to use recent aapg
- Updated decoder to check for non-zero fs bits in mstatus for floating point instruction
- Updated decoder to check for valid rounding mode
- Fixed BPU to not give prediction at the start of fence operation.
- Upstreamed verification and updated timeout in ci
[1.9.9] - 2020-11-03¶
- Added c64, c32 design config yamls
- Removed obsolete csrs for MTIME and MTIMEH
[1.9.8] - 2020-09-23¶
- removed bram-2-bram paths from caches
- fixed rg_fs implementation for mstatus csr.
[1.9.7] - 2020-07-03¶
- license clean-ups
[1.9.6] - 2020-06-05¶
- put pmp related logic under ifdef pmp in ccore.bsv
- make the Addr_space configurable through YAML
- update schema_file comments for better readibility
- reset value of mstatus.mie is 0 even if openocd is enabled.
- minimal comments updated in stage0
[1.9.5] - 2020-05-13¶
- removed the concept of extra history bits from gshare_fa
- added historybits as a new parameter to indicate the size of bits used from the ghr for indexing.
- reduced tick resolution in test_soc
- updated the 2 bit counter increment scheme to account for hysterisis bit separately
- updated the gshare has function for improved collisions
- updated repomanager to 1.2.0
[1.9.4] - 2020-04-30¶
- parallel build using bluetcl is enabled
- remove re-alignment of bytes in ccore for I$ and D$ reads. This now is handled within the caches
- bumped version of the caches
- gitignore updated
- fixed and cleaned up the interrupt and delegation logic
- adding pre-requisite checks in configure
- default.yaml is picked up as default if no argument given to -ispec
- split interface of seip and meip. Both can now be driven by plic independently. Also led to removal of unwated attributes.
[1.9.3] - 2020-04-30¶
- fixed reset logic handling in ccore.bsv to support reset by debugger.
- updated SoC to decouple debug related logic into a separate module. This now allows for easy reset control.
- the debug module in the test-soc is now always enabled irrespective of the debug being enabled or not
- Fixed minor bug in Makefile when compiling for GDB sim.
- moved debug loop and dtvec_base to 0x100
[1.9.2] - 2020-04-26¶
Fixed¶
- [docs] move pip install requirements to building core section
- [docs] fixed typos in simulation section and added dhrystone benchmarking method
- updating verification repo version to avoid dirname error
Changed¶
- renamed cclass to ccore at all instances
[1.9.1] - 2020-04-07¶
Fixed¶
- when pmps are not implemented then return 0 instead
- bug fixed in csr trap handler logic when only usertraps enabled without supervisor
- enable openocd macros in configure and clean up performance counter macro generation
- link verilator target for gdb compile fixed
- exit ci for patch updates
- adding missing supervisor and user macros in decoder to enable correct debug functionality
- 32-bit default config updated to new schema
Changed¶
- updated method and rule attributes related to csrs for cleaner compile
- using SizedFIFO instead of LFIFO to avoid unwanted scheduling
Removed¶
- removing old msb lsb files and replacing with a single file
- adding sections in ci file
[1.9.0] - 2020-04-03¶
Added¶
- pmp support fixed
- pmp support enabled in config
- adding iitm copyright in configure log
- adding pmp support documentation
- adding pipeline image in introduction
Changed¶
- changed schema of warnings to be a list
- defaulting to suppress all warnings
- removing old storebuffer module
- moving micro arch related chapters under a single micro-arch-notes chapter
Fixed¶
- adding dummy arprot field to remove warning
- rg_stall available only under multicycle macro
- corrected conditions under which pmpcfg and pmpaddr can be written
- fixed logic for pmp access permissions in decoder
[1.8.0] - 2020-04-01¶
Added¶
- integration with optimized 1rw dcache and icache
- support for ecc on both caches
- suppot for dual ported-rams in dcache
[1.7.3] - 2020-03-24¶
Added¶
- note to install and follow steps available on the original repositories for all external tools
[1.7.1] - 2020-03-10¶
Fixed¶
- Doc updates
- Use v7.0.1 of the caches with new bram interfaces
- Store being dropped in the commit stage should wait for the cache to be ready.
[1.7.0] - 2020-03-02¶
Changed¶
- config file is now yaml based
- docs moved to read-the-docs
- restructured directories. base-sim is no longer present. All tests have been moved to micro-arch-tests.
- LICENSE files have been upgraded
- common_types.bsv renamed to cclass_types.bsv
- common_params.bsv renamed to cclass_params.defines
- removed unwanted ifdef simulate macros
- Makefile has been update to use the new configuration setup and use the open-bsc tool from henceforth.
- moved CHANGELOG to rst syntax
- modifications to use the new 1rw dcache with better freq closure.
- more comment updates in some modules
Added¶
- Added a new python based configuration setup
[1.6.1] - 2019-11-21¶
Fixed¶
- The indication of whether a instruction-page-fault was due to the lower-16 bits or the upper-16 bits has been fixed.
[1.5.0] - 2019-11-21¶
Added¶
- added support for ITIM and DTIM
- new csrs to define the address map of the ITIM and DTIM
- directed tests for performance counters and Tightly-integrated memories
- doc update for custom csrs of c-class done.
Fixed¶
- interrupt mask when debbuger is enabled has been fixed.
[1.4.2] - 2019-11-08¶
Added¶
- macro for reset value of dtvec csr
- updated doc and template with the macro
[1.4.0] - 2019-10-28¶
Added¶
- support for WFI
- support for illegal trapping when tvm, tw and tsr registers are set in supervisor mode
- verilog artifacts now have rtldump support and logger support.
- 256MBytes of BRAM for verilog artifact simulation
Fixed¶
- made ADDR_SPACE as a variable in config file
- fixed paramaters for linux template
- bumped verification version to 3.2.4
- access to csr 0x321 and 0x322 now generates trap
- bumping devices to 5.0.0 with new uart features.
- fixed verilator setup for gdb as well
- added suppresswarnings as part of the gitlab ci/cd
[1.3.3] - 2019-10-08¶
Fixed¶
- Illegal encoding were being treated as FCVT.D.S and FCVT.S.D. This has been fixed. Close #149
[1.3.1] - 2019-10-04¶
Fixed¶
- Traps for floating point ops with ARITH_TRAP enabled but disabled through csr no longer generates traps. Close #147
[1.3.0] - 2019-10-03¶
Added¶
- bumped to caches with ECC support. Added corresponding hooks and details in readme as well.
Fixed¶
- typos in readme fixed #138
- improved verilator build speed.
[1.2.3] - 2019-09-27¶
Fixed¶
- mie and mip widths fixed when compiling with debug mode enabled. refer to issue #144.
[1.2.2] - 2019-09-26¶
Changed¶
- tracking cache misses instead of hits. refer to issue #143 for more info.
- updated performance tests with encodings.
[1.2.0] - 2019-09-26¶
Fixed¶
- performance counter increment conditions and interrupt generation scheme. A counter will not increment if the respective interrupt has been set.
- the last daisy-module instantiated should respond with true and data=0
- fixed op-fwding bug mentioned in issue #140
- decoding performance counters is fixed now. refer issue #141
Added¶
- added tests and benchmarks for performance counters.
Removed¶
- removed redundant epoch register and method from stage4
[1.1.0] - 2019-09-16¶
Added¶
- CSRs are now daisy chained.
- Performance counters and their event encodings added.
- Interrupts for counters has also been added.
- Increased default bram size in TB to be 32MB. This has increased regression time but now the same executable can be used for linux sim as well
Fixed¶
- BRAM now uses only a single file:
code.mem
for read-only. MSB and LSB files no longer required. - Updated docs to reflect new additions and fixes made above.
- renamed a few methods based on the coding guidelines.
[1.0.3] - 2019-09-10¶
Added¶
- makefile now uses bsvpath to identify directories for bsv source. This makes using vim-bsv easier.
[1.0.2] - 2019-09-10¶
Fixed¶
- rg_delayed_redirect register in stage0 should only be used when bpu and compressed both enabled.
[1.0.0] - 2019-09-09¶
Fixed¶
- data types of ISBs has been split to keep logic minimal and optimize frequency closure
- Logger is used in all submodules.
- macros and configurable options have been fixed to be more precise and granular
- stage0 or pc-fetch stage with fully-associative gshare has been fixed and tuned for higher frequency closure
- ALU has ben further optimized for better freqency closure
- ISB types and operand forwarding tuned for better frequency closure.
- overall changes to remove trailing white-spaces from all files.
- version extraction based on CHANGELOG will be followed hence forth.
- fpu convert from dp to sp roundup conditions fixed.
Added¶
- decompressor function added in stage1
- reset-pc can now be controlled by the SoC as an input without having to compromize on synthesi boundaries
- retimed multiplier with configurable stages is used always.
- different multiplier modules for evaluation have also been added.
- fully-associative TLB support has also been added.
- configuration support to supress all warnings during bsv compile
- CHANGELOG will be maintained from these release onwards.
Removed¶
- bimodal bpu support has been removed for now since it needs to be re-structured based on new interfaces and also requires new verilog-bram models
- gshare index model has also been removed along the same arguments as above.
- support for variable cycle mutliplier has also been removed as part of this release.