##################### Benchmarking the Core ##################### The max DMIPS of the C-class core is **1.72DMIPs/MHz.** The max CoreMarks of the C-class core is **2.9CoreMarks/MHz** The C-class core is highly configurable and thus requires a specific kind of tuning to achieve the maximum performance. This document will highlight some of the settings and their respective benchmark numbers. For the following benchmarks the c-class core has been configured using the default.yaml available in the ``samples/`` folder. .. note:: Make sure you are using gcc 9.2.0 or above to replicate the following results. Benchmarking Dhrystone ====================== The following numbers have been obtained via simulation where the number of ITERATIONS were fixed to 5000 Flags used for compilation:: -mcmodel=medany -static -std=gnu99 -O2 -ffast-math \ -fno-common -fno-builtin-printf -march=rv64$(march) -mabi=lp64d \ -w -static -nostartfiles -lgcc When ``$march`` is ``rv64imac`` the DMIPs/MHz is **1.68**:: Microseconds for one run through Dhrystone: 10.0 Dhrystones per Second: 94652.0 When ``$march`` is ``rv64ima`` the DMIPs/MHz is **1.72**:: Microseconds for one run through Dhrystone: 10.0 Dhrystones per Second: 96216.0 Benchmarking CoreMarks ====================== The following numbers have been obtained via simulation where the number of ITERATIONS were fixed at 100 Flags used for compilation are available in the logs below: When ``$march`` is ``rv64imac`` the CoreMarks/MHz is **2.84**:: 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 35205197 Total time (secs): 35 Iterations/Sec : 2 Iterations : 100 Compiler version : riscv64-unknown-elf-9.2.0 Compiler flags : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 \ -DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 \ -DITERATIONS=10 -O3 -fno-common -funroll-loops -finline-functions \ -fselective-scheduling -falign-functions=16 -falign-jumps=4 \ -falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math \ -fno-builtin-printf -march=rv64imac -mexplicit-relocs Memory location : STACK seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x988c Correct operation validated. See README.md for run and reporting rules. When ``$march`` is ``rv64ima`` the CoreMarks/MHz is **2.897**:: 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 34516277 Total time (secs): 34 Iterations/Sec : 2 Iterations : 100 Compiler version : riscv64-unknown-elf-9.2.0 Compiler flags : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 \ -DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 \ -DITERATIONS=100 -O3 -fno-common -funroll-loops -finline-functions \ -fselective-scheduling -falign-functions=16 -falign-jumps=4 \ -falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math \ -fno-builtin-printf -march=rv64ima -mexplicit-relocs Memory location : STACK seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x988c Correct operation validated. See README.md for run and reporting rules. Why Compressed Binaries perform bad on C-class? =============================================== If you have observed the numbers above, it is evident that for the same configuration of the branch-predictor compressed provides a slight reduction in the performance of DMIPs. This is because how the fetch-stage (stage1) has been designed. The fetch stage always expects the I$ to respond with a 32-bit word which is 4-byte aligned. Since it is possible that the 32-bit word can hold upto 2 16-bit compressed instructions the predictor also always presents 2 predictions one for `pc` and one for `pc+2`. While analysing the 32-bit word from the I$ the following scenarios can occur: * **Case-1**: entire word is a 32-bit instruction. In this case the entire word and the prediction for `pc` is sent to the decode stage. * **Case-2**: word contains 2 16-bit instructions. in this case in the first cycle the lower 16-bits of the word and prediction of `pc` is sent to the decode stage. In the next cycle the upper 16-bits and prediction of `pc+2` is sent to the decode stage. * **Case-3**: lower 16-bits need to be concatenated with the upper 16-bits of the previous I$ response. in this case the a new 32-bit instruction is formed and the prediction of the previous response is sent to the decode stage. * **Case-4**" Only the upper 16-bits of the I$ needs to be analysed. If the upper 16-bits are compressed then the same and prediction of `pc+2` is sent to the decode stage. If however, the upper 16-bits are the lower part of a 32-bit instruction, then we need to wait for the next I$ response and use the Case-3 scheme then. Now one can land in this case, when there is jump to a 32-bit instruction placed at a 2-byte buondary. Now that we understand how the fetch-stage works, assume that all the dhrystone code fits within the I$ (i.e. no misses) and predictor is also well trained to provide all correct-predictions. Consider the following sequence from dhrystone: .. code-block:: bash ... 8000106e: 0x00001797 auipc a5,0x1 ... ... ... 800010d8: 0xf97ff0ef jal ra,8000106e ... Now each time the ``jal`` instruction is executed the fetch-stage enters into case-4 where the upper 16-bits of the 32-bit word at ``8000106c`` is the lower part of a 32-bit instruction starting at ``0x8000106e`` and thus lead to a single-cycle stall in sending the ``auipc`` instruction into the decode stage. Since in dhrystone the above kind of sequence occurs for 3 scenarios in each iteration, and thus there is always a single-cycle delay for each scenario - hence the reduced performance for compressed support.