September 2, 2023
| Version History | Changes | 
|---|---|
| Version 0.9 | Initial Version | 
| Version 0.91 | Fix typos | 
| Version 0.92 | Add info on how to use the PCM tool get to RAPL power data for CPU/DRAM | 
On August 28, 2023, we released a report analyzing the performance and power efficiency of BladeBit CUDA leveraging the BladeBit CUDA simulator. The report provided our data points for the test systems in our lab and serves as some baseline measurements for comparison.
This short guide gives an overview of how to estimate the compression level your CPU/GPU will support for Bladebit CUDA using the simulator and gather some helpful performance diagnostics as well.
We use Ubuntu for this overview.
Prerequisites:
BladeBit CUDA binary
BladeBit CUDA Compressed farm plots available for the compression levels to be tested
For GPU plotting: Nvidia GPU that supports 8 GB VRAM and CUDA Compute Capability 5.2.
For CPU plotting: Sufficient memory, consult Table 1 which is based on our testing.
The number of BladeBit software threads and compression level guides the total memory required. For example, testing C7 farms with 32 threads needs 10.4 GiB of RAM.
Table 1: System Memory Used (GiB), CPU Farming
| Threads | C1 | C2 | C3 | C4 | C5 | C6 | C7 | 
|---|---|---|---|---|---|---|---|
| 16 | 0.1 | 0.2 | 0.3 | 0.7 | 1.3 | 2.6 | 5.2 | 
| 24 | 0.1 | 0.2 | 0.5 | 1.0 | 2.0 | 3.9 | 7.8 | 
| 32 | 0.2 | 0.3 | 0.7 | 1.3 | 2.6 | 5.2 | 10.4 | 
| 48 | 0.3 | 0.5 | 1.0 | 2.0 | 3.9 | 7.8 | 15.6 | 
| 96 | 0.5 | 1.0 | 2.0 | 3.9 | 7.8 | 15.6 | 31.3 | 
| 144 | 0.8 | 1.5 | 3.0 | 5.9 | 11.7 | 23.5 | 46.9 | 
Our method of running testing was to leverage several Bash scripts to automate the data collection and parsing across a wide variety of runs. There are many ways to automate data extraction and parsing, we provide a simplified example in this write-up as an introductory guide. Additional automation and command parameters can be provided to automate further.
First, it's necessary to make a plot for your desired compression level. The command below creates a C7 plot:
 ./bladebit_cuda -f $FARMER -c $POOL --compress 7 cudaplot /path/to/plot_storage  , use your farmer public key for $FARMER and your contract address for $POOL, set the place for the plot storage to replace the default /path/to/plot_storage above)
At a minimum, all that is required to run the simulator is the ./bladebit_cuda command with your chosen options and compressed plot file. 
However, if you would, also like to gather some CPU, GPU, memory, disk, etc utilization, the following is a simple script to do some simple data collection along with the run.
Sample simulator run script with basic data collection for sar and nvidia-smi: 
runSimulation.sh
x# DURATION is set to 600 seconds or 10 minutes# THREADS is the number of CPU threads you want to use. (For GPU plotting, the script below # is hard-coded to 1 thread.)# SIZE is the farm size, e.g. 250TB, 500TB, 750TB, 1PB, 2PB, etc. # Change the plotfile to a C1,C2,C3,C4,C5,C6,C7 plot to simulate that typeDURATION=600THREADS=8SIZE=500TBPLOTFILE=plot-k32-c07-2023-08-13-09-01-eaa809914afde2e5c2d0[etc].plot
# For CUDA simulations
# Run sar (prereq: sudo apt-get install sysstat)sar -A 1 $DURATION -o sar-data.bin > /dev/null 2>&1 &
# Optional: Nvidia GPU stats, run in the background while the benchmark goes (See below for script)./getGpu.sh&
# Start the CUDA simulation, use 1 thread./bladebit_cuda  simulate --power $DURATION -p 1  --size $SIZE $PLOTFILE
# Workload done, stop the GPU monitoring. Raw data is in nvidia-log.txt pkill getGpu.sh 
# For CPU-based simulations, set --no-cuda in bladebit_cuda syntax# Run sar (prereq: sudo apt-get install sysstat)sar -A 1 $DURATION -o sar-data-nocuda.bin > /dev/null 2>&1 &./bladebit_cuda  simulate --no-cuda --power $DURATION -p $THREADS  --size $SIZE $PLOTFILETo easily log the run output to a file, one option is to use script such as:
xxxxxxxxxxscript./runSimulation.sh[Hit Control D to kill the script, resulting BladeBit CUDA run log file is in a file called: "typescript".] 
To clean up the typescript to remove any extraneous ^M characters, run: "dos2unix -f typescript" (Prereq: sudo apt-get install dos2unix)
If you ssh to the system under test, you may also consider using screen (sudo apt install screen) to avoid issues with ssh sessions getting closed due to network inactivity / connection loss. 
Thus, a full run could look like:
xxxxxxxxxxscreenscript./runSimulation.sh[Hit Control D when the run is over to capture the output from script]
Full guides on how to use screen are available online however, a few basic useful commands to get started are:
xxxxxxxxxx
(Inside screen)           
Control a c     (Open a new screen window)
Control a "     (Choose which screen window to switch to)Control a d     (Detatch from your screen window)
(Outside of screen)  
exit            (Close the current screen window)
screen -r       Resume a previous session
screen -dr      Disconnect an active screen session and resume it elsewhere
getGpu.sh (Script to get Nvidia GPU data)
xxxxxxxxxx
# getGpu.sh
# This script runs until it's killed and takes measurements from nvidia-smi, extracts the data lines into an nvidia-log.txt file. Data is collected once a second. 
# Recommend to uncomment the line below to clear out any previous run data# rm nvidia-log.txt
# (The nvidia-smi command below should be on one line or properly split between lines)while true;do    nvidia-smi --query-gpu = timestamp,temperature.gpu,utilization.gpu, utilization.memory,       memory.total,memory.free,memory.used --format=csv
    sleep 1done
Sample nvidia-log.txt
xxxxxxxxxxtimestamp, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/09/01 18:51:52.953, 55, 0 %, 0 %, 8192 MiB, 1349 MiB, 6615 MiB...
RAPL
To get more insights into CPU and DRAM power consumption, the Intel Running Average Power Limit ("RAPL") CPU counters can be queried over time to monitor the power consumption of the CPU and memory. 
One easy way to get this information is to installed the PCM tools from Intel from: https://github.com/intel/pcm 
   The  At a high level, the method is to log all sar data to one .bin file (e.g. "sar-data.bin" in this example) and then selectively extract portions of interest into separate text files for easier processing.  CPU utilization data: For easier import into tools like Excel, it can be useful to make these fields comma separated (CSV). This can be readily done by the following line below. The  Many other performance metrics can be extracted from the  Paging data ( Block device data for the hard-drives/NVME devices ( Memory data ( Network device data ( Or extract all possible data by doing ( The sar data can be readily imported into other tools such as Excel for graphing and additional analysis.  After the BladeBit CUDA run, the simulator will show results such as below giving results about the run.  Several key fields of interest are the  The current Chia guide suggests keeping the maximum lookup time at around 5 seconds and if the times exceed 10 seconds to get a more power CPU/GPU or use a lesser compression level.  The sample results below are representative of attempting to use a compression and farm size which is too high for the hardware.  The estimated largest farm sizes reported by the tool seem to be larger than the response time data would suggest is reasonable so more investigation may be needed before relying upon that portion of the results.    Sample Bladebit CUDA simulator results:   This write-up provided an example of how to run the BladeBit CUDA simulator tool to estimate how well a Chia farmers CPU or GPU could handle a given farm size and compression level. While the simulator tool itself provides useful output data, it is also often helpful to dig deeper into the data, e.g. from    Contact info: tech@ [this website address]x
# If cmake is not installed, first do: sudo apt-get install cmakegit clone --recursive https://github.com/opcm/pcm.gitcd pcmmkdir buildcd buildcmake ..cmake --build . --parallel
# To get the RAPL datacd binsudo ./pcm-power |grep Watts
# Sample data after initial diagnostic printouts:S0; Consumed energy units: 354349; Consumed Joules: 21.63; Watts: 21.67; Thermal headroom below TjMax: 68S0; Consumed DRAM energy units: 103613; Consumed DRAM Joules: 6.32; DRAM Watts: 6.34
3. Extracting data from 
sar sar utility can provide very detailed information regarding the performance of the system. This section describes a short overview of how to interact with the sar data that's been collected. sar -u -f sar-data.bin > cpu-utilization.txt xxxxxxxxxx03:21:47 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle03:21:48 PM     all     55.34      0.00      0.56     19.55      0.00     24.5503:21:49 PM     all     23.88      0.00      0.19     30.24      0.00     45.7003:21:50 PM     all     96.63      0.00      0.25      0.87      0.00      2.25tr command portion will substitute a comma for a space character in the input file. sar -u -f sar-data.bin > cpu-utilization.txt | tr -s ' ' ',' > cpu-utilization-csv.txt sar file such as: sar -B -f sar-data.bin)sar -d -f sar-data.bin)sar -r -f sar-data.bin)sar -n DEV -f sar-data.bin)sar -A -f sar-data.bin)4. Interpreting BladeBit CUDA Results
worst plot lookup lookup and Average full proof lookup time . xxxxxxxxxx [Simulator for harvester farm capacity for K32 C7 plots] Random seed: 0x1c563ba8c9755e0269e0751810e21065a894269a5e0c964d1157575e0ad00cb2 Simulating...
 Context count                 : 24 Thread per context instance   : 0 Memory used                   : 8007.2MiB ( 7.8GiB ) Proofs / Challenges           : 1242 / 1536 ( 80.86% ) Fetches / Challenges          : 785 / 1536 Filter bits                   : 512 Effective partials            : 23 ( 2.93% ) Total fetch time elapsed      : 227.820 seconds Average plot lookup time      : 0.290 seconds Worst plot lookup lookup time : 128.588 seconds Average full proof lookup time: 54.967 seconds Fastest full proof lookup time: 14.629 seconds
*** Warning *** : Your worst plot lookup time of 128.588 was over the maximum set of 8.000. compression | plot count | size TB    | size PB------------------------------------------------ C7          | 14113      | 1182       | 1.185. Conclusions
sar or nvidia-smi to get additional insights. The sample scripts provided here provide a basic starting place for those who would like to evaluate BladeBit CUDA performance on their CPU/GPU hardware.