Hello, World!
Bitcoin Mining: The process legitimizing Bitcoin transactions by performing work, thereby linking Bitcoin to human productive effort. More concretely, Bitcoin mining is a process of putting together (guessing) a blockchain entry whose hash is less than some target value determined by the hash rate (guesses per unit time for one guess to succeed about every 10 minutes.).
Even more concretely, mining is a process of taking 640 bits from external sources and running it through 340 computation (scrambling) stage–where information is lost at each–to produce a 256 bit value that is simultaneously unique to the starting value and cannot be used to compute the starting value, and repeating this until the output meets an externally determined criteria.
The algorithm used is a back-to-back SHA256. The result is a “random” bit sequence. The binary sequence looks random, but was still deterministically generated and is unique to the input sequence.
My goal is to implement this algorithm in an FPGA.
Of the 640 bits that make up the Bitcoin header, 32 of them are a “nonce”: bits that can be arbitrarily set. The other 608 bits actually have meaning in terms of transaction details, so the easiest approach is to take those 608 bits as a constant and increment through the nonce “space” (set of all possible values). So that’s what I’m starting with.
This digital design should: (1) mine Bitcoin, and (2) be auditable*.
The Design
You might think the SHA256 algorithm is the smallest “chunk” of computation, and consequently the best place to start. I did.
The algorithm is a natural starting point, but won’t end up being the top level module in my hierarchy. It’s not the lowest level, either. Additional blocks are necessary to receive data and actually run the algorithm.
This is complicated by two facts: Bitcoin calls the function twice in a row to compute the hash, and the SHA algorithm itself only operates on a 512 bit chunk. For longer data messages, the algorithm is iteratively executed: the resulting hash and working variables of the first chunk are mixed into the input of the second chunk, and so on. Creating an HDL Module that can process multiple chunks then invoking it twice will result in duplicate code. Which will haunt my dreams.
If I make the working variables an input into my “SHA” module, I can re-use it for large messages. This is particularly practicable since the parameters of the message are constrained by the Bitcoin protocol. Every message will be 640 bits long.
Note: for those programmers unfamiliar to the Realm of Digital Design, HDL is NOT software. HDL describes a logical circuit as hardware. (See Note 2)
Processing the SHA “Chunk”
The SHA Algorithm has four steps, but breaks down into two distinct computational stages:
- [SETUP] Scrambling the inputs (the message schedule and working variables).
- [MIXING] Scrambling the Working Variables (repeatedly)
The module takes as input 16 Words (16Wx32b/W = 512 Bits = 64 Bytes) of data and 8 Bytes (64 Bits) of hash values. The input data is the “chunk” of the original message; the hash values are either taken from the output of the previous chunk (if available), or are constants taken from the specification (for the first block of the message).
Each computation stage makes use of binary rotations and XOR functions to mix the data from the original message together, like kneading bread. First, the 16 input words are used to generate 48 more “scrambled words” through this process of mixing and combining, resulting in a “Message Schedule” of 64 Words. At each stage the logic gates cause a loss of information. This is what makes the function “one-way”; information is lost at every stage of this process.
This continues during the second step: for each word in the message schedul–from the first word to the last–that word is mixed into the working variables, which themselves are rotated, ANDed, and XORed together. Pieces of the message are taken from the message schedule and mixed in each time. The output will be unique to the input message.
Once this process is complete the output hash is generated by taking the final value of the working variables and adding them to the initial values (the input hashes). These are concatenated into the ouput hash. These final values may be used as inputs into another instance of the SHA block.
I compartmentalized the SHA Module into two submodules: the Scheduler and the Iterator.
The Scheduler iteratively generates the message schedule, and the Iterator performs the 64 “mixing” stages. You’ll notice each has staging registers in parallel to the computation chain that do nothing but pass data down the line; I deliberately added these in an attempt to utilize one of the unique advantages of Digital Logic: pipelining.
Software requires data be shuffled into and out of memory whenever it is used. This is resource efficient, since memory elements can be re-used; but time inefficient–since the memory elements have to be pipelined in time (instead of in space). Creating a custom digital design allows the engineer to choose which (and to what degree) he wants to prioritize: resource utilization, latency, or frequency.
My SHA module requires 112 stages of computation: 48 for the iterative generation of the Message Schedule and another 64 for scrambling the working variables. Since this is going into gateware (similar to implementing the physical circuit in logic gates) ingest and egest don’t require a computation. Though they may require an extra clock cycle (or two) to shove the input/output data into a register (See Note 3).
I’m max-ing out the pipelining (frequency) of my signal chain at the expense of everything else. I intend to pipe in new data every single clock cycle, which means my hashrate per Hasher Module will be equal to my clock frequency (the Xilinx FPGAs I’m using have a theoretical upper bound of 1 GHz). If I naively shove the un-manipulated data into a register, it will be immediately overwritten the next clock cycle when the new data is ingested. I need enough memory to hold as many values as there are stages. My first header reaches the end of the sequencer after 48 clock cycles. The data associated with it needs to be held for those 48 cycles and spat out alongside the processed data. Hence the Delay FIFOs.
Bitcoin SHA & the HeaderHash Module
The Bitcoin system requires three SHA blocks. Bitcoin uses a double SHA256, calling the function twice successively.
The first invocation requires two chunks Since the algorithm operates on 512 bit chunks, and the 640 is not a multiple of 512, the Bitcoin header requires some fields be added to the message. The original message is appended by a 1’b1 bit, and a 64 bit length is placed at the end. Zeros are added between the ‘1’ and the length until the total message length is a multiple of 512. The message can then be split into 512 bit chunks and fed piecemeal into the algorithm.
The second iteration is operating on the output of the first. It only has to ingest a 256 bit message–and only needs one SHA Chunk to finish.
The HeaderHash module…ingests a header and egests a hash.
More explicitly, it accepts the 640 bit raw header data, performs the necessary additions and padding in the Preprocessor Module (Figure 10), then routes the data through three successive SHA blocks and the Delay FIFO.
It’s a fairly trivial task to split up the input fields and concatenate the output fields in HDL. Computing all three SHA chunks requires 1+113+113+113 = 340 clock cycles. If I’m able to run at the theoretical upper limit switching speed of 1GHz, that’s a latency of 340 ns.
Hasher Unit
Moving from the HeaderHash Module up to the Hasher Unit marks the departure from the “computational” portion of the design and into a more “logistical” train of thought. In order to use the computational chain we’ve built, we have to feed real data into it. That data has to come from the blockchain, the mempool, a timeserver; we also have to append the nonce and increment through all possible values–which implies some “control” logic to start, stop, and reset everything depending on various flags.
The Hasher Unit surrounds the HeaderHash block, and it’s purpose is to compute SHAs of the Bitcoin header. I decided to offload construction of the header to a software component. Eventually some (potentially all) of this functionality will be moved into the FPGA. For now I’m focusing on the minimum viable proof-of-concept. I’m currently developing this module, so the details are still being ironed out.
The Hasher Unit takes a mostly pre-constructed header, appends an incrementing Nonce Value, and checks for success (the output hash is less than the target threshold). This requires reconstruction of the Nonce Value on success, because it’s not passed through the module in a Delay FIFO like some of the other values. That seemed resource intensive and unnecessary.
I’d also like to note that this module is designed to run continuously; it doesn’t need to be be restarted when parameters change (new header, new start/end nonce). The nonce range can be set dynamically; whenever something changes the data clocked into the HeaderHash pipeline will change, but the existing data will continue propagating through. It may not be useful, but clearing it won’t reduce the latency of the new data.
I haven’t finished this module yet. My first design will only have one Hasher Unit instance.
The Miner, The Top Level
The top level of the design is pretty basic, and by “basic” I mean not fleshed out. Though it should be relatively simple, since the functionality is organized into a hierarchy of modules. And integration is always the easiest part, right?†
In order to function, there needs to be an external interface to the software (or Bitcoin network). There needs to be control module that directs and parallelizes the data into mulitple Hasher Units; resultant data has to be fed back through to the network via external interface.
Future Directions
I’m currently working on the Hasher Unit–and so far I’ve only tested things in simulation. My next tasks are:
- Implement and “Verify” the Hasher Unit (Designer Simulation)
- Implement and Verify the top level (Designer Simulation) with a basic interface and a single hasher unit
- Assess resource utilization and how many Hash Units will fit ona chip
- Program the design into my FPGA and test on hardware
- Modify/Write the necessary software and get it to work with the FPGA
More generally: there are some limitations with my approach that could be addressed. At 1 GHz a single Hasher Unit will iterate through the entire Nonce space in about a second. These days the nonce alone isn’t enough to guess a winning header; my design introduces a software bottleneck in terms of feeding viable headers into the design. Maybe.
If I can get this working, it may be interesting to move more of the header construction into the FPGA. At least the shuffling and computation of the Merkle Tree. I need to do some math on data throughput and Software/Interface/Circuitware speeds. It may not be worth it.
And in a domain dominated by ASIC miners this project may be more of an academic pursuit in the first place, and consequently not “worth” the optimization.
Thanks for reading.
Notes
If you’d like to check out (audit) the HDL: Hasher FPGA Repo
*Id Est: Have people (rather publicly) point out my poor engineering.
Note 2: This may look like code, but it is NOT code. This HDL describes a logical circuit as hardware. Everything in it becomes either a clump of logic gates that manipulate bits (represented as voltages), a register for storing bits, or a connection (net) in between those things. Every computation described in these files happens simultaneously (except when it isn’t) and registers are placed after each computation, instead of in a memory “block” (except when it is).
Confused? Good. That fear is something every budding digital designer feels, and should (but probably won’t) keep you from treating HDL like code and generating a mess. I’m convinced it’s part of the learning process. If you imagine everything here as gates on a PCB, you’ll be fine. You can also play Turing Complete on Steam, it’s a surprisingly good introduction.
Note 3: Generally speaking, I registered the inputs but not the outputs of a block (since I’m constantly changing the input values as things get piped in). There is an additional register external to the chain as well. So my pipeline has 113 clock cyles of latency per SHA Module; +1 for registering the output. So in simulation you’ll see the data take 114 clock cycles for 1 SHA block, 227 cycles for 2 SHA blocks (113+113+1), and 340 cycles for 3 SHA blocks.
†It is not.
Everyone thinks “this time I’ll properly define the interfaces, so this time things will just plug togther–integration will be a breeze!” and it’s never true. It’s a lie engineers tell themselves. Every. Single. Time. It’s an illusion. The Illusion of Rigor.
Note 5: This “blog post” is a poorly concealed micro-architecture document.