Jade Philipoom 12/12/24 Jade Philipoom 12/12/24

Future of PQC on OpenTitan

This is part 3 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips. SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum cryptography competition; the final standard was recently released as FIPS 205.

This is part 3 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips (1,2). SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum cryptography competition; the final standard was recently released as FIPS 205.

Read part 1 here and part 2 here.

This post will focus on the future of post-quantum cryptography on OpenTitan, specifically:

new SPHINCS+ parameter sets that dramatically improve secure boot tradeoffs, and
potential hardware modifications to support lattice-based cryptography on OpenTitan.

I said the last post would be the longest, but it looks like this one is. Turns out there’s a lot to say about the future!

New SPHINCS+ Parameter Sets

For the OpenTitan Earl Grey chip design, we set up the SPHINCS+ verification so that it’s a configuration option during manufacturing; you can decide to run secure boot using both classical non-PQC verification or using both classical and SPHINCS+. We continued to support the classical-only option because SPHINCS+, although fast enough to be tolerable, was still a few times slower than RSA or ECDSA. Specifically, SPHINCS+ with the shake-sha2-128s parameter set takes about 9.3ms on Earl Grey when clocked at 100MHz, compared to about 2.4ms for RSA-3072 and 4.2ms for ECDSA-P256, which provide about the same level of security.

This performance picture is about to change. The paper A note on SPHINCS+ parameter sets (2022), authored by Stefan Kölbl (one of the SPHINCS+ authors) and myself, explores new parameter sets that are better suited to firmware signing. As I described in the first post of this series, SPHINCS+ is a signature framework; it has several settings that you can tweak to get different characteristics. The NIST PQC competition required that all submissions support up to 2^64 signatures per key. This is so many signatures that, practically speaking, one can simply never worry about counting. For many applications, this is a pragmatic choice to reduce complexity and risk, especially when the same key may be used by multiple entities. For firmware signing specifically, the context is different; the signing environment is tightly controlled, and signing generally won’t happen more frequently than once per month in practice. In this context, 2^20 signatures are more than enough; that’s enough to sign once per month for 87,000 years, or once per day for 2,872 years. Even 2^10 signatures is enough to sign once per month for 85 years.

And if you exceed the number of signatures needed for the target security level, the characteristics of SPHINCS+ are such that the security level drops off very gradually; you would retain practical security, over 100 bits, even after signing over 1000x more than you should. This is a strong contrast with the LMS/XMSS signature schemes, where practical security is immediately lost if the state is mishandled even once.

So the question was: if the maximum number of signatures was relaxed, what new possibilities would that open up in the tradeoff space for SPHINCS+ parameters? Stefan built a way to automatically search through the parameter space (on GitHub: kste/spx-few/), and was able to map the landscape with detailed graphs like this one:

This was a promising result; targeting the same security value but a lower maximum number of signatures, it was possible to significantly reduce signature size without performance sacrifices. Optimizing for verification performance (since these parameter sets by definition target contexts where signing is infrequent) and signature size, the paper proposes effectively six new parameter sets, one each for the “s” variants in FIPS 205. The new parameters are called “Q20” in reference to the 2^20 signature bound, so the analogue of shake-128s from FIPS 205 is shake-128s-q20. The other signature framework parameters don’t change when the hash function changes, so sha2-128s-q20 is exactly the same as shake-128s-q20 except for the hash function.

OpenTitan was the case study for firmware signing in the paper, due to its combination of production quality implementation and open-source availability. I ran benchmarks for several of the new parameter sets using our secure boot implementation. For shake-128s-q20, which is the security level we’d most likely target, we saw a whopping 58% decrease in signature size and a 79% reduction in verification time.

The branch with the benchmarking scripts and reproduction instructions is available at jadephilipoom/opentitan:spx-benchmark

That speedup is enough to make SPHINCS+ as fast or faster than classical, non-post-quantum cryptography. On OpenTitan, that’s nearly as fast as RSA, and significantly faster than ECDSA at the same security level (note that the ECDSA number is slightly outdated, it’s now more like 420K cycles; we’ve made some speed improvements since the paper benchmarks were measured).

Although the signature size is still larger, it’s now only 4x larger than RSA’s combined public key and signature size (as discussed in the first post, it’s the sum of the two that really matters). The existing FIPS 205 parameter set is nearly 10x larger. This is a huge improvement to the tradeoff space of working with SPHINCS+ for firmware signing.

Now that we have an implementation for SHA-2 parameters rather than SHAKE, I can add some new benchmarks, shown here for the first time:

As discussed in the previous post, the SHA-2 parameters are faster on OpenTitan because the SHA-2 accelerator hardware implementation has less power side-channel hardening than the SHAKE accelerator. For secure boot, where we only do verification and therefore never handle secret data, we don’t need the hardening, so the speed is a free advantage. With SHA-2 and the Q20 parameters, SPHINCS+ is in fact significantly faster than RSA, and more than twice as fast as ECDSA, making it a very practical choice for the boot process despite the large signatures.

We’re very enthusiastic about these new parameter sets, which were presented at the 5th NIST PQC Standardization Conference in April 2024. At the same conference, NIST announced that they indeed plan to standardize some parameter sets for smaller numbers of signatures, in a separate publication from FIPS 205. We strongly support the standardization of reduced-maximum-signature parameter sets. Standardizing them will help hardware projects like ours roll out PQC quickly and effectively, a necessary precondition for the PQC migration of any system that relies on secure boot.

Hardware Acceleration for Lattice Cryptography

Hardware security means more than just secure boot. In some cases, we might want to be able to run alternative post-quantum signature algorithms on OpenTitan, especially for cases where we need to compute a signature (rather than only verify one). Signing speed, for example, is not a strength of SPHINCS+. Also, some of the data we handle in signing and key generation is secret, so side-channel attacks (e.g. power and timing) are in scope. Defending against these side channels is probably well within reason if we use the SHAKE parameter set, since that has a masked hardware implementation.

Another concern for signing is fault injection, and this one is trickier for SPHINCS+. I touched on these in the last post; it means that the attacker uses a laser or other means to deliberately insert a glitch during computation. The 2018 paper Grafting Trees: a Fault Attack against the SPHINCS framework described an attack that essentially causes SPHINCS+ (and several other related schemes) to reuse an internal one-time signature. The resulting faulty signatures pass verification, but reveal information to the attacker that, with enough signatures, allows them to create forgeries. The attack was experimentally verified shortly after being published, and a recent analysis confirms that the only real defense is redundancy. In other words, we would have to perform each signature twice or more to protect against this scenario. Given that signing is already pretty slow, this isn’t ideal for something we might have to do relatively frequently. It’s still viable to do SPHINCS+ signing and key generation on OpenTitan, and I believe we should support it. However, it would be good to support alternative post-quantum signatures as well.

Dilithium or Falcon?

So, what are the other options? Besides SPHINCS+, the other two signature algorithms that won the NIST competition are Dilithium (aka ML-DSA) and Falcon (aka FN-DSA). Here are some relevant benchmarks for ARM Cortex-M4, courtesy once again of the excellent pqm4 project. I’ve highlighted some particular measurements that are likely to be either challenging (yellow) or near-impossible (red) to accommodate during boot for an embedded system like OpenTitan with limited memory:

* Can also be stored as a 32-byte seed.
** Falcon signing is many times slower than this without floating-point instructions (this excellent and informative *blog post* *from Bas Westerbaan at Cloudflare estimates about 20x slower).*

As discussed in the first post of this series, I am using parameter sets here that aim for a higher security level than for SPHINCS+. This is a hedge against future cryptanalytic attacks potentially weakening lattice signature schemes, since they are newer and less well-understood than hash-based cryptography.

A few observations we can make based on the pqm4 measurements:

Dilithium signatures are almost 3x larger than Falcon ones.
Falcon key generation and signing are many times slower than Dilithium, especially taking into account that the current OpenTitan designs do not have floating point instructions.
The stack size required by the “clean” implementations is probably not feasible for an embedded system; we will need to use stack-optimized versions, and pay the price in code size and Dilithium signing speed.
Falcon code size is much larger than Dilithium.
We can likely accelerate Dilithium more than Falcon, because a much higher percentage of its computation is SHAKE hash computations that can use our hardware SHAKE accelerator.

Taking all of this information together, it’s clear that Dilithium is a more appealing option than Falcon for OpenTitan, at least with the current design. The code size alone for stack-optimized Falcon is probably disqualifying; that’s more SRAM than the current Earl Grey OpenTitan chip has. Besides that, the key generation and signing times are quite slow, even if we took the big step of adding floating-point instructions, and accelerating hashing won’t get us very far in speeding it up.

Another advantage of Dilithium is that the secret keys are derived from short seeds. This opens up the possibility to generate the secret keys from OpenTitan’s hardware key manager block. The key manager block maintains a secret internal state and performs key-derivation operations to generate key material that it loads directly into hardware blocks, for example the OTBN coprocessor that we currently use to accelerate ECDSA and RSA operations.

What about Kyber?

The fourth algorithm selected in the NIST PQC competition is Kyber, aka ML-KEM. Kyber is not a signature algorithm; instead, it is for “key encapsulation” and allows two parties to exchange information such that they end up with a shared symmetric key. This is an extremely useful operation; for example, TLS uses an exchange like this to set up an encrypted connection between a website’s server and a user’s browser. Once you do the key exchange operation, you can use fast “symmetric” cryptography like AES to encrypt data. Symmetric cryptography is post-quantum secure already, so there’s no need to change much here or accept performance penalties.

Although we don’t need Kyber at the moment for any of OpenTitan’s core device functions, it’s an algorithm that we expect high demand for, and it would be good to have a hardware architecture that supports it. Luckily, many of the underlying structures of Kyber are very similar to Dilithium, so we can consider them together.

The Design Space

So the next question is, how easily can OpenTitan (or, more precisely, a future hardware instantiation of OpenTitan) support efficient Dilithium and Kyber operations, and what challenges will we face there?

For the answers we can look to a pair of recent papers: Enabling Lattice-Based Post-Quantum Cryptography on the OpenTitan Platform (2023) and Towards ML-KEM & ML-DSA on OpenTitan (2024, pre-print – for full disclosure, I’m one of the authors). The first paper evaluates a more hardware-focused approach to supporting lattice cryptography, extending the OTBN coprocessor with a new post-quantum ALU and specialized instructions (3). It focuses on verification operations. The second paper evaluates four different hardware and software implementations of Dilithium and Kyber with OpenTitan’s OTBN coprocessor:

Unmodified OTBN ISA and hardware design, but without OTBN’s current memory constraints
Same as (1), but with a direct connection between OTBN and KMAC hardware blocks
Same as (2), but with the OTBN ISA extended with five new vector instructions
Same as (3), but with the hardware implementation of the new instructions optimized for speed instead of area.

Side channel defenses are also not in scope for either of the other papers, and will probably incur a significant cost in terms of code size, stack usage, and runtime. Luckily, the KMAC hardware block already includes these defenses, so the cost would only apply to the non-hashing computations. Although it’s difficult to estimate the exact cost of side channel countermeasures, it’s important to keep this in mind and leave a little bit of slack in the stats to accommodate the cost of side channel defense.

Memory optimizations would be necessary for our embedded context, and are not completely in scope for the above papers. However, from Dilithium for Memory Constrained Devices (2022) and the pqm4 measurements we can get a decent idea of the amount of slowdown we would have from stack optimizations, and how much we could bring stack usage down. This paper also helpfully optimizes for code size, supporting all 3 Dilithium variants in around 10kB of code.

Together, this research helps us get a sense of the tradeoff space available, primarily between hardware area, stack usage, and speed. Much thanks to all of the researchers involved; it’s amazing to be able to reference all of this information instead of guessing! This is also a great example of how an open source hardware project can benefit from external researchers having the ability to experiment.

So let’s dive in and see what these papers tell us, paying special attention to the three key tradeoffs of area, stack usage, and speed. Starting chronologically with Dilithium for Memory Constrained Devices, we can see that Dilithium-3 (ML-DSA-65) can be brought down from 50-80kB of memory usage for all operations to about 6.5kB for signing and key generation and 2.7kB for verification:

It’s important to remember here that the keys and signatures themselves often also need to exist in memory during the computation, and for lattice cryptography these are measured in kilobytes. On Earl Grey, OTBN has only 4kB of data memory, so this would clearly need to be a little bigger.

The cost for using less memory is performance; compared to a similarly generic and C-based implementation, PQClean, the slowdown is 1.5x for key generation, 2.8x for signing, and 2.0x for verification. Compared to an implementation with assembly optimized for Cortex M4, the slowdowns become more significant: 1.8x for key generation, 5.4x for signing, and 2.7x for verification (4). It’s likely that for OpenTitan, we’d end up somewhere in between. Effectively, we’d need to apply the same transformation that gets from PQClean to “This work” in the table above, but we’d be applying it to an implementation that’s closer in nature to the specialized assembly of pqm4. In terms of code size, it doesn’t seem to incur a high cost; the code size is about the same as pqm4. It’s somewhat larger than PQClean, but this is partly because PQClean separates the different Dilithium parameter sets and the memory-optimized version includes all of them.

So, from here we can conclude:

It’s possible to lower the memory consumption of Dilithium-3 to something that an OTBN with slightly expanded (e.g. 12kB or possibly even 8kB) of DMEM could accommodate.
We should expect small-multiple slowdowns from these memory optimizations: 1-2x for key generation, 3-5x for signing, and 2-3x for verification.
The cost in terms of code size should not be catastrophic.

Next, what do we learn from the two papers with implementations on modified versions of OpenTitan? This table from Towards ML-KEM & ML-DSA on OpenTitan summarizes the performance numbers for the different variants (including Enabling Lattice-Based Post-Quantum Cryptography on the OpenTitan Platform, which is cited as SOSK23) as well as other platforms:

The stack size requirements for this paper’s implementations are similar to PQClean’s Dilithium implementation; we must assume that we need to apply some of the transformations in the memory-optimization paper. For SOSK23, some computation has been moved out of OTBN, likely to save memory, and the implementation works with 32kB of OTBN DMEM. Because memory optimizations could remove some overhead and perhaps some have already been applied, it would be reasonable to expect a slightly smaller multiplier from memory optimizations there.

When we compare the performance table with the hardware area evaluations, we can see a clear tradeoff emerge between area and performance:

While the “Ext++” variant is the fastest, it also uses significantly more hardware area than “Ext”. However, for just a 12% area penalty on OTBN, “Ext” is nearly as fast. In fact, 697K cycles is close to the performance we see for ECDSA-P256 signing, currently 704K cycles – although it’s important to remember here that the Dilithium implementation still needs memory optimizations and side-channel protection to be truly comparable. Still, the performance looks very promising, and the costs in terms of area and code size would be manageable. I am very optimistic about this direction of work!

Such a detailed view of the tradeoff space is especially valuable because OpenTitan is not just one chip; it’s a library of hardware designs that can be assembled into multiple different “top-levels”, like Earl Grey and Darjeeling (we have a tea theme). So it’s not hard to imagine that, in the future, some OpenTitan designs might prioritize having the fastest post-quantum cryptography they possibly can, even though it costs area, while others might be less specialized.

In any case, it’s clear from this research that:

Connecting OTBN and KMAC has a dramatic positive effect on Dilithium performance.
Modest vector instruction-set extensions also provide a lot of speedup for a relatively small cost.
We would need to increase OTBN’s data memory and likely also instruction memory to accommodate Dilithium-3; 12kB each should be enough.

Conclusion

That’s all for this long-winded series of posts! I hope that explaining a bit about our experience and thought process has been helpful or at least entertaining to people who are also thinking about implementing PQC. This is a huge, complex, and fascinating area. Many thanks to all of the brilliant, talented researchers who have helped me understand this space and contributed to the works I cited above!

Interested in learning more? Sign up for our early-access program or contact us at info@zerorisc.com.

(1) The NIST standard, FIPS 205, renames the algorithm to SLH-DSA, but we’ll refer to it as SPHINCS+ for this post. Our implementation is compatible with round-3 SPHINCS+ for last year’s chips and will be compatible with FIPS 205 for future versions.

(2) OpenTitan is a large open-source project stewarded by lowRISC on which many organizations, including zeroRISC, collaborate.

(3) It would be technically possible to run the Dilithium and Kyber reference implementations on Ibex/KMAC instead of OTBN. This would work with no hardware modifications but would be slow. As a very rough back-of-the-napkin estimate, it would take something like 89ms per Dilithium3 signature, much slower than even OTBN with more memory and no ISA changes. (The pqm4 measurement is 127ms, 31% of that is SHAKE, and I'm using the estimate from post 1 that hardware SHAKE takes 3% of the time software SHAKE does).

(4) Specifically, the implementation from Compact Dilithium Implementations onCortex-M3 and Cortex-M4 (2021).

Jade Philipoom 9/3/24 Jade Philipoom 9/3/24

Landing SPHINCS+ on OpenTitan

This is part 2 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips. SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum cryptography competition; the final standard was recently released as FIPS 205. On this exciting occasion...

This is part 2 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips (1, 2). SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum cryptography competition; the final standard was recently released as FIPS 205. On this exciting occasion, we hope that sharing our experience with SPHINCS+ will help others who are considering migrating to PQC, especially for firmware signing. Read part 1 here.

This post will focus on the implementation and organizational aspects of adopting SPHINCS+ for OpenTitan’s current-generation “Earl Grey” chips. We’ll cover:

how the idea to use SPHINCS+ for secure boot on Earl Grey came about,
the process of preparing an RFC and getting it approved within OpenTitan, and
how we adapted and optimized the reference implementation to suit Earl Grey.

This will probably be the most detailed and longest blog post in the series; I think a really cool part of working on an open-source project is that we leave a public trail of discussions we’ve had and code we’ve merged, and I want readers to be able to follow that trail. So I’ll link liberally to commits and GitHub issues throughout this post. I also want to draw attention to the non-technical infrastructure I discuss here, especially the RFC process (part of the general “Silicon Commons” approach to collaborative open-source hardware projects) and how OpenTitan development works across multiple organizations.

Idea and Prototyping

The original suggestion to try running SPHINCS+ on OpenTitan came from Peter Schwabe, one of the original authors of SPHINCS+, following a serendipitous meeting at CHES 2022. He suggested that we could probably run the reference implementation efficiently on our platform without any hardware changes. With his help, we were able to strip out the non-verification code, replace the software hash function with calls to our hardware implementation, set up a test case, and successfully verify a signature in hardware simulation the next day. (You can see the excited commit message where I wrote “WORKING!” – I remember clearly the moment when I saw “verification passed” print on my screen for the first time.) The availability of an optimized, public-domain reference implementation was crucial here; it made getting the first prototype off the ground a breeze, and allowed us to continually check our implementation against reference test vectors throughout the whole development cycle to make sure we didn’t introduce bugs.

After that first signature, we spent a few weeks optimizing the implementation and experimenting with different parameter sets. You can see the whole trail of those experiments recorded on my prototyping branch: jadephilipoom/opentitan:sphincsplus. We were able to find a 3x speedup in verification time overall with platform-specific optimizations, mostly adjusting the KMAC block driver and assuming word-aligned buffers everywhere. The reference implementation has to assume that some internal buffers might be byte-aligned, but in OpenTItan’s ROM code there was an early design decision that absolutely everything should be word-aligned, and it’s OK to assume so. In my opinion, this decision has paid off extremely well, with benefits not just for performance but also for defense against power side-channels when handling secret data. (Byte-writes are generally more vulnerable than word-writes in that context, because there are fewer possibilities for the attacker to choose between.)

The benchmarks are all documented on the prototyping branch, along with instructions to reproduce them if you’re curious. We compared all the parameter sets at first but quickly realized that we wanted to target the shake-128s-simple parameter set, so the benchmarks focused on that. Below is the summary of what we tried during the initial prototyping and the effect on performance for shake-128s-simple:

You can find (or create!) benchmarks for other parameter sets by checking out the branch and running more tests.

It’s worth noting that the final version of SPHINCS+ in ROM today is even faster; we were able to bring shake-128s-simple down to just under 13ms, mostly with more word-alignment. After recently switching to SHA2 parameters (see issue #23144 and pull request #23732) for the upcoming Earl Grey chips, the verification takes about 9.3ms.

These experiments showed that SPHINCS+ was fast enough to run as an option for secure boot. It was still about 6x slower than RSA-3072 and 3x slower than ECDSA-P256, so we wouldn’t want to force all users of the chip to run it. However, we could potentially add a field to the chip’s one-time programmable (OTP) memory configuration to let us choose to enable post-quantum secure boot for certain chips at manufacturing time, an option that was suddenly looking feasible much earlier than expected.

Around the time we concluded the optimization experiments, we started evaluating what it would take to land this code in the very first Earl Grey tapeout. The timing was ambitious. This was late November 2022, and we had already scheduled a tapeout for the first chips. Based on that schedule, we needed to lock in the ROM implementation, with no further changes, by June 2023. Seven months to put a working prototype into production use might sound like plenty of time, but not for silicon where you can’t compile your way out of a problem. There is a lot of work in between passing an initial test and having code that is ready to go into ROM. For example, we needed to decide how the PQC option would interact with the existing classical signature verification, change the boot manifests to accommodate multiple signature types, add SPHINCS+ signing capability to the Rust-based opentitantool utility, adjust all the code to match OpenTitan’s specific style guide, and of course run extensive tests – not just on the signature verification core code but also on its integration with the rest of the ROM.

Before we did any of that, because this was a major change we needed to go through the RFC process and seek approval from the OpenTitan Technical Committee. Big decisions in OpenTitan can’t be made unilaterally by design. Lots of organizations collaborate to make the project possible. It’s vitally important to the health of the project that decisions are made fairly and transparently, giving everyone a chance to provide feedback, object to or adjust new proposals.

Creating an RFC

In December 2022, I presented an RFC to the OpenTitan Technical Committee, explaining the results from the initial prototype and optimization experiments and mapping out the estimated cost in terms of implementation effort, boot time, and code size.

One important decision for the RFC was whether enabling the SPHINCS+ secure boot flow should disable the classical flow. We decided if SPHINCS+ was enabled, both the classical and PQC flows would run, and the code would need two signatures. Although SPHINCS+ is based on the security of hash functions and is cryptanalytically low-risk, the flow was also completely new, and we didn’t want to risk introducing a new bug and undermining security. Plus, given that SPHINCS+ took quite a bit longer than classical verification, adding the classical verification time on top of it wasn’t much of a relative cost.

Safe Comparison

Another important detail was how to defend the sensitive final comparison against fault injection attacks. Some general principles behind defending against fault injection attacks in software, at a very high level, are:

It’s easier to glitch one bit than several at once, so avoid situations where a single bit-flip can bypass a security check (e.g. a security-critical if/else statement).
For enums, use values with a high Hamming distance from each other, making it hard for an attacker to glitch one value into another one that sends the code down a different path.
Sometimes, attackers can trick the code into reading unexpected values that exist in the logic (for example, by preventing a register from updating). Therefore, ensure any values that mean “passed security check” are not just hanging around in a register; they should be constructed slowly as the code goes through its intended path.

Like RSA and ECDSA, the signature verification procedure is structured so that it does a big computation on some combination of the public key, message, and signature, then checks the resulting value against part of the original input values (in SPHINCS+, the public key root). If the signature is valid, the two should be equal. If you’re an attacker and you want to bypass secure boot, this is the comparison you should target; by causing that single comparison to say “yes” when it should say “no”, you can cause an invalid signature to be accepted. Luckily, we had already implemented a safe comparison for the classical flow (see pull request #10024, and fellow formal methods nerds might also appreciate the small proof that shows this algorithm produces the right final result). Now, we just needed to adapt the design to work with both signatures.

The original design worked by taking advantage of the fact that we don’t want to produce a boolean true or false from the comparison. Ultimately, we need a “magic” 32-bit constant, kFlashExec, if the comparison succeeds. That high-Hamming-weight constant (and no other value) would allow us to unlock flash for the next boot stage. This makes it easier to defend the implementation from fault attacks, because we’re never relying on a single bit or branch. We generated pre-computed 32-bit shares of the magic value, meaning values x₀ through x_n-1 such that:

We chose the number of shares (n in the equation above) so that the total length of the shares, when concatenated, was equal to the length of the signature values we needed to compare. Then we would start with an all-zero result value. For every word of the two values we needed to compare, we would XOR (⊕) the two next words we needed to compare with the result value and also the next share of the precomputed list of shares. There was also an extra value called “diff” to make sure it was impossible to accidentally arrive at the correct value. It directly checked if the XOR of the two words was zero, and if so (or if the previous diff was nonzero), it set both the result value and the diff to all-ones.

We needed to adapt this design so it could generate kFlashExec only if either:

the classical verification passed and SPHINCS+ was disabled, or
both verifications passed and SPHINCS+ was enabled.

The approach we chose was to pick special values A and B, as well as special values for the “enable” and “disable” OTP settings, so that:

Then, instead of using shares for kFlashExec, we would have the classical signature verification routine use shares for A, and the SPHINCS+ comparison use the same strategy with shares for B. We’d XOR the results of these comparisons with the SPHINCS+ enablement field, so that either of the expected cases (but no unacceptable ones) would construct the correct value.

With this crucial detail planned out, I was ready to present the RFC to the Technical Committee. In one of the committee’s regular meetings, I presented the document and answered questions. Committee members gave feedback and requested more details in certain sections, for example on the changes to the manifest format. Per normal procedures, they didn’t vote to approve the RFC in that first meeting; rather, the RFC contributors made the changes and TC members deliberated offline. At the next meeting, they held a (successful!) vote approving the proposal.

Initial Implementation

By the time the RFC was done and approved, it was early January 2023; we had less than 6 months before the ROM freeze deadline. We quickly got to work. Starting with adjusting the reference implementation to match the code style conventions in the OpenTitan repository, we slowly merged the prototype implementation with about a dozen pull requests (see for example #17093, #17221, #17295, and #17367), each with a small, digestible chunk of code to undergo review. For now, nothing actually called the code.

As of pull request #17326 in late February, we had a complete implementation and were ready to integrate the code into the boot flow and surrounding infrastructure. We reworked how the ROM code represented keys to handle both classical and SPHINCS+ keys (see pull request #18512), and implemented the safe final comparison from the RFC (see pull request #17995). We added SPHINCS+ support for opentitantool based on the pqcrypto Rust crate (see pull requests #18184 and #18041). Finally, we added a new manifest extensions capability to the format for boot manifests, and incorporated the changes into opentitantool so we could sign images with SPHINCS+ signatures from the command line (see pull requests #18584 and #18667). The opentitantool support also allowed us to run integration tests that checked the signature verification code operated as expected with the whole boot sequence and correctly accepted or rejected the signatures from pqcrypto.

Maintenance and Updates

Since the initial implementation for the first Earl Grey tapeout, there have been a few additional updates and changes to the SPHINCS+ code for OpenTitan that will come into effect for the next tapeout. The old version was compatible with the round 3 SPHINCS+ submission to the NIST competition and used SHAKE as the hash function. The new version will be compatible with the FIPS 205 standard and use SHA-256 as the hash function.

First, NIST made a few changes for the FIPS 205 standard that are not backwards compatible. For example, there was a small endianness change in an internal routine that changed signature values completely. We implemented it in May 2024 (see pull request #22953) in preparation for the next tapeout. This was a somewhat tricky change to make, especially since NIST hadn’t released test vectors for it and the pqcrypto crate we used for opentitantool didn’t yet include it. The reference implementation had added a branch with the endianness change, so we were able to re-generate tests for part of our testing infrastructure that way. But another part of our test infrastructure directly pulled the NIST tests from round 3 of the PQC competition, so we needed to change it. Instead, as discussed in the comments on pull request #22953, we set up self-hosted test vectors generated from the right branch of the reference implementation to replace the round 3 tests.

We also needed a new way to generate signatures with opentitantool before the changes would be compatible with our integration tests and signing utilities. We discussed the issue at one of the regular Software Working Group meetings. This is a good forum for smaller-scale decisions, where OpenTitan maintainers from different organizations can informally coordinate and seek feedback on engineering decisions. There, we decided to directly link the reference implementation into opentitantool with bindgen rather than, for instance, look for a different Rust crate. This option would give us more future flexibility, including to experiment with alternative parameter sets like we’ll discuss in the next post. We integrated the reference implementation into our infrastructure and generated the bindings so that I could switch opentitantool to use them (see pull requests #23049 and #23104).

There were also two smaller code changes for the new version; domain separator support to match the FIPS 205 standard (see pull requests #23762 and #23765), and a small bugfix from the upstream reference implementation (see pull request #22894). The bug didn’t affect any of the parameter sets that had been submitted as part of the PQC competition, but was problematic when experimenting with different, alternative parameters.

Finally, we changed the parameter set from SHAKE to SHA-2, as I discussed a bit already in the last post. In May 2024, just before a code freeze, we received a time-sensitive request to switch to SHA2 parameters to maintain compatibility with project partners’ infrastructure. Luckily, we were able to take advantage of the existing organizational and technical infrastructure to evaluate the effort and risk of the change, agree on a plan, and implement and test it in time. First, we wrote an RFC for the change and the TC approved it. To help inform the decision, we did some quick performance estimates to check if there would be an impact on verification speed. In general, SHAKE is faster than SHA-2 in hardware. However, our SHAKE implementation is masked for side-channel protection and the SHA-2 implementation isn’t, so it runs a little bit faster. Also, we had implemented a new save/restore feature for the SHA-2 accelerator hardware since the first tapeout (see pull request #21307), which we could use to accelerate a performance-critical part of SPHINCS+. After the Technical Committee approved the RFC, we updated multiple parts of the test infrastructure to include SHA-2 parameter sets (see pull requests #21681 and #23598), and then implemented the extra bits of code (e.g. MGF1) that we needed and swapped over the implementation (see pull request #23710 and #23732). Then we went through and, using similar techniques as we had for SHAKE as well as the save/restore feature, optimized the code so that it would run a significant few milliseconds faster (see pull request #23761). For code size and schedule reasons, we fully swapped over the implementation and don’t support the SHAKE parameter sets as an option in this version. For future OpenTitan chips, we’re considering supporting both.

Closing Words and Thanks

And of course, like most technical projects, we build on the work of many others; in this case, we benefited greatly from the SPHINCS+ authors making a high-quality reference implementation and test vector generation script available under a permissive open-source license. We strongly believe that accessible, quality implementations are indispensable – in hardware and software alike! So thank you to the SPHINCS+ authors, especially Peter for suggesting that we try running SPHINCS+ and helping us set up the first experiments.

I think of this post as sort of a case study in how a major feature on OpenTitan chips was introduced, accepted and maintained. It takes a village to tape out a chip, and landing this feature required tons of expertise and hard work from people with different specialties (and frequently different employers!) Heartfelt thanks to all of the contributors on the OpenTitan project who wrote design docs, pushed code, reviewed PRs, and adjusted infrastructure to make this possible: Alphan, Jon, Ryan, Chris, and many more.

Stay tuned for the third and final post in this series, where we’ll focus on exciting future possibilities for PQC on OpenTitan: alternative SPHINCS+ parameter sets and lattice cryptography.

Interested in learning more? Sign up for our early access program or contact us at info@zerorisc.com.

(2) OpenTitan is a large open-source project stewarded by lowRISC on which many organizations, including zeroRISC, collaborate.

Jade Philipoom 8/14/24 Jade Philipoom 8/14/24

Post-Quantum Secure Boot on OpenTitan

This is part 1 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips. SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum…

This is part 1 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips (1, 2). SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum cryptography competition; the final standard was released yesterday as FIPS 205. On this exciting occasion, we hope that sharing our experience with SPHINCS+ will help others who are considering migrating to PQC, especially for firmware signing.

The OpenTitan project has a complete SPHINCS+ implementation in mask ROM, which means post-quantum secure boot has been supported since the very first OpenTitan “Earl Grey” chip samples came back earlier this year. In this post, we’ll cover:

why post-quantum cryptography is important,
a quick, high-level primer on SPHINCS+, and
how SPHINCS+ compares to other post-quantum algorithms in this context.

Stay tuned for future posts with more focus on the implementation and our future plans!

Why now?

First, let’s take a step back and answer the basic question of motivation: why is it important to implement post-quantum cryptography now? No quantum computer currently exists that could break “classical” signature algorithms like RSA or ECDSA. In fact, current known quantum computers are quite far off from that goal; according to the Global Risk Institute’s 2023 survey, quantum computing experts were asked to estimate the chance, by various deadlines, of a quantum computer being able to crack RSA-2048 in a day. The experts gave it a 4-11% probability of happening by 2028, and a 17-31% probability of happening by 2033. The estimates don’t break 50% until 2038. So why start defending against quantum computers now?

For secure silicon, the most crucial factors to consider are that (a) hardware development timelines are long, and (b) the first signature verification during boot is critical to the trust model. A chip we design today might well be in the field 5, 10, or even 20 years from now, and the ROM code that does the signature verification can never be changed after manufacturing. If the ROM can only verify ECDSA signatures, then a quantum attacker who can forge an ECDSA signature could run their own code in early boot. This would break fundamental features of these devices, like OpenTItan’s ownership transfer; the attacker could insert malicious code without the device owner’s knowledge. So, even if the risk is low today, we still need to be ready for an uncertain future.

One notable factor we don’t need to worry about in this context is “harvest now, decrypt later” attacks. In those attacks, a patient attacker collects encrypted communications now and then decrypts them with a quantum computer many years in the future to find out what was said. This is an important concern for many contexts in modern cryptography. However, since secure boot doesn’t involve encryption at all, it’s not a concern here.

Finally, from a big-picture perspective, it will take the world a lot of time to agree on and migrate to post-quantum cryptography. The NIST competition was first announced in 2016, and (some of) the standards are now finally out 8 years later. It took a massive amount of time and effort from cryptographers to invent and analyze dozens of algorithms, but that work was just the start. Post-quantum cryptography generally involves much larger signatures and keys and/or much slower running times than classical cryptography. Migrating real-world live systems to use these algorithms will be a many-year process requiring global coordination; we can’t afford to put it off until it becomes an emergency.

About SPHINCS+

As post-quantum algorithms go, SPHINCS+ is refreshingly familiar. It’s a hash-based algorithm, meaning that it doesn’t rely on any new cryptographic constructs like lattices – instead, the internal mathematical structures are trees of hashed values. That makes SPHINCS+ more conservative security-wise than lattice-based schemes; since we’ve been studying hash functions for a long time, it’s unlikely that a new cryptanalytic result is going to suddenly weaken the security bounds. Hashing speed is the most important performance factor in SPHINCS+ by far. In particular, there’s a chain operation that repeatedly runs the hash function and can take up to 95% of the runtime, depending on the parameter set, platform, and hashing speed. Even with our hardware-accelerated version, where hashing is much faster compared to non-hash operations, the chain operation is about 79% of the runtime.

Another interesting aspect of SPHINCS+ is that it’s a signature framework; you can adjust 6 different parameters and freely swap out the hash function to make signature algorithms with different performance and size characteristics. The authors selected 36 specific parameter sets in their submission to the NIST competition. For each of the 3 security levels they targeted, they picked two settings for the framework parameters, one that targeted small signatures, the “s” parameters, and one that targeted fast signature generation, the “f” parameters. Those 6 options could each be deployed with one of 3 different hash functions (SHA2, SHAKE, and Haraka), and one of two different algorithmic variants “simple” and “robust”. NIST dropped the Haraka option and the “robust” variants, reducing the original 36 parameter sets to 12 for FIPS 205.

So, when we say OpenTitan has SPHINCS+, we need to be a little more specific; the first round of chips supported the shake-128s parameter set, meaning that the hash function is SHAKE, the security level is equivalent to AES-128, and the remaining parameters are tuned for small signatures at the expense of signing speed (3). We chose the AES-128 security level to match our existing classical signature verification. For a lattice-based scheme, it might make sense to go a level up, but since SPHINCS+ is hash-based, our risk assessment concluded it wasn’t necessary in this case. For firmware signing, the “s” small-signature parameter sets are clearly better suited than the fast-signing “f” parameter sets; “s” has faster signature verification time as well as smaller signatures, and since signing happens infrequently, there’s no problem with waiting a bit longer to generate signatures.

The next tapeout of Earl Grey chips will support the sha2-128s parameter set; the same settings, except with SHAKE swapped out for SHA-2. OpenTitan has hardware accelerators for both operations (the KMAC block for SHAKE and other Keccak-family functions, and the HMAC block for SHA-2), so either option works well. For signing or key generation, when secret values are involved, it would definitely make more sense to use SHAKE, because the KMAC block has masking measures to protect against physical side-channel attacks and the HMAC block does not. However, since verification doesn’t handle secret values, the lack of masking measures actually becomes an advantage. SHA-2 operations run slightly faster than SHAKE on OpenTitan because they don’t include overhead from masking. We also considered that it might be easier to interoperate with code-signing infrastructure using SHA-2 than SHAKE, since SHAKE is newer and not everything yet supports it. In the future, both hash functions may be supported.

Why SPHINCS+?

In addition to SPHINCS+, NIST is standardizing two other new signature algorithms: Falcon and Dilithium. (Dilithium is already released as ML-DSA in FIPS 204; the Falcon standard is not yet published.) There’s also LMS and XMSS, stateful hash-based signatures that have been standardized for some time. So, out of all of these options, why does SPHINCS+ make sense for OpenTitan?

First, let’s address the stateful hash-based signature schemes, LMS and XMSS. On the surface, they seem to have generally better stats than SPHINCS+ 128s. The LMS parameter set (n=24, h=20, w=4), for example, has signatures about ¼ the size of 128s, and probably would verify signatures about twice as fast on OpenTitan. Furthermore,And these schemes are hash-based, so they are about as safe from new cryptanalytic breaks as SPHINCS+ is. However, there’s one big catch: the “stateful” part. LMS and XMSS maintain a set of one-time-use keys, and must remember which ones they’ve already used. If you ever sign twice without changing the state then the security guarantees immediately break down. This poses a fair amount of operational risk. For example, backing up and restoring a stateful private key must be done very carefully; a signature between the backup and the restore could mean game over. In this case, the OpenTitan project decided we’d rather deal with large signatures than accept additional complexity for the signing infrastructure, but this is ultimately a judgment call. You can find more discussion of the risks and mitigation techniques for stateful signatures in IETF RFC 8554 and the public comments for the LMS/XMSS NIST standard.

So what about Falcon and Dilithium? The bottom line is that, given the current Earl Grey hardware and OpenTitan’s security requirements, these algorithms would be somewhat slower and riskier than SPHINCS+, and the reduction in public key + signature size they would offer is not game-changing enough to justify those tradeoffs. (For the future, we are optimistic about hardware modifications that would accelerate lattice cryptography on OpenTitan, which we’ll discuss in more detail in a later post.)

Falcon and Dilithium are based on newer, lattice-based cryptography rather than hash functions. This means that there’s a higher risk of new attacks that would weaken their security bounds or potentially break them completely. It’s not a hypothetical concern; this is exactly what happened to SIKE, an isogenies-based scheme that had advanced to the final stages of the NIST competition and withstood years of analysis. Given the long timescales of hardware and the fact that the signature scheme can’t ever be updated, this is problematic. We could, however, minimize the risk by using security levels one step higher than we strictly need, meaning Falcon-1024 or Dilithium3 (aka ML-DSA-65) instead of Falcon-512 or Dilithium2. This is in line with the recommendation of the Dilithium authors themselves.

Even with these beefier parameter sets, we’d get a substantial reduction in public key + signature size compared to SPHINCS+, which is at about 8kB, versus 5kB for Dilithium3 and 3kB for Falcon-1024. Dilithium and Falcon public keys are too large to store in the chip’s OTP like we do for ECDSA and SPHINCS+, but we could get around this issue by hashing the public key, storing only the hash, and passing the full public key along with the signature. Therefore, it makes sense to look at the combined public key + signature size here to understand the amount of data we’d need to include with the signature in practice. Any of those numbers take a significant chunk of space away from the space we have for the code we’re signing, but SPHINCS+ is significantly larger than the lattice-based schemes. Thus,So signature size is definitely a favorable point in favor of Dilithium and Falcon.

Performance is a bit tricky to estimate, but we can get a rough idea from the pqm4 project’s benchmarks for ARM Cortex M4. Thanks to the pqm4 authors for making these incredibly easy to access and interpret! OpenTitan’s Ibex core is vaguely similar to the Cortex M4 in that it’s a memory-constrained 32-bit processor. However, there are some major differences: for example, Ibex doesn’t have floating-point instructions. This is more important for Falcon than Dilithium, so our estimates are more certain for the latter. With that disclaimer, the most relevant benchmarks are reproduced here, alongside OpenTitan’s SPHINCS+ measurements:

Let’s break this down a bit. The “cycles” column records the CPU time needed for signature verification. The “hash %” column is the amount of time spent on hashing; in the case of both Dilithium and Falcon, the hashing is SHAKE. This column gives us some insight into how much we can speed up the implementation on OpenTitan by using the SHAKE accelerator, compared to a platform without hash acceleration. So, even though the Dilithium runtimes are slower, we have a bit more leeway to speed them up than we do for Falcon. With SPHINCS+, since the vast majority of runtime is hashing, we can get really dramatic speedups.

The last two columns are important factors for Earl Grey’s memory-constrained environment. The “memory” column records how much stack space the implementation needs, and the “code size” column records the amount of space needed to store the code itself. Unfortunately, pqm4’s benchmarks don’t include a code size metric for the verify routine on its own. Still, we can get a rough idea of where the dragons lie. For example, since our ROM is only 32kB, we can make an educated guess that the Falcon implementations might be difficult for us to fit. These memory metrics are definitely a point in the column for SPHINCS+.

We can – again, very roughly – estimate the speedup more precisely with some back-of-the-envelope linear equations. If we make the approximation that the non-hash components of the implementation will perform similarly, we can use the known values for SPHINCS+ to solve for the difference in hash performance.

In the above equations, t is the total number of cycles for OpenTitan’s Earl Grey and the Cortex M4 respectively, n is the time spent on non-hashing operations, h is the number of hashing operations, and s is the average time taken for a hashing operation. If we know the total cycles on both platforms, and we know the ratio of time that’s spent on hashing for Cortex M4 (which lets us solve for n), we can derive the value of (sot / sm4), a measurement of how long the average Earl Grey hashing operation takes compared to the Cortex M4 version. Applying these estimates to the known SPHINCS+ shake-128s numbers tells us that the hardware-accelerated SHAKE takes on average 3% of the time that the software SHAKE does, and gives us a ballpark estimate of around 1.5 million cycles for “clean” Dilithium3 and 3.4 million cycles for the “m4stack” variant. Falcon-1024 comes out to 1.2 million cycles for “clean” and 600K cycles for “m4-ct”.

Because the estimate makes some sweeping assumptions about the similarity of “Earl Grey” and Cortex M4, and the hashing operations for the different schemes, we shouldn’t interpret this as anything more precise than a ballpark estimate. Still, the rough numbers don’t give us reason to believe that Falcon or Dilithium verification would run much faster than SPHINCS+ on our current hardware, and they might even run slower, especially when we consider that code size might disqualify “m4-ct” Falcon. This makes sense, simply because Earl Grey is currently better at accelerating hash-based computations than lattice-based ones. For example, there are currently no vector instructions on Earl Grey, which are very handy for lattice cryptography.

So, in summary, we expect that on current Earl Grey hardware:

speed comes out slightly in favor of SPHINCS+
signature size is better with Dilithium and Falcon
code size + stack usage is lower with SPHINCS+
cryptanalytic risk is lower with SPHINCS+

Since the SPHINCS+ signature size of 8kB is in the high-but-manageable range for today’s Earl Grey chips, SPHINCS+ is more suited to this use-case. This doesn’t mean we’re not excited about lattice cryptography – quite the opposite! But for now, for this purpose, SPHINCS+ just makes the most sense.

That’s all for part 1! In part 2, we’ll focus more on implementation and organizational details: how the code actually landed in time for tapeout and how OpenTitan’s RFC process works for big changes like this. Then in part 3, we’ll focus on the future: how the tradeoff space discussed in this post may change with better lattice cryptography support and new SPHINCS+ parameter sets.

(2) OpenTitan is a large open-source project stewarded by lowRISC on which many organizations, including zeroRISC, collaborate.

(3) This is also called “L1” in the context of the NIST competition. The reason to say roundabout things like “equivalent to AES-128” or “L1” instead of “128 bits of security” is because of Grover’s algorithm, which theoretically halves security bound for quantum computers brute-forcing large numbers. However, it’s questionable whether Grover weakens the security bound significantly in practice, so saying “equivalent to AES-128” lets us all put it aside and allow a specific exception for quantum brute-force.

Sydney Gibson 8/13/24 Sydney Gibson 8/13/24

Introducing the zeroRISC Technical Blog

Stay tuned for future posts about the exciting contributions we’ve been making to OpenTitan, other open-source projects, and the widespread security community.