This is part 3 of 3 in an experience report about implementing SPHINCS+ (aka SLH-DSA) for secure boot in OpenTitan root of trust (RoT) chips (1,2). SPHINCS+ is a post-quantum secure signature algorithm and one of the four winners of NIST’s post-quantum cryptography competition; the final standard was recently released as FIPS 205.

Read part 1 here and part 2 here.

This post will focus on the future of post-quantum cryptography on OpenTitan, specifically:

new SPHINCS+ parameter sets that dramatically improve secure boot tradeoffs, and
potential hardware modifications to support lattice-based cryptography on OpenTitan.

I said the last post would be the longest, but it looks like this one is. Turns out there’s a lot to say about the future!

New SPHINCS+ Parameter Sets

For the OpenTitan Earl Grey chip design, we set up the SPHINCS+ verification so that it’s a configuration option during manufacturing; you can decide to run secure boot using both classical non-PQC verification or using both classical and SPHINCS+. We continued to support the classical-only option because SPHINCS+, although fast enough to be tolerable, was still a few times slower than RSA or ECDSA. Specifically, SPHINCS+ with the shake-sha2-128s parameter set takes about 9.3ms on Earl Grey when clocked at 100MHz, compared to about 2.4ms for RSA-3072 and 4.2ms for ECDSA-P256, which provide about the same level of security.

This performance picture is about to change. The paper A note on SPHINCS+ parameter sets (2022), authored by Stefan Kölbl (one of the SPHINCS+ authors) and myself, explores new parameter sets that are better suited to firmware signing. As I described in the first post of this series, SPHINCS+ is a signature framework; it has several settings that you can tweak to get different characteristics. The NIST PQC competition required that all submissions support up to 2^64 signatures per key. This is so many signatures that, practically speaking, one can simply never worry about counting. For many applications, this is a pragmatic choice to reduce complexity and risk, especially when the same key may be used by multiple entities. For firmware signing specifically, the context is different; the signing environment is tightly controlled, and signing generally won’t happen more frequently than once per month in practice. In this context, 2^20 signatures are more than enough; that’s enough to sign once per month for 87,000 years, or once per day for 2,872 years. Even 2^10 signatures is enough to sign once per month for 85 years.

And if you exceed the number of signatures needed for the target security level, the characteristics of SPHINCS+ are such that the security level drops off very gradually; you would retain practical security, over 100 bits, even after signing over 1000x more than you should. This is a strong contrast with the LMS/XMSS signature schemes, where practical security is immediately lost if the state is mishandled even once.

So the question was: if the maximum number of signatures was relaxed, what new possibilities would that open up in the tradeoff space for SPHINCS+ parameters? Stefan built a way to automatically search through the parameter space (on GitHub: kste/spx-few/), and was able to map the landscape with detailed graphs like this one:

This was a promising result; targeting the same security value but a lower maximum number of signatures, it was possible to significantly reduce signature size without performance sacrifices. Optimizing for verification performance (since these parameter sets by definition target contexts where signing is infrequent) and signature size, the paper proposes effectively six new parameter sets, one each for the “s” variants in FIPS 205. The new parameters are called “Q20” in reference to the 2^20 signature bound, so the analogue of shake-128s from FIPS 205 is shake-128s-q20. The other signature framework parameters don’t change when the hash function changes, so sha2-128s-q20 is exactly the same as shake-128s-q20 except for the hash function.

OpenTitan was the case study for firmware signing in the paper, due to its combination of production quality implementation and open-source availability. I ran benchmarks for several of the new parameter sets using our secure boot implementation. For shake-128s-q20, which is the security level we’d most likely target, we saw a whopping 58% decrease in signature size and a 79% reduction in verification time.

The branch with the benchmarking scripts and reproduction instructions is available at jadephilipoom/opentitan:spx-benchmark

That speedup is enough to make SPHINCS+ as fast or faster than classical, non-post-quantum cryptography. On OpenTitan, that’s nearly as fast as RSA, and significantly faster than ECDSA at the same security level (note that the ECDSA number is slightly outdated, it’s now more like 420K cycles; we’ve made some speed improvements since the paper benchmarks were measured).

Although the signature size is still larger, it’s now only 4x larger than RSA’s combined public key and signature size (as discussed in the first post, it’s the sum of the two that really matters). The existing FIPS 205 parameter set is nearly 10x larger. This is a huge improvement to the tradeoff space of working with SPHINCS+ for firmware signing.

Now that we have an implementation for SHA-2 parameters rather than SHAKE, I can add some new benchmarks, shown here for the first time:

As discussed in the previous post, the SHA-2 parameters are faster on OpenTitan because the SHA-2 accelerator hardware implementation has less power side-channel hardening than the SHAKE accelerator. For secure boot, where we only do verification and therefore never handle secret data, we don’t need the hardening, so the speed is a free advantage. With SHA-2 and the Q20 parameters, SPHINCS+ is in fact significantly faster than RSA, and more than twice as fast as ECDSA, making it a very practical choice for the boot process despite the large signatures.

We’re very enthusiastic about these new parameter sets, which were presented at the 5th NIST PQC Standardization Conference in April 2024. At the same conference, NIST announced that they indeed plan to standardize some parameter sets for smaller numbers of signatures, in a separate publication from FIPS 205. We strongly support the standardization of reduced-maximum-signature parameter sets. Standardizing them will help hardware projects like ours roll out PQC quickly and effectively, a necessary precondition for the PQC migration of any system that relies on secure boot.

Hardware Acceleration for Lattice Cryptography

Hardware security means more than just secure boot. In some cases, we might want to be able to run alternative post-quantum signature algorithms on OpenTitan, especially for cases where we need to compute a signature (rather than only verify one). Signing speed, for example, is not a strength of SPHINCS+. Also, some of the data we handle in signing and key generation is secret, so side-channel attacks (e.g. power and timing) are in scope. Defending against these side channels is probably well within reason if we use the SHAKE parameter set, since that has a masked hardware implementation.

Another concern for signing is fault injection, and this one is trickier for SPHINCS+. I touched on these in the last post; it means that the attacker uses a laser or other means to deliberately insert a glitch during computation. The 2018 paper Grafting Trees: a Fault Attack against the SPHINCS framework described an attack that essentially causes SPHINCS+ (and several other related schemes) to reuse an internal one-time signature. The resulting faulty signatures pass verification, but reveal information to the attacker that, with enough signatures, allows them to create forgeries. The attack was experimentally verified shortly after being published, and a recent analysis confirms that the only real defense is redundancy. In other words, we would have to perform each signature twice or more to protect against this scenario. Given that signing is already pretty slow, this isn’t ideal for something we might have to do relatively frequently. It’s still viable to do SPHINCS+ signing and key generation on OpenTitan, and I believe we should support it. However, it would be good to support alternative post-quantum signatures as well.

Dilithium or Falcon?

So, what are the other options? Besides SPHINCS+, the other two signature algorithms that won the NIST competition are Dilithium (aka ML-DSA) and Falcon (aka FN-DSA). Here are some relevant benchmarks for ARM Cortex-M4, courtesy once again of the excellent pqm4 project. I’ve highlighted some particular measurements that are likely to be either challenging (yellow) or near-impossible (red) to accommodate during boot for an embedded system like OpenTitan with limited memory:

* Can also be stored as a 32-byte seed.
** Falcon signing is many times slower than this without floating-point instructions (this excellent and informative *blog post* *from Bas Westerbaan at Cloudflare estimates about 20x slower).*

As discussed in the first post of this series, I am using parameter sets here that aim for a higher security level than for SPHINCS+. This is a hedge against future cryptanalytic attacks potentially weakening lattice signature schemes, since they are newer and less well-understood than hash-based cryptography.

A few observations we can make based on the pqm4 measurements:

Dilithium signatures are almost 3x larger than Falcon ones.
Falcon key generation and signing are many times slower than Dilithium, especially taking into account that the current OpenTitan designs do not have floating point instructions.
The stack size required by the “clean” implementations is probably not feasible for an embedded system; we will need to use stack-optimized versions, and pay the price in code size and Dilithium signing speed.
Falcon code size is much larger than Dilithium.
We can likely accelerate Dilithium more than Falcon, because a much higher percentage of its computation is SHAKE hash computations that can use our hardware SHAKE accelerator.

Taking all of this information together, it’s clear that Dilithium is a more appealing option than Falcon for OpenTitan, at least with the current design. The code size alone for stack-optimized Falcon is probably disqualifying; that’s more SRAM than the current Earl Grey OpenTitan chip has. Besides that, the key generation and signing times are quite slow, even if we took the big step of adding floating-point instructions, and accelerating hashing won’t get us very far in speeding it up.

Another advantage of Dilithium is that the secret keys are derived from short seeds. This opens up the possibility to generate the secret keys from OpenTitan’s hardware key manager block. The key manager block maintains a secret internal state and performs key-derivation operations to generate key material that it loads directly into hardware blocks, for example the OTBN coprocessor that we currently use to accelerate ECDSA and RSA operations.

What about Kyber?

The fourth algorithm selected in the NIST PQC competition is Kyber, aka ML-KEM. Kyber is not a signature algorithm; instead, it is for “key encapsulation” and allows two parties to exchange information such that they end up with a shared symmetric key. This is an extremely useful operation; for example, TLS uses an exchange like this to set up an encrypted connection between a website’s server and a user’s browser. Once you do the key exchange operation, you can use fast “symmetric” cryptography like AES to encrypt data. Symmetric cryptography is post-quantum secure already, so there’s no need to change much here or accept performance penalties.

Although we don’t need Kyber at the moment for any of OpenTitan’s core device functions, it’s an algorithm that we expect high demand for, and it would be good to have a hardware architecture that supports it. Luckily, many of the underlying structures of Kyber are very similar to Dilithium, so we can consider them together.

The Design Space

So the next question is, how easily can OpenTitan (or, more precisely, a future hardware instantiation of OpenTitan) support efficient Dilithium and Kyber operations, and what challenges will we face there?

For the answers we can look to a pair of recent papers: Enabling Lattice-Based Post-Quantum Cryptography on the OpenTitan Platform (2023) and Towards ML-KEM & ML-DSA on OpenTitan (2024, pre-print – for full disclosure, I’m one of the authors). The first paper evaluates a more hardware-focused approach to supporting lattice cryptography, extending the OTBN coprocessor with a new post-quantum ALU and specialized instructions (3). It focuses on verification operations. The second paper evaluates four different hardware and software implementations of Dilithium and Kyber with OpenTitan’s OTBN coprocessor:

Unmodified OTBN ISA and hardware design, but without OTBN’s current memory constraints
Same as (1), but with a direct connection between OTBN and KMAC hardware blocks
Same as (2), but with the OTBN ISA extended with five new vector instructions
Same as (3), but with the hardware implementation of the new instructions optimized for speed instead of area.

Side channel defenses are also not in scope for either of the other papers, and will probably incur a significant cost in terms of code size, stack usage, and runtime. Luckily, the KMAC hardware block already includes these defenses, so the cost would only apply to the non-hashing computations. Although it’s difficult to estimate the exact cost of side channel countermeasures, it’s important to keep this in mind and leave a little bit of slack in the stats to accommodate the cost of side channel defense.

Memory optimizations would be necessary for our embedded context, and are not completely in scope for the above papers. However, from Dilithium for Memory Constrained Devices (2022) and the pqm4 measurements we can get a decent idea of the amount of slowdown we would have from stack optimizations, and how much we could bring stack usage down. This paper also helpfully optimizes for code size, supporting all 3 Dilithium variants in around 10kB of code.

Together, this research helps us get a sense of the tradeoff space available, primarily between hardware area, stack usage, and speed. Much thanks to all of the researchers involved; it’s amazing to be able to reference all of this information instead of guessing! This is also a great example of how an open source hardware project can benefit from external researchers having the ability to experiment.

So let’s dive in and see what these papers tell us, paying special attention to the three key tradeoffs of area, stack usage, and speed. Starting chronologically with Dilithium for Memory Constrained Devices, we can see that Dilithium-3 (ML-DSA-65) can be brought down from 50-80kB of memory usage for all operations to about 6.5kB for signing and key generation and 2.7kB for verification:

It’s important to remember here that the keys and signatures themselves often also need to exist in memory during the computation, and for lattice cryptography these are measured in kilobytes. On Earl Grey, OTBN has only 4kB of data memory, so this would clearly need to be a little bigger.

The cost for using less memory is performance; compared to a similarly generic and C-based implementation, PQClean, the slowdown is 1.5x for key generation, 2.8x for signing, and 2.0x for verification. Compared to an implementation with assembly optimized for Cortex M4, the slowdowns become more significant: 1.8x for key generation, 5.4x for signing, and 2.7x for verification (4). It’s likely that for OpenTitan, we’d end up somewhere in between. Effectively, we’d need to apply the same transformation that gets from PQClean to “This work” in the table above, but we’d be applying it to an implementation that’s closer in nature to the specialized assembly of pqm4. In terms of code size, it doesn’t seem to incur a high cost; the code size is about the same as pqm4. It’s somewhat larger than PQClean, but this is partly because PQClean separates the different Dilithium parameter sets and the memory-optimized version includes all of them.

So, from here we can conclude:

It’s possible to lower the memory consumption of Dilithium-3 to something that an OTBN with slightly expanded (e.g. 12kB or possibly even 8kB) of DMEM could accommodate.
We should expect small-multiple slowdowns from these memory optimizations: 1-2x for key generation, 3-5x for signing, and 2-3x for verification.
The cost in terms of code size should not be catastrophic.

Next, what do we learn from the two papers with implementations on modified versions of OpenTitan? This table from Towards ML-KEM & ML-DSA on OpenTitan summarizes the performance numbers for the different variants (including Enabling Lattice-Based Post-Quantum Cryptography on the OpenTitan Platform, which is cited as SOSK23) as well as other platforms:

The stack size requirements for this paper’s implementations are similar to PQClean’s Dilithium implementation; we must assume that we need to apply some of the transformations in the memory-optimization paper. For SOSK23, some computation has been moved out of OTBN, likely to save memory, and the implementation works with 32kB of OTBN DMEM. Because memory optimizations could remove some overhead and perhaps some have already been applied, it would be reasonable to expect a slightly smaller multiplier from memory optimizations there.

When we compare the performance table with the hardware area evaluations, we can see a clear tradeoff emerge between area and performance:

While the “Ext++” variant is the fastest, it also uses significantly more hardware area than “Ext”. However, for just a 12% area penalty on OTBN, “Ext” is nearly as fast. In fact, 697K cycles is close to the performance we see for ECDSA-P256 signing, currently 704K cycles – although it’s important to remember here that the Dilithium implementation still needs memory optimizations and side-channel protection to be truly comparable. Still, the performance looks very promising, and the costs in terms of area and code size would be manageable. I am very optimistic about this direction of work!

Such a detailed view of the tradeoff space is especially valuable because OpenTitan is not just one chip; it’s a library of hardware designs that can be assembled into multiple different “top-levels”, like Earl Grey and Darjeeling (we have a tea theme). So it’s not hard to imagine that, in the future, some OpenTitan designs might prioritize having the fastest post-quantum cryptography they possibly can, even though it costs area, while others might be less specialized.

In any case, it’s clear from this research that:

Connecting OTBN and KMAC has a dramatic positive effect on Dilithium performance.
Modest vector instruction-set extensions also provide a lot of speedup for a relatively small cost.
We would need to increase OTBN’s data memory and likely also instruction memory to accommodate Dilithium-3; 12kB each should be enough.

Conclusion

That’s all for this long-winded series of posts! I hope that explaining a bit about our experience and thought process has been helpful or at least entertaining to people who are also thinking about implementing PQC. This is a huge, complex, and fascinating area. Many thanks to all of the brilliant, talented researchers who have helped me understand this space and contributed to the works I cited above!

Interested in learning more? Sign up for our early-access program or contact us at info@zerorisc.com.

(1) The NIST standard, FIPS 205, renames the algorithm to SLH-DSA, but we’ll refer to it as SPHINCS+ for this post. Our implementation is compatible with round-3 SPHINCS+ for last year’s chips and will be compatible with FIPS 205 for future versions.

(2) OpenTitan is a large open-source project stewarded by lowRISC on which many organizations, including zeroRISC, collaborate.

(3) It would be technically possible to run the Dilithium and Kyber reference implementations on Ibex/KMAC instead of OTBN. This would work with no hardware modifications but would be slow. As a very rough back-of-the-napkin estimate, it would take something like 89ms per Dilithium3 signature, much slower than even OTBN with more memory and no ISA changes. (The pqm4 measurement is 127ms, 31% of that is SHAKE, and I'm using the estimate from post 1 that hardware SHAKE takes 3% of the time software SHAKE does).

(4) Specifically, the implementation from Compact Dilithium Implementations onCortex-M3 and Cortex-M4 (2021).

Future of PQC on OpenTitan