-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add library for compile time instantiation of elliptic curves #3979
base: master
Are you sure you want to change the base?
Conversation
Awesome! I'll certainly have a look. Not before tomorrow, though. |
No rush! This is certainly not going into 3.4 so there is plenty of time. Most interesting thing about the sea of red in CI - GCC 11.4 on x86 fails with constexpr timeout, while same version of GCC on say Aarch64 or S390 accepts. So constexpr limits are architecture dependent [*] :( and thus intrinsically flaky since you can be near some limit without knowing it [*] And also version dependent since GCC 13 on my machine is fine with the code without increasing the constexpr limit. |
Clang seems to have a nasty bug where It works fine for me in Clang 17 on my machines so I'm assuming this was a Clang bug that was fixed subsequently. I'd vote for just bumping the Clang minimum version - that's an absolutely insane bug and nearly impossible to work around - but unfortunately the version in Android NDK also has the bug and we are stuck with that version at least until the next NDK release. |
Downside of this approach is that (due to name mangling being somewhat horribly thought out for this case) object sizes are really big. Even with just 3 curves, on my machine the object file is one of the largest from the whole library, that will get a lot worse with 27. And compile times may prove untenable. I'm sure 90% of this can be salvaged but I suspect in the end the string based instantiation won't work in practice 😭 which is a shame since it's quite pretty imo |
Clang bug might be llvm/llvm-project#55638 |
This error
Looks like llvm/llvm-project#51182 |
I can repro the |
3496464
to
cd40575
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! 😄
I merely skimmed through the changes and left a few suggestions and comments here and there. No thorough review whatsoever.
*/ | ||
|
||
#ifndef BOTAN_PCURVES_UTIL_H_ | ||
#define BOTAN_PCURVES_UTIL_H_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for this file (and perhaps other places):
I'd suggest to use std::span<const W, N>
instead of std::array<>&
for the parameters of those utilities. No need to restrict those functions to arrays, IMO. And a static-length span should provide the same guarantees and optimization opportunities as an array, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main problem is for some technical reason I'm not bothering to learn the details of, C++ can't deduce the conversion from a statically sized array
to a statically sized span
. https://stackoverflow.com/questions/70983595
This can be worked around with some additional noise but the additional flexibility doesn't buy us anything in this context so I'd rather keep things as simple as possible.
auto v = bigint_monty_redc(m_val, Self::P, Self::P_dash); | ||
std::reverse(v.begin(), v.end()); | ||
auto bytes = store_be(v); | ||
|
||
if constexpr(Self::BYTES == Self::N * WordInfo<W>::bytes) { | ||
return bytes; | ||
} else { | ||
// Remove leading zero bytes | ||
const size_t extra = Self::N * WordInfo<W>::bytes - Self::BYTES; | ||
std::array<uint8_t, Self::BYTES> out; | ||
copy_mem(out.data(), &bytes[extra], Self::BYTES); | ||
return out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could get away without the if constexpr
and the potential memory copy entirely?
auto v = bigint_monty_redc(m_val, Self::P, Self::P_dash); | |
std::reverse(v.begin(), v.end()); | |
auto bytes = store_be(v); | |
if constexpr(Self::BYTES == Self::N * WordInfo<W>::bytes) { | |
return bytes; | |
} else { | |
// Remove leading zero bytes | |
const size_t extra = Self::N * WordInfo<W>::bytes - Self::BYTES; | |
std::array<uint8_t, Self::BYTES> out; | |
copy_mem(out.data(), &bytes[extra], Self::BYTES); | |
return out; | |
} | |
auto v = bigint_monty_redc(m_val, Self::P, Self::P_dash); | |
std::reverse(v.begin(), v.end()); | |
std::array<uint8_t, Self::BYTES> out = {0}; | |
static_assert(Self::N * WordInfo<W>::bytes >= Self::BYTES); // is this guranteed somehow anyway? | |
const size_t zero_padding_offset = Self::N * WordInfo<W>::bytes - Self::BYTES; | |
store_be(std::span{out}.template subspan<zero_padding_offset>(), v); | |
return out; |
... this is untested. Just a sketch of the idea. store_be
with a statically-sized out-param will static_assert that the byte buffer matches the input range exactly.
void conditional_assign(bool cond, const Self& pt) { | ||
m_x.conditional_assign(cond, pt.x()); | ||
m_y.conditional_assign(cond, pt.y()); | ||
m_z.conditional_assign(cond, pt.z()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is needed for a constant-time implementation? So far, I saw it as an unspoken rule that all methods that are "supposed to be constant-time" are prefixed with ct_...
. I found this a helpful convention, that might be applicable here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this code I'm going for the opposite convention namely that everything is constant time by default and anything that is intentionally variable time has a _vartime
suffix.
(There are some current exceptions eg in the point multiplication that will be fixed before merging.)
GCC 11 miscompilation, excellent |
2feb181
to
3bde391
Compare
f694ccc
to
8e5d544
Compare
@reneme I think this is ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions/nits. I'm still figuring out the structure of this. So far: really nice!
Currently, my biggest concerns are:
- Why are the instances of the
PrimeOrderCurveImpl
template singletons? My concern is that the singletons' states will become an issue in testing (for instance). And can those singletons be avoided? I didn't get deep enough into the implementation to be able to have a decent opinion on that. - Should there be a way to select specific curves at compile-time? At the moment, its "all-or-nothing" and, I feel, especially for embedded use cases it might be helpful for binary size to be able to fine-tune which curves should be compiled.
And one general wish:
Some Doxygen comments describing the rough use case of each class and also their relationship would really help to get started with this. 😅
src/lib/math/pcurves/pcurves.h
Outdated
/// Creates a generic non-optimized version | ||
//static std::shared_ptr<const PrimeOrderCurve> from_params(...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO or left over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point a left over - removed.
It was (somewhat still is) unclear how to handle curves that are not baked in (either application specific curves but also the various "misc" curves like secp192k1, x962_p239v1, etc). One approach would be to somehow implement PrimeOrderCurve
for these in a generic way and that's what this commented out decl was leading to.
In practice though, examining the list of curves remaining, some have some rather unappealing properties that would be difficult to fit into this model. Particularly curves where p % 4 != 3, or where the order one bit larger than the field.
So I think what's going to happen, at least in the short term (ie prior to Botan 4 where we can maybe remove some of the most unappealing builtin curves, plus remove explicit encoding of EC curves in keys/ASN.1, plus put some real restrictions on what applications use for custom curves) is that for anything that's not a builtin, we default back to the old generic BigInt approach.
typedef std::array<word, StorageWords> StorageUnit; | ||
typedef std::shared_ptr<const PrimeOrderCurve> CurvePtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move up those definitions to use CurvePtr
for from_name()
and from_id()
, perhaps?
For the record: I'm working on some high-level overview to structure my reading. Perhaps its helpful for others as well: Updated: 30.05.2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I mostly looked into the structure of the implementation. We'll probably go deeper into the math next week (when @FAlbertDev is back from vacation 😰). Some more nits and suggestions I found on the way are below.
class Params final : public EllipticCurveParameters< | ||
"FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFF0000000000000000FFFFFFFF", | ||
"FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFF0000000000000000FFFFFFFC", | ||
"B3312FA7E23EE7E4988E056BE3F82D19181D9C6EFE8141120314088F5013875AC656398D8A2ED19D2A85C8EDD3EC2AEF", | ||
"FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFC7634D81F4372DDF581A0DB248B0A77AECEC196ACCC52973", | ||
"AA87CA22BE8B05378EB1C71EF320AD746E1D3B628BA79B9859F741E082542A385502F25DBF55296C3A545E3872760AB7", | ||
"3617DE4A96262C6F5D9E98BF9292DC29F8F41DBD289A147CE9DA3113B5F0B8C00A60B1CE1D7E819D7A431D7C90EA0E5F", | ||
-12> { | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the compactness of this! Though, you could replace the generic EllipticCurveParameters
struct with a concept and re-define the parameter struct for each curve. That would "label" each parameter value to their intended parameter name and also remove the need for the StringLiteral
helper. Like shown in this GodBolt.
Admittedly, after typing it out, I'm not super convinced about it myself anymore. However, the concept might still be worthwhile to restrict the typename Params
in the EllipticCurve
class and elsewhere (eg. MontgomeryRep
). Also, this concept could be extended to statically check further requirements of the parameter relationships as needed.
std::array<W, 2 * N> z; | ||
comba_mul<N>(z.data(), a.data(), b.data()); | ||
return Self(Rep::redc(z)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Perhaps implement this in terms of operator*=
like:
std::array<W, 2 * N> z; | |
comba_mul<N>(z.data(), a.data(), b.data()); | |
return Self(Rep::redc(z)); | |
auto result = a; | |
result *= b; | |
return result; |
... same for operator+
and the others, of course. That would reduce code duplication a tiny bit.
src/lib/math/pcurves/pcurves_impl.h
Outdated
/* | ||
It's impossible for this addition to ever be a doubling. | ||
|
||
Consider the sequence of points that are operated on, and specifically | ||
their discrete logarithms. We start out at at the point at infinity | ||
(dlog 0) and then add the initial window which is precisely P*w_0 | ||
|
||
We then perform WindowBits doublings, so accum's dlog at the point | ||
of the addition in the first iteration of the loop (when i == 1) is | ||
at least 2^W * w_0. | ||
|
||
Since we know w_0 > 0, then in every iteration of the loop, accums | ||
dlog will always be greater than the dlog of the table element we | ||
just looked up (something between 0 and 2^W-1), and thus the | ||
addition into accum cannot be a doubling. | ||
*/ | ||
accum += AffinePoint::ct_select(m_table, w_i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true for all cases? Since we use blinding (with scalars bigger than the group order), we could end up with an accumulator of, for example,
Of course, this would be extremely unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point, this analysis did not consider "overflow" of the dlog of the accumulator due to blinding. Alas.
Possibly there is some other scalar multiplication algorithm we can use here, I'll have to investigate. IMO it doesn't block merging -- the situation wrt side channels here is still strictly better than with the BigInt
based code, it even uses the same algorithm for this -- but this should be addressed.
|
||
constexpr size_t n_words = C::NW.size(); | ||
|
||
uint8_t maskb[mask_bytes] = {0}; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
src/lib/math/pcurves/pcurves_impl.h
Outdated
size_t get_window(size_t offset) const { return read_window_bits<WindowBits>(std::span{m_bytes}, offset); } | ||
|
||
private: | ||
secure_vector<uint8_t> m_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we need a dynamic container here? Edit: I'm guessing you don't want to allocate the entire lookup table on the stack. Perhaps leave a comment for future-us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a stack allocation here is pretty cheap, the main motivation was to avoid putting the scalar onto the stack. But maybe we can just admit that zeroization is a lost cause and use the stack for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just do it in the destructor of this helper class?
if(i <= 3) { | ||
accum.randomize_rep(rng); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any rationale why randomizations aren't necessary after i == 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH no. This is not well motivated. The idea is that in the first few iterations we only have a few secret bits that have been introduced into accum
, and so rerandomziation might be helpful. But this is extremely handwavey.
I'd be open to removing this entirely, or just doing rerandomization once during the first iteration.
If you have contacts at BSI -- they may have better ideas on how to approach this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: Given that the entire multiplication is implemented as constant-time: in particular the table lookup and the point operations; Do we even need blinding here? Or is that merely an additional precaution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need blinding here?
- Programmers make mistakes
- Compilers are smart (sometimes)
- Might help with EM / power analysis (questionable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defense in depth. Thanks for the background.
src/lib/math/pcurves/pcurves_impl.h
Outdated
constexpr size_t mask_words = BlindingBits / WordInfo<W>::bits; | ||
constexpr size_t mask_bytes = mask_words * WordInfo<W>::bytes; | ||
|
||
constexpr size_t n_words = C::NW.size(); | ||
|
||
uint8_t maskb[mask_bytes] = {0}; | ||
rng.randomize(maskb, mask_bytes); | ||
|
||
W mask[n_words] = {0}; | ||
load_le(mask, maskb, mask_words); | ||
mask[mask_words - 1] |= WordInfo<W>::top_bit; | ||
|
||
W mask_n[2 * n_words] = {0}; | ||
|
||
const auto sw = scalar.to_words(); | ||
|
||
// Compute masked scalar s + k*n | ||
comba_mul<n_words>(mask_n, mask, C::NW.data()); | ||
bigint_add2_nc(mask_n, 2 * n_words, sw.data(), sw.size()); | ||
|
||
std::reverse(mask_n, mask_n + 2 * n_words); | ||
m_bytes = store_be<secure_vector<uint8_t>>(mask_n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would become more compact if comba_mul
and bigint_add2_nc
would be able to handle spans.
constexpr size_t mask_words = BlindingBits / WordInfo<W>::bits; | |
constexpr size_t mask_bytes = mask_words * WordInfo<W>::bytes; | |
constexpr size_t n_words = C::NW.size(); | |
uint8_t maskb[mask_bytes] = {0}; | |
rng.randomize(maskb, mask_bytes); | |
W mask[n_words] = {0}; | |
load_le(mask, maskb, mask_words); | |
mask[mask_words - 1] |= WordInfo<W>::top_bit; | |
W mask_n[2 * n_words] = {0}; | |
const auto sw = scalar.to_words(); | |
// Compute masked scalar s + k*n | |
comba_mul<n_words>(mask_n, mask, C::NW.data()); | |
bigint_add2_nc(mask_n, 2 * n_words, sw.data(), sw.size()); | |
std::reverse(mask_n, mask_n + 2 * n_words); | |
m_bytes = store_be<secure_vector<uint8_t>>(mask_n); | |
constexpr size_t mask_words = BlindingBits / WordInfo<W>::bits; | |
constexpr size_t n_words = C::NW.size(); | |
std::array<W, n_words> mask = {}; | |
std::array<W, 2 * n_words> masked_scalar = {}; | |
auto effective_mask = std::span{mask}.template first<mask_words>(); | |
load_le(effective_mask, rng.random_array<effective_mask.size_bytes()>()); | |
effective_mask.back() |= WordInfo<W>::top_bit; | |
const auto sw = scalar.to_words(); | |
// Compute masked scalar s + k*n | |
comba_mul<n_words>(masked_scalar.data(), mask.data(), C::NW.data()); | |
bigint_add2_nc(masked_scalar.data(), masked_scalar.size(), sw.data(), sw.size()); | |
std::reverse(masked_scalar.begin(), masked_scalar.end()); | |
m_bytes = store_be<secure_vector<uint8_t>>(masked_scalar); |
... also, this assumes that we add the following overload to RandomNumberGenerator
:
/**
* Create an array of @p N random bytes.
*
* @tparam N number of desired random bytes
* @return an array of @p N random bytes
* @throws Exception if RNG fails
*/
template <size_t N>
std::array<uint8_t, N> random_array() {
std::array<uint8_t, N> result;
random_vec(std::span{result});
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot! Thanks for bearing with the load of comments we've had about that. There are certainly some more nits to be found, but I feel its more efficient if we address them in follow-ups.
👍 Thanks for the reviews! Some of your review comments still seem worth fixing here, also I need to clean up this dumpster fire of a commit history before merge. I'd also like #4096 to merge prior to this landing so we have some assurance wrt constant time, just in case more substantial reworking is required. |
e98e9d0
to
0ef936b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added documentation and the co-authorship! 🤩
* are set then add G or H resp. If both bits are set, add GH. | ||
*/ | ||
template <typename C, size_t W> | ||
class WindowedMul2Table final { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fabian and I had done a refactoring on this, here: 030c919
I don't remember receiving feedback about this commit from you. Just making sure it doesn't fall through the cracks.
At this point it might not apply neatly anymore, though. I'll rebase later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased: b6e847e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the rebase. I did see the commit but still need to think that one over.
* The W = 1 case is simple; we precompute an extra point GH = G + H, | ||
* and then examine 1 bit in each of x and y. If one or the other bits | ||
* are set then add G or H resp. If both bits are set, add GH. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #4095 we added a bit of ASCII art for the W = 2 case. Perhaps that's worth adding:
/**
* The example below is a precomputed table for W=2. The flattened table
* begins at (x_i,y_i) = (1,0), i.e. the identity element is omitted.
* The indices in each cell refer to the cell's location in m_table.
*
* x-> 0 1 2 3
* 0 |/ (ident) |0 x |1 2x |2 3x |
* 1 |3 y |4 x+y |5 2x+y |6 3x+y |
* y = 2 |7 2y |8 x+2y |9 2(x+y) |10 3x+2y |
* 3 |11 3y |12 x+3y |13 2x+3y |14 3x+3y |
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is also part of #3979 (comment))
This is designed from the start to avoid side channels and for good performance. On most benchmarks it is approximately 2x faster. Co-Authored-By: René Meusel <rene.meusel@rohde-schwarz.com> Co-Authored-By: Fabian Albert <fabian.albert@rohde-schwarz.com>
This new approach has a lot of benefits. It avoids the side channels that plague anything based on
BigInt
(or any general purpose arbitrary length integer type), it also avoids almost all heap allocations and inlines nicely.Performance advantage depends on curve and compiler. For smaller curves (eg secp256r1) ECDSA seems be almost twice as fast. For larger such as secp521r1 the advantage is more like 40%. Clang seems to be able to optimize everything much better than GCC here, for reasons I haven't been able to nail down.
Performance can be improved significantly from here; besides P-521 we're not taking advantage of specialized reductions possible with many primes, nor is there a system to express the precomputed addition chains used for inversions. The BigInt based code makes use of both of these already.
One disadvantage is that the code size is quite significant. The worst overhead is actually the symbol names. I've applied various tricks to reduce these but still over half the size of
math_pcurves.o
is (per Google's bloaty tool) symbols. I'd like to reduce these in the future, but I don't think it's a blocker to merge.