This is largely a save-your-work checkin.
Created p521/arch_ref64 code to make sure E-521 basically works.
Fixed some of the testing code around E-521. It doesn't quite pass
everything yet.
Created p521/arch_x86_64 code with optimized multiply. In this
checkin, the multiply is fast and works, but all the other code in
that directory is the completely unoptimized ref64 build which
reduces after every add and sub. So the whole thing isn't fast yet.
Continuing demagication and factoring of field code.
Removing high-level ops from p448.h and putting them in field.h. That way they
won't need rewriting for new fields and architectures.
Create constant_time.h which contains constant-time lookups, condswaps, etc.
That way the code is the same on all architectures, instead of varying depending
on whether the field size is a multiple of the vector register size. I should
still add a constant_time_select to factor out field_cond_negate.
TODO: I need to test this for correctness and performance on various platforms.
It works on my Mac, but since Yosemite the timing is totally unpredictable
(background tasks? variable boost?).