How will you know the mask of your FPGA isn't changed?
I think that the idea behind all of this is sound but that at some point we will have to accept that there always will be some remnant of insecurity unless you are willing to create your own fab or build your CPU up out of discrete transistors which can be (1) exhaustively tested and (2) are too simple to contain anything nefarious given that there is no way to know where in a circuit that transistor will end up.
1. FPGAs are easier to verify cause they are regular structures.
2. How do you insert a backdoor in an FPGA at the supply chain if you don't know what is the exact logic that is going to be uploaded?
> How do you insert a backdoor in an FPGA at the supply chain if you don't know what is the exact logic that is going to be uploaded?
Popularity of certain open core designs might be one way to gain advance knowledge of how an FPGA might be used.
That suggests an interesting option: to scramble the input to an FPGA in such a way that the device will still work but that it is even more unpredictable how its internal connections will be used (otherwise you could take a number of open core designs and arrange for your attack to work with those configurations, which might be detectable in hardware or in the toolchain).
Better yet, scramble the bitstream on every boot (but what would do the scrambling?).
TFA says “We rely on logic placement randomization to mitigate the threat of fixed silicon backdoors“ and goes on to talk about the tools they are working on, but there is still some way to go.
> The placement of logic with an FPGA can be trivially randomized by incorporating a random seed in the source code. This means it is not practically useful for an adversary to backdoor a few logic cells within an FPGA. A broadly effective silicon-level attack on an FPGA would lead to gross size changes in the silicon die that can be readily quantified non-destructively through X-rays. The efficacy of this mitigation is analogous to ASLR: it’s not bulletproof, but it’s cheap to execute with a significant payout in complicating potential attacks.
There is active research around the topics of logic locking and logic obfuscation. You can get a feel for the state-of-the-art by following HOST: http://www.hostsymposium.org/
I think if you can enforce that the user will resynthesize the design on his own, and you make sure the synthesis/placement is different, you are way ahead of any other AISC alternative in terms of trust.
But that would automatically limit the deployment to a very, very small portion of the public, those tech savvy enough to do that. A few thousand to tens of thousands of people worldwide. Unless they would become ambassadors of sort - and assuming anybody else would even care - that would still leave the rest wide open.
Cracking that is a difficult problem, you would have non-tech savvy consumers who need - or at least, that's what we think, your average consumer doesn't care at all - to gain access to secure devices. It would require a very large, visible and super embarrassing event to change the typical 'privacy is dead, get over it' mindset to switch to 'give me that secure hardware'.
Right now the only people who would be interested are those that rightfully have something to fear from nation state level actors (spies, dissidents, politicians, would-be whistleblowers). And using a device like this would make them stand out like sore thumbs.
> that would automatically limit the deployment to a very, very small portion of the public, those tech savvy enough to [resynthesize their CPU from source]
Every time someone opens the Facebook web page, their browser recompiles the Facebook application from JS source, typically using ASLR. Facebook's user interface is vastly more complex than a CPU. Yet Facebook is not limited to a very, very small tech savvy portion of the public.
Simple measures of complexity lead me to think otherwise. For example, the size of the 6502 design team (8 people, four of them circuit designers (several of them were doing things like design rule checks) from August 1974 to June 1975, 10 months) versus the size of Facebook's frontend team (within a factor of 3 of 1000 people, with 15 years of work behind them), the number of transistors in the 6502 (3,510) versus the number of bits in Facebook's minified frontend code (on the order of 100 million), and the line count of the Verilog designs for James Bowman's J1A CPU and Chuck Thacker's thing for STEPS (both under 100 lines of code) versus the line count of just React (a small part of Facebook's code).
You can try to convince people that software that has had 250 times as many people working on it for 18 times as long is "a few orders of magnitude less complex than" the 6502, but I think you're going to have an uphill battle. Three orders of magnitude less complex than the 6502 would be four transistors, a single CMOS NAND or NOR gate.
That's when you could do layout by hand. I thought you meant a modern CPU which is what that UI runs on. Good luck using Facebook on a 6502 which really brings up the rest of the subsystems that collaborate with the CPU.
Yes, that's when you had to do layout and design rule checks by hand, and you didn't have SPICE, so you couldn't run a simulation --- you had to breadboard your circuits if you weren't sure of them. Before that, you had to do the gate-level design because you didn't even have Verilog, much less modern high-level HDLs like Chisel, Migen, and Spinal. Before that, you had to desk-check your RTL design, because mainframe time was too expensive to waste finding obvious bugs in designs that hadn't been desk-checked. That's why the 6502 took eight talented people ten months. Nowadays you don't need to do any of that stuff, so it's much easier now to design a CPU than it was in 1974.
It's true that you need a faster CPU than a 6502 to run Facebook, but that's a matter of transistor count and clock speed much more than logical complexity. To a great extent, in fact, since both transistor count and clever design can improve performance, you can trade off transistor count against logical complexity if you're holding performance constant. (As a simple example, a 64-bit processor can be the same logical complexity as a 16-bit processor --- you can even make the bit width a parameter in your Chisel source code. An 8-bit processor needs to be more complex because an 8-bit address space is not practical.) Such a tradeoff is not an option for Intel, who need the best possible price-performance tradeoff in the market, which involves pushing hard on both transistor count and logical design.
Even if we take Intel's current designs as a reference, it's absurd to suggest that they're even equally complex as Facebook's user interface, let alone multiple orders of magnitude more complex. Do you literally think that Intel has hundreds of thousands of employees working on CPU design? They don't even have multiple hundreds of thousands of employees total. Do you literally think that the "source code" for a 64-core, 64-bit 30-billion-transistor CPU like the Amazon Graviton2 --- thus less than half a billion transistors per core --- is multiple gigabytes? Like, several bytes per transistor?
Let's look at a real CPU design it's plausible to run Fecebutt's UI on. https://github.com/SpinalHDL/VexRiscv is an LGPL-licensed RISC-V core written in SpinalHDL, an embedded DSL in Scala for hardware design. The CPU implementation is a bit less than 10,000 lines of Scala, but only about 2500 of that is the core part, the rest being in the "demo" and "plugin" subdirectories. There's another few thousand lines of C++ for tests. (There's also 40,000 lines of objdump output for tests, but presumably that's mostly disassembled compiler output.) You can run Linux on it, and you can run it on a variety of FPGAs; one Linux-capable Artix 7 configuration runs 1.21 DMIPS/MHz at 170 MHz.
This is not terribly atypical; the Shakti C-class processor from IIT-Madras at https://gitlab.com/shaktiproject/cores/c-class (1.72 DMIPS/MHz) is 33,000 lines of Bluespec, according to
Shakti or VexRiscv are about two orders of magnitude more complexity than a simple CPU design like the J1A or Wirth's RISC, but Shakti and VexRiscv are full-featured RISC-V CPUs with reasonable performance, MMUs, cache hierarchies, and multicore cache-coherency protocols, that can run operating systems like Linux.
In summary, a simple CPU is about a hundred lines of code and is reasonable for one person to write in a day or a few days. A modern RISC-V CPU with all the bells and whistles is about ten thousand lines of code and is reasonable for half a dozen people to write in a year. Facebook's UI is presumably a few million lines of code and has taken around a thousand talented people over a decade to build. Intel's and AMD's CPUs presumably represent around the same order of magnitude of effort, but much of that is the verification needed to avoid a repeat of the Pentium FDIV bug, which both doesn't add to the complexity of the CPU, and also isn't necessary either for Facebook's UI or for a core you're running on an FPGA.
Ergo, a full-featured modern CPU is about two or three orders of magnitude less complexity than Facebook's UI, and a CPU optimized for simplicity is about two or three orders of magnitude less complexity than that.
Aren't you ignoring a whole host of physical design complexities? Power, clock speed, signal integrity, packaging, manufacturability and yield? Yes, implementing the design in an FPGA solves some of those, but not all.
I guess your overall point is that it could be possible to provide people with source code, have them push one button, and get a working bitstream out (just the same as we simple browse to facebook.com and get a working app). That assumes that the designers know the target FPGA and work extra hard to make sure that their design meets timing/power/etc. budgets with any randomized placement and routing for that FPGA. Hmm, yeah, I guess that probably still is easier than creating Facebook's UI, as long as we can assume some constraints.
> it could be possible to provide people with source code, have them push one button, and get a working bitstream out (just the same as we simple browse to facebook.com and get a working app).
Right.
> packaging, manufacturability and yield
Using an FPGA solves those problems.
> signal integrity,
When we're talking about digital computing device design, rather than test instrument design or VHF+ RF design, there's a tradeoff curve between how much performance you get and how much risk you're taking on things like signal integrity, and, consequently, how much effort you need to devote to them.
> know the target FPGA
> timing/power/etc. budgets
> Power, clock speed
Similarly, those are optimizations. Facebook actually has a lot of trouble with power and speed, I think because they don't give a flip --- they aren't the ones who have to buy the new phones. They have trouble delivering messaging functionality on my 1.6GHz Atom that worked fine on a 12MHz 286 with 640K of RAM, so they have something like three orders of magnitude of deoptimization. (The 286 took a couple of seconds to decode a 640x350 GIF, as I recall, and Facebook is quite a bit faster than that at decoding and compositing JPEGs --- because that code is written in C and comes from libjpeg.)
In solving this problem, I think there is no perfect solution right now, just steps in the good direction. Making attacks harder instead of just impossible.
The article is long so it is normal that many people did not read it to the end, which is a shame because I think the conclusion is really important:
"I personally regard Betrusted as more of an evolution toward — rather than an end to — the quest for verifiable, trustworthy hardware. I’ve struggled for years to distill the reasons why openness is insufficient to solve trust problems in hardware into a succinct set of principles. I’m also sure these principles will continue to evolve as we develop a better and more sophisticated understanding of the use cases, their threat models, and the tools available to address them."
It is a quest. It will be made of a lot of partial solutions. FPGA are just easier to inspect and their functions harder to backdoor if you don't know what they will run. Harder but by no means impossible. But at this stage if we can make things 50% harder in 50% of the cases, that's progress.
I wonder if you can monitor energy usage (with an external chip) and compare it to what is expected to catch major changes.
So for the FPGA you could load it with a risc-v arch and then run that arch through some performance load. If the energy usage has changed a lot it may well be doing something nefarious. Bonus points if you can have a (set of) reference fpga's in the cloud you can compare arbitrary work loads on so that it is harder to predict and be stealthy about nefarious activities.
Use side-channel sources of information, where possible, to drive down the scale of changes possible.
I think that at some point in the future 'zero trust' will extend all the way down to the hardware level with individual components exchanging keys or otherwise nothing will happen. There simply won't be a safe perimeter within which you can trust another piece of hardware. And that's probably as it should be because a modern computer is better thought of as a network of - hopefully - collaborating processors than a single CPU with some RAM and peripherals.
This design does actually have a second external FPGA chip, which is in the "Untrusted" domain. It's running an ICE40UP5K, and acts more as the power management IC that turns the secure domain on and off.
This actually sounds reasonable with an open source model. Masks are open, so a third party could xray chips coming out of various fabs. Since that's a nondestructive process, identical chips could be tested by multiple parties, and the community can compare notes.
Some of the mask changes suggested in this thread would have pretty serious security implications and would be very hard to detect so I'm not sure if that holds.
Okay, yeah. Read some papers about some different attacks that avoid detection through such means. OTOH, those attacks do seem to rely on the simplistic nature of consistency checks. If this were used to make and open FPGA, it seems like we could run a very rigorous set of test structures that would exhaustively test the operations of various devices on-chip.
> unless you are willing to create your own fab or build your CPU up out of discrete transistors [...]
It’ll happen someday. I think there are enough hobbyists interested in home manufacturing (of all sorts of kinds) that we’ll eventually have low barrier to entry home semiconductor fabs. They’ll probably sacrifice performance for simplicity — I can’t imagine a home fab ever being cutting-edge — but for most applications that’s fine.
I think that the idea behind all of this is sound but that at some point we will have to accept that there always will be some remnant of insecurity unless you are willing to create your own fab or build your CPU up out of discrete transistors which can be (1) exhaustively tested and (2) are too simple to contain anything nefarious given that there is no way to know where in a circuit that transistor will end up.