The standard approach is be liberal in what you accept and be specific in what you emit.
You could
1) Filter the broken message
2) Drop the broken message
3) Ignore the broken attributes but pass them on
4) Break with the broken attributes
To me, only 4 (Arista) is the really unacceptable behaviour. 3 (Juniper) isn't desirable but it's not a devastating behaviour.
EDIT: Actually rereading it, Arista did 2 rather than 4. I think it just closed the connection as being invalid rather than completely crash. That's arguably acceptable, but not great for the users.
There is already RFC 7606 (Revised Error Handling for BGP UPDATE Messages), which specifies in detail how broken BGP messages should be handled.
The most common approach is 'treat-as-withdraw', i.e. handle the update (announcement of a route) as if it was a withdraw (removal of previously announced route). You should not just drop the broken message as that whould lead to keeping old, no longer valid state.
> The standard approach is be liberal in what you accept and be specific in what you emit.
What you're paraphrasing here is the so-called "robustness principle", also known as "Poestel's law". It is an idea from the ancient history of the 1980s and 09s Internet. Today, it's widely understood that it is a misguided idea that has led to protocol ossification and countless security issues.
Postel's Law certainly has led to a lot of problems, but is it really responsible for protocol ossification? Isn't the problem the opposite, e.g. that middleboxes are too strict in what they accept (say only the HTTP application protocol or only the TCP and UDP transport protocols)?
Overly strict and overly liberal both lead to ossification. That's merely the observation that buggy behavior in either direction can potentially come to be relied on (or to be unpredictably forced on you, in the case of middleboxes filtering your traffic).
I'd only expect security issues to result from being overly liberal but 1. I wouldn't expect it to be very common and 2. I'm not at all convinced that's a compelling argument to reduce the robustness of an implementation.
"Overly" here refers to restrictions that exceed the relevant standard. An extensibility mechanism is useless if a nonzero fraction of the network filters out messages that make use of it in certain ways.
Ossification comes from os, ossis: bones in Latin. Turning into bones. Stops being flexible. Common behavior becomes de facto specification. There's stuff that's allowed by the specification but not expected by implementations because things have always worked like this.
It's not related to open source software. The seemingly matching prefix is coincidence :-)
Pretty much when something in the spec in theory could change, but in practice never does. So software and hardware gets built around the assumption that it never changes.
For example for networking you can have packets sent using TCP or UDP, but actually there could be any number of protocols used. But for decades it was literally only ever those two. Then when QUIC came about, they couldn't implement it at the layer it was meant to be because all the routers and software were not built to accept anything other than TCP or UDP.
There's been a bunch of thought in to how to stop this stuff like making sure anything that can change, regularly does. Or using encryption to hide everything from routers and software that might want to inspect and tamper with it.
literally it means that something is slowly turning into stone, like dinosaur bones. protocols and standard libraries suffer from this in figurative sense.
The trouble is it fails to specify what you're supposed to be liberal with.
Suppose you get a message that violates the standard. It has a length field for a subsection that would extend beyond the length of the entire message. Should you accept this message? No, burn it with fire. It explicitly violates the standard and is presumably malicious or a result of data corruption.
Now suppose you get a you don't fully understand. It's a DNS request for a SRV record but your DNS cache was written before SRV records existed. Should you accept this message? Yes. The protocol specifies how to handle arbitrary record types. The length field is standard regardless of the record type and you treat the record contents as opaque binary data. You can forward it upstream and even cache the result that comes back without any knowledge of the record format. If you reject this request because the record type is unknown, you're the baddies.
I would say the proper way to apply Postel's law is to reasonable interpretations of standards. Internet standards are just text documents written by humans and often they are underspecified or have multiple plausible interpretations. There is no IETF court, which would gives canonical interpretation (well, appropriate working group could make a revision of the standard but that is usually multi-year effort). So unless we want to break up to multiple non-interoperable implementations, each strictly adhering to their own interpretation, we should be liberal about accepting plausible interpretations.
There are many cases where the RFC is not at all ambiguous about what you're supposed to do, and then some implementation doesn't do it. What should you do in response to this?
If you accept their garbage bytes, things might seem less broken in the short term, but then every implementation is stuck working around some fool's inability to follow directions forever, and the protocol now contains an artificial ambiguity because the bytes they put there now mean both what they're supposed to mean, and also what that implementation erroneously uses them to mean, and it might not always be detectable which case it is. Which breaks things later.
Whereas if you hard reject explicit violations of the standard then things break now and the people doing the breaking are subject to complaints and required to be the ones who stop doing that, rather than having their horkage silently and permanently lower the signal to noise ratio by another increment for everyone else.
One of the main problems here is that people want to be on the side of the debate that allows them to be lazy. If the standard requires you to send X and someone doesn't want to do the work to be able to send X then they say the other side should be liberal in what they accept. If the standard requires someone to receive X and they don't want to do the work to be able to process X then they say implementations should be strict in what they accept and tack on some security rationalization to justify not implementing something mandatory and thereby break the internet for people who aren't them.
But you're correct that there is no IETF court, which is why we need something in the way of an enforcement mechanism. And what that looks like is to willingly cause trouble for the people who violate standards, instead of the other side covering for their bad code.
> If you accept their garbage bytes, things might seem less broken in the short term, but then every implementation is stuck working around some fool's inability to follow directions forever, and the protocol now contains an artificial ambiguity because the bytes they put there now mean both what they're supposed to mean, and also what that implementation erroneously uses them to mean, and it might not always be detectable which case it is. Which breaks things later.
And, if your project is on GitHub, gets your Issues page absolutely clowned on because you're choosing to do the right thing technically and the leeching whiners shitting up the Issues don't want to contribute a goddamn thing other than complaints, and they definitely don't want to go to the authors of the thing that doesn't work with your stuff and try and get that fixed either.
It's a description of how natural language is used, so what you'd expect is constant innovation, with protocols naturally developing extensions that can only be understood within local communities, even though they aren't supposed to.
Something like "this page is best viewed in Internet Explorer" as applied to HTML.
That's only one distinct component. HTML vs XHTML was also a distinct aspect (syntax ambiguity was a lesser problem than larger ambiguity. The WHATWG fiasco is IMO more important to the point that low quality half baked new features is not an accident but a goal.)
XHTML reveals though that HTML won on ambiguity over pedantic error identification. The adopters it needed rallied against anything that would tell them what they should do from day 1 to unambiguously say what they mean. Starting with a fundamentally flawed demo, blog, shop that ropes in some commitment and gradually fixing things on the in-for-a-dime-in-for-a-dollar investor is basically the whole business model of most fields if you exclude exchanges between the top 1-10% of buyers and sellers, which have an entirely different structure.
Even things like Facebook are an example of the manure first model. I wouldn't be stupid enough to let Zuckerberg plan lunch and as an investor I'm about as savvy as someone who bet against HTML. A billion flies can't be wrong as the saying goes.
Postel's law is absolutely great if you want to make new things and get them going in a hurry, and I think it was one of the major reasons the TCP/IP stack beat the ISO model. But as you say, it's a disaster if you want to build large robust systems for the long term.
1970s was also just a different time: documentation was harder to get, it was harder to do quality implementations for protocols, people had less of an idea what may or may not work because everyone was new at this (both in terms of protocols and implementations), shipping bugfixes took a lot longer, few people were writing tests (and there wasn't a standard test suite), few people had long experience with these protocols, and general quality of software was a lot lower.
The problem is that folks took advantage of the behavior of BGP where it would forward unknown attributes that the local device didn't understand, as a means to do all sorts of things throughout the network. People now rely on that behavior.
Now, we're experiencing the downside of this "feature"
BGP has classes attributes that it forwards. While it is true that it forwards route attributes it doesn't know about, this was an attribute that it DID know about and knows it shouldn't forward.
In fact it's a bit strange just how lenient Juniper's software was here. If a session is configured as IBGP on one end and EBGP on the other end, it should never get past the initial message. Juniper not only let it get past the connection establishment but forwarded obviously wrong routes.
Yes but you are seeing a symptom of what I believe is a fundamental design decision to be liberal in passing on data and then _later_ go through and build logic that stops certain things from being forwarded, and the result is that things slip through the cracks that shouldn't.
Rather than the inverse where you only forward things explicitly and by default do not forward.
As far as I'm aware "a session is configured as IBGP on one end and EBGP on the other end" isn't possible.
You can't configure it like that, most of the BGP implementations I'm familiar with automatically treat the a same-AS neighbor as iBGP and a different-AS neighbor as eBGP.
Juniper explicitly has 'internal' and 'external' neighbors, but you can't configure a different peer AS than your own on an internal neighbor or the same peer AS on an external neighbor.
BGP sessions also have the AS of the neighbor specified in the local config, and will not bring up the session if it's not what's configured.
I understand that, but it's a double edged sword. We enjoyed that flexibility for a long time, but lately we are now experiencing the downsides of this flexibility.
At a glance this “feature” seems like an incredibly bad idea, as it allows possibly unknown information to propagate blindly through systems that do not understand the impact of what they are forwarding. However this feature has also allowed widespread deployment of things like Large Communities to happen faster, and has arguably made deployment of new BGP features possible at all.
Being that prescriptive is fundamentally unworkable in practice. Propagating unknown attributes is fundamentally what made the deployment of 32-bit AS numbers possible (originally RFC 4893; unaware routers pass the `AS4_PATH` attribute without needing to comprehend it), large communities (RFC 8092), the Only To Customer attribute (RFC 9234) and others.
A BGP Update message is mostly just a container of Type-Length-Value attributes. As long as the TLV structure is intact, you should be able to just pass on those TLVs without problems to any peers that the route is destined for.
The problem fundamentally is three things:
1. The original BGP RFC suggests tearing down the connection upon receiving an erroneous message. This is a terrible idea, especially for transitive attributes: you'll just reconnect and your peer will resend you the same message, flapping over and over, and the attribute is likely to not even be your peer's fault. The modern recommendation is Treat As Withdraw, i.e. remove any matching routes from the same peer from your routing table.
2. A lack of fuzz testing and similar by BGP implementers (Arista in this case)
3. Even for vendors which have done such testing, a number of have decided (IMO stupidly) to require you to turn on these robustness features explicitly.
PNG solved this problem when BGP was still young: each section of an image document is marked as to whether understanding it is necessary to process the payload or not. So image transform and palette data is intrinsic, but metadata is not. Adding EXIF for instance is thus made trivial. No browser needs to understand it so it can be added without breaking the distribution mechanism.
This is also how BGP (mostly) solved it. Each attribute has 'transitive' bit. Unknown attributes with 'transitive' bit set are passed, one without are discarded.
You're suggesting that being liberal in what you accept is necessary for forward evolution of the protocol, but I think you're presenting a false dichotomy.
In practice there are many ways to allow a protocol to evolve, and being liberal in what you accept is just about the worst way to achieve that. The most obvious alternative is to version the protocol, and have each node support multiple versions.
Old nodes will simply not receive messages for a version of the protocol they do not speak. The subset of nodes supporting a new version can translate messages into older versions of the protocol where it makes sense, and they can do this because they speak the new protocol, so can make an intelligent decision. This allows the network to function as a single entity even when only a subset is able to communicate on the newer protocol.
With strict versioning and compliance to specification, reference validators can be built and fitted as barriers between subnetworks so that problems in one are less likely to spread to others. It becomes trivial for anyone to quickly detect problems in the network.
That's in conflict with the philosophy behind the internet. If you'd just drop anything because some part of it you don't understand, you lose a lot of flexibility. You have to keep in mind that some parts of the internet are running on 20 year old hardware, but some other parts might work so much better if some protocol is modified a little. Just like with web browsers, if everything is a little bit flexible in what they accept, you both improve the smoothness of the experience and create room for growth and innovation.
Postel's Law is important, but it creates brittle systems. You can force them further from the ideal operating state before failure, but when they fail they tend to fail suddenly and catastrophically. I like to call it the "Hardness Principle" as opposed to the "Robustness Principle" in analogy to metallurgy.
That's what Postel thought. He was wrong. Allowing everything creates a brittle system because the system has to accept all the undocumented behaviour that other broken systems emit. If broken files were rejected quickly, nobody would generate them.
There's a difference between unknown extensions following a known format, and data that's simply broken (e.g. offset pointer past end of data).
You're not accounting for the incorrectly rejected file / protocols. And incomplete protocol specifications.
And generally I think critics of Postel are lacking the context in which they were made. You and probably others would actually make similar decisions than Postel for many particular issues.
I disagree that I'd make similar decisions. Postel's Law is a big part of the reason Bleichenbacher attacks (adaptive chosen-ciphertext attacks)[1] stayed so common for so long. As an engineer responsible for the security I absolutely reject malformed inputs.
Postel's law doesn't mean "accept everything", but that you should accept de-facto rules people have created. If everyone says, "this is how we do it", you should ignore the RFC and just copy what others do.
One, if everyone is doing something different from the spec it is hard to figure out what they are really doing and what they mean. Long term you have confidence things will continue to work even when someone else writes their own version which otherwise might also deviate from the spec.
Two, it is easier to modify the spec as more features are dreamed up if you have confidence that the spec is boss meaning someone else didn't already use that field for something different (which you may not have heard about yet).
Three, if you agree to a spec you can audit it (think security), if nobody even knows what the spec is that is much harder.
Following the spec is harder in the early days. You have to put more effort into the spec because you can't discover a problem and just patch it in code. However the internet is far past those days. We need a spec that is the rule that everyone follows exactly.
The internet is ossified because middleboxes stick their noses where they shouldn't.
If they just route IP packets, we could have had nice things like SCTP...
Browsers are permissive not because it's technically superior but as a concession for the end user who still wants to be able to use a poorly built website, and they're competing with browsers who will bend over backwards to render that crappy website so that they look good and your browser looks bad.
It's not a concession you want to make unless you really have to.
Well, my point is that there's unique pressure for browsers to be permissive for practical reasons beyond Postel's law even if you were building a browser in 2025 and the whole internet reset to xhtml.
And that's because the end-user is at the mercy of, but not party to, an over the air interface between the producer and consumer that you can't verify ahead of time.
So if you're consuming a stream of supposed xhtml `<p>foo<p>bar</p>`, you have to decide if you want to screw the user for the producer's mistake for a single fuck up in the website's footer.
HTML is a nightmare that had to be reverse engineered as in, rebuilt with proper engineering standards in mind, several times. HTML and CSS are both quite horrible.
I would perhaps argue that juniper's behavior is the preferable one.
Remember the definition of this "drop the message I think is broken" not inherently "drop the broken message," it's entirely plausible that the message is fine but you have a bug which makes you THINK it's a broken message.
There is also a huge difference between considering it a broken message and a broken session, which is what Arista did.
Arista did 2, but it also dropped the whole connection which was probably bad.
IMHO, just drop the broken attributes in the message and log them, and pass on the valid data if there's any left. If not, pretend you did not receive an UPDATE message from that particular peer.
Monitoring will catch the offending originator and people can deal with this without having to deal with any network instability.
In case you want to calibrate your sense of armchair-ness: you have completely missed the point that discarding an individual attribute can quite badly change the meaning of a route, and since we're talking about the DFZ here, such breakage can spread around the planet to literally every DFZ participant. The only safe thing you can do is to drop the entire route. Maybe there was a point to this being discussed at quite some length by very knowledgeable people, before 7606 became RFC ;)
(I haven't downvoted your comment, but I can see why others would — you're making very simple and definite statements about very complicated problems, and you don't seem to be aware of the complications involved. Hence: your calibration is a bit off.)
Funny enough, I actually have a few routers with a DFZ, so I have an idea or two about how BGP works.
My point is that:
- if you drop a connection, especially one through which you announce the full routing table, it is going to create a lot of churn to your downstreams. Depending on the kind of routers they use, it can create some network instability for quite a while. And if you drop it again when you receive that malformed route, the instability continues
- removing only the malformed attribute maybe changes the way you treat traffic but you still route it. OK, you send it to maybe another interface, but no biggie
- if you’re using a DFZ setup, dropping that single route could blackhole traffic to that destination if you’re the only upstream to another router
> Funny enough, I actually have a few routers with a DFZ, so I have an idea or two about how BGP works.
And I'm TSC emeritus and >10 year maintainer on FRRouting, and active at IETF. Yet I hugely respect the other people there, all of whom have areas of expertise where they far outrank my own.
I have very strong opinions about some subjects, one of them being BGP.
I believe sessions should not be tear down just because you receive malformed data. You should be able to remove just the corrupt data. Or treat as a withdraw message like one of the RFC recommends.
I for one one would like knobs to match on any attribute and value and remove/rewrite them at will. Imagine something akin to a very smart HTTP proxy.
If the attribute says "encapsulate this", dropping just the attribute will create a blackhole as you will attract traffic that should be encapsulated and packets following this route will be dropped it if not.
Yes, but then again since you have logs of why it was dropped (like I suggested in my first post, to log everything dropped), you can easily troubleshoot the problem. A much better outcome than flapping a BGP session for no good reason and creating route churn and network instability.
You could
1) Filter the broken message
2) Drop the broken message
3) Ignore the broken attributes but pass them on
4) Break with the broken attributes
To me, only 4 (Arista) is the really unacceptable behaviour. 3 (Juniper) isn't desirable but it's not a devastating behaviour.
EDIT: Actually rereading it, Arista did 2 rather than 4. I think it just closed the connection as being invalid rather than completely crash. That's arguably acceptable, but not great for the users.