- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
Meaning, internal error, like the other two prior.
Almost like one big provider with 99.9999% availability is worse than 10 with maybe 99.9%
So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we’ll get a small number of computers and we’ll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn’t end we update the software onto the next group and then the next and then the next until everything is upgraded. We don’t just slap it onto production infrastructure and then go to the pub.
But apparently our standards are slightly higher than that of an international organisation who’s whole purpose is cyber security.
Their motivation is that that file has to change rapidly to respond to threats. If a new botnet pops up and starts generating a lot of malicious traffic, they can’t just let it run for a week
There are technical solutions to this. You update half your servers, and then if they die you just disconnect them from the network while you fix them and then have your own unaffected servers take up the load. Now yes, this doesn’t get a fixout quickly, but if you update kills your entire system, you’re not going to get the fix out quickly anyway.
How about an hour? 10 minutes? Would have prevented this. I very much doubt that their service is so unstable and flimsy that they need to respond to stuff on such short notice. It would be worthless to their customers if that were true.
Restarting and running some automated tests on a server should not take more than 5 minutes.
5 minutes of uninterrupted DDoS traffic from a bot farm would be pretty bad.
My assumption is that the pattern you describe is possible/doable on certain scales and in certain combinations of technologies. But doing this across a distributed system with as many nodes and as many different nodes as CloudFlare has, and still have a system that can be updated quickly (responding to DDOS attacks for example) is a lot harder.
If you really feel like you have a better solution please contact them and consult for them, the internet would thank you for it.
They know this, it’s not like any of this is a revelation. But the company has been lazy and would rather just test in production because that’s cheaper and most of the time perfectly fine.

Wasn’t it crowdstrike? Close enough though
The crowd was in the cloud.
What are the chances they started using AI to automate some of this and that’s the real reason. It sounds like no human was involved in breaking this.
When are people going to realise that routing a huge chunk of the internet through one private company is a bad idea? The entire point of the internet is that it’s a decentralized network of networks.
I hate it but there really isn’t much in the way of an alternative. Which is why they’re dominant, they’re the only game in town
How come?
You can route traffic without Cloudflare.
You can use CDNs other than Cloudflare’s.
You can use tunneling from other providers.
There are providers of DDOS protection and CAPTCHA other than Cloudflare.
Sure, Cloudflare is probably closest to asingle, integrated solution for the full web delivery stack. It’s also not prohibitively expensive, depending on who needs what.
So the true explanation, as always, is lazyness.
there really isn’t much in the way of an alternative
Bunny.net covers some of the use cases, like DNS and CDN. I think they just rolled out a WAF too.
There’s also the “traditional” providers like AWS, Akamai, etc.
I guess one of the appeals of Cloudflare is that it’s one provider for everything, rather than having to use a few different providers?
A permissions change in one database can bring down half the Internet now.
tbf IAM is the bastard child of many cloud providers.
It exists to provide CISOs and BROs a level of security that no one person has access to their infrastructure. So if a company decides that system A should no longer have access to system B, they can do that quickly.
IAM is so complex now that it’s a field all in itself.
certainly brought my audiobookshelf to its knees when i decided that that lxc was gonna go ahead and be the jellyfin server also
Somewhere, that Dev who was told that having clustered databases in nonprod was two expensive and not needed is now updating the deploy scripts
Before today, ClickHouse users would only see the tables in the default database when querying table metadata from ClickHouse system tables such as system.tables or system.columns.
Since users already have implicit access to underlying tables in r0, we made a change at 11:05 to make this access explicit, so that users can see the metadata of these tables as well.
I’m no expert, but this feels like something you’d need to ponder very carefully before deploying. You’re basically changing the result of all queries to your db. I’m not working in there, but I’m sure in plenty places if the codebase there’s a bunch of query this and pick column 5 from the result.
really reminds me of the self owned crowdstrike bullshit
Mortem
Wishful thinking :)
a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size.
Isn’t cloudflare also offering bot prevention as a service?
Imagine if the number of bots suddenly increases by 2…
And already they are on their knees?
Muuuhahahahaaa…
Did you read the article? It wasn’t taken down by the number of bots, but by the number of columns:
In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.
When the bad file with more than 200 features was propagated to our servers, this limit was hit — resulting in the system panicking.
They had some code to get a list of the database columns in the schema, but it accidentally wasn’t filtering by database name. This worked fine initially because the database user only had access to one DB. When the user was granted access to another DB, it started seeing way more columns than it expected.
Classic example of how dangerous rust is.
If they had just used Python and ran the whole thing in a try block with bare except this would have never been an issue.
As a next step they should have wrapped everything in a true(while) loop so it automatically restarts and the program never dies
I hope you’re joking. If anything, Rust makes error handling easier by returning them as values using the
Resultmonad. As someone else pointed out, they literally usedunwrapin their code, which basically means “panic if this ever returns error”. You don’t do this unless it’s impossible to handle the error inside the program, or if panicking is the behavior you want due to e.g. security reasons.Even as an absolute amateur, whenever I post any Rust to the public, the first thing I do is get rid of
unwrapas much as possible, unless I intentionally want the application to crash. Even then, I useexpectinstead ofunwrapto have some logging. This is definitely the work of some underpaid intern.Also, Python is sloooowwww.
So you think there is no error handling possible in Rust?
Wait until you find out that Pyhon doesn’t write the error handling by itself either…
Yeah, the Python equivalent would be something like this.
try: config = get_config() catch: sys.exit(1)It’s possible to handle these things, but if you explicitly don’t then you’ll discover them at runtime.
This can happen regardless of language.
The actual issue is that they should be canarying changes. Push them to a small percentage of servers, and ensure nothing bad happens before pushing them more broadly. At my workplace, config changes are automatically tested on one server, then an entire rack, then an entire cluster, before fully rolling out. The rollout process watches the core logs for things like elevated HTTP 5xx errors.
honestly this was a coding cock-up. there’s a code snippet in the article that
unwraps on aResultwhich you don’t do unless you’re fine with that part of the code crashingi think they are turning linters back to max and rooting through all their rust code as we speak










