The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

  • dan@upvote.au
    link
    fedilink
    English
    arrow-up
    18
    ·
    edit-2
    6 hours ago

    Did you read the article? It wasn’t taken down by the number of bots, but by the number of columns:

    In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.

    When the bad file with more than 200 features was propagated to our servers, this limit was hit — resulting in the system panicking.

    They had some code to get a list of the database columns in the schema, but it accidentally wasn’t filtering by database name. This worked fine initially because the database user only had access to one DB. When the user was granted access to another DB, it started seeing way more columns than it expected.