Results 1 to 7 of 7
  1. #1
    Thailand Expat harrybarracuda's Avatar
    Join Date
    Sep 2009
    Last Online
    @
    Posts
    108,146

    Cloudflare outage takes down X and ChatGPT


  2. #2
    RIP
    Topper's Avatar
    Join Date
    Dec 2007
    Last Online
    @
    Location
    Bangkok
    Posts
    14,079
    And this site and EiA as well as others....

  3. #3
    Arahant
    Edmond's Avatar
    Join Date
    Apr 2020
    Last Online
    @
    Location
    Nibbana
    Posts
    21,066
    The poor fookers that were just hitting the vinegar stroke on mendyandmaya.com

  4. #4
    Thailand Expat
    thailazer's Avatar
    Join Date
    Jul 2010
    Last Online
    Today @ 03:30 AM
    Posts
    3,514
    Took down TD and EIA here!

    Cloudflare outage takes down X and ChatGPT-screenshot-2025-11-18-6-21-a

  5. #5
    Thailand Expat harrybarracuda's Avatar
    Join Date
    Sep 2009
    Last Online
    @
    Posts
    108,146
    Quote Originally Posted by Topper View Post
    And this site and EiA as well as others....
    That's why I posted it.

    Wasn't just a TD problem.

    Imagine all those trumpanzees trying to post their Epstein Files fairy tales on Twatter.

  6. #6
    Member Bettyboo's Avatar
    Join Date
    Nov 2009
    Last Online
    @
    Location
    Bangkok
    Posts
    37,961
    Quote Originally Posted by harrybarracuda View Post
    Two of my least favourite things on the planet, I hope they die forever. Actually, add in Trump and Blair, and it'd be a brilliant day all round...

  7. #7
    Thailand Expat harrybarracuda's Avatar
    Join Date
    Sep 2009
    Last Online
    @
    Posts
    108,146
    Cloudflare CEO Matthew Prince has admitted that the cause of its massive Tuesday outage was a change to database permissions, and that the company initially thought the symptoms of that adjustment indicated it was the target of a “hyper-scale DDoS attack,” before figuring out the real problem.

    Prince has penned a late Tuesday post that explains the incident was “triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a ‘feature file’ used by our Bot Management system.”
    The file describes malicious bot activity and Cloudflare distributes it so the software that runs its routing infrastructure is aware of emerging threats.

    Changing database permissions caused the size of the feature file to double and grow beyond the file size limit Cloudflare imposes on its software. When that code saw the illegally large feature file, it failed.


    And then it recovered – for a while – because when the incident started Cloudflare was updating permissions management on a ClickHouse database cluster it uses to generate a new version of the feature file. The permission change aimed to give users access to underlying data and metadata, but Cloudflare made mistakes in the query it used to retrieve data, so it returned extra info that more than doubled the size of the feature file.

    At the time of the incident, the cluster generated a new version of the file every five minutes.


    “Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network,” Prince wrote.


    For a couple of hours starting at around 11:20 UTC on Tuesday, Cloudflare’s services therefore experienced intermittent outages.

    “This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network,” Prince wrote. “Initially, this led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state.”

    That “stabilized failing state” happened a few minutes before 13:00 UTC, which was when the fun really started and Cloudflare customers started to experience persistent outages.


    Cloudflare eventually figured out the source of the problem and stopped generation and propagation of bad feature files, then manually inserted a known good file into the feature file distribution queue. The company then forced a restart of its core proxy so its systems would read only good files.


    That all took time, and downstream problems for other systems that depend on the proxy.

    Prince has apologized for the incident.

    “An outage like today is unacceptable,” he said. “We've architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we've had outages in the past it's always led to us building new, more resilient systems.”


    This time around the company plans to do four things:


    • Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
    • Enabling more global kill switches for features
    • Eliminating the ability for core dumps or other error reports to overwhelm system resources
    • Reviewing failure modes for error conditions across all core proxy modules


    Prince ended his post with an apology “for the pain we caused the Internet today.” ®

    The next post may be brought to you by my little bitch Spamdreth

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •