Daily discussion thread 2025-02-27 13d β’ timdaub.eth β’ Share Kiwi link β’ Copy Kiwi link | |
Some of you maybe noticed the page didn't work over the last 10 hours. I am embarrassed to admit this but that is what happened. Retrospectively it must have been that Cloudflare was pushing an update that made our stale while revalidate Worker incompatible with how they implement their HTTP2 protocol or, in any case, the problem was somehow related to this disposition. I have now downgraded to a very simple Cloudflare Worker script and the page seems to be operational again. This is one of these very painful bugs. Rainer already reported it yesterday at 7 but when I was checking the page I didn't see anything wrong with it and so when I went to sleep yesterday I had still the impression that the bug was isolated to Rainer's client. This morning when I woke up I also saw the issue arise on my devices and so even before coffee I had to look into this, which is usually quite painful, because when you have a bug on production it is really the worst case. If we cannot ship the site to production everything else doesn't matter. So I started to roll back an entire day of git commits confident that this would fix the issue, but the issue persisted. I reset all of the caches because I couldn't believe that rolling an entire day of git history back wouldn't fix the issue. It then dawned on me that it must have been something related to the production server. I looked into whether the origin server was rendering the page normally and it did. I also wasn't able to reproduce the issue locally. In any case, I probably would have caught it already yesterday, if it had happened locally, and so it became this really painful process of disabling a bunch of stuff on Cloudflare and basically doing a binary search over all its configuration options. I ended up being able to bring the page partially back by enabling development mode but I still saw some really weird bugs in the console that I had never seen. Anyways at the end of it I worked quite closely with Claude and other AIs and this helped tremendously to debug and to keep calm and rational in a situation like this. I absolutely fucking hate when production is down but I hate it even more when the issue is on Cloudflare. Cloudflare is not a bad piece of software from what I can tell but you cannot reach anyone at Cloudflare for support. The business plan is 200 euros a month. I'm not willing to pay that because it is essentially half the rent that I pay here in Germany, and so usually when something on Cloudflare happens, you are just on your own and you have no introspection into the system and the system behaves kind of sluggishly, so when you enable or disable an option you usually don't know if that translates into an immediate effect on the production website. So it's really a black box and this makes it extremely painful because you are debugging it in the open with everybody else seeing how you are struggling and not serving the site and so it's just really miserable as an experience. That said, Claude had the right intuition to disable the Cloudflare worker with a simple script, which was a genius insight. The irony or the absurdity of the situation is that with these kinds of issues it's very hard to understand the definitive root cause. It seemed to have been the worker but do we really know that it was something in the worker implementation that failed? Was this a temporary lapse in Cloudflare's backend? Did they permanently change something about their protocol implementations that led to a bug in the worker? Was it because we had added Posthog and so that was interfering with other scripts which are being inserted into the page? I have no idea. What I know is that removing the worker fixed it for now and so we will have to carefully try to add it back in and see if we can re-establish stale while revalidate caching. π₯ π₯ π π― π€ Today signaling the growing European consciousness. (on TikTok π¨π³ of course lol) https://www.tiktok.com/@choose.europe/photo/7473158325794819350 π₯ π₯ π π― π€ Update on the server situation: It took the entire day to figure out what it was. We had another down time just a few hours ago. Anyways, I'm happy to report that I think we have resolved it. It was related to the disk being full on the origin. Yeah, I know. Why is sending express truncated HTML responses when the disk is full. I also don't know Anyways, I was actually able to build back everything else. My guess for why removing the cf worker fixed it temporarily. Maybe it was because the time out is stricted for the cf worker than users browsers (which is more tolerant). Anyways, I'll write some checks related to the disc being full so that this shit won't happen anymore π₯ π₯ π π― π€ | |