中文版：2023/1/21博客受攻击宕机事件分析与复盘 – Frank’s Weblog
On 1/21/2023, my blog was attacked and went down for 4 hours. This article will cover what the incident was like, the root cause analysis and improvements.
On that day, I woke up in the noon and saw the alert email from UptimeRobot. Sometimes a network or server glitch can trigger an alert as well, but it have been 2 hours since alert triggered, so apparently that’s not the case. I found I was not able to connect to the website, while sometimes I could connect but got 504.
I ssh-ed to the server and restarted all the Docker containers, but the problem persists.
top showed that all the load average were 6.xx and most of the CPU usage were from php-fpm. I checked the graphs in nginx amplify and found that nginx have received large amount of requests during past few hours.
I planned to go grocery shopping for the lunar new year dinner with my girlfriend, so I didn’t want to spend too much time on this. I simply turned on the Cloudflare reverse proxy(orange cloud icon) and “Under attack” mode and left home.
After a while I received the alert clear email from UptimeRobot and website was back online.
Over last few years I’ve implemented a set of monitoring and security measures for my site and automated scripts to mitigate common issues.
- UptimeRobot for monitoring downtime. I’ll receive alerts if the website cannot be reached or returned HTTP status that indicates a malfunctioning(5xx).
- nginx amplify for monitoring nginx and OS metrics. I’ll receive alerts if some metrics(eg. disk usage, requests per second) goes over the threshold.
- If requests per second goes over the threshold, it will automatically turn on Cloudflare proxy and increase security level.
- WordPress security plugin automatically blocks malicious requests.
Benefit from these measures, my site have maintained a uptime of nearly 100%. Being a blog that only have 2 digits of visitors everyday, 4 hour downtime is nothing to worry about. But my professtional habit have been wondering what happened behind the incident, especially why these measures failed to prevent the incident from happening.