needhelp
← Back to blog

GitHub Availability Report: April 2026 — 10 Incidents, Including a 30% Scraping Attack

by xingwangzhe
GitHub
Availability
Incident Report
SRE
Outage

GitHub Availability Report: April 2026 — 10 Incidents, Including a 30% Scraping Attack

On May 14, GitHub published its April 2026 availability report. It was a rough month: 10 incidents across code search, Copilot, Pages, Codespaces, Actions, and more. Here’s what happened and what GitHub is doing about it.


Incident Summary

DateServiceDurationImpact
Apr 1Code Search8h 43m100% query failure, full re-index needed
Apr 1Audit Log4m4,297 API actors affected
Apr 9Copilot Agent4h 16m~84% of new sessions delayed, 54 min queues
Apr 13Pages39m~17.5M HTTP 500 errors (12.8% peak)
Apr 16Codespaces3h 22m~40% of VS Code starts failed
Apr 20Code Scanning / Projects15h 36mNew PRs not scanned; new issues missing from boards
Apr 22Copilot Chat3h 43mFull unavailability, then regional recovery
Apr 23Multi-service1h 18mCopilot, Webhooks, Git, Actions, Deployments — 5-7% traffic
Apr 27Search (scraping attack)6h 15mUp to 65% of searches timed out across Issues, PRs, more
Apr 27+Search continuedSee above, same incident

The Big Ones

The most interesting incident was on April 27. Between 16:15 and 22:46 UTC, GitHub’s search services experienced severe degradation. The cause? A massive anonymous distributed scraping attack.

The attacker used 600,000+ unique IP addresses, with all requests including matching actor information — making standard rate limiting ineffective since each IP stayed below the threshold. This traffic made up 30% of the day’s total search traffic, concentrated within a 4-hour window. The load balancer tier saturated, causing up to 65% of searches to time out across Issues, Pull Requests, Projects, Repositories, Actions, Package Registry, and Dependabot Alerts.

GitHub’s response: scale the load balancer tier, block the traffic, add better connection handling, and implement new controls to allow restricting anonymous traffic to protect registered users.

April 23: DNS Degradation Cascades Across Copilot, Webhooks, Git, Actions

A single-datacenter DNS infrastructure degradation cascaded into a multi-service incident affecting 5-7% of overall traffic. A recently introduced traffic-balancing mechanism caused DNS resolvers to begin failing under a specific load pattern. The impact spread across Copilot (~7% model request failures), Webhooks (elevated latency >3s), Git Operations (1.25% errors), Actions (workflow status delays ~8s), and Deployments (temporarily blocked).

The fix: restart DNS infrastructure. The takeaway: better DNS resilience, safer rollout procedures, and self-healing mechanisms for resolution failures.

April 9: Rate Limit Bug Cripples Copilot Agent

A bug in Copilot’s rate limiting logic applied a global rate limit instead of per-installation. A coincidental 3-4x traffic surge from a client update accelerated the exhaustion. 84% of new agent sessions were delayed, with queue times hitting 54 minutes (normal: 15-40 seconds). A second wave on the same day was caused by a caching bug that persisted the rate-limited state.

Fix: per-installation credentials, disable faulty caching, and better monitoring.


Notable Recurring Themes

  1. Cascading failures from shared infrastructure — The April 23 DNS incident is the textbook example. One degraded datacenter component → multiple services affected. GitHub is working on better isolation.

  2. Rate limiting scoping — Both the Copilot (Apr 9) and scraping (Apr 27) incidents involve rate limits being either too global or too easy to bypass.

  3. Automation causing harm — The April 1 code search outage was triggered by an automated infrastructure change applied too aggressively. The April 13 Pages outage was caused by an automated DNS tool deleting a necessary record.

  4. Detection gaps — Multiple incidents had detection delays of 40-53 minutes because monitoring didn’t classify the failure pattern as a risk (e.g., the scraping attack was only discovered while working on mitigation).


What GitHub Is Doing

The report lists specific follow-up actions for each incident. Common threads:

  • Stronger DNS resilience and multi-datacenter failover
  • More gradual rollouts with better health checks
  • Faster detection through improved monitoring and alerting
  • Better traffic isolation to prevent cascading impact
  • Rate limit hardening — both per-installation scoping and anonymous traffic controls
  • Fallback mechanisms for upstream service dependencies (Codespaces VS Code Server, Pages storage)

References

Share this page