Thundering herds, noisy neighbours, retry storms.
I love the names that people have come up with over the years. Some of them describe observed patterns, as Lorin Hochstein so eloquently put it āOperators give names to recurring patterns of system behavior that they observeā (tweet), others describe techniques used to mitigate these observed patterns.
I donāt know what youād call these names, and I havenāt been able to find a dictionary or list of them anywhere, so Iāve wanted to create a list for a while now.
Update (17th May 2021): Lex Neva suggested calling these āoperational patternsā and I love it.
So here we go, Iāll add some now based on the few I can remember and notes I have on my computer, and then I can always come back and add more as a I come across them.
Iād love your help growing this list. If you know of a name that is missing from the list please send me a tweet with the name and a short description of it and Iāll include it in the list with a link to your tweet š
Table of contents
Names
Thundering herd
I first came across this term from a colleague at Glitch who used it to describe the situation where we had just recovered from an incident, only to have everything break once all our users tried to start their projects again š The surge of projects overwhelmed the system and everything broke. It was truly a thundering herd.
Other related names are: Dogpile, Cache-stampede [wikipedia]
Noisy neighbour
I donāt remember when I first came across this, but it has come up quite a lot both at Glitch and Gitpod.
Nosy neighbour
Okay this isnāt a thing, but I really think it should be (tweet).
Retry storm
Itās a bit like the thundering herd, but specific to retries. Still a great name.
[Microsoft - Retry Storm antipattern]
Banding
[tweet]
Dimensions of doom
This is specific to time series databases that donāt support high cardinality labels, but I still love it.
The number of time series quickly becomes overwhelming, and impossible to store for tools that arenāt designed to handle it, much less read it back quickly enough to help you figure out where issues lie.
I donāt know the source of the description above. I found it in a note on my computer, but I doubt I wrote it. If you know where it might be from send me a tweet and Iāll link to it here.
This is from How Does Honeycomb Compare to Metrics, Log Management, and APM Tools?, thanks to Kevin Collas-Arundell for spotting the reference (tweet)!
Load shedding
Purposefully reducing requests to your systems to avoid them falling over. Netflix wrote a great post about it here: Keeping Netflix Reliable Using Prioritized Load Shedding
Circuit Breaker
Haunted Graveyard
Submitted by Lorin Hochstein in this tweet
I like āhaunted graveyardsā (learned this one from @john_p_looney), about systems that people are afraid to change.
Another related name for this is Haunted Forrest (see tweet from Jacob)
Flapping
Submitted by James Cheng in this tweet
āflappingā. When something repeatedly switches back and forth between āgoodā and ābadā. Imagine a health check for something that is healthy, then gets overwhelmed, then recovers, then again gets overwhelmed.
With follow up additions by rat rancher tweet
Iāve heard āflappingā mainly with regard to flapping links on network devices:
And Lorin Hochstein in this tweet
Iāve heard of flapping alerts.
Flaky
Jan Keromnes calls out the similarity to Flaky in this tweet; Iāve heard them used interchangeably and think of them as synonyms:
The definition of āflappingā makes me think of āflakyā (as in āflaky testsā ā personally Iāve never heard āflappingā used that way)
Death spiral
Submitted by Lex Neva:
Iām talking about the pattern where the system reacts to a failure or degradation in certain ways that act to amplify the problem. An example is the db struggling under load, and the auto-scaler notices the degradation and starts adding front-ends, but the front-ends have to boot up by running a few intensive queries, exacerbating the load, so the auto-scaler adds more instancesā¦
Verbalmatic
This was submitted by Paul de Lange who writes āI would like to contribute one that we use at Expedia. This was coined by Mike Peterson, and first appeared in an internal conversation on 18 September, 2020.ā
When the answer should be automatic but it isnāt, and so we rely on talking to a bunch of people to come up with the answer. This goes against the tenants of SRE because it is manual toil to answer the question each time and the reliability of the answer depends on the specific person you ask.
Numbers from Lost
Submitted by Mark Ellens
Another is ānumbers from Lostā wherein a human being has to perform a certain action (entering a series of numbers, applying a specific piece of config) on a regular basis, at a specific interval, otherwise dire consequence (island explosion, catastrophic system failure) will ensue. Like in Lost, you see.
turn it off and on againā¦and again
Submitted by Mark Ellens
The scenario is that you have a message queue and one of the consumers stops processing messages and so the queue backs up. A naive fix for your immediate issue is to, predictably, turn it off and on again. With a certain type of issue it will then start to look like it is processing normally, perhaps even long enough to make the SRE / person on call close the ticket, go back to bed, set off for the office or whatever. At which point the actual cause (perhaps a badly formed message) would come in, cause monitoring to back up, and trigger another alert. An optimistic / inexperienced SRE can waste a good few hours in this way.
Chesterton Fences
Submitted by motxilo in this tweet which links to this blog post:
The sun is shining and youāre walking through a wood when you happen upon quiet, desolate road. As you walk along the road you encounter an rather peculiar placed fence running from one side of the road to the other. āThis fence has no useā you mutter under your breath as you remove it.
Removing the fence could very well be the correct thing to do, nonetheless you have fallen victim to the principle of Chestertonās Fence. Before removing something we should first seek to understand the reasoning and rationale for why it was there in the first place.
Iāve definitely fallen victim to this. I can just see the PR āRemoved superfluous configā and down went production š
Changelog
2021 May 14
Initial version of the list
2021 May 17
Added
2021 May 23
Added
- Numbers from Lost
- turn it off and on againā¦and again
- Chesterton Fences
- Flaky which I believe is a synonym for Flapping