Tales from the Machine Room


Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Set language to:en it | Login/Register


Fake Test, Real Disaster

Ah, computers and computerized systesm. Everything uses computers nowaday. If you haven't noticed yet. Are you going to work by car? The traffic light that has kept you stuck for 15 minutes is controlled by a computer that is probably chatting with other computers along your route and all of them are thinking how to fuck up your commute. Are you going by public transport? Good. Points and signals are also controlled by computers and they are also part of the conspiracy.

Taxi? One word: Uber.

And when you finally get to the office and go to the coffee machine for a cuppa, that one too works because there is a computer in it.

And since we want more of them, we are now thinking about putting computers into a lot more stuff. From the doorbell to the refrigirator going through the toaster and the cucuclock in the hall.

And what happens when they don't work? Chaos. Obviously.

Because computers, like it or not, are just machines that can break down. And when ain't the hardware, is the software that fucks up. Because everything is made by humans that makes mistakes, lots of them, and they don't have always the time, the will or the possibility to fix them.

However, for some weird reason, nobody seems to think ever to the possibility that things can go wrong in their beautyful system. Everyone is making plans, projection assuming that everything will always works 100%, things will be completed on time and on budget, there will be no problems or incidents. And we all know how things turns out.

So it is always a surprise when somebody start talking about "Disaster Recovery" before the "disaster" part shows his ugly face.

The "Disaster Recovery" is, in short, to think what could go wrong (ok, worse than usual) and how to fix the things and bring everything back together in the shortest time and with the least damage. That is a fantastic idea, but the problem is that in the majority of the cases, all the possible way require money and time. Two things that the manglment doesn't even want to hear about.

It seems that starting from the 2000, when IT became "important", a lot of peoples decided that computers can do everything in zero time and with zero costs.

I blame that Star Trek crap. And I still can't figure out why after 1980 they keep making that shit and people keeps watching them, what the heck is so interesting in that crap.. oh wait...

...yeah, that.

Anyhow, enter $wethinktoeverything. A Marketing company (or somesuch) that, probably, during an alchool-based-after-meeting, decided that they had to do a "disaster recovery test".

Now, those peoples had a system that, don't really know how, they decided was "bomb proof", with everything redundant. Except, obviously, the main database that was used by all their systems. And the "management server" that was used to handle the whole thing. And the load balancer. And... Well you know that. Doesn't matter what, you'll end up anyway with some Single Point of Failure eventually.

When they informed us that they wanted to do a DR Test, we noticed that before the 'Test' you should have ... you know... A PLAN! That is, what to do and in which order. They were stupified that WE didn't had such plan. We retorted that we were doing the hosting, but we had no idea what was going on in their system.

After a number of meeting, it turned out that there were 2 main problems: 1. the database and 2. their fucking 'management' server.

The database was SQL Server, so there were 2 options: a cluster of 2 machines with similar characteristics or frequent backup and be ready to spin up a clone of the db and do a restore. This last one meant, however, accept some data loss.

The "management" server was a lot more complicated... For some reason they used a Linux server to perform "deployment" on the various servers that were Windows. No I don't know wtf. In any event, the whole thing was using Jenkins to download stuff from Git and the contact all the servers and perform operations using ssh and powershell scripts.

The reasoning they did was that "we are talking about a DISASTER, that shouldn't be NORMAL, and the system is based on virtual machines, so it shouldn't take too long to spin a new machine by cloning the old one and perform a restore from of the db. And the management server is mostly standard software and everything is git."

All excited by the "brainstorming", they showed off the "plan" that was a gigantic list of complex operation that had to be performed in concert between us and them. At that point I pointed out that there was a little problem: they had no "test" environment and as such we couldn't actually test the plan. If we wanted to test, it had to be on production. And that meant to turn the whole system off for several hours.

After gasping like fishes in the desert for a while, they decided that the whole thing had to be done in a "time selected to minimize the disruption to our customers", that is in the middle of the night. Because nobody ever give a fuck about disrupting the sleep cycle of the sysadmins, of course.

And so the Great Day arrived. Or the Great Night. And I wasn't supposed to be involved, but it happened that I was woken up by my collegue (CL) at 5 AM (for some value of 'woken up' since I was already awake because my cats wanted breakfast at 4AM). Anyhow, everybody was panicking because after all the thinking and planning, we never actually tested a database restore and now it turned out that it wasn't working very well.

Obviously, in all that thinking and planning for the disaster, everybody had forgotten that ... There was no DISASTER yet, it was just a TEST. The db wasn't dead, it was just turned off. And was a simple matter of turning it on again to have it back.

Once I pointed out that insignificant detail, everybody stood silent a bit, probably thinking "why didn't we thought at this before?" and then they started yelling all together.

UL1 - So we can simply turn it on again?
Me  - Well, it would be better first to turn off the "new" one, if they are using the same IP I don't think that having both of them together would be a good thing.
UL2 - And what do we do with the servers?
Me  - I don't know, what do you want to do with the servers?
UL2 - Based on our plan they should have been "failed-back" to the other datacenter.
Me  - Did they?
UL2 - I don't know... how can we know that?
Me  - ...whose plan is this?
UL2 - Ours but...
CL  - So... can I turn on the old database now? 'cause the restore is 3 hours and still running...
Me  - If you thought about the plan you should've also thought how to check if the server had 'failed back' or not.
UL1 - Yes but...
Me  - So you should have a way to check.
CL  - Hemmm.. may I say something?
UL2 - The failback should be checked by the management server. But that is unreachable until it gets rebuilt.
Me  - Then that should be rebuilt first. Right?
UL1 - Actually no, the first step should be the database.
Me  - The management server uses the database in any shape or form?
CL  - ...I wouldn't say weird things but ...
UL2 - No the management server doesn't use the database, but it needs to run a number of queries before the release is completed.
Me  - So management server first, database second.
UL1 - But in our plan that's the last step...
Me  - Peoples! You came up with this plan. You discussed it and checked it. Not me. Now, if you haven't considered how to test each step and if it worked or not, I can only suggest to abort the whole thing and go back to the drawing board. Before a real disaster cames along.
CL  - Right, what I wanted to say.
Me  - .. what?
CL  - I think something is wrong because I can't access the system anymore...
Me  - Which system? The one that is supposed to be on or the one that is supposed to be down?
CL  - Both of them.

And was at this point that we realized that with all the thinking about the 'test', nobody had checked and one of the "clones" was assigned the same IP address as the datacenter's gateway. And that is not a good thing to do.

Davide
23/07/2018 14:05

Previous Next

Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.

4 messages  this document does not accept new posts

Messer Franz

By Messer Franz posted 13/08/2018 09:07

Quando si crea un sistema informatico, sia esso web o locale, si devono sempre seguire 4 regole:

1)provvedere a misure che garantiscano continuità alla corrente e protezione in caso di sovraalimentazione, per esempio con un adeguato sistema di UPS

2)ridondanza, perchè tutto può rompersi o smettere di funzionare, dalla connettività a semplicemente un cavo in cui uno inciampa e non si accorge che lo ha staccato

3)software adeguato, che valuti la possibilità di malfunzionamenti, hdd pieni, RAM piena, connettività intermittente o assente ecc

ma il più importante è il punto 4:

NON FAR MAI FARE LA COSA A DEI MANAGER, A GENTE CHE USA C# ANCHE PER SAPERE SE C'E' LA CARTA IGIENICA NEL BAGNO E A CHI L'HARDWARE TE LO VENDEREBBE , nel qual caso aspettati di avere un 486 difettoso venduto come un server della NASA e al prezzo corrispondente.

 

Se i manager li chiudi in bagno con la macchinetta del caffè durante la riunione ( o tutti i giorni dell'anno, al massimo li facciamo uscire a capodanno sperando si prendano un razzo in faccia ) il sistema , è scientificamente provato , acquista già di per sè una maggiore stabilità, minor costo ed una efficienza esponenzialmente maggiore...

-- Messer Franz

Davide Busato

By Davide Busato posted 14/08/2018 10:09

Jeri Ryan, 7 di 9 per gli amici, per la serie la patonza fa sempre audience :P

-- Davide Busato

Anonymous coward

By Anonymous coward posted 14/08/2018 10:53

Hai presente i "jump scare", quando nei film horror arriva il mostro all'improvviso e tutti urlano? Ecco, nemmeno con Sadako ho urlato così tanto come quando ho letto la frase "...era stato assegnato lo stesso ip del gateway del datacenter..."!!!

Questa diventa la mia storia preferita da raccontare intorno al fuoco, con la torcia puntata sotto al mento. Altro che "who was phone?"

Il gateway. Sto ancora tremando.

 

-- Anonymous coward

Coso de Cosis

By Coso de Cosis posted 16/08/2018 15:02

Procedura per pianificare un "disaster recovery":

1) Schiaffare di fronte a tutti i mannagger a cui piacciono le donne un'immagine di Jeri Ryan (o altra gnocca megagalattica del genere) possibilmente nuda - e di pari passo schiaffare di fronte a tutti i mannagger a cui piacciono gli uomini un'immagine di qualche bel fusto - al fine di distrarli e impedire loro di pianificare a membro di canis domesticus.

2) Mandare tutti i CL a prendersi un caffè al bar a 10 km di distanza dicendo loro che glielo offre la casa.

3) Procedere alla pianificazione vera.

-- Coso de Cosis

Previous Next


This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.


This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.

This site isn't optimized for vision with any specific browser, nor it requires special fonts or resolution.
You're free to see it as you wish.

Web Interoperability Pleadge Support This Project Powered By Gigan