Tales from the Machine Room
Is a quiet end-of-July thursday and I'm thinking to my "compensation" days starting from monday, when, for some reason, probably Karma, a mail from our fantastic hosting provider show up in my mailbox.
Subject: Os Upgrade
The following to inform that, as announced, we'll proceed with the os update for the following machines:
We will begin immediately and you will be informed when it will be completed.
After having checked the list of name I had some suspect, so I did a quick scan in my inbox and there was no trace of any mail from the hosting referring to os updates or anything. And those idiots keeps using a semi-random name scheme that is borderline isterical, so I start checking their "dynamic" documentation to try to figure out which machine are they going to hit.
And discover with some surprise that in the list there are 3 of our PROUDCTION servers. I can't even say "fuck" that my system monitor... stops working, since it was hosted in one of the mentioned machine.
The problem is that another of the server is the vpn gateway that allow us to "talk" with the production environment, and lo and behold, 30 seconds later half of the office is calling me that the connection is down, while the other half is coming here on foot to tell me the same.
After spending some time to calm everyone down that the problem is only ours and the production system is still working so the customers can keep giving us money, I try to contact the Hosting to figure out a) who the fuck authorized the maintenance on a production system, b) how long will it take and c) no, seriously, who the fuck gave you an ok to perform maintenance on a production system?
The answer I got were unsatisfactory, to say the least. In short, these peoples sent out a notice e-mail to our "mother company" about a week ago, and being every "manager" on vacation, they proceeded with their updates without any confirmation even if part of that stuff should be approved by us.
Of course, always because of "Karma", all the developers decided that they, right fucking NOW, need to run tests and stuff. And since there is no connection with the test system, they can't do squat. Therefore, they begin to complaint loudly that their productivity is zero (like normally is any better).
After about 3 hours, all spent repeating over and over the very same things to the very same peoples that every 10 minutes came around asking the very same questions, I get another mail from the hosting that says "maintenance completed everything is ok". A quick check tells me that, no, "everything" is not OK. Not by a long shot. The first mail is thrown to the hosting:
No, "everything" IS NOT working. In specific our monitoring server and our vpn gateway doesn't seems to work at all.
Follow lots of details, like the IP address and the hostname based on their stupid "documentation" and other things.
Of course, I get no answer. After 30 minute I start ringing up every possible number and after a brief ping-pong between the first-line guy that is playing "stupid blonde" and some guy from the "technical team" that keep asking the same question I already answered, they say they are checking it.
Another half an hour later, my "ping" to the server beging to came back with an answer, and a login attempt tells me that the machine is now kind-of-alive. Another quick check tells me that things seems to be working, and it looks like the hosting penguins decided that the machine needed an upgrade to the software. Even what was not supposed to be updated because we need those specific versions. While I'm writing the next mail I get another mail that "the problem has been fixed" so I just add a "Fixed my ass" to the beginning and send it out as a reply to the last one.
5 minutes later, one of my user came to notify me that he can't login in our ERP software. That is worrying because basically everything depends on that thing working.
I check and I notice that the ERP returns a nice "FATAL: no pg_hba.conf entry for host 'ip.of.our.server', user 'openerp', database 'openerp'". That means problems.
Yep, our production Postgres db is catatonic. I send immediately yet another mail with the instructions: add ip.of.our.server to hba and run 'SELECT pg_reload_conf()'. Quickly I get an answer:
We at $hostingprovider always strive to give the maximum for our customers, your message has been received and will be checked first thing in the morning by the first support person.
...yes because it's 5pm obviously. Again I start with the phone calls and the cursing. Lots of both. When I run out of the 'emergency numbers' I start with all the mobile phones for the managers I can find. The problem is that is "the end of july" and those assholes are in holiday and of course they never give the numbers of their replacements. If there are replacements that is.
At 19.35 I manage to grab one of the penguins and remote-drive him into running a few checks. It looks like our postgres db, that in fact is a cluster, has enacted a spontaneus failover to the secondary node. That is actually the first one, otherwise it would be too easy. Executive decision: change the configuration to allow connection and reload and we'll keep it like that while you try to figure out what the fuck happened to the other node.
At 21.30 I receive yet another mail that the db is ok. I check it and the erp application still can't talk with the db, so again isn't fixed, another round of phone calls commence.
At 22.00 I manage to get somebody that knows how to type into a consolle and I pilot him into restarting the db after changing the configuration.
At 22.30, ERP gets back to life and it's now time to start manually running all the queues to process the orders.
At 23.15 I run the last script to send mail to the customers and I'm ready to go home. At this point I notice that our Kibana is still blocked at 14.20. Because it's reading directly from the db that is now comatose probably. That means that our "cluster" ain't a cluster anymore.
And in all this, I have to notice that during the whole day, nobody, I mean NOFUCKINGBODY ever picked up the phone or sent us a mail to notify us of the problem on our system, problems that they should monitor and be informed about. Not our mother-company IT, not the Hosting Technical Support. Basically I had to tell them that the shit wasn't running and how to fix it.
Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.
Anche da noi una volta hanno deciso di migrare tutte le macchine virtuali senza dircelo e finito il tutto (5 ore dopo) si sono scordati di riattivare il listener del DB...
Buona settimana a tutti!
By Ranzon - posted 03/09/2018 09:26
Legge delle ferie: i due o tre giorni prima succede abbastanza da fartele anelare. Esperienza personale: ogni singola volta nella mia vita quando comincio a dirmi "Oh che bello, da lunedì [giorno preso come esempio] sono in ferie" arrivano i guai a passo di carica.
By Boso - posted 03/09/2018 10:26
Dimmi che qualche culo è saltato, che qualcuno è stato sodomizzato, che anche quelli il cui sesso era noto ora non lo sarà mai più, che almeno dico ALMENO siate passiati ad altro hosting... tieni viva la mia speranza di giustizia
On a positive note: se non altro dopo $NetworkGestapo i cluster che non sono cluster te li mangi a colazione.
By MrPan - posted 03/09/2018 11:23
Ah Murphy !! ... simpatica canaglia...
By Nik - posted 12/09/2018 12:29
Questa va almeno nella Top10 di tutti gli anni
Se striscia fulmina, se svolazza l'ammazza
This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.
This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.
This site isn't optimized for vision with any specific browser, nor
it requires special fonts or resolution.
You're free to see it as you wish.