Tales from the Machine Room
Being "ready" is a big part of most activities. Ready for what? You’re probably asking, well, it mostly depends on what is you're doing. And what are you supposed to be doing. Are you ready for your day's work? Maybe... Are you ready for a 40 Km marathon? Me... no. But hopefully, nobody will ask me to run one in such a short notice.
Obviously, if you know that you have to do, or deliver, something beforehand, things are a lot simpler. Given the necessary time, almost everything and everybody should be able to "get ready". Unless you're a moron or ask somebody else that happens to be one.
However, sometimes you have no choice. You have a task and that task involves a moron... and with this intro out of the way, lets' begin today's tale.
We had a customer that was quickly growing his resource requirements faster than the disk and ram in his system could accommodate. It was a Physical system, not a virtual one, so when a disk is full or the ram is always at 100% use, the only things to do is to buy a new disk/ram, go to the datacenter, turn everything off, pop open the case and add the new part, then turn it on again and hope that everything works. This in the hope that you CAN fit the new part with the existing ones and don't run out of disks bays or ram slots, or into some weird incompatibility. Or you can just leave it like it is and try to re-arrange things around trying to squeeze some more lifetime from the existing hardware. But sooner or later, you run out of options, it is only a matter of time.
So, one end-of-the-summer day, after seeing that ‘disk partition 98% full' for a bit too long in the system monitor, we decided that it was time to get Customer and drag him into the Virtual era. It is time to get our Marketing man and unleash him onto the prey, with a nice enticing offer for a fully virtual environment. Now, after several days (or weeks) of negotiation, because the customer was also sort-of a "careful" one (read: didn't want to pay anything more) and was more concerned about how did we planned to "move" from the old system to the new one. Of course our "marketing man" had no fucking clue how to do so and simply said that "we'll take care of that with ZERO downtime"... Now I'd really, really, like if before busting out with such bullshit he had taken 30 seconds to ask somebody with a grain of salt in his brain... Anyhoo, he got the paper signed, so we have an order for a brand-new VPS for the customer, including migration from the old system to the new one. We got informed of the fact, as usual during one of the weekly meeting by our boss in the following way (kinda):
DB - ...and $Customer is going to get a new VPS so we shouldn't have problem with disk space anymore, any takers for the installation? (looks around)
Me - Installation of what? Where is the installation sheet?
MarketingMan - I have to put it into the wiki...
Me - Good, when it is in the system we'll install it.
DB - So it's on you then?
Me - No, when it is in the system it will be installed by whoever is supposed to be installing stuff that week, that is not supposed to be me by default. That's why you had the "installation" in the planning right?
DB - Yes but this is kinda urgent...
Me - If it is urgent why isn't the installation manual already into the system? We had that piece of junk blinking red for weeks in the monitor.
Marketing Man - The customer gave the ok only this morning.
Me - Then it is definitively not going to be installed this week.
The "discussion" went on a bit longer with nothing substantial decided of course. During the course of the week we (at least, ME) had other things to think about and nobody apparently did anything until the next week, when $Colleague (CL) ended up pulling the short straw from the pile and had the pleasure of installing Customer's new server. I was busy with other things and had really no interest in being involved into that stuff, so I'm not really sure what was in the install page or what kind of 'special' instructions he got, if he got any. A couple of weeks more passed, then one day I picked up a ringing phone (second mistake, first one was to get out of bed that morning):
Me - ShittyHostingProvider, what can I do for you?
Customer - Hi, it's Customer, we'd like to know when do you plan to go in production with our new server, since we're paying it and it's doing nothing. Last time we spoke with your MarketingDude it was agreed for a "quick in production" move, and now it's about a month.
Me - Oh... Hummm... I have to see what is the status of that, I'll call you back...
Obviously MarketingDude ain't in the office that day, so I try to figure out what was done, by who, when and how (not the why 'cause I already know that), and as far as I understand the server has been installed but I can't find any information about a "move to production". I grab $Colleague (CL) and ask him directly.
Me - How did you planned to go on production?
CL - Me?
Me - Yes, you. You installed this pile of crap right? So you should have also made a plan to move the old pile of crap into the new one and go into production with it. How did you do?
CL - Well... but... Hemmm... It's just standard stuff, is basically done...
Me - "Basically" means?
CL - Basically.
I consider, briefly, if punching him repeatedly in the face could improve his brain functions, but then turn that down because it would probably hurt me more than him.
Me - So what has to be done to put this crap into production?
CL - Hu..... it should be easy....
Me - "should be" ?
It seems the only way to do this is the hard way, so I grab the "installation page" and starts checking things out. Standard server with MySQL and Tomcat. First problem: there is no java installed in this thing... and tomcat without java doesn't work very well... Grab java and install it. Tomcat has been copied on the system but never configured or even started of course, so grab default configuration then get the configuration from the old server and compare them to see what's different... This prove to be quite a problem. Apparently, we don't have the root password for the old system and my account can't SUDO. It's time to escalate the problem.
DB - What do you mean we don't have the root password?
Me - Is not in the standard password list.
DB - It should be.
Me - yeah, well, ain't.
DB - Can't you ask customer?
Me - Sure I can, are we going to fess up that we lost the password so we can't do what are we supposed to do? That is, maintain his server?
DB - We should be able to sudo...
Me - again with the 'should'...
Contacting the customer with a request for the password returns nothing (password? What password?) so I pull MarketingDude into the fray and tell him that if we want to move forward we need to recover that password, that include a reboot in single-user mode with a rescue disk, this means a trip to the datacenter. Of course Customer wants zero downtime, so it means to either do it during the 'normal' upgrade (that we can't perform anyway 'cause we lack the root access, and anyway that system is EOL) or in the wee hours of the night. That means triple cost.
After a painful forth and back, we decide to do it early in the morning, I ride my motorcycle to the datacenter, put my mittens on the server and a reboots later I got the root password changed. In the meantime, I also realize that this thing is old... I mean OLD, really, really OLD... not only is EOL the software, even the hardware is out-of-support... this thing is a disaster waiting to happen.
With the root password I can finally proceed in copying the configuration from one system to another and fix some more stuff... then I realize another problem: there should (again with the 'should') be MySQL server installed on the new system... there is only the client installed.
Of course at this point CL is on holiday and can't be punched anymore, I report to DB that the "install page" is a pile of stinking crap and half of the stuff that was supposed to be installed ain't and when there is a checkbox for "tested and approved" the guy that check it should TEST before APPROVE. But this is matter for another tale I suppose. It's basically time to contact the customer to organize "the move", my plan: copy the database from one system to another and reconfigure the application to use the database on the new system, configure the old one to be a 'slave' of the new one so the db is always up-to-date and in case we can fall-back on the old system, the day we decide to do the switcheroo, add a RDR in the firewall and change the configuration of the DNS and that's it, the new system is in production. If something goes bad, RDR back to the old system and we are back to the old one.
On paper, it should work... Until I realize that the two systems are in two completely different subnets...
After a number of suitably bad words, I begin the process of migrating the server from its subnet to the old one, and luckily this doesn't involve more than a few changes in the firewalls and a copy of files from one datastore to the other.
At this point our MarketingDude inform me that the Customer has decided to go on production the day after... at 4 AM!
I point out that we haven't even tested the application yet, since, thanks to the "stellar" work of CL, I've been busy re-doing 99% of the installation, so I do not take any responsibility if the whole thing explode at once when I click on the "launch" button. Of course there is a long crying out of "customer's satisfaction" to which I answer asking what is better? A carefully planned and executed transition from the old one to the new one or a quick-assed move that potentially fucks up everything? Sadly, I got no answer to that question.
4 Am rolls forward, I get out of bed (helped by the cats that are asking for food), grab a cuppa, power on the pc and start messing around.
And then there is the practical realization, the thing that, in the mad rush to try to have everything finished on top of doing also the 'rest' of the work, I completely forgot to check: we do not maintain the DNS for that domain... So I can add the RDR, but I'm sure something will be wrong somewhere in the line.
It is around 3pm that the "problem" shows up: Customer reports that something is wrong with the application. Since we don't see anything in the logs I try to get some more information from the customer himself, but beyond the "I get an error to contact the sysadmin", no useful information. In a burst of inspiration, I check the firewall rules for the old machine, that were... scarce. What I notice is that the old system had a blanket "pass all" rule for outgoing connections, while the new one doesn't. A check in the logs reveals thousands of blocked outgoing connections directed to some server in the UK.
Me - (at the phone with Customer) ...and I see that your system is trying to connect to that server on port 8080, is this some sort of webservice your application uses or... ?
Customer - I'm not sure, you guys are the ones that should know these things!
Me - Sorry, but we do only the basic maintenance on your server, that means that we keep it up-to-date, we don't develop the application and we don't know the details of its functioning, so we can't know these things.
Customer - Well, I don't know either...
Me - Can't you ask the developers of that application?
Customer - I dunno, the developers are long gone...
Me - ...nobody does maintenance on the application?
Customer - No.
...of course not, judging by how old was the old server it was probably running the one and only copy of that stuff in existence.
So I add a 'pass' rule for that thing and the customer confirm that the error is gone, we have no way to know what kind of problems it caused for the whole day, nor any other problems that could lurk into the dark, unexplored corners of that unknown application. On a side note, the backup was also not configured, the system was not in the monitoring and the logs were not in the analyzer... in practice, when CL said that the system was "basically" ready, what he meant was that it wasn't.
Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.
Fammi capire una cosa ma com'e' che alla fine il pesce padulo torna sempre a te?
who uses Debian learns Debian but who uses Slackware learns Linux
By Davide Bianchi - posted 16/01/2017 10:04
Fammi capire una cosa ma com'e' che alla fine il pesce padulo torna sempre a te?
Perche' sono scemo evidentemente...
@ Davide Bianchi
By Guido - posted 16/01/2017 11:35
Perche' sono scemo evidentemente...
...penso sia comune a tutti quelli che preferiscono risolvere i problemi invece di scrollare le spalle e fregarsene...
vuoi essere positivo? Perdi un elettrone!
By trekfan1 - posted 17/01/2017 08:25
Ma le storie più recenti non dovrebbero essere in alto nel feed?
By pif - posted 02/02/2017 13:55
By Anonymous coward - posted 01/04/2017 01:12
By antonio - posted 26/04/2017 00:43
bentornato! Le mie speranze si sono realizzate.
By Simo - posted 23/05/2017 11:55
By mima85 - posted 14/07/2017 17:10
Siiiiii una nuova serie di storie
By Arthur Dent - posted 08/02/2018 10:43
Bentornato! Avevo perso la speranza di leggere nuove storie, quindi era da un bel po' che non visitavo il sito.
This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.
This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.
This site isn't optimized for vision with any specific browser, nor
it requires special fonts or resolution.
You're free to see it as you wish.