Tales from the Machine Room


Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Login/Register

True Lies

It was a dull, cold morning, when our MarketingMan dropped a ticket on top of the pile that was already gowing bigger by the minute, and the fact that the three (not one, not two, but THREE) guys that were supposed to be taking care of 'Customer Requests' weren't doing it wasn't helping.

Since, unfortunately, I sat in close proximity with MarketingMan, it was normal for him to ask ME about whatever problem he could possibly think of, so I grabbed the ticket and had a look.

Well, it was a pretty normal request from one of our customer.

$UnluckyChum#218 had a bit of fuckup during the weekend when his database server run out of disk space and committed suicide, our "standby" engineer was lost in the middle of the flatland on a bicycle and was unable to intervene quickly so they decided to "escalate" the problem to "national catastrophe #1", by calling every single cellphone number they could possibly find, the result was that basically the whole "management" began phoning each others and calling left and right with request of quick intervention and follow-ups and status reports every 2 seconds.

This until they managed to reach my phone number. Since I was basically at home, I logged in, looked at the poor db server, quickly zapped the last dozen or so backups that were still on the local disk freeing up enough space for the db to restart and things went, more or less, back to normal.

Of course $UnluckyChum#218 wasn't happy with the "performance" and asked details about what we monitor and how. And that was the point of the ticket that our MarketingMan wanted information about.

In about 5 minutes I located the stuff and updated the ticket, then removed myself from the "list of people we can bother if we want more info about this" and quickly forgot about the whole thing.

But every bad thing came back, repeatedly if it is bad enough. So about a week later, MarketingMan came around to ask a few more question. In specific, he wanted to know how was that we didn't noticed the disk space becoming scarce.

MM - ...because we're responsible for the good operation of the system so we should notice those things and act before it is too late and prevent the problem...
Me - Look, first of all, the whole thing went from 89% to 100% in about half an hour when the daily backup began dumping the db on the local disk, why there are 12 backup left on disk? I don't know, I guess they wanted 12 backups locally and probably it wasn't a problem when they ordered the machine a years ago, now the db has grown but the disk hasn't. Second, there is a guy each day that should keep an eye for these kind of thing but evidently he didn't or didn't considered the problem that big on friday. Last: we do not monitor disk space 24/7, only during office time. And if a disk become full, most of the time we can't do much, except deleting stuff or asking the customer what he want to remove from it.
MM - Yes but...
Me - If I remember correctly we've already pending a request for increasing the disk space of that and several other systems for $UnluckyChum. Maybe they should get a move about that?
MM - ...I'll talk to them.

A couple of day later, I get notified that a new "update" has been added to $Unlucky ticket.

Now, that was odd since I've removed myself from the list of 'update'. A quick check informed me that my name was back on the list and was added back by MarketingMan.

The update was a question from $UnluckyChum about what we monitored of their systems and how.

Since it seemed I was the only person that had some idea of the technical details involved in their setup that was in the "notify" list, I had a look at the configuration of the monitoring and prepared a quick overview of the checks that were referring to their servers.

The answer was more or less something like this:

"We monitor your applications (check on connections on port 80/443) for server X1, X2, X3, X4, X5, X6, X7 and X8, we check if the response of the application conforms what is in our 'expected list' (that you provided), we control if the database server is running (processes running in background), the number of connections, the cpu and disk activity and the disk space."

Maybe I was a bit more technical but that was the core of it.

A few days later, again I got a reaction, this time from MarketingMan that asked me if I could answer the next question from $UnluckyChum about the service level of the checks.

In specific, they wanted to know if we did monitor the disk partitions on a 24/7 base or not.

A couple of hour later, MarketingMan showed up and came directly to me.

MM - I saw this morning that $UnluckyChum has sent a reaction to his ticket...
Me - Yes, I saw it too, and replied to it already.
MM - Oh, good, they asked if we monitor the stuff 24/7 right? And...
Me - No, we don't. We don't do that for several reason, first of all, we have way too many customers with nightly automated processes that can and do fill up the disk to almost full and then clean it up again. So unless we want to maintain a shitload of "specials" checks and/or exclude lists, we decided to not monitor disks, ram and cpu on 24/7. Those are only during normal office time. Or better, they SHOULD be if somebody take a look at them every now and then.
MM - No, that's not right... Can you answer to $UnluckyChum that we monitor everything 24/7?
Me - ...that's not what I just said... And is also not what I already told $UnluckyChum.
MM - Yes I know, but this is the right answer to give to them, because I checked and when we made the contract we specified that we were going to monitor everything 24x7.
Me - ...So your question is not a question, you are asking me to give back the answer that you want to hear?
MM - Yes, sort of.
Me - That is also a lie.
MM - Well, we are going to move to a new version of the monitoring softwar anyway later this year so we will put all that stuff in 24/7 by default, so is not completely wrong... is just wrong right now...
Me - ...right... wrong...

Post-Mortem

More than a year has passed by that 'incident', we never moved to the "new monitoring software" and the discussion about the 24/7 monitoring of CPU, disk and ram space was still ongoing the day before I left.

Davide
28/12/2016 12:58

Previous Next

Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.

6 messages this document does not accept new posts
Guido By Guido - posted 13/03/2017 07:42

Se c'e' una cosa che odio sono quelli che quando ti fanno una domanda e ottengono una risposta che non volevano sentire rifanno la domanda... angry

--
who uses Debian learns Debian but who uses Slackware learns Linux


Manuel By Manuel - posted 13/03/2017 09:57

Two notes.

First, the oncall engineer who happens to be lost in the plains on a bike and unable to handle the ticket. Am I the only stupid fscker who actually cares about reaction time and SLAs?

From my point of view this should be enough for firing the engineer. Or, let's be generous. First time I catch you on the bike instead of being within 20-30-60-whatever minutes to your PC you get a reprimand and pay back to the company the extra allowance you got for being on call. Second time, formal letter from HR, pay back the oncall allowance and have your frigging bike crammed where the sun does not shine. Third time, you are out. Now. Yes, as in "now".

Second note.

Ah, the salesmen asking you to blatantly lie...

--
Manuel


Davide Bianchi@ Manuel By Davide Bianchi - posted 13/03/2017 10:32

First, the oncall engineer who happens to be lost in the plains on a bike and unable to handle the ticket. Am I the only stupid fscker who actually cares about reaction time and SLAs?

Let's be frank: when you are "on route" is quite difficult to be able to answer, on the other hand, we (me and the boss) were sitting in the office, so he calling up and asking wtf was perfectly valid. Especially because the "problem" was self-inflicted by the customer self.

Ah, the salesmen asking you to blatantly lie...

Yeah, that was "the" problem.

--
Davide Bianchi


Boso By Boso - posted 13/03/2017 13:06

Anche il mio Boss spesso vende al cliente la qualunque e assicura che ogni intimo desiderio verrà automaticamente riconosciuto dal sistema, indipendentemente dalla realizzabilità della cosa. Se avessi avuto un euro per tutte le volte che ho sentito le parole "trova il modo di farlo"...

Sempre la stessa storia remix: il markettaro vende cose di cui non sa (quasi) nulla a gente che pare ne sappia ancora meno. Un cliente serio se richiede un servizio 24/7 è perchè gli serve e si incazza non poco se poi succedono cose come questa.

Ma quanta gente che abbia una mezza idea di come si lavora in decenza è rimasta in giro?

--
Boso


mk66 By mk66 - posted 13/03/2017 15:49

E' tardi per un "bentornato"? :-\)

 

Purtroppo, noto che alla fine della fiera l'andamento lavorativo è sempre uguale, indipendentemente dall'altezza dei Paesi.

I markettari sono sempre markettari ovunque. I boss sono sempre boss ovunque e ovunque ci deve sempre essere qualcuno che alla fine sistema tutto e fa andare avanti la baracca... indipendentemente che si parli di Olanda, Italia, Serbia, Brasile, eccetera... :-\(

 

Una nota che c'entra come i cavoli a merenda: nell'elenco dei "personaggi e interpreti" a sinistra, il povero Bob pare che non abbia diritto a una sua categoria a parte, ma è un'appendice della new entry Dumboss (New Entry riferito esclusivamente alla mia memoria bacata, ovviamente)

--
mk66


chiaretta By chiaretta - posted 16/08/2017 06:18

Dove lavoro io succede l'esatto contrario: le segnalazioni del monitoraggio vengono gestite 24/7 anche per i clienti che non hanno il contratto 24/7  sad

--
chiaretta


6 messages this document does not accept new posts

Previous Next


This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.


This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.

This site isn't optimized for vision with any specific browser, nor it requires special fonts or resolution.
You're free to see it as you wish.

Web Interoperability Pleadge Support This Project
Powered By Gojira