Tales from the Machine Room
If you can't do anything about a problem, should you worry about it?
The opinion on this matter can vary a bit, in most cases, if you really can't do anything about it, you shouldn't worry about it. As long as it doesn't interest you. Otherwise, either move out or get as far from the problem until is no longer your problem.
You live right near a major airport and right under the main landing or take-off planes path? Well, no amount of whining will change that situation, so you either learn to tell the planes from their noises or start looking for a new house. Possibly not the one right nearby a massive waste recycling plant.
However... When your job is to worry about that problem, if you can't fix you should at least report it. Because... Well... IT'S YOUR FUCKING JOB! That's why.
So we're back talking about disk problems... and who's problem should it be.
Since our customers, for the most parts, paid us to handle their systems, in most cases the problem should be OUR problem. Because the disk is part of the system and as such should be covered by the contract. But again... if a disk is getting full, you can't do much. You can have a look and remove things that are obviously useless junk (why there are copies and copies of old applications or backups?) but if that is not a solution, the next step is to contact the customer and report the problem to them, if they have miscalculated the amount of disk space they needed (or, more blatantly, just picked the smallest size because is also the cheapest size), they'll need to review their calculation and eventually authorize an increase in size of the disk (with a correspondent increase in cost of the system and backup). It's either that, or they decide what they can zap from the system without killing it.
Does it sound very difficult to understand? Does it sound illogical?
However, in order for the customer to take a decision on the problem, they first need to be informed of the problem and for that to happen, somebody need to notice the problem and decide that they have to do something about it. If this crucial first step isn't taken... Well... let's go on with the story.
Once upon a time there was $WeSellCarpet, a company that... you know it.
They had a website and webshop and other stuff that was hosted by $NiceAndFriendlyHostingProvider, the latter was snuffed out by $Shitty in its plans for Total World Domination.
One day, $WeSellCarpet discovered that their site was now hosted by us. And the same goes the other way around.
The site of $WSC wasn't anything out of the ordinary, it was a simple web site with a few custom made bits and an attached webshop.
One of the custom made bit, the only one that had something 'weird' was the one related to Picture Management. Apparently the editor (or whatever you want to call the guy or girl that write on that stuff) liked to put several different picture of the same thing and then pick the one that most suited the mood of the day. The result is that several dozens copies of the more-or-less same picture existed at any given time.
Well, nothing bad with it... Unless you happens to be bumped from a cheap-as-dirt hosting to a fancy fully-managed-hosting-and-we-are-also-certified-by-this-and-that.
$WSC looked at the invoice for the "managed" hosting and immediately jumped on the phone, the result was that the contract was amended to a "more manageable" price. The meaning of that was that we got a "cap" to the amount of hours we should spend on requests from $WSC and they got a slightly less powerful VM.
Now, everything could have worked out well. And I could be stinking rich and famous. Probably an alternate universe were these are both true exists. But ain't this one.
Fast-forward a few weeks, when the site of $WSC stopped working.
And now a little digression.
In an attempt at having people working, DB had introduced some sort of 'shifts'. In practic every day somebody should be in "log checking", somebody else should be handling tickets (or "Customer Response"), somebody else should be installing or decommissioning machines and everybody else should work on projects. All nice and dandy, the problem is what you consider a 'project' and what to do when somebody can't do tickets because it's sick or has "other things to do". As usual, a good plan should account for backups.
Then the problem is the 24x7 stand-by. That is, the poor asshole with the pager that could ring at any moment.
Another "executive decisions" was to declare that during normal office hours, the "pager duty" fall on the guy that is checking the logs. And he was also responsible for "monitoring" stuff that wasn't strictly 24x7. Like... disk space.
So the guy that should focus on log-checking and following the breadcrumbs thread that could potentially point to an hacked machine or some other wrongdoing, was now loaded with other stuff that asked for quick and decisive intervention. And I talk about hacked machines for a reason: it happened multiple times that following something strange in the logs I found out that one or more machines were used to send spam or as control hosts for botnets. The sad part is that they were like that for weeks and nobody else had done a thing about them.
One not-so-nice day the $WSC site stopped working. And the guy that was supposed to keep an eye on the monitor wasn't. And since he was supposed to keep an eye on the monitor nobody else was. Unless that somebody else is me. I notice the red light and asked if somebody was busy with the site, and got no answer, as usual. So I logged in and checked or better I tried to login and check but couldn't.
I quickly tracked down the problem: /var was 100% full.
A check in the partition graphs revealed that... there were no graphs because the machine was never added to the graphing system.
It was time to escalate the problem.
The site was down, we couldn't login, time for a reboot in single-user-mode and a clean-up of the /var partition, and then to start (again) the discussion about who the fuck installed the server and why several things weren't done as they were supposed to be done. Because you can have all the "certification" you want, but if you DON'T DO THE STUFF it ain't gonna work, you know?
DB - Why didn't you informed the customers before?
Me - Before what?
DB - You are checking the monitor right?
Me - I do check the monitor every now and then, I'm not actively checking the monitor because that's somebody else job.
DB - Who is it?
Me - YOU do the planning, YOU should know.
He started rummaging in that crap of "plan" application he wanted to use, I pointed out multiple times that we have a few beautiful WHITEBOARDS on the office that could be used as "planning" tool that were perfectly functional, cheap and immediately readable, but it didn't seemed "cool" enough evidently.
Anyhow, the penguin assigned to log-checking was identified and DB moved to inquire.
Note that at this point nobody had done anything to fix the site of $WSC yet.
DB (talking to CL that was supposed to do log-checking) - Have you seen the website of $WSC?
CL - Who me? Why?
DB - It's not working.
CL (look at the pager) - Doesn't beep.
DB - Have you seen the monitor?
CL (point to the pager) - It doesn't beep.
DB - Can you open the monitor?
CL - Sure
(start rummaging on the computer, of course there is everything running but nothing remotely related to monitoring, log-checking or reporting)
CL - here.
(there are several red light blinking)
Me - Have you noted those red-light?
CL (points to the pager) - Is not beeping, so it's ok.
Me - No, it's not ok. There are several things that don't make the pager beep, it doesn't mean that you can ignore them.
CL - I'm not ignoring them
Me - So did you checked them?
CL (point to the pager) - Is not beeping.
Me - I'll assume it's a "no" then. Have you reported the problem to the customers or somebody else?
CL - Who me?
Me - So, if you did not reported the problems and didn't checked them, you ignored them.
CL - No, ignored not.
Me - ...then what did you do?
CL - (look at the pager) ...
Me (to DB) - I'm going to reboot the server of $WSC, you talk to him.
After a reboot and a clean-up, the server of $WSC was back on-line, and I suggested our MM to contact the customer and tell them that they either cleaned up all the dozens of unused images, or decided to toss some money in a bigger disk, and then I asked the useless question of WHY THE FUCK there is one single partition for /var and /var/www ain't on its bloody own and who the fuck decided the partition scheme and why if one of the "install step" is "add the server to monitoring and graphing" that bit is never carried out. Guess who was that installed the server?
The interesting part of this was that I reported the thing in the "weekly meeting", and there was a non-discussion about it. My point was that if you are supposed to look at the monitoring you should look at the monitoring, and the fact that the pager doesn't ring doesn't necessarily means that everything is fine. Of course everybody, including CL, was perfectly ok with this.
Fast forward a few of weeks.
Once again, CL was supposed to checks logs. And once again I saw a big, red, blinking light on the monitor and apparently nobody else was doing anything about it.
This time wasn't somebody's server, this time was the queue of the smtp relay used by all the servers in one of our datacenter. Normally, the queue should be ~100 mails, that time it was 29000. And it was growing. I check the graphs and it seemed the queue had grown quickly starting around 20.15 the previous day. What happened about 20.15? It seemed one of our customers had a new version of some CMS installed around that time. Coincidence? Probably not.
I logged in the mailserver and checked who had logged in it recently. Nobody. Ok, before doing anything else I went directly to DB.
Me - Check the monitor.
DB - (look at monitor) 29.000!
Me - Yep.
DB - Have you checked it?
Me - No. I'm not the one supposed to check these things. That has been blinking since yesterday night so it's not even a new thing.
Somebody else should check it. And that somebody is obviously not doing it.
Once again, DB repeated the ballet of "checking who is", and I noted that he was the guy that organized the weekly schedule and he could also do it on a monthly basis but nooooo...
Once again we showed up at CL's desk, that was very busy watching a video about... nothing work-related, unless his job was keeping track of the latest football exploit.
Me (pointing the big screen behind us) - Have you seen that thing?
CL (pick up the pager) - Is not...
Me - Beeping. Yes. I noticed. Have you seen that thing?
CL - No.
Me - And don't you think you should check it?
CL - But the pager is not beeping.
So we began again with the idiotic discussion that we had last time.
And in the middle of it I decided that I wasn't really going to bother anymore so I went back to my table and went back at doing what I was doing and decided that if it works for CL it can work for me too. After about an hour DB shows up at my table.
DB - Have you looked at the mail server?
Me - No. I didn't and I'm not gonna do it. CL is the person you should talk to.
DB - He can't found anything.
Me - He can find the door.
DB - Can we talk about this later?
Me - We can talk about this never, because there isn't really anything to talk about. Since there isn't apparently anything to DO about.
Well, the lesson here is that if you ignore the problem, somebody else will have to do something about it. And then it won't be your problem anymore.
Oh, what was the 29.000 mails in the queue? Apparently the new CMS had a nice "send a mail to anybody with whatever content you want" function that was supposed to be disabled but wasn't. So our relay ended up in every blacklist of the planet for about a month.
No, CL wasn't fazed by it. The pager still wasn't beeping.
Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.
Davide sbaglio o questi a $brancodipaguri fanno fare la figura di tecnici di profonda competenza?
By L'ennesimo codardo anonimo - posted 29/08/2017 09:39
Il pager non stava suonando - CL era già suonato di partenza.
L'ennesimo codardo anonimo
By Anonymous Stupid - posted 29/08/2017 11:40
Interessante... Sarà stato stupido oppure gli avranno detto che l'unica cosa che contava era quel pager?
Perchè DB secondo me sembra abbastanza CL da aver dati istruzioni suicida.
By Bopp - posted 31/08/2017 09:30
Credevo che avessi toccato il fondo quando lavoravi dall'altra parte. Con questo mucchio selvaggio, a quanto pare qualcuno ti ha dato un badile e ti ha detto "Mo' scava!".
N. B.: "pager" = "paguro" ????
By Eladamri - posted 01/09/2017 12:12
Ignora, ignora o ignora è come ragiona il cervello dei vari CL a qualunque input.
Quando non possono ignorare scaricano la cosa ad un altro CL finchè la cosa non verrà dimenticata.
By Guido - posted 06/09/2017 09:13
L'unica cosa buona di rientrare dalle ferie e' che ho due storie da leggere invece di una
who uses Debian learns Debian but who uses Slackware learns Linux
By noob - posted 21/09/2017 23:24
Caro Davide, purtroppo hai centrato il cuore del problema.
Il CL in questione non ha giustificazioni: anche se fosse valida la scusa che 'il pager non ha suonato', lui avrebbe dovuto passare il suo tempo di lavoro a spulciare i log, mentre tu e DB l'avete beccato a guardare dei video di calcio. Purtroppo tu, come me, sei nato co%*one(detto anche onesto) per cui se ti dicono di fare una cosa, tu hai l'ardire folle di farla! Un CL normale, invece, sa che fintanto che DB non è nella stessa stanza e sta fissando lo schermo del suo monitor (e non il suo retro) può fare tutto quello che gli pare e piace, tanto ci sarà un qualche altro co%*one che si occuperrà del lavoro. E tanto sia tu che lui a fine mese prenderete gli stessi $$.
Mi fai venire in mente il mio lavoro precedente a Bruxelles, quando in un team di cinque persone ero spesso l'unico a fare qualcosa, ma allo stesso tempo anche quello che si beccava i rimproveri della boss se tutto non era perfetto: lei sapeva che gli altri non avrebbero fatto nulla comunque, per cui preferiva prendersela con quello abbastanza co%*one da prendere sul serio il suo lavoro.
By Anonymous coward - posted 23/09/2017 20:46
"pager" non significa "paguro", ma questo:
@ Anonymous coward
By Bopp - posted 30/10/2017 15:28
"pager" non significa "paguro", ma questo:
Ma bravo. Manco le battute, non capisci.
This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.
This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.
This site isn't optimized for vision with any specific browser, nor
it requires special fonts or resolution.
You're free to see it as you wish.