Tales from the Machine Room


Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Login/Register

Loggings

...BEEEP BEEEP BEEEEP BEEEP... The proximity alarm keeps screaming, while the fragmentation missile whizzes 300Km away, its trajectory changed by the gravity of the gas giant "planet" the "HaveYouTriedTurningOffAndOn?" is orbiting.

The missile will probably blow up in before reaching the atmosphere, in about 25 minutes. But this is the first one, there are 2 more and something make me think that whoever is sitting in the Control chair on the enemy cruiser is right now sending updates to the AI of the other 2 missiles.

The computer informs me that both missiles are now acting trajectory corrections, the thrusters are invisible in normal light, but very visible in the infrared against the 2 Kelvin of the surrounding space. Both missiles are now travelling at cruise speed of about 30.000 mt/s, with an accelleration of about 8gs to correct... The computer spit out the data: the first one will zip by at less than 20 Km. Waay too close...

My orbit is changing since about 10 minutes, and should bring me close enough to the second moon to mess up the radar reading, the enemy cruiser is about 600.000 Km away so the next course change will reach the missile's computer in about 2 seconds...

BEEEP BEEEP BEEEP... What the ... ? Another signature is on the scanner now, this one is close. Way too close. What the heck is he doing 20 Km away?

BEEEP BEEEEP BEEEEP Oh Fuck... Ok, Ok, I'm awake now! Fuck...

Rummage around searching for the pager that is doing its best to wake up the whole block (and is very good at it) and stop the alarm, while with the other hand I'm trying to turn the light on. I manage to find the cat, that doesn't light up but only meows. After turning on the light, I think that if I grab the glasses I'll have better chance to read it.

Allarm: disk space $machinename

Oh, the joy of being "on call".

Turn on the PC and login on the machine. Or better, I try to. But I only get "access denied". Something make me think that the jackass that configured this thing, DID NOT made /var its own partition. As it specified in the "best practice" and is told and repeated all the time by me. At least since a year and half. The result is that some fucking process filled up /var/log, and hence it filled up / and now you can't login anymore.

It's time to go to the console. After messing around for about a quarter of an hour to figure out which one of the thousands of "default" password was used, I manage to login and my suspects are confirmed: /var/log is 16 Gb on a 20 Gb disk.

I quickly found the culprit in a nice "data_processing_debug.log" file and zap it, cleaning up about 7 Gb at once. After a while the file begin to grow immediately like some sort of alien monster.

Hmmmm... Maybe it's me, but the content of this file looks like a huge pile of useless crap. Anyhow, I have 2 choices here: I stop wathever is writing in this thing and make it write somewhere else (like /dev/null) or the whole thing will be full again in about an hour. Ok, /dev/null it is.

After a while it looks like everything is working and nothing gets logged. Nice. I write down the report with the number of hours spent rummaging around and go back to sleep. Of course the next day the developer (CL) came aloung to complain about the mistreatement of his log file.

CL - Why the log file has been deleted? We need that file for debugging!
Me - First of all, debugging should be done on the test environment, not the production, second, we had 2 choices: we remove log files that grew waay too fast or leave the whole system stuck.
CL - But the log file doesn't lock the whole system.
Me - It does if it fills up the whole / partition that is where everything else is. No space there means no space where to write, including the database. If the database can't be written, everything stop working.

After repeating the whole thing for about 96 times, I manage to lose CL but at this point is DB turn.

DB - What happened with CL server?
Me - It happened that some idiot installed the machine without reading the documentation, so with everything in /, and some other idiots that performed the "quality control" did it with the same care, so none wathsoever. With the expected result. Or better, expected by anyone that can use his brain.
DB - And why hasn't been noted by the check?
Me - Why don't you ask whoever was supposed to check? Maybe the same that should have been checking in the last... I don't know.. 3 days?
DB - And why isn't in the graphs?
Me - Because the same idiot didn't added the server to the graphs. Now have you finished asked stupid questions or shall we continue?
DB - I'm just trying to understand why this incident happened...
Me - This ain't an incident, this is the result of incompetent administration. And is incompetent because the procedures that exists are not followed. And are not followed because whoever should check doesn't. And guess who is that "whoever"?
DB - I have a meeting with the megaboss now, we'll talk later.

Yep... It's always "later".

Anyhow, I didn't see him for the rest of the day.

Fast forward of about 12 hours, that is, about 2 AM the next morning, when the pager started again. And yes, it is again the same fucking log file. I repeat the very same process and re-mark the very same time spent.

The next day I go directly to DB.

DB - Don't tell me... Again CL server.
Me - Of course.
DB - It should've been fixed...
Me - And obviously wasn't done. At this point I'm asking what are we supposed to do with this thing.
DB - Well... Nothing...
Me - "Nothing" is not an acceptable answer. If there is a problem and in this case obviously there is one, that problem have to be confronted and fixed.
DB - What do you suggest?
Me - Phase one: disable logging from that thing.
DB - We can't remove the logging they are using it for debugging.
Me - Phase two: install test environment for debugging.
DB - And whose going to pay for it?
Me - Them. Or we suspend the monitoring and if the application crashes, and it will crash, it stay crashed.
DB - That can't be.
Me - Good (I drop the pager on the table) This is yours then, because I'm not getting up tonight.

What followed was a very long and boring discussion, that was waaaay too long to be reported, in any case, my point of view was, and still is, that the monitoring should cover INCIDENTS, that are impredictable and unavoidable events, while a PROBLEM is predictable and avoidable and should be treated in a complete different way. Always act as in "emergency response" condition doesn't demonstrate any special capability or efficiency. Quite the opposite in fact.

And if the problem is caused by a larger Structural or Organizational problem, then it has to be solved in the same way.

DB - ...what do you mean?
Me - That we have procedures, that should be followed because they are part of all the fucking "certification" that everybody seems to adore, and if the procedures are not followed, then we shouldn't be too surprised when things don't work as expected.
DB - Hmmm...
Me - And to be sure that the procedures are followed there should be corresponding checks and administrative tasks that should also be followed. For example one of these specify that when a system is installed, before it gets a stamp of "approved", somebody should verify if the installation has been completed and all the box ticked. Now, this test has not been done on CL's, the system however was marked as "production". Why?
DB - Well, it was an urgent request...
Me - It is ALWAYS an Urgent request, how is that we are the ones that have always to scramble to fix somebody's else shit? Well, now you fix it.

And from that moment, the log file of CL has been symlined to /dev/null.

Davide
24/05/2018 13:53

Previous Next

Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.

13 messages this document does not accept new posts
Ben C8 By Ben C8 - posted 18/06/2018 10:04

Cioè, hai symlinkato il log al cervello di CL o a quello di DB? :D

E poi, chi strapiffero va a creare ancora partizioni da 20 GB? Vabbe', se gli avessero dato un tera si sarebbe soltanto procrastinato il problema, senza risolverlo, ma partizioni da 20 GB si usavano nei primi anni duemila...

--
Ben C8


Ivo Gandolfo@ Ben C8 By Ivo Gandolfo - posted 19/06/2018 09:20

[quote]Cioè, hai symlinkato il log al cervello di CL o a quello di DB? :D

E poi, chi strapiffero va a creare ancora partizioni da 20 GB? Vabbe', se gli avessero dato un tera si sarebbe soltanto procrastinato il problema, senza risolverlo, ma partizioni da 20 GB si usavano nei primi anni duemila...[/quote]

Bhè, se noti parecchia gente (hosting) vendono macchine virtuali con harddisk ridicoli a 5€/mese. Quindi uno spazio così ristretto è giustificabile. Braccino corto = roba ridicola.

 

 

 

--
Ivo Gandolfo


Ben C8@ Ivo Gandolfo By Ben C8 - posted 20/06/2018 10:36

[quote]Bhè, se noti parecchia gente (hosting) vendono macchine virtuali con harddisk ridicoli a 5€/mese. Quindi uno spazio così ristretto è giustificabile. Braccino corto = roba ridicola.[/quote]

La politica del braccino corto alla fine della fiera si rivolta contro tutti quanti, clienti e fornitori. Poi certo che ciascuna delle due parti non trova di meglio che assumere il primo CL che ha ramazzato dal marciapiede!

--
Ben C8


Palin@ Ben C8 By Palin - posted 19/06/2018 10:05

 

Cioè, hai symlinkato il log al cervello di CL o a quello di DB? :D

E poi, chi strapiffero va a creare ancora partizioni da 20 GB? Vabbe', se gli avessero dato un tera si sarebbe soltanto procrastinato il problema, senza risolverlo, ma partizioni da 20 GB si usavano nei primi anni duemila...

 

Sì certo perché a te gli SSD te li regalano :\)

--
Palin


Anonymous coward@ Palin By Anonymous coward - posted 20/06/2018 10:38

[quote]Sì certo perché a te gli SSD te li regalano :\) [/quote]

Discorsi del cazzo, visto che ormai trovi storage (non necessariamente su SSD) da centinaia di tera al prezzo a cui nei primi anni 2000 vendevano le centinaia di giga.

 

 

--
Anonymous coward


Davide Bianchi@ Ben C8 By Davide Bianchi - posted 19/06/2018 13:49

E poi, chi strapiffero va a creare ancora partizioni da 20 GB? Vabbe', se gli avessero dato un tera si sarebbe soltanto procrastinato il problema, senza risolverlo, ma partizioni da 20 GB si usavano nei primi anni duemila...

Il mio server ha tutt'ora una partizione di "/" da 20 Gb. Non dovrebbe esserci quasi niente in '/' e non dovrebbe crescere. Io lo spazio lo voglio in /var, /usr, /home, /data (dove sono i dati), ma '/' ? No.

 

--
Davide Bianchi


Anonymous coward@ Davide Bianchi By Anonymous coward - posted 20/06/2018 10:41

[quote]Il mio server ha tutt'ora una partizione di "/" da 20 Gb. Non dovrebbe esserci quasi niente in '/' e non dovrebbe crescere. Io lo spazio lo voglio in /var, /usr, /home, /data (dove sono i dati), ma '/' ? No. [/quote]

Almeno tu glielo dai lo spazio nelle altre partizioni! Qualcuno si piglia 20 GB e tenta senza successo di farci stare TUTTO.

 

 

 

--
Anonymous coward


Messer Franz By Messer Franz - posted 19/06/2018 08:21

Mi ricorda un sito fatto ( non da me ) nel lontano 2000 (o 2001, non sono certo) che scriveva log su di un db relativo ( e fin qui ok) in modo LEGGERMENTE verbous...dovevi salvare un record?

-scriveva che aveva ricevuto la richiesta e stava accedendo al db

-che aveva fatto l'accesso

-che stava scrivendo ( preparando la stringa SQL, penso )

-che aveva mandato l'SQL al DB

-che era stato salvato

-che aveva mandato il msg ( preparato l'html per la pagina web) che il salvataggio era stato eseguito

per. ogni. cosa.

Vorrei far ricordare la dimensione di un hdd medio di quegli anni... ovviamente il cliente si trovò in alcuni piiiiccoli problemi e la spiegazione fu che il sito "non era ottimizzato per quel DB" ...se ho capito bene il cliente aveva scelto un db "normale" e la ditta voleva piazzargli un Oracle, che sappiamo che ha un costo piuttosto elevato - sebbene meritato, è un db fatto proprio bene, anche se io sono un fan di interbase/firebird - per fare più cresta...ma NO, se ve lo state chiedendo, il programmatore non aveva ricevuto l'ordine di fare casini per "spingere" l'oracle, erano solo dei deficienti...

--
Messer Franz


magaolimpia By magaolimpia - posted 19/06/2018 14:05

"il mio punto di vista era ed e' tutt'ora che il monitoring deve coprire INCIDENTI, che sono avvenimenti imprevedibili ed inevitabili, mentre un PROBLEMA che e' prevedibile e EVITABILE deve essere trattato in maniera diversa."



Bwuahahahahaha. In un mondo ideale.

Lo ripeto anch'io da anni dove lavoro: vengono messi controlli del monitoraggio per mettere le pezze manualmente a bug del sotware. Che vuole dire che chi fa il turno di notte si deve mettere a sistemare record del db a mano per sistemare casi puntuali che danno problemi. Oppure monitor che fanno un sacco di falsi positivi (nessuno si ricorda mai di Pierino e il lupo).

--
magaolimpia


Guido By Guido - posted 22/06/2018 08:03

Questa gente che butta sul log di tutto e di piu' dovrebbe farselo mandare per mail, magari imparerebbero la differenza tra "essenziale" ed "inutile"

--
who uses Debian learns Debian but who uses Slackware learns Linux


Massimo m. By Massimo m. - posted 03/07/2018 12:12

Quanto mi piacerebbe avere quella lista di best practices!

BigD, che dici, potresti pubblicarle?

--
Massimo m.


Gianluca Rigon By Gianluca Rigon - posted 19/12/2018 14:37

Immagino che la domanda sia niubba, ma di norma filesystems come ext2 e fratelli non partono con un 5 % di spazio disco riservato a root in modo che i demoni vitali per il sistema possano restare in piedi anche in caso di riempimento del filesystem ? Oppure ho capito male io ?

--
Gianluca Rigon


Davide Bianchi@ Gianluca Rigon By Davide Bianchi - posted 20/12/2018 10:34

partono con un 5 % di spazio disco riservato a root

Si ma di solito non fai login da remoto come root perche' non permetti di fare login da remoto come root.

 

--
Davide Bianchi


13 messages this document does not accept new posts

Previous Next


This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.


This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.

This site isn't optimized for vision with any specific browser, nor it requires special fonts or resolution.
You're free to see it as you wish.

Web Interoperability Pleadge Support This Project
Powered By Gojira