Tales from the Machine Room |
Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Set language to:en it | Login/Register
Shouldn't be a big surprise that I am a bit "script happy". Every time there is something to do that is repetitive and/or boring and/or more complex than a single line of command, I grab my keyboard and begin writing a script to do most of the task. Or all of it if possible.
The pro of a script is that is, or should be, easy to understand and to use. And they should be "for everybody".
The script I write, I try to write them readable (so no convoluted code structure and zillion-level nested 'if') and easy to maintain (with a modicum of knowledge of course). When they get a bit complicated, I write documentation for them and publish that somewhere where everyone should be able to read it (wiki or such).
Usually, when a script needs to be used by other peoples, everytime somebody asks "how does this thing works", I sit with him and explain and check if he actually can use it. If necessary, I amend or improve the documentation.
I don't think that is something weird... or is it?
Unfortunately, it seems that this attitude of mine is almost "unique" in the IT world. At least the bit of the world I am in.
Most of the people I know goes from "1-line-of-code" to "Iwrotea43Klineapps" without a middle ground.
And about documentation, the top they can do is to write the name of the thing on a post-it and then lose it.
All this to talk about $megaapp of $megacorp, that does, basically, everything but the coffee.
It manages sales, purchase, invoices, taxes, orders, customer and supplier relationship, manages stocks and publish stuff on the internet using some sort of CMS. And it also has some sort of "statistical" bit that is used by $megacorp to... Ok, I'm not really sure what they do with that, I only know that every fucking day, around 8, somebody start phoning in panic because they "haven't got the mail with the daily stats yet".
And obviously I am the one that should fix the thing.
After the umptient phone call, I tried to figure out what the heck are they talking about. It turns out that the "stats" are actually produced by an external company that uses a bunch of data, some of them are produced by this application in the form of a bunch of .csv files.
The app produces those files and places them somewhere, then some other process grab them and upload them on an sftp server for the external company to pickup and use them to produce snazzy graphics that all the SLs can then look at without understanding what they exactly mean.
What's the problem? The problem is that the export procedure sometimes fails to produce something so not all files (or none whatsoever) are generated. Or is the SFTP that fails (the most common is that the sftp server is full or the files there have the same name and we can't overwrite them). In any case, something goes pear shaped and the whole thing crashes and somebody start complaining.
After lots of lamentation I tried to figure out what is the problem and where are the 'critical' points, why do they fail and what can we do to prevent that.
First order of the day is to figure out how the files are generated, and I start by asking CL that is one of the authors of that thing.
CL - ...and this produce the file_number_3234.csv, then we read the data from table this_and_that and we cross them with the data from table something_or_other and this produce file_number_3235.csv and then...
Me - Yeah, ok, you run a bunch of queries and you make a bunch of .csv, but at the end what the heck are you doing with those files? Because the files that are transferred are not that many, right?
CL - Ah... No, I don't know that.
ME - ...What do you mean "I don't know" ?
CL - This is what is in the Scheduler, what happens next I don't know.
Me - Let me see that scheduler...
So I dive into the scheduler, after several hours spent cursing whoever write crappy software with crappy interfaces and no documentation, I find out that after all the queries, the last step is to run a script on the server.
I check on the server and look at the script. That starts first thing with a "rm -f *csv" and then run 4 (say FOUR) query and produces a dozen file .csv that are the ones that are actually used.
Let's get back to CL.
CL - Ah... Maybe it was CL2 that made the script, I don't know...
Me - So you can confirm that all this junk in the scheduler is completely useless and we can remove it, right?
CL - I don't know, I'm not the one that manage the scheduler and frankly I don't even want to touch that.
Me - Why not?
CL - Because I don't want to have problem if it doesn't work.
Me - Can I point out that it doesn't work 3 time out of 5 already? So any change should be in the right direction at this point.
But nopes. CL is fixed in his position, to avoid any possible problem the best strategy is to do nothing.
This means we keep the giant fuckup in the scheduler and we try to fix it downriver.
After checking the script, I tried to "improve it" by making it more readable and with queries that were less crappy (less full-tablescan). The result is that is a tad faster and I've also added some logging to see where things goes wrong. Then I wait for the next run.
And obviously everything works fine... when you wait for an error...
I leave everything standing and wait. After a couple of days, my trap spring! It seems the export failed.
I check on the server and it seems all normal. "My" script seems to have generated all the files around 2AM. After scratching my head for a bit I check if the problem was the SFTP, but even that tells me that it did everything without problems. At this point I decide to contact the other company to see if they know something.
The available CL (CL2) informs me that on their side the process started and immediately stopped.
Me - Oh nice, and why?
CL2 - Because the procedure checks if we have all the files before proceeding, but it stopped because the files weren't there.
Me - Which files?
CL2 - I don't know, I only see that the error reported was "missing files".
Me - ...mmm.. based on the SFTP log we transferred 12 .csv files.
CL2 - Yes, that should be it.
Me - ...so we have transferred all of them.. what went wrong?
CL2 - I don't know, it is not my procedure.
Me - Ok, is it possible to ask whoever made it to know what could be the problem?
CL2 - I'll see what I can do.
Grumblesmurfidiots that never documents their stupid scripts...
Anyhow, shoot out to CL2 that actually manage to find out the author and turns out that the script doesn't check which files have been transferred or not, it simply check if there are "flag" files that are simple "ready.xxx" files with a different extension for the process that generate those files.
I check on the server and the .xxx files are non in the directory so they weren't made or there is something else. A check in the script tells me that the script doesn't generate the files, so they came from somewhere else. Let's ask to CL.
CL - Oh... those files...
Me - yes, "those files"... Where they came from? Because I couldn't found anything that make them, they are not made by the script.
CL - No, don't think they are made.
Me - ...what do you mean?
CL - Those files have been always there, we always transfer the same files, we never made them.
Me - Well, they are not on the server right now, so something should made them.
CL - Eh... no...
Me - ...what did you do?
CL - I deleted them.
Me - ...
CL - They looked old so I deleted them.
Me - Let me understand... you don't want to look or touch the procedure that actually produced those files, and then you go and delete files that are NOT produced by anything?
CL - Eh...
After adding a new line with "touch ready.xxx ready.yyy ready.zzz" to the script, I observed that this time everything worked fine. Now I'll have to check the SFTP part. And then transplant CL's brain.
And then document the whole thing of course. The script I mean, I can't document CL's brain.
Davide
18/04/2018 12:26
Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.
By Fabio posted 30/04/2018 11:36
Ricapitolando:
- ti sei preso in carico un problema di $megaditta
- hai trovato la fonte problema, che già non accade sempre.
- hai risolto il problema
Ora non resta che farsi pagare la consulenza da $megaditta...
By Diavolo_Rosso posted 30/04/2018 12:12
Io documenterei anche il trapianto di CL. Qualcosa mi dice che sarà utile come paziente 0.
-- Diavolo_Rosso
By trekfan1 posted 01/05/2018 08:48
Documentare il cervollo di CL? Molto semplice:
nano cervellodicl > /dev/null
-- trekfan1
@ trekfan1 By Anonymous coward posted 01/05/2018 15:19
Meglio: touch cervellodicl
Documentare il cervollo di CL? Molto semplice:
nano cervellodicl > /dev/null
-- Anonymous coward
By Il solito anonimo codardo posted 02/05/2018 09:25
Si può fare un bel rm -f CL? Risolverebbe almeno uno dei millemila problemi alla radice!
-- Il solito anonimo codardo
@ Il solito anonimo codardo By Davide Bianchi posted 02/05/2018 09:44
Si può fare un bel rm -f CL? Risolverebbe almeno uno dei millemila problemi alla radice!
Poi dovrei provisionarne un'altro...
-- Davide Bianchi
By Anonymous coward posted 02/05/2018 10:29
Tranquillo non sei solo... qui qualcuno ha messo in piedi un sistema di massima sicurezza, il che significa che l'unico servizio attivo è SFTP, e non c'è accesso SSH interattivo... peccato che lo stesso qualcuno abbia installato una versione del demone SFTP appositamente modificata per eseguire qualunque comando di shell (come root!) quando prefissato da una certa parola "segreta".
Quindi lo script parte da un sistema Windows, scrive su una share CIFS di un volume NAS, che viene letto tramite una export NFS da un sistema HPUX, che usa SFTP per copiare i file sul server Solaris, poi esegue comandi usando "l'estensione" di SFTP, per copiare via FTP (non sicuro, ma attraverso una VPN "diventa sicuro") i file su un server remoto. Su cui probabilmente sono leggibili da everyone in full control.
Ovviamente la password dell'utente che può dare qualunque comando, è uguale al nome dell'utente. Per sicurezza.
Quel qualcuno non lavora più qui da circa 20 anni.
-- Anonymous coward
@ Anonymous coward By Davide Bianchi posted 02/05/2018 16:38
...un sistema di massima sicurezza, ... SFTP
..."massima sicurezza" e "SFTP" nella stessa frase?
-- Davide Bianchi
This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.
This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.
This site isn't optimized for vision with any specific browser, nor
it requires special fonts or resolution.
You're free to see it as you wish.