Tales from the Machine Room


Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Login/Register

Pimp My Drive

Hardware versus Software. Is the age-old story whenever you have a "performance" problem, or any kind of problem actually, with a complex system.

The point is, when a system goes beyond a certain level of complexity, nobody really understand anymore what does what and which part is responsible for this or that functions. However, everybody knows that if there is a problem, is in somebody's else part of the system. It must be. My part is absoluetely spotless, not a single bug here to see, no siree!

And the major problem is that, most of the time, they are correct.

The problem is not exactly in "their" part, but more precisely, in the way that part "talks" (or refuses to) with all the rest of the system. And the more 'complex' is the whole system, the more the verious parts should "talk" with each other to be able to function as a one, and the more acute become the problem whent they don't "play nice" together.

So what happens when such a system begin the display "problems"? Everybody panic of course.

And in general, everybody point the finger towards the hardware, because that is, at least, easy to spot and to deal with.

So now we're going to talk about $masterofdisaster, a nice little company that was busy building some sort of "data analyisis tool".

The idea was to collect data from various applications and systems using "feeders" and then process all that stuff using some sort of fancy interface. And sell the service for mucha dinero of course. Basically they were trying to re-make NewRelic or something like that.

To do so, they went wild with "stacking", basically they junked together all kind of different stuff tying them together with lots of Mule and NodeJS code.

And then... disaster.

After about a month or so they were operating, everything begin to be very very slow... and also on the "unpredictable" side. Things were working for a while and then stop working, apparently for no reason.

Of course the first scream was for the "performance". Yes, because is better to provide wrong result quickly than accurate result but slowly. Better to take wrong decision very quickly than no decision at all or good decision slowly, right?

So a very long meeting was held between us (the hosting) and somebody from the company. Of course they were looking at easy and quick way to "speed up" the performance without "impacting the general structure". Aka: we don't want to change our software, we want to change your hardware. That is ok for us, I mean, you pay us more and in the end is your fucking problem so... who cares, but still, the problem is that if the hardware is not the problem, and in my opinion it wasn't, you can change it as much as you want, it won't do much.

We, ok, in fact "I", tried to explain that when you have multiple bits and pieces that talks to each other, it is important to measure where the delay is to be able to pinpoint where the problem are. So the developer should be very careful to

1. add debug points to measure how long every step of the processing takes
2. be sure to optimize each step as much as possible
3. be prepared to revise their processing if it turns out that is not that good performance-wise

and by last but not less important

4. do not stuff in shit that you do not understand just because it's easy

Of course you can imagine what was the reception of these simple and logical consideration.

In order to "speed up" the processing, a decision was quickly taken to double the hardware. So twice the servers, twice the ram, twice everything. This... solved nothing. The system went from slow and busy to slow and doing nothing. When that was done, they started yelling at each other. And to us of course, because it's easier to yell at somebody else than to yell to yourself.

How does this finishes? Well, it doesn't... It 'changes' in fact, because the company quickly decided that the 'next iteration' of the software had to be tested in a different way. And now we have to talk how the software was developed.

Not in a Test environment. At least, not in "regular" test environment. You know, where you have a sort-of-copy of the production environment where you do your things... no. That would have been too easy.

They decide to go "agile". That means that every developer had a bunch of virtual machines on his machine where to run a "stack" of the software to run tests.

At this point the problem was how to "build" those virtual machines. Yes because, since the production environment was built on CentOS Linux, nothing is better than build the test environment out of... something completely different of course. Something that doesn't have the same packages, the same versions of base software and where every and each configuration file is different. And then yell that they can't compare the environment because they are different.

And of course, when we asked why they were using wathever they were using to build their vms instead of using what we used on production, the answer was that what they were using was "optimized" for low memory and disk consumption. Note that we are talking about a few gigabytes of memory and disk on machines with about 16 Gb of ram and several hundreds of Gb in disks.

I don't have to say that they never really fixed their problems don't I?

Davide
29/08/2018 11:32

Previous Next

Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.

7 messages this document does not accept new posts
Guido By Guido - posted 10/09/2018 09:41

L'immagine:

a) e' bellissima

b) rende troppo bene l'idea

;)

--
who uses Debian learns Debian but who uses Slackware learns Linux


Messer Franz@ Guido By Messer Franz - posted 11/09/2018 10:01

 

L'immagine:

a) e' bellissima

b) rende troppo bene l'idea

;)

Hai perfettamente ragione, DB è un genio a scegliere le immagini migliori, quella di terminator x il supporto tecnico è senza dubbio la migliore; quando l'ho vista sono partito a ridere e mi sono fermato solo minuti dopo, ed anche oggi (che l'ho vista decine di  volte) mi fa sorridere.

 

--
Messer Franz


Guido@ Messer Franz By Guido - posted 14/09/2018 06:56

quella di terminator x il supporto tecnico è senza dubbio la migliore; 

Purtroppo quella mi ricorda l'hosting pampers nostro - tutte le volte che ci parlo mi verrebbe voglia di mandargliene un paio (abbiamo un ticket aperto da piu' di un anno, apparentemente non riescono a configurare un jboss 10 con apache* davanti...)

*short story long - jb+apache+jsf in https non funziona tanto bene perche' a jb la connessione arriva in http e jsf se la gestisce come gli arriva da jboss per loro era colpa del nostro applicativo che redirigeva da https ad http solo che se non mi dicono che c'e' apache davanti non posso nemmeno rendermi conto del problema - per dire... Un anno per arrivare alla conclusione che LORO devono aprire un ticket al LORO primo livello

'nuff said

 

 

--
who uses Debian learns Debian but who uses Slackware learns Linux


Akart72 By Akart72 - posted 10/09/2018 09:45

Ultimamente sto metodo "agile" ha preso molto piede 

Una volta il ciclo era "raccolta specifiche" -> sviluppo -> test -> rilascio    (con l'idea che se il test non era ok non rilasciavi)

Ora "raccolta specifiche minimali" (tanto le cambierai man mano) -> sviluppo (fatto a cane) -> rilascio -> test -> madonne del cliente -> riparti a mettere immondizia nel codice

 

Se una pagina web ha un bug vabbe, ma per chi come me usa software per uso industriale non e' mica tanto bello 

--
Akart72


Guido By Guido - posted 10/09/2018 09:47

Mi hai fatto venire in mente una cosa: in $ENTE tengono i file su db (horresco referens) - TANTI.

Ho fatto notare la cosa, come forse sarebbe meglio metterli su disco a fronte di una gran rottura di balle a farlo, ma comunque penso che il guadagno sarebbe notevole e valevole di perderci tempo.

Risposta: Tu non ti preoccupare quando sara' il momento aumenteremo lo spazio del db

Devo dire altro?

--
who uses Debian learns Debian but who uses Slackware learns Linux


Zimpazum By Zimpazum - posted 10/09/2018 09:49

E fu così che si comprarono tutti i datacenter di Google e riuscirono a mandarli in palla... :D

--
Zimpazum


emi_ska By emi_ska - posted 10/09/2018 14:03

Ciao,

Anche noi sviluppiamo con un programma che gira sotto oracle, che si ciuccia tutte le risorse a disposizione, dopodiche' si pianta miserandamente sputando fuori una camionata di exception che riempirebbero il camion nella foto!!!

Buona settimana a tutti e grazie a te Davide, per la storia che mi allieta il lunedi'!!

--
emi_ska


7 messages this document does not accept new posts

Previous Next


This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.


This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.

This site isn't optimized for vision with any specific browser, nor it requires special fonts or resolution.
You're free to see it as you wish.

Web Interoperability Pleadge Support This Project
Powered By Gojira