Tales from the Machine Room


Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Login/Register

State of (System) Alteration

A long time ago, almost 20 years ago to be precise, there were SysAdmins, that managed servers. They were huge, noisy boxes and they were mostly more trouble than they were worth. The servers I mean, not the admins. Then somebody thought that you could make more money if you could "rent" the admins with the servers. And that was the beginning of the "managed hosting" providers. Then somebody else decide they could sell or rent hardware that didn't existed and the Cloud was a thing.

on the left: the sysadmin, on the right: the appadmin

In all this, the Sysadmins found themselves in different situations. There are several "state" in which a system can be depending on which type of "management" the owner decide to use.

1. I am the SysAdmin, now fuck off

The company pay for the hosting of one or more machines (virtual or not) and that's it. Installation, monitoring, management, troubleshooting, maintenance and the like are done by one or more SysAdmins that knows the whole environment inside and out. The hosting company only need to bring power and network to the machines.

This was the "standard" situation about 10 years ago. Usually the problems with this situation were that the Admin(s) had to be good at their job and be able to not just react but actively prevent problems and foresee situations down the road. The risk was to end up with a system composed by badly patched-together bits and pieces. And that was more or less normal. The main disadvantage of this was that the admins were absolutely central in keeping the system alive.

2. I am the sysadmin, but you manage the OS

The company pay for the hosting and maintenance at OS level of one or more machines (virtual or not), everything that is application software is installed, managed, monitored and maintained by one or more sysadmin that know the applications and have root rights on the system to do their job. The hosting company rolls out OS patches and updates at (more or less) regular intervals.

This is akin to the previous situation only the OS maintenance is outsourced. The major problem normally is that the hosting company decide when to roll out the updates and that is usually done at time that are not really convenient for anybody. Also, planning for urgent roll-out are normally lacking.

3. I am the Application Admin

The company pay for the hosting, monitoring and maintenance at OS and system level of one or more machines (virtual or not), the hosting provided keeps the root rights on the system and a limited right user is created for the "other admins" to install and maintain the application. How much rights remains in the hand of the admins is to be decided. Normally, not much.

This situation is normal (unfortunately) when the company that pay the hosting doesn't want to spend the money required to hire "real" admins. The result (usually) is that when there is a problem that require some competence and skills, the peoples that have those skills don't have the rights to use them. Expect complications because of lack of rights to perform maintenance and an high count of unsolved problems.

4. THEY are the sysadmins

This, from my point of view, is the worse situation. This usually happens when the company search for the cheapest solution and decide to completely out-source the whole thing. Call it "fully managed" if you want. The problem is that all the activities have to be performed by the hosting company, so they should also provide personnel with the required skills and competence, but it is unlikely that they can manage to get all the possible skills required with a large enough pool of customers. And even so, the amount of time they can spend on a single customer is limited.

The result is that the "customer" will never get the level of service he would expect, the admin will have a huge amount of services to manage, all different and all with different quirks and "specials" and whoever develop the application will have to go through a complex procedure just to have a small change applied.

In the last few years situation (4) has become more the norm, call it "platform as a service" if you like, the problem remains.

Sure, you can be lucky and get the "perfect" situation, where the admin of the hosting company are competent and capable and the admin of the compay is able and can communicate. At that point the two "merge" into a single supersysadmin (like in a japanese manga) and every problem is blasted away in a Kamehameha attack...

But most of the time one of the two is barely capable to reboot a machine and the communication is lacking, the result is that any problem is dragged along for days or weeks.

Enough with bullshit, let's go with the story.

$ShittyHostingProvider was a provider of the fourth type, we had a few customers that were of the first kind, but with time and the rise of cheaper services (AWS, Azure) they migrated away. The "customers" were mostly companies were the admin capacity was almost zero. We're talking about part-time "user support" personnel that were supposed to send ticket to us when things were broken.

Now, having a backround as a developer and other things, I was able to take an application and figure out (mostly) what it was trying to do and how, and then, apply some "debugging" principle to identify and solve problems. However, not every sysadmin has evolved from a developer and not every sysadmin has the patience or the disposition to apply and spend hours to understand why an application suddenly stop working.

In most cases, is because a system that is designed to work under direct supervision (aka: babysitting) of one or more admins, doesn't work well when such supervision lacks. It can be because of a broken design and/or implementation or maybe because the system is "fragile" and doesn't react well to unpredicted changes. And no change is predicted. Ever.

Let's talk of the system of $slackers, whom had began long ago with "a small server" and after years of activities they got to level 18 servers, of which 2 load balancers, 8 application servers, a couple of database servers (one MS SQL Server) a LAMP to run a "forum" and a bunch of machines that run different java applications and whose purpose was not really clear and totally undocumented.

This environment was notorious to be extremely skittisch and susceptible to anything. Rain? The system start spitting out errors. A pigeon flies outside the window? One (or more) of the application servers starts acting up and dropping connections. The pigeon shit while flying? The load balancer get stuck on the server that drops connections. And so on.

Stange to say, but the only machine that never gave any trouble was the SQL Server one.

And we get to a nice saturday morning, a day that I was in "stand-by" and as such I was relaxing on the couch, when the fucking pager started beeping with a (not so) nice error on $slacker's system.

I connect through vpn and start investigating the problem. All the application servers are slow as pigs, and some pages of the site returns errors. Only SOME pages however. A quick check tells me that both database servers are quiet and so they look innocent enough.

The application servers' logs are full of errors and exceptions, but they are ALWAY full of that shit, to the point that they are useless.

After half an hour spent to try to figure out what the fuck is wrong with this junk, DB appears in the company's chat and wants to know WTF. It seems somebody of $slackers have noticed the problem and is getting impatient. I report what I found and DB begins to ask if I've restarted the services. Yes, I did, but is not helping. At this point DB feels obbliged to begin providing a series of suggestions that weren't helping.

After a bit I told him to shut the F up.

In the meantime, I managed to get one of the developers on line, he was, unfortunately, not very expert of that part of the application and completely ignorant of any system-related bits. It seemed that he was competent only in a very small part of the whole and completely uninterested in knowing anything beyond that. It seemed to me that even the development was distributed between differnet groups, each one of which were only responsible of a small part and nobody really had "the whole picture". That probably explains why the whole thing was so unstable.

The guy insisted that since we were the "admin" of the system, we were also responsible for the maintenance of the whole application. That was a good plan, the problem was that we never got any documentation or information about the application itself. And since the structure of that thing was in permanent evolution, getting a clear idea of what did what was a bit difficult.

Anyhow, after several HOURS spent rummaging into the guts of that thing, I noticed some pattern. Every 2 application error I got also a "connection failed" error about one of those "strange" servers whose functions were a bit of a mistery. The developer didn't have any information about these things either.

Since I had nothing to lose at that point, I decided to restart this specific application... and suddenly the entire thing began to behave. In 10 minutes or so things went back to "normal".

After having documentat the thing the best I could, I decided to require officially up-to-date documentation about the internal relations between the various applications and functions and to have a map of what we could monitor inside the application itself. So the next time we could also get some error message that made sense.

Oh and the SQL Server system? We discovered that that machine wasn't in use anymore.

Davide
06/10/2017 13:59

Previous

Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.

7 messages this document does not accept new posts
Il solito anonimo codardo By Il solito anonimo codardo - posted 09/10/2017 09:44

Ah, ecco perché l'SQL Server non si incriccava mai: non sapevano nemmeno di averlo!

crying e ancora crying per come il mondo del web si sta autodistruggendo un pezzo alla volta...

--
Il solito anonimo codardo


Jepessen By Jepessen - posted 09/10/2017 10:04

Che un programmatore non conosca tutto l'accrocchio ci sta, se non è l'unico. Quelli che stanno sopra di lui gli dicono quali parti fare, lui le fa, le testa ed è contento. Al telefono doveva esserci un project manager, che dovrebbe avere le conoscenze di tutto l'insieme, localizzare il problema e, se non era in grado di risolverlo, rompere i maroni al programmatore che aveva fatto quella parte. Così non è stato perché il project manager probabilmente manco sapeva di esserlo e quindi hanno scaricato il problema al poveretto di turno...

--
Jepessen


stecolna By stecolna - posted 09/10/2017 13:22

e smettiamola con questa storia che SQL Server si incricca ogni due per tre!

Sinceramente avete tutti rotto il c...o!

--
stecolna


Guido By Guido - posted 09/10/2017 13:23

La societa' per la quale lavoro ha un hosting del IV tipo - pero' siamo noi che dobbiamo dirgli qual'e' il problema e come si risolve (es. quando spostarono l'application server da un server ad un altro senza modificare l'ip binding del db server il quale - a ragione - rifiutava le connessioni dal nuovo)

--
who uses Debian learns Debian but who uses Slackware learns Linux


Manuel By Manuel - posted 09/10/2017 18:19

Il 4o tipo sta diventando un pò la norma, ma per ora non vedo solo lati negativi.

C'è da dire che sto vedendo qualcosa su Google Cloud Platform, e non mi sembra malvagio, affatto.

Però c'è anche da dire che per ora non ho gestito (personalmente) grossi progetti su GCP, quindi boh...non mi pronuncio.

Mi sembra evidente che i SysAdmin "vecchia scuola" debbano spostarsi verso il cloud, anche perchè, data la complessità delle cose, l'esperienza serve... 

--
::: meksONE :::


Babouk By Babouk - posted 07/11/2017 15:40

N. B. fuori tema: ho notato che riporti ancora link a Splinder - piattaforma che è ormai morta e sepolta da parecchi anni.

--
Babouk


Anonymous Gohan By Anonymous Gohan - posted 10/11/2017 01:24

Iniziato a vedere Dragon Ball?

--
Anonymous Gohan


7 messages this document does not accept new posts

Previous


This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.


This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.

This site isn't optimized for vision with any specific browser, nor it requires special fonts or resolution.
You're free to see it as you wish.

Web Interoperability Pleadge Support This Project
Powered By Gojira