Tales from the Machine Room


Home Page | Comments | Articles | Faq | Documents | Search | Archive | Tales from the Machine Room | Contribute | Login/Register

Normalized or DeNormalized

Databases made Human History.

No, really. It is an opinion of many historian that writing was invented not to chronicle the action of this or that "hero" of ancient time, but simply to keep track about who had which and how many seeds in the village, to be sure they were "covered" for the next season.

The ancient egyptian, beyond seeds, began to account about everything, in fact it seems they invented taxes (thanks a lot!).

Starting from 6000Bc, so basically when we have something written down, what we can found more often are tables relate to who has what. Database basically.

And today, we have computers that can process a bazillion data in a nanosecond, network that can transfer a bazillion data and storages that can contain the same and we still manage to fuck up the whole thing.

In many cases, is not a problem of too many data, but how the data are (ab)used.

So long ago, when I fledged myself as a "developer" and databases were, more or less, my bread-and-butter, one of the books that I found more usefull was "Database Designs for mere mortals"  besides the tongue-in-cheeck title, that book was very clear in specify that one thing is the LOGICAL structure and another, very different, thing is what you want to do with it.

The Theory, that is nice and everything but is still a theory, says that a database should be NORMALIZED, that is, its structure should obey to specific rules, but in practice, that is ugly as hell but is what pays the bill most of the time, explaing that it doesn't matter if the database is perfect, it matters that it WORKS and it does what you need it to do in the best possible way. And that difference is where the world crashes miserably.

Enters $weeatdatabaseforbreakfast, a company specialized in distribution of various things.

These peoples, given the need to manage peoples, things and locations, used lots and lots of databases. For reasons that still eludes me, they had given the whole thing in the hand of a third company that was supposed to to data-analysis and deliver nicely coloured reports full of arrows to these people to do... well, I'm not sure what, I guess they were hanging them to a wall and then throw darts at them...

Anyhow, in order to do what they needed to do, this company had to install a new database, that was going to be hosted by us in the environment for $weeatdatabases. Obviously, since $weeat was talking a lot but then had to put the money where the mouth was, and they had no idea about the size of data they had to manage, they decided to start with a "medium sized" solution.

And obviously, about a week later, the system was ready to explode.

Not just because of the datasize, but because every 10 minutes in normal day and every 5 in special days, anyone of the $weeat's CLs was coming out with a "beautiful" query to run on the database that was going to melt the whole system.

After several meetings and several round of Dutch's preferred game: pointing the finger to somebody else, the fact that the system was sized for a specific volume of data that was WAAAAY too small became evident. And the solution was also evident: resize the whole thing. But resize the whole thing to adapt it to the volume of data meant in first to pay an higher price, then remake the whole "import" data gig (and pay again for that too).

The "solution" cooked up by $weeat was "use the modern technologies to bypass the problem (caused by our own stupidity)". When we asked what were they talking about with "modern technologies" the started throwing buzzwords around (blockchain was mentioned multiple time), showing that they had no fucking clue.

This went on for a bit, until they decided that no, they had to pay. So after a robust injection of money and hardware, the whole thing began to work a little better. I say "a little" because at this point the "other" problems showed up. That means: the whole thing was slow as a pig.

Obviouly, both $weeatdatabases and their "dataprocessor" were in search for a guilty party and decided that, since we were the hosting provider, we were the best party to dump the problem onto.

And at this point I show up, since apparently I was the only one with some vague idea about what a database is and how it should work. After having got a copy of all the dozens of queries that were causing problems, I immediately noticed the problem. The database is normalized. That means it is not optimized to run the queries that $weeat want to run.

Now we need to open a little parhentesis and explain what the heck is a "normalized" database.

Database Theory is a long and complex argument, but it can be shortened down to a few "guidelines" that specify how a "good" database should be designed to conform to some techincal minimum. In specific, these guidelines specify that a) each table should only contains the data that belongs to the table, b) there shouldn't be any data that could be "inferred" from other tables and c) there should be no data duplicated or inferred anywhere.

This is all good and well, but everyone that have worked a few days in the real world, knows that Theory and Practice rarely cohexists, and that a "normalized" database is nice in theory but in practice a de-normalized one works a lot better. And every book thet is more oriented to real-world problems and not so much about theoretical exams clearly specify that you first normalize the db to get rid of all the junk and then you have a look at how the database has to be used and then de-normalize to improve functionality.

Let's apply this to real life and let's have a look at one query for $weeatdatabases. A query that took about 1.5 hours to run and was something like "select count(distinct id), sum(val1), sum(val2) from table where (some_date_group) group by id" (note: this is as much I remember the thing, the details are not important).

At first look it doesn't seems that bad but, if you stop and check, you notice that the table in question was about 320 MILLIONS records and that the "distinct id" would produce a temporary table every time, creating millions of records that are then discarded.

When I asked what the heck was a "count (distinct id)" the answer was that the table was a "sub-table" so a "mother table" was referencing multiple records from the "sub-table".

This is a typical example when de-normalization is a necessity. Adding the 'sum' fields to the 'mother-table' you only need to read ONE record from that one instead of running a long query on the 'sub-table'.

The problem with this type of things is that the solution is simple: redesign the data structure to optimize for use and not for theory and rebuild the procedure that fill in the data. Now this is SIMPLE but not EASY. Actually is a lot difficult to do, since the first thing you have to do is to admit that you made a fucking mess during the design of the database.

And "we made a fucking mess in the design" also means "then you fix it for free" and obviously the geniuses of the dataprocessor weren't very happy about it. Luckily (for them) they were helped by an unexpected ally: Oracle.

Yes, since the db was Oracle. And Oracle apply a license PER PHYSICAL PROCESSOR. This means that if your database require an hardware upgrade, you need to pay higher licenses. And obviously, redesigning the db required increasing the database size again!

In the end, it was decided that the "best" thing to do was to schedule a script that performed a copy with data-reduction from the main database to another database (SQL Server) in $weeatdatabases's local network, so they could run the queries directly there.

So from a "simple" solution to one that was a lot more complicated.

Davide
07/08/2018 16:24

Previous Next

Comments are added when and more important if I have the time to review them and after removing Spam, Crap, Phishing and the like. So don't hold your breath. And if your comment doesn't appear, is probably becuase it wasn't worth it.

6 messages this document does not accept new posts
Arroz conPollo By Arroz conPollo - posted 27/08/2018 09:48

"del gioco preferito degli olandesi (puntare il dito contro qualcun altro)": tutto il mondo è paese, eh? Gli olandesi l'hanno copiato dagli italiani, ecché, proprio come hanno anche importato l'Ufficio Complicazione Affari Semplici.

--
Arroz conPollo


Davide Busato By Davide Busato - posted 28/08/2018 20:25

@theBoss (Davide per gli amici), grazie per il suggerimento sul libro, sto cercando di imparare un po'come funzionano i database ma non essendo il mio mestiere cerco qualcosa di "umano" e non troppo legato a un software specifico.

--
Davide Busato


emi_ska By emi_ska - posted 29/08/2018 09:07

Quello che mi sembra allucinante e' che l'id non fosse univoco... Che senso ha???

Ciao BigD, e' sempre un piacere leggerti!!

Emiliano

--
emi_ska


Anonymous coward@ emi_ska By Anonymous coward - posted 30/08/2018 11:34

 

Quello che mi sembra allucinante e' che l'id non fosse univoco... Che senso ha???

Ciao BigD, e' sempre un piacere leggerti!!

Emiliano

 

Probabilmente in questo caso l'id era una foreign key in una relazione 1 a molti.

--
Anonymous coward


Messer Franz By Messer Franz - posted 30/08/2018 18:17

Vogliamo parlare di quando  due ditte si fondono e vogliono unire i db interni che hanno tabelle con nomi praticamente uguali ma con campi che nessuno si ricorda a che cavolo servono e con query scritte in lingua di mordor (e tu non hai più il ring per decodificarle)?

Sono stato in una ditta dove avevano un sito con - cica - 15/20 "argomenti", cose da metter nel db (tipo: utenti, articoli, ditte...) e il db era di (non scherzo) 542 tabelle (o 546, cambia poco)  e ci lavoravano cira in 7-10 da 5 anni, mentre io l'avrei fatto da solo in 6 mesi...ma va detto che erano ingegneri e che lavoravano in c#...e ho detto tutto...

--
Messer Franz


Anonymous coward By Anonymous coward - posted 10/09/2018 15:51

> Questo e' un classico caso in cui e' bene de-normalizzare

 

mah a volte basta una procedura con un for e un accumulatore per tirarsi fuori dai guai

--
Anonymous coward


6 messages this document does not accept new posts

Previous Next


This site is made by me with blood, sweat and gunpowder, if you want to republish or redistribute any part of it, please drop me (or the author of the article if is not me) a mail.


This site was composed with VIM, now is composed with VIM and the (in)famous CMS FdT.

This site isn't optimized for vision with any specific browser, nor it requires special fonts or resolution.
You're free to see it as you wish.

Web Interoperability Pleadge Support This Project
Powered By Gojira