11 July 2008

Performance of JOIN using columns of different numeric types

I know it's been a very long time since I posted anything, but I felt the itch today. A question came up earlier this week at work: How much is JOIN performance affected when using columns that are not the exact same data type?

This is important for me because entity-attribute-value tables require a lot of self-joins. Let me start by saying that we mitigate one of the common drawbacks to EAVs - mashing diverse data types into a single column - by separating numeric, date, and character data into different tables. However, we mashed a lot of integer data into a DECIMAL(13,4) column right alongside our financial data. I recently noticed that most of the data in this EAV table has no fractional part, and to determine whether it would be worth moving it all into another table - as well as determine what column type to use - I spent an afternoon running SHOW PROFILE. Actually, I wrote a Perl script to do it for me :)

The table I used has about 700k rows, and the self join I used matched 5% of the these. The test system is pretty fast and had plenty of available buffer pool, so everything should have been in-memory. The query used in all tests, modeled after ones commonly used in our production systems, was:
SELECT y.value
FROM test AS x JOIN test AS y ON x.value = y.entity_id
WHERE x.atr_id = 123 AND y.atr_id = 456;
I found it very interesting that EXPLAIN gives slightly different results when the column types do not match. When both columns are the same (uint) type ...

+----+-------------+-------+------+----------------+--------+---------+-----
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+--------+---------+-----
| 1 | SIMPLE | y | ref | PRIMARY,atr_id | atr_id | 4 | const | 747 | Using index |
| 1 | SIMPLE | x | ref | atr_id | atr_id | 8 | const,temp.y.entity_id | 4 | Using index |
+----+-------------+-------+------+----------------+--------+---------+-----

... and when one column is uint and the other is anything else ...
+----+-------------+-------+------+----------------+--------+---------+------
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+--------+---------+------
| 1 | SIMPLE | y | ref | PRIMARY,atr_id | atr_id | 4 | const | 747 | Using index |
| 1 | SIMPLE | x | ref | atr_id | atr_id | 8 | const,temp.y.entity_id | 358 | Using where; Using index |
+----+-------------+-------+------+----------------+--------+---------+------

This looks bad. I think it's the optimizer not handling data of different types, but am not sure. On to the tests...

The INFORMATION_SCHEMA.PROFILING table stores data about the last 15 queries, so I wrote a Perl script to execute the JOIN 15 times, then "SELECT SUM(duration) / 15 FROM INFORMATION_SCHEMA.PROFILING" to get the average of those 15, and repeat all that 50 times and recorded the average of averages. I ran that script at least 5 times per column type and recorded it's output. This gave very consistent results - standard deviation between executions of the same test was less than 2% of mean query time.

Each query returned a few thousand rows, so to get an idea how much the data type affected the time to read and transfer the data (even though the data's logical values never changed, everything in the test table was a positive integer less than 2^31), I reran all tests selecting "AVG(y.value)" instead of "y.value".

I compared joining unsigned int to: unsigned int, signed int, signed bigint, decimal(11,0), and decimal(13,4). I also compared joining bigint to bigint and dec(13,4) to dec(13,4). Here are the results:


y.value avg(y.value)
uint:uint
seconds 0.027293 0.017480
uint:int
seconds 0.028337 0.018221
% slower 4% 4%
uint:bigint
seconds 0.029849 0.019477
% slower 9% 11%
bigint:bigint
seconds 0.028625 0.018471
% slower 5% 6%
uint:dec(11,0)
seconds 0.042863 0.029739
% slower 57% 70%
uint:dec(13,4)
seconds 0.044909
0.030188
% slower 65%
73%
dec(13,4):dec(13,4)
seconds 0.040816 0.024387
% slower 50% 40%
uint:char

seconds 0.038499 0.027920
% slower 41% 63%
char:char

seconds 0.029010 0.019214
% slower 6% 12%


I didn't expect AVG(y.value) to be so much faster than simply selecting the data. If I had to guess, I would say MySQL is probably calculating the average as it reads the index, or at some other "lower" layer that is evaluated before reading the result set out to the client. That's just a guess though, and hopefully someone that knows better will correct me if I'm wrong :)

I was not surprised to see that joining INT to DECIMAL is a great deal slower than joins between any of the INT types, and it's good to know just how much slower it is. Even though decimal(13,4) is only 6 bytes, it is significantly slower to join two columns of this type than two bigint columns. I wonder why....

In any case, the answer for us is pretty clear: anything that doesn't need to store fractional data has got to be moved out of DECIMAL storage!


P.S.
Stay tuned, I have a new release of MyCAT coming out very soon which will include a binary log rotation program that works alongside rep_mon!


Updated 7/12/2008 10:30am to include uint:char and char:char test results.

4 comments:

Roland Bouman said...

Hi Devananda,

good to see you posting again :)

"Even though decimal(13,4) is only 6 bytes, it is significantly slower to join two columns of this type than two bigint columns. I wonder why...."

I think it is because internally, DECIMAL uses a kind of string representation. My guess is that this makes comparisons by numerical value slow.

I think your results clearly show that even joining on identical DECIMAL types it a good deal slower than over integers.

Devananda vdv said...

Roland,

Thanks for the encouragement :)

"I think it is because internally, DECIMAL uses a kind of string representation."

I thought that since MySQL 5.0.3, decimal has been stored in a binary format? Is it actually just a compressed string, and not in fact numeric? Guess I should dive into the source and find out for myself :)

Out of curiosity, I repeated the same tests joining uint:char(6) and char(6):char(6), to see how that stacks up to dec. I updated the original post with those results, but in short:
- char:char was 6% and 12% slower than uint:uint for the "SELECT value" and "SELECT avg(value)" tests, respectively,
- uint:char was 41% and 63% slower than uint:uint, respectively.

Considering that the char:char JOIN was almost as fast as uint:uint, whereas uint:char shows a very similar performance hit to all the tests involving decimal, I wonder if mysql is performing some sort of typecast even for dec:dec JOINs?

Anonymous said...
This comment has been removed by a blog administrator.
Roland Bouman said...

Hi Devananda,

to tell you the truth I'd have to take a close look at the source myself to get to the bottom of it. My remark concerning the DECIMAL being internally represented as a string is from recollection of a remark by Jeremy Cole some time ago (think it was Heidelberg Developer conference, september 2007)