21 July 2008

SAVEPOINTs just got more problematic

A few days ago I reported on a problem with savepoints, but all that it appeared to be was an extemporaneous warning message. Well, I just found out that InnoDB can actually lose the savepoints! The overall transaction can still be rolled back successfully, but the savepoints themselves are completely gone. This only occurs after the warning: "Got error 153 from storage engine" happens twice in one transaction (if it only happens once, the savepoints still work) which requires that alternating savepoints be set 6 times. See the full report here: http://bugs.mysql.com/bug.php?id=38187.

16 July 2008

Error 153 when creating savepoints

One of our developers ran into the strangest error a few days ago, which neither the MySQL manual nor Google searching could find any information on. When he called certain stored procedures in a certain order, he got "Error 153 returned from storage engine." It's been reduced to a very simple test case, and a bug report has been filed. Here's the test case:
DROP TABLE IF EXISTS foo;
CREATE TABLE `foo` (
`a` int(11) NOT NULL auto_increment,
PRIMARY KEY (`a`)
) ENGINE=InnoDB;

begin;
insert into foo values();
savepoint a1;
insert into foo values();
savepoint a2;
insert into foo values();
savepoint a1;
insert into foo values();
savepoint a2;
show warnings;
rollback;
The very last "savepoint" generates the warning "Got error 153 from storage engine". Any other MySQL'ers use savepoints and affected by this?

15 July 2008

Mycat beta 0.3.0 released

After a very long time distracted with other projects, I finally added the third component of the MyCAT project: binlog_mon, a binary log manager for MySQL.

The main feature of this tool is that it has two disk usage thresholds which determine when it purges your binary logs:
  • a lower, "nominal", threshold above which binary logs will be purged if-and-only-if none of the replication slaves are still reading it,
  • and a higher, "critical", threshold at which the behavior is configurable.
It can simply send you an alert if disk usage is above critical and the oldest file is still needed - or it can purge 1 file, all files until usage below critical, or all files until usage below nominal levels. (Other options could be added fairly easily.) The "critical" option is so configurable because purging any binary log while a slave is still reading them will most likely corrupt the slave, but in some (probably most) cases that is better than filling up the log disk and locking the master database!

For those not familiar with the MyCAT project, it is an open-source Perl toolset for managing MySQL/Linux servers that I wrote initially for my own use, but after a few people asked for similar tools I decided to publish it. Currently, it is composed of three programs that: monitor replication (rep_mon), monitor and rotate binary logs (binlog_mon), and allow remote shell and scp access to predefined groups of servers (rcall). This third tool is great for simplifying tasks such as syncing a new httpd.conf file and running a rolling-restart across a large web farm. All three tools read the same XML configuration file which defines servers and their properties (such as which servers have mysql, what volume the binlogs are on and what the nominal/critical levels are, what an acceptable replication "lag" is, and whether or not exceeding that lag is important enough to send you an email, etc). It really does a lot of things useful to any linux or mysql admin working in a LAMP environment, and I encourage anyone to send me feedback or feature requests.

MyCAT wiki
MyCAT on Sourceforge


Cheers,
Devananda

13 July 2008

How to replicate LOAD DATA from 4.1 to 5.1

It turns out that there is a bug that causes replication on a 5.1 slave to fail when reading LOAD DATA statements from a 4.1-or-earlier master.

About a month ago, I set up a data warehouse for an old 4.1.21 system that can't be upgraded. I decided to use the latest 5.1 beta to take advantage of partitioning, but within the first day replication stopped rather suddenly and with a nondescript error:
[ERROR] Error running query, slave SQL thread aborted.
Fix the problem, and restart the slave SQL thread with
"SLAVE START". We stopped at log 'xxx' position xxx
Using mysqlbinlog, I compared the relay log and the masters binary log around that position, and noticed the relay log appeared corrupt! I don't recall the exact steps I took, but I isolated the problem to a LOAD DATA INFILE statement, and filed my first issue with MySQL Enterprise support.

The great folks at MySQL Support had a response for me within 30 minutes: it is a known bug, documented at http://bugs.mysql.com/bug.php?id=31240. Since skipping a major version is not a recommended practice, fixing it wasn't a priority for them. Thankfully there is a fairly easy workaround - create a "blackhole relay" to translate the logs into a format 5.1 can read.

In essence, since 5.0 can read logs from 4.1, and 5.1 can read from 5.0, stick a 5.0 server in the middle and save on resources by converting all its tables to the blackhole engine. The steps to do this are pretty simple, and the hardest [optional] part is running two versions of mysql on the same server side by side. mysqld_multi does a good job at managing multiple instances, and since that is fairly well documented - and it would work just as well to have the relay on it's own box - I'll skip the details of my setup and get straight to the replication configuration.
  1. Use your preferred method to initialize the data on the new 5.1 slave, but do not START SLAVE yet. You will need the initial replication position later, however.
  2. Use mysqldump --nodata to get just the DDL from the 4.1 master. Load this into the 5.0 server.
  3. Convert the tables on the 5.0 relay that you want to replicate to the blackhole engine. Of course, don't do this to the mysql.* tables. I used a cheesy bash script:

    for t in `mysql $SCHEMA -NBre 'show tables'`; do
    echo "alter table $SCHEMA.$t engine=blackhole;" >> runme.sql;
    done
    cat runme.sql | mysql

  4. Tell the blackhole relay to read from your 4.1 master at the point-in-time that you initialized the 5.1 slave. If you created the 5.1 slave by using "mysqldump --master-data --single-transaction", then the positions you want can be read from the first few lines of the output of mysqldump.
  5. Set the 5.1 slave to read from the blackhole relay, making sure that it does not read all those ALTER statements. I suggest a quick RESET MASTER on the relay.
  6. Start replication on both the 5.0 and 5.1 servers!
The blackhole relay won't use any (significant) disk space for data files (since it only stores the DDL locally) and it won't use much CPU since it isn't maintaining any indexes. However, you do need to rotate it's binary logs, and the addition of a relay between the real master and slave does create a little delay before data propagates to the warehouse.

I've been running this configuration for over a month now with no problems :)

11 July 2008

Performance of JOIN using columns of different numeric types

I know it's been a very long time since I posted anything, but I felt the itch today. A question came up earlier this week at work: How much is JOIN performance affected when using columns that are not the exact same data type?

This is important for me because entity-attribute-value tables require a lot of self-joins. Let me start by saying that we mitigate one of the common drawbacks to EAVs - mashing diverse data types into a single column - by separating numeric, date, and character data into different tables. However, we mashed a lot of integer data into a DECIMAL(13,4) column right alongside our financial data. I recently noticed that most of the data in this EAV table has no fractional part, and to determine whether it would be worth moving it all into another table - as well as determine what column type to use - I spent an afternoon running SHOW PROFILE. Actually, I wrote a Perl script to do it for me :)

The table I used has about 700k rows, and the self join I used matched 5% of the these. The test system is pretty fast and had plenty of available buffer pool, so everything should have been in-memory. The query used in all tests, modeled after ones commonly used in our production systems, was:
SELECT y.value
FROM test AS x JOIN test AS y ON x.value = y.entity_id
WHERE x.atr_id = 123 AND y.atr_id = 456;
I found it very interesting that EXPLAIN gives slightly different results when the column types do not match. When both columns are the same (uint) type ...

+----+-------------+-------+------+----------------+--------+---------+-----
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+--------+---------+-----
| 1 | SIMPLE | y | ref | PRIMARY,atr_id | atr_id | 4 | const | 747 | Using index |
| 1 | SIMPLE | x | ref | atr_id | atr_id | 8 | const,temp.y.entity_id | 4 | Using index |
+----+-------------+-------+------+----------------+--------+---------+-----

... and when one column is uint and the other is anything else ...
+----+-------------+-------+------+----------------+--------+---------+------
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+--------+---------+------
| 1 | SIMPLE | y | ref | PRIMARY,atr_id | atr_id | 4 | const | 747 | Using index |
| 1 | SIMPLE | x | ref | atr_id | atr_id | 8 | const,temp.y.entity_id | 358 | Using where; Using index |
+----+-------------+-------+------+----------------+--------+---------+------

This looks bad. I think it's the optimizer not handling data of different types, but am not sure. On to the tests...

The INFORMATION_SCHEMA.PROFILING table stores data about the last 15 queries, so I wrote a Perl script to execute the JOIN 15 times, then "SELECT SUM(duration) / 15 FROM INFORMATION_SCHEMA.PROFILING" to get the average of those 15, and repeat all that 50 times and recorded the average of averages. I ran that script at least 5 times per column type and recorded it's output. This gave very consistent results - standard deviation between executions of the same test was less than 2% of mean query time.

Each query returned a few thousand rows, so to get an idea how much the data type affected the time to read and transfer the data (even though the data's logical values never changed, everything in the test table was a positive integer less than 2^31), I reran all tests selecting "AVG(y.value)" instead of "y.value".

I compared joining unsigned int to: unsigned int, signed int, signed bigint, decimal(11,0), and decimal(13,4). I also compared joining bigint to bigint and dec(13,4) to dec(13,4). Here are the results:


y.value avg(y.value)
uint:uint
seconds 0.027293 0.017480
uint:int
seconds 0.028337 0.018221
% slower 4% 4%
uint:bigint
seconds 0.029849 0.019477
% slower 9% 11%
bigint:bigint
seconds 0.028625 0.018471
% slower 5% 6%
uint:dec(11,0)
seconds 0.042863 0.029739
% slower 57% 70%
uint:dec(13,4)
seconds 0.044909
0.030188
% slower 65%
73%
dec(13,4):dec(13,4)
seconds 0.040816 0.024387
% slower 50% 40%
uint:char

seconds 0.038499 0.027920
% slower 41% 63%
char:char

seconds 0.029010 0.019214
% slower 6% 12%


I didn't expect AVG(y.value) to be so much faster than simply selecting the data. If I had to guess, I would say MySQL is probably calculating the average as it reads the index, or at some other "lower" layer that is evaluated before reading the result set out to the client. That's just a guess though, and hopefully someone that knows better will correct me if I'm wrong :)

I was not surprised to see that joining INT to DECIMAL is a great deal slower than joins between any of the INT types, and it's good to know just how much slower it is. Even though decimal(13,4) is only 6 bytes, it is significantly slower to join two columns of this type than two bigint columns. I wonder why....

In any case, the answer for us is pretty clear: anything that doesn't need to store fractional data has got to be moved out of DECIMAL storage!


P.S.
Stay tuned, I have a new release of MyCAT coming out very soon which will include a binary log rotation program that works alongside rep_mon!


Updated 7/12/2008 10:30am to include uint:char and char:char test results.