28 April 2009

check your my.cnf with "mysqld --help --verbose"

On the Drizzle-discuss mailing list, Baron just pointed out that MySQL has a means to validate its config file (/etc/my.cnf), similar to "apachectl configtest".

mysqld --help --verbose

Somehow, I never knew that...

Oh, and yes, this means I am following Drizzle. In fact, I'm trying to devote at least a few hours each day to reading the source and finding things to fix. In a weird way, it feels good to be writing C++ again!

19 February 2009

Bug in savepoints getting fixed

It looks like this bug finally got some attention, and a patch is queued.

In brief, the bug happens when you alternate and reuse names of savepoints. Quoting from the bug page:
InnoDB treats the savepoints like a stack, e.g.,    SAVEPOINT a;   SAVEPOINT b;   SAVEPOINT c;   SAVEPOINT b; <- Will delete c too

19 January 2009

Tokutek challenge vs. 128GB RAM

Like a few other folks, I decided to play with iiBench and see how fast I could insert 1B rows into an indexed InnoDB table. Mark Callaghan published  an excellent writeup of the theoretical and practical limits when index size >> buffer pool... but I bent the rules a bit ;)

The platform I ran iiBench on has four dual-core Xeon 3.2Ghz, internal 10k RPM 8 disk RAID 10, and 128GB of RAM. That's a lot, I know, but it was built precisely to avoid InnoDB's performance drop when an index does not fit in memory. It took just 22 hours to run a single iiBench process and 10 hours to run four processes in parallel (each doing 250m rows)! This was achieved without modifications to the iiBench code, using standard MySQL 5.1.30 binaries, the following innodb configuration parameters:
and binary logging enabled (had to purge master logs very frequently!). 

Here's a graph of inserts/sec and CPU usage during the single-process test ...

... and during the multi-process test ...

I'll leave any analysis of these results to those more knowledgable than myself, but close with a question: why do the graphs of insert performance decrease linearly even when all data should be in memory?

09 January 2009


With the new year and all the changes right now - namely, moving to the Olympic Peninsula (WA) - I've decided to try Twitter. I seem to have lots of ideas to blog about, scratch them down, then never finish the post. I've seen posts by a few other bloggers that I follow(ed) who said the same thing prior to migrating (and they haven't come back...) so I think I'll try the same.

21 July 2008

SAVEPOINTs just got more problematic

A few days ago I reported on a problem with savepoints, but all that it appeared to be was an extemporaneous warning message. Well, I just found out that InnoDB can actually lose the savepoints! The overall transaction can still be rolled back successfully, but the savepoints themselves are completely gone. This only occurs after the warning: "Got error 153 from storage engine" happens twice in one transaction (if it only happens once, the savepoints still work) which requires that alternating savepoints be set 6 times. See the full report here: http://bugs.mysql.com/bug.php?id=38187.

16 July 2008

Error 153 when creating savepoints

One of our developers ran into the strangest error a few days ago, which neither the MySQL manual nor Google searching could find any information on. When he called certain stored procedures in a certain order, he got "Error 153 returned from storage engine." It's been reduced to a very simple test case, and a bug report has been filed. Here's the test case:
`a` int(11) NOT NULL auto_increment,

insert into foo values();
savepoint a1;
insert into foo values();
savepoint a2;
insert into foo values();
savepoint a1;
insert into foo values();
savepoint a2;
show warnings;
The very last "savepoint" generates the warning "Got error 153 from storage engine". Any other MySQL'ers use savepoints and affected by this?

15 July 2008

Mycat beta 0.3.0 released

After a very long time distracted with other projects, I finally added the third component of the MyCAT project: binlog_mon, a binary log manager for MySQL.

The main feature of this tool is that it has two disk usage thresholds which determine when it purges your binary logs:
  • a lower, "nominal", threshold above which binary logs will be purged if-and-only-if none of the replication slaves are still reading it,
  • and a higher, "critical", threshold at which the behavior is configurable.
It can simply send you an alert if disk usage is above critical and the oldest file is still needed - or it can purge 1 file, all files until usage below critical, or all files until usage below nominal levels. (Other options could be added fairly easily.) The "critical" option is so configurable because purging any binary log while a slave is still reading them will most likely corrupt the slave, but in some (probably most) cases that is better than filling up the log disk and locking the master database!

For those not familiar with the MyCAT project, it is an open-source Perl toolset for managing MySQL/Linux servers that I wrote initially for my own use, but after a few people asked for similar tools I decided to publish it. Currently, it is composed of three programs that: monitor replication (rep_mon), monitor and rotate binary logs (binlog_mon), and allow remote shell and scp access to predefined groups of servers (rcall). This third tool is great for simplifying tasks such as syncing a new httpd.conf file and running a rolling-restart across a large web farm. All three tools read the same XML configuration file which defines servers and their properties (such as which servers have mysql, what volume the binlogs are on and what the nominal/critical levels are, what an acceptable replication "lag" is, and whether or not exceeding that lag is important enough to send you an email, etc). It really does a lot of things useful to any linux or mysql admin working in a LAMP environment, and I encourage anyone to send me feedback or feature requests.

MyCAT wiki
MyCAT on Sourceforge


13 July 2008

How to replicate LOAD DATA from 4.1 to 5.1

It turns out that there is a bug that causes replication on a 5.1 slave to fail when reading LOAD DATA statements from a 4.1-or-earlier master.

About a month ago, I set up a data warehouse for an old 4.1.21 system that can't be upgraded. I decided to use the latest 5.1 beta to take advantage of partitioning, but within the first day replication stopped rather suddenly and with a nondescript error:
[ERROR] Error running query, slave SQL thread aborted.
Fix the problem, and restart the slave SQL thread with
"SLAVE START". We stopped at log 'xxx' position xxx
Using mysqlbinlog, I compared the relay log and the masters binary log around that position, and noticed the relay log appeared corrupt! I don't recall the exact steps I took, but I isolated the problem to a LOAD DATA INFILE statement, and filed my first issue with MySQL Enterprise support.

The great folks at MySQL Support had a response for me within 30 minutes: it is a known bug, documented at http://bugs.mysql.com/bug.php?id=31240. Since skipping a major version is not a recommended practice, fixing it wasn't a priority for them. Thankfully there is a fairly easy workaround - create a "blackhole relay" to translate the logs into a format 5.1 can read.

In essence, since 5.0 can read logs from 4.1, and 5.1 can read from 5.0, stick a 5.0 server in the middle and save on resources by converting all its tables to the blackhole engine. The steps to do this are pretty simple, and the hardest [optional] part is running two versions of mysql on the same server side by side. mysqld_multi does a good job at managing multiple instances, and since that is fairly well documented - and it would work just as well to have the relay on it's own box - I'll skip the details of my setup and get straight to the replication configuration.
  1. Use your preferred method to initialize the data on the new 5.1 slave, but do not START SLAVE yet. You will need the initial replication position later, however.
  2. Use mysqldump --nodata to get just the DDL from the 4.1 master. Load this into the 5.0 server.
  3. Convert the tables on the 5.0 relay that you want to replicate to the blackhole engine. Of course, don't do this to the mysql.* tables. I used a cheesy bash script:

    for t in `mysql $SCHEMA -NBre 'show tables'`; do
    echo "alter table $SCHEMA.$t engine=blackhole;" >> runme.sql;
    cat runme.sql | mysql

  4. Tell the blackhole relay to read from your 4.1 master at the point-in-time that you initialized the 5.1 slave. If you created the 5.1 slave by using "mysqldump --master-data --single-transaction", then the positions you want can be read from the first few lines of the output of mysqldump.
  5. Set the 5.1 slave to read from the blackhole relay, making sure that it does not read all those ALTER statements. I suggest a quick RESET MASTER on the relay.
  6. Start replication on both the 5.0 and 5.1 servers!
The blackhole relay won't use any (significant) disk space for data files (since it only stores the DDL locally) and it won't use much CPU since it isn't maintaining any indexes. However, you do need to rotate it's binary logs, and the addition of a relay between the real master and slave does create a little delay before data propagates to the warehouse.

I've been running this configuration for over a month now with no problems :)

11 July 2008

Performance of JOIN using columns of different numeric types

I know it's been a very long time since I posted anything, but I felt the itch today. A question came up earlier this week at work: How much is JOIN performance affected when using columns that are not the exact same data type?

This is important for me because entity-attribute-value tables require a lot of self-joins. Let me start by saying that we mitigate one of the common drawbacks to EAVs - mashing diverse data types into a single column - by separating numeric, date, and character data into different tables. However, we mashed a lot of integer data into a DECIMAL(13,4) column right alongside our financial data. I recently noticed that most of the data in this EAV table has no fractional part, and to determine whether it would be worth moving it all into another table - as well as determine what column type to use - I spent an afternoon running SHOW PROFILE. Actually, I wrote a Perl script to do it for me :)

The table I used has about 700k rows, and the self join I used matched 5% of the these. The test system is pretty fast and had plenty of available buffer pool, so everything should have been in-memory. The query used in all tests, modeled after ones commonly used in our production systems, was:
SELECT y.value
FROM test AS x JOIN test AS y ON x.value = y.entity_id
WHERE x.atr_id = 123 AND y.atr_id = 456;
I found it very interesting that EXPLAIN gives slightly different results when the column types do not match. When both columns are the same (uint) type ...

| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | y | ref | PRIMARY,atr_id | atr_id | 4 | const | 747 | Using index |
| 1 | SIMPLE | x | ref | atr_id | atr_id | 8 | const,temp.y.entity_id | 4 | Using index |

... and when one column is uint and the other is anything else ...
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | y | ref | PRIMARY,atr_id | atr_id | 4 | const | 747 | Using index |
| 1 | SIMPLE | x | ref | atr_id | atr_id | 8 | const,temp.y.entity_id | 358 | Using where; Using index |

This looks bad. I think it's the optimizer not handling data of different types, but am not sure. On to the tests...

The INFORMATION_SCHEMA.PROFILING table stores data about the last 15 queries, so I wrote a Perl script to execute the JOIN 15 times, then "SELECT SUM(duration) / 15 FROM INFORMATION_SCHEMA.PROFILING" to get the average of those 15, and repeat all that 50 times and recorded the average of averages. I ran that script at least 5 times per column type and recorded it's output. This gave very consistent results - standard deviation between executions of the same test was less than 2% of mean query time.

Each query returned a few thousand rows, so to get an idea how much the data type affected the time to read and transfer the data (even though the data's logical values never changed, everything in the test table was a positive integer less than 2^31), I reran all tests selecting "AVG(y.value)" instead of "y.value".

I compared joining unsigned int to: unsigned int, signed int, signed bigint, decimal(11,0), and decimal(13,4). I also compared joining bigint to bigint and dec(13,4) to dec(13,4). Here are the results:

y.value avg(y.value)
seconds 0.027293 0.017480
seconds 0.028337 0.018221
% slower 4% 4%
seconds 0.029849 0.019477
% slower 9% 11%
seconds 0.028625 0.018471
% slower 5% 6%
seconds 0.042863 0.029739
% slower 57% 70%
seconds 0.044909
% slower 65%
seconds 0.040816 0.024387
% slower 50% 40%

seconds 0.038499 0.027920
% slower 41% 63%

seconds 0.029010 0.019214
% slower 6% 12%

I didn't expect AVG(y.value) to be so much faster than simply selecting the data. If I had to guess, I would say MySQL is probably calculating the average as it reads the index, or at some other "lower" layer that is evaluated before reading the result set out to the client. That's just a guess though, and hopefully someone that knows better will correct me if I'm wrong :)

I was not surprised to see that joining INT to DECIMAL is a great deal slower than joins between any of the INT types, and it's good to know just how much slower it is. Even though decimal(13,4) is only 6 bytes, it is significantly slower to join two columns of this type than two bigint columns. I wonder why....

In any case, the answer for us is pretty clear: anything that doesn't need to store fractional data has got to be moved out of DECIMAL storage!

Stay tuned, I have a new release of MyCAT coming out very soon which will include a binary log rotation program that works alongside rep_mon!

Updated 7/12/2008 10:30am to include uint:char and char:char test results.

20 January 2006

Think I found a bug ....

A couple days ago, a mistake caused a key-value table to generate new keys for duplicate values when there should have been a unique constraint on key. This affected the stat gathering table (stats for one value are now associated with many keys) and I have to clean up the data -- while preserving the stats that have already been generated. In trying to do this, I think I found a bug in replication .... an UPDATE statement that aggregates the data produces different results on my replication slave. I've submitted a formal bug report along with SQL to duplicate the error here. I'm curious if anyone else has encountered this bug, or knows of a way to achieve the result that I want without breaking replication.

Here's a description of the test case:

Two tables store key-value pairs (in the example, `color` and `animal`), and `a_to_c` stores a relationship between each key along with a third column, `count`. However, in the two key-value tables, the values are not unique. The overall purpose is to sum the `count` column in the third table for each unique animal+color combination and remove all duplicate keys from the `color` and `animal` tables. In other words -- because there is no unique constraint on `color` or `animal`, the color "grey" is given more than one key, as is the animal "dog". The goal is to rebuild the `color` and `animal` tables, preserving the lowest key for each distinct value and -- most importantly -- to rebuild the `a_to_c` table, preserving the total sum for each unique combination of animal+color. In the example, I stop short of finishing this process when the first error appears.

Here is the demonstration SQL:

-- creating the tables and initializing data

CREATE TABLE `animals` (
`k` int(11) NOT NULL auto_increment,
`v` varchar(10) default NULL,

INSERT INTO `animals` VALUES (1,'dog'),(2,'cat'),(3,'dog');

CREATE TABLE `colors` (
`k` int(11) NOT NULL auto_increment,
`v` varchar(10) default NULL,

INSERT INTO `colors` VALUES (1,'grey'),(2,'beige'),(3,'grey'),(4,'grey');

CREATE TABLE `a_to_c` (
`animal` int(11) default NULL,
`color` int(11) default NULL,
`count` int(11) default NULL

INSERT INTO `a_to_c` VALUES (1,2,1),(3,2,1),(1,1,1),(1,3,1),(2,3,2),(2,4,2);

CREATE TABLE sums SELECT animals.v AS animal, colors.v AS color, SUM(count)
FROM animals JOIN a_to_c ON animals.k=a_to_c.animal
JOIN colors ON colors.k=a_to_c.color
GROUP BY animal, color;

-- Everything up to here works fine.

mysql> select * from sums;
| animal | color | SUM(count) |
| cat | grey | 4 |
| dog | beige | 2 |
| dog | grey | 2 |
3 rows in set (0.00 sec)

-- It is this next query that produces unexpected results on the slave!
-- It does work as expected on the master.

SELECT v AS animal, MIN(k) AS k FROM animals GROUP BY animal
) AS a USING(animal)
SET sums.animal=a.k;

-- Here are the results of this UPDATE, on master and slave
-- master
mysql> select * from sums;
| animal | color | SUM(count) |
| 2 | grey | 4 |
| 1 | beige | 2 |
| 1 | grey | 2 |
3 rows in set (0.00 sec)

-- slave
mysql> select * from sums;
| animal | color | SUM(count) |
| NULL | grey | 4 |
| 1 | beige | 2 |
| 1 | grey | 2 |
3 rows in set (0.00 sec)

13 November 2005

Added 'rcall' to mycat project

I just finished a couple day's work, documenting and adapting the 'rcall' (Remote Call) tool, and added it to the Mycat project on sourceforge. The readme is available here.

'rcall' is designed to run in non-homogeneous *nix clusters, to ease use of these clusters by creating a single place from which to securely run commands across logical groups of servers.

I'll give some examples.... The past two companies I have worked for both had separate web (apache) servers, mailing (MTA) servers, and database (MySQL) servers. Using rcall (its previous incarnation, that is), I was able to easily track down an error that was causing seemingly random entries in the databases to have wrong timestamps; `rcall -on web -C 'date'` would print the current date on each web server, making it easy to see which one's clock was wrong. After making a change to apache's configuration, it was easy to restart all the webservers without any visible interruption in service; `rcall -on web -delay 15 -C '/usr/local/apache/bin/apachectl restart'` - the delay helps our loadbalancer handle the sudden loss of service from each individual server, but collectively, there is no downtime. Checking on disk utilization across all database servers is as easy as `rcall -on data -C 'df -h | grep -P "data|backup"'`.

`rcall` does not directly relate to MySQL, but it has helped me in my work with MySQL Cluster, and with clusters of MySQL servers (using the term loosely there). Of course, the other tool already in the mycat project is `rep_mon`, which IS directly relevant to MySQL.

Over the coming weeks, I will be adding another tool to this project, namely, `conftool` which creates a single location from which to sync configuration files (and back them up, and restore from backups, and so on) throughout the same groups/clusters of servers that `rcall` and `rep_mon` operate on. After that, I hope to add my database backup system, but I have yet to find a way to "generalize" it so that it might work on someone else's systems without hassle.

I've found these tools indispensable in my own work; I hope that others can find at least some use in them!

01 November 2005

Replication Monitor finished!

Finally! Last night, I put the finishing touches on the documentation, and uploaded to SourceForge. Well, finishing touches for a first alpha, hehe - still, it's a good step! Anyone with Perl installed should be able to download Mycat-0.1.2.tgz, untar, edit the config file (mycat.cnf), and run rep_mon.pl to see the status of replication on their servers.

Mycat-0.1.2 release notes

I will add the "rcall" script next, but I can't decide whether it's name comes from "Run Command on all" or "Remote Call"... decisions, decisions... anyhow, rcall relies on ssh key-based remote execution of commands, and is not specific to any MySQL installation. It is none the less an indispensable tool for me, when working with large heterogeneous groups of servers.

26 October 2005

Replication Monitor preview posted

I posted the source to my replication monitor (rep_mon.pl) over at sourceforge, along with a screen shot of it in action. It doesn't look all that fancy, but when you've got a dozen MySQL servers, it sure helps to have one script that checks them all at once!

24 October 2005

Open Source

Just this morning, I recieved permission to make the tools I've developed available via SourceForge, under a GPL license! Wooo! (heh, sorry, couldn't help myself) I've already set up the project


It will take me a little time to prepare the code, so check back soon!