19 January 2009

Tokutek challenge vs. 128GB RAM

Like a few other folks, I decided to play with iiBench and see how fast I could insert 1B rows into an indexed InnoDB table. Mark Callaghan published  an excellent writeup of the theoretical and practical limits when index size >> buffer pool... but I bent the rules a bit ;)

The platform I ran iiBench on has four dual-core Xeon 3.2Ghz, internal 10k RPM 8 disk RAID 10, and 128GB of RAM. That's a lot, I know, but it was built precisely to avoid InnoDB's performance drop when an index does not fit in memory. It took just 22 hours to run a single iiBench process and 10 hours to run four processes in parallel (each doing 250m rows)! This was achieved without modifications to the iiBench code, using standard MySQL 5.1.30 binaries, the following innodb configuration parameters:
and binary logging enabled (had to purge master logs very frequently!). 

Here's a graph of inserts/sec and CPU usage during the single-process test ...

... and during the multi-process test ...

I'll leave any analysis of these results to those more knowledgable than myself, but close with a question: why do the graphs of insert performance decrease linearly even when all data should be in memory?


Unknown said...

Updated to add a few more innodb config params, and make the list more readable.

Bradley C. Kuszmaul said...

I'm no InnoDB expert, but I do know about data structures and systems. Here are a few possible reasons for the insertion rate to drop as the benchmark progresseses:

1) More nodes are traversed per insertion. The tree is getting bigger. That means the tree is getting deeper. So it is more CPU work to traverse the tree to perform a single insertion because more nodes are traversed.

2) More cache misses. As the tree gets larger it fits less well into the caches and TLB and you are getting more cache and TLB misses. Then the cost of traversing a single node becomes more expensive.

3) More instructions per node. For example, perhaps the undo information gets a little more complex as the tree gets bigger. (I'm just making this up, since I dont' fully understand InnoDB's undo logic.) Consider what happens when you insert 20 items into an empty tree. All 20 items end up in the same node. But if you insert 20 items into a big tree, all 20 items are likely to end up in 20 different nodes. So the undo information becomes "sparse" in the data structure, maybe requiring more work. Maybe there are other effects that add to the instruction count as the tree gets bigger. Thus in addition to suffering more cache misses per node, you may be executing more instructions per node.

Mark Callaghan said...

That is nice hardware. Thanks for the writeup. In addition to the slowdown from the memory system, I wonder if there was a slowdown from flushing dirty pages. You used O_DIRECT rather than buffered IO so writes are likely to see disk latency rather being done quickly to the OS buffer cache -- ignoring the impact of a HW RAID write cache. And you used unmodified MySQL so there is 1 thread that uses sync IO to flush dirty pages. And InnoDB will try to flush ~100 pages per second, even though your hardware can do ~800 writes per second.

When there are too many dirty pages, user transactions will be delayed. There are two metrics by which there are too many dirty pages. The first is when max dirty pages is exceeded and that is not likely here. The second is when there are pages with the min LSN at which the page was dirty is near the min LSN in the transaction log file. And this is likely to be the case because the max size all tx log files is 4G.

So, how many log files do you use and how big are they?

You might be able to make this run faster by using the largest possible log files (2 X 2G), a Percona build with support for more background IO threads and setting innodb_io_capacity=1000. But a lot more remains to be done to make InnoDB use all IO capacity.

Unknown said...

Thanks Mark and Bradley for the insights. Regarding log configuration:

innodb_log_file_size = 1G
innodb_log_files_in_group = 2
innodb_log_buffer_size = 16M

I am using O_DIRECT and disabling the HW write-cache because this particular model of RAID controller (MegaRAID SAS 1078) is known to have failures in the battery backed write cache. It's a nightmare when the RAID controller fails and all the data in the write-cache is lost -- and difficult to piece back together if you have slaves that already replicated some of the lost data.

I would be very interested to see the difference Percona's patches make, but this server is now in production so I can't run another test on it at this time. If I'm able to in the future, I'll post those for comparison.