MySQL-HA


mysqlprestore for parallel restores

Posted in tools by mtaylor on September 13, 2007

Yesterday, I used mysqlpdump to dump 300G of data. Today, as step two of the process, I need to restore that data into the new server. Every good dump tool needs a restore tool, so I wrote
mysqlprestore, which is really just a modification of mysqlpdump that will spawn off threads and run your restore in parallel, processing an output directory from mysqlpdump.

I really need to stick it in version control or something, and I’m sure there are bugs, but it’s working for me so far. Maybe we can merge the two into a single great tool?

mysqlpdump for parallel dumps

Posted in tools by mtaylor on September 12, 2007

I’m working on a project at the moment where we’re doing an upgrade and need to do the dump-and-restore method. It’s 300G, so that’s never going to be fun, but I found a tool today that helped significantly.

Multi threaded mysqldump is not an utopia any more. mysqlpdump can dump all your tables and databases in parallel so it can be much faster in systems with multiple cpu’s.

I ran mysqlpdump (with one patch I’ll send in soon to put quotes around table names) today with 16 threads on a 4 core system and did all 300G in ~3.5 hours. Additionally, since it wraps mysqldump but iterates over the tables, I got a sql file for each table, which is going to make writing a script to restore a piece of cake. It understands that I wanted to do –master-data and it had an option to gzip each sql file as it went.

All in all, I’m thrilled. kudos! And thanks for the tool.

mysqlpdump

Multiple bond interfaces in CentOS/RHEL

Posted in ha by mtaylor on September 11, 2007

Kris writes:

I had a machine with 4 nics that I wanted to bond 2 by to. I had no problem getting the bond0 device up witn any of the interfaces, however getting a bond1 up always resulted in the above error.

The friendly guys from #centos on freenode pointed me to the missing config.

options bonding mode=4 max_bonds=4

An important thing to keep in mind here is that in the RHEL/CentOS initscripts package, these options are global. There is no way to set a different set of options for each bond. So, if for instance, you had 4 NICs and wanted to have 2 of them bonded in mode 1 and 2 of them in mode 4, you’re SOL. (Unless, of course, you go for insmodding everything by hand. But that’s ugly)

further malloc() scaling

Posted in performance by mtaylor on September 10, 2007

As a quick follow up to the last post on malloc() speed comparisons, I decided to extend the size out a bit to cover malloc()ing up to 500M.

Time for        128k:   0.035259
Time for        256k:   0.009718
Time for        1M:     0.478129
Time for        5M:     0.968945
Time for        10M:    0.965172
Time for        50M:    0.674316
Time for        500M:   1.018901

As you can see, once you make the jump up to mmap() (>256k), the cost is fairly well constant (give or take fluctuations). So it’s not that huge memory buffers are terrible, just that there is a cost difference between the smaller and larger buffer sizes that may or may not matter in your case.

more on malloc() speed

Posted in performance by mtaylor on September 10, 2007

The recent read_buffer issue got me curious, so I hacked up a quick (ugly) test program to see if I could show the different speeds of malloc()ing different buffer sizes. Here’s the test code:

[C]
#include
#include
#include

#define LOOP 10000

int main(void) {

int x[4] = { 128,256,1024,5*1024 } ;

int f = 0;
for(f=0;f<4;f++) {
int val = x[f];

timeval before;
timeval after;

gettimeofday(&before,0);

int loop=0;
for(loop=0;loop after.tv_usec) {
udiff = (float)(after.tv_usec + 10000000 – before.tv_usec)/10000000.0;
diff = after.tv_sec – 1 – before.tv_sec;
} else {
udiff = (float)(after.tv_usec – before.tv_usec)/1000000.0;
diff = after.tv_sec – before.tv_sec;
}
printf(“Time for %dk:\t%f\n”,val,((float)diff+udiff));

}

return 0;
}
[/C]

And the results.

Time for 128k:  0.002641
Time for 256k:  0.002628
Time for 1024k: 0.097950
Time for 5120k: 0.060240

Which is rather huge and in line with the read buffer performance we were talking about. malloc()ing 1M isn’t 4x as slow as doing 256k… it’s 37x as slow. Obviously, there is some variance here, and I should probably make this run a little longer. So setting LOOP to 100000, I get this:

Time for 128k:  0.035105
Time for 256k:  0.028631
Time for 1024k: 0.942634
Time for 5120k: 0.985856

Luckily this shows about 10x the performance of before, so we’re at least on the right track here.

The long and short of this is that bigger memory buffers aren’t always better. Much depends on how and when they are used. Much also depends on where your bottleneck is. Many database loads are disk bound, so the cost of 37x isn’t much, since we’re talking microseconds here anyway. But when you’re talking high concurrency and everything is in memory, shaving microseconds from your queries can actually be crucial.

Read Buffer performance hit

Posted in performance by mtaylor on September 6, 2007

I had some fun yesterday with some odd performance problems. So I did a run with oprofile and got this:

561612   25.0417  /lib64/tls/libc-2.3.4.so memset
429457   19.1491  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux clear_page
214268    9.5540  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_page_fault
144293    6.4339  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_no_page
94410     4.2097  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux buffered_rmqueue
64998     2.8982  /lib64/tls/libc-2.3.4.so memcpy
59565     2.6559  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __down_read_trylock
59369     2.6472  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux handle_mm_fault
47312     2.1096  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux free_hot_cold_page
39161     1.7462  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux release_pages
39140     1.7452  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux unmap_vmas
27200     1.2128  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __alloc_pages
21520     0.9596  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __pagevec_lru_add_active
20772     0.9262  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux page_add_anon_rmap
19852     0.8852  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux page_remove_rmap
16424     0.7323  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux bad_range
11993     0.5348  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __might_sleep

Which seemed to me like an awful lot of page faults, especially on a system that wasn’t
swapping at all.

Alexander and I got to talking about it, and I learned a few new things. The first thing is that mmap() appears as page_faults, because the machinery for doing mmap() and for doing swap is actually the same. Fun.

Back to the problem at hand… the following was in the my.cnf:

read_buffer=5M

As I understand it, if read_buffer set to over 256K, it uses mmap() instead of malloc() for memory
allocation. Actually – this is a libc malloc thing and is tunable, but defaults to 256k. From the manual:

Each thread that does a sequential scan allocates a buffer of this size (in bytes) for each table it scans.

In this case, the server is pretty much handling click logging, so all of the queries are pure inserts. Why then would the read buffer get used at all?

Well, it just so happens that this is a PHP app and is not using any sort of persistent connections, so the app is connecting and disconnecting 60 – 100 times a second. Each of those connections is going to have to authenticate against (at least) the mysql.user table. On the other hand, that _should_ be cached, so it shouldn’t be a problem. However, lowering the read buffer is quite easy … and it’ll be interesting.

So, theory to the fire – I reduced read_buffer to 128k (the default) and then got this output from oprofile:

2220     15.7760  /lib64/tls/libc-2.3.4.so memset
785       5.5785  /lib64/tls/libc-2.3.4.so memcpy
608       4.3206  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux thread_return
398       2.8283  /usr/libexec/mysqld      yyparse(void*)
337       2.3948  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux system_call
234       1.6629  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_gettimeoffset_hpet
197       1.3999  /lib64/tls/libc-2.3.4.so _int_malloc
183       1.3005  /lib64/tls/libpthread-2.3.4.so pthread_mutex_lock
181       1.2862  /lib64/tls/libpthread-2.3.4.so pthread_mutex_unlock
181       1.2862  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux clear_page
176       1.2507  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux wake_futex
143       1.0162  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux sys_tgkill
138       0.9807  /usr/libexec/mysqld      yylex(void*, void*)
130       0.9238  /usr/libexec/mysqld      update_sys_var_str(sys_var_str*, _pthread_rwlock_t*, set_var*)
126       0.8954  /jbd                     (no symbols)
123       0.8741  /usr/libexec/mysqld      my_strntoul_ucs2
106       0.7533  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __might_sleep
106       0.7533  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_futex
104       0.7391  /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux tcp_recvmsg

Which is more what we would expect.

The results on the system as a whole were that with read buffer set to 5M, extra 3-5% usr and 6-12% sys cpu time taken on an 8 cpu box (as reported by top) or, to look at it another way, about 100% of a single CPU taken by mysqld.

With read_buffer reduced to 128k, the CPU usage dropped to just about nothing.