mysqlprestore for parallel restores
Yesterday, I used mysqlpdump to dump 300G of data. Today, as step two of the process, I need to restore that data into the new server. Every good dump tool needs a restore tool, so I wrote
mysqlprestore, which is really just a modification of mysqlpdump that will spawn off threads and run your restore in parallel, processing an output directory from mysqlpdump.
I really need to stick it in version control or something, and I’m sure there are bugs, but it’s working for me so far. Maybe we can merge the two into a single great tool?
mysqlpdump for parallel dumps
I’m working on a project at the moment where we’re doing an upgrade and need to do the dump-and-restore method. It’s 300G, so that’s never going to be fun, but I found a tool today that helped significantly.
Multi threaded mysqldump is not an utopia any more. mysqlpdump can dump all your tables and databases in parallel so it can be much faster in systems with multiple cpu’s.
I ran mysqlpdump (with one patch I’ll send in soon to put quotes around table names) today with 16 threads on a 4 core system and did all 300G in ~3.5 hours. Additionally, since it wraps mysqldump but iterates over the tables, I got a sql file for each table, which is going to make writing a script to restore a piece of cake. It understands that I wanted to do –master-data and it had an option to gzip each sql file as it went.
All in all, I’m thrilled. kudos! And thanks for the tool.
Multiple bond interfaces in CentOS/RHEL
I had a machine with 4 nics that I wanted to bond 2 by to. I had no problem getting the bond0 device up witn any of the interfaces, however getting a bond1 up always resulted in the above error.
The friendly guys from #centos on freenode pointed me to the missing config.
options bonding mode=4 max_bonds=4
An important thing to keep in mind here is that in the RHEL/CentOS initscripts package, these options are global. There is no way to set a different set of options for each bond. So, if for instance, you had 4 NICs and wanted to have 2 of them bonded in mode 1 and 2 of them in mode 4, you’re SOL. (Unless, of course, you go for insmodding everything by hand. But that’s ugly)
further malloc() scaling
As a quick follow up to the last post on malloc() speed comparisons, I decided to extend the size out a bit to cover malloc()ing up to 500M.
Time for 128k: 0.035259 Time for 256k: 0.009718 Time for 1M: 0.478129 Time for 5M: 0.968945 Time for 10M: 0.965172 Time for 50M: 0.674316 Time for 500M: 1.018901
As you can see, once you make the jump up to mmap() (>256k), the cost is fairly well constant (give or take fluctuations). So it’s not that huge memory buffers are terrible, just that there is a cost difference between the smaller and larger buffer sizes that may or may not matter in your case.
more on malloc() speed
The recent read_buffer issue got me curious, so I hacked up a quick (ugly) test program to see if I could show the different speeds of malloc()ing different buffer sizes. Here’s the test code:
[C]
#include
#include
#include
#define LOOP 10000
int main(void) {
int x[4] = { 128,256,1024,5*1024 } ;
int f = 0;
for(f=0;f<4;f++) {
int val = x[f];
timeval before;
timeval after;
gettimeofday(&before,0);
int loop=0;
for(loop=0;loop after.tv_usec) {
udiff = (float)(after.tv_usec + 10000000 – before.tv_usec)/10000000.0;
diff = after.tv_sec – 1 – before.tv_sec;
} else {
udiff = (float)(after.tv_usec – before.tv_usec)/1000000.0;
diff = after.tv_sec – before.tv_sec;
}
printf(“Time for %dk:\t%f\n”,val,((float)diff+udiff));
}
return 0;
}
[/C]
And the results.
Time for 128k: 0.002641 Time for 256k: 0.002628 Time for 1024k: 0.097950 Time for 5120k: 0.060240
Which is rather huge and in line with the read buffer performance we were talking about. malloc()ing 1M isn’t 4x as slow as doing 256k… it’s 37x as slow. Obviously, there is some variance here, and I should probably make this run a little longer. So setting LOOP to 100000, I get this:
Time for 128k: 0.035105 Time for 256k: 0.028631 Time for 1024k: 0.942634 Time for 5120k: 0.985856
Luckily this shows about 10x the performance of before, so we’re at least on the right track here.
The long and short of this is that bigger memory buffers aren’t always better. Much depends on how and when they are used. Much also depends on where your bottleneck is. Many database loads are disk bound, so the cost of 37x isn’t much, since we’re talking microseconds here anyway. But when you’re talking high concurrency and everything is in memory, shaving microseconds from your queries can actually be crucial.
Read Buffer performance hit
I had some fun yesterday with some odd performance problems. So I did a run with oprofile and got this:
561612 25.0417 /lib64/tls/libc-2.3.4.so memset 429457 19.1491 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux clear_page 214268 9.5540 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_page_fault 144293 6.4339 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_no_page 94410 4.2097 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux buffered_rmqueue 64998 2.8982 /lib64/tls/libc-2.3.4.so memcpy 59565 2.6559 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __down_read_trylock 59369 2.6472 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux handle_mm_fault 47312 2.1096 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux free_hot_cold_page 39161 1.7462 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux release_pages 39140 1.7452 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux unmap_vmas 27200 1.2128 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __alloc_pages 21520 0.9596 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __pagevec_lru_add_active 20772 0.9262 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux page_add_anon_rmap 19852 0.8852 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux page_remove_rmap 16424 0.7323 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux bad_range 11993 0.5348 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __might_sleep
Which seemed to me like an awful lot of page faults, especially on a system that wasn’t
swapping at all.
Alexander and I got to talking about it, and I learned a few new things. The first thing is that mmap() appears as page_faults, because the machinery for doing mmap() and for doing swap is actually the same. Fun.
Back to the problem at hand… the following was in the my.cnf:
read_buffer=5M
As I understand it, if read_buffer set to over 256K, it uses mmap() instead of malloc() for memory
allocation. Actually – this is a libc malloc thing and is tunable, but defaults to 256k. From the manual:
Each thread that does a sequential scan allocates a buffer of this size (in bytes) for each table it scans.
In this case, the server is pretty much handling click logging, so all of the queries are pure inserts. Why then would the read buffer get used at all?
Well, it just so happens that this is a PHP app and is not using any sort of persistent connections, so the app is connecting and disconnecting 60 – 100 times a second. Each of those connections is going to have to authenticate against (at least) the mysql.user table. On the other hand, that _should_ be cached, so it shouldn’t be a problem. However, lowering the read buffer is quite easy … and it’ll be interesting.
So, theory to the fire – I reduced read_buffer to 128k (the default) and then got this output from oprofile:
2220 15.7760 /lib64/tls/libc-2.3.4.so memset 785 5.5785 /lib64/tls/libc-2.3.4.so memcpy 608 4.3206 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux thread_return 398 2.8283 /usr/libexec/mysqld yyparse(void*) 337 2.3948 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux system_call 234 1.6629 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_gettimeoffset_hpet 197 1.3999 /lib64/tls/libc-2.3.4.so _int_malloc 183 1.3005 /lib64/tls/libpthread-2.3.4.so pthread_mutex_lock 181 1.2862 /lib64/tls/libpthread-2.3.4.so pthread_mutex_unlock 181 1.2862 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux clear_page 176 1.2507 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux wake_futex 143 1.0162 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux sys_tgkill 138 0.9807 /usr/libexec/mysqld yylex(void*, void*) 130 0.9238 /usr/libexec/mysqld update_sys_var_str(sys_var_str*, _pthread_rwlock_t*, set_var*) 126 0.8954 /jbd (no symbols) 123 0.8741 /usr/libexec/mysqld my_strntoul_ucs2 106 0.7533 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux __might_sleep 106 0.7533 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux do_futex 104 0.7391 /usr/lib/debug/lib/modules/2.6.9-34.ELsmp/vmlinux tcp_recvmsg
Which is more what we would expect.
The results on the system as a whole were that with read buffer set to 5M, extra 3-5% usr and 6-12% sys cpu time taken on an 8 cpu box (as reported by top) or, to look at it another way, about 100% of a single CPU taken by mysqld.
With read_buffer reduced to 128k, the CPU usage dropped to just about nothing.