MySQL-HA


My fun with MySQL on EC2

Posted in ha by mtaylor on October 28, 2007

Morgan is looking in to EC2 on MySQL, so I thought I’d pipe up about stuff I’ve been playing with.

The ephemeral nature of the data is troubling, because at best you’re going to have some lag before you can back stuff up to S3 or some other place. (Unless that was happening continuously… but we’ll come back to that) On the other hand, if you’re doing app sharding or something similar, this essentially just makes you plan that your machines can all die at any time. If you used Google’s semi-sync replication patch, you could easily spin up little replication clusters as needed.

Hm. Clusters. Well, I’m also a fan of MySQL Cluster. What if you ran MySQL Cluster on a single ec2 node (both data and sql nodes)? What if, further, you wrote (and by you, I mean me… code coming soon, I promise) an AsyncFile implementation for Cluster that read and wrote to S3 instead of local disk. Cluster itself is already decoupled from disk write latency. Sounds like a good UC talk…

Then you could do the same thing with a multi-node cluster, but Amazon doesn’t let you control the network between EC2 nodes, so the latency there could kill you.

I have put together a few scripts for spinning up an EC2 node with MySQL (and Cluster) ready to go, but I’m sitting in an airport right now, so I’ll have to post the code later.

I think the possibilities for scale-out here are fantastic, but like all application partitioning approaches, they do require some engineering of the application to take advantage of it.

Running more than one ndbd on a machine

Posted in ha by mtaylor on October 21, 2007

Personally, I’m not a fan of more than one ndbd per machine…

Diamond Notes » Fun with Running a Cluster on Two Servers

Others might argue with this, but I would never put the SQL nodes on the same servers as the ndbd nodes for production. Some say you can run multiple ndbd nodes on the same server and I am more comfortable with that since I can lock the ndbd daemon into memory and know its not going to change (my ndbd nodes on those two servers have been at exactly 71.3% since I started them up. If I had servers for the ndbd nodes that had 16+ gigs of RAM I might start allocating 4 gigs of RAM to a ndbd daemon with 3+ daemons per node. My understanding is that this helps keep the transactional logs for the nodes under control. When you do a ndbd node restart it takes less time for a node to get up and running because of the smaller files to read. I might be mistaken and its too late for me to look it up :) Anyone got other reasons or maybe (if I am right) someone can elaborate.

First of all, I’m very excited to see that Cluster is being used for MogileFS here. I’ve actually been thinking it might be fun to write a MogileFS::Store::NDB class to use NDB/Perl … but that’s another story.

One of the things that’s nice about NDB is that it spread across mutliple machines. Now, granted, in development we can’t always do this. I run NDB on my laptop all the time, so I certainly feel the pain there. But with a multi-node design, I say take advantage of it. People on other architecture are always asking how they can spread the load more easily across multiple machines, and here you can. If you need 8 data nodes, get 8 machines.

Practically, there is another reason. If you put 3 data nodes on a single physical machine, then if that machine crashes, you have not a single node failure, but 3 nodes failing at the same time. Although there is no specific reason that this won’t work, it’s also essentially an edge case and not really tested all that well. Other people may disagree with me here, especially as currently running multiple ndbd’s on a single box is the only way to take advantage of more that 2 CPU cores, but I’m just not a fan. For something like this, go get a bunch of 2 CPU boxes.

To address the larger question, though, you can certainly spread the write load and the redo log burden by having more data nodes, and having the data distributed across more data nodes will certainly make the log recovery shorter on any one given node. However, there is a price to pay here in query latency. The more data nodes you have, the more nodes your data might be on. If you are not using the NDB API (or any of the NDB/Connectors) then there is no optimized node selection going on. (I’ve heard someone is working on an Optimized TC Selection patch for mysqld, but it’s not in the mainline yet) So if you go from 2 nodes to 4 nodes, you went from a 100% chance that the TC will have your data and not have to ask someone else, to a 50% chance. At 8 nodes you are at a 25% chance that the TC selection will select a node with your data. That’s for primary key operations. If you’re doing a scan, then your data is going to (most likely) be on all of the nodes, which can be good or bad. But if you’re doing MogileFS queries, perhaps the extra milliseconds of latency isn’t a concern in this case and you’d rather have faster node recovery. I’d test that hypothesis out and see how much better the recovery is.

Another potential reason to have more data nodes even with the higher latency costs (and extra network traffic as that many more nodes have to talk to that many more nodes) is scalability. Of course, more nodes mean more scaling. But as of now ADDING a data node is not an online operation. Adding more memory can be done in a rolling fashion. So if you think you might need 8 data nodes worth of CPU processing, go ahead and get 8 data nodes with 4G of RAM a piece. Then as you need to store more data, stick in more RAM. Before long, you’ll have a nice 256G system… :)

SO… as with everything cluster, there is certainly room on both side of the argument as to what’s best here. And as always, testing is the best bet to see how it maps to your environment.

(Also, it’s 4AM at the moment, so please forgive me if this rambles a little.)

Join Syntax changes in 5.0 (not a bug)

Posted in ha by mtaylor on October 5, 2007

Our friend Dathan recently suggested that The Quality of mySQL lately sucks.

I’ve just ran into yet another obvious bug that has made it’s way into production.

While I’m not going to argue that more unit tests would be a good thing, I’d like to point out that what he’s referring to is not so much a bug as it is a change that happened (and was documented) in 5.0.12:

Beginning with MySQL 5.0.12, natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. The changes include elimination of redundant output columns for NATURAL joins and joins specified with a USING clause and proper ordering of output columns. The precedence of the comma operator also now is lower compared to JOIN, LEFT JOIN, and so forth.

These changes make MySQL more compliant with standard SQL. However, they can result in different output columns for some joins. Also, some queries that appeared to work correctly prior to 5.0.12 must be rewritten to comply with the standard. For details about the scope of the changes and examples that show what query rewrites are necessary, see Section 12.2.7.1, “JOIN Syntax”.

What this means is that the query in the bug report:

SELECT p.id, gt.object_id FROM Photos p, PhotosExtra px LEFT JOIN GeoTagged gt ON
gt.object_id=p.id WHERE px.photo_id=p.id AND p.id = 2173;

should be rewritten to either:

SELECT p.id, gt.object_id FROM PhotosExtra px, Photos p LEFT JOIN GeoTagged gt ON
gt.object_id=p.id WHERE px.photo_id=p.id AND p.id = 2173;

or even:

SELECT p.id, gt.object_id FROM , Photos p LEFT JOIN GeoTagged gt ON
gt.object_id=p.id JOIN PhotosExtra px ON px.photo_id=p.id WHERE p.id = 2173;

This is a common thing people have to deal with when upgrading to 5.0.

Multiple bond interfaces in CentOS/RHEL

Posted in ha by mtaylor on September 11, 2007

Kris writes:

I had a machine with 4 nics that I wanted to bond 2 by to. I had no problem getting the bond0 device up witn any of the interfaces, however getting a bond1 up always resulted in the above error.

The friendly guys from #centos on freenode pointed me to the missing config.

options bonding mode=4 max_bonds=4

An important thing to keep in mind here is that in the RHEL/CentOS initscripts package, these options are global. There is no way to set a different set of options for each bond. So, if for instance, you had 4 NICs and wanted to have 2 of them bonded in mode 1 and 2 of them in mode 4, you’re SOL. (Unless, of course, you go for insmodding everything by hand. But that’s ugly)

DRBD and Replication in the real world

Posted in ha by mtaylor on April 2, 2007

It’s often quite tempting to make sweeping statements about the superiority of one approach to a problem over another. While various approaches often have advantages, in the real world often there are many competing criteria which make a black and white assessment of choices seem rather simplistic. Politicians and marketing folks oversimplify complex problems every day, and it often makes a real discussion of the issue at hand harder – although it does engender a vitriolic us vs. them approach of yelling. Luckily for us, this never happens in the technical world.

Eric Bergen posted an interesting entry on his blog this morning about DRBD. He makes some very good points, but I believe leaves out some context or assumptions for some of his conclusions, primarily by assuming that there is a single HA setup with a single set of criteria for success and then comparing both DRBD and MySQL Replication to that invented scenario. Since the scenario seems to be one in which MySQL Replication is the obvious choice, it is no surprise that a solution involving DRBD comes up wanting. Eric is 100% correct – DRBD is terrible at performing the tasks that MySQL Replication is well suited for. I would argue, however, that MySQL Replication is just as terrible at performing the tasks that DRBD does well, and neither of them can touch MySQL Cluster for the tasks that MySQL Cluster was deisgned for. One of the underlying ideas in the MySQL world is that there is no one perfect task for every occasion – witness the existence of multiple storage engines. The real Zen comes in matching the correct tool (or tools) to the job.

The key difference between MySQL Replication and DRBD is that MySQL Replication is asynchronous whereas DRBD is synchronous. Using MySQL Replication, writes to the primary master are not affected by the health of the secondary master in any way. Using DRBD writes are dependent on the health of both boxes. This doesn’t mean that in a DRBD world that a failure of either box is a failure on both, but if the performance of either box becomes degraded the performance of the pair becomes degraded. In this, as in many other ways, a DRBD system behaves much like a system involving two machines and shared storage. If your SAN performance becomes degraded, it will affect the overall performance of the system. Thinking of DRBD in similar terms to shared storage is often helpful, as the metaphor actually holds up quite nicely.

Even with both boxes disk subsystems perfectly healthy, as a synchronous replication technology, DRBD adds overhead to each write, and a performance hit is unavoidable. MySQL Replication also does not experience this problem. Should we not then, as Eric suggests, avoid DRBD and use MySQL Replication for everything? The answer is an emphatic “No” as it really depends on the problem you are trying to solve.

The very thing that makes MySQL Replication wonderful at some tasks, the asynchronous operation, is the very thing that makes it impossible to use in other situations. MySQL Replication undermines the durability of your transactions. It can’t help it. Imagine your database is collecting the data for processing credit card payments, as many of them do. Imagine then that your primary master is handling a transaction and returns success on a commit. The expectation now is that this data is saved and durable. Ah, but then at that moment, your primary master dies. Who knows why, maybe one of the new guys working at the colo facility thought it would be neat to unplug the server from power. It happens. You’ve got heartbeat or something like it set up and your secondary master comes up and running and the system as a whole keeps on running, and it takes next to no time since the secondary master was running a warm copy of the database. Life is good – right? What about that transaction? You’ve told the rest of the world that it is committed, but you have absolutely no way of knowing if it made it to the secondary master. If you are running one transaction at a time, you could check and correct it manually, but you probably don’t have redundant masters if your query load is that low. You probably have a steady stream of transactions happening. Which means that now your secondary master is in an unknown state.

Does this mean that MySQL Replication should never be used in a dual-master setup? Of course not. There are plenty of applications (click tracking comes to mind as an excellent example) where the loss of a few records actually doesn’t matter in the slightest, and where what you really want is the lowest possible transactional latency and the lowest possible downtime during a failover.

As a general rule (and remember, general rules are made to be broken) MySQL Replication is wonderful where short failover time a consistent primary performance are the key. “In a good fail over scenario a problem with the backup master should never cause an issue on the primary master.” In a scenario where this is the case, MySQL Replication is a fantastic choice. DRBD is wonderful if you need to be certain about the state of your data and can afford to lose a little bit of performance to provide extra durability, and in this case degraded performance of the secondary affecting the performance of the primary is something that is perfectly acceptable.

Something else to keep in mind is that not only can failovers be automated with DRBD, but so can failbacks (reattaching the original primary to the pair) In fact, you can happily bounce back and forth between two hosts running DRBD all day long with nary a problem. I dare you to do that unattended with a MySQL Replication setup and not run some sort of external consistency checking on your databases.

A few quick points to address directly:

“When DRBD, the operating system, or hardware crashes it crashes hard. Any corruption on the primary master during a nasty failure gets happily propagated over DRBD.” There are two pieces of truth here combined in an interesting way. In a normal DRBD setup, I configure DRBD to throw a kernel panic in case of any problems with the underlying IO subsystem. If the error happens on the primary, the secondary happily takes over. If it happens on the secondary, the primary just keeps right on chugging. If, however, there is a bug internal to MySQL that causes corrupt data to be written to disk, this data will be happily written to both disks. Most of the time this doesn’t result in an immediate crash of the primary server, so although the secondary server may not have the corruption, it is unknown how long the primary may run with the corruption, and again, the state of the data consistency between servers is unknown and unknowable.

Since DRBD replicates blocks, you also don’t run in to the very common problem MySQL Replication has of the slave getting behind the master due to a single execution thread. The secondary machine doesn’t have to process a thing – all it has to do is write blocks to the disk. Eric is 100% correct in pointing out that this does not allow you to use the alter tables on the slave trick. Of course, the real culprit here is the inability to add columns or indexes live. You really don’t want to use this trick to remove columns, as all matter of holy hell will break loose if you suddenly have less columns on your replication target.

Reports that DRBD loading pairs of masters for query time outs are very interesting, or would be if they were backed up with any real details. The most overhead I’ve seen DRBD put on a system is a 30% slower disk subsystem response time. Maybe the client is trying to do DRBD over a 10M network link, or has the whole thing horribly, horribly misconfigured. I have not seen anything even remotely like this problem in the field.

DRBD is operationally much more stable and simple to deal with than failover using MySQL Replication. This still doesn’t mean there aren’t times when Replication is the answer, but alluding to a “less stable less operationally friendly system” is just plain misdirection. I wholehearted agree that we should be working on making MySQL Replication better (right there with you on check sums – how about global transaction id’s of some sort too?) but there is no reason we can’t continue to make both tools better and have more ways of dealing with more problems.

If you want to get rid of your DRBD fail over setup, by all means give Eric a shout and he’ll be happy to help out. But if you wouldn’t mind, give me a shout too. I’d like to hear about situations where DRBD isn’t actually working out (haven’t personally run across one yet) Maybe your DRBD isn’t setup well, or maybe it’s just trying to be the wrong tool for the wrong job. And, of course, I’d be remiss if I didn’t want to give Eric some friendly competition for your business in doing the migration if it’s actually warranted.

Technorati Tags: ,

Recursive idea

Posted in ha, ideas by mtaylor on November 10, 2006

What if we made a storage engine that used MogileFS to store blobs/images?