Planet MySQL

Planet MySQL -
  1. NDB is mainly an In-memory database. We have however also the possibility tostore non-indexed columns on disk. This data uses a page cache as anyother normal disk-based DBMS.Interestingly with the increases of memory sizes one could think thatdisk data becomes less important for MySQL Cluster. The answer is actuallythe opposite.The reason is again the HW development. NDB is designed with predictablelatency as a very basic requirement. In the past disks meant hard drives. Accesstime to a hard disk was several milliseconds at best. Given that our requirementwas to handle complex transactions within 10 milliseconds disk data storagewas out of the question.Modern HW is completely different, they use SSD devices, first attached throughthe SATA interface that enabled up to around 500 MByte per second anda few thousand IO operations per second (IOPS). The second step was theintroduction of SSD devices on the PCI bus. This lifted the performance up to morethan  1 GByte per second. These devices are extremely small and still very powerful.I have an Intel NUC at home that has two of those devices.Thus the performance difference between disk storage and RAM has decreased.The next step on the way was to change the storage protocol and introduce NVMedevices. These still use the same HW, but use a new standard that is designed forthe new type of storage devices. Given those devices we have now the ability toexecute millions of IOPS on a standard server box with access times of a few tensof microseconds.For NDB this means that this HW fits very well into the NDB architecture. The workwe did on developing the Partial LCP algorithm did also a lot of work on improvingour disk data implementation. We see more and more people that use disk datacolumns in NDB.The next step is even more interesting, this will bring storage into the memory bus andaccess times of around one microsecond. For NDB this disk storage can be treated asmemory to start with, thus making it possible to soon have multiple TBytes of memoryin standard boxes.Thus HW development is making the NDB engine more and more interesting to use.One notable example that uses disk data columns in NDB is HopsFS. They use thedisk data columns to store small files in the meta data server of the HopsFSimplementation of the Hadoop HDFS Name Server. This means much fasteraccess to small files. The tests they did showed that they could handled hundredsof thousands of file reads and writes per second even using fairly standard SSD diskson the servers.The implementation of disk data in NDB is done such that each row can have threeparts. The fixed memory part that is accessed quickly using a row id. The variablesized part that is accessed through a pointer from the fixed size part.The disk columns are also accessed through a reference in the fixed size part. Thisreference is an 8-bit value that refers to the page id and page index of the diskcolumns.Before we can access those pages we go through a page cache. The page cache wasimplemented on caching techniques that was state of the art a few years ago.The idea is quite simple. The page cache uses a normal hot page queue. Pages arebrought up in this queue when they are accessed. A single access will bring it up,but to be more permanent in the page cache a page has to be accessed several times.Now each page is represented in those queues by a page state record. The basisof the page cache algorithm is that a page can be represented in a page staterecord even if the page is not in the page cache.NDB has a configuration variable called DiskPageBufferEntries, by default this isset to 10. It is the multiplication factor of how many more pages we havepage state records compared to the amount of pages we have in the page cache.So for example if we have set DiskPageBufferMemory to 10 GByte and we haveset DiskPageBufferEntries we will have page state records that holds pages of100 GBytes in the queues. Thus even when a page is paged out we keep it in thelist and thus we can see patterns of reuse that are longer than the page cachewe have access to. The factor of 10 means that the page state records are ofabout 3% of the size of the page cache itself. Thus the benefits of the extraknowledge about page usage patterns comes at a fairly low cost. The factor10 is configurable.Many cloud servers comes equipped with hundreds of GBytes (some even TBytes)and can also store a number of TBytes on NVMe devices. NDB is well suitedfor those modern machines and MySQL Cluster 7.6 have been designed to besuitable for this new generation of HW.
  2. One important problem that requires a solution is to decide whethera row has been updated since the last checkpoint or not.Most implementations use some kind of mechanism that requires extramemory resources and/or CPU resources to handle this.NDB uses the fact that each row is already stamped with a timestamp.The timestamp is what we call a global checkpoint id. A new globalcheckpoint is created about once every 2 seconds (can be faster orslower by configuration).Thus we will overestimate the number of rows written since last checkpointwith a little bit, but with checkpoints taking a few minutes, the extra overheadof this is only around 1%.Thus when we scan rows we check the global checkpoint id of the row, ifit is bigger than the global checkpoint that the last checkpoint had fullycovered we will write the row as changed since last checkpoint. Actuallywe also have the same information on the page level, thus we can checkthe page header and very quickly scan past an entire page if it hasn't beenupdated since last checkpoint.The same type of scanning is used also to bring a restarting node up tosynch with the live node. This algorithm has been present in NDB sinceMySQL 5.1.
  3. In MySQL Cluster 7.5 we use Complete Checkpoints. In MySQL Cluster 7.6we implement an approach where we only checkpoint a part of the databasein each checkpoint.A special case is a checkpoint of a table partition where no changesat all have happened since the last checkpoint. In this case we implementeda special optimisation such that it is not necessary to checkpoint anythingat all for this table partition. It is only necessary to write a new LCPcontrol file which is 4 kBytes in size for each table partition (can grow to8 kBytes if the recovery will require more than 980 checkpoints torecover.This means that if your database contains a large set of read-only tables,there will be no need to checkpoint those tables at all. This featureis used also when setting EnablePartialLcp to false.
  4. One of the main objectives of the new Partial LCP algorithm in MySQLCluster 7.6 is to keep up with the development of modern HW.I have already described in previous blogs how Partial LCP can handlenicely even database sizes of 10 TBytes of memory with a very modestload on the disk devices.Now modern HW has shifted from using hard drives to using SSDs.The original approach in NDB is assuming that the checkpoints andREDO logs are stored on hard drives. In MySQL Cluster 7.5 thedisk space required for the REDO log is that it is a bit larger than theDataMemory size. The reason is that we want to survive also whenloading massive amounts of data.In MySQL Cluster 7.5 we cannot remove any checkpoint files untila checkpoint is fully completed. This means that we require around4x the memory size of disk space for REDO logs and checkpoints.With hard drives this is not a problem at all. As an example mydevelopment box has 32 GBytes of memory with 2 TByte of diskspace. Thus 64x more disk space compared to the memory space.With modern servers this size difference between memory anddisks is decreasing. For example many cloud VMs only havea bit more than 2x the disk size compared to the memory size.So one goal of MySQL Cluster 7.6 is to fit in much less diskspace.The aim is to solve this with a three-thronged approach.1) Partial LCP means that we can execute the checkpoints muchfaster. Since REDO logs only need to be kept for around twocheckpoints this means a significant decrease of size requirementsfor REDO logs. The aim is to only need around 10% of the diskspace of memory for the REDO logs. This work is not completedin 7.6.4. As usual there are no guarantees when this work will becompleted.2) Using Partial LCP we can throw away old LCP files as soonas we have created a new recoverable LCP for the table partition.Thus it is no longer necessary to store 2 LCPs on disk. At thesame time there is some overhead related to Partial LCPs. By defaultsetting this overhead is 50% plus a bit more. Thus we should alwaysfit within about 1.6x times the memory size.It is possible to set EnablePartialLcp to false, in this case allcheckpoints will be Complete Checkpoints. This means morewrites to disk for checkpoints, but it will decrease the storagespace to around the same as the memory size.3) Using CompressedLCP set to 1 we can decrease LCP storageby another factor of 2-3x (usually around 2.7x). This feature hasexisted for a long time in NDB.Thus it should be possible to significantly decrease the requirementson storage space when running NDB using MySQL Cluster 7.6.
  5. I just read an article called Low-Overhead Asynchronous Checkpointing inMain-Memory Database Systems. It was mentioned in a course in DatabaseSystems at Carnegie-Mellon University, see here.In MySQL Cluster 7.6.4 we released a new variant of our checkpointing designedfor modern HW with TBytes of main memory. I think studying this implementationwill be very worthwhile both for users of NDB, but also for researchers in DBMSimplementations. It implements a new class of checkpoint algorithms that is currentlya research topic in the database research community.It was interesting to compare our approach that I called Partial LCP with approachestaken by other commercial in-memory databases and with the approach presentedin the paper.LCP is Local CheckPoint which is the name we use for our checkpoint protocolin NDB.The course presents a number of ideal properties of a checkpoint implementation.The first property is that doesn't slow down regular transaction processing.In the case of NDB we execute checkpoints at a steady pace which consumesaround 5-10% of the available CPU resources. This will decrease even more withthe implementation in 7.6.The second is that it doesn't introduce any latency spikes.NDB checkpointing both new and old executes in steps of at most 10-20microseconds. So there will be extremely small impact on latency oftransactions due to checkpointing.The third property is that it doesn't require excessive memory overhead.NDB checkpointing consumes a configurable buffer in each database thread. Theideal size of this is around 1 MByte. In addition we have a REDO log buffer thatis usually a bit bigger than that. That is all there is to it. There is no extra memoryspace needed for checkpointing rows. The checkpointing performs a normal scanof the rows and copies the memory content to the buffer and as soon as the bufferis full it writes it to disk using sequential disk writes.It is fair to say that NDB does a good job in handling those ideal properties.The course presents two variants called fuzzy checkpoints and consistent checkpoints.The course defines fuzzy checkpoints as a checkpoint that can write uncommitteddata. I would normally use the term fuzzy checkpoint to mean that the checkpointis not consistent at a database level, but can still be consistent on a row basis.Actually NDB is a mix of the definition provided in the course material. It is aconsistent checkpoint for each row. But different rows can be consistent at verydifferent points in time. So on a row basis NDB is consistent, but at the databaselevel the checkpoint is fuzzy. Thus to perform recovery one needs to install thecheckpoint and then apply the REDO log to get a consistent checkpoint restored.Next the course presents two variants called Complete Checkpoints and DeltaCheckpoints. Complete Checkpoint means that the entire database is written ineach checkpoint. Delta Checkpoint means that only changes are written in acheckpoint.This is where MySQL Cluster 7.6 differs from 7.5. 7.5 uses a Complete Checkpointscheme. 7.6 uses a Partial Checkpoint scheme.In my view the NDB variant is a third variant which is not complete and not aDelta Checkpoint. Partial means that it writes the Delta, that is it writes all changessince the last checkpoint. But it does also write a Complete Checkpoint for a partof the database, thus the name Partial Checkpoint. Thus it is similar to anincremental backup scheme.NDB can divide the database up in up to 2048 parts, each checkpoint can write0 parts (only if no changes occurred in the table partition since last checkpoint).It can write 1 part if the number of writes is very small, it can write all 2048 partsif almost all rows have been updated and it can write anywhere between 1 and2048 based on how many rows were updated since last checkpoint.Almost all commercial In-Memory DBMSs still use a complete checkpoint scheme.As we move towards TBytes of memory this is no longer a plausible approach.The NDB approach means that we can perform a checkpoint in a few minuteseven in a system with 16 TBytes of memory where we need to write about8 GBytes plus the changes since the last checkpoint.Thus NDB takes the step into a new world of massively large In-Memory DBMSswith the introduction of MySQL Cluster 7.6 and its new Partial LCP implementation.My new book "MySQL Cluster 7.5 inside and out" describes the LCPimplementation in 7.5, the description of the Partial LCP can be found in my blogsand also some very detailed descriptions in the source code itself. Among otherthings a 10-page proof of that the algorithm actually works :)The nice thing with the Partial LCP approach in NDB is that it requires nomore work after writing the checkpoint. There is no need of merging checkpoints.This happens automatically at recovery. There is some amount of overhead inthat the checkpoints can have some rows in multiple checkpoints and thus there issome amount of overhead at recovery. We calculate the number of parts to usebased on the amount of changes. We even implemented a LCP simulator thatcalculates the overhead while inserting and deleting large amounts of rowand has been used to find the proper configurable parameters for the algorithm.