Planet MySQL

Planet MySQL - https://planet.mysql.com
  1. Knowing the bits and bytes of an SQL query’s order of operations can be very valuable, as it can ease the process of writing new queries, while also being very beneficial when trying to optimize an SQL query. If you’re looking for the short version, this is the logical order of operations, also known as the order of execution, for an SQL query: FROM, including JOINs WHERE GROUP BY HAVING WINDOW functions SELECT DISTINCT UNION ORDER BY LIMIT and OFFSET But the reality isn’t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Said that, modern databases are already challanaging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order. Why would they do that? Well, it can be silly if the database would first fetch all data mentioned in the FROM clause (including the JOINs), before looking into the WHERE clause and its indexes. Those tables can hold lots of data, so you can imagine what will happen if the database’s optimizer would stick to the traditional order of operations of an SQL query. Let’s look into each of the SQL query parts according to their execution order. FROM and JOINs The tables specified in the FROM clause (including JOINs), will be evaluated first, to determine the entire working set which is relevant for the query. The database will merge the data from all tables, according to the JOINs ON clauses, while also fetching data from the subqueries, and might even create some temporary tables to hold the data returned from the subqueries in this clause. In many cases though, the database’s optimizer will choose to evaluate the WHERE part first, to see which part of the working set can be left out (preferably using indexes), so it won’t inflate the data set too much if it doesn’t really have to. WHERE clause The WHERE clause will be the second to be evaluated, after the FROM clause. We have the working data set in place, and now we can filter the data according to the conditions in the WHERE clause. These conditions can include references to the data and tables from the FROM clause, but cannot include any references to aliases defined in the SELECT clause, as that data and those aliases may not yet ‘exist’ in that context, as that clause wasn’t yet evaluated by the database. Also, a common pitfall for the WHERE clause would be to try and filter out aggregated values in the WHERE clause, for example with this clause: “WHERE sum(available_stock) > 0“. This statement will fail the query execution, because aggregations will be evaluated later in the process (see the GROUP BY section below). To apply filtering condition on aggregated data, you should use the HAVING clause and not the WHERE clause. GROUP BY clause Now that we filtered the data set using the WHERE clause, we can aggregate the data according to one or more columns appearing in the GROUP BY clause. Grouping the data is actually splitting it to different chunks or buckets, where each bucket has one key and a list of rows that match that key. Not having a GROUP BY clause is like putting all rows in a one huge bucket. Once you aggregate the data, you can now use aggregation functions to return a per-group value for each of the buckets. Such aggregation functions include COUNT, MIN, MAX, SUM and others. HAVING clause Now that we have grouped the data using the GROUP BY clause, we can use the HAVING clause to filter out some buckets. The conditions in the HAVING clause can refer to the aggregation functions, so the example which didn’t work in the WHERE clause above, will work just fine in the HAVING clause: “HAVING sum(available_stock) > 0″. As we’ve already grouped the data, we can no longer access the original rows at this point, so we can only apply conditions to filter entire buckets, and not single rows in a bucket. Also, as we mentioned in previous sections, aliases defined in the SELECT clause cannot be accessed in the section either, as they weren’t yet evaluated by the database (this is true in most databases). Window functions If you are using Window functions, this is the point where they’ll be executed. Just like the grouping mechanism, Window functions are also performing a calculation on a set of rows. The main difference is that when using Window functions, each row will keep its own identity and won’t be grouped into a bucket of other similar rows. Window functions can only be used in either the SELECT or the ORDER BY clause. You can use aggregation functions inside the Window functions, for example: SUM(COUNT(*)) OVER () SELECT clause Now that we are done with discarding rows from the data set and grouping the data, we can select the data we want to be fetched from the query to the client side. You can use column names, aggregations and subqueries inside the SELECT clause. Keep in mind that if you’re using a reference to an aggregation function, such as COUNT(*) in the SELECT clause, it’s merely a reference to an aggregation which already occurred when the grouping took place, so the aggregation itself doesn’t happen in the SELECT clause, but this is only a reference to its result set. DISTINCT keyword The syntax of the DISTINCT keyword is a bit confusing, because the keyword takes its place before the column names in the SELECT clause. But, the actual DISTINCT operation takes place after the SELECT. When using the DISTINCT keyword, the database will discard rows with duplicate values from the remaining rows left after the filtering and aggregations took place. UNION keyword The UNION keyword combines the result sets of two queries into one result set. Most databases will allow you to choose between UNION DISTINCT (which will discard duplicate rows from the combined result set) or UNION ALL (which just combines the result sets without applying any duplication check). You can apply sorting (ORDER BY) and limiting (LIMIT) on the UNION’s result set, the same way you can apply it on a regular query. ORDER BY clause Sorting takes place once the database has the entire result set ready (after filtering, grouping, duplication removal). Once we have that, the database can now sort the result set using columns, selected aliases, or aggregation functions, even if they aren’t part of the selected data. The only exception is when using the DISTINCT keyword, which prevents sorting by a non-selected column, as in that case the result set’s order will be undefined. You can choose to sort the data using a descending (DESC) order or an ascending (ASC) order. The order can be unique for each of the order parts, so the following is valid: ORDER BY firstname ASC, age DESC LIMIT and OFFSET In most use cases (excluding a few like reporting), we would want to discard all rows but the first X rows of the query’s result. The LIMIT clause, which is executed after sorting, allows us to do just that. In addition, you can choose which row to start fetching the data from and how many to exclude, using a combination of the LIMIT and OFFSET keywords. The following example will fetch 50 rows starting row #100: LIMIT 50 OFFSET 100
  2. Prometheus 2 time series database (TSDB) is an amazing piece of engineering, offering a dramatic improvement compared to “v2” storage in Prometheus 1 in terms of ingest performance, query performance and resource use efficiency. As we’ve been adopting Prometheus 2 in Percona Monitoring and Management (PMM), I had a chance to look into the performance of Prometheus 2 TSDB. This blog post details my observations. Understanding the typical Prometheus workload For someone who has spent their career working with general purpose databases, the typical workload of Prometheus is quite interesting. The ingest rate tends to remain very stable: typically, devices you monitor will send approximately the same amount of metrics all the time, and infrastructure tends to change relatively slowly. Queries to the data can come from multiple sources. Some of them, such as alerting, tend to be very stable and predictable too. Others, such as users exploring data, can be spiky, though it is not common for this to be largest part of the load. The Benchmark In my assessment, I focused on handling an ingest workload. I had deployed Prometheus 2.3.2 compiled with Go 1.10.1 (as part of PMM 1.14)  on Linode using this StackScript.  For a maximally realistic load generation, I spin up multiple MySQL nodes running some real workloads (Sysbench TPC-C Test) , with each emulating 10 Nodes running MySQL and Linux using this StackScript The observations below are based on a Linode instance with eight virtual cores and 32GB of memory, running  20 load driving simulating the monitoring of 200 MySQL instances. Or, in Prometheus Terms, some 800 targets; 440 scrapes/sec 380K samples ingested per second and 1.7M of active time series. Design Observations The conventional approach of traditional databases, and the approach that Prometheus 1.x used, is to limit amount of memory. If this amount of memory is not enough to handle the load, you will have high latency and some queries (or scrapes) will fail. Prometheus 2 memory usage instead is configured by storage.tsdb.min-block-duration   which determines how long samples will be stored in memory before they are flushed (the default being 2h). How much memory it requires will depend on the number of time series, the number of labels you have, and your scrape frequency in addition to the raw ingest rate. On disk, Prometheus tends to use about three bytes per sample. Memory requirements, though, will be significantly higher. While the configuration knob exists to change the head block size, tuning this by users is discouraged. So you’re limited to providing Prometheus 2 with as much memory as it needs for your workload. If there is not enough memory for Prometheus to handle your ingest rate, then it will crash with out of memory error message or will be killed by OOM killer. Adding more swap space as a “backup” in case Prometheus runs out of RAM does not seem to work as using swap space causes a dramatic memory usage explosion. I suspect swapping does not play well with Go garbage collection. Another interesting design choice is aligning block flushes to specific times, rather than to time since start: As you can see from this graph, flushes happen every two hours, on the clock. If you change min-block-duration  to 1h, these flushes will happen every hour at 30 minutes past the hour. (If you want to see this and other graphs for your Prometheus Installation you can use this Dashboard. It has been designed for PMM but can work for any Prometheus installation with little adjustments.) While the active block—called head block— is kept in memory, blocks containing older blocks are accessed through mmap()  This eliminates the need to configure cache separately, but also means you need to allocate plenty of memory for OS Cache if you want to query data older than fits in the head block. It also means the virtual memory you will see Prometheus 2 using will get very high: do not let it worry you. Another interesting design choice is WAL configuration. As you can see in the storage documentation, Prometheus protects from data loss during a crash by having WAL log. The exact durability guarantees, though, are not clearly described. As of Prometheus 2.3.2, Prometheus flushes the WAL log every 10 seconds, and this value is not user configurable. Compactions Prometheus TSDB is designed somewhat similar to the LSM storage engines – the head block is flushed to disk periodically, while at the same time, compactions to merge a few blocks together are performed to avoid need to scan too many blocks for queries Here is the number of data blocks I observed on my system after a 24h workload: If you want more details about storage, you can check out the meta.json file which has additional information about the blocks you have, and how they came about. {        "ulid": "01CPZDPD1D9R019JS87TPV5MPE",        "minTime": 1536472800000,        "maxTime": 1536494400000,        "stats": {                "numSamples": 8292128378,                "numSeries": 1673622,                "numChunks": 69528220        },        "compaction": {                "level": 2,                "sources": [                        "01CPYRY9MS465Y5ETM3SXFBV7X",                        "01CPYZT0WRJ1JB1P0DP80VY5KJ",                        "01CPZ6NR4Q3PDP3E57HEH760XS"                ],                "parents": [                        {                                "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",                                "minTime": 1536472800000,                                "maxTime": 1536480000000                        },                        {                                "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",                                "minTime": 1536480000000,                                "maxTime": 1536487200000                        },                        {                                "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",                                "minTime": 1536487200000,                                "maxTime": 1536494400000                        }                ]        },        "version": 1 } Compactions in Prometheus are triggered at the time the head block is flushed, and several compactions may be performed at these intervals: Compactions do not seem to be throttled in any way, causing huge spikes of disk IO usage when they run: And a spike in CPU usage: This, of course, can cause negative impact to the system performance. This is also why it is one of the greatest questions in LSM engines: how to run compactions to maintain great query performance, but not cause too much overhead. Memory utilization as it relates to the compaction process is also interesting: We can see after compaction a lot of memory changes from “Cached”  to “Free”, meaning potentially valuable data is washed out from memory. I wonder if fadvice() or other techniques to minimize data washout from cache are in use, or if this is caused by the fact that the blocks which were cached are destroyed by the compaction process Crash Recovery Crash recovery from the log file takes time, though it is reasonable. For an ingest rate of about 1 mil samples/sec, I observed some 25 minutes recovery time on SSD storage: level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)" level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)" level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))" level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090 level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB..." level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0 level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363 level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started" level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests." The problem I observed with recovery is that it is very memory intensive. While the server may be capable of handling the normal load with memory to spare if it crashes, it may not be able to ever recover due to running out of memory.  The only solution I found for this is to disable scraping, let it perform crash recovery, and then restarting the server with scraping enabled Warmup Another behavior to keep in mind is the need for warmup – a lower performance/higher resource usage ratio immediately after start. In some—but not all—starts I can observe significantly higher initial CPU and memory usage The gaps in the memory utilization graph show that Prometheus is not initially able to perform all the scrapes configured, and as such some data is lost. I have not profiled what exactly causes this extensive CPU and memory consumption. I suspect these might be happening when new time series entries are created, at head block, and at high rate. CPU Usage Spikes Besides compaction—which is quite heavy on the Disk IO—I also can observe significant CPU spikes about every 2 minutes. These are longer with a higher ingest ratio. These seem to be caused by Go Garbage Collection during these spikes: at least some CPU cores are completely saturated These spikes are not just cosmetic. It looks like when these spikes happen, the Prometheus internal /metrics endpoint becomes unresponsive, thus producing data gaps during the exact time that the spikes occur: We can also see the Prometheus Exporter hitting a one second timeout: We can observe this correlates with garbage collection: Conclusion Prometheus 2 TSDB offers impressive performance, being able to handle a cardinality of millions of time series, and also to handle hundreds of thousands of samples ingested per second on rather modest hardware. CPU and disk IO usage are both very impressive. I got up to 200K/metrics/sec per used CPU core! For capacity planning purposes you need to ensure that you have plenty of memory available, and it needs to be real RAM. The actual amount of memory I observed was about 5GB per 100K/samples/sec ingest rate, which with additional space for OS cache, makes it 8GB or so. There is work that remains to be done to avoid CPU and IO usage spikes, though this is not unexpected considering how young Prometheus 2 TSDB is – if we look at InnoDB, TokuDB, RocksDB, WiredTiger all of them had similar problem in their initial releases. The post Prometheus 2 Times Series Storage Performance Analyses appeared first on Percona Database Performance Blog.
  3. MySQL privilege system is small, almost all administrative tasks can be completed using a handful of privileges. If we exclude generic ones as ALL, SHOW DATABASES and USAGE, create and drop permissions as CREATE USER, DROP ROLE or CREATE TABLESPACE, the number of privileges remaining is really limited: PROCESS, PROXY, RELOAD, REPLICATION_CLIENT, REPLICATION_SLAVE, SHUTDOWN, and SUPER. Having such a reduced list of privileges means that it is very difficult to control what a connected session can do. For example, if a user has privileges to stop replication, it also has privileges to start it, and also to configure it. Actually, it has rights to do almost everything as the privilege required to stop replication is SUPER. MySQL 8 improves this by introducing Dynamic Privileges. There are 18 Dynamic Privileges. But again the granularity of these privileges is really reduced. For example, REPLICATION_SLAVE_ADMIN allows the user to start, stop, change master and change replication filters. Again, if we need to grant an account only the ability to start and stop replication, this is not possible without providing additional permissions. But how could we avoid granting too much power?   What happens in Procedures stays in Procedures One interesting feature of procedures, functions, and views is SECURITY CONTEXT. There are two security contexts: INVOKER and DEFINER. A procedure created with the invoker security context will be executed using the privileges of the invoker account. But the default security context is definer. A procedure created with the definer security context will be executed with the privileges of the definer at execution time. Actually, during the execution of a procedure created using the definer security context, the processlist table and show processlist command will display the definer in the user column instead of the connected user. This means that using procedures is really a great way to raise the permissions and execute privileged code. The privileges remain restricted to the code within the procedure.   Impossible is nothing But what can procedures do? What are the limitations of code executed within a procedure? Well, it is possible to run almost any MySQL statement in a procedure. You can start and stop replication, change master, change both local and global variables, and more… The list of statements that are not permitted is: LOCK TABLES/UNLOCK TABLES, ALTER VIEW, LOAD DATA and LOAD TABLE. Let’s see one example of a valid procedure: DELIMITER // CREATE DEFINER=`root`@`localhost` PROCEDURE show_processlist() BEGIN show processlist; END // DELIMITER ; The only small inconvenience is that procedures must belong to a database schema. Let’s see the results of this procedure: $ mysql -u root -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 112 Server version: 5.7.17 MySQL Community Server (GPL) Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> CREATE DATABASE pythian; mysql> USE pythian; mysql> DELIMITER // mysql> CREATE PROCEDURE show_processlist() -> BEGIN -> show processlist; -> END // Query OK, 0 rows affected (0,00 sec) mysql> DELIMITER ; mysql> CREATE USER test_user@'%' identified by 'test'; Query OK, 0 rows affected (0,01 sec) mysql> GRANT EXECUTE ON PROCEDURE pythian.show_processlist TO test_user; Query OK, 0 rows affected (0,00 sec) mysql> exit And now let’s call the procedure with our unprivileged user: $ mysql -s -u test_user -ptest pythian mysql: [Warning] Using a password on the command line interface can be insecure. mysql> call show_processlist; Id User Host db Command Time State Info 112 root localhost pythian Sleep 3 NULL 116 root localhost pythian Query 0 checking permissions show processlist mysql> mysql> show grants for current_user(); Grants for test_user@% GRANT USAGE ON *.* TO 'test_user'@'%' GRANT EXECUTE ON PROCEDURE `pythian`.`show_processlist` TO 'test_user'@'%' mysql>   Preparation is the key to success We’ve seen that it is possible to execute simple administrative statements from a procedure, but what if we need to execute more complex statements? The answer is a quote from Alexander Graham Bell: “Before anything else, preparation is the key to success” or to be more precise, “Prepared statements are the key to success.” By using prepared statements you can craft the command to execute using parameters or data stored in tables. Let’s see one example code: Execute as root: DELIMITER // CREATE PROCEDURE pythian.change_master(newmaster varchar(256)) BEGIN SET @change_master=CONCAT('CHANGE MASTER TO MASTER_HOST=\'',newmaster,'\''); PREPARE cmtm FROM @change_master; SET sql_log_bin=FALSE; EXECUTE cmtm; DEALLOCATE PREPARE cmtm; SET sql_log_bin=TRUE; END // CREATE PROCEDURE show_slave_status() BEGIN show slave status; END // DELIMITER ; GRANT EXECUTE ON PROCEDURE pythian.change_master TO test_user; GRANT EXECUTE ON PROCEDURE pythian.show_slave_status TO test_user; Then connect to test_user and check: mysql> call pythian.show_slave_status; mysql> call pythian.change_master('master2'); mysql> call show_slave_status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: master2 Master_User: Master_Port: 3306 Connect_Retry: 60 Master_Log_File: Read_Master_Log_Pos: 4 Relay_Log_File: iMac-de-Pep-relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 0 Relay_Log_Space: 154 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 0 Master_UUID: Master_Info_File: /opt/mysql/data/master.info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version:   Security First We have seen how we can use procedures to add more granularity to MySQL privileges, but you must be careful when developing your administrative procedures as they will be executed with higher privileges. Always sanitize and check your inputs to avoid SQL injection. And remember that code will be replicated to slaves and calling a procedure in the replication chain can be replicated to all the slaves. My recommendation is that you explicitly disable binary logging for the execution of this type of procedures.  
  4. As you may know, sometimes MySQL can be memory-hungry. Of course having data in memory is always better than disk… RAM is still much faster than any SSD disk. This is the reason why we recommended to have the working set as much as possible in memory (I assume you are using InnoDB of course). Also this why you don’t want to use Swap for MySQL, but don’t forget that a slow MySQL is always better than no MySQL at all, so don’t forget to setup a Swap partition but try to avoid using it. In fact, I saw many people just removing the Swap partition… and then OOM Killer did its job… and mysqld is often its first victim. MySQL allocates buffers and caches to improve performance of database operations. That process is explained in details in the manual. In this article series, I will provide you some information to check MySQL’s memory consumption and what configuration settings or actions can be made to understand and control the memory usage of MySQL. We will start the series by the Operating System. Operating System In the OS level, there are some commands we can use to understand MySQL’s memory usage. Memory Usage You can check mysqld‘s memory usage from the command line: # ps -eo size,pid,user,command --sort -size | grep [m]ysqld \ | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' \ |cut -d "" -f2 | cut -d "-" -f1 1841.10 Mb /usr/sbin/mysqld 0.46 Mb /bin/sh /usr/bin/mysqld_safe top can also be used to verify this. For top 3.2: # top -ba -n1 -p $(pidof mysqld) | grep PID -A 1 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1752 mysql 20 0 1943m 664m 15m S 0.0 11.1 19:55.13 mysqld # top -ba -n1 -m -p $(pidof mysqld) | grep PID -A 1 PID USER PR NI USED RES SHR S %CPU %MEM TIME+ COMMAND 1752 mysql 20 0 664m 664m 15m S 2.0 11.1 19:55.17 mysqld For more recent top, you can use top -b -o %MEM -n1 -p $(pidof mysqld) | grep PID -A 1 VIRT represents the total amount of virtual memory used by mysql. It includes all code, data and shared libraries plus pages that have eventually been swapped out. USED reports the sum of process rss (resident set size, the portion of memory occupied by a process that is held in RAM) and swap total count. We will see later what we can check from MySQL client. SWAP So we see that this can eventually include the swapped pages too. Let’s check if mysqld is using the swap, and the first thing to do is to check is the machine has some information in swap already: # free -m total used free shared buffers cached Mem: 5965 4433 1532 128 454 2359 -/+ buffers/cache: 1619 4346 Swap: 2045 30 2015 We can see that a little amount of swap is used (30MB), is it by MySQL ? Let’s verify: # cat /proc/$(pidof mysqld)/status | grep Swap VmSwap: 0 kB Great, mysqld si not swapping. In case you really want to know which processes have swapped, run the following command: for i in $(ls -d /proc/[0-9]*) do out=$(grep Swap $i/status 2>/dev/null) if [ "x$(echo $out | awk '{print $2}')" != "x0" ] && [ "x$(echo $out | awk '{print $2}')" != "x" ] then echo "$(ps -p $(echo $i | cut -d'/' -f3) \ | tail -n 1 | awk '{print $4'}): $(echo $out | awk '{print $2 $3}')" fi done Of course the pages in the swap could have been there for a long time already and never been used since…  to be sure, I recommend to use vmstat and verify the columns si and so (a trending system is highly recommended): # vmstat 1 10 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 31252 1391672 505840 2523844 0 0 2 57 5 2 3 1 96 0 0 1 0 31252 1392664 505840 2523844 0 0 0 328 358 437 6 1 92 1 0 0 0 31252 1390820 505856 2523932 0 0 0 2024 1312 2818 28 3 67 2 0 0 0 31252 1391440 505860 2523980 0 0 0 596 484 931 1 1 98 1 0 0 0 31252 1391440 505860 2523980 0 0 0 1964 500 970 0 1 96 3 0 0 0 31252 1391440 505860 2523980 0 0 0 72 255 392 0 0 98 2 0 0 0 31252 1391440 505860 2523980 0 0 0 0 222 376 0 0 99 0 0 0 0 31252 1391440 505908 2524096 0 0 0 3592 1468 2095 34 6 52 9 0 0 0 31252 1391688 505928 2524092 0 0 0 1356 709 1179 12 1 85 2 0 0 0 31252 1390696 505928 2524092 0 0 0 152 350 950 4 6 90 1 0 On this server, we can see that mysqld is not using the swap, but if it was the case and some free RAM was still available, what could have been done ? If this was the case, you must check 2 direct causes: swappiness numa Swappiness The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and put them onto the swap disk partition. As I explained earlier,  disks are much slower than RAM, therefore this leads to slower response times for system and applications if processes are too aggressively moved out of memory. A high swappiness value means that the kernel will be more apt to unmap mapped pages. A low swappiness value means the opposite, the kernel will be less apt to unmap mapped pages. This means that the higher is the swappiness value, the more the system will swap ! The default value (60) is too high for a dedicated MySQL Server and should be reduced. Pay attention that with older Linux kernels (prior 2.6.32), 0 meant that the kernel should avoid swapping processes out of physical memory for as long as possible. Now the same value totally avoid swap to be used. I recommend to set it to 1 or 5. # sysctl -w vn.swappinness=1 Numa For servers having multiple NUMA cores, the recommendation is to set the NUMA mode to interleaved which balances memory allocation to all nodes. MySQL 8.0 supports NUMA for InnoDB. You just need to enable it in your configuration: innodb_numa_interleave = 1 To check if you have multiple NUMA nodes, you can use numactl -H These are two different output: # numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 64379 MB node 0 free: 2637 MB node distances: node 0 0: 10 # numactl -H available: 4 nodes (0-3) node 0 cpus: 0 2 4 6 node 0 size: 8182 MB node 0 free: 221 MB node 1 cpus: 8 10 12 14 node 1 size: 8192 MB node 1 free: 49 MB node 2 cpus: 9 11 13 15 node 2 size: 8192 MB node 2 free: 4234 MB node 3 cpus: 1 3 5 7 node 3 size: 8192 MB node 3 free: 5454 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10 We can see that when there are multiple NUMA nodes (right column), by default the memory is not spread equally between all those nodes. This can lead to more swapping. Check these two nice articles from Jeremy Cole explaining this behavior: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ http://blog.jcole.us/2012/04/16/a-brief-update-on-numa-and-mysql/ Filesystem Cache Another point we can check from the OS is the filesystem cache. By default, Linux will use the filesystem cache for all I/O accesses (this is one of the reason why using MyISAM is not recommended, as this storage engine relies on the FS cache and can lead in loosing data as Linux sync those writes up to every 10sec). Of course as you are using InnoDB, with O_DIRECT as innodb_flush_method, MySQL will bypass the filesystem cache (InnoDB has already enough optimized caches anyway and one extra is not necessary). InnoDB will then not use any FS Cache Memory for the data files (*.ibd). But there are of course other files used in MySQL that will still use the FS Cache. Let’s check this example: # ~fred/dbsake fincore binlog.000017 binlog.000017: total_pages=120841 cached=50556 percent=41.84 # ls -lh binlog.000017 -rw-r----- 1 mysql mysql 473M Sep 18 07:17 binlog.000017 # free -m total used free shared buffers cached Mem: 5965 4608 1356 128 435 2456 -/+ buffers/cache: 1716 4249 Swap: 2045 30 2015 # ~fred/dbsake uncache binlog.000017 Uncached binlog.000017 # free -m total used free shared buffers cached Mem: 5965 4413 1552 128 435 2259 -/+ buffers/cache: 1718 4247 Swap: 2045 30 2015 Some explanations. I started checking how much of one binary log was present in the filesystem cache (using dbsake fincore), and we could see that 42% of 473M were using the RAM as FS cache. Then I forced an unmap of those pages in the cache (using fincore uncache) and finally, you could see that we freed +/- 195MB of RAM. You could be surprised to see which logs or datafiles are using the FS cache (making a file copy for example). I really encourage you to verify this The next article will be about what can be seen from MySQL’s side and what are the best configuration practices.
  5. When the NDB version is upgraded it requires the underlying MySQL server as well is upgraded. Internal table storage format is different in MySQL 5.7 and for new tables created on MySQL 5.6. Row_format, temporal data type (time, datetime etc.) storage are handled differently in 5.7. You can find the details on how MySQL handles temporal in these versions from the link below. The new format have time value in microseconds resolution, and also changes the way datetime is stored, which is more optimized https://dev.mysql.com/doc/refman/5.7/en/time.html https://dev.mysql.com/doc/refman/5.7/en/storage-requirements.html#data-types-storage-reqs-date-time Here is a quick glance of how temporal are handled in different versions 5.5 – tables in old format 5.6 – Supports both old and new format tables. New tables created will be in new format. Alter existing tables if you want to bring them into new format 5.7 – All tables must be in new format. The challenge ! The 5.7 mandates the existing tables in older format (created on 5.5 and before) to be promoted to new format, which requires an alter table before the upgrade or using the mysql_upgrade program right after the binary upgrade. This can be a very lengthy process for cluster with huge tables and sometimes the alter operation fails or stuck with no progress. Explained below is a very efficient method for upgrading large databases using the ndb backup and restore programs , which is must faster than any other approach like using mysqldump restore, alter table, mysql_upgrade etc. How we solved it ! Create a new MySQL Cluster 7.6 and start it with initial. Which will creates the data nodes empty Create ndb backup of 7.4 cluster Create a schema backup of MySQL cluster 7.4 using mysqldump (no-data option) Restore the metadata alone to new MySQL 7.6 cluster from the ndb backup created on step 2 Restore the schema dump to 7.6 cluster which will just drop and re-create the tables , but in new format. Note that this is just structure but no data Restore the MySQL cluster 7.4 backup DATA ONLY on the new MySQL cluster 7.6 using the 7.6 version of NDB restore with --promote-attributes option Restoring the 7.4 backup to 7.6 will create all MySQL objects under the newer version 5.7 Do a rolling restart of all API and data nodes That’s it ! All set and you are good to go 😀 Hope this helps ! Thank you. 💂