O'Reilly logo

MySQL Reference Manual by Kaj Arno, David Axmark, Michael Widenius

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

InnoDB Tables

InnoDB Tables Overview

InnoDB provides MySQL with a transaction-safe (ACID-compliant) table handler with commit, rollback, and crash recovery capabilities. InnoDB does locking on row level and also provides an Oracle-style consistent non-locking read in SELECTs. These features increase multi-user concurrency and performance. There is no need for lock escalation in InnoDB because row level locks in InnoDB fit in very little space. InnoDB tables support FOREIGN KEY constraints as the first table type in MySQL.

InnoDB has been designed for maximum performance when processing large data volumes. Its CPU efficiency is probably not matched by any other disk-based relational database engine.

Technically, InnoDB is a complete database backend placed under MySQL. InnoDB has its own buffer pool for caching data and indexes in main memory. InnoDB stores its tables and indexes in a tablespace, which may consist of several files. This is different from, for example, MyISAM tables, where each table is stored as a separate file. InnoDB tables can be of any size also on those operating systems where file-size is limited to 2G.

You can find the latest information about InnoDB at http://www.innodb.com/. The most up-to-date version of the InnoDB manual is always placed there, and you can also order commercial licenses and support for InnoDB.

InnoDB is currently (October 2001) used in production at several large database sites requiring high performance. The famous Internet news site Slashdot.org runs on InnoDB. Mytrix, Inc. stores over 1TB of data in InnoDB, and another site handles an average load of 800 inserts/updates per second in InnoDB.

InnoDB tables are included in the MySQL source distribution starting from 3.23.34a and are activated in the MySQL -Max binary. For Windows the -Max binaries are contained in the standard distribution.

If you have downloaded a binary version of MySQL that includes support for InnoDB, simply follow the instructions of the MySQL manual for installing a binary version of MySQL. If you already have MySQL-3.23 installed, then the simplest way to install MySQL -Max is to replace the server executable mysqld with the corresponding executable in the -Max distribution. MySQL and MySQL -Max differ only in the server executable. See Section 2.2.7. See Section 4.7.5.

To compile MySQL with InnoDB support, download MySQL-3.23.34a or newer version from http://www.mysql.com/ and configure MySQL with the --with-innodb option. See the MySQL manual about installing a MySQL source distribution. See Section 2.3.

cd /path/to/source/of/mysql-3.23.37
./configure --with-innodb

To use InnoDB you have to specify InnoDB startup options in your my.cnf or my.ini file. The minimal way to modify it is to add to the [mysqld] section the line


but to get good performance it is best that you specify options as recommended. See Section 7.5.2.

InnoDB is distributed under the GNU GPL License Version 2 (of June 1991). In the source distribution of MySQL, InnoDB appears as a subdirectory.

InnoDB Startup Options

To use InnoDB tables in MySQL-Max-3.23 you MUST specify configuration parameters in the [mysqld] section of the configuration file my.cnf, or on Windows optionally in my.ini.

At the minimum, in 3.23 you must specify innodb_data_file_path. In MySQL-4.0 you do not need to specify even innodb_data_file_path: the default for it is to create an auto-extending 16M file ibdata1 to the datadir of MySQL. (In MySQL-4.0.0 and 4.0.1 the data file is 64M and not auto-extending.)

But to get good performance you MUST explicitly set the InnoDB parameters listed in the following examples.

Starting from versions 3.23.50 and 4.0.2 InnoDB allows the last data file on the innodb_data_file_path line to be specified as auto-extending. The syntax for innodb_data_file_path is then the following:

pathtodata file:sizespecification;pathtodata file:sizespecification;...
...  ;pathtodata file:sizespecification[:autoextend[:max:sizespecification]]

If you specify the last data file with the autoextend option, InnoDB will extend the last data file if it runs out of free space in the tablespace. The increment is 8M at a time. An example:

innodb_data_file_path = /ibdata/ibdata1:100M:autoextend

instructs InnoDB to create just a single data file whose initial size is 100M and which is extended in 8M blocks when space runs out. If the disk becomes full you may want to add another data file to another disk, for example. Then you have to look at the size of ibdata1, round the size downward to the closest multiple of 1024 * 1024 bytes (= 1M), and specify the rounded size of ibdata1 explicitly in innodb_data_file_path. After that you can add another data file:

innodb_data_file_path = /ibdata/ibdata1:988M;/disk2/ibdata2:50M:autoextend

Be cautious on filesystems where the maximum file-size is 2G! InnoDB is not aware of the OS maximum file-size. On those filesystems you might want to specify the max size for the data file:

innodb_data_file_path = /ibdata/ibdata1:100M:autoextend:max:2000M

Suppose you have a Windows NT computer with 128M of RAM and a single 1G hard disk. The following is an example of possible configuration parameters in my.cnf or my.ini for InnoDB:

# You can write your other MySQL server options here
# ...
innodb_data_home_dir = c:\ibdata
#                                  Datafiles must be able to
#                                  hold your data and indexes
innodb_data_file_path = ibdata1:2000M;ibdata2:2000M
#                                  Set buffer pool size to 50 - 80 %
#                                  of your computer's memory
set-variable = innodb_buffer_pool_size=70M
set-variable = innodb_additional_mem_pool_size=10M
innodb_log_group_home_dir = c:\iblogs
#                                  .._log_arch_dir must be the same
#                                  as .._log_group_home_dir
innodb_log_arch_dir = c:\iblogs
set-variable = innodb_log_files_in_group=3
#                                  Set the log file-size to about
#                                  15 % of the buffer pool size
set-variable = innodb_log_file_size=10M
set-variable = innodb_log_buffer_size=8M
#                                  Set ..flush_log_at_trx_commit to
#                                  0 if you can afford losing
#                                  a few last transactions
set-variable = innodb_file_io_threads=4
set-variable = innodb_lock_wait_timeout=50

Note that InnoDB does not create directories: you must create them yourself. Use the Unix or MS-DOS mkdir command to create the data and log group home directories. Check also that the MySQL server has the right to create files in the directories you specify.

Note that data files must be < 2G in some filesystems! The combined size of data files must be >= 10M. The combined size of the log files must be < 4G.

If you do not specify innodb_data_home_dir, the default is that InnoDB creates its data files to the datadir of MySQL. Then you cannot use absolute file paths in innodb_data_file_path.

When you create an InnoDB database for the first time, it is best that you start the MySQL server from the command prompt. Then InnoDB will print the information about the database creation to the screen, and you see what is happening. For example, in Windows you can start mysqld-max.exe with:

your-path-to-mysqld>mysqld-max --standalone --console

For information about what the printout should look like, see Section 7.5.3.

Where to put my.cnf or my.ini in Windows? The rules for Windows are the following:

  • Only one my.cnf or my.ini file should be created.

  • The my.cnf file should be placed in the root directory of the drive C:.

  • The my.ini file should be placed in the WINDIR directory—e.g, C:\WINDOWS or C:\WINNT. You can use the SET command of MS-DOS to print the value of WINDIR.

  • If your PC uses a boot loader where the C: drive is not the boot drive, your only option is to use the my.ini file.

Where to specify options in Unix? On Unix mysqld reads options from the following files, if they exist, in the following order:


Global options


Server-specific options


The file specified with --defaults-extra-file=...


User-specific options

COMPILATION_DATADIR is the MySQL data directory which was specified as a ./configure option when mysqld was compiled (typically /usr/local/mysql/data for a binary installation or /usr/local/var for a source installation).

If you are not sure from where mysqld reads its my.cnf or my.ini, you can give the path as the first command-line option to the server: mysqld --defaults-file=your_path_to_my_cnf.

Suppose you have a Linux computer with 512M of RAM and three 20G hard disks (at directory paths `/', `/dr2', and `/dr3'). Here is an example of possible configuration parameters in my.cnf for InnoDB:

# You can write your other MySQL server options here
# ...
innodb_data_home_dir = /
#                                  Datafiles must be able to
#                                  hold your data and indexes
innodb_data_file_path = ibdata/ibdata1:2000M;dr2/ibdata/ibdata2:2000M
#                                  Set buffer pool size to 50 - 80 %
#                                  of your computer's memory, but
#                                  make sure on Linux x86 total
#                                  memory usage is < 2 GB
set-variable = innodb_buffer_pool_size=350M
set-variable = innodb_additional_mem_pool_size=20M
innodb_log_group_home_dir = /dr3/iblogs
#                                  .._log_arch_dir must be the same
#                                  as .._log_group_home_dir
innodb_log_arch_dir = /dr3/iblogs
set-variable = innodb_log_files_in_group=3
#                                  Set the log file-size to about
#                                  15 % of the buffer pool size
set-variable = innodb_log_file_size=50M
set-variable = innodb_log_buffer_size=8M
#                                  Set ..flush_log_at_trx_commit to
#                                  0 if you can afford losing
#                                  a few last transactions
set-variable = innodb_file_io_threads=4
set-variable = innodb_lock_wait_timeout=50
#set-variable = innodb_thread_concurrency=5

Note that we have placed the two data files on different disks. The reason for the name innodb_data_file_path is that you can also specify paths to your data files, and innodb_data_home_dir is just textually catenated before your data file paths, adding a possible slash or backslash in between. InnoDB will fill the tablespace formed by the data files from the bottom up. In some cases it will improve the performance of the database if all data is not placed on the same physical disk. Putting log files on a different disk from data is very often beneficial for performance. You can also use raw disk partitions (raw devices) as data files. In some Unix systems they speed up I/O. See the manual section on InnoDB file space management about how to specify them in my.cnf.

Warning: On Linux x86 you must be careful not to set memory usage too high. glibc will allow the process heap to grow over thread stacks, which will crash your server. It is a risk if the value of:

innodb_buffer_pool_size + key_buffer +
max_connections * (sort_buffer + record_buffer) + max_connections * 2 M

is close to or exceeds 2G. Each thread will use a stack (often 2M, but in MySQL AB binaries only 256 kB) and, in the worst case, also sort_buffer + record_buffer additional memory.

How to tune other mysqld server parameters? Typical values which suit most users are:

set-variable = max_connections=200
set-variable = record_buffer=1M
set-variable = sort_buffer=1M
#                                  Set key_buffer to 5 - 50 %
#                                  of your RAM depending on how
#                                  much you use MyISAM tables, but
#                                  keep key_buffer + InnoDB
#                                  buffer pool size < 80 % of
#                                  your RAM
set-variable = key_buffer=...

Note that some parameters are given using the numeric my.cnf parameter format—set-variable = innodb... = 123—others (string and boolean parameters) with another format—innodb_... = ... .

The meanings of the configuration parameters are as follows:




The common part of the directory path for all InnoDB data files. The default for this parameter is the datadir of MySQL.


Paths to individual data files and their sizes. The full directory path to each data file is acquired by concatenating innodb_data_home_dir to the paths specified here. The file-sizes are specified in megabytes, hence the M after the size specification. InnoDB also understands the abbreviation G; 1G meaning 1024M. Starting from 3.23.44 you can set the file-size bigger than 4G on those operating systems which support big files. On some operating systems files must be < 2G. The sum of the sizes of the files must be at least 10M.


Number of identical copies of log groups we keep for the database. Currently this should be set to 1.


Directory path to InnoDB log files.


Number of log files in the log group. InnoDB writes to the files in a circular fashion. Value 3 is recommended here.


Size of each log file in a log group in megabytes. Sensible values range from 1M to 1/nth of the size of the buffer pool specified below, where n is the number of log files in the group. The bigger the value, the less checkpoint flush activity is needed in the buffer pool, saving disk I/O. But bigger log files also mean that recovery will be slower in case of a crash. The combined size of log files must be < 4G on 32-bit computers.


The size of the buffer which InnoDB uses to write logs to the log files on disk. Sensible values range from 1M to half the combined size of the log files. A big log buffer allows large transactions to run without a need to write the log to disk until the transactions commit. Thus, if you have big transactions, making the log buffer big will save disk I/O.


Normally this is set to 1, meaning that at a transaction commit the log is flushed to disk, and the modifications made by the transaction become permanent and survive a database crash. If you are willing to compromise this safety, and you are running small transactions, you may set this to 0 to reduce disk I/O to the logs.


The directory where fully written log files would be archived if we used log archiving. The value of this parameter should currently be set the same as innodb_log_group_home_dir.


This value should currently be set to 0. As recovery from a backup is done by MySQL using its own log files, there is currently no need to archive InnoDB log files.


The size of the memory buffer InnoDB uses to cache data and indexes of its tables. The bigger you set this the less disk I/O is needed to access data in tables. On a dedicated database server you may set this parameter up to 80% of the machine’s physical memory size. Do not set it too large, though, because competition of the physical memory may cause paging in the operating system.


Size of a memory pool InnoDB uses to store data dictionary information and other internal data structures. A sensible value for this might be 2M, but the more tables you have in your application the more you will need to allocate here. If InnoDB runs out of memory in this pool, it will start to allocate memory from the operating system and write warning messages to the MySQL error log.


Number of file I/O threads in InnoDB. Normally, this should be 4, but on Windows disk I/O may benefit from a larger number.


Timeout in seconds an InnoDB transaction may wait for a lock before being rolled back. InnoDB automatically detects transaction deadlocks in its own lock table and rolls back the transaction. If you use the LOCK TABLES command, or transaction-safe table handlers other than InnoDB in the same transaction, a deadlock may arise which InnoDB cannot notice. In cases like this the timeout is useful to resolve the situation.


(Available from 3.23.40 up.) The default value for this is fdatasync. Another option is O_DSYNC.

Creating InnoDB Tablespace

Suppose you have installed MySQL and have edited my.cnf so that it contains the necessary InnoDB configuration parameters. Before starting MySQL you should check that the directories you have specified for InnoDB data files and log files exist and that you have access rights to those directories. InnoDB cannot create directories, only files. Check also that you have enough disk space for the data and log files.

When you now start MySQL, InnoDB will start creating your data files and log files. InnoDB will print something like the following:

~/mysqlm/sql > mysqld
InnoDB: The first specified data file /home/heikki/data/ibdata1
did not exist:
InnoDB: a new database to be created!
InnoDB: Setting file /home/heikki/data/ibdata1 size to 134217728
InnoDB: Database physically writes the file full: wait...
InnoDB: data file /home/heikki/data/ibdata2 did not exist:
new to be created
InnoDB: Setting file /home/heikki/data/ibdata2 size to 262144000
InnoDB: Database physically writes the file full: wait...
InnoDB: Log file /home/heikki/data/logs/ib_logfile0 did not exist:
new to be created
InnoDB: Setting log file /home/heikki/data/logs/ib_logfile0 size to 5242880
InnoDB: Log file /home/heikki/data/logs/ib_logfile1 did not exist:
new to be created
InnoDB: Setting log file /home/heikki/data/logs/ib_logfile1 size to 5242880
InnoDB: Log file /home/heikki/data/logs/ib_logfile2 did not exist:
new to be created
InnoDB: Setting log file /home/heikki/data/logs/ib_logfile2 size to 5242880
InnoDB: Started
mysqld: ready for connections

A new InnoDB database has now been created. You can connect to the MySQL server with the usual MySQL client programs like mysql. When you shut down the MySQL server with mysqladmin shutdown, InnoDB output will be like the following:

010321 18:33:34  mysqld: Normal shutdown
010321 18:33:34  mysqld: Shutdown Complete
InnoDB: Starting shutdown...
InnoDB: Shutdown completed

You can now look at the data files and log directories and you will see the files created. The log directory will also contain a small file named ib_arch_log_0000000000. That file resulted from the database creation, after which InnoDB switched off log archiving. When MySQL is again started, the output will be like the following:

~/mysqlm/sql > mysqld
InnoDB: Started
mysqld: ready for connections

If something goes wrong in database creation

If InnoDB prints an operating system error in a file operation, usually the problem is one of the following:

  • You did not create InnoDB data or log directories.

  • mysqld does not have the rights to create files in those directories.

  • mysqld does not read the right my.cnf or my.ini file, and consequently does not see the options you specified.

  • The disk is full or a disk quota is exceeded.

  • You have created a subdirectory whose name is equal to a data file you specified.

  • There is a syntax error in innodb_data_home_dir or innodb_data_file_path.

If something goes wrong in an InnoDB database creation, you should delete all files created by InnoDB. This means all data files, all log files, the small archived log file, and in the case you already did create some InnoDB tables, delete also the corresponding .frm files for these tables from the MySQL database directories. Then you can try the InnoDB database creation again.

Creating InnoDB Tables

Suppose you have started the MySQL client with the command mysql test. To create a table in the InnoDB format you must specify TYPE = InnoDB in the table creation SQL command:


This SQL command will create a table and an index on column A into the InnoDB tablespace consisting of the data files you specified in my.cnf. In addition MySQL will create a file CUSTOMER.frm to the MySQL database directory test. Internally, InnoDB will add to its own data dictionary an entry for table 'test/CUSTOMER'. Thus you can create a table of the same name CUSTOMER in another database of MySQL, and the table names will not collide inside InnoDB.

You can query the amount of free space in the InnoDB tablespace by issuing the table status command of MySQL for any table you have created with TYPE = InnoDB. Then the amount of free space in the tablespace appears in the table comment section in the output of SHOW. An example:


Note that the statistics SHOW gives about InnoDB tables are only approximate: they are used in SQL optimisation. Table and index reserved sizes in bytes are accurate, though.

Converting MyISAM tables to InnoDB

InnoDB does not have a special optimisation for separate index creation. Therefore, it does not pay to export and import the table and create indexes afterward. The fastest way to alter a table to InnoDB is to do the inserts directly to an InnoDB table—that is, use ALTER TABLE ... TYPE=INNODB, or create an empty InnoDB table with identical definitions and insert the rows with INSERT INTO ... SELECT * FROM ....

To get better control over the insertion process, it may be good to insert big tables in pieces:

INSERT INTO newtable SELECT * FROM oldtable
   WHERE yourkey > something AND yourkey <= somethingelse;

After all data has been inserted you can rename the tables.

During the conversion of big tables you should make the InnoDB buffer pool size large to reduce disk I/O. Do not set it larger than 80% of the physical memory, though. You should make InnoDB log files and the log buffer large.

Make sure you do not run out of tablespace: InnoDB tables take a lot more space than MyISAM tables. If an ALTER TABLE runs out of space, it will start a rollback, and that can take hours if it is disk-bound. In inserts InnoDB uses the insert buffer to merge secondary index records to indexes in batches. That saves a lot of disk I/O. In rollback no such mechanism is used, and the rollback can take 30 times longer than the insertion.

In the case of a runaway rollback, if you do not have valuable data in your database, it is better that you kill the database process and delete all InnoDB data and log files and all InnoDB table .frm files, and start your job again, rather than wait for millions of disk I/Os to complete.

Foreign key constraints

Starting from Version 3.23.43b InnoDB features foreign key constraints. InnoDB is the first MySQL table type which allows you to define foreign key constraints to guard the integrity of your data.

The syntax of a foreign key constraint definition in InnoDB is as follows:

FOREIGN KEY (index_col_name, ...)
                  REFERENCES table_name (index_col_name, ...)

Both tables have to be of the InnoDB type and there must be an index where the foreign key and the referenced key are listed as the first columns. InnoDB does not auto-create indexes on foreign keys or referenced keys: you have to create them explicitly.

Corresponding columns in the foreign key and the referenced key must have similar internal data types inside InnoDB so that they can be compared without a type conversion. The size and the signedness of integer types has to be the same. The length of string types need not be the same.

Starting from Version 3.23.50 you can also associate the ON DELETE CASCADE or ON DELETE SET NULL clause with the foreign key constraint.

If ON DELETE CASCADE is specified, and a row in the parent table is deleted, then InnoDB automatically deletes also all those rows in the child table whose foreign key values are equal to the referenced key value in the parent row. If ON DELETE SET NULL is specified, the child rows are automatically updated so that the columns in the foreign key are set to the SQL NULL value.

Starting from version 3.23.50, InnoDB does not check foreign key constraints on those foreign key or referenced key values which contain a NULL column.

Starting from version 3.23.50 the InnoDB parser allows you to use backquotes (`) around table and column names in the above definition but the InnoDB parser is not yet aware of possible variable lower_case_table_names you give in my.cnf.

An example:

CREATE TABLE child(id INT, parent_id INT, INDEX par_ind (parent_id),
             FOREIGN KEY (parent_id) REFERENCES parent(id)
             ON DELETE SET NULL

If MySQL gives the error number 1005 from a CREATE TABLE statement, the error message string refers to errno 150, then the table creation failed because a foreign key constraint was not correctly formed. Similarly, if an ALTER TABLE fails and it refers to errno 150, that means a foreign key definition would be incorrectly formed for the altered table.

Starting from Version 3.23.50 InnoDB allows you to add a new foreign key constraint to a table through

ALTER TABLE yourtablename
  ADD CONSTRAINT FOREIGN KEY (...) REFERENCES anothertablename(...)

Remember to create the required indexes first, though.

In InnoDB versions < 3.23.50 ALTER TABLE or CREATE INDEX should not be used in connection with tables that have foreign key constraints or are referenced in foreign key constraints: any ALTER TABLE removes all foreign key constraints defined for the table. You should not use ALTER TABLE to the referenced table either, but use DROP TABLE and CREATE TABLE to modify the schema. When MySQL does an ALTER TABLE it may internally use RENAME TABLE, and that will confuse the foreign key constraints that refer to the table. A CREATE INDEX statement in MySQL is processed as an ALTER TABLE, and these restrictions also apply to it.

When doing foreign key checks InnoDB sets shared row level locks on child or parent records it has to look at. InnoDB checks foreign key constraints immediately: the check is not deferred to transaction commit.

InnoDB allows you to drop any table even though that would break the foreign key constraints that reference the table. When you drop a table the constraints that were defined in its create statement are also dropped.

If you re-create a table that was dropped, it has to have a definition that conforms to the foreign key constraints referencing it. It must have the right column names and types, and it must have indexes on the referenced keys, as stated previously. If these are not satisfied, MySQL returns error number 1005 and refers to errno 150 in the error message string.

Starting from Version 3.23.50 InnoDB returns the foreign key definitions of a table when you call

SHOW CREATE TABLE yourtablename

Then also mysqldump produces correct definitions of tables to the dump file, and does not forget about the foreign keys.

You can also list the foreign key constraints for a table T with:

SHOW TABLE STATUS FROM yourdatabasename LIKE 'T'

The foreign key constraints are listed in the table comment of the output.

InnoDB does not yet support ON DELETE CASCADE or other special options on the constraints.

Adding and Removing InnoDB Data and Log Files

You cannot increase the size of an InnoDB data file. To add more into your tablespace you have to add a new data file. To do this you have to shut down your MySQL database, edit the my.cnf file, add a new file to innodb_data_file_path, and then start MySQL again.

Currently you cannot remove a data file from InnoDB. To decrease the size of your database you have to use mysqldump to dump all your tables, create a new database, and import your tables to the new database.

If you want to change the number or the size of your InnoDB log files, you have to shut down MySQL and make sure that it shuts down without errors. Then copy the old log files into a safe place just in case something went wrong in the shutdown and you will need them to recover the database. Delete the old log files from the log file directory, edit my.cnf, and start MySQL again. InnoDB will tell you at the startup that it is creating new log files.

Backing up and Recovering an InnoDB Database

The key to safe database management is taking regular backups.

InnoDB Hot Backup is an online backup tool you can use to back up your InnoDB database while it is running. InnoDB Hot Backup does not require you to shut down your database and it does not set any locks or disturb your normal database processing. InnoDB Hot Backup is an additional tool that is not included in the standard MySQL distribution. See the InnoDB Hot Backup homepage at http://www.innodb.com/hotbackup.html for detailed information and screenshots.

If you are able to shut down your MySQL server, to take a ‘binary’ backup of your database you have to do the following:

  • Shut down your MySQL database and make sure it shuts down without errors.

  • Copy all your data files into a safe place.

  • Copy all your InnoDB log files to a safe place.

  • Copy your my.cnf configuration file(s) to a safe place.

  • Copy all the .frm files for your InnoDB tables into a safe place.

There is currently no online or incremental backup tool available for InnoDB, though they are in the TODO list.

In addition to taking the binary backups just described, you should also regularly take dumps of your tables with mysqldump. This is because a binary file may be corrupted without you noticing it. Dumped tables are stored into text files that are human-readable and much simpler than database binary files. Seeing table corruption from dumped files is easier, and since their format is simpler, the chance for serious data corruption in them is smaller.

A good idea is to take the dumps at the same time you take a binary backup of your database. You have to shut out all clients from your database to get a consistent snapshot of all your tables into your dumps. Then you can take the binary backup, and you will then have a consistent snapshot of your database in two formats.

To be able to recover your InnoDB database to the present from the binary backup just described, you have to run your MySQL database with the general logging and log archiving of MySQL switched on. Here by the general logging we mean the logging mechanism of the MySQL server that is independent of InnoDB logs.

To recover from a crash of your MySQL server process, the only thing you have to do is restart it. InnoDB will automatically check the logs and perform a roll-forward of the database to the present. InnoDB will automatically roll back uncommitted transactions that were present at the time of the crash. During recovery, InnoDB will print out something like the following:

~/mysqlm/sql > mysqld
InnoDB: Database was not shut down normally.
InnoDB: Starting recovery from log files...
InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 0 13674004
InnoDB: Doing recovery: scanned up to log sequence number 0 13739520
InnoDB: Doing recovery: scanned up to log sequence number 0 13805056
InnoDB: Doing recovery: scanned up to log sequence number 0 13870592
InnoDB: Doing recovery: scanned up to log sequence number 0 13936128
InnoDB: Doing recovery: scanned up to log sequence number 0 20555264
InnoDB: Doing recovery: scanned up to log sequence number 0 20620800
InnoDB: Doing recovery: scanned up to log sequence number 0 20664692
InnoDB: 1 uncommitted transaction(s) which must be rolled back
InnoDB: Starting rollback of uncommitted transactions
InnoDB: Rolling back trx no 16745
InnoDB: Rolling back of trx no 16745 completed
InnoDB: Rollback of uncommitted transactions completed
InnoDB: Starting an apply batch of log records to the database...
InnoDB: Apply batch completed
InnoDB: Started
mysqld: ready for connections

If your database gets corrupted or your disk fails, you have to do the recovery from a backup. In the case of corruption, you should first find a backup that is not corrupted. From a backup do the recovery from the general log files of MySQL according to instructions in the MySQL manual.


InnoDB implements a checkpoint mechanism called a fuzzy checkpoint. InnoDB will flush modified database pages from the buffer pool in small batches. There is no need to flush the buffer pool in one single batch, which would, in practice, stop processing user SQL statements for a while.

In crash recovery InnoDB looks for a checkpoint label written to the log files. It knows that all modifications to the database before the label are already present on the disk image of the database. Then InnoDB scans the log files forward from the place of the checkpoint, applying the logged modifications to the database.

InnoDB writes to the log files in a circular fashion. All committed modifications that make the database pages in the buffer pool different from the images on disk must be available in the log files in case InnoDB has to do a recovery. This means that when InnoDB starts to reuse a log file in the circular fashion, it has to make sure that the database page images on disk already contain the modifications logged in the log file InnoDB is going to reuse. In other words, InnoDB has to make a checkpoint, and often this involves flushing of modified database pages to disk.

This explains why making your log files very big may save disk I/O in checkpointing. It can make sense to set the total size of the log files as big as the buffer pool, or even bigger. The drawback in big log files is that crash recovery can last longer because there will be more log to apply to the database.

Moving an InnoDB Database to Another Machine

InnoDB data and log files are binary-compatible on all platforms if the floating-point number format on the machines is the same. You can move an InnoDB database simply by copying all the relevant files, which we already listed in the previous section on backing up a database. If the floating-point formats on the machines are different but you have not used FLOAT or DOUBLE data types in your tables, the procedure is the same: just copy the relevant files. If the formats are different and your tables contain floating-point data, you have to use mysqldump and mysqlimport to move those tables.

A performance tip is to switch off the auto commit when you import data into your database, assuming your tablespace has enough space for the big rollback segment the big import transaction will generate. Do the commit only after importing a whole table or a segment of a table.

InnoDB Transaction Model

In the InnoDB transaction model the goal has been to combine the best properties of a multi-versioning database to traditional two-phase locking. InnoDB does locking on row level and runs queries by default as non-locking consistent reads, in the style of Oracle. The lock table in InnoDB is stored so space-efficiently that lock escalation is not needed: typically several users are allowed to lock every row in the database, or any random subset of the rows, without InnoDB running out of memory.

In InnoDB all user activity happens inside transactions. If the auto-commit mode is used in MySQL, each SQL statement will form a single transaction. If the auto commit mode is switched off, we can think that a user always has a transaction open. If he issues the SQL COMMIT or ROLLBACK statement, that ends the current transaction, and a new one starts. Both statements will release all InnoDB locks that were set during the current transaction. A COMMIT means that the changes made in the current transaction are made permanent and become visible to other users. A ROLLBACK, on the other hand, cancels all modifications made by the current transaction.

Consistent read

A consistent read means that InnoDB uses its multi-versioning to present to a query a snapshot of the database at a point in time. The query will see the changes made by exactly those transactions that committed before that point of time, and no changes made by later or uncommitted transactions. The exception to this rule is that the query will see the changes made by the transaction that issues the query.

When a transaction issues its first consistent read, InnoDB assigns the snapshot, or the point of time, which all consistent reads in the same transaction will use. In the snapshot are all transactions that committed before assigning the snapshot. Thus, the consistent reads within the same transaction will also be consistent with respect to each other. You can get a fresher snapshot for your queries by committing the current transaction, and after that, issuing new queries.

Consistent read is the default mode in which InnoDB processes SELECT statements. A consistent read does not set any locks on the tables it accesses, and therefore other users are free to modify those tables at the same time a consistent read is being performed on the table.

Locking reads

A consistent read is not convenient in some circumstances. Suppose you want to add a new row into your table CHILD, and make sure that the child already has a parent in table PARENT.

Suppose you use a consistent read to read the table PARENT and indeed see the parent of the child in the table. Can you now safely add the child row to table CHILD? No because it may happen that meanwhile some other user has deleted the parent row from the table PARENT, and you are not aware of that.

The solution is to perform the SELECT in a locking mode, LOCK IN SHARE MODE:


Performing a read in share mode means that we read the latest available data, and set a shared mode lock on the rows we read. If the latest data belongs to a yet uncommitted transaction of another user, we will wait until that transaction commits. A shared mode lock prevents others from updating or deleting the row we have read. After we see that the above query returns the parent 'Jones', we can safely add his child to table CHILD, and commit our transaction. This example shows how to implement referential integrity in your application code.

Let us look at another example: we have an integer counter field in a table CHILD_CODES which we use to assign a unique identifier to each child we add to table CHILD. Obviously, using a consistent read or a shared mode read to read the present value of the counter is not a good idea, since then two users of the database may see the same value for the counter, and we will get a duplicate key error when we add the two children with the same identifier to the table.

In this case there are two good ways to implement the reading and incrementing of the counter: (1) update the counter first by incrementing it by 1 and only after that read it, or (2) read the counter first with a lock mode FOR UPDATE, and increment after that:


A SELECT ... FOR UPDATE will read the latest available data, setting exclusive locks on each row it reads. Thus, it sets the same locks a searched SQL UPDATE would set on the rows.

Next-key locking: Avoiding the phantom problem

In row level locking InnoDB uses an algorithm called next-key locking. InnoDB does the row level locking so that when it searches or scans an index of a table, it sets shared or exclusive locks on the index records it encounters. Thus, the row level locks are more precisely called index record locks.

The locks InnoDB sets on index records also affect the ‘gap’ before that index record. If a user has a shared or exclusive lock on record R in an index, another user cannot insert a new index record immediately before R in the index order. This locking of gaps is done to prevent the so-called phantom problem. Suppose I want to read and lock all children with an identifier bigger than 100 from table CHILD, and update some field in the selected rows:


Suppose there is an index on table CHILD on column ID. Our query will scan that index starting from the first record where ID is bigger than 100. Now, if the locks set on the index records would not lock out inserts made in the gaps, a new child might meanwhile be inserted to the table. If in my transaction I now execute:


again, I will see a new child in the result set the query returns. This is against the isolation principle of transactions: a transaction should be able to run so that the data it has read does not change during the transaction. If we regard a set of rows as a data item, the new ‘phantom’ child would break this isolation principle.

When InnoDB scans an index it can also lock the gap after the last record in the index. Just that happens in the previous example: the locks set by InnoDB will prevent any insert to the table where ID would be bigger than 100.

You can use next-key locking to implement a uniqueness check in your application: if you read your data in share mode and do not see a duplicate for a row you are going to insert, you can safely insert your row. The next-key lock set on the successor of your row during the read will prevent anyone from inserting a duplicate for your row in the meantime. Thus, the next-key locking allows you to ‘lock’ the non-existence of something in your table.

Locks set by different SQL statements in InnoDB


This is a consistent read, reading a snapshot of the database and setting no locks.


Sets shared next-key locks on all index records the read encounters.


Sets exclusive next-key locks on all index records the read encounters.


Sets an exclusive lock on the inserted row; note that this lock is not a next-key lock and does not prevent other users from inserting to the gap before the inserted row. If a duplicate key error occurs, sets a shared lock on the duplicate index record.


Sets an exclusive (non-next-key) lock on each row inserted into T. Does the search on S as a consistent read, but sets shared next-key locks on S if the MySQL logging is on. InnoDB has to set locks in the latter case because in roll-forward recovery from a backup every SQL statement has to be executed in exactly the same way as it was done originally.


Performs the SELECT as a consistent read or with shared locks, like in the previous item.


Is done like an insert if there is no collision on a unique key. Otherwise, an exclusive next-key lock is placed on the row which has to be updated.

UPDATE ... SET ... WHERE ...

Sets an exclusive next-key lock on every record the search encounters.


Sets an exclusive next-key lock on every record the search encounters.


If this constraint is defined on a table, any insert, update, or delete which requires checking of the constraint condition sets shared record level locks on the records it looks at to check the constraint. Also in the case where the constraint fails, InnoDB sets these locks.


Sets table locks. In the implementation the MySQL layer of code sets these locks. The automatic deadlock detection of InnoDB cannot detect deadlocks where such table locks are involved: see the following section. Also, since MySQL does know about row level locks, it is possible that you get a table lock on a table where another user currently has row level locks. But that does not put transaction integerity into danger. See Section 7.5.14.

Deadlock detection and rollback

InnoDB automatically detects a deadlock of transactions and rolls back the transaction whose lock request was the last one to build a deadlock—that is, a cycle in the waits-for graph of transactions. InnoDB cannot detect deadlocks where a lock set by a MySQL LOCK TABLES statement is involved, or if a lock set in a table handler other than InnoDB is involved. You have to resolve these situations using innodb_lock_wait_timeout set in my.cnf.

When InnoDB performs a complete rollback of a transaction, all the locks of the transaction are released. However, if just a single SQL statement is rolled back as a result of an error, some of the locks set by the SQL statement may be preserved. This is because InnoDB stores row locks in a format where it cannot afterward know which was set by which SQL statement.

An example of how the consistent read works in InnoDB

When you issue a consistent read—that is, an ordinary SELECT statement—InnoDB will give your transaction a timepoint according to when your query sees the database. Thus, if transaction B deletes a row and commits after your timepoint was assigned, you will not see the row deleted. The same is true with inserts and updates.

You can advance your timepoint by committing your transaction and then doing another SELECT.

This is called multi-versioned concurrency control:

                  User A                 User B

              SET AUTOCOMMIT=0;      SET AUTOCOMMIT=0;
|             SELECT * FROM t;
|             empty set
|                                    INSERT INTO t VALUES (1, 2);
v             SELECT * FROM t;
              empty set

              SELECT * FROM t;
              empty set;


              SELECT * FROM t;
              |    1    |    2    |

Thus, user A sees the row inserted by B only when B has committed the insert, and A has committed his own transaction so that the timepoint is advanced past the commit of B.

If you want to see the “freshest” state of the database, you should use a locking read:


Performance Tuning Tips

1. If the Unix top or the Windows Task Manager shows that the CPU usage percentage with your workload is less than 70%, your workload is probably disk-bound. Maybe you are making too many transaction commits, or the buffer pool is too small. Making the buffer pool bigger can help, but do not set it bigger than 80% of your physical memory.

2. Wrap several modifications into one transaction. InnoDB must flush the log to disk at each transaction commit, if that transaction made modifications to the database. Since the rotation speed of a disk is typically, at most, 167 revolutions/second, that constrains the number of commits to the same 167/second if the disk does not fool the operating system.

3. If you can afford the loss of some latest committed transactions, you can set the my.cnf parameter innodb_flush_log_at_trx_commit to zero. InnoDB tries to flush the log anyway once per second, though the flush is not guaranteed.

4. Make your log files big, even as big as the buffer pool. When InnoDB has written the log files full, it has to write the modified contents of the buffer pool to disk in a checkpoint. Small log files will cause many unnecessary disk writes. The drawback in big log files is that recovery time will be longer.

5. Also the log buffer should be quite big—say 8 M.

6. (Relevant from 3.23.39 up.) In some versions of Linux and Unix, flushing files to disk with the Unix fdatasync and other similar methods is surprisingly slow. The default method InnoDB uses is the fdatasync function. If you are not satisfied with the database write performance, you may try setting innodb_flush_method in my.cnf to O_DSYNC, though O_DSYNC seems to be slower on most systems.

7. In importing data to InnoDB, make sure that MySQL does not have autocommit=1 on. Then every insert requires a log flush to disk. Put before your plain SQL import file line:


and after it:


If you use the mysqldump option --opt, you will get dump files that are fast to import also to an InnoDB table, even without wrapping them to the preceding SET AUTOCOMMIT=0; ... COMMIT; wrappers.

8. Beware of big rollbacks of mass inserts: InnoDB uses the insert buffer to save disk I/O in inserts, but in a corresponding rollback no such mechanism is used. A disk-bound rollback can take 30 times the time of the corresponding insert. Killing the database process will not help because the rollback will start again at database startup. The only way to get rid of a runaway rollback is to increase the buffer pool so that the rollback becomes CPU-bound and runs fast, or delete the whole InnoDB database.

9. Beware also of other big disk-bound operations. Use DROP TABLE or TRUNCATE (from MySQL 4.0 up) to empty a table, not DELETE FROM yourtable.

10. Use the multi-line INSERT to reduce communication overhead between the client and the server if you need to insert many rows:

INSERT INTO yourtable VALUES (1, 2), (5, 5);

This tip is, of course, valid for inserts into any table type, not just InnoDB.

The InnoDB monitor

Starting from Version 3.23.41 InnoDB includes the InnoDB Monitor, which prints information on the InnoDB internal state. When switched on, InnoDB Monitor will make the MySQL server mysqld print data (note: the MySQL client will not print anything) to the standard output about once every 15 seconds. This data is useful in performance tuning. On Windows you must start mysqld-max from an MS-DOS prompt with the --standalone --console options to direct the output to the MS-DOS prompt window.

There is a separate innodb_lock_monitor which prints the same information as innodb_monitor plus information on locks set by each transaction.

The printed information includes data on:

  • Lock waits of a transactions

  • Semaphore waits of threads

  • Pending file I/O requests

  • Buffer pool statistics

  • Purge and insert buffer merge activity of the main thread of InnoDB

You can start InnoDB Monitor through the following SQL command:

CREATE TABLE innodb_monitor(a int) type = innodb;

and stop it by:

DROP TABLE innodb_monitor;

The CREATE TABLE syntax is just a way to pass a command to the InnoDB engine through the MySQL SQL parser: the created table is not relevant at all for InnoDB Monitor. If you shut down the database when the monitor is running, and you want to start the monitor again, you have to drop the table before you can issue a new CREATE TABLE to start the monitor. This syntax may change in a future release.

A sample output of the InnoDB Monitor:

Number of locks in the record hash table 1294
TABLE LOCK table test/mytable trx id 0 582333343 lock_mode IX

RECORD LOCKS space id 0 page no 12758 n bits 104 table test/mytable index
PRIMARY trx id 0 582333343 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 74; 1-byte offs FALSE;
info bits 0
 0: len 4; hex 0001a801; asc ;; 1: len 6; hex 000022b5b39f; asc ";;
 2: len 7; hex 000002001e03ec; asc ;; 3: len 4; hex 00000001;
Sorry, cannot give mutex list info in non-debug version!
Sorry, cannot give rw-lock list info in non-debug version!
SYNC ARRAY INFO: reservation count 6041054, signal count 2913432
4a239430 waited for by thread 49627477 op. S-LOCK file NOT KNOWN line 0
Mut ex 0 sp 5530989 r 62038708 sys 2155035;
rws 0 8257574 8025336; rwx 0 1121090 1848344
Pending normal aio reads:
Reserved slot, messages 40157658 4a4a40b8
Reserved slot, messages 40157658 4a477e28
Reserved slot, messages 40157658 4a4424a8
Reserved slot, messages 40157658 4a39ea38
Total of 36 reserved aio slots
Pending aio writes:
Total of 0 reserved aio slots
Pending insert buffer aio reads:
Total of 0 reserved aio slots
Pending log writes or reads:
Reserved slot, messages 40158c98 40157f98
Total of 1 reserved aio slots
Pending synchronous reads or writes:
Total of 0 reserved aio slots
LRU list length 8034
Free list length 0
Flush list length 999
Buffer pool size in pages 8192
Pending reads 39
Pending writes: LRU 0, flush list 0, single page 0
Pages read 31383918, created 51310, written 2985115
010809 18:45:22 InnoDB starts purge
010809 18:45:22 InnoDB purged 0 pages

Some notes on the output:

  • If the section LOCKS HELD BY TRANSACTIONS reports lock waits, your application may have lock contention. The output can also help to trace reasons for transaction deadlocks.

  • Section SYNC INFO will report reserved semaphores if you compile InnoDB with UNIV_SYNC_DEBUG defined in univ.i.

  • Section SYNC ARRAY INFO reports threads waiting for a semaphore and statistics on how many times threads have needed a spin or a wait on a mutex or an rw-lock semaphore. A big number of threads waiting for semaphores may be a result of disk I/O, or contention problems inside InnoDB. Contention can be due to heavy parallelism of queries, or problems in operating system thread scheduling.

  • Section CURRENT PENDING FILE I/O’S lists pending file I/O requests. A large number of these indicates that the workload is disk I/O-bound.

  • Section BUFFER POOL gives you statistics on pages read and written. You can calculate from these numbers how many data file I/Os your queries are currently doing.

Implementation of Multi-Versioning

Since InnoDB is a multi-versioned database, it must keep information of old versions of rows in the tablespace. This information is stored in a data structure we call a rollback segment after an analogous data structure in Oracle.

InnoDB internally adds two fields to each row stored in the database. A 6-byte field tells the transaction identifier for the last transaction that inserted or updated the row. Also a deletion is internally treated as an update where a special bit in the row is set to mark it as deleted. Each row also contains a 7-byte field called the roll pointer. The roll pointer points to an undo log record written to the rollback segment. If the row was updated, the undo log record contains the information necessary to rebuild the content of the row before it was updated.

InnoDB uses the information in the rollback segment to perform the undo operations needed in a transaction rollback. It also uses the information to build earlier versions of a row for a consistent read.

Undo logs in the rollback segment are divided into insert and update undo logs. Insert undo logs are only needed in transaction rollback and can be discarded as soon as the transaction commits. Update undo logs are used also in consistent reads, and they can be discarded only after there is no transaction present for which InnoDB has assigned a snapshot that in a consistent read could need the information in the update undo log to build an earlier version of a database row.

You must remember to commit your transactions regularly, including those transactions that only issue consistent reads. Otherwise InnoDB cannot discard data from the update undo logs, and the rollback segment may grow too big, filling up your tablespace.

The physical size of an undo log record in the rollback segment is typically smaller than the corresponding inserted or updated row. You can use this information to calculate the space needed for your rollback segment.

In our multi-versioning scheme a row is not physically removed from the database immediately when you delete it with an SQL statement. Only when InnoDB can discard the update undo log record written for the deletion can it also physically remove the corresponding row and its index records from the database. This removal operation is called a purge, and it is quite fast, usually taking the same order of time as the SQL statement which did the deletion.

Table and Index Structures

MySQL stores its data dictionary information of tables in .frm files in database directories. But every InnoDB type table also has its own entry in InnoDB internal data dictionaries inside the tablespace. When MySQL drops a table or a database, it has to delete both a .frm file or files, and the corresponding entries inside the InnoDB data dictionary. This is the reason why you cannot move InnoDB tables between databases simply by moving the .frm files, and why DROP DATABASE did not work for InnoDB type tables in MySQL versions <= 3.23.43.

Every InnoDB table has a special index called the clustered index where the data of the rows is stored. If you define a PRIMARY KEY on your table, the index of the primary key will be the clustered index.

If you do not define a primary key for your table, InnoDB will internally generate a clustered index where the rows are ordered by the row id InnoDB assigns to the rows in such a table. The row id is a 6-byte field which monotonically increases as new rows are inserted. Thus, the rows ordered by the row id will be physically in the insertion order.

Accessing a row through the clustered index is fast because the row data will be on the same page where the index search leads us. In many databases the data is traditionally stored on a different page from the index record. If a table is large, the clustered index architecture often saves disk I/O when compared to the traditional solution.

The records in non-clustered indexes (we also call them secondary indexes) in InnoDB contain the primary key value for the row. InnoDB uses this primary key value to search for the row from the clustered index. Note that if the primary key is long, the secondary indexes will use more space.

Physical structure of an index

All indexes in InnoDB are B-trees where the index records are stored in the leaf pages of the tree. The default size of an index page is 16 kB. When new records are inserted, InnoDB tries to leave 1 / 16 of the page free for future insertions and updates of the index records.

If index records are inserted in a sequential (ascending or descending) order, the resulting index pages will be about 15/16 full. If records are inserted in a random order, the pages will be 1/2 - 15/16 full. If the fill factor of an index page drops below 1/2, InnoDB will try to contract the index tree to free the page.

Insert buffering

It is a common situation in a database application that the primary key is a unique identifier and new rows are inserted in the ascending order of the primary key. Thus, the insertions to the clustered index do not require random reads from a disk.

On the other hand, secondary indexes are usually non-unique and insertions happen in a relatively random order. This would cause a lot of random disk I/Os without a special mechanism used in InnoDB.

If an index record should be inserted to a non-unique secondary index, InnoDB checks if the secondary index page is already in the buffer pool. If that is the case, InnoDB will do the insertion directly to the index page. But, if the index page is not found from the buffer pool, InnoDB inserts the record to a special insert buffer structure. The insert buffer is kept so small that it entirely fits in the buffer pool, and insertions can be made to it very fast.

The insert buffer is periodically merged to the secondary index trees in the database. Often we can merge several insertions on the same page of the index tree, and hence save disk I/Os. It has been measured that the insert buffer can speed up insertions to a table up to 15 times.

Adaptive hash indexes

If a database fits almost entirely in main memory, the fastest way to perform queries on it is to use hash indexes. InnoDB has an automatic mechanism which monitors index searches made to the indexes defined for a table, and if InnoDB notices that queries could benefit from building a hash index, such an index is automatically built.

But note that the hash index is always built based on an existing B-tree index on the table. InnoDB can build a hash index on a prefix of any length of the key defined for the B-tree, depending on what search pattern InnoDB observes on the B-tree index. A hash index can be partial: it is not required that the whole B-tree index is cached in the buffer pool. InnoDB will build hash indexes on demand to often-accessed pages of the index.

In a sense, through the adaptive hash index mechanism InnoDB adapts itself to ample main memory, coming closer to the architecture of main memory databases.

Physical Record Structure

  • Each index record in InnoDB contains a header of 6 bytes. The header is used to link consecutive records together, and also in the row level locking.

  • Records in the clustered index contain fields for all user-defined columns. In addition, there is a 6-byte field for the transaction id and a 7-byte field for the roll pointer.

  • If the user has not defined a primary key for a table, then each clustered index record contains also a 6-byte row id field.

  • Each secondary index record contains also all the fields defined for the clustered index key.

  • A record contains also a pointer to each field of the record. If the total length of the fields in a record is < 128 bytes, then the pointer is 1 byte, else 2 bytes.

How an auto-increment column works in InnoDB

After a database startup, when a user first does an insert to a table T where an auto-increment column has been defined, and the user does not provide an explicit value for the column, then InnoDB executes SELECT MAX(auto-inc-column) FROM T, and assigns that value incremented by one to the column and the auto-increment counter of the table. We say that the auto-increment counter for table T has been initialised.

InnoDB follows the same procedure in initializing the auto-increment counter for a freshly created table.

Note that if the user specifies in an insert the value 0 to the auto-increment column, then InnoDB treats the row like the value would not have been specified.

After the auto-increment counter has been initialised, if a user inserts a row where he explicitly specifies the column value, and the value is bigger than the current counter value, then the counter is set to the specified column value. If the user does not explicitly specify a value, then InnoDB increments the counter by one and assigns its new value to the column.

The auto-increment mechanism, when assigning values from the counter, bypasses locking and transaction handling. Therefore you may also get gaps in the number sequence if you roll back transactions that got numbers from the counter.

The behavior of auto-increment is not defined if a user gives a negative value to the column or if the value becomes bigger than the maximum integer that can be stored in the specified integer type.

File Space Management and Disk I/O

Disk I/O

In disk I/O InnoDB uses asynchronous I/O. On Windows NT it uses the native asynchronous I/O provided by the operating system. On Unix, InnoDB uses simulated asynchronous I/O built into InnoDB: InnoDB creates a number of I/O threads to take care of I/O operations, such as read-ahead. In a future version we will add support for simulated I/O on Windows NT and native aio on those versions of Unix that have one.

On Windows NT InnoDB uses non-buffered I/O. That means that the disk pages InnoDB reads or writes are not buffered in the operating system file cache. This saves some memory bandwidth.

Starting from 3.23.41 InnoDB uses a novel file flush technique called doublewrite. It adds safety to crash recovery after an operating system crash or a power outage, and improves performance on most Unix flavors by reducing the need for fsync operations.

Doublewrite means that before writing pages to a data file InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer have completed, does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB will, in recovery, find a good copy of the page from the doublewrite buffer.

Starting from 3.23.41 you can also use a raw disk partition as a data file, though this has not been tested yet. When you create a new data file you have to put the keyword newraw immediately after the data file-size in innodb_data_file_path. The partition must be >= than you specify as the size. Note that 1M in InnoDB is 1024x1024 bytes, while in disk specifications 1M usually means 1 million bytes.


When you start the database again you must change the keyword to raw. Otherwise, InnoDB will write over your partition!


By using a raw disk you can, on some Unix sysytems, perform unbuffered I/O.

There are two read-ahead heuristics in InnoDB: sequential read-ahead and random read-ahead. In sequential read-ahead InnoDB notices that the access pattern to a segment in the tablespace is sequential. Then InnoDB will post in advance a batch of reads of database pages to the I/O system. In random read-ahead InnoDB notices that some area in a tablespace seems to be in the process of being fully read into the buffer pool. Then InnoDB posts the remaining reads to the I/O system.

File space management

The data files you define in the configuration file form the tablespace of InnoDB. The files are simply catenated to form the tablespace. There is no striping in use. Currently you cannot directly instruct where the space is allocated for your tables, except by using the following fact: from a newly created tablespace InnoDB will allocate space starting from the low end.

The tablespace consists of database pages whose default size is 16 kB. The pages are grouped into extents of 64 consecutive pages. The ‘files’ inside a tablespace are called segments in InnoDB. The name of the rollback segment is somewhat misleading because it actually contains many segments in the tablespace.

For each index in InnoDB we allocate two segments: one is for non-leaf nodes of the B-tree, the other is for leaf nodes. The idea here is to achieve better sequentiality for the leaf nodes, which contain the data.

When a segment grows inside the tablespace, InnoDB allocates the first 32 pages to it individually. After that InnoDB starts to allocate whole extents to the segment. InnoDB can add to a large segment up to 4 extents at a time to ensure good sequentiality of data.

Some pages in the tablespace contain bitmaps of other pages, and therefore a few extents in an InnoDB tablespace cannot be allocated to segments as a whole, but only as individual pages.

When you issue a query SHOW TABLE STATUS FROM ... LIKE ... to ask for available free space in the tablespace, InnoDB will report the extents which are definitely free in the tablespace. InnoDB always reserves some extents for clean-up and other internal purposes; these reserved extents are not included in the free space.

When you delete data from a table, InnoDB will contract the corresponding B-tree indexes. Individual pages or extents to the tablespace are freed from depending on the pattern of deletes, so that the freed space is available for other users. Dropping a table or deleting all rows from it is guaranteed to release the space to other users, but remember that deleted rows can be physically removed only in a purge operation after they are no longer needed in transaction rollback or consistent read.

Defragmenting a table

If there are random insertions or deletions in the indexes of a table, the indexes may become fragmented. By fragmentation we mean that the physical ordering of the index pages on the disk is not close to the alphabetical ordering of the records on the pages, or that there are many unused pages in the 64-page blocks that were allocated to the index.

You can speed up index scans if you periodically use mysqldump to dump the table to a text file, drop the table, and reload it from the dump. Another way to do the defragmenting is to ALTER the table type to MyISAM and back to InnoDB again. Note that a MyISAM table must fit in a single file on your operating system.

If the insertions to an index are always ascending and records are deleted only from the end, the file space management algorithm of InnoDB guarantees that fragmentation in the index will not occur.

Error Handling

The error handling in InnoDB is not always the same as specified in the ANSI SQL standards. According to the ANSI standard, any error during an SQL statement should cause the rollback of that statement. InnoDB sometimes rolls back only part of the statement, and other times the whole transaction. The following list specifies the error handling of InnoDB:

  • If you run out of file space in the tablespace, you will get the MySQL 'Table is full' error and InnoDB rolls back the SQL statement.

  • A transaction deadlock or a timeout in a lock wait make InnoDB roll back the whole transaction.

  • A duplicate key error only rolls back the insert of that particular row, even in a statement like INSERT INTO ... SELECT .... This will probably change so that the SQL statement will be rolled back if you have not specified the IGNORE option in your statement.

  • A ‘row too long’ error rolls back the SQL statement.

  • Other errors are mostly detected by the MySQL layer of code, and they roll back the corresponding SQL statement.

Restrictions on InnoDB Tables

  • Warning: Do not convert MySQL system tables from MyISAM to InnoDB tables! This is not supported. If you do this MySQL will not restart until you restore the old system tables from a backup or regenerate them with the mysql_install_db script.

  • SHOW TABLE STATUS does not give accurate statistics on InnoDB tables, except for the physical size reserved by the table. The row count is only a rough estimate used in SQL optimisation.

  • If you try to create a unique index on a prefix of a column you will get an error:


    If you create a non-unique index on a prefix of a column, InnoDB will create an index over the whole column.

  • INSERT DELAYED is not supported for InnoDB tables.

  • The MySQL LOCK TABLES operation does not know of InnoDB row level locks set in already completed SQL statements: this means that you can get a table lock on a table even if there still exist transactions of other users that have row level locks on the same table. Thus, your operations on the table may have to wait if they collide with these locks of other users. Also, a deadlock is possible. However, this does not endanger transaction integrity because the row level locks set by InnoDB will always take care of the integrity. Also, a table lock prevents other transactions from acquiring more row level locks (in a conflicting lock mode) on the table.

  • You cannot have a key on a BLOB or TEXT column.

  • A table cannot contain more than 1000 columns.

  • DELETE FROM TABLE does not regenerate the table but instead deletes all rows, one by one, which is not that fast. In future versions of MySQL you can use TRUNCATE, which is fast.

  • The default database page size in InnoDB is 16 kB. By recompiling the code one can set it from 8 kB to 64 kB. The maximun row length is slightly less than half of a database page in versions up to 3.23.40 of InnoDB. Starting from source release 3.23.41 BLOB and TEXT columns are allowed to be less than 4G. The total row length must also be less than 4G. InnoDB does not store fields whose size is less than or equal to 128 bytes on separate pages. After InnoDB has modified the row by storing long fields on separate pages, the remaining length of the row must be less than half a database page. The maximun key length is 7000 bytes.

  • On some operating systems data files must be less than 2G. The combined size of log files must be less than 4G on 32-bit computers.

  • The maximum tablespace size is 4 billion database pages. This is also the maximum size for a table. The minimum tablespace size is 10M.

InnoDB Contact Information

To contact Innobase Oy, producer of the InnoDB engine. visit http://www.innodb.com/ or email .

phone: 358-9-6969 3250 (office) 358-40-5617367 (mobile)
Innobase Oy Inc.
World Trade Center Helsinki
Aleksanterinkatu 17
P.O.Box 800
00101 Helsinki

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required