Learn different ways to insert large numbers of records into the database efficiently in Python
It’s very convenient to use SQLAlchemy to interact with relational databases with plain SQL queries or object-relational mappers (ORM). However, when it comes to bulk inserts, namely, inserting a large number of records into a table, we may often have performance issues. In this post, we will introduce different ways for bulk inserts and compare their performances through a hands-on tutorial. You will have a better understanding of these methods and can choose one that best suits your practical case.
In order to make it a hands-on tutorial that can be followed along, we need to install the libraries required on our computers. It’s recommended to install the packages in a virtual environment so it won’t mess up your system libraries. We will use conda to create a virtual environment because we can install a specific version of Python in the virtual environment:
Packages installed for this tutorial:
- SQLAlchemy — The main package that will be used to interact with a database.
- mysqlclient — A high-performance driver for the MySQL database. If you encounter some issues installing mysqlclient or using it, you can install PyMySQL instead which has no system dependency issues. Check this post if needed for using PyMySQL as the driver.
- cryptography — Used by SQLAlchemy for authentication.
- ipython — Used to execute interactive Python code more conveniently.
Set up a local MySQL server
In this tutorial, we will use a local MySQL server rather than SQLite to make it more similar to practical use cases. The server can be conveniently set up with Docker:
Note that a volume is attached to the MySQL Docker container so the data can persist even when the container is restarted. Besides, the root password is specified as an environment variable so it can be used for authentication later. Finally, a high port (13306) is assigned for the container so it won’t have potential conflicts with other existing MySQL servers.
Set up database connection
Now let’s set up the database connection metadata which will be used in the tests to be introduced soon. Two context managers are created that yield a
Session and a
Connection object, respectively. The
Session object will be used to perform operations with ORM models and the
Connection object for working with SQLAlchemy Core APIs or executing plain SQL queries directly. Some cleanup work is done in the context managers as well so we can run multiple tests consecutively. For a more detailed introduction regarding the SQLAlchemy engine, connection and session, please have a look at this post.
The code snippet for setting up database connection metadata is as follows:
- It is
localhostthat should be used as the hostname in the DB URL above, otherwise, there can be connection issues.
Create an ORM class for testing
We will create a simple
customers table that has two fields, namely
id with the primary key which is auto-incremented by default. By the way, the table will be located in the
data schema as specified in the DB URL above. The ORM class for this table is shown below. This table will be created when a session or connection is created by the context managers and will be dropped when the
cleanup parameter is
Add the ORM objects one by one
Now let’s add a large number of records to the table with different methods and compare their performances. The first one is
Session.add() which is very commonly used when you use ORM to interact with a database.
We will first add 20,000 records to the database without specifying the primary keys:
This test function takes about 5 seconds. The time taken can vary depending on the performance of your computer and can be slightly different each time it’s run. If it’s too fast or too slow, you can fine-tune the
num parameter. And if you want to check the inserted data in the database, set
cleanup to be
With ORM, there is a short-cut method
Session.add_all() that takes a list of ORM instances as the parameter:
The performance with
Session.add_all() should be pretty similar because the data is not saved to the database until you run
Session.commit() which is the real time-limiting step.
Actually, taking 5 seconds to insert 20,000 records can be a major performance issue for an application. It can be more serious if the database is located on a remote server. There are two main reasons responsible for the low performance:
- An ORM instance needs to be created for each record.
- The primary keys and other default values need to be returned to the ORM instances due to the unit-of-work design of ORM.
The second one is more impactful, which can be proved if we provide primary keys for the ORM instances created:
- If the primary key is auto-incremented and is explicitly specified like here, it must not be zero otherwise the data may not be inserted successfully. You can try to change
id=idxand see if it happens to you as well.
It turned out that the performance can be dramatically improved if primary keys are provided. This is great! However, this is not the most efficient way to perform bulk inserts with SQLAlchemy and sometimes it may not be applicable to specify primary keys, so hang in there a bit.
SQLAlchemy has some methods specifically designed for bulk operations. For bulk inserts, there are
Session.bulk_save_objects() takes a list of ORM instances as the parameter, similar to
Session.bulk_insert_mappings() takes a list of mappings/dictionaries as the parameter. We will use
Session.bulk_save_objects() here, and use
Before starting to use it, we should be aware of two major caveats of
- Most ORM benefits like foreign key relationships and automatic updates of the attributes are not available for the ORM instances passed this method. If we want to have these benefits, then we should not use this method, but use
- We should not return the primary keys of the ORM instances being inserted, otherwise, the performance will be dramatically degraded. If we need the primary keys to be returned, we should also use
In the following code snippet, we will perform three tests and compare their performances:
Session.bulk_save_objects()with primary keys returned.
Session.bulk_save_objects()without returning primary keys.
Session.bulk_save_objects()and explicitly specify the primary keys.
When the three tests are run, it shows that when the primary keys are returned, the performance is indeed dramatically degraded. However, different from the case of
Session.add_all(), it doesn’t matter much if primary keys are specified for the ORM instances to be saved or not.
Another SQLAlchemy method for bulk inserts is
Session.bulk_insert_mappings(). As the name indicates, a list of mappings (dictionaries in Python) is passed as the parameter for this method. The benefit of using mappings directly is to avoid the overhead of creating ORM instances, which is normally not an issue but can become significant when a large number of ORM instances need to be created and saved.
In the following code snippet, we will perform two tests and compare their performances:
Session.bulk_insert_mappings()with no primary keys specified.
Session.bulk_insert_mappings()with primary keys specified.
The above tests are extremely fast. If no primary keys are specified, it’s about two times faster than
Session.bulk_save_objects() which is in turn about five times faster than
Session.add_all(). Besides, similar to
Session.bulk_save_objects(), it doesn’t matter much if primary keys are specified for the mappings to be saved.
Use SQLAlchemy Core API
SQLAlchemy ORM models are built on top of the Core APIs. If performance is the only goal, we should use the Core APIs for inserting directly and avoid all the overheads of ORM models.
We can use SQL Expression Language to access the Core APIs of SQLAlchemy. The benefit of using the SQL Expression Language is the ability to access Core APIs directly and thus achieve high performance and at the same time provide a backend/database-neutral language applicable to all types of relational databases. We will introduce the direct usage of plain MySQL queries in the next section.
We can use the
__table__property of the ORM class to access the underlying
Table object which provides the
Insert construct. Similar to
Session.bulk_insert_mappings(), a list of mappings/dictionaries can be passed to the
Insert construct. However, an SQLAlchemy
Connection object is used to execute the insert expression, rather than a
In the following code snippet, we will perform two tests and compare their performances:
- Use Core API to insert a list of dictionaries with no primary keys.
- Use Core API to insert a list of dictionaries with primary keys specified.
The above tests are even faster than using
Session.bulk_insert_mappings(), but not much, since the overhead of using ORM models is avoided completely here. Besides, it doesn’t matter much if primary keys are specified for the mappings to be saved.
Use plain SQL query
If you are an old school who only wants to work with plain SQL queries and don’t want to deal with Core API or ORM at all, you can use
Connection.exec_driver_sql() to perform bulk inserts, which utilizes underlying DBAPI directly and has the same performance as using the Core APIs shown above:
Check this post if you want to learn more about executing plain SQL queries in SQLAlchemy.
The code for all the examples can be found here. You can run it directly once you install the libraries and set up the MySQL server.
In this post, different SQLAlchemy methods are introduced for bulk inserts. The codes for them are introduced in an easy-to-follow manner and the performances are compared systematically.
In conclusion, if you work with plain SQL queries, you don’t need to worry about the performance of SQLAlchemy because it’s calling the underlying DBAPI directly. It’s the query itself that should be optimized.
If you sit in the middle and don’t use plain SQL queries or ORM models, but use the so-called Expression Language, you can use the
Insert construct which access the Core API directly to perform bulk inserts which is also very efficient.
Finally, if you use ORM models and want to access the updated status of the ORM instances after inserts, you should use
Session.add_all(). Provide the primary keys if available because the performance can be dramatically increased. On the other hand, if you use ORM models and don’t need to access the updated data, you can use
Session.bulk_insert_mappings() which has comparable efficiency to Core APIs.