Join us at Microsoft Ignite in Chicago, Nov 19-21. Booth 614! ->->

Blog / Product

How we made querying Pandas DataFrames with chDB 87x faster

author avatar
Auxten Wang
Aug 29, 2024 - 12 minutes read

Get started with ClickHouse Cloud today and receive $300 in credits. To learn more about our volume-based discounts, contact us or visit our pricing page.

It’s been almost two years since I started working on chDB—an embedded version of ClickHouse that runs in-process—and six months since chDB joined ClickHouse. In this blog post, I want to share some performance improvements I’ve worked on in the last few months.

When I started building chDB, one of the earliest challenges was building on ClickHouse Local’s support for querying data from various data sources with zero performance loss. One thing to keep in mind is that all input and output with ClickHouse Local is done through file descriptors, as shown in the diagram below.

0_ChDB v2 Diagrams Banner.png

This isn’t a problem for ClickHouse Local, but it is problematic for an in-process engine like chDB, where we want to work with data read or generated by libraries like Pandas, Numpy, or PyArrow.

To perform efficient SQL queries on this data, chDB needed to meet the following requirements:

  1. Zero-copy, - i.e., by utilizing Python’s memoryview for direct memory mapping between the ClickHouse and Python processes.
  2. Vectorized reading - considering CPU and memory hardware characteristics while fully utilizing existing statistics

The initial version of chDB was designed with simplicity in mind. For handling in-memory data, taking Pandas DataFrame as an example, the initial version of chDB’s DataFrame querying was implemented as follows:

  1. Serialize the in-memory DataFrame to Parquet, written to a temporary file or memfd. We initially looked at serialized data to an Arrow Buffer, but our tests showed that Parquet was faster.
  2. Modify the data source Table in the SQL statement to replace it with ClickHouse’s File Table engine and pass in the temporary file's file descriptor.
  3. Run the ClickHouse Engine, setting the output format to Parquet.
  4. Read the Parquet Buffer and convert it into a DataFrame.

1_Simplicity.png

This implementation resulted in most of the time being spent on serialization, deserialization, and memory copying. Even with the use of memfd, performance was still unsatisfactory.

As shown in the chart below, almost every query in the ClickBench benchmark took chdB over 30 seconds.

2_SQL on Dataframe benchmak results (1).png

Introducing the Python table engine

In June 2024, chDB v2 introduced SQL on DataFrame, allowing you to easily run SQL on a DataFrame variable as a table like this:

import chdb

df = pd.DataFrame({"a": [1, 2, 3], "b": ["one", "two", "three"]})
chdb.query("SELECT * FROM Python(df)").show()

Numpy Arrays, PyArrow Tables, and Python Dict variables can also be queried as tables in a similar manner:

import chdb
import pandas as pd
import pyarrow as pa

data = {
    "a": [1, 2, 3, 4, 5, 6],
    "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
}
chdb.query("""
SELECT b, sum(a) 
FROM Python(data) 
GROUP BY b
"").show()

arrow_table = pa.table(data)
chdb.query("""
SELECT b, sum(a) FROM 
Python(arrow_table) 
GROUP BY b
""").show()

In the next section, I’ll share how we solved the efficiency challenges described above and made chDB the fastest SQL on DataFrame engine in the world, one line at a time.

How we created the Python table function

I found it reasonably easy to add a new table function to ClickHouse. The process was divided into three steps, none of which involved any complex C++ logic:

  1. Declare and register the TableFunctionPython.
  2. Define the logic for StoragePython, focusing on how to read data and obtain the table schema.
  3. Define the logic for PythonSource, focusing on how to return data blocks for the concurrent pipeline.

First, we need to declare and register a TableFunctionPython. This way, after ClickHouse’s parser parses the SQL into an AST, it knows there is a table engine called Python:

3_unnamed (4).png

The registration process is also very straightforward, mainly involving providing necessary descriptions and examples, then declaring that "Python" is case-insensitive:

4_unnamed (4).png

The main function of the StoragePython class is part of ClickHouse’s data pipeline, with most of the data processing happening in the PythonSource class below. In earlier versions of ClickHouse, the IStorage.read function was responsible for actual data processing. However, it is now part of the physical execution plan.

5_362244772-841a19fd-ffd5-4b9c-8e14-597bd7d328b0.png

PythonSource inherits from ISource, and ISource.generate is responsible for data generation once the pipeline starts running.

6_362244806-caf68b68-86e5-4b07-aa8f-a44f3331282b.png

Challenges implementing the Python table function

Although the overall code framework of the chDB Python Table Engine is relatively simple, we encountered many unforeseen issues along the way. Most problems arose from performance issues when interacting between C++ and Python.

For example, when reading in-memory Pandas DataFrames in chDB, it is inevitable to call parts of CPython (the C implementation of Python, which is also the mainstream Python implementation). This leads to two significant challenges: the GIL and Object Reference Counting.

Fly with GIL

Due to the existence of the Python GIL (Global Interpreter Lock), any CPython function call requires acquiring the GIL first. If the GIL’s granularity is too large, it will directly cause ClickHouse’s multi-threaded engine to degrade into serial execution under the GIL’s constraints; if the GIL granularity is too small, frequent lock contention between threads will occur, which might even make the execution slower than a single-threaded program.

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. The GIL prevents race conditions and ensures thread safety. A nice explanation of how the Python GIL helps in these areas can be found here. In short, this mutex is necessary mainly because CPython's memory management is not thread-safe.

--- https://wiki.python.org/moin/GlobalInterpreterLock

Avoid Ref Counting

Python has automatic garbage collection, which makes it very easy to write code in Python. However, if you accidentally reference an existing Python object in memory or create a new one, you must control the reference counter; otherwise, it may lead to double-free errors or memory leaks.

Therefore, for chDB, the only viable approach is to avoid calling CPython API functions to read CPython objects as much as possible. Does this sound crazy?!

Let me briefly explain what we did to make chDB one of the fastest SQL engines on Pandas DataFrame despite Python dragging its feet.

Performance Optimizations

At the start of the blog post, we mentioned that chDB v1 required at least four extra serialization and deserialization steps, which resulted in every query in ClickBench taking over 30 seconds.

The first optimization was to reduce this overhead and read Python objects directly. This dramatically improved the time taken by most of the ClickBench queries. For the most time-consuming Q23 query, the time was reduced by nearly four times to 8.6 seconds.

7_Getting to parity (1).png

Still, we wanted to make it even faster to maintain ClickHouse’s performance despite the presence of Python’s GIL and GC. We managed to achieve this by doing the following things:

  1. Minimized the number of CPython API function calls. When unavoidable, we handled them in a concentrated manner to avoid calling any CPython functions after the pipeline construction was complete and started running.
  2. Batched data copying as much as possible, utilizing ClickHouse’s SIMD-optimized memcpy.
  3. Rewrote the Python string encoding and decoding logic in C++.

The last point might be a bit hard to understand, so let me elaborate:

Python String Encoding

Due to historical reasons, the data structure of Python’s str is extremely complex. It can be stored in UTF-8, UTF-16, UTF-32, or even more obscure encodings. When users further manipulate the str, Python’s runtime may convert them all to UTF-8 encoding and possibly cache them in the str structure.

This results in many cases that need to be handled. As mentioned, any call to a CPython function requires acquiring the GIL first. If you use Python’s internal implementation to convert to UTF-8, only one thread can work at a time. So, out of necessity, we re-implemented the str encoding conversion logic in C++.

Our efforts directly led to another significant performance leap for Q23, reducing the time from 8.6 seconds to 0.56 seconds—a 15x improvement! ????????????

Now that we’ve addressed these issues, we thought we should see how well chDB performs against DuckDB, a popular in-process analytics database.

The chart below shows the two databases' performance when querying DataFrames that contain 10 million rows of ClickBench data:

8_SQL on Dataframe benchmak results after improvements (1).png

Note: 1. The above benchmark data was tested on hardware with EPYC 9654 + 128G + HP FX900 4TB NVMe, using 30 million rows of ClickBench data. The related code: pd_zerocopy.ipynb.

To summarize the chart:

Prometheus presentation (1).png

Of course, performance isn’t everything in a database. chDB v2’s 87-fold performance improvement is pretty cool, but this approach does not cover the diverse and complex Python data querying needs.

Therefore, I began considering creating a mechanism that allows users to define their own table-returning logic. This way, chDB users can combine Python’s flexibility with ClickHouse’s high performance to query any dataset they like.

User-defined Python Table Function

After several weeks of development, we now have chdb.PyReader. You only need to inherit this class and implement the read function to use Python to define the data returned by a ClickHouse Table Function. Like this:

import chdb

class MyReader(chdb.PyReader):
    def __init__(self, data):
        # some basic init

    def read(self, col_names, count):
        # return col_names*count block

And then, we can use our new reader like this:

# Initialize reader with sample data
reader = MyReader({
    "a": [1, 2, 3, 4, 5, 6],
    "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
})

# Execute a query on the Python reader and display results
chdb.query("SELECT b, sum(a) FROM Python('reader') GROUP BY b").show()

SQL on API

With chdb.PyReader, you can define your own data return logic using Python. So I made a demo for querying Google Calendar using SQL. By running:

python google_cal.py \
  "SELECT summary, organizer_email, parseDateTimeBestEffortOrNull(start_dateTime) WHERE status = 'confirmed';"

You can retrieve all your accepted meeting invitations:

9_unnamed (4).png

Through the API provided by chDB, you can easily treat many APIs that return JSON arrays as a ClickHouse table to run SQL queries. You don’t need to store additional data or manually define table structures.

All the above features are available in chDB v2.0.2 and later versions, which you can install by running the following:

pip install "chdb>=2.0.2"

If you're interested in building your own applications with chDB, we welcome you to join our Discord. Don't forget to give chDB a GitHub star. And check out the chDB documentation.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
Twitter imageSlack imageGitHub image
Telegram imageMeetup imageRss image