Another month goes by, which means itβs time for another release!
ClickHouse version 24.3 contains 13 new features π 16 performance optimisations π· 65 bug fixes π
New Contributors
As always, we send a special welcome to all the new contributors in 24.4! ClickHouse's popularity is, in large part, due to the efforts of the community that contributes. Seeing that community grow is always humbling.
Below are the names of the new contributors:
Alexey Katsman, Anita Hammer, Arnaud Rocher, Chandre Van Der Westhuizen, Eduard Karacharov, Eliot Hautefeuille, Igor Markelov, Ilya Andreev, Jhonso7393, Joseph Redfern, Josh Rodriguez, Kirill, KrJin, Maciej Bak, Murat Khairulin, PaweΕ Kudzia, Tristan, dilet6298, loselarry
Hint: if youβre curious how we generate this listβ¦ click here.
You can also view the slides from the presentation.
Recursive CTEs
Contributed by Maksim Kita
SQL:1999 introduced recursive common table expressions (CTEs) for hierarchical queries, making SQL a Turing-complete programming language.
So far, ClickHouse has supported hierarchical queries by utilizing hierarchical dictionaries. With our new query analysis and optimization infrastructure, now enabled by default, we finally have everything in place to introduce long-awaited and powerful features like recursive CTEs.
ClickHouse recursive CTEs have the standard SQL:1999 syntax and pass all PostgreSQL tests for recursive CTEs. Furthermore, ClickHouse now has better support for recursive CTEs than PostgreSQL. Inside the CTEβs UNION ALL clauseβs bottom part, multiple (arbitrarily complex) queries can be specified, the CTE base table can be referenced multiple times, etc.
Recursive CTEs can solve hierarchical problems elegantly and simply. For example, they can easily answer reachability questions for hierarchical data models (e.g. trees and graphs) .
Specifically, recursive CTEs can calculate transitive closures of relations. With the London Underground's tube connections as a binary relation example, you can imagine the set of all directly connected tube stations:
Oxford Circus -> Bond Street
,Bond Street -> Marble Arch
,Marble Arch -> Lancaster Gate
, and so on. The transitive closure of those connections includes all possible connections between these stations, e.g.Oxford Circus -> Lancaster Gate
,Oxford Circus -> Marble Arch
, and so on.
To demonstrate this, we use an adapted version of a data set modeling all London Underground connections where each entry represents two directly connected stations. Then, we can use a recursive CTE to answer questions like this easily:
When starting at Oxford Circus station on the Central Line, which stations can we reach with a maximum of five stops?
We visualize this with a screenshot of the Central Line station map:
We create a ClickHouse table for storing the London Underground connections data set:
CREATE OR REPLACE TABLE Connections (
Station_1 String,
Station_2 String,
Line String,
PRIMARY KEY(Line, Station_1, Station_2)
);
An attentive reader will have recognized that we didnβt specify a table engine in the DDL statement above (this is possible since ClickHouse 24.3) and used PRIMARY KEY syntax in the column definition (possible since ClickHouse 23.7). With that, not just recursive CTEs but also our ClickHouse table DDL syntax is standard-compliant with SQL.
By utilizing the url table function and automatic schema inference, we load the data set directly into our table:
INSERT INTO Connections
SELECT * FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/london_underground/london_connections.csv')
This is how the loaded data looks like:
SELECT
*
FROM Connections
WHERE Line = 'Central Line'
ORDER BY Station_1, Station_2
LIMIT 10;
ββStation_1βββββββ¬βStation_2βββββββββ¬βLineββββββββββ
1. β Bank β Liverpool Street β Central Line β
2. β Bank β St. Paul's β Central Line β
3. β Barkingside β Fairlop β Central Line β
4. β Barkingside β Newbury Park β Central Line β
5. β Bethnal Green β Liverpool Street β Central Line β
6. β Bethnal Green β Mile End β Central Line β
7. β Bond Street β Marble Arch β Central Line β
8. β Bond Street β Oxford Circus β Central Line β
9. β Buckhurst Hill β Loughton β Central Line β
10. β Buckhurst Hill β Woodford β Central Line β
ββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββ
Now, we use a recursive CTE to answer the above question:
WITH RECURSIVE Reachable_Stations AS
(
SELECT Station_1, Station_2, Line, 1 AS stops
FROM Connections
WHERE Line = 'Central Line'
AND Station_1 = 'Oxford Circus'
UNION ALL
SELECT rs.Station_1, c.Station_2, c.Line, rs.stops + 1 AS stops
FROM Reachable_Stations AS rs, Connections AS c
WHERE rs.Line = c.Line
AND rs.Station_2 = c.Station_1
AND rs.stops < 5
)
SELECT DISTINCT (Station_1, Station_2, stops) AS connections
FROM Reachable_Stations
ORDER BY stops ASC;
This is the result:
ββconnectionsβββββββββββββββββββββββββββββββββ
1. β ('Oxford Circus','Bond Street',1) β
2. β ('Oxford Circus','Tottenham Court Road',1) β
3. β ('Oxford Circus','Marble Arch',2) β
4. β ('Oxford Circus','Oxford Circus',2) β
5. β ('Oxford Circus','Holborn',2) β
6. β ('Oxford Circus','Bond Street',3) β
7. β ('Oxford Circus','Lancaster Gate',3) β
8. β ('Oxford Circus','Tottenham Court Road',3) β
9. β ('Oxford Circus','Chancery Lane',3) β
10. β ('Oxford Circus','Marble Arch',4) β
11. β ('Oxford Circus','Oxford Circus',4) β
12. β ('Oxford Circus','Queensway',4) β
13. β ('Oxford Circus','Holborn',4) β
14. β ('Oxford Circus','St. Paul\'s',4) β
15. β ('Oxford Circus','Bond Street',5) β
16. β ('Oxford Circus','Lancaster Gate',5) β
17. β ('Oxford Circus','Tottenham Court Road',5) β
18. β ('Oxford Circus','Notting Hill Gate',5) β
19. β ('Oxford Circus','Chancery Lane',5) β
20. β ('Oxford Circus','Bank',5) β
ββββββββββββββββββββββββββββββββββββββββββββββ
A recursive CTE has a simple iterations-based execution logic that behaves like a recursive self-self-...-self-join that stops self-joining once no new join partners are found or an abort condition is met. For this, our CTE above starts with executing the UNION ALL
clauseβs top part, querying our Connections
table for all stations directly connected to the Oxford Circus
station on the Central Line
. This will return a table that is bound to the Reachable_Stations
identifier and looks like this:
Initial Reachable_Stations table content
ββStation_1ββββββ¬βStation_2βββββββββββββ
β Oxford Circus β Bond Street β
β Oxford Circus β Tottenham Court Road β
βββββββββββββββββ΄βββββββββββββββββββββββ
From now on, only the CTEβs UNION ALL
clauseβs bottom part will be executed (recursively):
Reachable_Stations
is joined with the Connections
table to find the following join partners within the Connections
table whose Station_1
value matches the Station_2
value of Reachable_Stations
:
Connections table join partners
ββStation_1βββββββββββββ¬βStation_2ββββββ
β Bond Street β Marble Arch β
β Bond Street β Oxford Circus β
β Tottenham Court Road β Holborn β
β Tottenham Court Road β Oxford Circus β
ββββββββββββββββββββββββ΄ββββββββββββββββ
Via the UNION ALL
clause, these join partners are added to the Reachable_Stations
table (with the Station_1
column replaced by Oxford Circus
), and the first iteration of our recursive CTE is complete. In the next iteration, by executing the CTEβs UNION ALL
clauseβs bottom part, Reachable_Stations
is again joined with the Connections
table to identify (and add to Reachable_Stations
) all NEW join partners within the Connections
table whose Station_1
value matches the Station_2
value of Reachable_Stations
. These iterations continue until no new join partners are found or a stop condition is met. In our query above, we use a stops
counter to abort the CTEβs iteration cycle when our specified number of allowed stops from the start station is reached.
Note that the result lists Oxford Circus
as a reachable station from Oxford Circus
with 2 and 4 stops. That is theoretically correct but not very practical, and caused by the fact that our query doesnβt mind any directions or cycles. We leave this as a fun exercise for the reader.
As a bonus question, we are interested in how many stops it takes from the Oxford Circus
station on the Central Line
to reach the Stratford
station. We visualize this again with the Central Line map:
For this, we just need to modify our recursive CTEβs abort condition (stopping the CTEβs join iterations once a join partner with Stratford
as the target station got added to the CTE table):
WITH RECURSIVE Reachable_Stations AS
(
SELECT Station_1, Station_2, Line, 1 AS stops
FROM Connections
WHERE Line = 'Central Line'
AND Station_1 = 'Oxford Circus'
UNION ALL
SELECT rs.Station_1 c.Station_2, c.Line, rs.stops + 1 AS stops
FROM Reachable_Stations AS rs, Connections AS c
WHERE rs.Line = c.Line
AND rs.Station_2 = c.Station_1
AND 'Stratford' NOT IN (SELECT Station_2 FROM Reachable_Stations)
)
SELECT max(stops) as stops
FROM Reachable_Stations;
The result shows that it would take 9 stops, which matches the Central Lineβs map plan above:
ββstopsββ
1. β 9 β
βββββββββ
A recursive CTE could easily answer more interesting questions about this data set. For example, the relative connection times in the original version of the data set could be used to discover the fastest (and across tube lines) connection from Oxford Circus
station to Heathrow Airport
station. Stay tuned for the solution to this in a separate follow-up post.
QUALIFY
Contributed by Maksim Kita
Another feature added in this release is the QUALIFY
clause, which lets us filter on the value of window functions.
We will see how to use it with help from the Window Functions - Rankings example. The dataset contains hypothetical football players and their salaries. We can import it into ClickHouse like this:
CREATE TABLE salaries ORDER BY team AS
FROM url('https://raw.githubusercontent.com/ClickHouse/examples/main/LearnClickHouseWithMark/WindowFunctions-Aggregation/data/salaries.csv')
SELECT * EXCEPT (weeklySalary), weeklySalary AS salary
SETTINGS schema_inference_make_columns_nullable=0;
Letβs have a quick look at the data in the salaries
table:
SELECT * FROM salaries LIMIT 5;
ββteamβββββββββββββββ¬βplayerββββββββββββ¬βpositionββ¬βsalaryββ
1. β Aaronbury Seekers β David Morton β D β 63014 β
2. β Aaronbury Seekers β Edwin Houston β D β 51751 β
3. β Aaronbury Seekers β Stephen Griffith β M β 181825 β
4. β Aaronbury Seekers β Douglas Clay β F β 73436 β
5. β Aaronbury Seekers β Joel Mendoza β D β 257848 β
βββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββ΄βββββββββ
Next, letβs compute the salary rank by position for each player. I.e., how much are they paid relative to people who play in the same position?
SELECT player, team, position AS pos, salary,
rank() OVER (PARTITION BY position ORDER BY salary DESC) AS posRank
FROM salaries
ORDER BY salary DESC
LIMIT 5
ββplayerβββββββββββ¬βteamβββββββββββββββββββββ¬βposββ¬βsalaryββ¬βposRankββ
1. β Robert Griffin β North Pamela Trojans β GK β 399999 β 1 β
2. β Scott Chavez β Maryhaven Generals β M β 399998 β 1 β
3. β Dan Conner β Michaelborough Rogues β M β 399998 β 1 β
4. β Nathan Thompson β Jimmyville Legionnaires β D β 399998 β 1 β
5. β Benjamin Cline β Stephaniemouth Trojans β D β 399998 β 1 β
βββββββββββββββββββ΄ββββββββββββββββββββββββββ΄ββββββ΄βββββββββ΄ββββββββββ
Letβs say we want to filter posRank
to return only the top 3 paid players by position. We might try to add a WHERE
clause to do this:
SELECT player, team, position AS pos, salary,
rank() OVER (PARTITION BY position ORDER BY salary DESC) AS posRank
FROM salaries
WHERE posRank <= 3
ORDER BY salary DESC
LIMIT 5
Received exception:
Code: 184. DB::Exception: Window function rank() OVER (PARTITION BY position ORDER BY salary DESC) AS posRank is found in WHERE in query. (ILLEGAL_AGGREGATION)
We canβt do this as the WHERE
clause runs before the window function has been evaluated. Before the 24.4 release, we could work around this problem by introducing a CTE:
WITH salaryRankings AS
(
SELECT player,
if(
length(team) <=25,
team,
concat(substring(team, 5), 1, '...')
) AS team,
position AS pos, salary,
rank() OVER (
PARTITION BY position
ORDER BY salary DESC
) AS posRank
FROM salaries
ORDER BY salary DESC
)
SELECT *
FROM salaryRankings
WHERE posRank <= 3
ββplayerβββββββββββββ¬βteamβββββββββββββββββββββ¬βposββ¬βsalaryββ¬βposRankββ
1. β Robert Griffin β North Pamela Trojans β GK β 399999 β 1 β
2. β Scott Chavez β Maryhaven Generals β M β 399998 β 1 β
3. β Dan Conner β Michaelborough Rogue... β M β 399998 β 1 β
4. β Benjamin Cline β Stephaniemouth Troja... β D β 399998 β 1 β
5. β Nathan Thompson β Jimmyville Legionnai... β D β 399998 β 1 β
6. β William Rubio β Nobleview Sages β M β 399997 β 3 β
7. β Juan Bird β North Krystal Knight... β GK β 399986 β 2 β
8. β John Lewis β Andreaberg Necromanc... β D β 399985 β 3 β
9. β Michael Holloway β Alyssaborough Sages β GK β 399984 β 3 β
10. β Larry Buchanan β Eddieshire Discovere... β F β 399973 β 1 β
11. β Alexis Valenzuela β Aaronport Crusaders β F β 399972 β 2 β
12. β Mark Villegas β East Michaelborough ... β F β 399972 β 2 β
βββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄ββββββ΄βββββββββ΄ββββββββββ
This query works, but itβs pretty clunky. Now that we have the QUALIFY
clause, we can filter the data without needing to introduce a CTE, as shown below:
SELECT player, team, position AS pos, salary,
rank() OVER (PARTITION BY position ORDER BY salary DESC) AS posRank
FROM salaries
QUALIFY posRank <= 3
ORDER BY salary DESC;
And weβll get the same results as before.
Join Performance Improvements
Contributed by Maksim Kita
There are also some performance improvements for very specific JOIN use cases.
The first is better predicate pushdown, when the analyzer works out when a filter condition can be applied to both sides of a JOIN.
Letβs look at an example using The OpenSky dataset, which contains air traffic data from 2019-2021. We want to get a list of ten flights that go through San Francisco, which we can do with the following query:
SELECT
l.origin,
r.destination AS dest,
firstseen,
lastseen
FROM opensky AS l
INNER JOIN opensky AS r ON l.destination = r.origin
WHERE notEmpty(l.origin) AND notEmpty(r.destination) AND (r.origin = 'KSFO')
LIMIT 10
SETTINGS optimize_move_to_prewhere = 0
Weβre disabling optimize_move_to_prewhere
so that ClickHouse doesnβt perform another optimization, which would stop us from seeing the benefit of the JOIN improvements. If we run this query on 24.3, weβll see the following output:
ββoriginββ¬βdestββ¬βββββββββββfirstseenββ¬ββββββββββββlastseenββ
1. β 00WA β 00CL β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
2. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
3. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
4. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
5. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
6. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
7. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
8. β 00WA β YSSY β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
9. β 00WA β YSSY β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
10. β 00WA β YSSY β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
ββββββββββ΄βββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββ
10 rows in set. Elapsed: 0.656 sec. Processed 15.59 million rows, 451.90 MB (23.75 million rows/s., 688.34 MB/s.)
Peak memory usage: 62.79 MiB.
Letβs have a look at 24.4:
ββoriginββ¬βdestββ¬βββββββββββfirstseenββ¬ββββββββββββlastseenββ
1. β 00WA β 00CL β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
2. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
3. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
4. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
5. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
6. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
7. β 00WA β ZGGG β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
8. β 00WA β YSSY β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
9. β 00WA β YSSY β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
10. β 00WA β YSSY β 2019-10-14 21:03:19 β 2019-10-14 22:42:01 β
ββββββββββ΄βββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββ
10 rows in set. Elapsed: 0.079 sec.
So thatβs about 8 times quicker. If we return all the columns via SELECT *
, the time taken for this query in 24.3 goes up to over 4 seconds and in 24.4 itβs 0.4 seconds, which is a 10x improvement.
Letβs see if we can understand why itβs faster. The two lines of interest are these:
INNER JOIN opensky AS r ON l.destination = r.origin
WHERE notEmpty(l.origin) AND notEmpty(r.destination) AND (r.origin = 'KSFO')
The last predicate of our WHERE
clause is r.origin = 'KSFO'
. In the previous line we said that we only want to do the join if l.destination = r.origin
, which means that l.destination = 'KSFO'
as well. The analyzer in 24.4 knows this and can therefore filter out a bunch of rows much earlier.
In other words, our WHERE
clause effectively looks like this in 24.4
INNER JOIN opensky AS r ON l.destination = r.origin
WHERE notEmpty(l.origin) AND notEmpty(r.destination)
AND (r.origin = 'KSFO') AND (l.destination = 'KSFO')
The second improvement is that the analyzer will now automatically convert an OUTER JOIN
to an INNER JOIN
if the predicate after the JOIN filters out any non-joined rows.
For example, letβs say we initially wrote a query to find flights between San Francisco and New York, capturing both direct flights and ones that have a layover.
SELECT
l.origin,
l.destination,
r.destination,
registration,
l.callsign,
r.callsign
FROM opensky AS l
LEFT JOIN opensky AS r ON l.destination = r.origin
WHERE notEmpty(l.destination)
AND (l.origin = 'KSFO')
AND (r.destination = 'KJFK')
LIMIT 10
We later add an extra filter that only returns rows where r.callsign = 'AAL1424'
.
SELECT
l.origin,
l.destination AS leftDest,
r.destination AS rightDest,
registration AS reg,
l.callsign,
r.callsign
FROM opensky AS l
LEFT JOIN opensky AS r ON l.destination = r.origin
WHERE notEmpty(l.destination)
AND (l.origin = 'KSFO')
AND (r.destination = 'KJFK')
AND (r.callsign = 'AAL1424')
LIMIT 10
SETTINGS optimize_move_to_prewhere = 0
Since we now require the callsign
column on the right-hand side of the join to have a value, the LEFT JOIN
can be converted to an INNER JOIN
. Letβs examine the query performance in 24.3 and 24.4.
24.3
ββoriginββ¬βleftDestββ¬βrightDestββ¬βregβββββ¬βcallsignββ¬βr.callsignββ
1. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
2. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
3. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
4. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
5. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
6. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
7. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
8. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
9. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
10. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
ββββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββ΄βββββββββββ΄βββββββββββββ
10 rows in set. Elapsed: 1.937 sec. Processed 63.98 million rows, 2.52 GB (33.03 million rows/s., 1.30 GB/s.)
Peak memory usage: 2.84 GiB.
24.4
ββoriginββ¬βleftDestββ¬βrightDestββ¬βregβββββ¬βcallsignββ¬βr.callsignββ
1. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
2. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
3. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
4. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
5. β KSFO β 01FA β KJFK β N12221 β UAL423 β AAL1424 β
6. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
7. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
8. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
9. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
10. β KSFO β 01FA β KJFK β N87527 β UAL423 β AAL1424 β
ββββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββ΄βββββββββββ΄βββββββββββββ
10 rows in set. Elapsed: 0.762 sec. Processed 23.22 million rows, 939.75 MB (30.47 million rows/s., 1.23 GB/s.)
Peak memory usage: 9.00 MiB.
It's a little bit under three times quicker in 24.4.
If youβd like to learn more about how the JOIN performance improvements were implemented, read Maksim Kitaβs blog post explaining everything.
Thatβs all for the 24.4 release. Weβd love for you to join us for the 24.5 release call on 30 May. Make sure you register so that youβll get all the details for the Zoom webinar.