Backfilling Data
Whether new to ClickHouse or responsible for an existing deployment, users will invariably need to backfill tables with historical data. In some cases, this is relatively simple but can become more complex when materialized views need to be populated. This guide documents some processes for this task that users can apply to their use case.
This guide assumes users are already familiar with the concept of Incremental Materialized Views and data loading using table functions such as s3 and gcs. We also recommend users read our guide on optimizing insert performance from object storage, the advice of which can be applied to inserts throughout this guide.
Example dataset
Throughout this guide, we use a PyPI dataset. Each row in this dataset represents a Python package download using a tool such as pip
.
For example, the subset covers a single day - 2024-12-17
and is available publicly at https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/2024-12-17/
. Users can query with:
SELECT count()
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/2024-12-17/*.parquet')
┌────count()─┐
│ 2039988137 │ -- 2.04 billion
└────────────┘
1 row in set. Elapsed: 32.726 sec. Processed 2.04 billion rows, 170.05 KB (62.34 million rows/s., 5.20 KB/s.)
Peak memory usage: 239.50 MiB.
The full dataset for this bucket contains over 320 GB of parquet files. In the examples below, we intentionally target subsets using glob patterns.
We assume the user is consuming a stream of this data e.g. from Kafka or object storage, for data after this date. The schema for this data is shown below:
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/2024-12-17/*.parquet')
FORMAT PrettyCompactNoEscapesMonoBlock
SETTINGS describe_compact_output = 1
┌─name───────────────┬─type──────────────────────────────────────────────────────────────────────────────────────────────── ────────────────────────────────────┐
│ timestamp │ Nullable(DateTime64(6)) │
│ country_code │ Nullable(String) │
│ url │ Nullable(String) │
│ project │ Nullable(String) │
│ file │ Tuple(filename Nullable(String), project Nullable(String), version Nullable(String), type Nullable(String)) │
│ installer │ Tuple(name Nullable(String), version Nullable(String)) │
│ python │ Nullable(String) │
│ implementation │ Tuple(name Nullable(String), version Nullable(String)) │
│ distro │ Tuple(name Nullable(String), version Nullable(String), id Nullable(String), libc Tuple(lib Nullable(String), version Nullable(String))) │
│ system │ Tuple(name Nullable(String), release Nullable(String)) │
│ cpu │ Nullable(String) │
│ openssl_version │ Nullable(String) │