Clickhouse In A Flash

Last updated: 2024-06-30

ClickHouse is an incredibly fast and scalable SQL database that is rapidly growing in popularity. As well as being used as the backend for observability platforms such as SigNoz, GroundCover and HyperDX, it is also the data store for hyper-scalers such as Uber, eBay, Disney, Cloudflare any many more. You can opt to send your data to the ClickHouse cloud or self-host the OSS version. In this article we are going to look at how easy it is to run one of the world's most powerful databases on your desktop.

Sorry - No Windows!

if you are running ClickHouse on Docker on Windows then you can't mount a volume. If you do, then when ClickHouse tries to write to the volume it will fail, and you will encounter an error such as this:


Received exception from server (version 24.1.5):
Code: 1001. DB::Exception: Received from localhost:9000. 
DB::Exception: std::__1::__fs::filesystem::filesystem_error: 
filesystem error: in rename: Permission denied 
["/var/lib/clickhouse/store/cd8/cd821f2c-d9a7-4b9e-a377-ee5638603510/tmp_insert_all_1_1_0/"]... 

This is not a bug and there is no fix. ClickHouse does not support writing to non-Unix systems. Full stop.

At the moment, the only way to run ClickHouse on Windows and map a volume is to use the WSL.

In a production environment you would also certainly run ClickHouse on a Kubernetes cluster. For this exercise, we are just going to run the ClickHouse server locally on a Docker container. So, first of all, we will just pull the image:

docker pull clickhouse/clickhouse-server

As you might expect, ClickHouse will need to write data and logs to a location on the host machine. We therefore need to map some volumes in our Docker run command. So, let's create some directories:

mkdir /var/lib/clickhouse 
mkdir /var/log/clickhouse-server 

And now we will execute the Docker run command

docker run -d --name clickhouse-server-01 -p 9000:9000 --ulimit nofile=262144:262144 -v /var/lib/clickhouse/:/var/lib/clickhouse/  -v /var/log/clickhouse-server/:/var/log/clickhouse-server/ clickhouse/clickhouse-server 
  

In this example we are using the ulimit parameter to limit the maximum number of open files for this container to 262144. we are exposing port 9000 so that we can connect to it later from our host. Next, we can try connecting to our db:

docker exec -it clickhouse-server-01 clickhouse-client

We now have a connection to the ClickHouse client:

Creating A Database

Now that we are connected to the client we can start running some queries. First let's have a look at our default databases:

SHOW DATABASES;

Obviously, if we want to use ClickHouse we will need a database and some data to populate it. We are going to use the New York Taxi data set - as recommended in the ClickHouse documentation. One cool aspect of the ClickHouse Insert statement is that it can pull data from a web Url. The whole dataset consists of billions of records and is broken down into a number of files.

Next, we are going to create a database to store the data:

CREATE DATABASE IF NOT EXISTS TaxiData

Now we will make TaxiData our working database

Use TaxiData

Next, we will create our Trips table. To do this we will run the Create table script in Appendix 1. This script is taken from the ClickHouse tutorial: https://clickhouse.com/docs/en/tutorial

Inserting Data

Our next task is to insert the data into our table.


    INSERT INTO trips SELECT 
    trip_id, 
    vendor_id,
    pickup_date,
    pickup_datetime,
    dropoff_date,
    dropoff_datetime,
    rate_code_id,
    pickup_longitude,
    pickup_latitude,
    dropoff_longitude,
    dropoff_latitude,
    passenger_count,
    trip_distance,
    fare_amount, 
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    ehail_fee,
    improvement_surcharge,
    total_amount, 
    payment_type
    FROM gcs('https://storage.googleapis.com/clickhouse-public-datasets/nyc-taxi/trips_{0..2}.gz', 'TabSeparatedWithNames')
    

We can now watch the progress of our insert:

As you can see, we have downloaded and inserted three million records in around 168 seconds.

The image below also gives us an interesting insight into the internals of ClickHouse storage.

Now that we have our data inserted, we can start to play with it. This query shows the average fare based on the number of passengers:

SELECT  
   passenger_count, 
   avg(total_amount) 
FROM trips 
GROUP BY passenger_count; 

This is the result of the query:

This is pretty impressive performance. This was run on a Docker container on a home server. It has processed 3 million rows in 0.028 seconds - a rate of 106 million rows per second.

Breaking The Rules

ClickHouse is a phenomenally fast and powerful database system. Whilst ClickHouse is not exactly an immutable database, it does not support the SQL update statement for updating the values of fields in one or more table rows.

You can, however, achieve the same effect by using the ALTER TABLE...UPDATE statement:

alter table trips update trip_id=trip_id + 3200000 where trip_id >0 

The where clause in the above statement may seem a little redundant, however, this statement will not work without a where clause

alter table trips update pickup_datetime = dateadd(d,93,pickup_datetime) 
    where trip_id >0  
 
alter table trips update dropoff_datetime = dateadd(d,93,dropoff_datetime) 
where trip_id >0 
 

Whilst ClickHouse is extremely powerful, it naturally has some limitations. One point to remember is that it is designed as a large-sale analytics database. It is not designed for transaction processing, and it does not have support for ACID operations. It is designed for querying large datasets - it is not designed for querying or updating individual records. The guideline is that it is optimised for queries returning 8k+ rows. Equally, it is optimised for batch operations rather than a series of smaller queries.


Querying in Grafana

Querying our data directly via the ClickHouse Client is great fun, however, we can also visualise our data in Grafana, using the Grafana ClickHouse data source. For this exercise, we are also going to be running our Grafana instance on a Docker container. There are two different flavours of Grafana available for Docker - Grafana OSS and Grafana Enterprise. The Grafana Enterprise edition is actually free to use and is actually the recommended edition - so that is the one we will use.

Creating A User Account

Later on, we will want to query our data. To allow Grafana to access our database we will create a read-only user account: We will first of all create a role:

Create Role DBReader;

Then grant it select permissions:

GRANT SELECT ON default.* TO DBReader;

Now create the user

CREATE USER grafanaUser IDENTIFIED WITH sha256_password BY 'YourStrongPassword'

Now assign the role to our user:

GRANT DBReader TO grafanaUser;

Running a local Grafana Instance

The first thing we are going to do is create a Docker volume so that we can persist our data when the container stops

docker volume create grafana-storage 

Then we will just check that it has been created properly by running the inspect command:

docker volume inspect grafana-storage 

Which will produce an output something like this:

Now we will run our container:

docker run -d -p 3000:3000 --name=grafana \  
  --volume grafana-storage:/var/lib/grafana \ 
  grafana/grafana-enterprise 

Next, we can navigate to localhost:3000 in our browser and log in. The default username and password are 'admin'. Obviously, you should replace these with secure values immediately.

Creating The Data Source

When the main screen loads, we will click on this button to create our data source:

A Data Sources page will now load. It will list the most common data sources - but this will not include ClickHouse. So, scroll down the page and click on this button:

You should now see the ClickHouse Data Source

Then on the next screen click on the Install button and then on the Add New Data Source button. This will load a page where we will configure the data source. Setting up the data source is actually a breeze. We just need to specify:
Name - this defaults to 'grafana-clickhouse-datasource'
IP address - this will obviously be the IP address of our ClickHouse container
Port - ClickHouse will be listening on port 9000
Username
Password

You can obtain the IP address of your container by running this command:

docker inspect <containername> 

This will produce a json output that will include the IP address. Once we have entered the details, we can just click on the Save & Test button:

Great - we can start to build our dashboard!

Creating the Grafana Dashboard

We are just going to use the SQL Editor to build a nice simple bar chart where we group trips by the payment_type field. The SQL for this is:

SELECT payment_type, count(trip_id) as total from trips group by payment_type

This is an aggregation on a dataset of 3 million rows, but with the power of ClickHouse your chart will be generated almost instantly:

This has just been a whistlestop tour to show how quick and easy it is to get ClickHouse up and running on a local machine. As well as being immensely powerful, ClickHouse can also be run on a Docker container on a relatively low spec machine. Naturally, if you are interested in learning more there are many great resources on the ClickHouse web site. There are also excellent, free instructor-led courses.

Appendix 1 - Create Table

                               
                                   CREATE OR REPLACE TABLE trips
(
    trip_id UInt32,
    vendor_id Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    pickup_date Date,
    pickup_datetime DateTime,
    dropoff_date Date,
    dropoff_datetime DateTime,
    rate_code_id UInt8,
    pickup_longitude Float64,
    pickup_latitude Float64,
    dropoff_longitude Float64,
    dropoff_latitude Float64,
    passenger_count UInt8,
    trip_distance Float64,
    fare_amount Float32,
    extra Float32,
    mta_tax Float32,
    tip_amount Float32,
    tolls_amount Float32,
    ehail_fee Float32,
    improvement_surcharge Float32,
    total_amount Float32,
    payment_type Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;