Getting started with chDB
In this guide, we're going to get up and running with the Python variant of chDB. We'll start by querying a JSON file on S3, before creating a table in chDB based on the JSON file, and doing some queries on the data. We'll also see how to have queries return data in different formats, including Apache Arrow and Panda, and finally we'll learn how to query Pandas DataFrames.
Setup
Let's first create a virtual environment:
And now we'll install chDB. Make sure you have version 2.0.3 or higher:
And now we're going to install ipython:
We're going to use ipython
to run the commands in the rest of the guide, which you can launch by running:
We'll also be using Pandas and Apache Arrow in this guide, so let's install those libraries too:
Querying a JSON file in S3
Let's now have a look at how to query a JSON file that's stored in an S3 bucket. The YouTube dislikes dataset contains more than 4 billion rows of dislikes on YouTube videos up to 2021. We're going to work with one of the JSON files from that dataset.
Import chdb:
We can write the following query to describe the structure of one of the JSON files:
We can also count the number of rows in that file:
This file contains just over 300,000 records.
chdb doesn't yet support passing in query parameters, but we can pull out the path and pass it in via an f-String.
This is fine to do with variables defined in your program, but don't do it with user-provided input, otherwise your query is open to SQL injection.
Configuring the output format
The default output format is CSV
, but we can change that via the output_format
parameter.
chDB supports the ClickHouse data formats, as well as some of its own, including DataFrame
, which returns a Pandas DataFrame:
Or if we want to get back an Apache Arrow table:
Creating a table from JSON file
Next, let's have a look at how to create a table in chDB. We need to use a different API to do that, so let's first import that:
Next, we'll initialize a session. If we want the session to be persisted to disk, we need to provide a directory name. If we leave it blank, the database will be in-memory and lost as soon as we kill the Python process.
Next, we'll create a database:
Now we can create a dislikes
table based on the schema from the JSON file, using the CREATE...EMPTY AS
technique.
We'll use the schema_inference_make_columns_nullable
setting so that column types aren't all made Nullable
.
We can then use the DESCRIBE
clause to inspect the schema:
Next, let's populate that table:
We could also do both these steps in one go, using the CREATE...AS
technique.
Let's create a different table using that technique:
Querying a table
Finally, let's query the table:
Let's say we then add an extra column to the DataFrame to compute the ratio of likes to dislikes. We could write the following code:
Querying a Pandas DataFrame
We can then query that DataFrame from chDB:
You can also read more about querying Pandas DataFrames in the Querying Pandas developer guide.
Next steps
Hopefully, this guide has given you a good overview of chDB. To learn more about how to use it, see the following developer guides: