NYPD Complaint Data
Tab separated value, or TSV, files are common and may include field headings as the first line of the file. ClickHouse can ingest TSVs, and also can query TSVs without ingesting the files. This guide covers both of these cases. If you need to query or ingest CSV files, the same techniques work, simply substitute TSV
with CSV
in your format arguments.
While working through this guide you will:
- Investigate: Query the structure and content of the TSV file.
- Determine the target ClickHouse schema: Choose proper data types and map the existing data to those types.
- Create a ClickHouse table.
- Preprocess and stream the data to ClickHouse.
- Run some queries against ClickHouse.
The dataset used in this guide comes from the NYC Open Data team, and contains data about "all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD)". At the time of writing, the data file is 166MB, but it is updated regularly.
Source: data.cityofnewyork.us Terms of use: https://www1.nyc.gov/home/terms-of-use.page
Prerequisites
- Download the dataset by visiting the NYPD Complaint Data Current (Year To Date) page, clicking the Export button, and choosing TSV for Excel.
- Install ClickHouse server and client.
- Launch ClickHouse server, and connect with
clickhouse-client
A note about the commands described in this guide
There are two types of commands in this guide:
- Some of the commands are querying the TSV files, these are run at the command prompt.
- The rest of the commands are querying ClickHouse, and these are run in the
clickhouse-client
or Play UI.
The examples in this guide assume that you have saved the TSV file to ${HOME}/NYPD_Complaint_Data_Current__Year_To_Date_.tsv
, please adjust the commands if needed.
Familiarize yourself with the TSV file
Before starting to work with the ClickHouse database familiarize yourself with the data.
Look at the fields in the source TSV file
This is an example of a command to query a TSV file, but don't run it yet.
Sample response
Most of the time the above command will let you know which fields in the input data are numeric, and which are strings, and which are tuples. This is not always the case. Because ClickHouse is routineley used with datasets containing billions of records there is a default number (100) of rows examined to infer the schema in order to avoid parsing billions of rows to infer the schema. The response below may not match what you see, as the dataset is updated several times each year. Looking at the Data Dictionary you can see that CMPLNT_NUM is specified as text, and not numeric. By overriding the default of 100 rows for inference with the setting SETTINGS input_format_max_rows_to_read_for_schema_inference=2000
you can get a better idea of the content.
Note: as of version 22.5 the default is now 25,000 rows for inferring the schema, so only change the setting if you are on an older version or if you need more than 25,000 rows to be sampled.
Run this command at your command prompt. You will be using clickhouse-local
to query the data in the TSV file you downloaded.
Result:
At this point you should check that the columns in the TSV file match the names and types specified in the Columns in this Dataset section of the dataset web page. The data types are not very specific, all numeric fields are set to Nullable(Float64)
, and all other fields are Nullable(String)
. When you create a ClickHouse table to store the data you can specify more appropriate and performant types.
Determine the proper schema
In order to figure out what types should be used for the fields it is necessary to know what the data looks like. For example, the field JURISDICTION_CODE
is a numeric: should it be a UInt8
, or an Enum
, or is Float64
appropriate?
Result:
The query response shows that the JURISDICTION_CODE
fits well in a UInt8
.
Similarly, look at some of the String
fields and see if they are well suited to being DateTime
or LowCardinality(String)
fields.
For example, the field PARKS_NM
is described as "Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included)". The names of parks in New York City may be a good candidate for a LowCardinality(String)
:
Result:
Have a look at some of the park names:
Result:
The dataset in use at the time of writing has only a few hundred distinct parks and playgrounds in the PARK_NM
column. This is a small number based on the LowCardinality recommendation to stay below 10,000 distinct strings in a LowCardinality(String)
field.
DateTime fields
Based on the Columns in this Dataset section of the dataset web page there are date and time fields for the start and end of the reported event. Looking at the min and max of the CMPLNT_FR_DT
and CMPLT_TO_DT
gives an idea of whether or not the fields are always populated:
Result:
Result:
Result:
Result:
Make a plan
Based on the above investigation:
JURISDICTION_CODE
should be cast asUInt8
.PARKS_NM
should be cast toLowCardinality(String)
CMPLNT_FR_DT
andCMPLNT_FR_TM
are always populated (possibly with a default time of00:00:00
)CMPLNT_TO_DT
andCMPLNT_TO_TM
may be empty- Dates and times are stored in separate fields in the source
- Dates are
mm/dd/yyyy
format - Times are
hh:mm:ss
format - Dates and times can be concatenated into DateTime types
- There are some dates before January 1st 1970, which means we need a 64 bit DateTime
There are many more changes to be made to the types, they all can be determined by following the same investigation steps. Look at the number of distinct strings in a field, the min and max of the numerics, and make your decisions. The table schema that is given later in the guide has many low cardinality strings and unsigned integer fields and very few floating point numerics.
Concatenate the date and time fields
To concatenate the date and time fields CMPLNT_FR_DT
and CMPLNT_FR_TM
into a single String
that can be cast to a DateTime
, select the two fields joined by the concatenation operator: CMPLNT_FR_DT || ' ' || CMPLNT_FR_TM
. The CMPLNT_TO_DT
and CMPLNT_TO_TM
fields are handled similarly.
Result:
Convert the date and time String to a DateTime64 type
Earlier in the guide we discovered that there are dates in the TSV file before January 1st 1970, which means that we need a 64 bit DateTime type for the dates. The dates also need to be converted from MM/DD/YYYY
to YYYY/MM/DD
format. Both of these can be done with parseDateTime64BestEffort()
.
Lines 2 and 3 above contain the concatenation from the previous step, and lines 4 and 5 above parse the strings into DateTime64
. As the complaint end time is not guaranteed to exist parseDateTime64BestEffortOrNull
is used.
Result:
The dates shown as 1925
above are from errors in the data. There are several records in the original data with dates in the years 1019
- 1022
that should be 2019
- 2022
. They are being stored as Jan 1st 1925 as that is the earliest date with a 64 bit DateTime.
Create a table
The decisions made above on the data types used for the columns are reflected in the table schema
below. We also need to decide on the ORDER BY
and PRIMARY KEY
used for the table. At least one
of ORDER BY
or PRIMARY KEY
must be specified. Here are some guidelines on deciding on the
columns to includes in ORDER BY
, and more information is in the Next Steps section at the end
of this document.
Order By and Primary Key clauses
- The
ORDER BY
tuple should include fields that are used in query filters - To maximize compression on disk the
ORDER BY
tuple should be ordered by ascending cardinality - If it exists, the
PRIMARY KEY
tuple must be a subset of theORDER BY
tuple - If only
ORDER BY
is specified, then the same tuple will be used asPRIMARY KEY
- The primary key index is created using the
PRIMARY KEY
tuple if specified, otherwise theORDER BY
tuple - The
PRIMARY KEY
index is kept in main memory
Looking at the dataset and the questions that might be answered by querying it we might
decide that we would look at the types of crimes reported over time in the five boroughs of
New York City. These fields might be then included in the ORDER BY
:
Column | Description (from the data dictionary) |
---|---|
OFNS_DESC | Description of offense corresponding with key code |
RPT_DT | Date event was reported to police |
BORO_NM | The name of the borough in which the incident occurred |
Querying the TSV file for the cardinality of the three candidate columns:
Result:
Ordering by cardinality, the ORDER BY
becomes:
The table below will use more easily read column names, the above names will be mapped to
Putting together the changes to data types and the ORDER BY
tuple gives this table structure:
Finding the primary key of a table
The ClickHouse system
database, specifically system.table
has all of the information about the table you
just created. This query shows the ORDER BY
(sorting key), and the PRIMARY KEY
:
Response
Preprocess and Import Data
We will use clickhouse-local
tool for data preprocessing and clickhouse-client
to upload it.
clickhouse-local
arguments used
table='input'
appears in the arguments to clickhouse-local below. clickhouse-local takes the provided input (cat ${HOME}/NYPD_Complaint_Data_Current__Year_To_Date_.tsv
) and inserts the input into a table. By default the table is named table
. In this guide the name of the table is set to input
to make the data flow clearer. The final argument to clickhouse-local is a query that selects from the table (FROM input
) which is then piped to clickhouse-client
to populate the table NYPD_Complaint
.
Validate the Data
The dataset changes once or more per year, your counts may not match what is in this document.
Query:
Result:
The size of the dataset in ClickHouse is just 12% of the original TSV file, compare the size of the original TSV file with the size of the table:
Query:
Result:
Run Some Queries
Query 1. Compare the number of complaints by month
Query:
Result:
Query 2. Compare total number of complaints by Borough
Query:
Result:
Next Steps
A Practical Introduction to Sparse Primary Indexes in ClickHouse discusses the differences in ClickHouse indexing compared to traditional relational databases, how ClickHouse builds and uses a sparse primary index, and indexing best practices.