Querying parquet files with duckdb – Thomas Sandmann’s blog

tl;dr

Today I learned how to access and query CSV and parquet files with duckdb, using either the duckdb command line interface or the eponymous R package

Motivation

duckdb is a relational (table-oriented) database management system (RDMS) contained in a single executable. It excels at processing tabular datasets, e.g. from CSV or Parquet files, from local file systems or remote sources.

Apache Parquet is > an open source, column-oriented data file format designed for efficient data storage and retrieval.

Here, I am highlighting how to use duckdb to query remote parquet files without the need for retrieving the full dataset first. And that’s just one of the many functionalities offered by duckdb, truly a swiss army knife in the data science toolkit!

D-M Commons, CC BY-SA 3.0, via Wikimedia Commons

Prequisites

I installed the duckdb executable on my Mac OS system with homebrew:

brew install duckdb
duckdb --version

Getting started

By default, duckdb will create database in memory. Like other RMDS, it supports a core set of SQL statements and expressions. In addition, extensions provide additional functionality, e.g. connecting to Postgres databases or supporting JSON data.

Commands can either be entered interactively, provided via the -c argument or in a text file. To access remote files, we first need to install the httpsfs` extension that allows reading remote/writing remote files ¹.

duckdb -c "INSTALL httpfs"

To get started, we read a small dataset from a CSV file hosted publicly on a webserver. For brevity, we store this URL in the environmental variable REMOTE_FILE:

REMOTE_FILE=https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv

duckdb -c "SELECT species, island, sex, bill_length_mm, bill_depth_mm \
           FROM '$REMOTE_FILE' LIMIT 5;"

┌─────────┬───────────┬─────────┬────────────────┬───────────────┐
│ species │  island   │   sex   │ bill_length_mm │ bill_depth_mm │
│ varchar │  varchar  │ varchar │     double     │    double     │
├─────────┼───────────┼─────────┼────────────────┼───────────────┤
│ Adelie  │ Torgersen │ MALE    │           39.1 │          18.7 │
│ Adelie  │ Torgersen │ FEMALE  │           39.5 │          17.4 │
│ Adelie  │ Torgersen │ FEMALE  │           40.3 │          18.0 │
│ Adelie  │ Torgersen │         │                │               │
│ Adelie  │ Torgersen │ FEMALE  │           36.7 │          19.3 │
└─────────┴───────────┴─────────┴────────────────┴───────────────┘

By default, duckdb will use a temporary, in-memory database. To open or create a persistent database, simply include a path as a command line argument, e.g. duckdb path/to/my_database.duckdb

For example, the following command will download the remote CSV file and import it into a duckdb database and store it in the penguins.duckdb file.

duckdb \
  -c "CREATE TABLE penguins AS SELECT * FROM '${REMOTE_FILE}';" \
  penguins.duckdb

Now, we can query the local file with duckdb or explore it interactive with the tad viewer ²

duckdb \
  -c "SELECT * from penguins WHERE sex = 'MALE' LIMIT 5;" \
  penguins.duckdb

┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┐
│ species │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │   sex   │
│ varchar │  varchar  │     double     │    double     │       int64       │    int64    │ varchar │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ MALE    │
│ Adelie  │ Torgersen │           39.3 │          20.6 │               190 │        3650 │ MALE    │
│ Adelie  │ Torgersen │           39.2 │          19.6 │               195 │        4675 │ MALE    │
│ Adelie  │ Torgersen │           38.6 │          21.2 │               191 │        3800 │ MALE    │
│ Adelie  │ Torgersen │           34.6 │          21.1 │               198 │        4400 │ MALE    │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┘

Querying remote parquet files

The NYC Taxi & Limousine Commission has collected data on public NYC taxi and for-hire vehicle (Uber, Lyft) trips, going all the way back to 2009. The data is shared in the form of parquet files, and one parquet file is created for each month of data.

Here, I will use the Yellow Taxi Trip records from January and February 2023 as examples. Let’s store the URLs pointing to the respective parquet files in environmental variables.

PARQUET_FILE1="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
PARQUET_FILE2="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet"

Each parquet file stores a single table of data. To get an overview of the available information, we ask duckdb to DESCRIBE it:

duckdb -c "DESCRIBE SELECT * FROM '$PARQUET_FILE1'";

┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name      │ column_type │  null   │   key   │ default │  extra  │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ VendorID              │ BIGINT      │ YES     │         │         │         │
│ tpep_pickup_datetime  │ TIMESTAMP   │ YES     │         │         │         │
│ tpep_dropoff_datetime │ TIMESTAMP   │ YES     │         │         │         │
│ passenger_count       │ DOUBLE      │ YES     │         │         │         │
│ trip_distance         │ DOUBLE      │ YES     │         │         │         │
│ RatecodeID            │ DOUBLE      │ YES     │         │         │         │
│ store_and_fwd_flag    │ VARCHAR     │ YES     │         │         │         │
│ PULocationID          │ BIGINT      │ YES     │         │         │         │
│ DOLocationID          │ BIGINT      │ YES     │         │         │         │
│ payment_type          │ BIGINT      │ YES     │         │         │         │
│ fare_amount           │ DOUBLE      │ YES     │         │         │         │
│ extra                 │ DOUBLE      │ YES     │         │         │         │
│ mta_tax               │ DOUBLE      │ YES     │         │         │         │
│ tip_amount            │ DOUBLE      │ YES     │         │         │         │
│ tolls_amount          │ DOUBLE      │ YES     │         │         │         │
│ improvement_surcharge │ DOUBLE      │ YES     │         │         │         │
│ total_amount          │ DOUBLE      │ YES     │         │         │         │
│ congestion_surcharge  │ DOUBLE      │ YES     │         │         │         │
│ airport_fee           │ DOUBLE      │ YES     │         │         │         │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 19 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

A detailed description of the columns and their values is available in the metadata dictionary. For example, the payment_type field contains “A numeric code signifying how the passenger paid for the trip.” with the following encoding:

1: Credit card
2: Cash
3: No charge
4: Dispute
5: Unknown
6: Voided trip

In January, more than three million trips were recorded, but a query to return the total number of records executes almost instantaneously - because we don’t need to download the (very large) file first:

duckdb -c "SELECT count(*) FROM '$PARQUET_FILE1'";

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      3066766 │
└──────────────┘

The vast majority of trips was paid for by credit card (payment type 1), and a small subset of trips was performed free of charge (payment type 3).

duckdb -c "SELECT payment_type, count(payment_type) \
           FROM '$PARQUET_FILE1' \
           GROUP BY payment_type LIMIT 5";

┌──────────────┬─────────────────────┐
│ payment_type │ count(payment_type) │
│    int64     │        int64        │
├──────────────┼─────────────────────┤
│            0 │               71743 │
│            1 │             2411462 │
│            2 │              532241 │
│            3 │               18023 │
│            4 │               33297 │
└──────────────┴─────────────────────┘

We can also query across multiple parquet files, e.g. retrieving the total number of trips for both January and February 2023:

duckdb -c "SELECT count(*) FROM \
           read_parquet(['$PARQUET_FILE1', '$PARQUET_FILE2'])";

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      5980721 │
└──────────────┘

We can also copy the output of a query into a new, local parquet file. For example, the following query will copy records for 100 trips that were performed free of charge into a new free_trips.parquet parquet file in the current working directory:

duckdb -c \
  "COPY (SELECT * FROM '$PARQUET_FILE1' \
         WHERE payment_type = 3 LIMIT 100) TO 'free_trips.parquet' \
  (FORMAT 'parquet');"

We can now query the local parquet file to drill deeper into this data slice:

duckdb -c "SELECT payment_type, count(payment_type) \
           FROM 'free_trips.parquet' \
           GROUP BY payment_type";

┌──────────────┬─────────────────────┐
│ payment_type │ count(payment_type) │
│    int64     │        int64        │
├──────────────┼─────────────────────┤
│            3 │                 100 │
└──────────────┴─────────────────────┘

APIs

In addition to using the duckdb command line interface (CLI), you can also use a library for your favorite programming language. For example, the duckdb R package provides a DBI interface that enables queries from within an R session. (The duckdb python module provides similar functionality.)

if (!requireNamespace("duckdb", quietly = TRUE)) {
  install.packages("duckdb")
}
suppressPackageStartupMessages(library("duckdb"))
suppressPackageStartupMessages(library("DBI"))

con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")
dbExecute(conn = con, "INSTALL httpfs")

[1] 0

For example, we can use an in-memory duckdb instance to query the one (or more) of the remote parquet files we examined above:

PARQUET_FILE1 = paste0("https://d37ci6vzurychx.cloudfront.net/",
                       "trip-data/yellow_tripdata_2023-01.parquet")

sql <- "SELECT payment_type, count(payment_type) \
        FROM read_parquet([?]) \
        GROUP BY payment_type LIMIT 5";
dbGetQuery(con, sql, list(PARQUET_FILE1))

  payment_type count(payment_type)
1            0               71743
2            1             2411462
3            2              532241
4            3               18023
5            4               33297

Alternatively, we can also access data (including CSV and parquet files) using dbplyr and dplyr

suppressPackageStartupMessages(library(dbplyr))
suppressPackageStartupMessages(library(dplyr))

tbl(con, PARQUET_FILE1) |>
  group_by(payment_type) |>
  count() |>
  collect()

# A tibble: 5 × 2
# Groups:   payment_type [5]
  payment_type       n
         <dbl>   <dbl>
1            0   71743
2            1 2411462
3            2  532241
4            3   18023
5            4   33297

Don’t forget to disconnect from your duckdb database at the end of your R session!

dbDisconnect(con, shutdown=TRUE)

This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

Additional options to parse / import CSV files is available in duckdb’s documentation ↩︎
The tad viewer is a free tool to view CSV, Parquet, and SQLite and DuckDb database files↩︎