library(DBI)
library(glue)
library(jsonlite) # to create example JSON files
library(paws.storage) # to copy our example JSON file on AWS S3
library(noctua) # to interface with AWS Athenatl;dr
This week, I learned how to query JSON files stored on AWS S3 with the noctua R package, an API for the AWS Athena service.
Overview
Normally, duckdb is my tool of choice for parsing and querying large numbers of JSON files, and when the files are available on my local system, it makes easy work of this task.
But this week, I needed to process more than 20 thousand small JSON files stored on AWS S3. Instead of retrieving them first, I used the opportunity to learn about AWS Athena, a severless query service that makes it easy to analyze data in Amazon S3 using standard SQL. (In other words, I am using a query engine that is located close to the data, instead of downloading the data to bring it closer to my local duckdb query engine.)
Athena supports CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC, and enables ad-hoc queries without the need to set up a database beforehand.
There are multiple ways to interact with AWS Athena from within R 1. Here, I am using the noctua R package, which leverages the paws R package under the hood.
Authentication with AWS
Please note that use of AWS services, including the S3 and Athena, is not free and requires you to create an account first. Storing the small example data and running the Athena queries in this tutorial may be free if you haven’t exhausted the free tier of your AWS services, yet. But please be aware of the potential cost of cloud computing with AWS.
The paws R package recognizes numerous ways of authenticating with AWS. For the following code to run, please ensure that you provided one of them, e.g. exported your key and secret key available as environmental variables, or created a credentials file, etc.
In this example, I am using the same AWS S3 bucket to store the JSON files I want to query (under the example prefix), and to store files generated by Athena (under the athena prefix), but you can use any location on S3 you have permission to access.
kRegion <- "us-west-2"
kDataBucket <- "my-scratch-bucket"
kDataDir <- "s3://my-scratch-bucket/example/"
kStagingDir <- "s3://my-scratch-bucket/athena/"Creating a set of example JSON files
First, let’s create a small set of JSON files, each containing a single record, by
- looping over the rows of the
mtcarsdata.frame, - writing a JSON file to the temporary directory, and
- copying it to AWS S3.
1svc <- paws.storage::s3(region = kRegion)
data(mtcars)
2mtcars$name <- row.names(mtcars)
for (n in seq.int(nrow(mtcars))) {
# export the row to a json file
temp_file <- tempfile(fileext = ".json")
3 jsonlite::write_json(
x = mtcars[n, , drop = TRUE],
path = temp_file,
4 pretty = FALSE,
auto_unbox = TRUE)
# upload the JSON file to S3
5 svc$put_object(
Body = temp_file,
Bucket = kDataBucket,
Key = paste0("example/", basename(temp_file))
)
unlink(temp_file)
}- 1
- Establish a connection to the AWS S3 service.
- 2
-
The
write_jsondoes not export therow.namesof the data.frame, so we store them in a regular column first. - 3
-
Export each row of the
mtcarsdata.frame into a separate JSON file. - 4
- Athena does not accept pretty JSON format. Instead, each JSON-encoded record must be represented on a separate line as outlined in the Best practices for reading JSON data documentation page.
- 5
- Copy the JSON file to AWS S3.
Connecting to the AWS Athena service
Next, we establish a connection to the AWS Athena service, pointing it to our staging location on S3. The noctua package provides methods to connect to (dbConnect) and query (e.g. dbQuery) Athena, extending generic methods defined in the DBI R package.
con <- dbConnect(noctua::athena(), s3_staging_dir = kStagingDir)
dbGetQuery(con, "show databases")## INFO: (Data scanned: 0 Bytes)
## database_name
## <char>
## 1: default
If this is your first interaction with Athena, only the default database will be available.
Creating an external table with AWS Athena
Next, we point Athena to our JSON files, by defining an external table with a schema that matches the column types of the original mtcars data.frame:
1sql <- glue_sql(
"CREATE EXTERNAL TABLE IF NOT EXISTS mtcars (
mpg float,
cyl tinyint,
disp float,
hp smallint,
drat float,
wt float,
qsec float,
vs tinyint,
am tinyint,
gear tinyint,
carb tinyint,
name string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION {kDataDir};", .con = con)
dbExecute(con, sql)- 1
- The glue_sql() command facilitates inserting user-defined variables into a SQL query.
# <AthenaResult>
## SQL CREATE EXTERNAL TABLE IF NOT EXISTS mtcars (
## mpg float,
## cyl tinyint,
## disp float,
## hp smallint,
## drat float,
## wt float,
## qsec float,
## vs tinyint,
## am tinyint,
## gear tinyint,
## carb tinyint,
## name string
## )
## ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
## LOCATION 's3://my-scratch-bucket/example/';
The dbListTables command confirms that our default database now contains the mtcars table:
dbListTables(con)## [1] "mtcars"
Querying across all JSON files
Now we are ready to issue queries across our collection of JSON files, using standard SQL. For example, we can retrieve a subset of rows
sql <- glue_sql('SELECT * FROM "mtcars" LIMIT 5', .con = con)
dbGetQuery(con, sql)## INFO: (Data scanned: 2.15 KB)
## mpg cyl disp hp drat wt qsec vs am gear carb
## <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
## 1: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 2: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 5: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## name
## <char>
## 1: Pontiac Firebird
## 2: Valiant
## 3: Datsun 710
## 4: Merc 280
## 5: Merc 230
filter for specific values
sql <- glue_sql('SELECT * FROM "mtcars" WHERE "gear" = 5;', .con = con)
dbGetQuery(con, sql)## INFO: (Data scanned: 4.05 KB)
## mpg cyl disp hp drat wt qsec vs am gear carb
## <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
## 1: 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## 2: 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## 3: 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## 4: 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## 5: 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## name
## <char>
## 1: Ferrari Dino
## 2: Porsche 914-2
## 3: Ford Pantera L
## 4: Lotus Europa
## 5: Maserati Bora
or match strings
sql <- glue_sql('SELECT * FROM "mtcars" WHERE "name" like \'Ferrari%\';',
.con = con)
dbGetQuery(con, sql)## INFO: (Data scanned: 4.05 KB)
## mpg cyl disp hp drat wt qsec vs am gear carb
## <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
## 1: 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## name
## <char>
## 1: Ferrari Dino
We can also read the full table into our R session, reconstituting the contents of the original mtcars data.frame:
df <- DBI::dbReadTable(con, "mtcars")
head(df)## INFO: (Data scanned: 4.05 KB)
# mpg cyl disp hp drat wt qsec vs am gear carb
## <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
## 1: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 4: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## 5: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 6: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## name
## <char>
## 1: Merc 450SL
## 2: Merc 240D
## 3: Fiat X1-9
## 4: Volvo 142E
## 5: Mazda RX4 Wag
## 6: Lotus Europa
Cleaning up
Once our analysis is complete, we disconnect from the service
dbDisconnect(con)and, if we don’t want to query the same JSON files again in the future, we can also remove the table from the database:
1dbRemoveTable(con, "mtcars", delete_data = FALSE)- 1
-
Set the
delete_data = FALSEargument to remove the Athena database, but leave the JSON files in place.
Conclusions
- The
noctuaR package made it easy to interface with AWS Athena, because it allowed me to use the familiarDBIAPI implemented for many database back ends. - Defining the schema for the example table was informed by examining the
mtcarsdata set. - Querying the collection of JSON files required Athena to read all of them. To reduce the amount of data that needs to be scanned, you might want to partition your data - e.g. split it by date, country, etc - both speeding up queries and reducing cost.
- The
mtcarsdata set is a highly structured, and could easily be stored as a single table on AWS S3, e.g. in a CSV or parquet file. The latter is highly optimized for columnar data storage, and can be queried in a highly efficient way - definitely something I will consider for large, structured data in the future.
Reproducibility
Session Information
sessioninfo::session_info("attached")─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os Debian GNU/Linux 12 (bookworm)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2024-04-17
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P DBI * 1.1.3 2022-06-18 [?] CRAN (R 4.3.1)
P glue * 1.6.2 2022-02-24 [?] CRAN (R 4.3.1)
P jsonlite * 1.8.7 2023-06-29 [?] CRAN (R 4.3.1)
P noctua * 2.6.2 2023-08-08 [?] RSPM
P paws.storage * 0.5.0 2024-01-09 [?] RSPM
[1] /home/sandmann/repositories/blog/renv/library/R-4.3/x86_64-pc-linux-gnu
[2] /home/sandmann/.cache/R/renv/sandbox/R-4.3/x86_64-pc-linux-gnu/9a444a72
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────

This work is licensed under a Creative Commons Attribution 4.0 International License.
Footnotes
Dyfan Jones’s blog features great introductions on how to get started with the RAthena R package, which leverages the python boto3 AWS API, or the noctua R package I am using in this post.↩︎