library(DBI)
library(RPostgres)
con <- DBI::dbConnect(
RPostgres::Postgres(),
dbname = "test",
host = "localhost")tl;dr
Today I learned how to
- Use Common Table Expressions (CTEs) to simultaneously insert data into two Postgres tables and
- Use the
RETURNINGSQL command to retrieve automatically created fields inside the same statement.
Gene expression data hosted at the NCBI’s Short Read Archive (SRA) or at the European Nucleotide Archive (ENA) are a great resource. Both repositories represent information for different entities that make up a project, e.g. study, sample, experiment, run and analysis information.

For example, ENA project PRJNA818657 is an RNA-seq study with data for 25 samples. For each sample, a single sequencing library (= experiment) was prepared and sequenced in two separate runs.
In other words, e.g. sample SAMN26870486 produced experiment SRX14564817, which was then analyzed in run SRR18430942 and run SRR18430943.
One way to capture this information in a relational database is to set up three tables - one for each entity - and then use ENA’s unique sample-, experiment- and run-identifiers as natural primary keys.
But what if I don’t have suitable natural keys, or simply prefer to use surrogate keys?
Today, I learned how to
INSERTa new record into a Postgres database,- automatically generate a primary key,
- return the key and
- include it in a subsequent
INSERTstatement
Prerequisites
I am using a Postgres database called test, running on the local host and connect to it with the DBI R package, via the RPostgres::Postgres() driver.
Then I pass the returned PqConnection object to the following SQL code cells in this Quarto document.
Creating experiment and run tables
Initially, the database is empty, so let’s create two tables:
experiment: sample-level informationrun: run-level information
Each table will include
- An auto-generated primary key (
experiment_idandrun_id, respectively) - A field to record the record’s ENA accession
- A time-stamp
and the run table will reference its parent experiment via the experiment_id foreign key.
CREATE TABLE IF NOT EXISTS experiment (
experiment_id SERIAL PRIMARY KEY,
accession text UNIQUE,
timestamp timestamp default current_timestamp not null
)CREATE TABLE IF NOT EXISTS run (
run_id SERIAL PRIMARY KEY,
accession text UNIQUE,
timestamp timestamp default current_timestamp not null,
experiment_id integer,
FOREIGN KEY(experiment_id) REFERENCES experiment(experiment_id)
)Simultaneously nserting records into both tables
Next, we use a single SQL statement to insert both the experiment and its related runs:
WITH
exp AS (
INSERT INTO experiment (accession)
VALUES ('SRX14564817')
RETURNING experiment_id
),
data(accession) AS (
VALUES
('SRR18430942'),
('SRR18430943')
)
INSERT INTO run (experiment_id, accession)
SELECT e.experiment_id, d.accession
FROM exp e, data dWe verify that the experiment has been accessioned into the experiment table, and the same identifier has then be inserted into the run table as well
SELECT e.experiment_id, e.accession AS experiment_accession,
r.run_id, r.accession AS run_accession
FROM experiment e
INNER JOIN run r ON e.experiment_id = r.experiment_id| experiment_id | experiment_accession | run_id | run_accession |
|---|---|---|---|
| 1 | SRX14564817 | 1 | SRR18430942 |
| 1 | SRX14564817 | 2 | SRR18430943 |
Let’s examine the individual parts of this query:
The WITH command creates a Common Table Expression (CTE), e.g.
a temporary named result set, derived from a simple query and defined within the execution scope of a SELECT, INSERT, UPDATE, or DELETE statement.
In this example, the
exptemporary result is generated by the firstINSERTstatement, which updates theexperimenttable. It returns the automatically generatedexperiment_idvia theRETURNINGcommand. Let’s add another accession to theexperimenttable and examine the returnedexptable:
WITH
exp AS (
INSERT INTO experiment (accession)
VALUES ('another accession')
RETURNING experiment_id
)
SELECT * FROM exp| experiment_id |
|---|
| 2 |
As expected, the experiment_id has been incremented for the next experiment.
- Next, we provide the two run accessions by passing them as
VALUESto thedatatable.
WITH
data(accession) AS (
VALUES
('SRR18430942'),
('SRR18430943')
)
SELECT * FROM data| accession |
|---|
| SRR18430942 |
| SRR18430943 |
- Finally, the second
INSERTstatement adds the two runs to theruntable, by retrieving the temporary values from both theexpanddataresult sets.
Because the CTE is a single SQL statement, it runs within a single transaction, e.g. it is committed only at the successful completion of the whole statement.

This work is licensed under a Creative Commons Attribution 4.0 International License.