Simultaneously inserting records into two tables with Postgres CTEs

SQL
postgres
Author

Thomas Sandmann

Published

February 25, 2023

tl;dr

Today I learned how to

  • Use Common Table Expressions (CTEs) to simultaneously insert data into two Postgres tables and
  • Use the RETURNING SQL command to retrieve automatically created fields inside the same statement.

Gene expression data hosted at the NCBI’s Short Read Archive (SRA) or at the European Nucleotide Archive (ENA) are a great resource. Both repositories represent information for different entities that make up a project, e.g. study, sample, experiment, run and analysis information.

Relationships between entities (source: ENA)

For example, ENA project PRJNA818657 is an RNA-seq study with data for 25 samples. For each sample, a single sequencing library (= experiment) was prepared and sequenced in two separate runs.

In other words, e.g.  sample SAMN26870486 produced experiment SRX14564817, which was then analyzed in run SRR18430942 and run SRR18430943.

One way to capture this information in a relational database is to set up three tables - one for each entity - and then use ENA’s unique sample-, experiment- and run-identifiers as natural primary keys.

But what if I don’t have suitable natural keys, or simply prefer to use surrogate keys?

Today, I learned how to

  1. INSERT a new record into a Postgres database,
  2. automatically generate a primary key,
  3. return the key and
  4. include it in a subsequent INSERT statement

Prerequisites

I am using a Postgres database called test, running on the local host and connect to it with the DBI R package, via the RPostgres::Postgres() driver.

Then I pass the returned PqConnection object to the following SQL code cells in this Quarto document.

library(DBI)
library(RPostgres)

con <- DBI::dbConnect(
  RPostgres::Postgres(), 
  dbname = "test", 
  host = "localhost")

Creating experiment and run tables

Initially, the database is empty, so let’s create two tables:

  • experiment: sample-level information
  • run: run-level information

Each table will include

  1. An auto-generated primary key (experiment_id and run_id, respectively)
  2. A field to record the record’s ENA accession
  3. A time-stamp

and the run table will reference its parent experiment via the experiment_id foreign key.

CREATE TABLE IF NOT EXISTS experiment (
  experiment_id SERIAL PRIMARY KEY,
  accession text UNIQUE,
  timestamp timestamp default current_timestamp not null
)
CREATE TABLE IF NOT EXISTS run (
    run_id SERIAL PRIMARY KEY,
    accession text UNIQUE,
    timestamp timestamp default current_timestamp not null,
    experiment_id integer,
    FOREIGN KEY(experiment_id) REFERENCES experiment(experiment_id)
)

Simultaneously nserting records into both tables

Next, we use a single SQL statement to insert both the experiment and its related runs:

WITH
exp AS (
  INSERT INTO experiment (accession) 
  VALUES ('SRX14564817') 
  RETURNING experiment_id
),
data(accession) AS (
  VALUES
  ('SRR18430942'),
  ('SRR18430943')
)
INSERT INTO run (experiment_id, accession)
SELECT e.experiment_id, d.accession
FROM exp e, data d

We verify that the experiment has been accessioned into the experiment table, and the same identifier has then be inserted into the run table as well

SELECT e.experiment_id, e.accession AS experiment_accession, 
       r.run_id, r.accession AS run_accession
FROM experiment e
INNER JOIN run r ON e.experiment_id = r.experiment_id
2 records
experiment_id experiment_accession run_id run_accession
1 SRX14564817 1 SRR18430942
1 SRX14564817 2 SRR18430943

Let’s examine the individual parts of this query:

  • The WITH command creates a Common Table Expression (CTE), e.g. 

    a temporary named result set, derived from a simple query and defined within the execution scope of a SELECT, INSERT, UPDATE, or DELETE statement.

    In this example, the exp temporary result is generated by the first INSERT statement, which updates the experiment table. It returns the automatically generated experiment_id via the RETURNING command. Let’s add another accession to the experiment table and examine the returned exp table:

  WITH
  exp AS (
    INSERT INTO experiment (accession) 
    VALUES ('another accession') 
    RETURNING experiment_id
  )
  SELECT * FROM exp
1 records
experiment_id
2

As expected, the experiment_id has been incremented for the next experiment.

  • Next, we provide the two run accessions by passing them as VALUES to the data table.
  WITH
  data(accession) AS (
    VALUES
    ('SRR18430942'),
    ('SRR18430943')
  )
  SELECT * FROM data
2 records
accession
SRR18430942
SRR18430943
  • Finally, the second INSERT statement adds the two runs to the run table, by retrieving the temporary values from both the exp and data result sets.

Because the CTE is a single SQL statement, it runs within a single transaction, e.g. it is committed only at the successful completion of the whole statement.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.