library(DBI)
library(RPostgres)
<- DBI::dbConnect(
con ::Postgres(),
RPostgresdbname = "test",
host = "localhost")
tl;dr
Today I learned how to
- Use Common Table Expressions (CTEs) to simultaneously insert data into two Postgres tables and
- Use the
RETURNING
SQL command to retrieve automatically created fields inside the same statement.
Gene expression data hosted at the NCBI’s Short Read Archive (SRA) or at the European Nucleotide Archive (ENA) are a great resource. Both repositories represent information for different entities that make up a project, e.g. study
, sample
, experiment
, run
and analysis
information.
For example, ENA project PRJNA818657 is an RNA-seq study with data for 25 samples. For each sample, a single sequencing library (= experiment
) was prepared and sequenced in two separate runs.
In other words, e.g. sample SAMN26870486 produced experiment SRX14564817, which was then analyzed in run SRR18430942 and run SRR18430943.
One way to capture this information in a relational database is to set up three tables - one for each entity - and then use ENA’s unique sample-, experiment- and run-identifiers as natural primary keys.
But what if I don’t have suitable natural keys, or simply prefer to use surrogate keys?
Today, I learned how to
INSERT
a new record into a Postgres database,- automatically generate a primary key,
- return the key and
- include it in a subsequent
INSERT
statement
Prerequisites
I am using a Postgres database called test
, running on the local host and connect to it with the DBI R package, via the RPostgres::Postgres() driver.
Then I pass the returned PqConnection
object to the following SQL code cells in this Quarto document.
Creating experiment
and run
tables
Initially, the database is empty, so let’s create two tables:
experiment
: sample-level informationrun
: run-level information
Each table will include
- An auto-generated primary key (
experiment_id
andrun_id
, respectively) - A field to record the record’s ENA accession
- A time-stamp
and the run
table will reference its parent experiment
via the experiment_id
foreign key.
CREATE TABLE IF NOT EXISTS experiment (
PRIMARY KEY,
experiment_id SERIAL UNIQUE,
accession text timestamp timestamp default current_timestamp not null
)
CREATE TABLE IF NOT EXISTS run (
PRIMARY KEY,
run_id SERIAL UNIQUE,
accession text timestamp timestamp default current_timestamp not null,
integer,
experiment_id FOREIGN KEY(experiment_id) REFERENCES experiment(experiment_id)
)
Simultaneously nserting records into both tables
Next, we use a single SQL statement to insert both the experiment and its related runs:
WITH
exp AS (
INSERT INTO experiment (accession)
VALUES ('SRX14564817')
RETURNING experiment_id
),data(accession) AS (
VALUES
'SRR18430942'),
('SRR18430943')
(
)INSERT INTO run (experiment_id, accession)
SELECT e.experiment_id, d.accession
FROM exp e, data d
We verify that the experiment has been accessioned into the experiment
table, and the same identifier has then be inserted into the run
table as well
SELECT e.experiment_id, e.accession AS experiment_accession,
AS run_accession
r.run_id, r.accession FROM experiment e
INNER JOIN run r ON e.experiment_id = r.experiment_id
experiment_id | experiment_accession | run_id | run_accession |
---|---|---|---|
1 | SRX14564817 | 1 | SRR18430942 |
1 | SRX14564817 | 2 | SRR18430943 |
Let’s examine the individual parts of this query:
The WITH command creates a Common Table Expression (CTE), e.g.
a temporary named result set, derived from a simple query and defined within the execution scope of a SELECT, INSERT, UPDATE, or DELETE statement.
In this example, the
exp
temporary result is generated by the firstINSERT
statement, which updates theexperiment
table. It returns the automatically generatedexperiment_id
via theRETURNING
command. Let’s add another accession to theexperiment
table and examine the returnedexp
table:
WITH
exp AS (
INSERT INTO experiment (accession)
VALUES ('another accession')
RETURNING experiment_id
)SELECT * FROM exp
experiment_id |
---|
2 |
As expected, the experiment_id
has been incremented for the next experiment.
- Next, we provide the two run accessions by passing them as
VALUES
to thedata
table.
WITH
data(accession) AS (
VALUES
'SRR18430942'),
('SRR18430943')
(
)SELECT * FROM data
accession |
---|
SRR18430942 |
SRR18430943 |
- Finally, the second
INSERT
statement adds the two runs to therun
table, by retrieving the temporary values from both theexp
anddata
result sets.
Because the CTE is a single SQL statement, it runs within a single transaction, e.g. it is committed only at the successful completion of the whole statement.
This work is licensed under a Creative Commons Attribution 4.0 International License.