library(DBI)
library(glue)
library(RPostgres)
library(sessioninfo)
# Connect to a (prexisting) postgres database called `full_text_search`
<- DBI::dbConnect(
con dbname = "full_text_search",
drv = RPostgres::Postgres(),
host = "localhost",
port = 5432L,
user = "postgres"
)
I have been learning how to organize, search and modify data in a Postgres database by working through Anthony DeBarros’ excellent book Practical SQL.
Because I currently perform most of my data analyses in R, I am using the great RPostgres, DBI and glue packages to interface with Postgres - without ever leaving my R session.
Today I learned how to create a full text search index and how to search it with one or more search terms.
Connecting to Postgres
For this example, I created a toy database full_text_search
in my local Postgres server. I connect to it with the DBI::dbConnect
command, and by passing it the RPostgres::Postgres()
driver.
Creating and populating a table
Because this is a toy example, I start with a fresh table datasets
. (In case it already exists from previous experimentation, I drop the table if necessary).
Let’s define four fields for the table:
id
: the unique identifiername
: the short name of each entrytitle
: a longer titledescription
: a paragraph describing the entrycreated
: a date and time the entry was added to the database
# drop the `datasets` table if it already exists
if (DBI::dbExistsTable(con, "datasets")) DBI::dbRemoveTable(con, "datasets")
# create the empty `datasets` table
<- glue_sql("
sql CREATE TABLE IF NOT EXISTS datasets (
id bigserial PRIMARY KEY,
name text,
title text,
description text,
created timestamp with time zone default current_timestamp not null
);", .con = con)
<- suppressMessages(DBI::dbSendStatement(con, sql))
res ::dbClearResult(res)
DBI::dbReadTable(con, "datasets") DBI
[1] id name title description created
<0 rows> (or 0-length row.names)
Initially, our new database is empty. Let’s populate them with three entries, each describing a popular dataset shipped with R’s built-in datasets package.
# some example entries
<- list(
buildin_datasets mtcars = list(
"name" = "mtcars",
"title" = "The built-in mtcars dataset from the datasets R package.",
"description" = gsub(
"\r?\n|\r", " ",
"The data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).")
), airmiles = list(
name = "airmiles",
title = "The built-in airmiles dataset from the datasets R package",
description = gsub(
"\r?\n|\r", " ",
"The revenue passenger miles flown by commercial airlines in the United
States for each year from 1937 to 1960.")
),attitude = list(
name = "attitude",
title = "The built-in attitude dataset from the datasets R package",
description = gsub(
"\r?\n|\r", " ",
"From a survey of the clerical employees of a large financial
organization, the data are aggregated from the questionnaires of the
approximately 35 employees for each of 30 (randomly selected) departments.
The numbers give the percent proportion of favourable responses to seven
questions in each department.")
) )
Next, we loop over each element of the list and use the glue_sql()
command to unpack both the names (names(dataset)
) and the values of each field for this entry. Then we update the datasets
table with this new information.
Afterward, we retrieve the name
and title
fields to verify the correct import:
for (dataset in buildin_datasets) {
<- glue_sql(
sql "INSERT INTO datasets ({`names(dataset)`*})
VALUES ({dataset*});",
.con = con)
<- suppressMessages(DBI::dbSendStatement(con, sql))
res ::dbClearResult(res)
DBI
}::dbGetQuery(con, "SELECT name, title from datasets;") DBI
name title
1 mtcars The built-in mtcars dataset from the datasets R package.
2 airmiles The built-in airmiles dataset from the datasets R package
3 attitude The built-in attitude dataset from the datasets R package
Searching!
Our goal is to enable full-text search for the description
field. Let’s look up the term data
. To perform full-text search, both the records to search and our query need to be tokinzed first, with the to_tsvector
and to_tsquery
functions, respectively.
Here is an example of the tokens that are generated:
<- glue_sql(
sql "SELECT to_tsvector('This is a my test phrase, and what
a beautiful phrase it is.')
to_tsquery", con = con)
::dbGetQuery(con, sql) DBI
to_tsquery
1 'beauti':10 'phrase':6,11 'test':5
The following query correctly returns all records whose descriptions contain the word data
:
# search the description field
<- "data"
term <- glue_sql(
sql "SELECT id, name
FROM datasets
WHERE to_tsvector(description) @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name
1 1 mtcars
2 3 attitude
We can enrich the output by returning the output of the ts_headline
function, highlighting the location / context of the the matched term:
# search the description field and show the matching location
<- "data"
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE to_tsvector(description) @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 1 mtcars <data> was extracted from the 1974 Motor
2 3 attitude financial organization, the <data> are aggregated from
We can also combine search terms, e.g. searching for either employee
or motor
terms:
# using multiple search terms
<- "employee | motor" # OR
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE to_tsvector(description) @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 1 mtcars from the 1974 <Motor> Trend US magazine
2 3 attitude clerical <employees> of a large financial organization
Similarly, we can narrow our search by requiring both data
and employee
terms to appear in the same description:
<- "data & employee" # AND
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE to_tsvector(description) @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 3 attitude clerical <employees> of a large financial organization
Creating indices
In the examples above, we performed tokenization of the search term and the description
field at run time, e.g. when the query was executed. As our database grows, this will soon become too cumbersome and degrade performance.
Adding an index to our database will maintain full-text search speed even with large datasets. We have two different options:
- Create an index based on an expression.
- Create a new field to hold the output of the
to_tsvector
function, and then index this new field.
Creating an expression index
A simple way to create a full-text index is to include the to_tsvector()
expression in the definition of the index itself. Here, we add a Generalized Inverted Index (GIN) index for the description
column:
= glue_sql(
sql "CREATE INDEX description_idx ON datasets
USING gin(to_tsvector('english', description));",
con = con
)::dbExecute(con, sql) DBI
[1] 0
The same type of query we issued above will now take advantage of the description_idx
:
# search the description field using its index
<- "questioning"
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE to_tsvector('english', description) @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 3 attitude responses to seven <questions> in each department
The description
fields of new records, e.g those that are added later, will automatically be added to the index. Let’s create a new record for the euro
dataset, for example.
= list(
new_data name = "euro",
title = "The built-in euro dataset from the datasets R package",
description = gsub(
"\r?\n|\r", " ",
"The data set euro contains the value of 1 Euro in all currencies
participating in the European monetary union (Austrian Schilling ATS,
Belgian Franc BEF, German Mark DEM, Spanish Peseta ESP, Finnish Markka FIM,
French Franc FRF, Irish Punt IEP, Italian Lira ITL, Luxembourg Franc LUF,
Dutch Guilder NLG and Portuguese Escudo PTE). These conversion rates were
fixed by the European Union on December 31, 1998. To convert old prices to
Euro prices, divide by the respective rate and round to 2 digits.")
)<- glue_sql(
sql "INSERT INTO datasets ({`names(dataset)`*})
VALUES ({new_data*});",
.con = con)
::dbExecute(con, sql) DBI
[1] 1
This new record will now be included in the search results for the term data
, for example:
# search the description field using its index
<- "data"
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE to_tsvector('english', description) @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 1 mtcars <data> was extracted from the 1974 Motor
2 3 attitude financial organization, the <data> are aggregated from
3 4 euro <data> set euro contains the value
Adding a tokenized field for full-text searches
Alternatively, another option is to create a new column to hold the output of the to_tsvector()
function, and then to index it for future use. Let’s create a new column search_description_text
:
# create a column to hold tokens for full text search
<- glue_sql(
sql "ALTER TABLE datasets
ADD COLUMN search_description_text tsvector;",
.con = con)
::dbExecute(con, sql) DBI
[1] 0
::dbListFields(con, "datasets") DBI
[1] "id" "name"
[3] "title" "description"
[5] "created" "search_description_text"
Next, we tokenize the descriptions
field, and store the output in our new search_description_text
column:
<- glue_sql(
sql "UPDATE datasets
SET search_description_text = to_tsvector('english', description);",
.con = con)
::dbExecute(con, sql) DBI
[1] 4
Here are the tokens generated from the description
of the first record, for example:
::dbGetQuery(con,
DBI"SELECT name, search_description_text from datasets LIMIT 1;")
name
1 mtcars
search_description_text
1 '10':17 '1973':27 '1974':7 '32':25 '74':28 'aspect':18 'automobil':20,26 'compris':13 'consumpt':15 'data':2 'design':21 'extract':4 'fuel':14 'magazin':11 'model':29 'motor':8 'perform':23 'trend':9 'us':10
As before, we can add an index - but this time, we index the pre-tokenized search_description_text
column instead:
# create the search index
<- glue_sql(
sql "CREATE INDEX search_description_idx
ON datasets
USING gin(search_description_text);",
.con = con)
::dbExecute(con, sql) DBI
[1] 0
Time to run our search again. When we search the search_description_text
field, we can omit the to_tsvector()
call, because its has been tokenized already:
# search the description field and show the matching location
<- "data"
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE search_description_text @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 1 mtcars <data> was extracted from the 1974 Motor
2 3 attitude financial organization, the <data> are aggregated from
3 4 euro <data> set euro contains the value
🚨 But beware: because we have precalculated the tokens, any new records added to the database will not automatically be processed, nor will they be indexed!
Let’s add a final record, the morely
dataset:
= list(
more_data name = "morley",
title = "The built-in morley dataset from the datasets R package",
description = gsub(
"\r?\n|\r", " ",
"A classical data of Michelson (but not this one with Morley) on
measurements done in 1879 on the speed of light. The data consists of five
experiments, each consisting of 20 consecutive ‘runs’. The response is the speed
of light measurement, suitably coded (km/sec, with 299000 subtracted).")
)
To enter this record, we not only have to populate the name
, title
and description
fields - but also the list of tokens derived from the description
in the search_description_text
column. In other words, we have to execute the to_tsvector
function inside our INSERT
statement:
<- glue_sql(
sql "INSERT INTO datasets ({`names(dataset)`*}, search_description_text)
VALUES ({more_data*}, to_tsvector({more_data[['description']]}));",
.con = con)
::dbExecute(con, sql) DBI
[1] 1
Now, our query returns both the original matches and the new record:
# search the description field and show the matching location
<- "data"
term <- glue_sql(
sql "SELECT id, name,
ts_headline(description, to_tsquery('english', {term}),
'StartSel = <,
StopSel = >,
MinWords = 5,
MaxWords = 7,
MaxFragments = 1')
FROM datasets
WHERE search_description_text @@ to_tsquery('english', {term})
ORDER BY created;",
.con = con)
::dbGetQuery(con, sql) DBI
id name ts_headline
1 1 mtcars <data> was extracted from the 1974 Motor
2 3 attitude financial organization, the <data> are aggregated from
3 4 euro <data> set euro contains the value
4 5 morley classical <data> of Michelson (but not this
Choosing between indexing strategies
According to the Postgres documentation:
One advantage of the separate-column approach over an expression index is that it is not necessary to explicitly specify the text search configuration in queries in order to make use of the index. Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches. The expression-index approach is simpler to set up, however, and it requires less disk space since the tsvector representation is not stored explicitly.
That’s it. Thanks again to Anthony DeBarros’ for his excellent introduction to Practical SQL!
Reproducibility
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.2 (2022-10-31)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2023-01-16
pandoc 2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
askpass 1.1 2019-01-13 [1] CRAN (R 4.2.0)
bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
blob 1.2.3 2022-04-10 [1] CRAN (R 4.2.0)
cli 3.5.0 2022-12-20 [1] CRAN (R 4.2.0)
credentials 1.3.2 2021-11-29 [1] CRAN (R 4.2.0)
DBI * 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
glue * 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.2)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
openssl 2.0.5 2022-12-06 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.0)
RPostgres * 1.4.4 2022-05-02 [1] CRAN (R 4.2.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
sessioninfo * 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
sys 3.4.1 2022-10-18 [1] CRAN (R 4.2.0)
timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.0)
vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.0)
xfun 0.35 2022-11-16 [1] CRAN (R 4.2.0)
yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
[1] /Users/sandmann/Library/R/x86_64/4.2/library
[2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
──────────────────────────────────────────────────────────────────────────────
This work is licensed under a Creative Commons Attribution 4.0 International License.