Full text search in Postgres - the R way

TIL
R
postgres
Author

Thomas Sandmann

Published

December 12, 2022

I have been learning how to organize, search and modify data in a Postgres database by working through Anthony DeBarros’ excellent book Practical SQL.

Because I currently perform most of my data analyses in R, I am using the great RPostgres, DBI and glue packages to interface with Postgres - without ever leaving my R session.

Today I learned how to create a full text search index and how to search it with one or more search terms.

Connecting to Postgres

For this example, I created a toy database full_text_search in my local Postgres server. I connect to it with the DBI::dbConnect command, and by passing it the RPostgres::Postgres() driver.

library(DBI)
library(glue)
library(RPostgres)
library(sessioninfo)

# Connect to a (prexisting) postgres database called `full_text_search`
con <- DBI::dbConnect(
  dbname = "full_text_search",
  drv = RPostgres::Postgres(),
  host = "localhost",
  port = 5432L,
  user = "postgres"
  )

Creating and populating a table

Because this is a toy example, I start with a fresh table datasets. (In case it already exists from previous experimentation, I drop the table if necessary).

Let’s define four fields for the table:

  • id: the unique identifier
  • name: the short name of each entry
  • title: a longer title
  • description: a paragraph describing the entry
  • created: a date and time the entry was added to the database
# drop the `datasets` table if it already exists
if (DBI::dbExistsTable(con, "datasets")) DBI::dbRemoveTable(con, "datasets")

# create the empty `datasets` table
sql <- glue_sql("
      CREATE TABLE IF NOT EXISTS datasets (
      id bigserial PRIMARY KEY,
      name text,
      title text,
      description text,
      created timestamp with time zone default current_timestamp not null
    );", .con = con)
res <- suppressMessages(DBI::dbSendStatement(con, sql))
DBI::dbClearResult(res)
DBI::dbReadTable(con, "datasets")
[1] id          name        title       description created    
<0 rows> (or 0-length row.names)

Initially, our new database is empty. Let’s populate them with three entries, each describing a popular dataset shipped with R’s built-in datasets package.

# some example entries
buildin_datasets <- list(
  mtcars = list(
    "name" = "mtcars", 
    "title" = "The built-in mtcars dataset from the datasets R package.",
    "description" = gsub(
      "\r?\n|\r", " ", 
      "The data was extracted from the 1974 Motor Trend US magazine, and 
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).")
  ), 
  airmiles = list(
    name = "airmiles",
    title = "The built-in airmiles dataset from the datasets R package",
    description = gsub(
      "\r?\n|\r", " ", 
      "The revenue passenger miles flown by commercial airlines in the United
States for each year from 1937 to 1960.")
  ),
  attitude = list(
    name = "attitude", 
    title = "The built-in attitude dataset from the datasets R package",
    description = gsub(
      "\r?\n|\r", " ", 
      "From a survey of the clerical employees of a large financial
organization, the data are aggregated from the questionnaires of the
approximately 35 employees for each of 30 (randomly selected) departments. 
The numbers give the percent proportion of favourable responses to seven
questions in each department.")
  )
)

Next, we loop over each element of the list and use the glue_sql() command to unpack both the names (names(dataset)) and the values of each field for this entry. Then we update the datasets table with this new information.

Afterward, we retrieve the name and title fields to verify the correct import:

for (dataset in buildin_datasets) {
  sql <- glue_sql(
    "INSERT INTO datasets ({`names(dataset)`*})
   VALUES ({dataset*});", 
    .con = con)
  res <- suppressMessages(DBI::dbSendStatement(con, sql))
  DBI::dbClearResult(res)
}
DBI::dbGetQuery(con, "SELECT name, title from datasets;")
      name                                                     title
1   mtcars  The built-in mtcars dataset from the datasets R package.
2 airmiles The built-in airmiles dataset from the datasets R package
3 attitude The built-in attitude dataset from the datasets R package

Searching!

Our goal is to enable full-text search for the description field. Let’s look up the term data. To perform full-text search, both the records to search and our query need to be tokinzed first, with the to_tsvector and to_tsquery functions, respectively.

Here is an example of the tokens that are generated:

sql <- glue_sql(
  "SELECT to_tsvector('This is a my test phrase, and what 
                       a beautiful phrase it is.')
   to_tsquery", con = con)
DBI::dbGetQuery(con, sql)
                          to_tsquery
1 'beauti':10 'phrase':6,11 'test':5

The following query correctly returns all records whose descriptions contain the word data:

# search the description field
term <- "data"
sql <- glue_sql(
  "SELECT id, name
  FROM datasets
  WHERE to_tsvector(description) @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name
1  1   mtcars
2  3 attitude

We can enrich the output by returning the output of the ts_headline function, highlighting the location / context of the the matched term:

# search the description field and show the matching location
term <- "data"
sql <- glue_sql(
  "SELECT id, name,
    ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE to_tsvector(description) @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                            ts_headline
1  1   mtcars               <data> was extracted from the 1974 Motor
2  3 attitude financial organization, the <data> are aggregated from

We can also combine search terms, e.g. searching for either employee or motor terms:

# using multiple search terms
term <- "employee | motor"  # OR
sql <- glue_sql(
  "SELECT id, name,
  ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE to_tsvector(description) @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                            ts_headline
1  1   mtcars                from the 1974 <Motor> Trend US magazine
2  3 attitude clerical <employees> of a large financial organization

Similarly, we can narrow our search by requiring both data and employee terms to appear in the same description:

term <- "data & employee"  # AND
sql <- glue_sql(
  "SELECT id, name,
  ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE to_tsvector(description) @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                            ts_headline
1  3 attitude clerical <employees> of a large financial organization

Creating indices

In the examples above, we performed tokenization of the search term and the description field at run time, e.g. when the query was executed. As our database grows, this will soon become too cumbersome and degrade performance.

Adding an index to our database will maintain full-text search speed even with large datasets. We have two different options:

  1. Create an index based on an expression.
  2. Create a new field to hold the output of the to_tsvector function, and then index this new field.

Creating an expression index

A simple way to create a full-text index is to include the to_tsvector() expression in the definition of the index itself. Here, we add a Generalized Inverted Index (GIN) index for the description column:

sql = glue_sql(
  "CREATE INDEX description_idx ON datasets 
  USING gin(to_tsvector('english', description));",
  con = con
)
DBI::dbExecute(con, sql)
[1] 0

The same type of query we issued above will now take advantage of the description_idx:

# search the description field using its index
term <- "questioning"
sql <- glue_sql(
  "SELECT id, name,
    ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE to_tsvector('english', description) @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                       ts_headline
1  3 attitude responses to seven <questions> in each department

The description fields of new records, e.g those that are added later, will automatically be added to the index. Let’s create a new record for the euro dataset, for example.

new_data = list(
  name = "euro", 
  title = "The built-in euro dataset from the datasets R package",
  description = gsub(
    "\r?\n|\r", " ", 
    "The data set euro contains the value of 1 Euro in all currencies
participating in the European monetary union (Austrian Schilling ATS, 
Belgian Franc BEF, German Mark DEM, Spanish Peseta ESP, Finnish Markka FIM, 
French Franc FRF, Irish Punt IEP, Italian Lira ITL, Luxembourg Franc LUF, 
Dutch Guilder NLG and Portuguese Escudo PTE). These conversion rates were 
fixed by the European Union on December 31, 1998. To convert old prices to 
Euro prices, divide by the respective rate and round to 2 digits.")
)
sql <- glue_sql(
  "INSERT INTO datasets ({`names(dataset)`*})
   VALUES ({new_data*});", 
  .con = con)
DBI::dbExecute(con, sql)
[1] 1

This new record will now be included in the search results for the term data, for example:

# search the description field using its index
term <- "data"
sql <- glue_sql(
  "SELECT id, name,
    ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE to_tsvector('english', description) @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                            ts_headline
1  1   mtcars               <data> was extracted from the 1974 Motor
2  3 attitude financial organization, the <data> are aggregated from
3  4     euro                     <data> set euro contains the value

Adding a tokenized field for full-text searches

Alternatively, another option is to create a new column to hold the output of the to_tsvector() function, and then to index it for future use. Let’s create a new column search_description_text:

# create a column to hold tokens for full text search
sql <- glue_sql(
  "ALTER TABLE datasets
   ADD COLUMN search_description_text tsvector;", 
  .con = con)
DBI::dbExecute(con, sql)
[1] 0
DBI::dbListFields(con, "datasets")
[1] "id"                      "name"                   
[3] "title"                   "description"            
[5] "created"                 "search_description_text"

Next, we tokenize the descriptions field, and store the output in our new search_description_text column:

sql <- glue_sql(
  "UPDATE datasets
   SET search_description_text = to_tsvector('english', description);", 
  .con = con)
DBI::dbExecute(con, sql)
[1] 4

Here are the tokens generated from the description of the first record, for example:

DBI::dbGetQuery(con, 
                "SELECT name, search_description_text from datasets LIMIT 1;")
    name
1 mtcars
                                                                                                                                                                                          search_description_text
1 '10':17 '1973':27 '1974':7 '32':25 '74':28 'aspect':18 'automobil':20,26 'compris':13 'consumpt':15 'data':2 'design':21 'extract':4 'fuel':14 'magazin':11 'model':29 'motor':8 'perform':23 'trend':9 'us':10

As before, we can add an index - but this time, we index the pre-tokenized search_description_text column instead:

# create the search index
sql <- glue_sql(
  "CREATE INDEX search_description_idx
   ON datasets
   USING gin(search_description_text);",
  .con = con)
DBI::dbExecute(con, sql)
[1] 0

Time to run our search again. When we search the search_description_text field, we can omit the to_tsvector() call, because its has been tokenized already:

# search the description field and show the matching location
term <- "data"
sql <- glue_sql(
  "SELECT id, name,
  ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE search_description_text @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                            ts_headline
1  1   mtcars               <data> was extracted from the 1974 Motor
2  3 attitude financial organization, the <data> are aggregated from
3  4     euro                     <data> set euro contains the value

🚨 But beware: because we have precalculated the tokens, any new records added to the database will not automatically be processed, nor will they be indexed!

Let’s add a final record, the morely dataset:

more_data = list(
  name = "morley", 
  title = "The built-in morley dataset from the datasets R package",
  description = gsub(
    "\r?\n|\r", " ", 
    "A classical data of Michelson (but not this one with Morley) on 
measurements done in 1879 on the speed of light. The data consists of five 
experiments, each consisting of 20 consecutive ‘runs’. The response is the speed
of light measurement, suitably coded (km/sec, with 299000 subtracted).")
)

To enter this record, we not only have to populate the name, title and description fields - but also the list of tokens derived from the description in the search_description_text column. In other words, we have to execute the to_tsvector function inside our INSERT statement:

sql <- glue_sql(
  "INSERT INTO datasets ({`names(dataset)`*}, search_description_text)
   VALUES ({more_data*}, to_tsvector({more_data[['description']]}));", 
  .con = con)
DBI::dbExecute(con, sql)
[1] 1

Now, our query returns both the original matches and the new record:

# search the description field and show the matching location
term <- "data"
sql <- glue_sql(
  "SELECT id, name,
  ts_headline(description, to_tsquery('english', {term}),
     'StartSel = <,
      StopSel = >,
      MinWords = 5,
      MaxWords = 7,
      MaxFragments = 1')
  FROM datasets
  WHERE search_description_text @@ to_tsquery('english', {term})
  ORDER BY created;",
  .con = con)
DBI::dbGetQuery(con, sql)
  id     name                                            ts_headline
1  1   mtcars               <data> was extracted from the 1974 Motor
2  3 attitude financial organization, the <data> are aggregated from
3  4     euro                     <data> set euro contains the value
4  5   morley            classical <data> of Michelson (but not this

Choosing between indexing strategies

According to the Postgres documentation:

One advantage of the separate-column approach over an expression index is that it is not necessary to explicitly specify the text search configuration in queries in order to make use of the index. Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches. The expression-index approach is simpler to set up, however, and it requires less disk space since the tsvector representation is not stored explicitly.

That’s it. Thanks again to Anthony DeBarros’ for his excellent introduction to Practical SQL!

Reproducibility

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.2 (2022-10-31)
 os       macOS Big Sur ... 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2023-01-16
 pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 askpass       1.1     2019-01-13 [1] CRAN (R 4.2.0)
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.2.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
 blob          1.2.3   2022-04-10 [1] CRAN (R 4.2.0)
 cli           3.5.0   2022-12-20 [1] CRAN (R 4.2.0)
 credentials   1.3.2   2021-11-29 [1] CRAN (R 4.2.0)
 DBI         * 1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
 evaluate      0.19    2022-12-13 [1] CRAN (R 4.2.0)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
 glue        * 1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
 hms           1.1.2   2022-08-19 [1] CRAN (R 4.2.0)
 htmltools     0.5.4   2022-12-07 [1] CRAN (R 4.2.0)
 htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.2.2)
 jsonlite      1.8.4   2022-12-06 [1] CRAN (R 4.2.0)
 knitr         1.41    2022-11-18 [1] CRAN (R 4.2.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
 lubridate     1.9.0   2022-11-06 [1] CRAN (R 4.2.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
 openssl       2.0.5   2022-12-06 [1] CRAN (R 4.2.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
 Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.2.0)
 rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
 rmarkdown     2.19    2022-12-15 [1] CRAN (R 4.2.0)
 RPostgres   * 1.4.4   2022-05-02 [1] CRAN (R 4.2.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.0)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
 stringr       1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
 sys           3.4.1   2022-10-18 [1] CRAN (R 4.2.0)
 timechange    0.1.1   2022-11-04 [1] CRAN (R 4.2.0)
 vctrs         0.5.1   2022-11-16 [1] CRAN (R 4.2.0)
 xfun          0.35    2022-11-16 [1] CRAN (R 4.2.0)
 yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.0)

 [1] /Users/sandmann/Library/R/x86_64/4.2/library
 [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library

──────────────────────────────────────────────────────────────────────────────

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.