Distributing R packages with a drat repository hosted on AWS S3

R
AWS
TIL
Author

Thomas Sandmann

Published

January 21, 2023

tl;dr

Today I learned how to

  • Build an R package into source and binary bundles for distribution.
  • Create a local drat repository.
  • Add an R package to the repository and install it from there.
  • Host the repository remotely in an AWS S3 bucket.

Many thanks to Dirk Eddelbuettel for creating and documenting the drat R package! (As always, any mistakes are my own.)

Motivation

There are multiple ways for developers to share R packages publicly, e.g.

User can then install these packages via the familiar install.packages() command.

Alternatively, authors can share their code through version control systems like github or gitlab, and users can install them with third-party tools e.g. the remotes R package.

But how can you make an R package available privately, e.g. for use within an organization?

In this tutorial, I demonstrate how to set up your own package repository with Dirk Eddelbuettel’s drat R package, add a package, make R aware of the new repo - and host it remotely on AWS S3.

Why drat?

Dirk Eddelbuettel highlights two main advantages:

  • A package installed from a drat repository will be supported by install.packages() and update.packages(), so the user has easy methods for keeping up-to-date.
  • The package author has better control over the package version users install, because they actively push specific releases into the repository.

Please see Dirk’s Drat FAQ’s for additional points, e.g. ‘Why could install_github be wrong?’

Prequisites

Hadley Wickham and Jenny Bryan have documented how to author, document and build R packages in their freely-available R Packages book. In this walkthrough I am using Mac OS X (v13.1), but you can find instructions to set up Windows or Linux build environments in their R build toolchain chapter.

Bundling an R package’s source code for distribution

First, we need an R package that’s ready for distribution. Here, I am using the toy R package that you can retrieve from github, either via git clone https://github.com/tomsing1/toy or by downloading its source code as a zip file. (Feel free to follow along with another R package instead - as long as you have the source package, the following steps apply.)

Next, we bundle the package into a single compressed file with the .tar.gz file extension. Let’s download the .zip file linked above into the ~/Downloads folder and use the R CMD build command to create a source bundle 1:

cd ~/Downloads
curl -s -L -O https://github.com/tomsing1/toy/archive/refs/heads/main.zip
unzip -o -q main.zip
rm main.zip
R CMD build --force toy-main
* checking for file ‘toy-main/DESCRIPTION’ ... OK
* preparing ‘toy’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Omitted ‘LazyData’ from DESCRIPTION
* building ‘toy_0.1.0.tar.gz’

We now have the toy_0.1.0.tar.gz file, ready to be inserted into a new (or existing) drat repository.

Creating a local drat repository

To create a new repository, we start by installing the drat R package itself (if it’s not available on your system already) with the following R commands:

if (!requireNamespace("drat", quietly = TRUE)) {
  install.packages("drat")
}
library(drat)

You can specify the path of your drat repository either by setting the dratRepo option 2:

options(dratRepo = "~/drat-tutorial")
getOption("dratRepo")
[1] "~/drat-tutorial"

or by providing it as an argument to the drat::insertPackage() function (see below).

Let’s create a new drat repository in our home directory 3, and populate it with a minimal index.html file (to avoid HTTP 404 Not Found errors later).

dir.create("~/drat-tutorial", showWarnings = FALSE)
writeLines(
  text = "<!doctype html><title>My awesome drat repository!</title>",
  con = "~/drat-tutorial/index.html"
)

Now we are ready to insert the toy package bundle into the repository with drat’s insertPackage() command 4:

drat::insertPackage(file = "~/Downloads/toy_0.1.0.tar.gz",
                    repodir = "~/drat-tutorial")

Now, the ~/drat-tutorial folder contains the following files:

drat repository

Accessing the local drat repository

When you prompt your R installation to install or update R packages, it searches repositories specified in the repos option. On my system, only the default repository is set in a fresh R session 5:

getOption("repos")
                         CRAN 
"https://cloud.r-project.org" 

If I try to install our example toy R package, I don’t succeed:

install.packages("toy", type = "source")
Installing package into '/Users/sandmann/Library/R/x86_64/4.2/library'
(as 'lib' is unspecified)
Warning: package 'toy' is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

because R is not aware of our new repository, yet.

Installing from source

At this point, we must add the type="source" argument, because we have only added the source bundle to the repository. We will add a compiled version in a moment - read on!

To test our local repository, we add its path to the list of known repositories.

drat::addRepo("LocalRepo", "file://Users/sandmann/drat-tutorial")
getOption("repos")
                                 CRAN                             LocalRepo 
        "https://cloud.r-project.org" "file://Users/sandmann/drat-tutorial" 
Specifying file:// paths

By default, drat’s addRepo() command assumes that repositories are hosted on github-pages. Because we want to access a repo via the filesystem (either locally or on a network drive), we need to explicitly add the file:/ prefix - and use the absolute file path (e.g. returned by path.expand("~/drat-tutorial")) to specify its location.

In this case, concatenating file:/ with /Users/sandmann/drat-tutorial produces the final file://Users/sandmann/drat-tutorial location (note the double forward slashes).

Now, we can install it with the usual install.packages() command 6:

install.packages("toy", type = "source")
Installing package into '/Users/sandmann/Library/R/x86_64/4.2/library'
(as 'lib' is unspecified)

Great! We have successfully installed our toy R package from our brand new repository. Now it is time to make it available to other users as well.

Building binary packages

Windows and Mac users who install packages from CRAN or any user installing files from the Posit Public Package Manager (PPPM) will usually receive a binary package. CRAN accepts package bundles and creates the platform-specific binary file for distribution. To offer the same service to users of our drat repository, we need to compile the binary package ourselves.

Here, I create the Mac OS binary package from the bundle we obtained above by executing the following command on my Mac OS operating system:

cd ~/Downloads
R CMD INSTALL --build toy_0.1.0.tar.gz
* installing to library ‘/Users/sandmann/Library/R/x86_64/4.2/library’
* installing *source* package ‘toy’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* creating tarball
packaged installation of ‘toy’ as ‘toy_0.1.0.tgz’
* DONE (toy)

This command will first install the package into my default R library, and then create the binary toy_0.1.0.tgz file.

Next, we add it to our local drat repository (note the .tgz file suffix).

drat::insertPackage(file = "~/Downloads/toy_0.1.0.tgz",
                    repodir = "~/drat-tutorial")

Now, the ~/drat-tutorial folder contains a new subdirectory (bin) with the binary files for Mac OS X:

drat repository

At long last, now we can omit the type="source" argument from calls to install.packages():

install.packages("toy")
Installing package into '/Users/sandmann/Library/R/x86_64/4.2/library'
(as 'lib' is unspecified)

The downloaded binary packages are in
    /var/folders/wc/9tswmr4s74s0x90wqh2007300000gp/T//Rtmpg6tHJA/downloaded_packages

Hosting your drat repository on AWS S3

drat repositories can be hosted in any location

  1. that you can write files to and
  2. that can serve files via http

But unless you placed your drat repository into a network drive that is accessible by multiple users, it is currently only useful to yourself.

If you chose a network drive as the location of your drat repository, then other user can benefit from it right - as long as they can read from the shared directory. As before, the absolute path must be prefixed with the file:/ prefix. For example, a repository that is available on the user’s systems at /nfs/groups/groupABC/R/drat would be added to the list of R repositories via drat::addRepo("workgroup", "file://nfs/groups/groupABC/R/drat").

The drat documentation illustrates how you can use git and github pages to make your repository publicly available.

Here, we are interested in hosting a repository privately instead, e.g. in a location that is only accessible from within our own organization:

  • If you already have access to a private server that serves files to your users (e.g. via HTTP), then you can simply copy your repository there.
  • If your organization uses Amazon Web Services (AWS), you can also use an S3 bucket to host your repository and take advantage of the access controls set by your organization.

Although this use case focuses on hosting private repositories, you can of course also make repositories in S3 buckets publicly available. Alas, data storage in S3 buckets incurs cost, while other options (e.g. github-pages, CRAN, Bioconductor, etc) are free, so this might not be your preferred option.

We will assume that you have write access to an S3 bucket that is configured to serve static files via HTTP. (For a brief outline of the necessary steps, please see the appendix ). Here, I am using a bucket called drat-tutorial - but you should create / access your own bucket to follow along.

Warning

AWS S3 buckets can be configured to either be visible publicly, or access can be restricted to specific IP addresses, security groups or other AWS resources. Please make sure you have configured your bucket in a way that suits your needs.

S3 buckets do not support the HTTPS protocol. If you require an encrypted file transfer, you might need a different solution.

To share our repository, we must first copy its folder to the S3 bucket, either via the AWS Console or (more conveniently) with the aws command line interface7. (If you are adventurous, you can also mount an S3 bucket as a filey system with goofys).

Assuming you have set the necessary AWS credentials, the following aws s3 sync command copies our repository to the repo folder within drat-tutorial bucket that I created in the us-west-1 AWS region.

aws s3 sync ~/drat-tutorial s3://drat-tutorial/repo

We can use the aws s3 ls command to confirm the upload:

aws s3 ls s3://drat-tutorial/repo/
                           PRE bin/
                           PRE src/
2023-01-21 20:28:30         58 index.html
Note

Whenever we make changes to our local repository, e.g. after adding new packages or package versions, we have to rerun the aws s3 sync command to copy the new files to the S3 bucket.

Now that the files are in place, we can add our remote repository to the the list of R repositories in our R session. First, we remove the LocalRepo repository that we had added earlier, which points to the folder on our local filesystem.

options(repos = getOption("repos")[
  setdiff(names(getOption("repos")), "LocalRepo")
])

The we add the remote repository instead, by pointing to the URL of the S3 bucket 8.

drat::addRepo("S3repo", "http://drat-tutorial.s3.us-west-1.amazonaws.com/repo/")
getOption("repos")
                                                   CRAN 
                          "https://cloud.r-project.org" 
                                                 S3repo 
"http://drat-tutorial.s3.us-west-1.amazonaws.com/repo/" 

Let’s try to install the toy package from our S3 drat repository:

install.packages("toy")
Installing package into '/Users/sandmann/Library/R/x86_64/4.2/library'
(as 'lib' is unspecified)

The downloaded binary packages are in
    /var/folders/wc/9tswmr4s74s0x90wqh2007300000gp/T//Rtmpg6tHJA/downloaded_packages

Success! R has successfully connected to the remote repository and installed the (binary) R package.

Conclusions

  • The drat R package makes it extremely simple to create a CRAN-like repository.
  • The static files can be served via HTTP, making it straightforward to host the repository e.g. in an AWS S3 bucket with a restrictive access policy.

Appendix

Creating and configuring an S3 bucket to host static files

The following steps briefly outline how to create and configure an S3 bucket to act as a static web server via the AWS web interface (e.g. the AWS Console). For more details, please read the AWS S3 documentation and / or consult your local AWS expert.

Warning

Storing files on AWS S3 is not free. In this tutorial, we only upload a limited number of small files, but please don’t forget to purge them from your AWS account afterward.

  1. Create a new bucket (skip if you already have one)
  • Make sure you create the bucket in the region that works best for your organization (e.g. us-west-1 if you want to host your files in California).

  • You do not need to enable public access, stick to the defaults for your organization.

    Create an S3 bucket

  1. Next, navigate to your bucket’s properties,

    Bucket properties

    scroll all the way to the bottom of the page and enable Static website hosting.

    Enable static hosting

    (Typically) specify index.html as the Index document.

    Define the index document

  2. Under the Permissions tab, add a bucket policy that makes your content available within your organization

    Warning

    These settings determine who can access your files. Proceed with caution to avoid inadvertently exposing your data to the world!

    For example, the following policy grants read access to all files in the s3://drat-tutorial/ bucket to requests originating (only) from the 192.0.2.0 IP address. (Your own configuration will be different, of course.)

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "PublicReadGetObject",
          "Effect": "Allow",
          "Principal": "*",
          "Action": [
            "s3:GetObject"
          ],
          "Resource": [
            "arn:aws:s3:::drat-tutorial/*"
          ],
          "Condition": {
            "IpAddress": {
              "aws:SourceIp": "192.0.2.0/32"
            }
          }
        }
      ]
    }

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

  1. Alternatively, you can also create the bundle from within R using the devtools::build() command.↩︎

  2. You might want to add this option to your .Rprofile file.↩︎

  3. Of course, you can place it anywhere you like, including e.g. network drives, as long as you can write to the directory. If you are using Windows, please remember to use backward instead of forward slashes in your paths.↩︎

  4. In this tutorial, I use the :: notation to highlight which package a function originates from. Because we attached the package with the library(drat) command before, the drat:: prefix could be omitted.↩︎

  5. In this tutorial, I use the :: notation to highlight in which package functions originate from. Because we attached the package with the library(drat) command before, the drat:: prefix could be omitted.↩︎

  6. If you use Bioconductor, the BiocManager::repositories() specifies additional repositories that host its annotation and software packages.↩︎

  7. Installation instructions.↩︎

  8. You can look up the URL for your bucket in the AWS S3 console: S3 URL↩︎