Analyzing targeted locus amplification (TLA) data

tl;dr

This week, I had the chance to learn about a technique that was new to me: targeted locus amplification (TLA), initially described by de Vree et al, Nature Biotechnology in 2014¹.

Starting with a set of PCR primers, it captures sequences up to ~ 100 kb on either side of them. This targeted library preparation method can be used e.g. to identify the insertion site of transgenes, or to study structural variation of specific genes of interest.

Here, I am taking a closer look at published TLA data that reveals details about commonly used transgenic mouse models.

A basic overview of TLA

The goal of the TLA library preparation method is to capture genomic fragments that are in close proximity to a known sequence. The following steps are illustrated in the great figure by de Vree et al shown below²:

Protocol details

The protocol starts with a cell pellet, resuspended in PBS with 10% BSA.
The chromatin is cross-linked with formaldehyde, locking DNA fragments that are in close proximity to each other in place.
Next, the cells are lysed and the DNA is digested with NlaIII, a restriction enzyme that recognizes a 4 base pair sequence. Because these 4-mer motifs appear frequently in the genome this digestion step leads to short fragments.
Afterwards, the restriction enzyme is heat inactivated and the DNA ends are ligated, creating large DNA circles composed of multiple crosslinked restriction fragments. At this point, the cross-links are reversed and the DNA is purified using proteinase K digestions and phenol chloroform extraction.
The next step differentiates TLA from other protocols (e.g. Chromosome conformation capture (4C)): the DNA circles are digested once more with NspI a restriction enzyme that recognizes a 6-mer motif and creates overhangs that are compatible with those generated by NlaIII. This “limited trimming” step releases random fragments of the larger DNA circles, placing segments that were further apart within the same molecule closer to each other.
Afterwards, the circles are closed with another ligation step and purified once again to obtain the templates for an inverse PCR reaction.
Circules that contain the fragment of interest (also know as the anchor sequence) are selectively PCR-amplified with inverse PCR primers. In contrast to a typical PCR reaction, the primers face outwards, amplifying the entire circle.
The final PCR product contain the anchor sequence as well as assorted sequences in close proximity (e.g. within kilobase range) of the primers. The second digestion step (step 7, see above) ensures that both regions immediately adjacent to the anchor and those further removed are included.
These PCR products are now entered into a standard DNA sequencing library prepration, and can be analyzed using short or long read platforms.

Figure 1 by de Vree et al

Characterizing transgenic mouse models

Transgenic mouse models, e.g. engineered to carry human or synthetic genomic regions, are key to basic and translational research. For example, Alzforum has collated information about ~ 300 mouse strains used to study Alzheimer’s Disease (AD), Parkinson’s Disease (PD) or Amyotrophic lateral sclerosis (ALS).

Goodwin et al, Genome Research, 2017 used TLA to identify the insertion sites of 40 frequently used transgenic mouse lines. Many of these lines were created before high throughput sequencing became widely available, and information about the exact sequences that had been transferred and the insertion site(s) was missing.

The authors designed (often multiple) primer pairs for each mouse strain, performed TLA and sequenced the resulting (inverse) PCR products using an Illumina MiSeq instrument.

They describe the results of their analysis, e.g. the coordinates of the insertion into the mouse (GRCm38 / mm10) genome in Table 1, and also describe the resulting consequences (e.g. structural variations) they observed ³.

Yet, the authors do not report the coordinates of the donor sequence, e.g. the transgenic DNA that was inserted into the mouse genome. Luckily, the LTA data can identify both the insertion side and the likely sequence of origin.

Here, I am taking a closer look at the TLA results for one mouse strain, carrying the genomic region encompassing the human LRRK2 gene.

Note

In this post, I am interactively using both the linux shell (bash) and the R console. I have indicated the respective interface in each of the code blocks below.

Installing bwa, samtools and deeptool with pixi

To get ready for analyzing the raw sequencing data, I chose the pixi package manager to install different bioinformatics tools.

Installing pixi on my Linux system was straightforward, and I followed the instructions on the pixi homepage.

# excuted in the linux shell
> curl -fsSL https://pixi.sh/install.sh | bash

Next, I created a new directory for my analysis and initated a pixi project.

# excuted in the linux shell
> mkdir tla
> cd tla
> pixi init -c bioconda -c conda-forge
> pixi add bwa samtools deeptools
> ls -a 

.  ..  .gitattributes  .gitignore  index.qmd  .pixi  pixi.lock  pixi.toml

Tip

The pixi init command creates the pixi.toml manifest file in the current working directory, with the following content:

[project]
authors = ["Thomas Sandmann <tomsing1@gmail.com>"]
channels = ["bioconda", "conda-forge"]
name = "tla"
platforms = ["linux-64"]
version = "0.1.0"

[tasks]

[dependencies]

The pixi add bwa samtools deeptool command retrieves the specified tools - and their dependencies - from bioconda or conda-forge

It also updates the [dependencies] section of the pixi.toml file, so it now includes all three tools as dependencies:

[dependencies]
bwa = ">=0.7.18,<0.8"
samtools = ">=1.9,<2"
deeptools = ">=3.5.4,<4"

In addition, the pixi.lock file recorded the exact version, URL of origin, checksums, and other details about each software package that was installed.

The software tools are installed with in the (hidden) pixi sub-directory, which constitutes the environment for this project.

pixi caches each installed tools automatically, so previously downloaded copies are not duplicated.

The pixi shell command activates the pixi environment for interactive use:

# excuted in the linux shell
> pixi shell
> samtools --version

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.

Retrieving raw sequencing reads from the ENA

The European Nucleotide Archive (ENA), which mirrors datasets submitted to the NCBI as well, makes the raw MiSeq data deposited by Goodwin et al available under accession SRP156273 / PRJNA483560.

FASTQ files from 69 paired-end sequencing runs are available, and I downloaded the file report table as TSV file PRJNA483560.tsv.

Most of the 40 examined lines are associated with a single sequencing run, but a subset has been analyzed multiple times (with different primer pairs).

For example, the Lrrk2*G2019S line (lower case) has 4 pairs of FASTQ files available, and the LRRK2*G2019S (upper case letters) has 6 pairs ⁴

Let’s read the TSV file I downloaded from ENA into R and explore it:

# executed in an R session in the project directory
files <- read.delim("PRJNA483560.tsv")
local({
  tab <- table(files$sample_title)
  tab[grep("Lrrk2", names(tab), ignore.case = TRUE)]
})


       LRRK2 Lrrk2*G2019S LRRK2*G2019S LRRK2*R1441G 
           2            4            6            6

The sample_alias column contains the mouse strain identifier (prefixed with TLA-), so we can look up additional details about each on the Jackson Laboratory’s (JAX) website, e.g. for the two lines mentioned above:

LRRK2*G2019S (strain 18785):
- A bacterial artificial chromosome (BAC) with the entire human LRRK2 (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the LRRK2 locus to harbor the LRRK2 G2019S mutation.
- This modified 188 kb BAC was microinjected into fertilized C57BL/6J zygotes. Founder BAC LRRK2*G2019S mice were established and then subsequently maintained by breeding transgenic mice with C57BL/6J (Stock No. 000664) to establish the colony.
- The transgene integrated on chromosome 2. (Please note that Goodwin et al’s Table 1 mistakenly places this insertion on chromosome 1, but their Supplementary Table S1 shows the correct position on chromosome 2 instead.)
Lrrk2*G2019S (12467)
- A bacterial artificial chromosome (BAC) (#RP23-312I9) containing the entire mouse Lrrk2 (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the Lrrk2 locus to harbor the Lrrk2-G2019S mutation.
- This modified 240 kb BAC, containing a FLAG-tag downstream from the start codon, was microinjected into fertilized B6C3 F1 oocytes.
- FLAG-Lrrk2-G2019S founder line 2 was established and then subsequently maintained by breeding transgenic mice with C57BL/6J inbred mice for many generations prior to arrival at The Jackson Laboratory.
- Upon arrival, mice were bred to C57BL/6J inbred mice (Stock No. 000664) for at least one generation to establish the colony.

# executed in an R session in the project directory
head(files)[, c("sample_title", "sample_alias", "fastq_ftp")]

    sample_title sample_alias
1        3XTg-AD     TLA-4807
2      CAG-FCGRT     TLA-4919
3        3XTg-AD     TLA-4807
4           MAPT     TLA-5491
5 Cspg4-cre/Esr1     TLA-8538
6 Cspg4-cre/Esr1     TLA-8538
                                                                                                                                          fastq_ftp
1 ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/001/SRR7641541/SRR7641541_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/001/SRR7641541/SRR7641541_2.fastq.gz
2 ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/003/SRR7641543/SRR7641543_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/003/SRR7641543/SRR7641543_2.fastq.gz
3 ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/004/SRR7641544/SRR7641544_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/004/SRR7641544/SRR7641544_2.fastq.gz
4 ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/006/SRR7641546/SRR7641546_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/006/SRR7641546/SRR7641546_2.fastq.gz
5 ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/008/SRR7641548/SRR7641548_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/008/SRR7641548/SRR7641548_2.fastq.gz
6 ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/009/SRR7641549/SRR7641549_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/009/SRR7641549/SRR7641549_2.fastq.gz

LRRK2*G2019S: human LRRK2 locus integrated into the mouse genome

Let’s take a closer look at line LRRK2*G2019S, which carries a 188 kb BAC with the human LRRK2 locus on (mouse) chromosome 2 ⁵.

There are two FTP URLs, one for the R1 and one for the R2 read of the FASTQ pair, for each of the 6 sequencing runs; let’s split them into separate strings.

# executed in an R session in the project directory
urls <- with(files[files$sample_title == "LRRK2*G2019S", ],
             split(fastq_ftp, run_accession)) |>
  lapply(\(x) strsplit(x, split = ";")[[1]])
urls

$SRR7641557
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/007/SRR7641557/SRR7641557_1.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/007/SRR7641557/SRR7641557_2.fastq.gz"

$SRR7641558
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/008/SRR7641558/SRR7641558_1.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/008/SRR7641558/SRR7641558_2.fastq.gz"

$SRR7641591
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/001/SRR7641591/SRR7641591_1.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/001/SRR7641591/SRR7641591_2.fastq.gz"

$SRR7641592
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/002/SRR7641592/SRR7641592_1.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/002/SRR7641592/SRR7641592_2.fastq.gz"

$SRR7641593
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/003/SRR7641593/SRR7641593_1.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/003/SRR7641593/SRR7641593_2.fastq.gz"

$SRR7641594
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/004/SRR7641594/SRR7641594_1.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/004/SRR7641594/SRR7641594_2.fastq.gz"

Downloading reference sequences from Ensembl

Here, we already know that the BAC inserted on chromosome 2 of the mouse genome, so we can download that specific chromosome and align the sequencing reads to it. (In the absence of prior knowledge, we would align to the entire mouse reference genome instead.)

Ensembl provides genome sequences for download on its ftp server.

The following command downloads mouse chromosome 2 (GRCm38), where Goodwin et al reported the insertion to be located, and human chromosome 12 (GRCh38), where the source LRRK2 locus resides

# executed in an R session in the project directory
ref_folder <- "ref_folder"
dir.create(ref_folder, showWarnings = FALSE)

local({
  # 38 Mb
  url <- paste0("https://ftp.ensembl.org/pub/release-113/fasta/",
                "homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz")
  download.file(url, destfile = file.path(ref_folder, "human.fa.gz"))
  # 52 Mb
  url <- paste0("https://ftp.ensembl.org/pub/release-102/fasta/",
                "mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz")
  download.file(url, destfile = file.path(ref_folder, "mouse.fa.gz"))
})

Next, we create an index for the bwa aligner for each of the two references (using the pixi shell) we started above.

Tip

I chose a minimal bioinformatics workflow for this analysis, e.g. without first trimming the reads (e.g. using fastp or trim galore ) as I am merely interested in spotting regions with the largest coverage across the mouse genome. Other applications of TLA, e.g. resequencing of cancer-associated genes or structural variant calling, likely require greater care. This overview by Solvias provides a great introduction to TLA analysis, including multiple examples of structural variants and how they can be identified using a genome browser.

# excuted in the linux shell
pushd ref_folder
bwa index -p human human.fa.gz
bwa index -p mouse mouse.fa.gz
popd

Next, we are ready to retrieve the FASTQ files from ENA and align them to each of the references (mouse and human).

Download FASTQ files

Here, I am using a few lines of R code to download each of the FASTQ files:

# executed in an R session in the project directory
dir.create("fastq", showWarnings = FALSE)
for (url in unlist(urls)) {
  download.file(url, destfile = file.path("fastq", basename(url)), 
                method = "curl", quiet = TRUE)
}

Important

Usually, we map the R1 and R2 reads from a paired-end sequencing run together, e.g. instruct the aligner to map them as a physical pair. But because TLA uses inverse PCR to generate the fragments of interest, the R1 and R2 reads don’t face each other, but point away from each other instead. We therefore map them as two sets of single-end reads instead.

Because we don’t want to align the R1 and R2 reads as pairs, we can simply concatenate them (e.g. with the zcat command). To visualize the alignments in a genome browser (see below) we use samtools to sort them by position and create the .bai index file.

Because we are mainly interested in visualizing the coverage of different genomic regions, not individual alignments, we use the bamCoverage tools from the deeptools suite to generate smaller bigWig (.bw) files as well.

# excuted in the linux shell
mkdir -p alignments
for SAM in SRR7641557 SRR7641558 SRR7641591 SRR7641592 SRR7641593 SRR7641594
do
  # align to human chromosome
  bwa mem \
    -t 4 \
    ref_folder/human \
    <(zcat fastq/${SAM}*.fastq.gz) | \
  samtools sort -O BAM > alignments/${SAM}_human.bam
  samtools index alignments/${SAM}_human.bam
  bamCoverage \
    --normalizeUsing None \
    -b alignments/${SAM}_human.bam \
    -o alignments/${SAM}_human.bw

  # align to mouse chromosome
  bwa mem \
    -t 4 \
    ref_folder/mouse \
    <(zcat fastq/${SAM}*.fastq.gz) | \
  samtools sort -O BAM > alignments/${SAM}_mouse.bam
  samtools index alignments/${SAM}_mouse.bam
  bamCoverage \
    --normalizeUsing None \
    -b alignments/${SAM}_mouse.bam \
    -o alignments/${SAM}_mouse.bw 
done

Visual exploration of the alignments

Now it’s time to explore the results visually, using e.g. the Broad Institute’s IGV genome browser. As we have aligned the reads to both the human and the mouse genome (separately), we can visualize alignment coverage in both species. Let’s start with the mouse genome - and identify the integration site of the human BAC.

Alignments to the mouse genome

According to Goodwin et al’s Supplementary Table 1 the integration site of this BAC is at position chr2:80896405-80896406 of the GRCm38 / mm10 version of the mouse genome. Let’s zoom into that region.

Note

The alignment coverage at the insertion site, e.g. close to the anchor region, is very high. The y-axis of the IGV tracks has been set to include at least the range from 0 to 10,000 - or to autoscale if the maximum exceeds 10,000.

In three of the six sequencing runs (each representing a different primer pair / anchor region within the transgene), we observe three high coverage peaks close to (but not exactly at) the reported insertion site.

Zooming into the two peaks on the left reveals a very similar shape of the peak, with an dip (e.g. low coverage) at the center.

Note

Solvias great summary of TLA analysis includes examples of coverage peaks pointing towards different types of insertion events. For example, Figure 11 of their report (shown below) indicates that a gap within the coverage peak is consistent with a deletion in the target genome.

Interestingly, we another sequencing run (track 4) shows very high coverage in a different location on chromosome 2:

If this is a real signal, it might indicate either a second insertion - or the presence of a more complex structural variation caused by the (single) insertion event closeby.

Alignments to the human genome

Next, we switch to the human (GRCh38) genome, to examine the LRRK2 region from which the transgene originates. Goodwin et al do not report the coordinates of the source region.

When we zoom into the LRRK2 region on human chromosome 12, three of the six tracks show very highly concentrated coverage (tracks 1-3). The other three tracks show lower coverage over roughly the same range of coordinates, delineating the extend of the BAC sequence that was introduced into the mouse strain:

If we rescale the first three tracks, setting the maximum y-axis value to 1,000, we see that these sequencing runs also yielded alignments that span the same range (albeit with much more uneven coverage):

Variant calling

The strain includes the LRRK2 G2019S Parkinson’s Disease risk variant (e.g. SNP rs34637584). Let’s examine its position (chr12:40340400) and verify that the expected A allele is present (and not the G reference allele that is included in the human reference genome).

Satisfyingly, the targeted sequencing covers the variant position and confirms that the LRRK2 G2019S (G => T) variant is indeed present in this mouse strain: .

Conclusion

This brief exploration of published TLA data introduced the laboratory protocol and a (rough and tumble) bioinformatics analysis. Aligning the reads to the human genome revealed the start and end coordinates of the donor BAC sequence that was introduced to the LRRK2*G2019S mouse model, and it also validated the presence of the LRRK2 G2019S variant.

Interestingly, examination of the alignments to the mouse genome pinpointed a likely insertion site close to but not identical to the coordinates reported by Goodwin et al. The presence of a second coverage peak highlights that full understanding of the potential structural variations is complicated - and that the use of multiple anchor regions (e.g. inverse PCR pairs) is desirable.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

The full text of the article is available via ResearchGate.↩︎
The full protocol developed by de Vree et al is included in the supplementary information alongside helpful illustrations and numerous examples of TLA applied to different use cases.↩︎
Supplementary Table S1 has even more information about each strain.)↩︎
Each sequencing run was generated with a different set of primer pairs, which the authors provide in Supplementary Table S4 of the paper ↩︎
Goodwin et al reported the position of the insertion using GRCm38 / mm10 coordinates. For simplicity, we will use the same genome release here.)↩︎

--- title: "Analyzing targeted locus amplification (TLA) data" subtitle: "Identifying transgene insertions in a commonly used mouse model" author: "Thomas Sandmann" date: "2025-02-25" freeze: true categories: [NGS, TIL] editor: markdown: wrap: 72 format: html: toc: true toc-depth: 4 code-tools: source: true toggle: false caption: none editor_options: chunk_output_type: console --- ## tl;dr This week, I had the chance to learn about a technique that was new to me: targeted locus amplification (TLA), initially described by [de Vree et al, Nature Biotechnology](https://pubmed.ncbi.nlm.nih.gov/25129690/) in 2014[^1]. Starting with a set of PCR primers, it captures sequences up to ~ 100 kb on either side of them. This targeted library preparation method can be used e.g. to identify the insertion site of transgenes, or to study structural variation of specific genes of interest. Here, I am taking a closer look at published TLA data that reveals details about commonly used transgenic mouse models. [^1]: The full text of the article is available via [ResearchGate](https://www.researchgate.net/publication/267624474_Targeted_sequencing_by_proximity_ligation_for_comprehensive_variant_detection_and_local_haplotyping). ## A basic overview of TLA The goal of the TLA library preparation method is to capture genomic fragments that are in close proximity to a known sequence. The following steps are illustrated in the great figure by de Vree et al shown below[^2]: [^2]: The full protocol developed by de Vree et al is included in the [supplementary information](https://static-content.springer.com/esm/art%3A10.1038%2Fnbt.2959/MediaObjects/41587_2014_BFnbt2959_MOESM1_ESM.pdf) alongside helpful illustrations and numerous examples of TLA applied to different use cases. ::: {.callout-note collapse="true"} ### Protocol details 1. The protocol starts with a cell pellet, resuspended in PBS with 10% BSA. 2. The chromatin is cross-linked with formaldehyde, locking DNA fragments that are in close proximity to each other in place. 3. Next, the cells are lysed and the DNA is digested with [NlaIII](https://www.neb.com/en-us/products/r0125-nlaiii), a restriction enzyme that recognizes a 4 base pair sequence. Because these 4-mer motifs appear frequently in the genome this digestion step leads to short fragments. 4. Afterwards, the restriction enzyme is heat inactivated and the DNA ends are ligated, creating large DNA circles composed of multiple crosslinked restriction fragments. At this point, the cross-links are reversed and the DNA is purified using proteinase K digestions and phenol chloroform extraction. 5. The next step differentiates TLA from other protocols (e.g. Chromosome conformation capture (4C)): the DNA circles are digested once more with [NspI](https://www.neb.com/en-us/products/r0602-nspi) a restriction enzyme that recognizes a 6-mer motif and creates overhangs that are compatible with those generated by NlaIII. This "limited trimming" step releases random fragments of the larger DNA circles, placing segments that were further apart within the same molecule closer to each other. 6. Afterwards, the circles are closed with another ligation step and purified once again to obtain the templates for an *inverse PCR* reaction. 7. Circules that contain the fragment of interest (also know as the _anchor sequence_) are selectively PCR-amplified with inverse PCR primers. In contrast to a typical PCR reaction, the primers face outwards, amplifying the entire circle. 8. The final PCR product contain the anchor sequence as well as assorted sequences in close proximity (e.g. within kilobase range) of the primers. The second digestion step (step 7, see above) ensures that both regions immediately adjacent to the anchor and those further removed are included. 9. These PCR products are now entered into a standard DNA sequencing library prepration, and can be analyzed using short or long read platforms. ::: ![Figure 1 by de Vree et al](https://www.researchgate.net/profile/Birgit-Sikkema-Raddatz/publication/267624474/figure/fig3/AS:267552181977096@1440800819110/Targeted-sequencing-using-TLA-a-Neighboring-sequences-that-form-a-gene-or-genetic.png) ## Characterizing transgenic mouse models Transgenic mouse models, e.g. engineered to carry human or synthetic genomic regions, are key to basic and translational research. For example, Alzforum has collated [information about ~ 300 mouse strains](https://www.alzforum.org/research-models) used to study Alzheimer's Disease (AD), Parkinson's Disease (PD) or Amyotrophic lateral sclerosis (ALS). [Goodwin et al, Genome Research, 2017](https://pubmed.ncbi.nlm.nih.gov/30659012/) used TLA to identify the insertion sites of 40 frequently used transgenic mouse lines. Many of these lines were created before high throughput sequencing became widely available, and information about the exact sequences that had been transferred and the insertion site(s) was missing. The authors designed (often multiple) primer pairs for each mouse strain, performed TLA and sequenced the resulting (inverse) PCR products using an Illumina MiSeq instrument. They describe the results of their analysis, e.g. the coordinates of the insertion into the mouse (GRCm38 / mm10) genome in [Table 1](https://genome.cshlp.org/content/29/3/494/T1.expansion.html), and also describe the resulting consequences (e.g. structural variations) they observed [^3]. [^3]: [Supplementary Table S1](https://genome.cshlp.org/content/suppl/2019/02/11/gr.233866.117.DC1/Supplemental_Table_S1.xlsx) has even more information about each strain.) Yet, the authors do not report the coordinates of the donor sequence, e.g. the transgenic DNA that was inserted into the mouse genome. Luckily, the LTA data can identify _both_ the insertion side and the likely sequence of origin. Here, I am taking a closer look at the TLA results for one mouse strain, carrying the genomic region encompassing the human _LRRK2_ gene. ::: {.callout-note collapse="false"} In this post, I am interactively using both the linux shell (bash) and the R console. I have indicated the respective interface in each of the code blocks below. ::: ## Installing bwa, samtools and deeptool with pixi To get ready for analyzing the raw sequencing data, I chose the `pixi` package manager to install different bioinformatics tools. Installing `pixi` on my Linux system was straightforward, and I followed [the instructions on the pixi homepage](https://pixi.sh/latest/#installation). ```bash # excuted in the linux shell > curl -fsSL https://pixi.sh/install.sh | bash ``` Next, I created a new directory for my analysis and initated a `pixi` project. ```bash # excuted in the linux shell > mkdir tla > cd tla > pixi init -c bioconda -c conda-forge > pixi add bwa samtools deeptools > ls -a . .. .gitattributes .gitignore index.qmd .pixi pixi.lock pixi.toml ``` ::: {.callout-tip collapse="true"} The `pixi init` command creates the `pixi.toml` [manifest file](https://pixi.sh/latest/reference/pixi_manifest/) in the current working directory, with the following content: ``` [project] authors = ["Thomas Sandmann <tomsing1@gmail.com>"] channels = ["bioconda", "conda-forge"] name = "tla" platforms = ["linux-64"] version = "0.1.0" [tasks] [dependencies] ``` The `pixi add bwa samtools deeptool` command retrieves the specified tools - and their dependencies - from [bioconda](https://bioconda.github.io/) or [conda-forge](https://conda-forge.org/) It also updates the `[dependencies]` section of the `pixi.toml` file, so it now includes all three tools as dependencies: ``` [dependencies] bwa = ">=0.7.18,<0.8" samtools = ">=1.9,<2" deeptools = ">=3.5.4,<4" ``` In addition, the [pixi.lock file](https://pixi.sh/latest/features/lockfile/#what-is-a-lock-file) recorded the exact version, URL of origin, checksums, and other details about each software package that was installed. The software tools are installed with in the (hidden) `pixi` sub-directory, which constitutes the [environment](https://pixi.sh/latest/features/environment/#environments) for this project. `pixi` [caches](https://pixi.sh/latest/features/environment/#caching-packages) each installed tools automatically, so previously downloaded copies are not duplicated. ::: The `pixi shell` command activates the `pixi` environment for interactive use: ```bash # excuted in the linux shell > pixi shell > samtools --version samtools 1.9 Using htslib 1.9 Copyright (C) 2018 Genome Research Ltd. ``` ## Retrieving raw sequencing reads from the ENA The [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena/browser/), which mirrors datasets submitted to the NCBI as well, makes the raw MiSeq data deposited by Goodwin et al available under accession [SRP156273 / PRJNA483560](https://www.ebi.ac.uk/ena/browser/view/PRJNA483560). FASTQ files from 69 paired-end sequencing runs are available, and I downloaded the `file report` table as TSV file `PRJNA483560.tsv`. Most of the 40 examined lines are associated with a single sequencing run, but a subset has been analyzed multiple times (with different primer pairs). For example, the `Lrrk2*G2019S` line (lower case) has 4 pairs of FASTQ files available, and the `LRRK2*G2019S` (upper case letters) has 6 pairs [^4] [^4]: Each sequencing run was generated with a different set of primer pairs, which the authors provide in [Supplementary Table S4 of the paper](https://genome.cshlp.org/content/suppl/2019/02/11/gr.233866.117.DC1/Supplemental_Table_S4.xlsx) Let's read the TSV file I downloaded from ENA into R and explore it: ```{r} # executed in an R session in the project directory files <- read.delim("PRJNA483560.tsv") local({ tab <- table(files$sample_title) tab[grep("Lrrk2", names(tab), ignore.case = TRUE)] }) ``` The `sample_alias` column contains the mouse strain identifier (prefixed with `TLA-`), so we can look up additional details about each on the Jackson Laboratory's (JAX) website, e.g. for the two lines mentioned above: - [LRRK2*G2019S (strain 18785)](https://www.jax.org/strain/018785): - A bacterial artificial chromosome (BAC) with the entire **human LRRK2** (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the LRRK2 locus to harbor the LRRK2 G2019S mutation. - This modified **188 kb BAC** was microinjected into fertilized C57BL/6J zygotes. Founder BAC LRRK2*G2019S mice were established and then subsequently maintained by breeding transgenic mice with C57BL/6J (Stock No. 000664) to establish the colony. - The transgene integrated on chromosome 2. (Please note that Goodwin et al's Table 1 mistakenly places this insertion on chromosome 1, but their Supplementary Table S1 shows the correct position on chromosome 2 instead.) - [Lrrk2*G2019S (12467)](https://www.jax.org/strain/012467) - A bacterial artificial chromosome (BAC) (#RP23-312I9) containing the entire **mouse Lrrk2** (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the Lrrk2 locus to harbor the Lrrk2-G2019S mutation. - This modified **240 kb BAC**, containing a FLAG-tag downstream from the start codon, was microinjected into fertilized B6C3 F1 oocytes. - FLAG-Lrrk2-G2019S founder line 2 was established and then subsequently maintained by breeding transgenic mice with C57BL/6J inbred mice for many generations prior to arrival at The Jackson Laboratory. - Upon arrival, mice were bred to C57BL/6J inbred mice (Stock No. 000664) for at least one generation to establish the colony. ```{r} # executed in an R session in the project directory head(files)[, c("sample_title", "sample_alias", "fastq_ftp")] ``` ## LRRK2*G2019S: human LRRK2 locus integrated into the mouse genome Let's take a closer look at line `LRRK2*G2019S`, which carries a 188 kb BAC with the _human_ LRRK2 locus on (mouse) chromosome 2 [^5]. [^5]: Goodwin et al reported the position of the insertion using GRCm38 / mm10 coordinates. For simplicity, we will use the same genome release here.) There are two FTP URLs, one for the R1 and one for the R2 read of the FASTQ pair, for each of the 6 sequencing runs; let's split them into separate strings. ```{r} # executed in an R session in the project directory urls <- with(files[files$sample_title == "LRRK2*G2019S", ], split(fastq_ftp, run_accession)) |> lapply(\(x) strsplit(x, split = ";")[[1]]) urls ``` ### Downloading reference sequences from Ensembl Here, we already know that the BAC inserted on chromosome 2 of the mouse genome, so we can download that specific chromosome and align the sequencing reads to it. (In the absence of prior knowledge, we would align to the entire mouse reference genome instead.) Ensembl provides genome sequences for download on its [ftp server](https://www.ensembl.org/info/data/ftp/index.html). The following command downloads mouse chromosome 2 (GRCm38), where Goodwin et al reported the insertion to be located, and human chromosome 12 (GRCh38), where the source [LRRK2 locus resides](https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000188906;r=12:40196744-40369285) ```{r} #| eval: false # executed in an R session in the project directory ref_folder <- "ref_folder" dir.create(ref_folder, showWarnings = FALSE) local({ # 38 Mb url <- paste0("https://ftp.ensembl.org/pub/release-113/fasta/", "homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz") download.file(url, destfile = file.path(ref_folder, "human.fa.gz")) # 52 Mb url <- paste0("https://ftp.ensembl.org/pub/release-102/fasta/", "mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz") download.file(url, destfile = file.path(ref_folder, "mouse.fa.gz")) }) ``` Next, we create an index for the `bwa` aligner for each of the two references (using the `pixi shell`) we started above. ::: {.callout-tip collapse="false"} I chose a minimal bioinformatics workflow for this analysis, e.g. without first trimming the reads (e.g. using [fastp](https://github.com/OpenGene/fastp) or [trim galore](https://github.com/FelixKrueger/TrimGalore) ) as I am merely interested in spotting regions with the largest coverage across the mouse genome. Other applications of TLA, e.g. resequencing of cancer-associated genes or structural variant calling, likely require greater care. [This overview by Solvias](https://www.solvias.com/wp-content/uploads/2024/03/Cergentis-Manual-Introduction-to-the-terminology-and-methods-used-in-TLA-analyses.pdf) provides a great introduction to TLA analysis, including multiple examples of structural variants and how they can be identified using a genome browser. ::: ```bash # excuted in the linux shell pushd ref_folder bwa index -p human human.fa.gz bwa index -p mouse mouse.fa.gz popd ``` Next, we are ready to retrieve the FASTQ files from ENA and align them to each of the references (mouse and human). ### Download FASTQ files Here, I am using a few lines of R code to download each of the FASTQ files: ```{r} #| eval: false # executed in an R session in the project directory dir.create("fastq", showWarnings = FALSE) for (url in unlist(urls)) { download.file(url, destfile = file.path("fastq", basename(url)), method = "curl", quiet = TRUE) } ``` ::: {.callout-important collapse="false"} Usually, we map the R1 and R2 reads from a paired-end sequencing run together, e.g. instruct the aligner to map them as a physical pair. But because TLA uses _inverse PCR_ to generate the fragments of interest, the R1 and R2 reads don't face each other, but point away from each other instead. We therefore map them as two sets of single-end reads instead. ::: Because we don't want to align the R1 and R2 reads as pairs, we can simply concatenate them (e.g. with the `zcat` command). To visualize the alignments in a genome browser (see below) we use `samtools` to sort them by position and create the `.bai` index file. Because we are mainly interested in visualizing the coverage of different genomic regions, not individual alignments, we use the `bamCoverage` tools from the [deeptools](https://test-argparse-readoc.readthedocs.io/en/latest/index.html) suite to generate smaller `bigWig` (`.bw`) files as well. ```bash # excuted in the linux shell mkdir -p alignments for SAM in SRR7641557 SRR7641558 SRR7641591 SRR7641592 SRR7641593 SRR7641594 do # align to human chromosome bwa mem \ -t 4 \ ref_folder/human \ <(zcat fastq/${SAM}*.fastq.gz) | \ samtools sort -O BAM > alignments/${SAM}_human.bam samtools index alignments/${SAM}_human.bam bamCoverage \ --normalizeUsing None \ -b alignments/${SAM}_human.bam \ -o alignments/${SAM}_human.bw # align to mouse chromosome bwa mem \ -t 4 \ ref_folder/mouse \ <(zcat fastq/${SAM}*.fastq.gz) | \ samtools sort -O BAM > alignments/${SAM}_mouse.bam samtools index alignments/${SAM}_mouse.bam bamCoverage \ --normalizeUsing None \ -b alignments/${SAM}_mouse.bam \ -o alignments/${SAM}_mouse.bw done ``` ## Visual exploration of the alignments Now it's time to explore the results visually, using e.g. the Broad Institute's [IGV genome browser](https://igv.org/). As we have aligned the reads to both the human and the mouse genome (separately), we can visualize alignment coverage in both species. Let's start with the _mouse_ genome - and identify the integration site of the human BAC. ### Alignments to the mouse genome According to Goodwin et al's [Supplementary Table 1](https://genome.cshlp.org/content/suppl/2019/02/11/gr.233866.117.DC1/Supplemental_Table_S1.xlsx) the integration site of this BAC is at position `chr2:80896405-80896406` of the GRCm38 / mm10 version of the mouse genome. Let's zoom into that region. ::: {.callout-note collapse="false"} The alignment coverage at the insertion site, e.g. close to the anchor region, is very high. The y-axis of the IGV tracks has been set to include at least the range from 0 to 10,000 - or to autoscale if the maximum exceeds 10,000. ::: In three of the six sequencing runs (each representing a different primer pair / anchor region within the transgene), we observe three high coverage peaks close to (but not exactly at) the reported insertion site. ![Mouse genome (Chr2 region)](igv/mouse_chr2.jpg) Zooming into the two peaks on the left reveals a very similar shape of the peak, with an dip (e.g. low coverage) at the center. ![Mouse genome (Chr2 region), zoomed](igv/mouse_chr2_left.jpg) ::: {.callout-note collapse="true"} Solvias [great summary of TLA analysis](https://www.solvias.com/wp-content/uploads/2024/03/Cergentis-Manual-Introduction-to-the-terminology-and-methods-used-in-TLA-analyses.pdf) includes examples of coverage peaks pointing towards different types of insertion events. For example, Figure 11 of their report (shown below) indicates that a gap within the coverage peak is consistent with a deletion in the target genome. ![Solvias, Figure 11](igv/solvias_fig11.jpg) ::: Interestingly, we another sequencing run (track 4) shows very high coverage in a different location on chromosome 2: ![Mouse genome (Chr2 region), right](igv/mouse_chr2_right.jpg) If this is a real signal, it might indicate either a second insertion - or the presence of a more complex structural variation caused by the (single) insertion event closeby. ### Alignments to the human genome Next, we switch to the human (GRCh38) genome, to examine the _LRRK2_ region from which the transgene originates. Goodwin et al do not report the coordinates of the source region. When we zoom into the _LRRK2_ region on human chromosome 12, three of the six tracks show very highly concentrated coverage (tracks 1-3). The other three tracks show lower coverage over roughly the same range of coordinates, delineating the extend of the BAC sequence that was introduced into the mouse strain: ![Human LRRK2 region](igv/human_LRRK2.jpg) If we rescale the first three tracks, setting the maximum y-axis value to 1,000, we see that these sequencing runs also yielded alignments that span the same range (albeit with much more uneven coverage): ![Human LRRK2 region](igv/human_LRRK2_rescaled.jpg) ### Variant calling The strain includes the LRRK2 G2019S Parkinson's Disease risk variant (e.g. SNP [rs34637584](https://www.ncbi.nlm.nih.gov/snp/rs34637584)). Let's examine its position (chr12:40340400) and verify that the expected `A` allele is present (and not the `G` reference allele that is included in the human reference genome). Satisfyingly, the targeted sequencing covers the variant position and confirms that the LRRK2 G2019S (G => T) variant is indeed present in this mouse strain: ![LRRK2 G2019S](igv/LRRK2_G2019S.png). ## Conclusion This brief exploration of published TLA data introduced the laboratory protocol and a (rough and tumble) bioinformatics analysis. Aligning the reads to the human genome revealed the start and end coordinates of the donor BAC sequence that was introduced to the `LRRK2*G2019S` mouse model, and it also validated the presence of the LRRK2 G2019S variant. Interestingly, examination of the alignments to the mouse genome pinpointed a likely insertion site close to but not identical to the coordinates reported by Goodwin et al. The presence of a second coverage peak highlights that full understanding of the potential structural variations is complicated - and that the use of multiple anchor regions (e.g. inverse PCR pairs) is desirable.