Identifying transgene insertions in a commonly used mouse model
NGS
TIL
Author
Thomas Sandmann
Published
February 25, 2025
tl;dr
This week, I had the chance to learn about a technique that was new to me: targeted locus amplification (TLA), initially described by de Vree et al, Nature Biotechnology in 20141.
Starting with a set of PCR primers, it captures sequences up to ~ 100 kb on either side of them. This targeted library preparation method can be used e.g. to identify the insertion site of transgenes, or to study structural variation of specific genes of interest.
Here, I am taking a closer look at published TLA data that reveals details about commonly used transgenic mouse models.
A basic overview of TLA
The goal of the TLA library preparation method is to capture genomic fragments that are in close proximity to a known sequence. The following steps are illustrated in the great figure by de Vree et al shown below2:
Protocol details
The protocol starts with a cell pellet, resuspended in PBS with 10% BSA.
The chromatin is cross-linked with formaldehyde, locking DNA fragments that are in close proximity to each other in place.
Next, the cells are lysed and the DNA is digested with NlaIII, a restriction enzyme that recognizes a 4 base pair sequence. Because these 4-mer motifs appear frequently in the genome this digestion step leads to short fragments.
Afterwards, the restriction enzyme is heat inactivated and the DNA ends are ligated, creating large DNA circles composed of multiple crosslinked restriction fragments. At this point, the cross-links are reversed and the DNA is purified using proteinase K digestions and phenol chloroform extraction.
The next step differentiates TLA from other protocols (e.g. Chromosome conformation capture (4C)): the DNA circles are digested once more with NspI a restriction enzyme that recognizes a 6-mer motif and creates overhangs that are compatible with those generated by NlaIII. This “limited trimming” step releases random fragments of the larger DNA circles, placing segments that were further apart within the same molecule closer to each other.
Afterwards, the circles are closed with another ligation step and purified once again to obtain the templates for an inverse PCR reaction.
Circules that contain the fragment of interest (also know as the anchor sequence) are selectively PCR-amplified with inverse PCR primers. In contrast to a typical PCR reaction, the primers face outwards, amplifying the entire circle.
The final PCR product contain the anchor sequence as well as assorted sequences in close proximity (e.g. within kilobase range) of the primers. The second digestion step (step 7, see above) ensures that both regions immediately adjacent to the anchor and those further removed are included.
These PCR products are now entered into a standard DNA sequencing library prepration, and can be analyzed using short or long read platforms.
Figure 1 by de Vree et al
Characterizing transgenic mouse models
Transgenic mouse models, e.g. engineered to carry human or synthetic genomic regions, are key to basic and translational research. For example, Alzforum has collated information about ~ 300 mouse strains used to study Alzheimer’s Disease (AD), Parkinson’s Disease (PD) or Amyotrophic lateral sclerosis (ALS).
Goodwin et al, Genome Research, 2017 used TLA to identify the insertion sites of 40 frequently used transgenic mouse lines. Many of these lines were created before high throughput sequencing became widely available, and information about the exact sequences that had been transferred and the insertion site(s) was missing.
The authors designed (often multiple) primer pairs for each mouse strain, performed TLA and sequenced the resulting (inverse) PCR products using an Illumina MiSeq instrument.
They describe the results of their analysis, e.g. the coordinates of the insertion into the mouse (GRCm38 / mm10) genome in Table 1, and also describe the resulting consequences (e.g. structural variations) they observed 3.
Yet, the authors do not report the coordinates of the donor sequence, e.g. the transgenic DNA that was inserted into the mouse genome. Luckily, the LTA data can identify both the insertion side and the likely sequence of origin.
Here, I am taking a closer look at the TLA results for one mouse strain, carrying the genomic region encompassing the human LRRK2 gene.
Note
In this post, I am interactively using both the linux shell (bash) and the R console. I have indicated the respective interface in each of the code blocks below.
Installing bwa, samtools and deeptool with pixi
To get ready for analyzing the raw sequencing data, I chose the pixi package manager to install different bioinformatics tools.
In addition, the pixi.lock file recorded the exact version, URL of origin, checksums, and other details about each software package that was installed.
The software tools are installed with in the (hidden) pixi sub-directory, which constitutes the environment for this project.
pixicaches each installed tools automatically, so previously downloaded copies are not duplicated.
The pixi shell command activates the pixi environment for interactive use:
# excuted in the linux shell> pixi shell> samtools --versionsamtools 1.9Using htslib 1.9Copyright(C)2018 Genome Research Ltd.
FASTQ files from 69 paired-end sequencing runs are available, and I downloaded the file report table as TSV file PRJNA483560.tsv.
Most of the 40 examined lines are associated with a single sequencing run, but a subset has been analyzed multiple times (with different primer pairs).
For example, the Lrrk2*G2019S line (lower case) has 4 pairs of FASTQ files available, and the LRRK2*G2019S (upper case letters) has 6 pairs 4
Let’s read the TSV file I downloaded from ENA into R and explore it:
# executed in an R session in the project directoryfiles <-read.delim("PRJNA483560.tsv")local({ tab <-table(files$sample_title) tab[grep("Lrrk2", names(tab), ignore.case =TRUE)]})
The sample_alias column contains the mouse strain identifier (prefixed with TLA-), so we can look up additional details about each on the Jackson Laboratory’s (JAX) website, e.g. for the two lines mentioned above:
A bacterial artificial chromosome (BAC) with the entire human LRRK2 (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the LRRK2 locus to harbor the LRRK2 G2019S mutation.
This modified 188 kb BAC was microinjected into fertilized C57BL/6J zygotes. Founder BAC LRRK2*G2019S mice were established and then subsequently maintained by breeding transgenic mice with C57BL/6J (Stock No. 000664) to establish the colony.
The transgene integrated on chromosome 2. (Please note that Goodwin et al’s Table 1 mistakenly places this insertion on chromosome 1, but their Supplementary Table S1 shows the correct position on chromosome 2 instead.)
A bacterial artificial chromosome (BAC) (#RP23-312I9) containing the entire mouse Lrrk2 (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the Lrrk2 locus to harbor the Lrrk2-G2019S mutation.
This modified 240 kb BAC, containing a FLAG-tag downstream from the start codon, was microinjected into fertilized B6C3 F1 oocytes.
FLAG-Lrrk2-G2019S founder line 2 was established and then subsequently maintained by breeding transgenic mice with C57BL/6J inbred mice for many generations prior to arrival at The Jackson Laboratory.
Upon arrival, mice were bred to C57BL/6J inbred mice (Stock No. 000664) for at least one generation to establish the colony.
# executed in an R session in the project directoryhead(files)[, c("sample_title", "sample_alias", "fastq_ftp")]
LRRK2*G2019S: human LRRK2 locus integrated into the mouse genome
Let’s take a closer look at line LRRK2*G2019S, which carries a 188 kb BAC with the human LRRK2 locus on (mouse) chromosome 2 5.
There are two FTP URLs, one for the R1 and one for the R2 read of the FASTQ pair, for each of the 6 sequencing runs; let’s split them into separate strings.
# executed in an R session in the project directoryurls <-with(files[files$sample_title =="LRRK2*G2019S", ],split(fastq_ftp, run_accession)) |>lapply(\(x) strsplit(x, split =";")[[1]])urls
Here, we already know that the BAC inserted on chromosome 2 of the mouse genome, so we can download that specific chromosome and align the sequencing reads to it. (In the absence of prior knowledge, we would align to the entire mouse reference genome instead.)
Ensembl provides genome sequences for download on its ftp server.
The following command downloads mouse chromosome 2 (GRCm38), where Goodwin et al reported the insertion to be located, and human chromosome 12 (GRCh38), where the source LRRK2 locus resides
# executed in an R session in the project directoryref_folder <-"ref_folder"dir.create(ref_folder, showWarnings =FALSE)local({# 38 Mb url <-paste0("https://ftp.ensembl.org/pub/release-113/fasta/","homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz")download.file(url, destfile =file.path(ref_folder, "human.fa.gz"))# 52 Mb url <-paste0("https://ftp.ensembl.org/pub/release-102/fasta/","mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz")download.file(url, destfile =file.path(ref_folder, "mouse.fa.gz"))})
Next, we create an index for the bwa aligner for each of the two references (using the pixi shell) we started above.
Tip
I chose a minimal bioinformatics workflow for this analysis, e.g. without first trimming the reads (e.g. using fastp or trim galore ) as I am merely interested in spotting regions with the largest coverage across the mouse genome. Other applications of TLA, e.g. resequencing of cancer-associated genes or structural variant calling, likely require greater care. This overview by Solvias provides a great introduction to TLA analysis, including multiple examples of structural variants and how they can be identified using a genome browser.
# excuted in the linux shellpushd ref_folderbwa index -p human human.fa.gzbwa index -p mouse mouse.fa.gzpopd
Next, we are ready to retrieve the FASTQ files from ENA and align them to each of the references (mouse and human).
Download FASTQ files
Here, I am using a few lines of R code to download each of the FASTQ files:
# executed in an R session in the project directorydir.create("fastq", showWarnings =FALSE)for (url inunlist(urls)) {download.file(url, destfile =file.path("fastq", basename(url)), method ="curl", quiet =TRUE)}
Important
Usually, we map the R1 and R2 reads from a paired-end sequencing run together, e.g. instruct the aligner to map them as a physical pair. But because TLA uses inverse PCR to generate the fragments of interest, the R1 and R2 reads don’t face each other, but point away from each other instead. We therefore map them as two sets of single-end reads instead.
Because we don’t want to align the R1 and R2 reads as pairs, we can simply concatenate them (e.g. with the zcat command). To visualize the alignments in a genome browser (see below) we use samtools to sort them by position and create the .bai index file.
Because we are mainly interested in visualizing the coverage of different genomic regions, not individual alignments, we use the bamCoverage tools from the deeptools suite to generate smaller bigWig (.bw) files as well.
# excuted in the linux shellmkdir-p alignmentsfor SAM in SRR7641557 SRR7641558 SRR7641591 SRR7641592 SRR7641593 SRR7641594do# align to human chromosomebwa mem \-t 4 \ ref_folder/human \<(zcat fastq/${SAM}*.fastq.gz)|\samtools sort -O BAM > alignments/${SAM}_human.bamsamtools index alignments/${SAM}_human.bambamCoverage\--normalizeUsing None \-b alignments/${SAM}_human.bam \-o alignments/${SAM}_human.bw# align to mouse chromosomebwa mem \-t 4 \ ref_folder/mouse \<(zcat fastq/${SAM}*.fastq.gz)|\samtools sort -O BAM > alignments/${SAM}_mouse.bamsamtools index alignments/${SAM}_mouse.bambamCoverage\--normalizeUsing None \-b alignments/${SAM}_mouse.bam \-o alignments/${SAM}_mouse.bw done
Visual exploration of the alignments
Now it’s time to explore the results visually, using e.g. the Broad Institute’s IGV genome browser. As we have aligned the reads to both the human and the mouse genome (separately), we can visualize alignment coverage in both species. Let’s start with the mouse genome - and identify the integration site of the human BAC.
Alignments to the mouse genome
According to Goodwin et al’s Supplementary Table 1 the integration site of this BAC is at position chr2:80896405-80896406 of the GRCm38 / mm10 version of the mouse genome. Let’s zoom into that region.
Note
The alignment coverage at the insertion site, e.g. close to the anchor region, is very high. The y-axis of the IGV tracks has been set to include at least the range from 0 to 10,000 - or to autoscale if the maximum exceeds 10,000.
In three of the six sequencing runs (each representing a different primer pair / anchor region within the transgene), we observe three high coverage peaks close to (but not exactly at) the reported insertion site.
Zooming into the two peaks on the left reveals a very similar shape of the peak, with an dip (e.g. low coverage) at the center.
Mouse genome (Chr2 region), zoomed
Note
Solvias great summary of TLA analysis includes examples of coverage peaks pointing towards different types of insertion events. For example, Figure 11 of their report (shown below) indicates that a gap within the coverage peak is consistent with a deletion in the target genome.
Solvias, Figure 11
Interestingly, we another sequencing run (track 4) shows very high coverage in a different location on chromosome 2:
Mouse genome (Chr2 region), right
If this is a real signal, it might indicate either a second insertion - or the presence of a more complex structural variation caused by the (single) insertion event closeby.
Alignments to the human genome
Next, we switch to the human (GRCh38) genome, to examine the LRRK2 region from which the transgene originates. Goodwin et al do not report the coordinates of the source region.
When we zoom into the LRRK2 region on human chromosome 12, three of the six tracks show very highly concentrated coverage (tracks 1-3). The other three tracks show lower coverage over roughly the same range of coordinates, delineating the extend of the BAC sequence that was introduced into the mouse strain:
Human LRRK2 region
If we rescale the first three tracks, setting the maximum y-axis value to 1,000, we see that these sequencing runs also yielded alignments that span the same range (albeit with much more uneven coverage):
Human LRRK2 region
Variant calling
The strain includes the LRRK2 G2019S Parkinson’s Disease risk variant (e.g. SNP rs34637584). Let’s examine its position (chr12:40340400) and verify that the expected A allele is present (and not the G reference allele that is included in the human reference genome).
Satisfyingly, the targeted sequencing covers the variant position and confirms that the LRRK2 G2019S (G => T) variant is indeed present in this mouse strain: .
Conclusion
This brief exploration of published TLA data introduced the laboratory protocol and a (rough and tumble) bioinformatics analysis. Aligning the reads to the human genome revealed the start and end coordinates of the donor BAC sequence that was introduced to the LRRK2*G2019S mouse model, and it also validated the presence of the LRRK2 G2019S variant.
Interestingly, examination of the alignments to the mouse genome pinpointed a likely insertion site close to but not identical to the coordinates reported by Goodwin et al. The presence of a second coverage peak highlights that full understanding of the potential structural variations is complicated - and that the use of multiple anchor regions (e.g. inverse PCR pairs) is desirable.
The full text of the article is available via ResearchGate.↩︎
The full protocol developed by de Vree et al is included in the supplementary information alongside helpful illustrations and numerous examples of TLA applied to different use cases.↩︎
Goodwin et al reported the position of the insertion using GRCm38 / mm10 coordinates. For simplicity, we will use the same genome release here.)↩︎
Source Code
---title: "Analyzing targeted locus amplification (TLA) data"subtitle: "Identifying transgene insertions in a commonly used mouse model"author: "Thomas Sandmann"date: "2025-02-25"freeze: truecategories: [NGS, TIL]editor: markdown: wrap: 72format: html: toc: true toc-depth: 4 code-tools: source: true toggle: false caption: noneeditor_options: chunk_output_type: console---## tl;drThis week, I had the chance to learn about a technique that was new to me:targeted locus amplification (TLA), initially described by [de Vree et al, Nature Biotechnology](https://pubmed.ncbi.nlm.nih.gov/25129690/)in 2014[^1].Starting with a set of PCR primers, it captures sequences up to ~ 100 kb on either side of them. This targeted library preparation method can be used e.g.to identify the insertion site of transgenes, or to study structural variationof specific genes of interest.Here, I am taking a closer look at published TLA data that reveals details aboutcommonly used transgenic mouse models.[^1]: The full text of the article is available via [ResearchGate](https://www.researchgate.net/publication/267624474_Targeted_sequencing_by_proximity_ligation_for_comprehensive_variant_detection_and_local_haplotyping).## A basic overview of TLAThe goal of the TLA library preparation method is to capture genomic fragmentsthat are in close proximity to a known sequence. The following steps are illustrated in the great figure by de Vree et al shown below[^2]:[^2]: The full protocol developed by de Vree et al is included in the[supplementary information](https://static-content.springer.com/esm/art%3A10.1038%2Fnbt.2959/MediaObjects/41587_2014_BFnbt2959_MOESM1_ESM.pdf) alongside helpful illustrations andnumerous examples of TLA applied to different use cases.::: {.callout-note collapse="true"}### Protocol details1. The protocol starts with a cell pellet, resuspended in PBS with 10% BSA.2. The chromatin is cross-linked with formaldehyde, locking DNA fragments that are in close proximity to each other in place.3. Next, the cells are lysed and the DNA is digested with [NlaIII](https://www.neb.com/en-us/products/r0125-nlaiii), a restriction enzyme that recognizes a 4 base pair sequence. Because these 4-mer motifs appear frequently in the genome this digestion step leads to short fragments. 4. Afterwards, the restriction enzyme is heat inactivated and the DNA ends are ligated, creating large DNA circles composed of multiple crosslinked restriction fragments. At this point, the cross-links are reversed and the DNA is purified using proteinase K digestions and phenol chloroform extraction.5. The next step differentiates TLA from other protocols (e.g. Chromosome conformation capture (4C)): the DNA circles are digested once more with [NspI](https://www.neb.com/en-us/products/r0602-nspi) a restriction enzyme that recognizes a 6-mer motif and creates overhangs that are compatible with those generated by NlaIII. This "limited trimming" step releases random fragments of the larger DNA circles, placing segments that were further apart within the same molecule closer to each other.6. Afterwards, the circles are closed with another ligation step and purified once again to obtain the templates for an *inverse PCR* reaction.7. Circules that contain the fragment of interest (also know as the _anchor sequence_) are selectively PCR-amplified with inverse PCR primers. In contrast to a typical PCR reaction, the primers face outwards, amplifying the entire circle.8. The final PCR product contain the anchor sequence as well as assorted sequences in close proximity (e.g. within kilobase range) of the primers. The second digestion step (step 7, see above) ensures that both regions immediately adjacent to the anchor and those further removed are included.9. These PCR products are now entered into a standard DNA sequencing library prepration, and can be analyzed using short or long read platforms.:::## Characterizing transgenic mouse modelsTransgenic mouse models, e.g. engineered to carry human or synthetic genomicregions, are key to basic and translational research. For example,Alzforum has collated [information about ~ 300 mouse strains](https://www.alzforum.org/research-models)used to study Alzheimer's Disease (AD), Parkinson's Disease (PD) or Amyotrophic lateral sclerosis (ALS).[Goodwin et al, Genome Research, 2017](https://pubmed.ncbi.nlm.nih.gov/30659012/)used TLA to identify the insertion sites of 40 frequently used transgenic mouselines. Many of these lines were created before high throughput sequencing becamewidely available, and information about the exact sequences that had beentransferred and the insertion site(s) was missing.The authors designed (often multiple) primer pairs for each mouse strain,performed TLA and sequenced the resulting (inverse) PCR products using an Illumina MiSeq instrument.They describe the results of their analysis, e.g. the coordinates of theinsertion into the mouse (GRCm38 / mm10) genome in [Table 1](https://genome.cshlp.org/content/29/3/494/T1.expansion.html), andalso describe the resulting consequences (e.g. structural variations) theyobserved [^3].[^3]: [Supplementary Table S1](https://genome.cshlp.org/content/suppl/2019/02/11/gr.233866.117.DC1/Supplemental_Table_S1.xlsx)has even more information about each strain.)Yet, the authors do not report the coordinates of the donor sequence, e.g. thetransgenic DNA that was inserted into the mouse genome. Luckily, the LTA datacan identify _both_ the insertion side and the likely sequence of origin.Here, I am taking a closer look at the TLA results for one mouse strain,carrying the genomic region encompassing the human _LRRK2_ gene.::: {.callout-note collapse="false"}In this post, I am interactively using both the linux shell (bash) and the R console. I have indicated the respective interface in each of thecode blocks below.:::## Installing bwa, samtools and deeptool with pixiTo get ready for analyzing the raw sequencing data, I chose the `pixi` package manager to install different bioinformatics tools.Installing `pixi` on my Linux system was straightforward, and I followed [the instructions on the pixi homepage](https://pixi.sh/latest/#installation).```bash# excuted in the linux shell> curl -fsSL https://pixi.sh/install.sh |bash```Next, I created a new directory for my analysis and initated a `pixi` project.```bash# excuted in the linux shell> mkdir tla> cd tla> pixi init-c bioconda -c conda-forge> pixi add bwa samtools deeptools> ls -a. .. .gitattributes .gitignore index.qmd .pixi pixi.lock pixi.toml```::: {.callout-tip collapse="true"}The `pixi init` command creates the `pixi.toml`[manifest file](https://pixi.sh/latest/reference/pixi_manifest/)in the current working directory, with the following content:```[project]authors = ["Thomas Sandmann <tomsing1@gmail.com>"]channels = ["bioconda", "conda-forge"]name = "tla"platforms = ["linux-64"]version = "0.1.0"[tasks][dependencies]```The `pixi add bwa samtools deeptool` command retrieves the specified tools - andtheir dependencies - from[bioconda](https://bioconda.github.io/) or[conda-forge](https://conda-forge.org/)It also updates the `[dependencies]` section of the `pixi.toml` file, so itnow includes all three tools as dependencies:```[dependencies]bwa = ">=0.7.18,<0.8"samtools = ">=1.9,<2"deeptools = ">=3.5.4,<4"```In addition, the [pixi.lock file](https://pixi.sh/latest/features/lockfile/#what-is-a-lock-file)recorded the exact version, URL of origin, checksums, and other details abouteach software package that was installed.The software tools are installed with in the (hidden) `pixi` sub-directory,which constitutes the [environment](https://pixi.sh/latest/features/environment/#environments)for this project. `pixi`[caches](https://pixi.sh/latest/features/environment/#caching-packages)each installed tools automatically, so previously downloaded copies are not duplicated.:::The `pixi shell` command activates the `pixi` environment for interactive use:```bash# excuted in the linux shell> pixi shell> samtools --versionsamtools 1.9Using htslib 1.9Copyright(C)2018 Genome Research Ltd.```## Retrieving raw sequencing reads from the ENAThe [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena/browser/),which mirrors datasets submitted to the NCBI as well, makes the raw MiSeqdata deposited by Goodwin et al available under accession [SRP156273 / PRJNA483560](https://www.ebi.ac.uk/ena/browser/view/PRJNA483560).FASTQ files from 69 paired-end sequencing runs are available, and I downloadedthe `file report` table as TSV file `PRJNA483560.tsv`.Most of the 40 examined lines are associated with a single sequencing run,but a subset has been analyzed multiple times (with different primer pairs).For example, the `Lrrk2*G2019S` line (lower case) has 4 pairs of FASTQ files available, and the `LRRK2*G2019S` (upper case letters) has 6 pairs [^4][^4]: Each sequencing run was generated with a different set of primer pairs,which the authors provide in [Supplementary Table S4 of the paper](https://genome.cshlp.org/content/suppl/2019/02/11/gr.233866.117.DC1/Supplemental_Table_S4.xlsx)Let's read the TSV file I downloaded from ENA into R and explore it:```{r}# executed in an R session in the project directoryfiles <-read.delim("PRJNA483560.tsv")local({ tab <-table(files$sample_title) tab[grep("Lrrk2", names(tab), ignore.case =TRUE)]})```The `sample_alias` column contains the mouse strain identifier (prefixed with `TLA-`), so we can look up additional details about each on the Jackson Laboratory's (JAX) website, e.g. for the two lines mentioned above:- [LRRK2*G2019S (strain 18785)](https://www.jax.org/strain/018785): - A bacterial artificial chromosome (BAC) with the entire **human LRRK2** (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the LRRK2 locus to harbor the LRRK2 G2019S mutation. - This modified **188 kb BAC** was microinjected into fertilized C57BL/6J zygotes. Founder BAC LRRK2*G2019S mice were established and then subsequently maintained by breeding transgenic mice with C57BL/6J (Stock No. 000664) to establish the colony. - The transgene integrated on chromosome 2. (Please note that Goodwin et al's Table 1 mistakenly places this insertion on chromosome 1, but their Supplementary Table S1 shows the correct position on chromosome 2 instead.)- [Lrrk2*G2019S (12467)](https://www.jax.org/strain/012467) - A bacterial artificial chromosome (BAC) (#RP23-312I9) containing the entire **mouse Lrrk2** (leucine-rich repeat kinase 2) gene was modified by targeted mutation of the Lrrk2 locus to harbor the Lrrk2-G2019S mutation. - This modified **240 kb BAC**, containing a FLAG-tag downstream from the start codon, was microinjected into fertilized B6C3 F1 oocytes. - FLAG-Lrrk2-G2019S founder line 2 was established and then subsequently maintained by breeding transgenic mice with C57BL/6J inbred mice for many generations prior to arrival at The Jackson Laboratory. - Upon arrival, mice were bred to C57BL/6J inbred mice (Stock No. 000664) for at least one generation to establish the colony.```{r}# executed in an R session in the project directoryhead(files)[, c("sample_title", "sample_alias", "fastq_ftp")]```## LRRK2*G2019S: human LRRK2 locus integrated into the mouse genomeLet's take a closer look at line `LRRK2*G2019S`, which carries a 188 kb BACwith the _human_ LRRK2 locus on (mouse) chromosome 2 [^5].[^5]: Goodwin et al reported the position of the insertion using GRCm38 / mm10coordinates. For simplicity, we will use the same genome release here.)There are two FTP URLs, one for the R1 and one for the R2 read of the FASTQ pair, for each of the 6 sequencing runs; let's split them into separatestrings.```{r}# executed in an R session in the project directoryurls <-with(files[files$sample_title =="LRRK2*G2019S", ],split(fastq_ftp, run_accession)) |>lapply(\(x) strsplit(x, split =";")[[1]])urls```### Downloading reference sequences from EnsemblHere, we already know that the BAC inserted on chromosome 2 of the mouse genome,so we can download that specific chromosome and align the sequencing reads toit. (In the absence of prior knowledge, we would align to the entire mousereference genome instead.)Ensembl provides genome sequences for download on its[ftp server](https://www.ensembl.org/info/data/ftp/index.html).The following command downloads mouse chromosome 2 (GRCm38), where Goodwin et alreported the insertion to be located, and human chromosome 12 (GRCh38), where the source[LRRK2 locus resides](https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000188906;r=12:40196744-40369285)```{r}#| eval: false# executed in an R session in the project directoryref_folder <-"ref_folder"dir.create(ref_folder, showWarnings =FALSE)local({# 38 Mb url <-paste0("https://ftp.ensembl.org/pub/release-113/fasta/","homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz")download.file(url, destfile =file.path(ref_folder, "human.fa.gz"))# 52 Mb url <-paste0("https://ftp.ensembl.org/pub/release-102/fasta/","mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz")download.file(url, destfile =file.path(ref_folder, "mouse.fa.gz"))})```Next, we create an index for the `bwa` aligner for each of the two references(using the `pixi shell`) we started above.::: {.callout-tip collapse="false"}I chose a minimal bioinformatics workflow for this analysis, e.g. withoutfirst trimming the reads (e.g. using [fastp](https://github.com/OpenGene/fastp) or [trim galore](https://github.com/FelixKrueger/TrimGalore)) as I am merely interested in spotting regions with the largest coverageacross the mouse genome. Other applications of TLA, e.g. resequencing ofcancer-associated genes or structural variant calling, likely require greatercare. [This overview by Solvias](https://www.solvias.com/wp-content/uploads/2024/03/Cergentis-Manual-Introduction-to-the-terminology-and-methods-used-in-TLA-analyses.pdf) provides agreat introduction to TLA analysis, including multiple examples of structuralvariants and how they can be identified using a genome browser.:::```bash# excuted in the linux shellpushd ref_folderbwa index -p human human.fa.gzbwa index -p mouse mouse.fa.gzpopd```Next, we are ready to retrieve the FASTQ files from ENA and align them to each of the references (mouse and human). ### Download FASTQ filesHere, I am using a few lines of R code to download each of the FASTQ files:```{r}#| eval: false# executed in an R session in the project directorydir.create("fastq", showWarnings =FALSE)for (url inunlist(urls)) {download.file(url, destfile =file.path("fastq", basename(url)), method ="curl", quiet =TRUE)}```::: {.callout-important collapse="false"}Usually, we map the R1 and R2 reads from a paired-end sequencing run together,e.g. instruct the aligner to map them as a physical pair. But because TLAuses _inverse PCR_ to generate the fragments of interest, the R1 and R2 readsdon't face each other, but point away from each other instead. We thereforemap them as two sets of single-end reads instead. :::Because we don't want to align the R1 and R2 reads as pairs, we can simplyconcatenate them (e.g. with the `zcat` command). To visualize the alignments ina genome browser (see below) we use `samtools` to sort them by position and create the `.bai` index file.Because we are mainly interested in visualizing the coverage of differentgenomic regions, not individual alignments, we use the `bamCoverage` tools from the [deeptools](https://test-argparse-readoc.readthedocs.io/en/latest/index.html)suite to generate smaller `bigWig` (`.bw`) files as well.```bash# excuted in the linux shellmkdir-p alignmentsfor SAM in SRR7641557 SRR7641558 SRR7641591 SRR7641592 SRR7641593 SRR7641594do# align to human chromosomebwa mem \-t 4 \ ref_folder/human \<(zcat fastq/${SAM}*.fastq.gz)|\samtools sort -O BAM > alignments/${SAM}_human.bamsamtools index alignments/${SAM}_human.bambamCoverage\--normalizeUsing None \-b alignments/${SAM}_human.bam \-o alignments/${SAM}_human.bw# align to mouse chromosomebwa mem \-t 4 \ ref_folder/mouse \<(zcat fastq/${SAM}*.fastq.gz)|\samtools sort -O BAM > alignments/${SAM}_mouse.bamsamtools index alignments/${SAM}_mouse.bambamCoverage\--normalizeUsing None \-b alignments/${SAM}_mouse.bam \-o alignments/${SAM}_mouse.bw done```## Visual exploration of the alignmentsNow it's time to explore the results visually, using e.g. the Broad Institute's[IGV genome browser](https://igv.org/). As we have aligned the reads to boththe human and the mouse genome (separately), we can visualize alignment coveragein both species. Let's start with the _mouse_ genome - and identify the integration site of the human BAC.### Alignments to the mouse genomeAccording to Goodwin et al's[Supplementary Table 1](https://genome.cshlp.org/content/suppl/2019/02/11/gr.233866.117.DC1/Supplemental_Table_S1.xlsx)the integration site of this BAC is at position `chr2:80896405-80896406` of theGRCm38 / mm10 version of the mouse genome. Let's zoom into that region.::: {.callout-note collapse="false"}The alignment coverage at the insertion site, e.g. close to the anchor region,is very high. The y-axis of the IGV tracks has been set to include at leastthe range from 0 to 10,000 - or to autoscale if the maximum exceeds 10,000.:::In three of the six sequencing runs (each representing a different primer pair /anchor region within the transgene), we observe three high coverage peaks closeto (but not exactly at) the reported insertion site.Zooming into the two peaks on the left reveals a very similar shape of the peak,with an dip (e.g. low coverage) at the center.::: {.callout-note collapse="true"}Solvias [great summary of TLA analysis](https://www.solvias.com/wp-content/uploads/2024/03/Cergentis-Manual-Introduction-to-the-terminology-and-methods-used-in-TLA-analyses.pdf)includes examples of coverage peaks pointing towards different types ofinsertion events. For example, Figure 11 of their report (shown below) indicatesthat a gap within the coverage peak is consistent with a deletion in thetarget genome.:::Interestingly, we another sequencing run (track 4) shows very high coveragein a different location on chromosome 2:If this is a real signal, it might indicate either a second insertion - or thepresence of a more complex structural variation caused by the (single) insertionevent closeby.### Alignments to the human genomeNext, we switch to the human (GRCh38) genome, to examine the _LRRK2_ regionfrom which the transgene originates. Goodwin et al do not report the coordinatesof the source region.When we zoom into the _LRRK2_ region on human chromosome 12, three of thesix tracks show very highly concentrated coverage (tracks 1-3). The other threetracks show lower coverage over roughly the same range of coordinates, delineating the extend of the BAC sequence that was introduced into the mousestrain:If we rescale the first three tracks, setting the maximum y-axis value to 1,000,we see that these sequencing runs also yielded alignments that span thesame range (albeit with much more uneven coverage):### Variant callingThe strain includes the LRRK2 G2019S Parkinson's Disease risk variant (e.g. SNP [rs34637584](https://www.ncbi.nlm.nih.gov/snp/rs34637584)). Let'sexamine its position (chr12:40340400) and verify that the expected `A` alleleis present (and not the `G` reference allele that is included in the human reference genome).Satisfyingly, the targeted sequencing covers the variant position and confirmsthat the LRRK2 G2019S (G => T) variant is indeed present in this mouse strain:.## ConclusionThis brief exploration of published TLA data introduced the laboratory protocoland a (rough and tumble) bioinformatics analysis. Aligning the reads to the human genome revealed the start and end coordinates of the donor BAC sequencethat was introduced to the `LRRK2*G2019S` mouse model, and it also validatedthe presence of the LRRK2 G2019S variant.Interestingly, examination of the alignments to the mouse genome pinpointed alikely insertion site close to but not identical to the coordinates reported byGoodwin et al. The presence of a second coverage peak highlights that fullunderstanding of the potential structural variations is complicated - and thatthe use of multiple anchor regions (e.g. inverse PCR pairs) is desirable.