Getting Started

FinaleDB refers to FragmentatIoN AnaLysis of cEll-free DNA DataBase. It is a comprehensive cell-free DNA (cfDNA) fragmentation pattern database to host uniformly processed and quality controlled 2579 paired-end cfDNA WGS datasets from 2505 samples across 23 different pathological conditions in the existing public domain.

cfDNA fragmentation patterns inspired much interest in the research community. The fragmentation is a nucleosome-guided non-random process. Its patterns have been demonstrated to be closely related to diseases such as cancer, and have the potential as blood-based biomarkers that provide diagnostic and prognostic insight in a number of pathological conditions.

However, a centralized database hosting publicly-available cfDNA fragmentation data doesn't yet exist. In addition, due to the protection of genotype information of patients, which in fact is not needed for the fragmentation analysis, databases such as dbGap and EGA requires special application procedure in terms of data access, and getting approvals can be burdensome. Lastly, different cfDNA studies usually employ different data processing workflows, resulting in possibly inconsistent cfDNA datasets.

FinaleDB curates cfDNA fragmentation data by collecting published WGS datasets and remove the original sequences to de-identify the sensitive genotype information. All the datasets are uniquely processed by an in-house developed workflow (see data collection and prcoessing). FinaleDB users can conveniently browse, query and visualize the datasets through FinaleDB web portal.

Data collection and processing

We collected in total 2579 paired-end cfDNA WGS datasets from 2505 samples from 8 published studies, covering 23 different pathological conditions: breast cancer, prostate cancer, liver cancer, lung cancer, head and neck cancer, multiple sclerosis, lupus, to name a few.

We collected the raw paired-end sequencing data from GEO, dbGaP, and EGA. After that, we processed them with a pipeline briefly described below:

  1. Trim all sequences to 50bp in order to minimize possible batch effects. The trimming happens at the 3’ end of each paired read.

  2. Perform quality trimming and adapter clipping by Trimmomatic.

  3. Apply FastQC for quality check.

  4. Align reads against both hg19 and hg38 assemblies via bwa mem, and then use samblaster to mark duplicates.

  5. Filter the alignments to exclude those that don’t meet the following critera: 1) properly mapped in pair; 2) primary alignments; 3) mapping quality ≥ 30; 4) not marked as duplicates in the previous step.

  6. From the filtered alignments, calculate the sizes and mapped locations of the corresponding fragment. This step leads to the fragmentation pattern data, stored in a sorted BED-like, tab-separated values (TSV) file.

  7. From the fragmentation pattern data, perform the following analysis: 1) fragment size distribution; 2) fragment size profile over the genome; 3) fragment coverage profile; 4) Windowed Protection Score (WPS), which reveals the protection of cfDNA from digestion, therefore the nucleosome positioning information.

The following diagram briefly demonstrates the overall pipeline:

Database queries

The following screenshot demonstrates how the database query page works.

The leftmost is the filter panel, where users can query the database through various filters and search terms, including sample names, platform, pathological conditions, etc.

To the right is the entry table. Users can select one or more entries and click the "Visualize" button for data visualization (see the boxes with green dash lines).

In addition, users can click the plus/minus sign to the right of each entry to add/remove them from the download list, and then go to the download center by clicking the "Download" button (see the boxes with purple dash lines).

Data visualization

The upper-left of the visualization page is the assembly selector, where users can choose between hg19 and hg38 for the genome browser.

In the middle is the interactive plot that shows the fragment size distribution.

To the bottom is the genome browser that shows the coverage, fragment size profile and WPS tracks.

Last updated