BioInfoCodex — The Open Bioinformatics Codex

🎓 Bioinformatics Academy

A comprehensive university-level curriculum — from molecular biology fundamentals to advanced genomic data analysis. Written by researchers, for researchers.

14 Modules 50+ Topics Real Code Examples Beginner → Advanced

📘 Module 1Foundations

💻 Module 2Computing

🗄️ Module 3Databases

🔍 Module 4Sequence Analysis

🧬 Module 5RNA-seq

🔬 Module 6Genomics

📊 Module 7Statistics

🚀 Module 8Advanced

🧪 Module 9Proteomics

🏗️ Module 10Structural Bio

🧩 Module 11Epigenomics

🔧 Module 12Genome Assembly

👥 Module 13Pop. Genetics

🕸️ Module 14Systems Biology

Module 1 · Foundations

Molecular Biology & Bioinformatics Fundamentals

Before writing a single command, you must understand what you are analysing. This module covers the biological foundations every bioinformatician needs — DNA structure, gene expression, the central dogma, and why bioinformatics exists.

1.1 What is Bioinformatics? ▶

Bioinformatics is an interdisciplinary science that develops and applies computational methods to analyse biological data. It sits at the intersection of biology, computer science, mathematics, and statistics. The field emerged because modern biology generates data at a scale that is impossible to analyse by hand — a single whole-genome sequencing run produces hundreds of gigabytes of raw data, and large consortia such as the UK Biobank have sequenced over 500,000 individuals.

💡

Think of a sequencing machine as a very sophisticated camera. It produces a "photograph" of millions of DNA/RNA fragments. Bioinformatics is the darkroom — the place where raw data becomes biological insight.

A Brief History

The roots of bioinformatics trace back to the 1960s, when Margaret Dayhoff compiled the first protein sequence database — the Atlas of Protein Sequence and Structure — and developed the PAM substitution matrices still used in sequence alignment today. In the 1970s, Needleman-Wunsch (1970) and Smith-Waterman (1981) published their foundational sequence alignment algorithms. The launch of the Human Genome Project in 1990 was a watershed moment: a $3 billion, 13-year international effort that produced the first draft of the human genome in 2001. Since then, sequencing costs have plummeted dramatically — from roughly $100 million per genome in 2001 to under $200 today — falling faster than Moore's Law. This exponential cost reduction has democratised genomics and made bioinformatics skills essential for modern biological research.

1965 — Dayhoff's Atlas of Protein Sequence and Structure

1977 — Sanger sequencing developed

1990 — Human Genome Project begins

2001 — Draft human genome published

2005 — Next-gen sequencing (454, Illumina) arrives

2012 — CRISPR-Cas9 gene editing described

2020 — AlphaFold2 solves protein folding

2022 — T2T consortium completes full human genome

Why Does Bioinformatics Matter?

Drug Discovery

Identify drug targets by finding genes upregulated in disease. AlphaFold predicted 200M+ protein structures, accelerating structure-based drug design.

Agriculture

Improve crop resistance by identifying beneficial genetic variants in plant genomes through GWAS and marker-assisted selection.

Epidemiology

Track viral evolution in real time. SARS-CoV-2 variant tracking relied entirely on bioinformatics pipelines and GISAID.

What a Typical Bioinformatics Project Looks Like

A typical project follows a pipeline from experimental question to biological insight:

Define Question → Design Experiment → Generate Data (sequencing) → Quality Control → Alignment / Assembly → Quantification → Statistical Analysis → Biological Interpretation → Visualisation & Publication

At each step, the bioinformatician must make choices: Which aligner to use? What quality thresholds? How to handle batch effects? The answers depend on the specific data type (RNA-seq, ChIP-seq, WGS, etc.), the organism, and the biological question. This course will walk you through each step.

Major Sub-fields

Genomics — studying entire genomes, variant calling, assembly, GWAS

Transcriptomics — measuring which genes are active (RNA-seq, scRNA-seq)

Proteomics — identifying and quantifying proteins (mass spectrometry)

Metabolomics — measuring small molecules (metabolites) in cells

Structural Biology — 3D shapes of macromolecules (X-ray, cryo-EM, AlphaFold)

Metagenomics — sequencing entire microbial communities (16S, shotgun)

Epigenomics — DNA methylation, histone modifications, chromatin accessibility

Pharmacogenomics — how genetic variation affects drug response

Career Paths and Demand

Bioinformatics professionals are in very high demand. Common job titles include Bioinformatics Scientist, Computational Biologist, Genomics Data Analyst, and Research Software Engineer (Bioinformatics). Positions exist in academia, pharmaceutical companies, biotech startups, clinical diagnostics labs, and agricultural genomics companies.

Academic Research — postdoc, research associate, or bioinformatics core facility staff at universities

Pharma / Biotech — drug target identification, clinical trial genomics, companion diagnostics

Clinical Genomics — variant interpretation, genetic counselling support, diagnostic pipeline development

Industry Tools — software engineering at companies like Illumina, 10x Genomics, or Oxford Nanopore

The Tools Ecosystem

Bioinformatics relies on an ecosystem of mostly open-source tools. You will use a combination of command-line tools on Linux, scripting languages (Python, R, Bash), and specialised software for specific analyses.

Linux/Bash Python R / Bioconductor Conda / Mamba Nextflow / Snakemake Galaxy Docker / Singularity HPC / SLURM

📝

Key Takeaways:
• Bioinformatics combines biology + CS + statistics to extract meaning from biological data
• The field was born from protein databases in the 1960s and exploded with the Human Genome Project
• Sequencing costs have fallen over a million-fold, making data generation easy — analysis is now the bottleneck
• A typical project flows from experimental design through QC, alignment, quantification, statistics, and interpretation
• Career demand is extremely high across academia, pharma, clinical, and industry settings

1.2 DNA, RNA and Proteins — The Central Dogma ▶

Francis Crick's central dogma (1958) describes the flow of genetic information: DNA → RNA → Protein. Every bioinformatics workflow targets one of these molecules. Understanding this flow at a molecular level is essential because the tools you use (aligners, quantifiers, variant callers) all make assumptions about the underlying biology.

DNA
Double-stranded
A, T, G, C → mRNA
Transcription
A, U, G, C → Protein
Translation
20 amino acids

Base Pairing Rules

DNA is a double-stranded helix held together by hydrogen bonds between complementary bases. The pairing rules are strict and fundamental to replication, transcription, and sequencing technology:

In DNA
Adenine (A) pairs with Thymine (T) — 2 hydrogen bonds
Guanine (G) pairs with Cytosine (C) — 3 hydrogen bonds
G-C pairs are stronger, so GC-rich regions are harder to denature and can cause sequencing bias.

In RNA
Adenine (A) pairs with Uracil (U) — 2 hydrogen bonds
Guanine (G) pairs with Cytosine (C) — 3 hydrogen bonds
RNA uses uracil instead of thymine. FASTQ files still write T by convention, but remember the molecule is actually U.

Transcription — Step by Step

Transcription is the process of copying a gene's DNA sequence into messenger RNA (mRNA). It occurs in the nucleus of eukaryotic cells and involves several steps:

1. Initiation: Transcription factors bind to the promoter region (e.g., TATA box at ~-25 bp). RNA Polymerase II is recruited to form the pre-initiation complex.

2. Elongation: RNA Pol II moves along the template strand (3' → 5'), synthesising mRNA in the 5' → 3' direction. It reads the template strand and adds complementary RNA bases (A→U, T→A, G→C, C→G).

3. Termination: RNA Pol II reaches a polyadenylation signal (AAUAAA). The pre-mRNA is cleaved and released.

4. Processing: Three modifications occur before the mRNA leaves the nucleus:
  • 5' cap — a modified guanine nucleotide added to the 5' end (protects from degradation)
  • 3' poly-A tail — 100-250 adenine nucleotides added (stability, export signal). This is why poly-A selection works for mRNA isolation in RNA-seq.
  • Splicing — introns removed, exons joined by the spliceosome

Translation — The Genetic Code

Translation occurs at ribosomes in the cytoplasm. The mRNA sequence is read in triplets called codons, each specifying one of 20 amino acids. The code is degenerate — most amino acids are encoded by multiple codons (e.g., Leucine has 6 codons). Key codons to know:

AUG
Start codon
Codes for Methionine

UAA, UAG, UGA
Stop codons
"Ochre, Amber, Opal"

64 total codons
61 code for amino acids
3 are stop codons

The reading frame matters enormously. A single insertion or deletion shifts the entire frame downstream (a frameshift mutation), usually producing a nonfunctional protein. This is why indels in coding regions are often more damaging than substitutions.

Post-Translational Modifications (PTMs)

After translation, proteins are often chemically modified. These modifications can change protein activity, location, stability, and interactions. Common PTMs include phosphorylation (adding a phosphate group — central to cell signalling), glycosylation (adding sugar chains — important for secreted proteins), ubiquitination (tagging for degradation by the proteasome), and acetylation (especially on histones, regulating gene expression). Proteomics and mass spectrometry are the primary tools for studying PTMs.

Why RNA is Unstable Compared to DNA

RNA is chemically less stable than DNA for two reasons: (1) the 2'-OH group on the ribose sugar makes RNA susceptible to alkaline hydrolysis, and (2) RNA is single-stranded, making it vulnerable to RNases (ubiquitous enzymes that degrade RNA). This instability is why RNA extraction requires careful technique — RNase-free conditions, working on ice, and rapid processing. It also means RNA quality degrades over time, which is why we measure RIN (RNA Integrity Number) before sequencing.

Exceptions to the Central Dogma

⚠️

The central dogma has important exceptions that are biologically and clinically significant:

Reverse Transcription

RNA → DNA. Performed by reverse transcriptase in retroviruses (HIV) and retrotransposons. Also used in the lab to create cDNA from mRNA (the basis of RNA-seq library preparation).

RNA Replication

RNA → RNA. RNA viruses (influenza, SARS-CoV-2) use RNA-dependent RNA polymerase (RdRp) to replicate their genomes. This enzyme is a drug target (e.g., remdesivir).

RNA Editing

Bases in mRNA can be chemically modified after transcription (e.g., A-to-I editing by ADAR enzymes). This means the mRNA sequence can differ from the genomic DNA — a potential pitfall in variant calling.

Prions

Protein → Protein. Misfolded prion proteins can template the misfolding of normal proteins without any nucleic acid involvement (the cause of mad cow disease / CJD).

Non-coding RNA

Not all RNA becomes protein. In fact, the vast majority of transcribed RNA in human cells is non-coding. Key types include: rRNA (ribosomal RNA — ~80% of total RNA, forms the ribosome structure), tRNA (transfer RNA — carries amino acids during translation), miRNA (microRNA — ~22 nt, silences gene expression post-transcriptionally), lncRNA (long non-coding RNA — >200 nt, diverse regulatory roles), and circRNA (circular RNA — resistant to degradation, potential biomarkers). These are increasingly studied in bioinformatics.

📝

Key Takeaways:
• The central dogma: DNA → RNA → Protein, with important exceptions (reverse transcription, RNA viruses, prions)
• Base pairing: A-T (DNA) / A-U (RNA) with 2 H-bonds; G-C with 3 H-bonds (stronger)
• Transcription produces pre-mRNA that is capped, polyadenylated, and spliced before export
• The genetic code is degenerate (64 codons for 20 amino acids + 3 stops); reading frame is critical
• RNA instability (2'-OH, RNases) has direct practical implications for sample handling and sequencing quality

1.3 Genome Organisation & Gene Structure ▶

Understanding how genes are organised in the genome is essential for interpreting any bioinformatics result. When you load a BAM file in a genome browser, annotate variants, or run differential expression analysis, you are navigating genome structure. This section covers the physical and functional organisation of genomes, from the chromosome level down to individual regulatory elements.

Chromosomes and Chromatin

In eukaryotes, genomic DNA is packaged with histone proteins into chromatin, which is further organised into chromosomes. Humans have 23 pairs of chromosomes (22 autosomes + 1 pair of sex chromosomes, XX or XY), totalling about 3.2 billion base pairs. During cell division, each chromosome is replicated, and the two copies (called sister chromatids) are joined at the centromere before being separated.

Euchromatin

Loosely packed, transcriptionally active chromatin. Most genes are expressed from euchromatic regions. Appears light in chromosome staining.

Heterochromatin

Tightly packed, largely silent chromatin. Found at centromeres and telomeres. Contains repetitive sequences. Constitutive (always condensed) or facultative (conditionally silenced, e.g., the inactive X chromosome).

Genome Size and the C-Value Paradox

E. coli: 4.6 Mb, ~4,300 genes Yeast: 12 Mb, ~6,000 genes Fly: 180 Mb, ~14,000 genes Human: 3.2 Gb, ~20,000 genes Wheat: 17 Gb, ~107,000 genes Paris japonica (plant): 149 Gb

The C-value paradox refers to the observation that genome size does not correlate with organism complexity. Humans have 3.2 Gb but only ~20,000 protein-coding genes, while the onion has 16 Gb. The difference is largely due to repetitive elements — transposons, retrotransposons, and satellite DNA account for about 45% of the human genome. Only about 1.5% of the human genome actually encodes proteins.

Regulatory Elements

Promoters

Located immediately upstream of genes (within ~1 kb of the TSS). Contain elements like the TATA box (~-25 bp) and CpG islands (GC-rich regions, often unmethylated in active promoters). RNA polymerase and transcription factors bind here to initiate transcription.

Enhancers

Can be located thousands or even millions of base pairs away from their target gene — upstream, downstream, or even within introns. They loop in 3D space to contact the promoter and boost transcription. Enhancer activity is tissue-specific, explaining why the same genome produces different cell types.

CpG Islands

CpG islands are regions of at least 200 bp with a GC content above 50% and an observed/expected CpG ratio above 0.6. About 70% of human gene promoters are associated with CpG islands. In normal cells, most CpG islands in promoters are unmethylated, allowing gene expression. In cancer, aberrant hypermethylation of tumour suppressor gene promoters is a hallmark event. This is studied using bisulfite sequencing or methylation arrays (e.g., Illumina EPIC array).

Gene Structure — A Real Example

Consider the human BRCA1 gene (associated with breast cancer risk). It spans about 81 kb on chromosome 17, contains 23 exons, and produces a protein of 1,863 amino acids. The coding sequence itself is only about 5,592 bp — just 7% of the gene's genomic span. The rest consists of introns, UTRs, and regulatory regions.

BRCA1 gene structure (simplified):

5'---[Promoter/CpG island]---[5'UTR]---[Exon 1]---intron 1 (34 kb)---[Exon 2]---intron 2---...---[Exon 23]---[3'UTR]---[polyA signal]---3'

Green = exons (coding) | Grey = introns (removed by splicing) | Total span: ~81 kb

Alternative Splicing

A single gene can produce multiple protein variants (isoforms) through alternative splicing of its pre-mRNA. The human genome has ~20,000 protein-coding genes but produces over 200,000 different mRNA transcripts. Common types of alternative splicing include:

Exon skipping

The most common type in humans. An entire exon is excluded from the mature mRNA. Example: the Drosophila Dscam gene can produce over 38,000 isoforms through exon skipping.

Alternative 5' or 3' splice site

A different splice site is used within the same exon, making the exon longer or shorter. Changes the protein sequence at the boundary.

Intron retention

An intron is not removed and remains in the mature mRNA. More common in plants and fungi. Can introduce premature stop codons.

Mutually exclusive exons

One of two exons is included, but never both. The choice is regulated in a tissue-specific manner.

⚠️

When aligning RNA-seq reads to a genome, use a splice-aware aligner (STAR, HISAT2) — not a simple aligner like BWA. Otherwise reads spanning exon-exon junctions will fail to align. If you need isoform-level analysis, consider tools like Salmon or Kallisto for transcript-level quantification, or PacBio IsoSeq for full-length transcript sequencing.

📝

Key Takeaways:
• Eukaryotic genomes are organised into chromosomes with euchromatin (active) and heterochromatin (silent) regions
• Genome size does not correlate with complexity (C-value paradox) — most large genomes are inflated by repetitive elements
• Promoters, enhancers, and CpG islands control when and where genes are expressed
• Gene structure: 5'UTR - exons/introns - 3'UTR; introns are spliced out during mRNA processing
• Alternative splicing enables one gene to produce multiple protein isoforms, greatly expanding the proteome

1.4 Sequencing Technologies ▶

Most bioinformatics data comes from Next-Generation Sequencing (NGS). Understanding what the machine actually does explains why your data looks the way it does — including error profiles, quality score distributions, read length limitations, and the types of biases you will encounter. The three major platforms each have distinct chemistries with different strengths and trade-offs.

How Illumina Sequencing-by-Synthesis Works

Illumina is the dominant platform, generating the majority of the world's sequencing data. Here is the step-by-step process:

Step 1 — Library Preparation: DNA/cDNA is fragmented (~300-500 bp), and adapters are ligated to both ends. These adapters contain sequences for binding to the flow cell and for PCR amplification. An index (barcode) sequence is included to allow multiplexing of multiple samples on one lane.

Step 2 — Cluster Generation: The library is loaded onto a flow cell — a glass slide coated with millions of short oligos complementary to the adapters. Each fragment binds to the surface, then undergoes bridge amplification: the fragment bends over and binds to a nearby oligo, forming a bridge, which is then extended by polymerase. This cycle repeats ~35 times, creating a cluster of ~1,000 identical copies of the original fragment. Each cluster will produce one "read."

Step 3 — Sequencing by Synthesis: Fluorescently-labelled, reversibly-terminated nucleotides are washed across the flow cell. In each cycle, exactly one nucleotide is incorporated into each growing strand. The flow cell is imaged to determine which base was added (each of A, C, G, T has a different fluorescent colour). The terminator and fluorescent tag are then chemically removed, and the next cycle begins. Typically 100-150 cycles are performed per read.

Step 4 — Base Calling: The raw images are processed by Illumina's Real-Time Analysis (RTA) software to determine the base at each position and assign a quality score. The output is a FASTQ file — one per read direction (R1 and R2 for paired-end). A standard run produces billions of reads.

PHRED Quality Scores

Every base in a FASTQ file has an associated PHRED quality score (Q) that represents the probability of an incorrect base call. The relationship is logarithmic:

Q = -10 × log₁₀(P)

Where P = probability of error

Q10 = 1 in 10 error rate (10%) — terrible
Q20 = 1 in 100 error rate (1%) — minimum acceptable
Q30 = 1 in 1,000 error rate (0.1%) — good quality
Q40 = 1 in 10,000 error rate (0.01%) — excellent

In FASTQ files, quality scores are encoded as ASCII characters. For Illumina 1.8+ (Phred+33 encoding), the character ! = Q0, I = Q40. When you see a quality string like FFFFFFFF, each F represents Q37 — very high quality. Quality typically degrades toward the end of reads, which is why trimming the 3' ends is common practice.

Technology Comparison

Feature	Illumina	PacBio HiFi	Oxford Nanopore
Read length	75-300 bp	10-25 kb	Up to 4 Mb
Accuracy	>99.9% (Q30+)	~99.9% (HiFi mode)	~95-99% (improving)
Throughput	Up to 6 Tb (NovaSeq X)	~30 Gb per cell	~50-300 Gb (PromethION)
Run time	12-48 hours	~24 hours	Minutes to days
Error type	Substitutions	Indels (rare in HiFi)	Indels (homopolymers)
Cost per Gb	~$2-5	~$20-30	~$10-20
Best for	RNA-seq, WGS, ChIP-seq, variant calling	Genome assembly, structural variants, isoforms	Field diagnostics, real-time, direct RNA, methylation

When to Choose Each Technology

Choose Illumina when:

• Differential gene expression (RNA-seq)
• Whole-genome variant calling
• ChIP-seq / ATAC-seq
• Large sample numbers (multiplexing)
• Budget is a concern

Choose PacBio when:

• De novo genome assembly
• Full-length transcript isoforms (IsoSeq)
• Structural variant detection
• Repetitive region resolution
• Phased diploid assembly

Choose Nanopore when:

• Rapid pathogen identification
• Field work (MinION is portable)
• Direct RNA sequencing (no RT step)
• Epigenetic modifications (m5C, m6A)
• Ultra-long reads needed

💡

Hybrid approaches: Many genome assembly projects now combine Illumina short reads (for accuracy/polishing) with long reads (for scaffolding and resolving repeats). The T2T (Telomere-to-Telomere) consortium used PacBio HiFi and Oxford Nanopore ultra-long reads to complete the first truly gapless human genome assembly in 2022.

📝

Key Takeaways:
• Illumina sequencing-by-synthesis uses fluorescent reversible terminators to read one base per cycle across billions of clusters
• PHRED quality scores are logarithmic: Q30 means a 1/1,000 chance of error
• Quality degrades at read ends; Illumina errors are substitutions, while long-read errors are typically indels
• Choose the platform based on your application: Illumina for most quantitative assays, long-read for assembly and structural analysis
• Hybrid approaches combining short and long reads often give the best results for genome projects

Module 2 · Computing Skills

Linux, Terminal & Programming for Bioinformatics

Almost all bioinformatics tools run on Linux/macOS in the terminal. This module teaches you the computing skills you need before running any analysis — from basic commands to writing scripts.

2.1 Essential Linux / Terminal Commands ▶

You will spend 80% of your time in the terminal. These commands are non-negotiable knowledge. Linux (and macOS terminal) is the native environment for nearly all bioinformatics tools. Mastering the command line is not optional — it is the single most important skill for a bioinformatician.

Navigation and File Operations

# Navigation pwd # Print current directory ls -lh # List files (human-readable sizes) ls -lhS # List files sorted by size (largest first) ls -lht # List files sorted by modification time cd /path/to/folder # Change directory cd .. # Go up one level cd ~ # Go to home directory cd - # Go to previous directory (very useful!) # File operations cp source.txt dest.txt # Copy file cp -r dir1/ dir2/ # Copy directory recursively mv old.txt new.txt # Move or rename rm file.txt # Delete file (PERMANENT — no Trash!) mkdir -p project/data/raw # Create nested directories ln -s /path/to/file link # Create symbolic link (saves disk space) # Viewing files cat file.txt # Print entire file head -20 file.txt # First 20 lines tail -20 file.txt # Last 20 lines less file.txt # Page through (press q to quit) wc -l file.txt # Count lines # Searching grep "pattern" file.txt # Find lines matching pattern grep -c ">" file.fa # Count sequences in FASTA grep -v "^#" file.vcf # Exclude comment/header lines find . -name "*.fastq.gz" # Find files by name pattern # Compression gzip file.fastq # Compress (creates .fastq.gz) gunzip file.fastq.gz # Decompress zcat file.fastq.gz | head # View compressed file without decompressing

⚠️

rm -rf is permanent and irreversible. There is no Trash. Always double-check before running delete commands. Never run rm -rf / or rm -rf * without absolute certainty.

File Permissions

Linux file permissions control who can read, write, and execute files. Understanding permissions is essential because scripts need execute permission, and shared HPC directories require correct group permissions.

# View permissions (the -rwxr-xr-- part of ls -l output) ls -l script.sh # Output: -rwxr-xr-- 1 user group 1234 Mar 20 10:00 script.sh # r=read(4), w=write(2), x=execute(1) # Positions: owner | group | others # Make a script executable chmod +x script.sh # Add execute permission for everyone chmod 755 script.sh # rwx for owner, rx for group and others chmod 644 data.txt # rw for owner, read-only for group and others # Change ownership (requires sudo or being the owner) chown user:group file.txt # Change owner and group chgrp labgroup project/ # Change group for shared projects

Pipes and Redirection

Pipes (|) and redirection (>, >>) are the heart of Unix philosophy — chain small tools together to accomplish complex tasks. This is how most bioinformatics one-liners work.

# Pipe: send output of one command as input to the next cat file.fastq | grep "^@" | wc -l # Count reads in FASTQ # Redirect output to a file samtools flagstat file.bam > stats.txt # Overwrite file echo "run complete" >> log.txt # Append to file # Redirect stderr separately from stdout hisat2 -x index -U reads.fq -S out.sam 2> alignment_log.txt # Redirect both stdout and stderr command &> all_output.txt # Powerful bioinformatics one-liners using pipes # Count reads per chromosome from BAM: samtools idxstats sorted.bam | cut -f1,3 | sort -k2 -nr | head # Extract unique gene names from a GTF file: grep -v "^#" genes.gtf | awk '$3=="gene"' | grep -o 'gene_name "[^"]*"' | sort -u | wc -l

Process Management

Bioinformatics jobs often run for hours. Knowing how to manage long-running processes is critical, especially when working over SSH.

# Monitor system resources top # Live process viewer (press q to quit) htop # Better version of top (install with conda) df -h # Check disk space du -sh */ # Size of each subdirectory free -h # Check available RAM # Run jobs in background nohup long_command & # Run in background, survives logout jobs # List background jobs kill %1 # Kill background job #1 kill -9 12345 # Force kill process by PID # screen / tmux — persistent terminal sessions (essential for SSH) screen -S mysession # Start named session # Ctrl+A, D to detach screen -r mysession # Reattach to session screen -ls # List sessions # tmux (modern alternative to screen) tmux new -s analysis # Start named session # Ctrl+B, D to detach tmux attach -t analysis # Reattach

SSH for HPC / Server Access

Most bioinformatics work is done on remote servers or High-Performance Computing (HPC) clusters. SSH (Secure Shell) is how you connect:

# Basic SSH connection ssh username@hpc.university.edu # SSH with key authentication (more secure, no password prompt) ssh-keygen -t ed25519 # Generate key pair (once) ssh-copy-id user@server # Copy public key to server # Transfer files with scp scp local_file.txt user@server:/path/to/dest/ scp user@server:/path/to/results.txt ./ # Transfer files with rsync (better for large transfers — resumable) rsync -avP local_dir/ user@server:/remote/dir/

Bioinformatics-Specific Tricks

# Run a command on multiple files with xargs ls *.fastq.gz | xargs -I {} fastqc {} # GNU parallel — run commands in parallel (much faster) ls *.fastq.gz | parallel -j 4 "fastqc {}" # Process all FASTQ pairs in a directory for f in *_R1.fastq.gz; do r2=${f/_R1/_R2} sample=$(basename $f _R1.fastq.gz) echo "Processing $sample: $f and $r2" done # Quick stats: column-based operations with awk awk '{sum+=$5} END {print "Mean:", sum/NR}' data.txt # Sort and remove duplicate lines sort -u gene_list.txt > unique_genes.txt # Join two files by a common column join -t$'\t' -1 1 -2 1 <(sort file1.txt) <(sort file2.txt)

📝

Key Takeaways:
• Master navigation (cd, ls, pwd), file operations (cp, mv, rm), and file viewing (head, tail, less, grep)
• Understand permissions (chmod 755) — your scripts must be executable to run
• Pipes and redirection are the foundation of bioinformatics one-liners
• Use screen/tmux for persistent sessions when working over SSH
• GNU parallel and xargs enable efficient batch processing of many files

2.2 Setting Up Your Environment (Conda) ▶

Conda is the standard package manager for bioinformatics. It installs tools and manages dependencies so they don't conflict with each other. Proper environment management is one of the most important skills for reproducible research — if you cannot recreate your exact software environment, your results cannot be reproduced.

Installing Conda / Mamba

# Option A: Install Miniconda (traditional) wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh source ~/.bashrc # Option B: Install Miniforge (RECOMMENDED — includes mamba by default) wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh bash Miniforge3-Linux-x86_64.sh source ~/.bashrc # Set up bioinformatics channels conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict # Important for reproducibility

💡

Why Mamba over Conda? Mamba is a drop-in replacement for conda that uses a C++ solver instead of Python. It resolves dependencies 10-50x faster than conda. With Miniforge, mamba is included by default. Just replace conda install with mamba install — everything else is the same.

Creating and Managing Environments

# Create a project environment (ALWAYS use environments!) mamba create -n rnaseq python=3.10 conda activate rnaseq # Install tools into the environment mamba install -c bioconda fastqc trim-galore hisat2 samtools subread mamba install -c conda-forge r-base r-biocmanager # List installed tools and their versions conda list # Save environment for reproducibility conda env export > environment.yml # Recreate on another machine: conda env create -f environment.yml # Useful management commands conda env list # List all environments conda deactivate # Return to base conda env remove -n old_env # Delete an environment

⚠️

Never install tools into your base environment. One project = one environment. This prevents version conflicts between projects and ensures reproducibility.

Containers: Docker and Singularity

Containers go beyond conda by packaging the entire operating system, libraries, and tools into a single portable image. This guarantees that the software runs identically on any machine. Docker is the standard for local development; Singularity (Apptainer) is the standard on HPC clusters because it does not require root privileges.

# Docker — pull and run a bioinformatics image docker pull biocontainers/samtools:v1.17 docker run -v $(pwd):/data biocontainers/samtools:v1.17 samtools flagstat /data/aligned.bam # Singularity — the HPC alternative (no root needed) singularity pull docker://biocontainers/samtools:v1.17 singularity exec samtools_v1.17.sif samtools flagstat aligned.bam # Many Nextflow/Snakemake pipelines use containers automatically

Project Directory Structure

Organising your project directory consistently saves enormous time and prevents errors. Here is a recommended structure:

# Create a standard project structure mkdir -p my_rnaseq_project/{data/{raw,trimmed,aligned},results/{counts,deseq2,figures},scripts,logs,ref} # Resulting structure: # my_rnaseq_project/ # data/ # raw/ <-- raw FASTQ files (never modify these!) # trimmed/ <-- trimmed FASTQ files # aligned/ <-- BAM files # results/ # counts/ <-- gene count matrices # deseq2/ <-- differential expression results # figures/ <-- publication-ready plots # scripts/ <-- analysis scripts (version controlled) # logs/ <-- log files from each step # ref/ <-- reference genome, GTF, indices

Version Control with Git

Track your analysis scripts with git. This provides a history of every change you made and allows collaboration. Do not track large data files — only scripts and small config files.

# Initialise a git repo in your project cd my_rnaseq_project git init echo -e "data/\nresults/\nlogs/\n*.bam\n*.fastq.gz" > .gitignore git add scripts/ .gitignore environment.yml git commit -m "Initial project setup" # Push to GitHub for backup and collaboration git remote add origin https://github.com/user/project.git git push -u origin main

HPC Module System

Many HPC clusters use the module system instead of (or alongside) conda to manage software. The module system allows administrators to install centrally-maintained software that users can load as needed.

# Common module commands on HPC module avail # List all available modules module avail samtools # Search for a specific tool module load samtools/1.17 # Load a specific version module list # Show currently loaded modules module unload samtools # Unload a module module purge # Unload all modules

📝

Key Takeaways:
• Use Mamba (via Miniforge) instead of plain Conda for much faster dependency resolution
• Always create separate environments for each project; never install into base
• Containers (Docker/Singularity) provide the highest level of reproducibility
• Maintain a consistent project directory structure and track scripts with git
• On HPC clusters, use the module system or Singularity containers

2.3 Bash Scripting for Automation ▶

When you have 20 samples, you don't run commands one by one. Bash scripts automate repetitive tasks — this is the single most time-saving skill in bioinformatics. A well-written script ensures every sample is processed identically, provides a record of exactly what was done, and can be re-run when parameters need to change.

Variables, Arrays, and String Operations

#!/bin/bash # Variables — no spaces around the = sign! SAMPLE="control_rep1" THREADS=8 GENOME="/ref/hg38/genome" # String operations echo ${SAMPLE} # control_rep1 echo ${SAMPLE^^} # CONTROL_REP1 (uppercase) echo ${SAMPLE/_rep1/} # control (substitution) echo ${SAMPLE%.fastq.gz} # Remove suffix # Arrays — store lists of items SAMPLES=("ctrl_1" "ctrl_2" "treat_1" "treat_2") echo ${SAMPLES[0]} # ctrl_1 (first element) echo ${SAMPLES[@]} # All elements echo ${#SAMPLES[@]} # 4 (number of elements) # Loop over array for s in "${SAMPLES[@]}"; do echo "Processing $s" done

Conditionals (if/else)

#!/bin/bash # Check if a file exists before processing if [ -f "${SAMPLE}_R1.fastq.gz" ]; then echo "File found, processing..." elif [ -d "$INPUT_DIR" ]; then echo "Directory exists but no file found" else echo "ERROR: Input not found!" >&2 exit 1 fi # Check if a tool is installed if ! command -v samtools &> /dev/null; then echo "ERROR: samtools not found. Install with: mamba install -c bioconda samtools" exit 1 fi # Numeric comparisons READ_COUNT=$(zcat sample.fastq.gz | wc -l) READ_COUNT=$((READ_COUNT / 4)) if [ "$READ_COUNT" -lt 1000000 ]; then echo "WARNING: Only $READ_COUNT reads — too few for reliable DE analysis" fi

Error Handling — set -euo pipefail

By default, bash scripts continue running even when a command fails. This is dangerous — if trimming fails silently, you might align garbage data. Always start scripts with safety flags:

#!/bin/bash set -euo pipefail # -e: Exit immediately if any command fails # -u: Treat unset variables as an error # -o pipefail: If any command in a pipe fails, the whole pipe fails # Without pipefail, this would silently fail if hisat2 crashes: hisat2 -x index -U reads.fq | samtools sort -o out.bam # With pipefail, the script stops and reports the error

⚠️

set -euo pipefail should be on line 2 of every bioinformatics script. Without it, you may not notice when a critical step fails, leading to incorrect results downstream.

Logging Output

#!/bin/bash set -euo pipefail # Create a log file with timestamp LOG="logs/pipeline_$(date +%Y%m%d_%H%M%S).log" exec > >(tee -a "$LOG") 2>&1 # Send stdout and stderr to both screen and log echo "Pipeline started at $(date)" echo "Working directory: $(pwd)" echo "Conda environment: $CONDA_DEFAULT_ENV" echo "---" # Your pipeline commands here... echo "Pipeline finished at $(date)"

Full RNA-seq Pipeline Script

#!/bin/bash set -euo pipefail # ───────────────────────────────────────── # RNA-seq preprocessing pipeline # Usage: bash pipeline.sh # ───────────────────────────────────────── # Define paths (edit these) RAW_DIR="./data/raw" TRIM_DIR="./data/trimmed" ALIGN_DIR="./data/aligned" GENOME_DIR="./ref" THREADS=8 # Create output directories mkdir -p "$TRIM_DIR" "$ALIGN_DIR" "logs" # Loop over all samples for R1 in ${RAW_DIR}/*_R1.fastq.gz; do SAMPLE=$(basename "$R1" _R1.fastq.gz) R2="${RAW_DIR}/${SAMPLE}_R2.fastq.gz" echo "=== Processing: $SAMPLE ===" # Check paired file exists if [ ! -f "$R2" ]; then echo "ERROR: Paired file not found: $R2" >&2 exit 1 fi # Trim adapters with Trim Galore trim_galore --paired --cores $THREADS \ --fastqc -o "$TRIM_DIR" \ "$R1" "$R2" 2>&1 | tee "logs/${SAMPLE}_trim.log" # Align to genome hisat2 -p $THREADS -x "${GENOME_DIR}/hisat2_index" \ -1 "${TRIM_DIR}/${SAMPLE}_R1_val_1.fq.gz" \ -2 "${TRIM_DIR}/${SAMPLE}_R2_val_2.fq.gz" \ 2> "logs/${SAMPLE}_align.log" \ | samtools sort -@ $THREADS -o "${ALIGN_DIR}/${SAMPLE}.bam" samtools index "${ALIGN_DIR}/${SAMPLE}.bam" echo "=== Done: $SAMPLE ===" done echo "All samples processed!"

Running Scripts on HPC with SLURM

On most HPC clusters, you do not run jobs directly. Instead, you submit them to a job scheduler (most commonly SLURM). SLURM allocates resources and runs your script when resources become available.

#!/bin/bash #SBATCH --job-name=rnaseq_align #SBATCH --output=logs/slurm_%j.out # %j = job ID #SBATCH --error=logs/slurm_%j.err #SBATCH --time=12:00:00 # Max wall time: 12 hours #SBATCH --cpus-per-task=8 # Number of CPU cores #SBATCH --mem=32G # RAM request #SBATCH --partition=general # Queue/partition name set -euo pipefail # Load conda environment source ~/.bashrc conda activate rnaseq # Run the pipeline bash scripts/pipeline.sh # Submit with: sbatch scripts/slurm_align.sh # Monitor with: squeue -u $USER # Cancel with: scancel JOB_ID

💡

SLURM array jobs let you submit one script that runs many times with different inputs — perfect for processing many samples in parallel: #SBATCH --array=1-20 and use $SLURM_ARRAY_TASK_ID to select the sample.

📝

Key Takeaways:
• Always start scripts with set -euo pipefail to catch errors early
• Use variables for paths and parameters — makes scripts reusable
• Log everything: timestamps, tool versions, and parameters used
• Test on one sample before running on all; check outputs at each step
• On HPC, use SLURM (#SBATCH directives) to request resources and submit jobs

2.4 R for Bioinformatics — Getting Started ▶

R is the language of choice for statistical analysis, visualisation, and Bioconductor packages (DESeq2, edgeR, limma). You don't need to be a software engineer — a working knowledge of R is enough to run the statistical analyses and produce publication-quality plots that make up the second half of most bioinformatics projects.

RStudio Setup

RStudio is the standard IDE for R. It provides a console, script editor, plot viewer, file browser, and help panel in one interface. Install R first (from CRAN or via conda), then install RStudio Desktop (free). For HPC, you can use RStudio Server through a web browser.

# Install via conda (recommended for reproducibility) mamba install -c conda-forge r-base=4.3 r-tidyverse r-biocmanager rstudio-desktop # Or install R packages within R: install.packages(c("tidyverse", "ggplot2", "pheatmap", "RColorBrewer")) # Install Bioconductor packages if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install(c("DESeq2", "edgeR", "clusterProfiler", "EnhancedVolcano", "ComplexHeatmap", "org.Hs.eg.db"))

Reading and Writing Data

# Read different file formats counts <- read.csv("counts.csv", row.names = 1) # CSV counts <- read.delim("counts.txt", row.names = 1) # Tab-delimited counts <- readRDS("counts.rds") # R binary (fast, compact) # Inspect the data head(counts) # First 6 rows dim(counts) # Dimensions: genes x samples str(counts) # Structure of the object summary(counts[, 1]) # Summary statistics for first column # Write outputs write.csv(results, "deseq2_results.csv") saveRDS(dds, "deseq2_object.rds") # Save full R object (can reload later)

Tidyverse Data Wrangling

The tidyverse is a collection of R packages (dplyr, tidyr, ggplot2, readr, etc.) that share a common design philosophy. The pipe operator (|> or %>%) chains operations together, making code readable.

library(tidyverse) # Read metadata and filter metadata <- read_csv("metadata.csv") metadata |> filter(condition == "treated") |> # Keep only treated samples select(sample_id, condition, batch) |> # Select specific columns arrange(batch) |> # Sort by batch mutate(label = paste(condition, batch, sep = "_")) # Add new column # Summarise expression across groups results <- read_csv("deseq2_results.csv") results |> filter(!is.na(padj)) |> # Remove NA values filter(padj < 0.05, abs(log2FoldChange) > 1) |> # Significant DEGs arrange(padj) |> # Sort by significance head(20) # Top 20

ggplot2 for Genomics Visualisation

ggplot2 is the most powerful plotting system in R. It uses a "grammar of graphics" — you build plots by adding layers (geom_point, geom_boxplot, etc.) to a base aesthetic mapping.

library(ggplot2) # Volcano plot — the classic DE visualisation ggplot(results, aes(x = log2FoldChange, y = -log10(padj))) + geom_point(aes(color = padj < 0.05 & abs(log2FoldChange) > 1), size = 0.8, alpha = 0.6) + scale_color_manual(values = c("grey70", "red3"), labels = c("NS", "Significant")) + geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey40") + geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "grey40") + labs(x = "Log2 Fold Change", y = "-Log10 Adjusted P-value", title = "Differential Expression: Treated vs Control") + theme_minimal(base_size = 12) + theme(legend.position = "bottom") # Save plot as publication-quality PDF ggsave("figures/volcano_plot.pdf", width = 7, height = 6, dpi = 300) # Boxplot of gene expression across conditions ggplot(expr_long, aes(x = condition, y = log2(count + 1), fill = condition)) + geom_boxplot(outlier.size = 0.5) + facet_wrap(~gene, scales = "free_y") + theme_bw()

R Markdown for Reproducible Reports

R Markdown lets you combine code, results, and narrative text in a single document. When you "knit" the document, R runs all the code and produces a formatted HTML or PDF report. This is the gold standard for reproducible analysis.

# In an .Rmd file: --- title: "RNA-seq Analysis Report" author: "Your Name" date: "`r Sys.Date()`" output: html_document --- ## Quality Control ```{r qc-summary} library(DESeq2) dds <- readRDS("deseq2_object.rds") plotPCA(vst(dds), intgroup = "condition") ``` ## Differential Expression ```{r de-results} res <- results(dds, contrast = c("condition", "treated", "control")) summary(res) ```

📌

Recommended R learning path: (1) R for Data Science by Hadley Wickham (free online at r4ds.had.co.nz) → (2) Bioconductor Workflows at bioconductor.org → (3) DESeq2 vignette → (4) ggplot2 book (ggplot2-book.org)

📝

Key Takeaways:
• Install R + RStudio for interactive analysis; use Bioconductor for genomics-specific packages
• The tidyverse (dplyr, ggplot2, tidyr) is essential for data wrangling and visualisation
• ggplot2 builds plots in layers — master aes(), geom_*, and theme() for publication figures
• Use R Markdown to combine code, results, and narrative in reproducible reports
• Save R objects with saveRDS() so you can reload them without re-running slow computations

Module 3 · Databases & Data

Biological Databases & File Formats

Public databases hold billions of sequences, structures, and experiments — and they are all free. Knowing which database to use and how to download data efficiently is an essential skill.

3.1 The Major Biological Databases ▶

Public biological databases are one of the greatest resources in science. They contain billions of sequences, millions of structures, and thousands of curated experiments — all freely accessible. Knowing which database to search, how to interpret accession numbers, and how to programmatically retrieve data are essential skills for any bioinformatician.

The Core Databases

NCBI Multi-purpose

National Center for Biotechnology Information. Houses GenBank (nucleotide sequences), SRA (raw sequencing data), PubMed (literature), RefSeq (curated reference sequences), and ClinVar (clinical variants). Start here for most searches. URL: ncbi.nlm.nih.gov

Ensembl Genome

Best source for annotated reference genomes and GTF annotation files for 300+ species. Provides BioMart for bulk queries. Use this to download genome FASTA and GTF files. URL: ensembl.org

UniProt Protein

Comprehensive protein database. Swiss-Prot = manually curated (~570K entries, high quality). TrEMBL = computationally annotated (~250M entries). URL: uniprot.org

RCSB PDB Structure

Protein Data Bank. 200,000+ experimentally determined 3D structures (X-ray, cryo-EM, NMR). Now complemented by AlphaFold DB with 200M+ predicted structures. URL: rcsb.org

GEO / SRA Expression

Gene Expression Omnibus (processed data, metadata) and Sequence Read Archive (raw FASTQ files). Search by organism + condition + technology. Most published datasets are deposited here.

KEGG / Reactome Pathways

KEGG: metabolic pathways, drug targets, disease associations. Reactome: curated biological pathways and reactions, especially good for human data. Essential for pathway enrichment analysis.

STRING Interactions

Protein-protein interaction networks for 5000+ species. Input a gene list and get a functional interaction network. URL: string-db.org

Gene Ontology (GO) Annotation

Standardised vocabulary for gene function. Three domains: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Used in enrichment analysis. URL: geneontology.org

Accession Number Formats

Every entry in a database has a unique accession number. Recognising the format tells you which database it comes from:

Pattern	Example	Database	What It Is
NM_######	NM_007294	NCBI RefSeq	mRNA transcript
NP_######	NP_009225	NCBI RefSeq	Protein
ENSG###########	ENSG00000012048	Ensembl	Gene (human)
ENST###########	ENST00000357654	Ensembl	Transcript
P##### or Q#####	P38398	UniProt Swiss-Prot	Curated protein
SRR#######	SRR1234567	NCBI SRA	Sequencing run
GSE######	GSE53697	NCBI GEO	Dataset/Series

How Databases Link to Each Other

Databases are extensively cross-referenced. A gene page on NCBI Gene links to its RefSeq transcript (NM_), RefSeq protein (NP_), UniProt entry, Ensembl ID, associated GEO datasets, ClinVar variants, and PDB structures. Similarly, an Ensembl gene page links out to NCBI, UniProt, KEGG, and more. This interconnection means you can start from any database and navigate to the information you need.

NCBI E-utilities API

For programmatic access, NCBI provides the E-utilities API — a set of URL-based tools for searching and retrieving records. This is essential for automating data retrieval in scripts.

# Search PubMed for articles about BRCA1 and RNA-seq esearch -db pubmed -query "BRCA1 AND RNA-seq" | efetch -format abstract | head -50 # Fetch a nucleotide sequence in FASTA format efetch -db nucleotide -id NM_007294 -format fasta > BRCA1_mRNA.fa # Search SRA for human RNA-seq datasets esearch -db sra -query '"Homo sapiens"[Organism] AND "RNA-Seq"[Strategy]' | efetch -format runinfo | head # Install E-utilities via conda mamba install -c bioconda entrez-direct

How to Cite Databases

💡

Always cite the databases you use in your Methods section. Each database publishes a yearly "Database Issue" paper in Nucleic Acids Research. For example: "Reference genome and annotation were obtained from Ensembl release 111 (Cunningham et al., 2024). Raw sequencing data were downloaded from the NCBI Sequence Read Archive (accession SRP123456)."

📝

Key Takeaways:
• NCBI is the starting point for most searches; Ensembl is best for genome downloads; UniProt for protein information
• Learn to recognise accession number formats (NM_, ENSG, P#####, SRR) to identify the source database
• Databases are cross-linked — navigate from gene to transcript to protein to structure to pathway
• Use NCBI E-utilities for programmatic access in automated pipelines
• Always cite the databases and their version/release numbers in publications

3.2 Essential File Formats Explained ▶

Each step of a bioinformatics analysis produces a different file format. Recognising these formats, understanding their structure, and knowing how to convert between them is essential. Misinterpreting a file format is one of the most common causes of analysis errors.

FASTA (.fa / .fasta)

The simplest sequence format. Each entry has a header line starting with > followed by one or more lines of sequence. Used for reference genomes, transcriptomes, and protein sequences.

# Example FASTA file >ENST00000357654.9 BRCA1-201 cds chromosome:GRCh38:17:43044295:43125483:-1 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA ATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGAC ...

💡

Useful one-liners: grep -c "^>" file.fa counts the number of sequences. samtools faidx genome.fa creates an index (.fai) needed by many tools. The index file lists each sequence name, length, offset, line bases, and line bytes.

FASTQ (.fastq / .fastq.gz)

Raw sequencing reads. Every read is exactly 4 lines. FASTQ files are almost always gzip-compressed (.fastq.gz) to save space.

# Example FASTQ entry (4 lines per read) @SRR1234567.1 1 length=150 # Line 1: Header (starts with @) ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG # Line 2: Sequence + # Line 3: Separator (always +) IIIIIIIIIIIIIHHHHHGGGGFFFDDDBBBB@@@@>>>555222 # Line 4: Quality scores (ASCII) # Quality scores: I=Q40(99.99%), F=Q37, 5=Q20(99%), !=Q0 # Count reads: divide total lines by 4 echo $(( $(zcat sample.fastq.gz | wc -l) / 4 ))

SAM / BAM / CRAM

Alignment files. SAM is human-readable text; BAM is the compressed binary equivalent (5-10x smaller); CRAM compresses even further by referencing the genome. Always store BAM, never SAM.

# SAM format (tab-separated, 11 mandatory columns) # QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL SRR123.1 99 chr17 43044295 60 150M = 43044450 305 ATCGATCG... IIIIIIII... # Key samtools commands samtools view -bS aligned.sam > aligned.bam # SAM to BAM samtools sort aligned.bam -o sorted.bam # Sort by coordinate samtools index sorted.bam # Create .bai index samtools flagstat sorted.bam # Quick alignment stats

GTF / GFF3

Gene annotation files define where genes, exons, transcripts, and other features are located in the genome. Essential for gene counting (featureCounts) and transcript quantification.

# GTF format: 9 tab-separated columns # chr source feature start end score strand frame attributes chr17 ensembl gene 43044295 43125483 . - . gene_id "ENSG00000012048"; gene_name "BRCA1"; chr17 ensembl transcript 43044295 43125483 . - . gene_id "ENSG00000012048"; transcript_id "ENST00000357654"; chr17 ensembl exon 43124017 43125483 . - . gene_id "ENSG00000012048"; exon_number "1";

VCF (.vcf / .vcf.gz)

Variant Call Format — the standard for storing genetic variants. Covered in detail in Module 6.

BED (.bed)

Simple genomic interval format. Minimum 3 columns: chrom, start (0-based), end. Used for ChIP-seq peaks, exon coordinates, target regions, blacklisted regions.

⚠️

Coordinate systems matter! BED uses 0-based, half-open coordinates (start is 0-indexed, end is exclusive). GTF/GFF and VCF use 1-based, inclusive coordinates. A region at chr1:100-200 in BED is the same as chr1:101-200 in GTF. Mixing these up is a very common source of off-by-one errors.

Converting Between Formats

# SAM <-> BAM <-> CRAM samtools view -bS input.sam > output.bam # SAM to BAM samtools view -h input.bam > output.sam # BAM to SAM samtools view -C -T ref.fa input.bam > out.cram # BAM to CRAM # GTF <-> BED (extract gene coordinates) awk '$3=="gene"' genes.gtf | awk '{print $1"\t"$4-1"\t"$5"\t"$10}' | tr -d '";' > genes.bed # BED operations with bedtools bedtools intersect -a peaks.bed -b genes.bed # Overlapping regions bedtools getfasta -fi genome.fa -bed regions.bed # Extract sequences # Validate file integrity samtools quickcheck sorted.bam && echo "OK" || echo "CORRUPTED" md5sum file.fastq.gz > file.md5 # Generate checksum md5sum -c file.md5 # Verify checksum

📝

Key Takeaways:
• FASTA = sequences, FASTQ = sequences + quality, BAM = aligned reads, GTF = gene annotations, VCF = variants, BED = intervals
• Always store alignments as BAM (or CRAM), never SAM — the space savings are enormous
• Watch out for coordinate systems: BED is 0-based half-open; GTF and VCF are 1-based inclusive
• Use samtools and bedtools for format conversion and manipulation
• Always verify file integrity with checksums (md5sum) after transfers

3.3 Downloading Data from SRA & Ensembl ▶

Most bioinformatics projects begin by downloading data — either raw sequencing data from public repositories (to reproduce or extend a study) or reference genomes and annotations for alignment. This section covers how to find, download, and verify data efficiently.

Finding SRA Accession Numbers from a Paper

Published papers deposit raw data in public repositories. To find the accession number: (1) Check the paper's Data Availability section — it will list a GEO accession (GSE######) or BioProject (PRJNA######). (2) Go to NCBI GEO and search for the GSE number. (3) Click "SRA Run Selector" at the bottom of the GEO page. (4) This gives you a table of all SRR accessions and their metadata. Download the Accession List text file.

Downloading from NCBI SRA

# Install SRA toolkit mamba install -c bioconda sra-tools # Download single-end data fasterq-dump SRR1234567 --outdir ./data/raw --progress # Download paired-end data (creates _1.fastq and _2.fastq) fasterq-dump SRR1234567 --outdir ./data/raw --split-files --progress # Compress immediately (fasterq-dump outputs uncompressed!) gzip ./data/raw/SRR1234567*.fastq # Batch download from accession list cat SRR_Acc_List.txt | parallel -j 4 "fasterq-dump {} --outdir ./data/raw --split-files && gzip ./data/raw/{}*.fastq"

⚠️

fasterq-dump outputs uncompressed FASTQ! A single sample can be 20-50 GB uncompressed. Always gzip immediately after download. Also, fasterq-dump uses a cache in ~/ncbi/ which can fill your home directory — set vdb-config --set /repository/user/cache-disabled=true or use --temp /scratch/tmp.

Faster Downloads with Aspera and ENA

Aspera is a high-speed transfer protocol that can be 10-50x faster than wget/ftp. NCBI and ENA both support it. The European Nucleotide Archive (ENA) is an alternative to SRA that is often faster, especially if you are in Europe, and provides pre-compressed FASTQ files directly.

# Download from ENA (often faster, provides .fastq.gz directly) # Search: https://www.ebi.ac.uk/ena/browser/ # ENA URLs follow a predictable pattern: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/SRR1234567_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/SRR1234567_2.fastq.gz # Using Aspera for fast downloads mamba install -c bioconda aspera-cli ascp -QT -l 300m -P 33001 -i ~/.aspera/cli/etc/asperaweb_id_dsa.openssh \ era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR123/007/SRR1234567/SRR1234567_1.fastq.gz \ ./data/raw/

Downloading Reference Genomes from Ensembl

# Human genome (GRCh38) — use primary_assembly (no alt/patch contigs) wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Human GTF annotation (matching release!) wget https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz # Mouse (GRCm39) wget https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz wget https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.gtf.gz # Decompress gunzip *.fa.gz *.gtf.gz

💡

Always use primary_assembly (not top_level) to avoid alt contigs and patches that can cause ambiguous alignments. And always ensure your genome FASTA and GTF come from the same Ensembl release — mismatched versions cause chromosome naming mismatches and counting errors.

Verifying Downloaded Data

# Verify file integrity with md5sum wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/CHECKSUMS md5sum Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Compare the hash with the value in the CHECKSUMS file # Quick sanity checks for FASTQ files zcat sample_R1.fastq.gz | head -8 # Check first 2 reads look normal echo $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 )) # Count total reads # Check that R1 and R2 have the same number of reads echo "R1: $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 )) reads" echo "R2: $(( $(zcat sample_R2.fastq.gz | wc -l) / 4 )) reads"

Organising Your Data

# Recommended project data structure mkdir -p project/{data/raw,data/trimmed,data/aligned,ref,results,scripts,logs} # Rename SRR files to meaningful sample names using a mapping file # mapping.txt: SRR1234567 control_rep1 while read srr name; do mv data/raw/${srr}_1.fastq.gz data/raw/${name}_R1.fastq.gz mv data/raw/${srr}_2.fastq.gz data/raw/${name}_R2.fastq.gz done < mapping.txt

📝

Key Takeaways:
• Find SRR accessions via GEO Run Selector (from the paper's Data Availability section)
• fasterq-dump outputs uncompressed files — always gzip immediately to save disk space
• ENA (European) often provides faster downloads with pre-compressed FASTQ files
• Always verify downloads with md5sum and check that R1/R2 files have matching read counts
• Keep genome FASTA and GTF from the same Ensembl release to avoid naming mismatches

Module 4 · Sequence Analysis

BLAST, Alignment & Sequence Comparison

Sequence comparison is the backbone of bioinformatics. Whether finding a gene's function, aligning reads, or building a phylogenetic tree — everything starts with comparing sequences.

4.1 BLAST — Finding Similar Sequences ▶

BLAST (Basic Local Alignment Search Tool) is the single most important tool in bioinformatics. Developed at NCBI in 1990, it compares a query sequence against a database and returns statistically significant matches. Understand it deeply — you will use it almost every day.

How BLAST Works (The Algorithm)

BLAST does NOT compare every position. Instead it uses a three-step heuristic: (1) Word finding — breaks your query into short words (default 11 bp for blastn, 3 aa for blastp) and finds exact matches in the database. (2) Extension — extends each word match left and right, tracking a running score. Stops when the score drops significantly. (3) Evaluation — calculates E-value for each high-scoring segment pair (HSP). This heuristic makes BLAST fast — but it can miss very divergent homologues.

The Five BLAST Programs

blastn

Nucleotide → nucleotide. Use for: verifying a cloned sequence, finding a gene in a new genome. Fast but only finds close matches. For very similar sequences use megablast (default on NCBI web).

blastp

Protein → protein. Much more sensitive than blastn for detecting distant homologues — amino acid conservation is stronger than nucleotide. Uses substitution matrices (BLOSUM62 default).

blastx

Translates nucleotide query in all 6 reading frames → searches protein database. Use when you have an unknown DNA sequence and want to find the protein it encodes.

tblastn

Protein query → translated nucleotide database. Use for finding a gene in an unannotated genome — you have a known protein and want to find its genomic location.

PSI-BLAST

Position-Specific Iterated BLAST. Iteratively builds a profile from initial hits and re-searches. Much more sensitive for distant homology. Use for: finding remote protein family members.

DIAMOND

Not BLAST, but a modern alternative. 500–20,000× faster for blastx/blastp searches. Essential for metagenomics where you BLAST millions of reads. mamba install -c bioconda diamond

Understanding BLAST Output

⚠️

E-value is NOT a p-value. E-value = expected number of hits with this score by random chance in a database of this size. E-value 0.001 = expect this hit once in 1,000 random databases. E-value < 1e-5 is generally significant. E-value depends on database size — the same alignment has a different E-value in a small vs large database. Always compare E-values from the same search.

Key output fields in tabular format (-outfmt 6):

pident — % identity (how similar)

length — alignment length

evalue — statistical significance

bitscore — alignment quality (higher = better, database-independent)

qcovs — % of query covered by the alignment

sseqid — subject (database hit) ID

Complete BLAST Workflow

# Install BLAST mamba install -c bioconda blast # 1. Build a local BLAST database makeblastdb -in genome.fa -dbtype nucl -out blastdb/genome makeblastdb -in proteins.fa -dbtype prot -out blastdb/proteome # 2. Run blastn with useful output format blastn -query unknown_gene.fa -db blastdb/genome \ -out results.tsv -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs" \ -evalue 1e-10 -max_target_seqs 5 -num_threads 8 # 3. Run blastp for protein function annotation blastp -query my_protein.fa -db nr -remote \ -out results_nr.tsv -outfmt 6 -evalue 1e-5 # 4. Run blastx to find protein encoded by a DNA sequence blastx -query unknown_cdna.fa -db blastdb/proteome \ -out blastx_results.tsv -outfmt 6 -evalue 1e-5 # 5. Reciprocal Best Hit (RBH) — the gold standard for orthology # Run BLAST in both directions, keep pairs that are best hits in BOTH blastp -query speciesA.fa -db speciesB_db -outfmt 6 -evalue 1e-10 > A_vs_B.tsv blastp -query speciesB.fa -db speciesA_db -outfmt 6 -evalue 1e-10 > B_vs_A.tsv

💡

When BLAST isn't enough: BLAST misses very divergent homologues. For sensitive remote homology detection, use HMMER (profile-HMM searches — much more sensitive than PSI-BLAST) or HHblits (hidden Markov model vs HMM database). These tools are standard for protein domain annotation.

Key Takeaways

E-value < 1e-5 for significanceblastp more sensitive than blastnAlways check query coverage, not just E-valueUse DIAMOND for large-scale searchesReciprocal Best Hit for orthology

4.2 Multiple Sequence Alignment (MSA) ▶

Multiple Sequence Alignment (MSA) arranges three or more biological sequences so that corresponding positions (homologous residues) line up in columns. This reveals evolutionary conservation: columns where all sequences have the same residue are under strong selective pressure — they are functionally important. MSA is the foundation for phylogenetics, protein domain identification, and functional annotation.

How MSA Algorithms Work

An exact optimal MSA is computationally infeasible for more than ~10 sequences (NP-hard problem). All practical tools use heuristics:

Progressive (ClustalW, MUSCLE v3)

Build a guide tree from pairwise distances, then align sequences in order of relatedness. Fast but early errors propagate.

Iterative (MUSCLE v5, MAFFT)

Align progressively, then refine by re-aligning subgroups. Better accuracy. MAFFT L-INS-i is the gold standard for accuracy.

Consistency-based (T-Coffee)

Uses all pairwise alignments to guide the MSA. Very accurate for small sets. Too slow for >200 sequences.

Gap Penalties — Why They Matter

Gaps represent insertions or deletions (indels) in evolution. The gap opening penalty controls how reluctant the aligner is to insert a gap. The gap extension penalty controls how costly it is to make an existing gap longer. High opening + low extension → fewer, longer gaps (appropriate for eukaryotic proteins). Low opening → many small gaps (may over-fragment the alignment).

Which Tool to Use?

MAFFT

Best overall choice. Use --auto to let MAFFT choose strategy. For <200 sequences: mafft --maxiterate 1000 --localpair (L-INS-i, most accurate). For >10,000 sequences: mafft --retree 1.

Recommended

MUSCLE v5

Excellent accuracy, competitive with MAFFT. Newer "super5" algorithm scales well. Use muscle -super5 input.fa -output aligned.fa for large datasets.

# ─── Complete MSA workflow ─── mamba install -c bioconda mafft muscle trimal # MAFFT — automatic mode (recommended start) mafft --auto sequences.fa > aligned.fa # MAFFT — maximum accuracy (for <200 seqs, publication-quality) mafft --maxiterate 1000 --localpair sequences.fa > aligned_accurate.fa # MUSCLE v5 — large datasets muscle -super5 sequences.fa -output aligned.fa # ─── Trim poorly aligned regions ─── # ALWAYS trim before building phylogenetic trees trimal -in aligned.fa -out trimmed.fa -automated1 # Check alignment stats trimal -in aligned.fa -sgc # Show column conservation scores trimal -in aligned.fa -sident # Show sequence identity matrix # ─── Visualise alignment ─── # Web: Jalview (jalview.org) — colour by conservation # Web: MView (ebi.ac.uk/Tools/msa/mview/) — quick coloured view # CLI: AliView — fast lightweight desktop viewer

⚠️

Common MSA mistakes: (1) Including too-divergent sequences — they add noise, not signal. Remove sequences with <20% identity. (2) Not trimming before phylogenetics — gappy regions create artefactual branches. (3) Using ClustalW — it is outdated and less accurate than MAFFT/MUSCLE. Use modern tools.

Key Takeaways

MAFFT --auto for most casesAlways trim before tree buildingL-INS-i for best accuracy (<200 seqs)Visualise with Jalview or AliViewConserved columns = functional importance

4.3 Phylogenetic Trees — Evolutionary Analysis ▶

A phylogenetic tree is a branching diagram that represents the evolutionary relationships among biological sequences or species. Trees are constructed from multiple sequence alignments and use mathematical models to infer how sequences have diverged over time. Understanding how to build, read, and interpret phylogenetic trees is a core skill in evolutionary biology, epidemiology, and comparative genomics.

Reading a Phylogenetic Tree

Every tree has key structural elements. The root represents the most recent common ancestor. Internal nodes represent inferred ancestral sequences (speciation or divergence events). Tips (also called leaves) are the extant sequences in your analysis. Branch length represents the amount of evolutionary change — typically measured in substitutions per site. A branch of length 0.05 means 5 nucleotide substitutions per 100 sites have occurred along that branch.

Tree-Building Methods

Neighbour-Joining (NJ)

Fast, distance-based heuristic. Computes a pairwise distance matrix and groups closest neighbours iteratively. Good for quick exploratory trees with many sequences. Available in MEGA, FastTree, and Clustal Omega.

Maximum Likelihood (ML)

Statistical optimisation approach that evaluates the probability of the observed alignment given a tree topology and substitution model. Best method for publication-quality trees. Use IQ-TREE2 or RAxML-NG.

Bayesian Inference

Probabilistic framework using MCMC sampling to explore tree space. Produces posterior probability support values. MrBayes for phylogenetics, BEAST2 for molecular clock and divergence time estimation. Computationally expensive but powerful.

When to use which?

NJ: quick-and-dirty exploration (>500 sequences). ML: standard research publication. Bayesian: when you need divergence time estimates or posterior probabilities. For most projects, ML with IQ-TREE2 is the right choice.

Rooting a Tree with an Outgroup

An unrooted tree shows relationships but not the direction of evolution. To root a tree, include an outgroup — a sequence known to be more distantly related than any ingroup member. For example, when building a tree of mammalian species, include a reptile as the outgroup. The root is placed on the branch leading to the outgroup, which polarises all other relationships.

Bootstrap Support Interpretation

Bootstrap values assess the reliability of each node. The alignment is resampled with replacement many times (typically 1000), and each resampled dataset produces a new tree. The bootstrap value is the percentage of times a given clade appears across all resampled trees.

>90% — Strongly supported

High confidence in this branching pattern. Reliable for biological conclusions.

70–90% — Moderately supported

Generally considered acceptable. Report with caution.

50–70% — Weakly supported

Unreliable. The relationship may change with more data or different methods.

<50% — No support

The branching pattern is essentially random at this node. Do not interpret.

Substitution Models

Substitution models describe how nucleotides or amino acids change over evolutionary time. Choosing the right model is critical for accurate tree inference.

DNA: JC69

Simplest model. Assumes all substitutions are equally likely and all bases have equal frequency. Rarely realistic but useful as a baseline.

DNA: GTR (General Time Reversible)

Most general model. Six substitution rate parameters and four base frequency parameters. Usually the best fit for real data. Often combined with +G (gamma rate variation) and +I (invariant sites).

Protein: WAG

Empirical amino acid substitution model based on globular protein alignments. Good general-purpose choice for protein trees.

Protein: LG

Newer and generally better-fitting than WAG. Estimated from a larger, more diverse set of protein alignments. Recommended as the default for protein phylogenetics.

💡

Model selection: Use IQ-TREE2's -m TEST flag (ModelFinder) to automatically test all models and select the best one by BIC. Never guess — always let the software choose.

Complete IQ-TREE2 Workflow

# Install IQ-TREE2 conda install -c bioconda iqtree # Step 1: Automatic model selection + ML tree + bootstrap iqtree2 -s trimmed_alignment.fa \ -m TEST \ -B 1000 \ -T AUTO \ --prefix my_tree # -m TEST: ModelFinder — tests all substitution models # -B 1000: 1000 ultrafast bootstrap replicates # -T AUTO: use optimal number of CPU threads # Output files: # my_tree.treefile — the best ML tree (Newick format) # my_tree.iqtree — full report with model parameters # my_tree.log — run log # Step 2: For more rigorous support, use SH-aLRT + ultrafast bootstrap iqtree2 -s trimmed_alignment.fa -m TEST -B 1000 -alrt 1000 -T AUTO # Step 3: For protein alignment iqtree2 -s protein_alignment.fa -m TEST -B 1000 -T AUTO -st AA

Tree Visualisation

# Option 1: FigTree (desktop application) # Download from: https://github.com/rambaut/figtree/releases # Open my_tree.treefile in FigTree # Display bootstrap values: Node Labels → Display → label # Colour branches by clade, adjust branch width, export as PDF/SVG # Option 2: iTOL (Interactive Tree Of Life — web-based) # Upload my_tree.treefile to https://itol.embl.de # Supports annotation files for metadata (colours, bars, heatmaps) # Export publication-quality figures in SVG/PDF/PNG # Option 3: ggtree in R (programmatic, highly customisable) # library(ggtree) # tree <- read.tree("my_tree.treefile") # ggtree(tree) + geom_tiplab() + geom_nodelab(aes(label=label))

⚠️

Common mistake: Interpreting branch length as time. Branch lengths represent substitutions per site unless you explicitly calibrate a molecular clock (e.g., with BEAST2). Two long branches does not mean two species diverged long ago — it could mean they evolved rapidly.

Root = common ancestorBranch length = substitutions/siteBootstrap >70 = supportedGTR+G = standard DNA modelIQ-TREE2 -m TEST = auto modelFigTree / iTOL for visualisationAlways include an outgroup

Module 5 · Transcriptomics

RNA-seq Analysis — From Raw Reads to Gene Lists

RNA sequencing (RNA-seq) measures gene expression across the entire transcriptome. This module takes you through every step of the pipeline with detailed explanations of what each tool does and why.

5.1 Experimental Design — The Most Critical Step ▶

No amount of computational analysis can rescue a poorly designed experiment. Experimental design decisions — replication, randomisation, library preparation — are made before any sequencing occurs and cannot be corrected downstream. This section covers the critical choices that determine whether your RNA-seq experiment will produce reliable, publishable results or generate data that no statistical method can salvage.

Statistical Power: How Many Replicates Do You Need?

Statistical power is the probability of detecting a true difference when one exists. Underpowered studies miss real biology; overpowered studies waste resources. Use the RNASeqPower R package to estimate sample size before your experiment.

# Power analysis with RNASeqPower library(RNASeqPower) # How many replicates to detect 2-fold change (log2FC=1)? rnapower(depth=20, cv=0.4, effect=1, alpha=0.05, power=0.8) # Result: ~3 replicates per group # What about detecting smaller changes (1.5-fold, log2FC=0.58)? rnapower(depth=20, cv=0.4, effect=0.58, alpha=0.05, power=0.8) # Result: ~6 replicates per group # Parameters: # depth = sequencing depth in millions of reads # cv = biological coefficient of variation (0.1–0.4 typical) # effect = log2 fold change you want to detect # alpha = significance threshold (0.05) # power = desired power (0.8 = 80%)

💡

Rule of thumb: 3 replicates can detect large changes (2-fold). To detect subtle but biologically important changes (1.5-fold), you need 6+ replicates. More replicates always outperform deeper sequencing for differential expression.

Biological vs Technical Replicates

Biological replicates

Independent biological samples (different mice, different cell culture flasks, different patients). These capture true biological variability and are essential for statistical inference. Minimum 3 per condition, ideally 5–6.

Technical replicates

The same sample sequenced multiple times, or the same library prepared twice. These only measure technical noise (which is very low in modern RNA-seq). Technical replicates do NOT substitute for biological replicates and are generally a waste of sequencing budget.

Randomisation and Blocking

Batch effects are systematic non-biological variations caused by processing samples at different times, on different machines, or by different technicians. They can completely confound your results if not properly handled.

⚠️

Critical mistake: Never process all treatment samples on Monday and all control samples on Tuesday. If you do, you cannot distinguish the treatment effect from the day-of-processing effect. Instead, balance conditions across batches: process 2 treatment + 2 control samples on each day.

Use a randomised block design: assign samples to batches randomly while ensuring each batch contains samples from every condition. Record all batch information (date, technician, lane, flowcell) in your metadata — you can correct for known batches computationally, but only if you recorded them.

RNA Quality

RNA integrity is measured by the RIN (RNA Integrity Number), scored 1–10 by a Bioanalyzer or TapeStation. Degraded RNA introduces systematic 3' bias because fragmented transcripts lose their 5' ends preferentially.

RIN > 7

Good quality. Suitable for standard RNA-seq with poly-A selection. This is the minimum for most experiments.

RIN 4–7

Partially degraded. Use rRNA depletion instead of poly-A selection. Consider FFPE-optimised protocols if working with clinical samples.

Library Preparation Choices

Poly-A selection vs rRNA depletion

Poly-A captures only mRNA (coding genes). Cheaper, cleaner signal. rRNA depletion captures mRNA + lncRNA + other non-coding RNA. Use rRNA depletion for degraded samples or when non-coding RNA is of interest.

Single-end (SE) vs Paired-end (PE)

SE reads one end of each fragment. PE reads both ends. PE is recommended for: novel isoform discovery, structural variant detection, and improved alignment to repetitive regions. For standard differential expression, SE 50bp is often sufficient and cheaper.

Paired vs unpaired experimental design

If you measure the same patient before and after treatment, use a paired design (~ patient + treatment in DESeq2). Paired designs are more powerful because they control for inter-individual variation.

Multiplexing & barcoding

Multiple samples pooled on one lane with unique index barcodes. Reduces cost and minimises lane-to-lane batch effects. Ensure all conditions are represented on each lane. Use dual-index barcodes to prevent index hopping on patterned flowcells.

📌

Sequencing depth guidelines: Differential expression: 20–30M reads/sample. Rare transcripts or lncRNA: 50–100M. Allele-specific expression: 100M+. For most bulk RNA-seq DE experiments, 25M PE 150bp reads is a good target.

Biological reps > technical reps3 reps minimum, 6 idealBalance batches across conditionsRIN > 7 for poly-APE for isoform discoveryUse RNASeqPower for sample sizeRecord ALL batch metadata

5.2 Quality Control — FastQC & Trimming ▶

Quality control is the first analytical step in any sequencing project. Before aligning reads, you must assess their quality to identify problems such as adapter contamination, low-quality bases, and library preparation artefacts. FastQC generates a per-sample report, MultiQC aggregates all reports for easy comparison, and trimming tools remove problematic sequences. Skipping QC risks propagating errors through the entire downstream analysis.

Running FastQC and MultiQC

# Run FastQC on all samples (parallelised) mkdir -p qc_reports fastqc raw_data/*.fastq.gz -o qc_reports/ -t 8 # Aggregate all FastQC reports with MultiQC pip install multiqc multiqc qc_reports/ -o qc_reports/multiqc_report # Open the MultiQC report in your browser # qc_reports/multiqc_report/multiqc_report.html # Look for: samples with outlier GC content, low quality, high adapter %

The MultiQC report shows all samples side-by-side. Look for outlier samples that differ dramatically from the rest — these may indicate contamination, sample mix-ups, or failed library preparation. The General Statistics table shows total reads, % duplicates, and % GC per sample.

FastQC Modules — What Each Failure Actually Means

Per Base Sequence Quality

Should stay in the green zone (>Q28). Quality commonly drops at the 3' end — this is normal for Illumina chemistry. If quality drops below Q20 before position 50, investigate the sequencing run. Action: Trim low-quality 3' bases if they fall below Q20.

Per Sequence Quality Scores

Distribution of mean quality per read. Should be a single peak above Q30. A bimodal distribution or a shoulder below Q20 means a subset of reads are poor quality. Action: Filter reads with mean quality below Q20.

Per Base Sequence Content (FAIL at pos 1-10)

Non-uniform base composition at the first 10 positions is normal and expected for RNA-seq. It is caused by random hexamer priming bias during cDNA synthesis. Action: No action needed. Do NOT trim these bases.

Adapter Content (FAIL)

Adapter sequences detected in reads. This happens when insert size is shorter than read length. Action: You MUST trim adapters before alignment. Use Trim Galore or fastp.

Sequence Duplication Levels (WARN)

A WARN here is completely normal for RNA-seq. Highly expressed genes (ribosomal proteins, housekeeping genes) naturally produce many identical reads. Only worry if duplication exceeds 80% or if you see it in DNA-seq data. Action: None for RNA-seq.

Overrepresented Sequences

Shows sequences that make up >0.1% of all reads. Common culprits: adapter dimers, rRNA contamination, or highly expressed transcripts. Action: If rRNA is dominant, your rRNA depletion failed. If adapter dimers are present, trim aggressively.

Diagnostic Decision Tree

📌

If Adapter Content FAIL → Trim with Trim Galore or fastp (mandatory).
If Per Base Quality drops below Q20 → Trim low-quality bases from the 3' end.
If GC Content shows bimodal distribution → Suspect contamination from another organism. Run FastQ Screen.
If Overrepresented Sequences shows rRNA → Your rRNA depletion failed. Check remaining samples. Consider computational rRNA removal with SortMeRNA.
If Sequence Duplication WARN → Normal for RNA-seq. Ignore.
If Per Base Content FAIL at positions 1-10 → Normal for random hexamer priming. Ignore.

Trimming Tools Compared

Trim Galore

Wrapper around Cutadapt. Auto-detects adapter type. Simple to use. Good default choice. Slightly slower than fastp.

fastp

All-in-one tool: QC + trimming + filtering in a single pass. Very fast (written in C++). Generates its own HTML report. Increasingly popular as a replacement for FastQC + Trim Galore.

Trimmomatic

Java-based. Flexible with many options. Older tool. Requires specifying adapter sequences manually. Still widely used but fastp is generally preferred now.

When NOT to trim

If using pseudoalignment (Salmon or kallisto), trimming is usually unnecessary — these tools handle adapters and low-quality bases internally. Trimming can even reduce accuracy by creating artificially short fragments.

Trimming Commands

# Option 1: Trim Galore (recommended for beginners) conda install -c bioconda trim-galore trim_galore --paired --cores 8 \ --quality 20 \ --length 36 \ sample_R1.fastq.gz sample_R2.fastq.gz \ --output_dir trimmed/ # --quality 20: trim bases with Q < 20 from 3' end # --length 36: discard reads shorter than 36bp after trimming # Option 2: fastp (faster, all-in-one QC + trim) conda install -c bioconda fastp fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o trimmed/sample_R1.fastq.gz -O trimmed/sample_R2.fastq.gz \ --detect_adapter_for_pe \ --qualified_quality_phred 20 \ --length_required 36 \ --html trimmed/sample_fastp.html \ --thread 8 # Re-run FastQC after trimming to verify improvement fastqc trimmed/*.fastq.gz -o qc_reports/post_trim/ -t 8 multiqc qc_reports/post_trim/ -o qc_reports/post_trim_multiqc/

⚠️

Always re-run FastQC after trimming to confirm that adapter contamination is gone and quality profiles have improved. If adapter content persists, your adapter sequence may not be in the default database — specify it manually with --adapter.

FastQC per sample, MultiQC for overviewDuplication WARN is normal for RNA-seqAdapter FAIL = must trimfastp = fastest all-in-oneNo trimming needed for Salmon/kallistoRe-run QC after trimming

5.3 Alignment with HISAT2 & STAR ▶

Alignment (or mapping) is the process of determining where each sequencing read originated in the reference genome. For RNA-seq, this is complicated by splicing — a single read may span an exon-exon junction, so standard DNA aligners will fail. Splice-aware aligners like HISAT2 and STAR can split a read across introns to find the correct genomic position. The output is a BAM file containing the coordinates and quality of each alignment.

HISAT2 vs STAR

HISAT2

Memory-efficient (~8 GB RAM for human genome), fast, and accurate. Uses a graph-based FM index that incorporates known splice sites and SNPs. Excellent for most organisms. Recommended when RAM is limited or for non-model organisms.

Recommended

STAR

Slightly more accurate, especially for novel splice junctions, but requires ~30 GB RAM for the human genome. Required by some downstream tools (e.g., STAR-Fusion for fusion gene detection). Gold standard in ENCODE and GTEx projects.

Alignment Workflow

# Build HISAT2 genome index (do once per genome) hisat2-build -p 8 genome.fa genome_index/hisat2_index # Align paired-end reads hisat2 -p 8 -x genome_index/hisat2_index \ -1 sample_R1_trimmed.fastq.gz \ -2 sample_R2_trimmed.fastq.gz \ --dta \ --rna-strandness RF \ 2> sample_hisat2.log \ | samtools sort -@ 8 -o sample.bam # Index the BAM file (required for most downstream tools) samtools index sample.bam # Check alignment statistics samtools flagstat sample.bam

SAM/BAM Format Explained

SAM (Sequence Alignment/Map) is a tab-delimited text format. BAM is its compressed binary equivalent. Each alignment line has 11 mandatory fields:

# Example SAM line (one read alignment): READ001 83 chr1 12345 60 75M200N25M = 12800 555 ACGT... FFFF... # Field breakdown: # 1 QNAME: READ001 — Read name # 2 FLAG: 83 — Bitwise flag (see below) # 3 RNAME: chr1 — Reference chromosome # 4 POS: 12345 — 1-based leftmost position # 5 MAPQ: 60 — Mapping quality (60 = uniquely mapped) # 6 CIGAR: 75M200N25M — 75bp match, 200bp intron, 25bp match # 7 RNEXT: = — Mate maps to same chromosome # 8 PNEXT: 12800 — Mate position # 9 TLEN: 555 — Template/fragment length # 10 SEQ: ACGT... — Read sequence # 11 QUAL: FFFF... — Base quality (Phred+33)

FLAG Values Decoded

The FLAG field is a bitwise combination of properties. Common flags:

FLAG 83 (0x53)

= 1 (paired) + 2 (proper pair) + 16 (reverse strand) + 64 (first in pair). A properly paired read on the reverse strand, first mate.

FLAG 163 (0xA3)

= 1 (paired) + 2 (proper pair) + 32 (mate reverse) + 128 (second in pair). A properly paired read on forward strand, second mate.

FLAG 4

= unmapped read. These reads did not align to the reference genome. Filter them with samtools view -F 4.

FLAG 256

= secondary alignment. The read mapped to multiple locations. This is the non-primary alignment. Filter with samtools view -F 256.

💡

Decode any FLAG value at broadinstitute.github.io/picard/explain-flags.html — enter a number and it shows exactly which bits are set.

Essential samtools Commands

# Alignment statistics (mapped reads, paired, singletons) samtools flagstat sample.bam # Per-chromosome read counts samtools idxstats sample.bam # Extract only mapped reads (remove unmapped) samtools view -F 4 -b sample.bam > sample_mapped.bam # Extract only properly paired reads samtools view -f 2 -b sample.bam > sample_proper_pairs.bam # Mark PCR duplicates (important for DNA-seq, optional for RNA-seq) samtools markdup sample.bam sample_markdup.bam # View reads in a specific region samtools view sample.bam chr1:1000000-1001000 | head

Interpreting Alignment Rates

💡

>90% overall alignment rate: Excellent. Expected for well-prepared libraries aligned to the correct genome.
75–90%: Acceptable. Some unmapped reads are normal (rRNA, adapter dimers, low-quality reads).
50–75%: Investigate. Possible causes: wrong reference genome, cross-species contamination, significant adapter contamination.
<50%: Serious problem. Check: (1) Are you using the correct genome assembly? (2) Run FastQ Screen to check for contamination from other organisms. (3) Check if adapter trimming was performed.

Visual Inspection with IGV

Always visually inspect alignments for a few genes of interest using the Integrative Genomics Viewer (IGV). Load your BAM file and the reference genome, then navigate to specific genes. Look for: even coverage across exons, clean splice junctions, and absence of systematic artefacts.

Generating Coverage Tracks (bigWig)

# Generate normalised coverage tracks with deepTools conda install -c bioconda deeptools bamCoverage -b sample.bam \ -o sample.bw \ --normalizeUsing RPKM \ --binSize 10 \ -p 8 # Output: sample.bw (bigWig format) # Load in IGV or UCSC Genome Browser for visualisation

HISAT2 = low RAM, STAR = higher accuracyCIGAR string shows splicing (N = intron)MAPQ 60 = uniquely mappedsamtools flagstat for QC>90% alignment = goodIGV for visual inspectionbamCoverage for bigWig tracks

5.4 Gene Counting with featureCounts ▶

Gene counting is the process of assigning each aligned read to a gene based on its overlap with annotated gene coordinates in a GTF/GFF file. The result is a count matrix — a table where rows are genes, columns are samples, and values are integer read counts. This matrix is the fundamental input for differential expression analysis with DESeq2 or edgeR. Getting the counting step right is critical because errors here propagate to every downstream result.

How Gene Counting Works

The counting tool examines each aligned read and checks whether it overlaps with an annotated gene feature (usually exons). If a read falls entirely within one gene's exons, it is assigned to that gene. If a read overlaps two genes (an ambiguous read), featureCounts discards it by default. HTSeq-count offers similar behaviour with its "union" mode. Reads that do not overlap any annotated feature are counted as "unassigned."

Gene-level vs Transcript-level Counting

Gene-level counting

Counts reads per gene (collapsing all isoforms). Used by featureCounts and HTSeq-count. Simpler, more robust. Standard for differential expression analysis.

Transcript-level quantification

Estimates abundance per transcript isoform. Used by Salmon and kallisto (pseudoalignment). More information but more complex. Can be summarised to gene level with tximport.

featureCounts Workflow

# Install featureCounts (from Subread package) conda install -c bioconda subread # Count reads per gene for all samples featureCounts -T 8 \ -a annotation.gtf \ -o counts.txt \ -s 2 \ -p --countReadPairs \ -B \ aligned/*.bam # Key parameters: # -a : GTF annotation file (must match genome used for alignment) # -o : output file name # -s 2: reverse-stranded library (most Illumina kits) # -p --countReadPairs: count fragments, not individual reads (for PE) # -B : require both reads in a pair to be aligned # The output is a table: genes x samples with read counts # Column 1: gene ID, columns 7+: counts per sample # A summary file (counts.txt.summary) shows assignment statistics

⚠️

Strandedness matters! Wrong -s setting can cause ~50% read loss or incorrect counting. Use -s 0 (unstranded), -s 1 (stranded-forward), or -s 2 (stranded-reverse). Check your library preparation kit documentation. Most modern kits (Illumina TruSeq Stranded, NEBNext) are reverse-stranded (-s 2). If unsure, run featureCounts with all three settings on one sample and compare assignment rates — the correct setting will have the highest "Assigned" count.

featureCounts vs HTSeq-count

featureCounts

Extremely fast (processes millions of reads per minute). Multi-threaded. Can process multiple BAM files in one command. Recommended for most use cases.

HTSeq-count

Python-based, slower. Processes one BAM file at a time. Offers fine-grained control over ambiguous read handling (union, intersection-strict, intersection-nonempty modes). Used historically but featureCounts is preferred.

Pseudoalignment Alternative: Salmon

Salmon and kallisto use pseudoalignment — they do not perform traditional alignment to the genome. Instead, they map reads directly to the transcriptome (cDNA sequences), skipping the genome alignment step entirely. This is dramatically faster (minutes instead of hours) and directly estimates transcript-level abundances.

# Salmon quantification (no alignment needed) conda install -c bioconda salmon # Step 1: Build transcriptome index (do once) salmon index -t transcriptome.fa -i salmon_index -p 8 # Step 2: Quantify each sample salmon quant -i salmon_index \ -l A \ -1 sample_R1.fastq.gz \ -2 sample_R2.fastq.gz \ -p 8 \ --validateMappings \ -o salmon_output/sample/ # -l A: auto-detect library type # Output: quant.sf file with TPM and estimated counts per transcript

Creating a Count Matrix from Salmon with tximport

# In R: import Salmon outputs into DESeq2 with tximport library(tximport) library(DESeq2) # Create a named vector of file paths samples <- c("control1","control2","control3","treated1","treated2","treated3") files <- file.path("salmon_output", samples, "quant.sf") names(files) <- samples # Load transcript-to-gene mapping (from GTF or biomaRt) tx2gene <- read.csv("tx2gene.csv") # columns: TXNAME, GENEID # Import and summarise to gene level txi <- tximport(files, type="salmon", tx2gene=tx2gene) # Create DESeq2 object directly from tximport metadata <- data.frame(condition=c(rep("control",3), rep("treated",3)), row.names=samples) dds <- DESeqDataSetFromTximport(txi, metadata, ~condition)

💡

Salmon + tximport + DESeq2 is increasingly the recommended pipeline because: (1) it is much faster than alignment-based counting, (2) it corrects for GC bias and sequence-specific bias, and (3) tximport provides offset matrices that improve DE accuracy at the gene level.

featureCounts = fast gene countingStrandedness: check -s 0/1/2Salmon = no alignment neededtximport bridges Salmon → DESeq2Gene-level for DE analysisAmbiguous reads are discarded

5.5 Differential Expression with DESeq2 ▶

DESeq2 is the gold-standard R package for identifying genes that are significantly more or less expressed between experimental conditions. It models RNA-seq count data using the negative binomial distribution, which accounts for both the mean expression level and the biological variability (dispersion) between replicates. Unlike simpler models (Poisson), the negative binomial captures the overdispersion that is characteristic of real biological data — where variance exceeds the mean.

The DESeq2 Statistical Model

DESeq2 performs three key steps internally: (1) Size factor estimation — normalises for differences in library size using the median-of-ratios method. (2) Dispersion estimation — estimates gene-wise dispersion with empirical Bayes shrinkage toward a fitted trend, borrowing strength across genes. (3) Wald test — tests whether the log2 fold change for each gene is significantly different from zero. The design formula (e.g., ~ condition) specifies the experimental factors to model.

Basic DESeq2 Workflow

library(DESeq2) # 1. Load count data from featureCounts counts <- read.table("counts.txt", header=TRUE, row.names=1, skip=1) counts <- counts[, 6:ncol(counts)] # Keep only count columns # 2. Create metadata (sample information) metadata <- data.frame( condition = c("control","control","control","treated","treated","treated"), row.names = colnames(counts) ) # 3. Create DESeq2 object dds <- DESeqDataSetFromMatrix(counts, metadata, ~condition) # 4. Set the reference level (control as baseline) dds$condition <- relevel(dds$condition, ref="control") # 5. Filter low-count genes (improves power) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep, ] # 6. Run DESeq2 (size factors + dispersion + Wald test) dds <- DESeq(dds) # 7. Get results res <- results(dds, contrast=c("condition","treated","control")) res <- res[order(res$padj), ] summary(res)

Multi-factor Designs

When your experiment has known batch effects or additional covariates, include them in the design formula. DESeq2 will model their effect and remove it before testing the variable of interest.

# Multi-factor design: correct for batch while testing condition metadata <- data.frame( condition = c("ctrl","ctrl","ctrl","treat","treat","treat"), batch = c("A","B","A","B","A","B"), row.names = colnames(counts) ) dds <- DESeqDataSetFromMatrix(counts, metadata, ~ batch + condition) # The last variable in the formula is the one being tested # Batch effect is modelled but not tested # For paired designs (same patient before/after treatment): # design = ~ patient + treatment

LFC Shrinkage with apeglm

Raw log2 fold changes from DESeq2 can be noisy for low-count genes. LFC shrinkage produces more accurate, regularised fold change estimates — especially important for ranking genes and for visualisation.

# Apply apeglm shrinkage (recommended method) library(apeglm) res_shrunk <- lfcShrink(dds, coef="condition_treated_vs_control", type="apeglm") # Use res_shrunk for MA plots and volcano plots # Use the non-shrunken 'res' for strict significance testing

Visualisation: MA Plot, Volcano Plot, and Heatmap

# MA plot — log2FC vs mean expression plotMA(res_shrunk, ylim=c(-5, 5), main="MA Plot") # Blue dots = significant (padj < 0.1). Genes with high expression # and large fold change are most reliable. # Volcano plot with EnhancedVolcano library(EnhancedVolcano) EnhancedVolcano(res, lab = rownames(res), x = "log2FoldChange", y = "padj", pCutoff = 0.05, FCcutoff = 1, title = "Treated vs Control", subtitle = "DESeq2 results") # Heatmap of top 50 DEGs library(pheatmap) vsd <- vst(dds, blind=FALSE) top50 <- head(order(res$padj), 50) mat <- assay(vsd)[top50, ] mat <- mat - rowMeans(mat) # Center rows (z-score-like) pheatmap(mat, annotation_col = as.data.frame(colData(dds)["condition"]), scale = "none", show_rownames = TRUE, fontsize_row = 7)

Annotating and Exporting Results

# Add gene symbols using biomaRt library(biomaRt) ensembl <- useEnsembl(biomart="genes", dataset="hsapiens_gene_ensembl") annotations <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), filters = "ensembl_gene_id", values = rownames(res), mart = ensembl ) # Merge annotations with results res_df <- as.data.frame(res) res_df$ensembl_gene_id <- rownames(res_df) res_annotated <- merge(res_df, annotations, by="ensembl_gene_id", all.x=TRUE) # Export significant DEGs sig <- subset(res_annotated, padj < 0.05 & abs(log2FoldChange) > 1) write.csv(sig, "DEGs_significant_annotated.csv", row.names=FALSE) # Export for pathway analysis (all genes with stats) write.csv(res_annotated, "all_genes_DESeq2.csv", row.names=FALSE)

📌

Interpreting results: log2FoldChange = direction and magnitude (+1 = 2x higher in treated, -1 = 2x lower). padj = Benjamini-Hochberg adjusted p-value (FDR). Always use padj < 0.05, never raw pvalue. Combine with a fold-change threshold (|log2FC| > 1) for biological relevance.

⚠️

Common mistakes: (1) Using normalised counts (TPM, RPKM) as input — DESeq2 requires raw integer counts. (2) Forgetting to set the reference level — the default alphabetical order may not be what you want. (3) Not filtering low-count genes — this reduces statistical power because of excessive multiple testing correction.

Negative binomial modelRaw counts as input (never normalised)padj < 0.05 + |log2FC| > 1apeglm for LFC shrinkage~ batch + condition for multi-factorEnhancedVolcano for plotsbiomaRt for gene annotation

Module 6 · Genomics

DNA Sequencing, Variant Calling & Genome Analysis

Whole-genome and whole-exome sequencing reveal the genetic blueprint of an organism. This module covers variant calling, genome assembly, and annotation.

6.1 WGS/WES Pipeline — Variant Calling ▶

Variant calling is the process of identifying positions in a sequenced genome where the sample differs from the reference. These differences include SNPs (single nucleotide polymorphisms — one base changed to another), indels (insertions or deletions of one or more bases), and MNVs (multi-nucleotide variants — multiple adjacent substitutions). Variant calling is the foundation of clinical genomics, population genetics, and disease gene discovery.

Variant Types and Zygosity

SNP (Single Nucleotide Polymorphism)

A single base change: e.g., reference A → alternative G. The most common variant type. ~4–5 million SNPs per human genome.

Indel (Insertion / Deletion)

Insertion of extra bases or deletion of reference bases. Short indels (1–50bp) are called by GATK. Larger structural variants require specialised tools (Manta, DELLY).

Homozygous reference (0/0)

Both copies of the chromosome carry the reference allele. This position is not a variant in this individual.

Heterozygous (0/1)

One chromosome carries the reference allele, the other carries the alternative. Expect ~50% of reads supporting each allele.

Homozygous alternative (1/1)

Both chromosomes carry the alternative allele. Expect ~100% of reads supporting the ALT allele.

MNV (Multi-Nucleotide Variant)

Two or more adjacent bases changed simultaneously (e.g., AG → TC). Often caused by a single mutational event. Correctly phasing these is important for protein impact prediction.

Why We Mark Duplicates

During library preparation, PCR amplification creates identical copies of DNA fragments. These PCR duplicates are not independent observations — they all come from the same original molecule. If not removed, they artificially inflate the number of reads supporting a variant, leading to false-positive variant calls. Duplicate marking identifies reads with identical start and end coordinates and flags all but one as duplicates.

How HaplotypeCaller Works

GATK's HaplotypeCaller does not simply count bases at each position. Instead, it performs local de novo assembly in active regions (regions with evidence of variation). It: (1) identifies active regions where the data deviates from the reference, (2) assembles the reads in that region into candidate haplotypes, (3) realigns reads to these haplotypes using a pair-HMM, and (4) assigns genotype likelihoods. This approach is more accurate than simple pileup-based callers, especially for indels and complex variants.

Complete GATK4 Pipeline

# Step 1: Align with BWA-MEM2 (faster successor to BWA-MEM) bwa-mem2 mem -t 8 -R "@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tLB:lib1" \ reference.fa sample_R1.fastq.gz sample_R2.fastq.gz \ | samtools sort -@ 8 -o sample.bam samtools index sample.bam # Step 2: Mark PCR duplicates gatk MarkDuplicates \ -I sample.bam \ -O sample_markdup.bam \ -M metrics.txt \ --CREATE_INDEX true # Step 3: Base Quality Score Recalibration (BQSR) # Corrects systematic errors in base quality scores gatk BaseRecalibrator \ -I sample_markdup.bam -R reference.fa \ --known-sites dbsnp.vcf.gz \ --known-sites mills_indels.vcf.gz \ -O recal.table gatk ApplyBQSR \ -I sample_markdup.bam -R reference.fa \ --bqsr-recal-file recal.table \ -O sample_recal.bam # Step 4: Call variants (per-sample GVCF mode) gatk HaplotypeCaller \ -I sample_recal.bam -R reference.fa \ -O sample.g.vcf.gz \ -ERC GVCF # Step 5: Joint genotyping (for cohorts) gatk GenotypeGVCFs -R reference.fa \ -V sample.g.vcf.gz \ -O sample.vcf.gz # Step 6: Filter variants gatk VariantFiltration -V sample.vcf.gz -O sample_filtered.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "LowQD" \ --filter-expression "MQ < 40.0" --filter-name "LowMQ" \ --filter-expression "FS > 60.0" --filter-name "HighFS" \ --filter-expression "SOR > 3.0" --filter-name "HighSOR"

Hard Filtering vs VQSR

Hard filtering

Apply fixed thresholds on quality metrics (QD, MQ, FS, SOR). Simple and works for any project size. Use for small cohorts (<30 samples), non-model organisms, or targeted panels.

VQSR (Variant Quality Score Recalibration)

Machine learning approach that learns the quality profile from known true variants (e.g., HapMap, 1000 Genomes). More accurate but requires large cohorts (>30 whole genomes) and curated training resources.

Quality Filter Annotations Explained

QD (QualByDepth)

Variant quality normalised by depth. Low QD (<2) means the variant quality is low relative to coverage — likely a false positive.

MQ (MappingQuality)

RMS mapping quality of reads at the site. Low MQ (<40) indicates reads are mapping to multiple locations — the region is ambiguous.

FS (FisherStrand)

Strand bias estimated by Fisher's exact test. High FS (>60 for SNPs) means the variant is only supported by reads on one strand — likely an artefact.

SOR (StrandOddsRatio)

Alternative strand bias metric. More robust than FS for high-coverage data. SOR > 3 indicates significant strand bias.

Visualising Variants in IGV

Always visually inspect candidate variants in the Integrative Genomics Viewer (IGV). Load the BAM file and VCF together, navigate to a variant position, and check: (1) Is there sufficient read depth? (2) Are both strands represented? (3) Are the variant-supporting reads properly mapped (no soft-clipping artefacts)? (4) Is the region not in a repeat or low-complexity sequence?

⚠️

BQSR requires known variant databases (dbSNP, Mills indels). For non-model organisms without these resources, skip BQSR and use more stringent hard filtering instead.

SNP = single base change0/0 = hom ref, 0/1 = het, 1/1 = hom altMark duplicates before callingHaplotypeCaller = local assemblyHard filter for small projectsVQSR for large cohortsAlways inspect in IGV

6.2 Understanding VCF Files & Variant Annotation ▶

The Variant Call Format (VCF) is the standard file format for storing genetic variants discovered by variant calling pipelines. Understanding the VCF format in detail is essential for filtering, annotating, and interpreting variant data. A VCF file consists of header lines (metadata) followed by data lines (one variant per line), and can contain genotype information for one or many samples.

VCF File Structure

A VCF file has two sections: the header (lines starting with ##) and the data lines (tab-separated variant records). The header defines the meaning of INFO and FORMAT fields used in the data lines.

## VCF header lines (metadata) ##fileformat=VCFv4.2 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (REF,ALT)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 chr1 925952 rs12345 G A 5000.1 PASS DP=45;AF=0.50 GT:GQ:DP:AD 0/1:99:45:22,23

Anatomy of a VCF Data Line

Let us dissect the example line above field by field:

CHROM: chr1

Chromosome where the variant is located.

POS: 925952

1-based position on the reference. For indels, this is the position of the base before the event.

ID: rs12345

dbSNP identifier if known, otherwise "." (missing). Links to public variant databases.

REF: G / ALT: A

Reference allele and alternative allele. For SNPs: single bases. For a deletion: REF=ACG, ALT=A (CG deleted). For an insertion: REF=A, ALT=ATCG (TCG inserted).

QUAL: 5000.1

Phred-scaled quality score. QUAL 30 = 1/1000 chance the variant is wrong. QUAL 5000 = extremely confident. Higher is better.

FILTER: PASS

PASS means the variant passed all quality filters. Other values (e.g., "LowQD") indicate which filter(s) it failed.

Genotype Field (FORMAT + Sample)

The FORMAT column defines the order of fields in the sample column(s). In our example: GT:GQ:DP:AD → 0/1:99:45:22,23

GT: 0/1

Genotype. 0 = reference allele, 1 = first ALT allele. So 0/1 = heterozygous. 0/0 = homozygous reference. 1/1 = homozygous alternative. 0|1 = phased heterozygous.

GQ: 99

Genotype quality. Phred-scaled confidence that the genotype call is correct. GQ 99 = very high confidence. Low GQ (<20) means the genotype is uncertain.

DP: 45

Read depth at this position for this sample. For WGS, typically 30x. For WES, typically 50–100x. Low depth (<10) = unreliable genotype.

AD: 22,23

Allelic depths. 22 reads support REF (G), 23 reads support ALT (A). For a heterozygous call, expect roughly 50:50. Severe imbalance may indicate an artefact.

bcftools: Filtering, Extracting, and Merging VCFs

# Install bcftools conda install -c bioconda bcftools # View basic stats bcftools stats sample.vcf.gz | head -40 # Filter: keep only PASS variants with depth >= 20 bcftools view -f PASS -i 'DP>=20' sample.vcf.gz -Oz -o filtered.vcf.gz # Extract only SNPs (exclude indels) bcftools view -v snps sample.vcf.gz -Oz -o snps_only.vcf.gz # Extract only indels bcftools view -v indels sample.vcf.gz -Oz -o indels_only.vcf.gz # Extract specific regions bcftools view -r chr1:1000000-2000000 sample.vcf.gz # Extract specific samples from a multi-sample VCF bcftools view -s Sample1,Sample2 cohort.vcf.gz -Oz -o subset.vcf.gz # Merge multiple single-sample VCFs bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \ -Oz -o merged.vcf.gz # Index a VCF (required for most operations) bcftools index -t filtered.vcf.gz

Variant Annotation with SnpEff

SnpEff annotates each variant with its predicted functional impact on genes, transcripts, and proteins. It adds an ANN field to the INFO column.

# Install SnpEff conda install -c bioconda snpeff # List available databases snpEff databases | grep -i "homo_sapiens" # Download the human genome database snpEff download GRCh38.105 # Annotate variants snpEff ann GRCh38.105 filtered.vcf.gz > annotated.vcf # Generate an HTML summary report snpEff ann -stats snpEff_summary.html GRCh38.105 filtered.vcf.gz > annotated.vcf # Extract high-impact variants only cat annotated.vcf | snpSift filter "ANN[*].IMPACT = 'HIGH'" > high_impact.vcf

Variant Functional Categories

Synonymous (LOW impact)

DNA change that does NOT alter the amino acid (e.g., GCC→GCT both encode Alanine). Usually benign due to codon degeneracy.

Missense (MODERATE impact)

DNA change that alters one amino acid (e.g., GAG→GTG: Glutamate→Valine). May or may not affect protein function depending on the position and amino acid properties.

Nonsense / Stop-gained (HIGH impact)

Creates a premature stop codon (e.g., CAG→TAG: Glutamine→Stop). Truncates the protein. Often loss-of-function. Subject to nonsense-mediated decay (NMD).

Frameshift (HIGH impact)

Insertion or deletion not divisible by 3. Shifts the reading frame, altering all downstream amino acids and usually creating a premature stop. Almost always loss-of-function.

Splice site (HIGH impact)

Variant at an exon-intron boundary (donor GT or acceptor AG). Disrupts normal splicing, potentially causing exon skipping or intron retention. Often pathogenic.

In-frame indel (MODERATE impact)

Insertion or deletion divisible by 3. Adds or removes amino acids without shifting the reading frame. Effect depends on the protein domain affected.

Clinical Interpretation Resources

ClinVar

NCBI database of variants with clinical significance (pathogenic, likely pathogenic, benign, VUS). The primary resource for clinical variant interpretation. Query by rsID or genomic coordinates.

gnomAD

Population allele frequencies from >140,000 exomes and >76,000 genomes. A variant with AF > 1% in gnomAD is almost certainly benign. Essential for filtering common polymorphisms.

💡

Filtering strategy for clinical variants: Start with ~4 million variants → filter PASS only → remove AF > 0.01 in gnomAD → keep HIGH/MODERATE impact → check ClinVar → manual review in IGV. This typically reduces to a manageable list of 50–200 candidate variants.

VCF = standard variant formatGT:GQ:DP:AD = genotype fields0/1 = heterozygousbcftools for filteringSnpEff for annotationHIGH impact = likely functionalgnomAD for population frequencyClinVar for clinical significance

Module 7 · Statistics

Statistical Foundations for Bioinformatics

Bad statistics ruins good biology. This module covers the statistical concepts every bioinformatician must understand — from p-values and FDR to normalisation and dimensionality reduction.

7.1 P-values, Multiple Testing & FDR ▶

Understanding p-values, multiple testing correction, and the False Discovery Rate (FDR) is arguably the single most important statistical concept in genomics. Misinterpreting these values leads to publishing false discoveries or missing real biology. This section provides the intuition, the mathematics, and the practical application you need to correctly interpret differential expression results.

⚠️

What a p-value IS: P(data | null hypothesis is true). The probability of observing data as extreme as yours, assuming there is no real effect. It is NOT: the probability that your gene is truly differentially expressed. A p-value of 0.03 does NOT mean there is a 97% chance the gene is real.

The Multiple Testing Problem — A Worked Example

Consider a typical RNA-seq experiment where you test 20,000 genes for differential expression. Suppose 1,000 genes are truly differentially expressed and 19,000 are not.

Without correction (p < 0.05)

Of the 19,000 null genes, 5% will produce p < 0.05 by chance = 950 false positives. Of the 1,000 true DEGs (assuming 80% power), you detect ~800. Your "significant" list has 800 + 950 = 1,750 genes, but 54% are false. You cannot tell which are real.

With BH correction (padj < 0.05)

Benjamini-Hochberg correction ensures that among your "significant" genes, at most 5% are false positives. If you call 750 genes significant, you expect ~38 false discoveries. This is manageable and scientifically defensible.

Benjamini-Hochberg FDR: Step-by-Step

The BH procedure adjusts p-values by ranking them:

1. Rank all m p-values from smallest to largest: p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p_(m)

2. For each gene at rank i, compare its p-value to the threshold: i/m × q (where q is your desired FDR, e.g., 0.05)

3. Find the largest rank k where p_(k) ≤ k/m × q

4. All genes with rank ≤ k are declared significant

# Example with 5 genes (m=5), FDR threshold q=0.05 # Gene Raw p-value Rank(i) BH threshold (i/m × q) Significant? # GeneA 0.001 1 1/5 × 0.05 = 0.010 YES (0.001 < 0.010) # GeneB 0.008 2 2/5 × 0.05 = 0.020 YES (0.008 < 0.020) # GeneC 0.030 3 3/5 × 0.05 = 0.030 YES (0.030 ≤ 0.030) # GeneD 0.040 4 4/5 × 0.05 = 0.040 YES (0.040 ≤ 0.040) # GeneE 0.060 5 5/5 × 0.05 = 0.050 NO (0.060 > 0.050) # In R, this is simply: p_values <- c(0.001, 0.008, 0.030, 0.040, 0.060) p.adjust(p_values, method = "BH") # DESeq2 does this automatically in the 'padj' column

FDR vs FWER

FDR (False Discovery Rate)

Controls the proportion of false discoveries among all rejected hypotheses. "Of the genes I call significant, at most 5% are false." Used for: RNA-seq, ChIP-seq, proteomics — any discovery-oriented experiment where some false positives are tolerable.

FWER (Family-Wise Error Rate)

Controls the probability of even one false positive. Much more conservative. Bonferroni: divide alpha by number of tests (0.05/20,000 = 2.5×10^-6). Used for: GWAS, where a single false positive could lead to expensive follow-up studies.

Why padj, Not pvalue, in DESeq2

The pvalue column in DESeq2 results is the raw, uncorrected p-value. The padj column is the BH-adjusted p-value. Always use padj for determining significance. If you use raw p-values, you are ignoring the multiple testing problem and your results will be full of false positives.

Independent Filtering in DESeq2

DESeq2 automatically performs independent filtering: it removes genes with very low mean expression before applying multiple testing correction. Why? Genes with almost zero counts across all samples have no chance of being significant, but they consume part of the multiple testing budget, reducing power for the remaining genes. By filtering them out, DESeq2 increases statistical power for the genes that matter.

📌

NA values in padj: If you see NA in the padj column, it means DESeq2 filtered that gene out (usually due to very low counts or extreme outlier detection). These genes should not be interpreted as significant or non-significant — they were excluded from testing.

Effect Size and Biological Significance

Statistical significance (small padj) does not equal biological significance. With enough replicates, even tiny fold changes become statistically significant. A gene with padj = 0.0001 but log2FC = 0.1 is a 7% expression change — real but unlikely to be biologically meaningful. Always apply both a significance threshold (padj < 0.05) and an effect size threshold (|log2FC| > 1, i.e., 2-fold change) when defining your DEG list.

💡

Volcano plot intuition: X-axis = log2FoldChange (effect size), Y-axis = -log10(padj) (significance). Top-right = significantly upregulated with large effect. Top-left = significantly downregulated. Bottom-centre = not significant. Right side but not high = large fold change but underpowered (add more replicates).

p-value = P(data | H0 true)Always use padj, never pvalueBH-FDR = standard in genomicsBonferroni = too conservative for RNA-seqpadj < 0.05 AND |log2FC| > 1NA in padj = gene was filteredStatistical ≠ biological significance

7.2 Normalisation Methods ▶

Raw read counts cannot be compared directly across samples or genes because they are confounded by two major technical factors: library size (total number of reads sequenced) and gene length (longer genes capture more reads). Normalisation removes these technical biases so that observed differences reflect true biological variation. Choosing the wrong normalisation method for your analysis is one of the most common errors in bioinformatics.

Why Normalisation Is Needed — A Concrete Example

Suppose Gene X has 100 raw counts in both Sample A and Sample B. It looks like equal expression. But Sample A has 10 million total reads and Sample B has 20 million total reads. Gene X consumed 100/10,000,000 = 0.001% of Sample A's reads, but only 100/20,000,000 = 0.0005% of Sample B's. Gene X is actually expressed at half the level in Sample B. Without normalisation, you would completely miss this 2-fold difference.

Normalisation Methods

CPM (Counts Per Million)

Corrects for library size only. Formula: CPM = (raw count / total counts) × 10⁶. Use for within-gene comparisons across samples. Does NOT correct for gene length, so cannot compare expression between different genes.

TPM (Transcripts Per Million)

Corrects for gene length first, then library size. Formula: (1) Divide counts by gene length in kb → RPK. (2) Sum all RPK values → scaling factor. (3) Divide each RPK by scaling factor × 10⁶. TPM values sum to 1 million per sample, making samples directly comparable. Current standard for expression visualisation.

RPKM / FPKM

RPKM = Reads Per Kilobase per Million mapped reads. FPKM = the paired-end equivalent. Formula: RPKM = (raw count × 10⁹) / (total counts × gene length). Problem: RPKM values do NOT sum to the same value across samples, making cross-sample comparison unreliable. Superseded by TPM.

DESeq2 size factors (median-of-ratios)

Computes a per-sample scaling factor that accounts for library size AND composition bias (when a few highly expressed genes consume a disproportionate fraction of reads). Does NOT correct for gene length. Used internally by DESeq2 for differential expression.

DESeq2 Size Factor Calculation

The median-of-ratios method works as follows: (1) For each gene, compute the geometric mean across all samples. (2) For each sample, divide each gene's count by this geometric mean. (3) The median of these ratios is the size factor for that sample. Using the median makes this robust to outlier genes (e.g., one massively upregulated gene does not skew the normalisation).

# DESeq2 size factors (computed automatically by DESeq()) dds <- estimateSizeFactors(dds) sizeFactors(dds) # Values near 1.0 = average library. >1 = larger library. <1 = smaller library. # Normalised counts = raw count / size factor normalized_counts <- counts(dds, normalized=TRUE)

TMM Normalisation in edgeR

TMM (Trimmed Mean of M-values) is the normalisation method used by edgeR. It works by selecting a reference sample, computing fold-changes and absolute expression levels between each sample and the reference, trimming the extremes (top/bottom 30% of fold-changes, top/bottom 5% of expression), and computing a weighted mean of the remaining values as the scaling factor.

# TMM normalisation in edgeR library(edgeR) dge <- DGEList(counts = count_matrix) dge <- calcNormFactors(dge, method = "TMM") # Effective library size = lib.size * norm.factors dge$samples

Which Normalisation for Which Purpose?

Differential expression analysis

Use raw counts as input to DESeq2 or edgeR. These tools apply their own internal normalisation (size factors or TMM). Never pre-normalise counts before DE analysis.

Cross-sample visualisation (heatmaps, PCA)

Use TPM or variance-stabilised transform (VST/rlog from DESeq2). These make samples comparable for visualisation.

Cross-gene comparison within a sample

Use TPM (accounts for gene length). CPM is insufficient because longer genes accumulate more reads regardless of expression level.

Reporting expression levels

Use TPM in publications and supplementary tables. It is the current community standard and is comparable across samples and studies (with caveats).

⚠️

Common mistakes: (1) Using RPKM/FPKM for differential expression — these are not appropriate because the variance structure is lost. (2) Using CPM to compare expression between genes of different lengths. (3) Pre-normalising counts before passing to DESeq2 — the tool expects raw integer counts.

Raw counts for DE analysisTPM for visualisationRPKM/FPKM is outdatedDESeq2 = median-of-ratiosedgeR = TMMLibrary size + composition biasGene length matters for cross-gene comparison

7.3 PCA & Clustering for Quality Control ▶

Before running differential expression, you should always visualise your samples using dimensionality reduction (PCA) and clustering. These exploratory analyses reveal sample outliers, batch effects, sample mix-ups, and whether your biological conditions are the dominant source of variation. Skipping this step means you may not discover critical problems until after you have wasted time on downstream analysis.

What PCA Actually Does

Principal Component Analysis (PCA) takes your high-dimensional data (20,000 genes per sample) and finds the directions of maximum variance. The first principal component (PC1) captures the most variance in the data, PC2 captures the second most (orthogonal to PC1), and so on. By plotting samples on PC1 vs PC2, you project 20,000 dimensions down to 2, preserving as much information as possible.

Interpreting a PCA Plot

PC1 (X-axis)

Captures the largest source of variation. Ideally this is your biological condition (treatment vs control). The % variance explained is shown in the axis label (e.g., "PC1: 65% variance").

PC2 (Y-axis)

The second largest source of variation. Often captures batch effects, sex differences, or other covariates. If PC2 separates batches more than conditions, you have a batch effect problem.

Good PCA

Replicates cluster tightly together. Conditions separate clearly along PC1. Percentage of variance explained by PC1 is high (>30%).

Problematic PCA

One replicate clusters with the wrong condition (sample mix-up). Samples cluster by batch rather than condition (batch effect). One sample is far from all others (outlier — consider removal).

PCA and Sample Distance Heatmap Code

library(DESeq2); library(ggplot2); library(pheatmap) # Variance stabilising transformation (for visualisation only) vsd <- vst(dds, blind=TRUE) # PCA plot coloured by condition plotPCA(vsd, intgroup="condition") + theme_minimal(base_size=14) + ggtitle("PCA — Sample Overview") + theme(plot.title = element_text(face="bold")) # Custom PCA with more components pcaData <- plotPCA(vsd, intgroup=c("condition","batch"), returnData=TRUE) percentVar <- round(100 * attr(pcaData, "percentVar")) ggplot(pcaData, aes(PC1, PC2, color=condition, shape=batch)) + geom_point(size=4) + xlab(paste0("PC1: ", percentVar[1], "% variance")) + ylab(paste0("PC2: ", percentVar[2], "% variance")) + theme_minimal(base_size=14) # Sample distance heatmap sampleDists <- dist(t(assay(vsd))) sampleDistMatrix <- as.matrix(sampleDists) colnames(sampleDistMatrix) <- NULL annotation_df <- as.data.frame(colData(dds)[, c("condition"), drop=FALSE]) pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, annotation_row = annotation_df, color = colorRampPalette(c("white","steelblue"))(50))

Detecting and Correcting Batch Effects

If your PCA shows samples clustering by batch (processing date, lane, technician) rather than by biological condition, you have a batch effect. Two common correction methods:

# Option 1: ComBat (from sva package) — for correcting normalised data library(sva) # Use on normalised expression matrix (for visualisation, clustering) adjusted <- ComBat( dat = assay(vsd), batch = colData(dds)$batch, mod = model.matrix(~ condition, data=colData(dds)) ) # Now use 'adjusted' matrix for PCA, heatmaps, etc. # Option 2: removeBatchEffect (from limma) — for visualisation only library(limma) adjusted <- removeBatchEffect( assay(vsd), batch = colData(dds)$batch, design = model.matrix(~ condition, data=colData(dds)) ) # Use adjusted matrix for PCA/heatmaps # For DE analysis: do NOT use corrected data! # Instead, include batch in the DESeq2 design formula: # design = ~ batch + condition

⚠️

Never use batch-corrected data for differential expression. ComBat and removeBatchEffect are for visualisation and clustering only. For DE analysis, include the batch variable in the DESeq2 design formula (~ batch + condition). DESeq2 will model and remove the batch effect statistically.

Hierarchical Clustering Methods

Ward's method (ward.D2)

Minimises within-cluster variance. Produces compact, evenly-sized clusters. Best for sample clustering in gene expression data. Most commonly used.

Complete linkage

Distance between clusters = maximum distance between any two members. Produces tight clusters. Sensitive to outliers.

Average linkage (UPGMA)

Distance between clusters = mean of all pairwise distances. Produces ultrametric trees. Used in phylogenetics but assumes constant rate of evolution.

Gene expression heatmap

Rows = genes (usually top DEGs), columns = samples. Colour = expression level (z-score). Both rows and columns are clustered. Reveals co-expression patterns.

# Gene expression heatmap of top 50 DEGs library(pheatmap) top50 <- head(order(res$padj), 50) mat <- assay(vsd)[top50, ] mat <- t(scale(t(mat))) # Z-score per gene (row) pheatmap(mat, annotation_col = as.data.frame(colData(dds)[, "condition", drop=FALSE]), clustering_method = "ward.D2", show_rownames = TRUE, fontsize_row = 7, color = colorRampPalette(c("navy","white","firebrick3"))(50), main = "Top 50 Differentially Expressed Genes")

💡

What % variance explained means: If PC1 explains 65% of the variance, it means 65% of all the variation across your 20,000 genes can be captured by a single direction. The remaining 35% is spread across all other PCs. High PC1 variance usually means a strong biological signal (good). Very low PC1 (<15%) may mean high noise or complex multi-factor design.

PCA before DE analysisPC1 = largest variance sourceReplicates should cluster togetherComBat / limma for batch correction~ batch + condition in DESeq2Ward.D2 for clusteringZ-score heatmaps for DEGs

Module 8 · Advanced Topics

Single-cell RNA-seq, Pathway Analysis & Machine Learning

The frontier of bioinformatics. These topics are increasingly required in modern research — from dissecting cell heterogeneity with scRNA-seq to interpreting gene lists with pathway analysis and applying machine learning to genomic data.

8.1 Single-cell RNA-seq (scRNA-seq) — Introduction ▶

Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, revealing cellular heterogeneity that is invisible to bulk RNA-seq. Where bulk RNA-seq reports the average expression across thousands of cells, scRNA-seq can identify rare cell populations, map developmental trajectories, and characterise the cellular composition of complex tissues like tumours or the immune system. The most widely used platform is the 10x Genomics Chromium system.

How 10x Chromium Works

In the 10x Chromium system, individual cells are captured in nanoliter-sized oil droplets (GEMs — Gel Beads-in-emulsion) together with barcoded hydrogel beads. Each bead carries millions of oligonucleotides with three components: (1) a cell barcode (16bp, unique to each droplet/cell), (2) a UMI (Unique Molecular Identifier, 12bp, unique to each captured mRNA molecule), and (3) a poly-T capture sequence that binds mRNA. After reverse transcription inside the droplet, all cDNA from one cell shares the same barcode, allowing computational demultiplexing.

Why scRNA-seq Has High Dropout (Zero Inflation)

A major characteristic of scRNA-seq data is dropout: many genes that are truly expressed in a cell are detected as zero. This happens because each cell contributes only ~10–20% of its mRNA to the captured library (low capture efficiency). Lowly expressed genes are particularly affected — they may be expressed but not captured. This "zero inflation" complicates analysis and requires specialised statistical methods.

Quality Control Metrics

nFeature_RNA (genes per cell)

Number of unique genes detected. Too low (<200): empty droplet or dead cell. Too high (>5,000–8,000): possible doublet (two cells in one droplet). Typical range: 500–5,000 genes.

nCount_RNA (UMIs per cell)

Total UMI counts. Proportional to mRNA content. Very low counts indicate low-quality cells. Extremely high counts may indicate doublets.

percent.mt (mitochondrial %)

Percentage of reads from mitochondrial genes. Dying cells release cytoplasmic mRNA but retain mitochondrial mRNA, so high mt% (>20%) indicates cell stress or death. Common threshold: remove cells with >15–20% mt.

Typical QC thresholds

Remove cells with: nFeature_RNA < 200 or > 5000, percent.mt > 20%. These are starting points — adjust based on your cell type (some cells naturally have high mt% or low gene counts).

Advanced QC: Ambient RNA and Doublets

Ambient RNA removal (SoupX)

During cell lysis, free-floating mRNA ("soup") contaminates all droplets. SoupX estimates the contamination profile from empty droplets and subtracts it from each cell's counts. Run before standard Seurat analysis.

Doublet detection (scDblFinder / DoubletFinder)

Doublets (two cells captured in one droplet) appear as cells with hybrid gene expression profiles. scDblFinder simulates artificial doublets and identifies real cells that resemble them. Expected doublet rate: ~0.8% per 1,000 cells loaded.

Complete Seurat Workflow

# Standard Seurat v5 scRNA-seq workflow library(Seurat) # Load 10x Cell Ranger output data <- Read10X("filtered_feature_bc_matrix/") seurat <- CreateSeuratObject(data, min.cells=3, min.features=200) # Quality control seurat[["pct_mt"]] <- PercentageFeatureSet(seurat, pattern="^MT-") VlnPlot(seurat, features=c("nFeature_RNA","nCount_RNA","pct_mt"), pt.size=0.1) seurat <- subset(seurat, nFeature_RNA > 200 & nFeature_RNA < 5000 & pct_mt < 20) # Normalise (log-normalisation) seurat <- NormalizeData(seurat) # Find highly variable genes (for dimensionality reduction) seurat <- FindVariableFeatures(seurat, selection.method="vst", nfeatures=2000) # Scale data (z-score, optionally regress out confounders) seurat <- ScaleData(seurat, vars.to.regress="pct_mt") # PCA → Nearest neighbours → Clustering → UMAP seurat <- RunPCA(seurat, npcs=30) ElbowPlot(seurat) # Choose number of PCs (elbow ~15-20) seurat <- FindNeighbors(seurat, dims=1:20) seurat <- FindClusters(seurat, resolution=0.5) seurat <- RunUMAP(seurat, dims=1:20) # Visualise clusters DimPlot(seurat, label=TRUE, repel=TRUE) + NoLegend() # Find marker genes for each cluster markers <- FindAllMarkers(seurat, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.5) top5 <- markers %>% group_by(cluster) %>% slice_max(avg_log2FC, n=5) DoHeatmap(seurat, features=top5$gene)

Cell Cycle Scoring and Regression

# Score cells for cell cycle phase s.genes <- cc.genes$s.genes g2m.genes <- cc.genes$g2m.genes seurat <- CellCycleScoring(seurat, s.features=s.genes, g2m.features=g2m.genes) # If cell cycle dominates clustering, regress it out: seurat <- ScaleData(seurat, vars.to.regress=c("S.Score","G2M.Score"))

Trajectory Analysis with Monocle3

Trajectory analysis (pseudotime) orders cells along a continuous path representing a biological process (e.g., differentiation, activation). Monocle3 constructs a principal graph through the UMAP embedding and assigns each cell a pseudotime value based on its position along this graph.

💡

When to use trajectories: Developmental datasets (stem cell differentiation, embryogenesis), immune activation time courses, or any scenario where cells exist along a continuum rather than in discrete types.

Dataset Integration

When analysing multiple scRNA-seq datasets (different patients, batches, or experiments), batch effects cause identical cell types to cluster separately by dataset rather than by biology. Integration methods align datasets so that shared cell types overlap.

Harmony

Fast, lightweight. Operates in PCA space. Iteratively adjusts PCs to remove batch effects. Works well for most integration tasks. RunHarmony(seurat, "batch")

Seurat RPCA integration

Uses reciprocal PCA to find shared anchors between datasets. More conservative — better at preserving unique cell populations that exist in only one dataset. More computationally expensive.

⚠️

Integration pitfall: Over-integration can merge truly different cell types across datasets. Always compare integrated and non-integrated results. If a cell type appears in only one condition, integration may incorrectly merge it with a different cell type.

10x Chromium = cell barcode + UMInFeature 200–5000, mt% < 20%SoupX for ambient RNAscDblFinder for doubletsFindAllMarkers for cluster identityHarmony for fast integrationMonocle3 for trajectories

8.2 Gene Ontology & Pathway Enrichment Analysis ▶

After identifying differentially expressed genes, the natural next question is: what biological processes, pathways, or functions are affected? Pathway enrichment analysis moves from individual genes to biological interpretation. Rather than examining 500 DEGs one by one, you ask whether genes belonging to a specific pathway (e.g., "apoptosis" or "cell cycle") appear more often in your results than expected by chance. Two main approaches exist: Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA).

What Is a Gene Set?

A gene set is a group of genes that share a common biological property. Gene sets are defined by databases:

Gene Ontology (GO)

Three domains: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Hierarchically organised (e.g., "immune response" contains "inflammatory response" contains "cytokine production"). Very comprehensive but often redundant.

KEGG Pathways

Curated metabolic and signalling pathways with detailed diagrams. ~340 human pathways. Good for metabolism, signalling, and disease pathways. Less comprehensive than GO but more interpretable.

MSigDB Hallmark Gene Sets

50 carefully curated gene sets representing well-defined biological states (e.g., "HALLMARK_INFLAMMATORY_RESPONSE", "HALLMARK_OXIDATIVE_PHOSPHORYLATION"). Excellent starting point — covers major biological themes without redundancy.

Reactome

Peer-reviewed pathway database with detailed molecular events. ~2,500 human pathways. More granular than KEGG. Open-source and actively curated.

How ORA Works

Over-Representation Analysis uses Fisher's exact test (or hypergeometric test) to ask: Is the overlap between my DEG list and a specific pathway larger than expected by chance?

Imagine you have 500 DEGs from a universe of 15,000 expressed genes. The "apoptosis" pathway contains 200 genes. If 40 of your 500 DEGs are in the apoptosis pathway, is that more than the ~7 you would expect by chance (200/15,000 × 500)? Fisher's exact test gives you a p-value for this overlap.

⚠️

Background gene set matters: Always use all expressed genes (genes that passed your expression filter) as the background, not all genes in the genome (~20,000). Using the wrong background inflates false positives because unexpressed genes cannot possibly appear in your DEG list but are counted as expected in the pathway.

How GSEA Works — Step by Step

GSEA does not use a DEG list with a cutoff. Instead, it uses the full ranked gene list (all genes ranked by fold change):

1. Rank all genes by log2 fold change (or a signed statistic like -log10(p) × sign(FC)).

2. Walk down the ranked list. For each gene, if it is in the pathway, increase a running enrichment score (ES); if not, decrease it.

3. The enrichment score is the maximum deviation from zero during the walk. A positive ES means the pathway genes are concentrated at the top (upregulated). A negative ES means they are at the bottom (downregulated).

4. Significance is assessed by permutation: shuffle gene labels 1,000 times and compute ES each time. Compare your real ES to the permutation distribution.

5. The Normalised Enrichment Score (NES) accounts for gene set size, making scores comparable across pathways.

📌

NES interpretation: Positive NES = pathway genes are enriched among upregulated genes. Negative NES = enriched among downregulated genes. Larger |NES| = stronger enrichment. Use padj < 0.05 for significance.

clusterProfiler Code — ORA and GSEA

library(clusterProfiler); library(org.Hs.eg.db); library(enrichplot) # Convert gene symbols to Entrez IDs gene_list <- bitr(rownames(sig), fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db) # Define background (all expressed genes, not just DEGs) background <- bitr(rownames(res), fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db) # === ORA: GO Biological Process === ego <- enrichGO(gene = gene_list$ENTREZID, universe = background$ENTREZID, OrgDb = org.Hs.eg.db, ont = "BP", pAdjustMethod = "BH", qvalueCutoff = 0.05, readable = TRUE) # Remove redundant GO terms ego_simplified <- simplify(ego, cutoff=0.7, by="p.adjust") dotplot(ego_simplified, showCategory=20) # === ORA: KEGG pathways === kk <- enrichKEGG(gene = gene_list$ENTREZID, organism = "hsa", pvalueCutoff = 0.05) barplot(kk, showCategory=15) # === GSEA: full ranked gene list === # Create named, sorted vector of fold changes gene_rank <- setNames(res$log2FoldChange, rownames(res)) gene_rank <- sort(gene_rank, decreasing=TRUE) # Must be sorted! gsea_res <- gseGO(geneList = gene_rank, OrgDb = org.Hs.eg.db, ont = "BP", minGSSize = 15, maxGSSize = 500, pvalueCutoff = 0.05) # Visualisations ridgeplot(gsea_res, showCategory=15) gseaplot2(gsea_res, geneSetID=1:3) # Running score plot # Enrichment network (shows relationships between terms) emapplot(pairwise_termsim(ego_simplified), showCategory=20)

Common Visualisations

Dot plot

Dot size = number of genes, colour = padj, X-axis = gene ratio. The default and most informative single plot.

Enrichment map / network

Nodes = pathways, edges connect pathways sharing genes. Reveals clusters of related biological themes. Use emapplot().

Ridgeplot

Shows the distribution of fold changes for genes in each pathway. Useful for GSEA results to see the shift direction.

GSEA running score plot

Shows the running enrichment score as you walk down the ranked list. The peak/trough is the ES. Vertical lines show where pathway genes fall.

Common Pitfalls

⚠️

Redundant GO terms: GO is hierarchical, so related terms often appear together (e.g., "immune response", "innate immune response", "regulation of immune response"). Use simplify() in clusterProfiler or REVIGO to remove redundancy and focus on the most specific or most significant terms.

💡

ORA vs GSEA: Use ORA when you have a clear DEG list (padj < 0.05, |log2FC| > 1). Use GSEA when you want to detect subtle, coordinated changes across an entire pathway — especially when few individual genes pass the significance threshold but many show small shifts in the same direction.

ORA = Fisher's test on DEG listGSEA = full ranked list, no cutoffPositive NES = upregulated pathwayUse expressed genes as backgroundsimplify() removes GO redundancyMSigDB Hallmark = best starting pointclusterProfiler for both ORA and GSEA

8.3 Machine Learning in Bioinformatics ▶

Machine learning (ML) is now embedded across bioinformatics — from protein structure prediction (AlphaFold) to variant pathogenicity scoring, cell type classification, and biomarker discovery. This section covers the key ML concepts that bioinformaticians need, including the unique challenges posed by genomic data (many features, few samples), practical workflows in Python with scikit-learn, and when ML is appropriate versus when simpler statistical methods suffice.

Supervised vs Unsupervised Learning

Supervised learning

You have labels (e.g., cancer subtype, drug response). The model learns to predict labels from features (gene expression). Examples: classify tumour vs normal, predict patient survival, identify pathogenic variants. Algorithms: Random Forest, SVM, logistic regression, neural networks.

Unsupervised learning

No labels. The model discovers structure in the data. Examples: clustering scRNA-seq cells into types, identifying patient subtypes from multi-omics data, PCA for dimensionality reduction. Algorithms: k-means, hierarchical clustering, UMAP, t-SNE, NMF.

The Curse of Dimensionality

Genomic datasets typically have 20,000+ features (genes) but only 50–200 samples. This "p >> n" problem means: (1) models can easily memorise the training data (overfit), (2) many spurious correlations exist by chance, and (3) feature selection is critical. Before training, reduce features using: variance filtering (remove low-variance genes), differential expression pre-filtering, or embedded methods (Random Forest feature importance).

Cross-Validation: The Cardinal Rule

Never evaluate a model on the same data used to train it. Cross-validation provides an honest estimate of model performance on unseen data.

k-fold cross-validation

Split data into k folds (typically k=5 or k=10). Train on k-1 folds, test on the held-out fold. Repeat k times. Average the scores. The standard approach for most genomic ML.

Leave-One-Out (LOO)

Each sample is held out once. Maximises training data but computationally expensive. Useful for very small datasets (<30 samples) where you cannot afford to hold out 20% of data.

⚠️

Feature selection MUST happen inside CV: If you select the top 100 genes using all samples and then split into train/test, information from the test set leaked into feature selection. Always perform feature selection within each CV fold to get unbiased performance estimates.

Practical Example: Cancer Classification with scikit-learn

import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report, roc_auc_score from sklearn.feature_selection import SelectKBest, f_classif # Load expression matrix (genes x samples) and labels expr = pd.read_csv("expression_matrix.csv", index_col=0).T # samples x genes labels = pd.read_csv("sample_labels.csv")["subtype"] # Build a pipeline: feature selection → scaling → classifier pipeline = Pipeline([ ("select", SelectKBest(f_classif, k=200)), # Top 200 genes by ANOVA ("scale", StandardScaler()), # Z-score normalise ("clf", RandomForestClassifier( n_estimators=500, max_depth=10, random_state=42)) ]) # 5-fold stratified cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipeline, expr, labels, cv=cv, scoring="roc_auc_ovr") print(f"Mean AUC: {scores.mean():.3f} (+/- {scores.std():.3f})") # Train final model and get feature importance pipeline.fit(expr, labels) selector = pipeline.named_steps["select"] clf = pipeline.named_steps["clf"] # Get selected gene names and their importance selected_genes = expr.columns[selector.get_support()] importance = pd.Series(clf.feature_importances_, index=selected_genes) top_biomarkers = importance.sort_values(ascending=False).head(20) print("Top 20 biomarker genes:") print(top_biomarkers)

ROC Curve and AUC Interpretation

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate vs False Positive Rate at different classification thresholds. The AUC (Area Under the Curve) summarises overall performance:

AUC = 0.5

Random guessing. The model has no discriminative ability. Your features carry no signal for this task.

AUC = 0.7–0.8

Acceptable discrimination. Useful for biomarker discovery but may not be sufficient for clinical decisions.

AUC = 0.8–0.9

Good discrimination. Strong enough for most research applications and preliminary clinical use.

AUC > 0.9

Excellent discrimination. Be cautious — if AUC is >0.95 on a small dataset, double-check for data leakage or confounders (batch effect masquerading as biology).

Deep Learning for Sequence Data

Deep learning excels when the input is raw sequence data (DNA, RNA, protein). A common architecture is a 1D convolutional neural network (CNN) applied to one-hot encoded DNA sequences (4 channels: A, C, G, T). These models can learn motifs (like transcription factor binding sites) directly from sequence without hand-crafted features.

import tensorflow as tf from tensorflow.keras import layers, models # 1D CNN for TF binding site prediction # Input: one-hot encoded DNA sequences (batch, length=200, channels=4) model = models.Sequential([ layers.Conv1D(64, kernel_size=12, activation="relu", input_shape=(200, 4)), layers.MaxPooling1D(pool_size=4), layers.Conv1D(128, kernel_size=8, activation="relu"), layers.GlobalMaxPooling1D(), layers.Dense(64, activation="relu"), layers.Dropout(0.3), layers.Dense(1, activation="sigmoid") # Binary: bound / not bound ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["AUC"]) model.summary() # One-hot encode DNA sequence def one_hot_encode(seq): mapping = {"A": [1,0,0,0], "C": [0,1,0,0], "G": [0,0,1,0], "T": [0,0,0,1]} return np.array([mapping.get(base, [0,0,0,0]) for base in seq])

When NOT to Use Machine Learning

⚠️

ML is not always the answer: (1) Small sample sizes (n < 30) — models cannot generalise reliably. Use traditional statistics instead. (2) Confounded data — if your treatment is perfectly correlated with batch, no ML model can separate them. Fix the experimental design. (3) Simple questions — "which genes are differentially expressed?" is better answered by DESeq2 than by a neural network. Use the simplest method that works.

Supervised = labelled datap >> n = curse of dimensionalityFeature selection inside CVAUC 0.5 = random, >0.9 = excellentRandom Forest for biomarker genes1D CNN for DNA sequenceNever test on training dataSimplest method that works

8.4 Metagenomics & Microbiome Analysis ▶

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples — soil, ocean water, the human gut — without the need to culture individual organisms. It reveals the composition and functional potential of entire microbial communities. Two main approaches exist: 16S rRNA amplicon sequencing (which bacteria are present?) and shotgun metagenomics (what are they doing?). This section covers both workflows in detail.

16S rRNA Amplicon Sequencing

The 16S ribosomal RNA gene is present in all bacteria and archaea. It contains conserved regions (where universal primers bind) flanking variable regions (V1–V9) that differ between species. By PCR-amplifying and sequencing the V3-V4 region (~460bp), you can identify which bacterial taxa are present in your sample. This approach is cost-effective, well-established, and sufficient for community profiling.

ASV vs OTU

OTU (Operational Taxonomic Unit)

Older approach. Clusters sequences at 97% similarity. Loses resolution (different species with >97% identity are merged). Computationally simple but biologically imprecise. Tools: QIIME1, mothur (legacy).

ASV (Amplicon Sequence Variant)

Modern standard. Resolves sequences to single-nucleotide precision. Each unique error-corrected sequence is an ASV. Reproducible across studies (unlike OTUs which depend on clustering parameters). Superior in every way. Tools: DADA2, Deblur.

💡

Always use ASVs. OTUs are a legacy concept from when sequencing error rates made exact sequences unreliable. Modern denoising algorithms (DADA2) correct errors statistically, making ASVs the standard since ~2017.

QIIME2 Pipeline Overview

# Install QIIME2 (in a conda environment) conda env create -n qiime2 --file qiime2-amplicon-2024.5-py39-linux-conda.yml conda activate qiime2 # Step 1: Import raw FASTQ files qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path raw_reads/ \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --output-path demux.qza # Step 2: Visualise read quality (choose truncation lengths) qiime demux summarize --i-data demux.qza --o-visualization demux.qzv # Step 3: Denoise with DADA2 (produces ASV table) qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux.qza \ --p-trunc-len-f 280 \ --p-trunc-len-r 220 \ --p-n-threads 8 \ --o-table table.qza \ --o-representative-sequences rep-seqs.qza \ --o-denoising-stats stats.qza # Step 4: Assign taxonomy (using SILVA or Greengenes2) qiime feature-classifier classify-sklearn \ --i-classifier silva-138-99-nb-classifier.qza \ --i-reads rep-seqs.qza \ --o-classification taxonomy.qza # Step 5: Build phylogenetic tree (needed for UniFrac) qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences rep-seqs.qza \ --o-alignment aligned-rep-seqs.qza \ --o-masked-alignment masked-aligned.qza \ --o-tree unrooted-tree.qza \ --o-rooted-tree rooted-tree.qza # Step 6: Alpha and beta diversity qiime diversity core-metrics-phylogenetic \ --i-phylogeny rooted-tree.qza \ --i-table table.qza \ --p-sampling-depth 10000 \ --m-metadata-file metadata.tsv \ --output-dir diversity-results/

Taxonomy Databases

SILVA

Comprehensive database for bacteria, archaea, and eukaryotes (18S/28S). Version 138.1 is the current standard. Highest quality curated taxonomy. Recommended for most studies.

Greengenes2 (2022)

Rebuilt from scratch using whole-genome phylogenetics. Consistent taxonomy derived from the Genome Taxonomy Database (GTDB). Better phylogenetic accuracy than the original Greengenes. Compatible with UniFrac.

Alpha Diversity Metrics

Alpha diversity measures the diversity within a single sample. Different metrics capture different aspects:

Shannon index (H')

Accounts for both richness (number of species) and evenness (how evenly individuals are distributed). Higher H' = more diverse. A sample dominated by one species has low Shannon even if many species are present.

Simpson index (1-D)

Probability that two randomly drawn individuals belong to different species. Emphasises dominant species. Less sensitive to rare species than Shannon. Range: 0 (no diversity) to 1 (high diversity).

Observed ASVs (richness)

Simply the number of unique ASVs detected. Sensitive to sequencing depth — always rarefy (subsample to equal depth) before comparing.

Faith's Phylogenetic Diversity (PD)

Sum of branch lengths in the phylogenetic tree connecting all species in the sample. Captures phylogenetic breadth — a sample with many distantly related species has higher PD than one with many closely related species.

Beta Diversity and Ordination

Beta diversity measures differences between samples. Each pair of samples gets a distance value, producing a distance matrix that is visualised by ordination (PCoA).

Bray-Curtis dissimilarity

Based on shared species abundances. Does not consider phylogeny. Range: 0 (identical) to 1 (completely different). The most common non-phylogenetic metric.

UniFrac (weighted & unweighted)

Incorporates phylogenetic relationships. Unweighted UniFrac considers only presence/absence (sensitive to rare taxa). Weighted UniFrac also considers abundance (sensitive to dominant taxa). Requires a phylogenetic tree.

PERMANOVA for Statistical Testing

To test whether microbial communities differ significantly between groups (e.g., healthy vs diseased), use PERMANOVA (Permutational Multivariate Analysis of Variance). It tests whether the centroids of groups in beta diversity space are significantly different by comparing observed F-statistics to a permutation distribution.

# PERMANOVA in QIIME2 qiime diversity beta-group-significance \ --i-distance-matrix diversity-results/bray_curtis_distance_matrix.qza \ --m-metadata-file metadata.tsv \ --m-metadata-column group \ --p-method permanova \ --o-visualization permanova-results.qzv # In R with vegan: # library(vegan) # adonis2(dist_matrix ~ group, data=metadata, permutations=999)

Shotgun Metagenomics Workflow

Unlike 16S amplicon sequencing, shotgun metagenomics sequences all DNA in the sample. This reveals not just which organisms are present (taxonomy) but also their functional potential (metabolic pathways, antibiotic resistance genes).

# Shotgun metagenomics pipeline # Step 1: Quality control (remove host reads if applicable) fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o clean_R1.fastq.gz -O clean_R2.fastq.gz \ --detect_adapter_for_pe # Remove human reads (for clinical samples) bowtie2 -x human_genome -1 clean_R1.fastq.gz -2 clean_R2.fastq.gz \ --un-conc-gz non_host_%.fastq.gz -S /dev/null # Step 2: Taxonomic profiling with MetaPhlAn4 conda install -c bioconda metaphlan metaphlan non_host_1.fastq.gz,non_host_2.fastq.gz \ --input_type fastq \ --nproc 8 \ -o taxonomy_profile.txt # Step 3: Functional profiling with HUMAnN3 conda install -c bioconda humann humann --input non_host_1.fastq.gz \ --output humann_output/ \ --threads 8 # HUMAnN3 outputs: # genefamilies.tsv — gene family abundances (UniRef90) # pathabundance.tsv — metabolic pathway abundances (MetaCyc) # pathcoverage.tsv — pathway coverage (fraction of pathway detected) # Step 4: Merge profiles across samples humann_join_tables -i humann_output/ -o merged_genefamilies.tsv \ --file_name genefamilies humann_renorm_table -i merged_genefamilies.tsv -o merged_genefamilies_cpm.tsv \ --units cpm

📌

16S vs Shotgun: Use 16S for cost-effective community profiling when you only need to know "who is there." Use shotgun metagenomics when you also need to know "what are they doing" (functional profiling) or when you need to identify viruses and eukaryotes (which lack the 16S gene).

⚠️

Rarefaction: Before comparing alpha diversity across samples, rarefy (subsample) all samples to the same sequencing depth. Samples with more reads will always show more taxa, creating a false diversity difference. Choose a rarefaction depth that retains most samples while providing sufficient coverage (e.g., the 10th percentile of library sizes).

16S V3-V4 for bacterial profilingASV > OTU (always)DADA2 for denoisingSILVA for taxonomyShannon = richness + evennessUniFrac = phylogenetic beta diversityPERMANOVA for group testingMetaPhlAn4 + HUMAnN3 for shotgun

Module 9 · Proteomics

Mass Spectrometry & Protein Analysis

Proteomics studies the entire complement of proteins in a cell, tissue, or organism. Unlike the genome (static), the proteome is dynamic — it changes with conditions, time, and cell type. Mass spectrometry is the workhorse technology for large-scale protein identification and quantification.

9.1 Mass Spectrometry Principles ▶

Mass spectrometry (MS) measures the mass-to-charge ratio (m/z) of ionised molecules. In proteomics, proteins are digested into peptides (typically with trypsin), separated by liquid chromatography (LC), ionised, and then measured in the mass spectrometer. The combination — LC-MS/MS — is the standard workflow for identifying and quantifying thousands of proteins in a single experiment.

The LC-MS/MS Workflow

A typical bottom-up proteomics experiment follows these steps: (1) Protein extraction — lyse cells, solubilise proteins. (2) Digestion — trypsin cleaves after lysine (K) and arginine (R), producing peptides of 7–35 amino acids. (3) Separation — reverse-phase liquid chromatography separates peptides by hydrophobicity over a 60–120 minute gradient. (4) Ionisation — electrospray ionisation (ESI) converts peptides into gas-phase ions. (5) MS1 scan — measures intact peptide masses. (6) MS2 scan — isolates individual peptide ions and fragments them (HCD or CID). The fragmentation pattern reveals the amino acid sequence.

Key Instrument Types

Orbitrap (Thermo)

High resolution (120,000–500,000), high mass accuracy (<2 ppm). Gold standard for discovery proteomics. Exploris 480, Q Exactive series.

TOF (Bruker timsTOF)

Time-of-flight with trapped ion mobility (TIMS). Very fast scan rates. DIA-PASEF enables deep proteome coverage. Excellent for clinical applications.

Triple Quadrupole (QQQ)

Lower resolution but extremely sensitive and quantitative. Used for targeted proteomics (MRM/SRM) — measuring specific proteins with high precision.

MALDI-TOF

Matrix-assisted laser desorption. Used for intact protein profiling, tissue imaging (MALDI-MSI), and microbial identification (Bruker Biotyper).

DDA vs DIA Acquisition

DDA (Data-Dependent)

Selects the top N most intense ions from MS1 for fragmentation. Simple, well-established. Stochastic — may miss low-abundance peptides. Less reproducible between runs.

DIA (Data-Independent)

Fragments ALL ions in wide m/z windows systematically. No selection bias. More reproducible and comprehensive. Requires spectral libraries or library-free tools (DIA-NN). Now the preferred method.

Recommended

💡

Modern trend: DIA with tools like DIA-NN or Spectronaut is rapidly replacing DDA for most applications. DIA-NN can identify 8,000–10,000 proteins from a single human cell line in a 90-minute gradient, with excellent quantitative reproducibility.

LC-MS/MS is standard workflowTrypsin cleaves at K and ROrbitrap = high resolutionDIA > DDA for reproducibilityDIA-NN for library-free analysis

9.2 Protein Identification & Database Searching ▶

After acquiring MS/MS spectra, the key bioinformatics task is identifying which peptides (and therefore proteins) generated them. This is done by matching experimental spectra against theoretical spectra predicted from a protein sequence database. The process is called database searching — and it is conceptually similar to BLAST, but for mass spectra instead of sequences.

How Database Search Works

(1) Take each experimental MS2 spectrum. (2) In silico digest all proteins from a reference database (e.g., UniProt human) with trypsin. (3) For each candidate peptide whose mass matches the precursor mass (within tolerance), predict the theoretical fragmentation spectrum. (4) Score the match between experimental and theoretical spectra. (5) Apply statistical validation (FDR control at 1% using target-decoy approach).

Major Search Engines

MaxQuant

Free, widely used. Built-in Andromeda search engine. Excellent for label-free and SILAC. GUI-based. Standard in academic proteomics.

MSFragger / FragPipe

Ultrafast open search. 100× faster than traditional engines. Integrated in FragPipe with downstream analysis. Excellent for PTM discovery.

DIA-NN

Deep neural network for DIA data. Library-free mode generates in silico spectral libraries. State-of-the-art sensitivity and speed.

# ─── MaxQuant (GUI-based, download from maxquant.org) ─── # Key settings to configure: # Raw files: add your .raw files # FASTA: download from uniprot.org (reviewed Swiss-Prot + isoforms) # Digestion: Trypsin/P, max 2 missed cleavages # Fixed mod: Carbamidomethyl (C) — from iodoacetamide alkylation # Variable mods: Oxidation (M), Acetyl (Protein N-term) # FDR: 1% at PSM and protein level # Match between runs: ON (transfers IDs across runs) # LFQ: ON for label-free quantification # ─── DIA-NN (command line) ─── # Download from github.com/vdemichev/DiaNN diann --f sample1.raw --f sample2.raw --f sample3.raw \ --lib "" \ --fasta uniprot_human.fasta \ --gen-spec-lib \ --predictor \ --qvalue 0.01 \ --matrices \ --out-lib report-lib.tsv \ --out report.tsv # Key output files: # report.tsv — all identified precursors with quantities # report.pg_matrix.tsv — protein group quantities (use this!)

⚠️

Target-Decoy FDR: All search engines use a decoy database (reversed or shuffled sequences) to estimate false discovery rate. At 1% FDR: for every 100 identified proteins, ~1 is expected to be wrong. Always use 1% FDR at both PSM (peptide-spectrum match) and protein level. Never relax this to 5%.

1% FDR is mandatoryMaxQuant for DDADIA-NN for DIAFragPipe = fastestUse reviewed UniProt FASTA

9.3 Quantitative Proteomics & Statistical Analysis ▶

Quantitative proteomics measures how much of each protein is present, enabling comparison between conditions. There are two main strategies: label-free quantification (LFQ), which compares ion intensities directly, and labelling methods (TMT, SILAC), which use chemical or metabolic tags to multiplex samples.

Quantification Strategies

Label-Free (LFQ)

Compare MS1 peptide intensities across runs. Simple, no extra chemistry. Requires careful normalisation. Best with DIA. MaxLFQ algorithm in MaxQuant.

Most common

TMT/iTRAQ

Chemical labels with same mass but different fragmentation reporters. Multiplex up to 18 samples in one run (TMTpro 18-plex). Reduces missing values. Higher throughput.

SILAC

Metabolic labelling — grow cells with heavy amino acids (13C-Lys, 13C-Arg). Most accurate quantification. Only works for cell culture. Not for tissues/clinical samples.

Statistical Analysis with R

# ─── Differential protein abundance analysis ─── # Using limma (same package used for microarrays — works great for proteomics) library(limma) library(tidyverse) # Load MaxQuant proteinGroups.txt or DIA-NN pg_matrix pg <- read.delim("proteinGroups.txt") # Filter: remove contaminants, reverse hits, only identified by site pg <- pg[pg$Reverse != "+" & pg$Potential.contaminant != "+" & pg$Only.identified.by.site != "+", ] # Extract LFQ intensity columns and log2-transform lfq <- pg[, grep("^LFQ.intensity.", names(pg))] lfq[lfq == 0] <- NA # Replace 0 with NA mat <- log2(as.matrix(lfq)) # Filter: keep proteins with values in at least 70% of samples per group # (adjust based on your design) # Imputation: replace missing values (MinProb method — common) # Or use DEqMS / msqrob2 for more sophisticated handling # Differential analysis with limma group <- factor(c("Control","Control","Control","Treated","Treated","Treated")) design <- model.matrix(~ 0 + group) colnames(design) <- levels(group) fit <- lmFit(mat, design) contrast <- makeContrasts(Treated - Control, levels = design) fit2 <- contrasts.fit(fit, contrast) fit3 <- eBayes(fit2) results <- topTable(fit3, number = Inf, sort.by = "P") sig <- results[results$adj.P.Val < 0.05 & abs(results$logFC) > 1, ] cat(nrow(sig), "significantly changed proteins (|logFC|>1, FDR<0.05)\n")

💡

Missing values in proteomics: Unlike RNA-seq, proteomics data has many missing values (10–30% typically). These can be MCAR (missing completely at random — stochastic), MNAR (missing not at random — protein is below detection limit), or a mixture. Imputation strategy matters: use MinProb for MNAR, kNN for MCAR. Better yet, use statistical methods that handle missingness directly: DEqMS, msqrob2, or proDA.

LFQ for most experimentsTMT for multiplexinglimma works well for proteomicsHandle missing values carefully|logFC| > 1 + FDR < 0.05

Module 10 · Structural Bioinformatics

Protein Structure & AlphaFold

Protein function is determined by its three-dimensional structure. Structural bioinformatics uses computational methods to predict, analyse, and visualise protein structures — revolutionised by AlphaFold, which can predict structures with near-experimental accuracy.

10.1 Protein Structure Levels & Databases ▶

Proteins fold into defined 3D structures that determine their function. Understanding the four levels of protein structure is fundamental to structural bioinformatics and drug design.

The Four Levels of Structure

Primary (1°)

The linear amino acid sequence. Determines all higher levels. Read from gene sequence. 20 standard amino acids linked by peptide bonds.

Secondary (2°)

Local folding patterns: α-helices (3.6 residues per turn, stabilised by i→i+4 hydrogen bonds) and β-sheets (parallel or antiparallel strands). Loops connect regular elements.

Tertiary (3°)

The complete 3D fold of a single polypeptide chain. Stabilised by hydrophobic core packing, disulfide bonds, salt bridges, and hydrogen bonds. This is what we see in PDB structures.

Quaternary (4°)

Assembly of multiple polypeptide chains (subunits) into a functional complex. Example: haemoglobin is a tetramer (α2β2). Many enzymes and receptors function as oligomers.

Key Structural Databases

PDB (rcsb.org)

~220,000 experimentally determined structures. Solved by X-ray crystallography (~87%), cryo-EM (~10%), or NMR (~3%). The primary source for known structures.

AlphaFold DB

~200 million predicted structures covering nearly all known proteins. Free access at alphafold.ebi.ac.uk. Quality indicated by pLDDT confidence score (blue = high, orange = low).

UniProt

Links sequence → structure → function. Cross-references PDB and AlphaFold entries. Shows domain architecture, active sites, binding sites, and post-translational modifications.

# Download a PDB structure wget https://files.rcsb.org/download/1UBQ.pdb # Download AlphaFold prediction by UniProt ID wget https://alphafold.ebi.ac.uk/files/AF-P0DTD1-F1-model_v4.pdb # Batch download AlphaFold structures for a proteome wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN_v4.tar # Visualise (install PyMOL or ChimeraX) mamba install -c conda-forge pymol-open-source pymol 1UBQ.pdb

📌

Reading pLDDT scores: AlphaFold provides a per-residue confidence score (pLDDT, 0–100). >90 = very high confidence (core domains). 70–90 = confident. 50–70 = low confidence (may be flexible). <50 = disordered/unstructured regions. Always check pLDDT before interpreting predicted structures.

PDB for experimental structuresAlphaFold for predictionspLDDT > 70 = trustworthyPyMOL or ChimeraX for visualisation

10.2 Structure Prediction — AlphaFold & ColabFold ▶

AlphaFold2, developed by DeepMind, solved the protein structure prediction problem in 2020. It predicts 3D structures from amino acid sequence alone with near-experimental accuracy. ColabFold makes it easy to run AlphaFold predictions using Google Colab — no local GPU required.

How AlphaFold Works (Simplified)

(1) MSA generation — searches sequence databases to find homologous proteins (evolutionary information). (2) Template search — finds known structures of related proteins in PDB. (3) Evoformer — a deep neural network processes the MSA and pairwise features through 48 attention-based blocks. (4) Structure module — converts learned representations into 3D atomic coordinates. (5) Recycling — repeats the prediction 3 times, feeding output back as input, improving accuracy each time.

Running ColabFold

💡

Easiest way to predict a structure: Go to colab.research.google.com and search for "ColabFold" notebook. Paste your amino acid sequence, click "Runtime → Run all". The prediction takes 5–30 minutes depending on protein size. Output: PDB file + pLDDT scores + PAE (predicted aligned error) plot.

# ─── Local AlphaFold (requires GPU + ~2 TB databases) ─── # Use LocalColabFold for easier local installation: git clone https://github.com/YoshitakaMo/localcolabfold.git cd localcolabfold bash install_colabbatch_linux.sh # Predict a single protein colabfold_batch input.fasta output_dir/ # Predict a protein complex (separate chains with :) # In FASTA: >complex\nSEQUENCE_CHAIN_A:SEQUENCE_CHAIN_B colabfold_batch complex.fasta output_dir/ --model-type alphafold2_multimer_v3 # ─── AlphaFold3 (online, free for academics) ─── # Visit alphafoldserver.com — predicts proteins, DNA, RNA, ligands, ions # Supports complexes of any type (protein-DNA, protein-drug, etc.)

⚠️

Limitations of AlphaFold: (1) Does not predict conformational changes — gives a single static structure. (2) Disordered regions (low pLDDT) should not be interpreted structurally. (3) Cannot predict effects of mutations reliably. (4) Ligand/cofactor binding sites may not be accurate without the ligand present. (5) AlphaFold2 struggles with protein-protein interfaces — use AlphaFold-Multimer or AlphaFold3 for complexes.

ColabFold = easiest way to predictAlphaFold3 for complexesCheck pLDDT and PAEStatic structure only

10.3 Molecular Docking & Drug Discovery ▶

Molecular docking predicts how a small molecule (ligand/drug) binds to a protein target. It is a cornerstone of computer-aided drug design (CADD). The goal is to find the binding pose (orientation + conformation) and estimate binding affinity. Virtual screening uses docking to test millions of compounds computationally before synthesising the most promising ones.

Docking Tools

AutoDock Vina

The most widely used docking tool. Fast, accurate, free. Flexible ligand + rigid receptor. Easy to use via command line.

Start here

Gnina

CNN-based scoring function built on Vina. Better pose prediction than classical scoring. GPU-accelerated.

DiffDock

Diffusion model for docking. Does not require predefined binding site. State-of-the-art blind docking accuracy.

# ─── AutoDock Vina workflow ─── mamba install -c conda-forge autodock-vina mamba install -c conda-forge openbabel # For format conversion # 1. Prepare receptor (remove water, add hydrogens) # Use ADFR suite or Open Babel: obabel receptor.pdb -O receptor.pdbqt -h # 2. Prepare ligand obabel ligand.sdf -O ligand.pdbqt --gen3d -h # 3. Define search box (center on binding site) # 4. Run docking vina --receptor receptor.pdbqt --ligand ligand.pdbqt \ --center_x 10.0 --center_y 20.0 --center_z 15.0 \ --size_x 20 --size_y 20 --size_z 20 \ --exhaustiveness 32 --num_modes 9 \ --out docked_poses.pdbqt # Output: ranked poses with binding affinity (kcal/mol) # More negative = stronger predicted binding # Typical drug: -7 to -12 kcal/mol

⚠️

Docking limitations: Docking scoring functions are approximate — they estimate binding affinity with ~2 kcal/mol error. Use docking for ranking (which compounds bind better), not absolute binding energy prediction. Always validate top hits experimentally. For more accurate binding free energies, use molecular dynamics with free energy perturbation (FEP) or MM-PBSA methods.

AutoDock Vina to startDiffDock for blind dockingUse for ranking, not absolute affinityAlways validate experimentally

Module 11 · Epigenomics

ChIP-seq, ATAC-seq & DNA Methylation

Epigenetics studies heritable changes in gene activity that do not involve changes to the DNA sequence. These modifications — histone marks, chromatin accessibility, and DNA methylation — control which genes are on or off in each cell type. Epigenomic assays reveal the regulatory landscape of the genome.

11.1 ChIP-seq — Mapping Protein-DNA Interactions ▶

Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) identifies where a specific protein (transcription factor or modified histone) binds across the genome. The protein-DNA complexes are cross-linked with formaldehyde, chromatin is sheared into 200–500 bp fragments, an antibody pulls down fragments bound by the target protein, and the bound DNA is sequenced. Peaks in the sequencing signal reveal binding sites.

ChIP-seq Analysis Pipeline

# ─── Complete ChIP-seq pipeline ─── mamba install -c bioconda bowtie2 samtools macs2 deeptools bedtools homer # 1. Align reads to genome (bowtie2, not HISAT2 — no splice awareness needed) bowtie2 -x bt2_index/genome -U chip_sample.fastq.gz \ --very-sensitive -p 8 | samtools sort -O bam -o chip.bam samtools index chip.bam # Same for input/IgG control bowtie2 -x bt2_index/genome -U input_control.fastq.gz \ --very-sensitive -p 8 | samtools sort -O bam -o input.bam samtools index input.bam # 2. Remove duplicates samtools markdup -r chip.bam chip_dedup.bam samtools markdup -r input.bam input_dedup.bam # 3. Call peaks with MACS2 # For transcription factors (narrow peaks): macs2 callpeak -t chip_dedup.bam -c input_dedup.bam \ -f BAM -g hs --outdir peaks/ -n my_tf -q 0.01 # For histone marks like H3K27me3 (broad peaks): macs2 callpeak -t chip_dedup.bam -c input_dedup.bam \ -f BAM -g hs --outdir peaks/ -n h3k27me3 --broad --broad-cutoff 0.1 # 4. Visualise with deepTools bamCoverage -b chip_dedup.bam -o chip.bw --normalizeUsing RPGC \ --effectiveGenomeSize 2913022398 --binSize 10 -p 8 # 5. Motif analysis (what TF motifs are enriched in peaks?) findMotifsGenome.pl peaks/my_tf_peaks.narrowPeak hg38 motif_output/ -size 200

⚠️

Always use an input control. ChIP-seq without a control (input DNA or IgG pulldown) will produce false peaks at open chromatin regions and repetitive elements. The control captures background signal, enabling the peak caller to distinguish real binding from noise. No control = no publication.

💡

CUT&RUN / CUT&Tag: Modern alternatives to ChIP-seq that require far fewer cells (100–1,000 vs millions), have lower background, and are cheaper. CUT&Tag is especially popular — it uses a protein A-Tn5 fusion to cut and tag DNA near the antibody target. Analysis is similar to ChIP-seq but use SEACR instead of MACS2 for peak calling.

Bowtie2 for alignmentMACS2 for peak callingAlways include input controlHOMER for motif analysisCUT&Tag is replacing ChIP-seq

11.2 ATAC-seq — Chromatin Accessibility ▶

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) maps open chromatin regions genome-wide. It uses the Tn5 transposase, which preferentially inserts sequencing adapters into accessible (open) chromatin, avoiding tightly packed nucleosome-occupied regions. Open chromatin corresponds to active regulatory elements — promoters, enhancers, and transcription factor binding sites.

ATAC-seq Analysis Pipeline

# ─── ATAC-seq pipeline ─── mamba install -c bioconda bowtie2 samtools macs2 deeptools picard # 1. Align with bowtie2 (paired-end, max fragment size 2000) bowtie2 -x bt2_index/genome -1 atac_R1.fastq.gz -2 atac_R2.fastq.gz \ --very-sensitive -X 2000 --no-mixed --no-discordant -p 8 | \ samtools sort -O bam -o atac.bam # 2. Remove mitochondrial reads (major contaminant in ATAC-seq!) samtools view -h atac.bam | grep -v chrM | samtools sort -O bam -o atac_nomt.bam samtools index atac_nomt.bam # 3. Remove duplicates picard MarkDuplicates I=atac_nomt.bam O=atac_dedup.bam \ M=dup_metrics.txt REMOVE_DUPLICATES=true samtools index atac_dedup.bam # 4. Shift reads (Tn5 inserts with +4/-5 bp offset) alignmentSieve -b atac_dedup.bam -o atac_shifted.bam \ --ATACshift --minFragmentLength 0 --maxFragmentLength 2000 # 5. Call peaks (use --nomodel and shift parameters) macs2 callpeak -t atac_shifted.bam -f BAMPE -g hs \ --nomodel --shift -75 --extsize 150 \ --outdir peaks/ -n atac -q 0.01 --keep-dup all # 6. Generate signal track bamCoverage -b atac_shifted.bam -o atac.bw --normalizeUsing RPGC \ --effectiveGenomeSize 2913022398 --binSize 10 -p 8

📌

Fragment size distribution: A good ATAC-seq library shows a characteristic nucleosomal ladder pattern: a peak at ~200 bp (nucleosome-free), ~400 bp (mono-nucleosome), ~600 bp (di-nucleosome). If you see only the ~200 bp peak, your transposition worked well. If the distribution is broad without clear peaks, the experiment may have had too many or too few Tn5 insertions.

Remove chrM reads firstTn5 shift correction is essentialNucleosomal ladder = good QCNo input control needed

11.3 DNA Methylation & Bisulfite Sequencing ▶

DNA methylation — the addition of a methyl group to the 5th carbon of cytosine (5mC) — is the most studied epigenetic modification. In mammals, it occurs predominantly at CpG dinucleotides. Methylation of gene promoters generally silences transcription. It plays critical roles in development, X-chromosome inactivation, genomic imprinting, and cancer (where aberrant methylation is a hallmark).

Detection Methods

WGBS

Whole-genome bisulfite sequencing. Gold standard. Single-CpG resolution across entire genome. Expensive — needs 10–15× coverage. Bisulfite converts unmethylated C→U; methylated C is protected.

RRBS

Reduced Representation BS. MspI digestion enriches CpG-rich regions. 1–10% of genome, covers most CpG islands. Much cheaper than WGBS.

Methylation Arrays

Illumina EPIC array (850K CpGs). Cheapest per-sample. Good for clinical studies. Limited to pre-selected CpGs. EPIC v2 covers ~930K sites.

# ─── WGBS analysis with Bismark ─── mamba install -c bioconda bismark bowtie2 samtools trim-galore # 1. Prepare bisulfite genome (creates C→T and G→A converted versions) bismark_genome_preparation --bowtie2 genome_folder/ # 2. Trim adapters + low quality trim_galore --paired --rrbs read_R1.fastq.gz read_R2.fastq.gz # 3. Align with Bismark bismark --bowtie2 --genome genome_folder/ \ -1 read_R1_val_1.fq.gz -2 read_R2_val_2.fq.gz -p 4 # 4. Deduplicate deduplicate_bismark --bam read_R1_val_1_bismark_bt2_pe.bam # 5. Extract methylation (outputs CpG, CHG, CHH context) bismark_methylation_extractor --paired-end --comprehensive \ --CX --bedGraph --genome_folder genome_folder/ \ read_R1_val_1_bismark_bt2_pe.deduplicated.bam # 6. Generate summary report bismark2report

💡

Interpreting methylation: CpG island methylation at promoters = gene silencing. Gene body methylation = correlated with active transcription (paradoxically). Enhancer methylation = generally low at active enhancers. In cancer, look for promoter hypermethylation of tumour suppressors and global hypomethylation (genomic instability). DMRs (Differentially Methylated Regions) are typically more biologically meaningful than individual CpG changes — use tools like dmrseq or DSS for DMR calling.

WGBS = gold standardBismark for alignmentCpG islands at promotersDMRs > individual CpGsPromoter methylation = silencing

Module 12 · Genome Assembly & Annotation

Building Genomes from Scratch

De novo genome assembly reconstructs complete chromosome sequences from millions of short or long sequencing reads — like assembling a jigsaw puzzle without the picture on the box. With the emergence of long-read technologies, telomere-to-telomere assemblies are now achievable for many organisms.

12.1 De Novo Assembly — Short & Long Reads ▶

Genome assembly combines overlapping reads into longer contiguous sequences (contigs), then scaffolds contigs into chromosome-scale sequences. The approach depends fundamentally on read length: short reads (150 bp Illumina) cannot span repetitive elements, producing fragmented assemblies. Long reads (10–100 kb from PacBio HiFi or Oxford Nanopore) span most repeats, enabling near-complete assemblies.

Assembly Strategies

Short-read (de Bruijn graph)

Chops reads into k-mers, builds a graph where k-mers are nodes and overlaps are edges. Traverses the graph to produce contigs. Tools: SPAdes (small genomes, metagenomes), MEGAHIT (metagenomes). Good for bacteria, not for complex eukaryotes.

Long-read (overlap-layout-consensus)

Finds overlaps between all read pairs, constructs a layout graph, derives consensus sequence. Tools: hifiasm (PacBio HiFi — best), Flye (Nanopore/PacBio CLR). Produces chromosome-scale contigs for most genomes.

# ─── Bacterial genome (short reads, SPAdes) ─── mamba install -c bioconda spades quast spades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \ -o spades_output/ -t 16 --careful # ─── Eukaryotic genome (PacBio HiFi, hifiasm) ─── mamba install -c bioconda hifiasm # Haplotype-resolved assembly (diploid) hifiasm -o asm -t 32 hifi_reads.fastq.gz # Convert GFA to FASTA awk '/^S/{print ">"$2"\n"$3}' asm.bp.p_ctg.gfa | fold > primary_assembly.fa # ─── Nanopore (Flye) ─── mamba install -c bioconda flye flye --nano-hq nanopore_reads.fastq.gz \ --out-dir flye_output/ --threads 32 --genome-size 500m # Polish Nanopore assembly with short reads (Illumina) mamba install -c bioconda pilon pilon --genome flye_output/assembly.fasta \ --frags illumina_aligned.bam --output polished

📌

Assembly quality metrics: N50 = the contig length such that 50% of the assembly is in contigs of this size or larger. Higher N50 = more contiguous. L50 = how many contigs make up 50% of the assembly. Lower L50 = better. BUSCO = completeness assessment based on universal single-copy orthologs (aim for >95% complete). Always report all three.

hifiasm for HiFi readsSPAdes for bacteriaN50 measures contiguityBUSCO measures completenessPolish Nanopore with Illumina

12.2 Assembly QC — BUSCO & QUAST ▶

An assembly is only useful if it is correct and complete. Quality assessment uses two complementary approaches: contiguity statistics (N50, total length, number of contigs) measured by QUAST, and gene completeness measured by BUSCO (Benchmarking Universal Single-Copy Orthologs), which checks whether expected genes are present and intact.

# ─── QUAST — contiguity statistics ─── mamba install -c bioconda quast quast assembly.fa -o quast_output/ -t 8 # If you have a reference genome for comparison: quast assembly.fa -r reference.fa -g genes.gff -o quast_output/ -t 8 # Key outputs: N50, L50, total length, # contigs, GC%, misassemblies # ─── BUSCO — gene completeness ─── mamba install -c bioconda busco # Choose the right lineage dataset for your organism: busco --list-datasets # Shows all available lineages # Run BUSCO busco -i assembly.fa -o busco_output -m genome \ -l mammalia_odb10 -c 16 # Interpretation: # C: >95% = excellent assembly # C: 90-95% = good # C: <80% = significant genes missing (check lineage choice) # D: >5% = possible contamination or collapsed haplotypes

💡

Merqury is a reference-free assembly evaluation tool that uses k-mer counts from raw reads to assess completeness, accuracy, and phasing of haplotype-resolved assemblies. It is now the gold standard for HiFi assemblies.

BUSCO >95% = excellentQUAST for contiguity statsMerqury for k-mer QCChoose correct BUSCO lineage

12.3 Gene Prediction & Genome Annotation ▶

After assembling a genome, the next step is annotation — finding where the genes are and what they encode. Gene prediction uses ab initio methods (statistical models trained on known gene structures), homology-based methods (aligning known proteins or transcripts), and evidence-based methods (using RNA-seq data from the same organism). Modern pipelines combine all three approaches.

Annotation Pipeline

# ─── Prokaryotic annotation (fast, automated) ─── mamba install -c bioconda prokka prokka assembly.fa --outdir annotation/ --prefix my_organism \ --kingdom Bacteria --genus Streptomyces --cpus 16 # Output: .gff (gene coordinates), .faa (protein sequences), # .ffn (gene nucleotide sequences), .gbk (GenBank format) # ─── Eukaryotic annotation (more complex) ─── # Step 1: Mask repeats (RepeatMasker) mamba install -c bioconda repeatmasker RepeatMasker -species "homo sapiens" -pa 16 -xsmall assembly.fa # Step 2: Align RNA-seq evidence (HISAT2 → StringTie) hisat2 -x genome_index -1 rnaseq_R1.fq.gz -2 rnaseq_R2.fq.gz -p 16 | \ samtools sort -O bam -o rnaseq_aligned.bam stringtie rnaseq_aligned.bam -o transcripts.gtf # Step 3: Run BRAKER2/3 (combines ab initio + RNA-seq + protein evidence) # BRAKER3 (recommended — integrates GeneMark + AUGUSTUS + protein hints) braker.pl --genome=assembly.fa.masked \ --bam=rnaseq_aligned.bam \ --prot_seq=proteins_from_related_species.fa \ --softmasking --threads 16 # Step 4: Functional annotation (assign GO terms, domains, pathways) mamba install -c bioconda interproscan interproscan.sh -i predicted_proteins.faa -f tsv -dp -goterms -pa

⚠️

Repeat masking is essential: Eukaryotic genomes can be 40–80% repetitive. Without masking, gene predictors will predict thousands of spurious genes in transposable elements. Always run RepeatMasker before gene prediction. Use soft-masking (lowercase, not N-replacement) so gene predictors can still use the sequence information.

Prokka for bacteriaBRAKER3 for eukaryotesRepeatMasker before gene predictionRNA-seq evidence improves accuracyInterProScan for functional annotation

Module 13 · Population Genetics & GWAS

Genetic Variation in Populations

Population genetics studies how allele frequencies change within and between populations over time. Genome-wide association studies (GWAS) connect genetic variants to phenotypes and diseases, forming the basis of precision medicine.

13.1 Population Structure & Hardy-Weinberg Equilibrium ▶

Hardy-Weinberg Equilibrium (HWE) describes the expected genotype frequencies in a population with no evolutionary forces acting. For a variant with allele frequencies p (reference) and q (alternative), expected genotype frequencies are: p² (homozygous ref), 2pq (heterozygous), q² (homozygous alt). Deviations from HWE can indicate selection, population stratification, genotyping errors, or inbreeding.

Key Population Genetics Concepts

Allele Frequency

The proportion of a specific allele in a population. If 80 of 200 chromosomes carry the T allele, frequency = 0.40. Minor allele frequency (MAF) < 0.01 = rare variant.

Genetic Drift

Random fluctuation of allele frequencies due to finite population size. Stronger in small populations. Founder effect and bottleneck are examples of extreme drift.

FST (Fixation Index)

Measures genetic differentiation between populations. FST = 0 means identical; FST = 1 means completely different. Human populations: FST ≈ 0.12 globally.

Linkage Disequilibrium (LD)

Non-random association between alleles at different loci. Nearby variants are in LD (inherited together). Measured by r². LD decays with physical distance due to recombination.

# ─── Population structure analysis with PLINK ─── mamba install -c bioconda plink2 # 1. Basic QC filtering plink2 --vcf cohort.vcf.gz --maf 0.01 --geno 0.05 --hwe 1e-6 \ --mind 0.1 --make-pgen --out filtered # 2. LD pruning (remove correlated variants for PCA) plink2 --pfile filtered --indep-pairwise 50 5 0.2 --out pruned plink2 --pfile filtered --extract pruned.prune.in --make-pgen --out pruned_data # 3. PCA for population structure plink2 --pfile pruned_data --pca 20 --out pca_results # 4. Plot PC1 vs PC2 in R: # pca <- read.table("pca_results.eigenvec", header=TRUE) # plot(pca$PC1, pca$PC2, col=as.factor(pca$Population)) # 5. ADMIXTURE analysis (ancestry proportions) mamba install -c bioconda admixture # Convert to BED format first, then: for K in 2 3 4 5; do admixture pruned_data.bed $K --cv -j16 done

📌

Population stratification in GWAS: If cases and controls come from different ancestral backgrounds, allele frequency differences due to ancestry (not disease) will create false associations. This is why GWAS always includes PCA components as covariates — to correct for population structure. Typically the top 10–20 PCs are included.

HWE test for QCPCA reveals population structurePLINK2 for genetic analysisFST measures differentiationLD decays with distance

13.2 GWAS — Genome-Wide Association Studies ▶

GWAS tests each genetic variant across the genome for association with a phenotype (disease, height, drug response, etc.). It is a hypothesis-free approach — no prior knowledge of which genes are involved is needed. Since millions of variants are tested simultaneously, very stringent significance thresholds are required (p < 5×10⁻⁸ for genome-wide significance). GWAS results are visualised as Manhattan plots.

GWAS Pipeline

# ─── GWAS with PLINK2 ─── # Assumes: genotypes in VCF/PGEN, phenotype file (FID IID PHENO) # 1. Sample & variant QC plink2 --vcf genotypes.vcf.gz \ --maf 0.01 \ # Remove rare variants (MAF < 1%) --geno 0.02 \ # Remove variants missing in >2% of samples --mind 0.05 \ # Remove samples missing >5% of variants --hwe 1e-6 \ # Remove variants violating HWE --make-pgen --out qc_data # 2. Calculate PCs for population stratification plink2 --pfile qc_data --indep-pairwise 50 5 0.2 --out prune plink2 --pfile qc_data --extract prune.prune.in --pca 20 --out pcs # 3. Run association (logistic for case/control, linear for quantitative) plink2 --pfile qc_data \ --pheno pheno.txt --pheno-name Disease \ --covar pcs.eigenvec --covar-col-nums 3-12 \ --glm --out gwas_results # 4. Genome-wide significance threshold: p < 5e-8 # Suggestive significance: p < 1e-5 # 5. Manhattan plot in R: # library(qqman) # manhattan(gwas, chr="CHROM", bp="POS", p="P", snp="ID")

Interpreting GWAS Results

⚠️

GWAS finds associations, not causation. A significant SNP is usually a tag SNP in LD with the true causal variant — not the causal variant itself. Fine-mapping (credible set analysis, SuSiE) is needed to narrow down the causal variant. Functional validation (CRISPR, reporter assays) is needed to prove causality. Most GWAS hits are in non-coding regions (enhancers, promoters), not in protein-coding exons.

💡

Post-GWAS analysis: After finding significant loci, perform: (1) Fine-mapping — identify the most likely causal variant. (2) Colocalization (coloc, eCAVIAR) — test if a GWAS signal and an eQTL signal share the same causal variant. (3) Gene mapping — assign GWAS hits to likely target genes using eQTL data, chromatin interaction (Hi-C), and distance. (4) Pathway enrichment — test if GWAS hits are enriched in specific biological pathways. (5) Polygenic Risk Scores — combine effects of all variants to predict individual risk.

p < 5e-8 for genome-wide significanceAlways correct for population structureGWAS = association, not causationFine-mapping for causal variantsManhattan plot for visualisation

Module 14 · Systems Biology & Network Analysis

Networks, Interactions & Integrative Analysis

Systems biology views biological processes as interconnected networks rather than isolated pathways. By integrating multi-omics data — transcriptomics, proteomics, metabolomics — with interaction networks, we can understand how molecular changes propagate through biological systems.

14.1 Protein-Protein Interaction Networks ▶

Protein-Protein Interaction (PPI) networks represent physical or functional associations between proteins. Nodes are proteins, edges represent interactions. Highly connected proteins (hubs) tend to be essential. PPI networks help identify functional modules, predict protein function, and prioritise drug targets.

Key Resources & Tools

STRING (string-db.org)

Largest PPI database. Integrates experimental data, co-expression, text mining, genomic context. Confidence scores 0–1. Use score > 0.7 for high-confidence interactions.

Cytoscape

Desktop software for network visualisation and analysis. Supports plugins (clusterMaker, MCODE for module detection, CytoHubba for hub identification). Essential for publication figures.

BioGRID

Curated repository of experimentally validated interactions (protein, genetic, chemical). Over 2 million interactions. Higher confidence than computationally predicted interactions.

Network Analysis Concepts

Degree

Number of connections a node has. High-degree nodes (hubs) are often essential genes. Degree distribution follows a power law in biological networks (scale-free property).

Betweenness Centrality

How often a node lies on shortest paths between other nodes. Bottleneck proteins control information flow through the network — they are often good drug targets.

Clustering Coefficient

How densely connected a node's neighbours are. Nodes in protein complexes have high clustering. Used to detect functional modules (MCODE algorithm).

Modules/Communities

Densely connected subnetworks often correspond to protein complexes or functional pathways. Detect with MCODE, Louvain, or Markov clustering (MCL).

# ─── Network analysis in Python with NetworkX ─── pip install networkx pandas matplotlib # Build network from STRING export import networkx as nx import pandas as pd edges = pd.read_csv("string_interactions.tsv", sep="\t") edges = edges[edges["combined_score"] > 700] # High confidence G = nx.from_pandas_edgelist(edges, "protein1", "protein2", edge_attr="combined_score") # Key network metrics print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}") print(f"Density: {nx.density(G):.4f}") print(f"Components: {nx.number_connected_components(G)}") # Find hub genes (top 10 by degree) degree = dict(G.degree()) hubs = sorted(degree, key=degree.get, reverse=True)[:10] print("Hub genes:", hubs) # Betweenness centrality (bottleneck nodes) betw = nx.betweenness_centrality(G) bottlenecks = sorted(betw, key=betw.get, reverse=True)[:10] # Community detection (Louvain) from networkx.algorithms.community import louvain_communities communities = louvain_communities(G, resolution=1.0) print(f"Found {len(communities)} communities")

💡

DEG-PPI integration: A powerful approach is to overlay your differentially expressed genes onto a PPI network. This reveals which functional modules are disrupted in your condition. Steps: (1) Get DEGs from RNA-seq. (2) Query STRING with DEG list. (3) Import network into Cytoscape. (4) Colour nodes by log2FC (red = up, blue = down). (5) Run MCODE to find activated/repressed modules. (6) Run pathway enrichment on each module separately.

STRING for PPI dataCytoscape for visualisationHub genes are often essentialMCODE for module detectionOverlay DEGs on PPI networks

14.2 Gene Regulatory Networks & Multi-Omics Integration ▶

Gene Regulatory Networks (GRNs) describe how transcription factors (TFs) control the expression of target genes. Unlike PPI networks which show physical contacts, GRNs show regulatory relationships — which TFs activate or repress which genes. Multi-omics integration combines transcriptomics, proteomics, metabolomics, and epigenomics data to build comprehensive models of cellular regulation.

GRN Inference Methods

SCENIC / pySCENIC

Infers GRNs from single-cell RNA-seq data. Uses GRNBoost2 (random forest) for coexpression, then validates with TF binding motif enrichment (RcisTarget). Outputs regulon activity scores per cell.

Best for scRNA-seq

WGCNA

Weighted Gene Co-expression Network Analysis. Groups genes into co-expression modules, correlates modules with clinical traits. Excellent for bulk RNA-seq. Widely used, well-cited. R package.

ARACNe / VIPER

Mutual information-based network inference. VIPER then infers TF activity from target gene expression. Widely used in cancer biology for identifying master regulators.

CellOracle

Combines scRNA-seq + scATAC-seq to build context-specific GRNs. Simulates TF perturbations in silico. Predicts effects of gene knockouts on cell fate.

Multi-Omics Integration Strategies

Integrating multiple data types reveals biology invisible to any single assay:

Correlation-based

Correlate mRNA ↔ protein levels, mRNA ↔ metabolite levels. Discordant pairs (high mRNA, low protein) suggest post-transcriptional regulation. Tools: mixOmics (R), MOFA+.

Pathway-based

Map all omics data onto KEGG/Reactome pathways. Identify pathways with consistent changes across data types. Tools: Paintomics, iPath, OmicsNet.

Factor analysis (MOFA+)

Multi-Omics Factor Analysis decomposes multiple data matrices into shared latent factors. Identifies which sources of variation are shared vs unique to each data type.

Network-based

Construct multilayer networks connecting genes, proteins, metabolites. Network propagation algorithms (Random Walk with Restart) identify key regulators driving multi-level changes.

# ─── WGCNA in R (bulk RNA-seq co-expression networks) ─── library(WGCNA) # Prepare normalised expression matrix (genes × samples) # Use VST-transformed counts from DESeq2 # Choose soft-thresholding power (aim for scale-free topology R² > 0.8) powers <- c(1:20) sft <- pickSoftThreshold(datExpr, powerVector = powers, verbose = 5) # Plot: sft$fitIndices[,1] vs -sign(sft$fitIndices[,3]) * sft$fitIndices[,2] # Build network and identify modules net <- blockwiseModules(datExpr, power = 6, TOMType = "unsigned", minModuleSize = 30, reassignThreshold = 0, mergeCutHeight = 0.25, numericLabels = TRUE, saveTOMs = TRUE) # Correlate modules with clinical traits moduleTraitCor <- cor(net$MEs, traits, use = "p") moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples) # Find hub genes in a module of interest module_genes <- names(datExpr)[net$colors == module_number] kME <- signedKME(datExpr, net$MEs) hub_genes <- module_genes[order(kME[module_genes, paste0("kME", module_number)], decreasing = TRUE)[1:20]]

📌

The future is multi-modal single-cell: Technologies like CITE-seq (RNA + surface protein), Multiome (RNA + ATAC), SHARE-seq, and spatial transcriptomics with paired proteomics are enabling true multi-omics integration at single-cell resolution. Tools like MOFA+, WNN (Seurat v4), and scVI/totalVI are designed for these integrated analyses.

WGCNA for bulk co-expressionSCENIC for scRNA-seq GRNsMOFA+ for multi-omicsAlways validate key findingsMulti-modal single-cell is the future

The Bioinformatics Platform You Need

RNAflow — the completeRNA-seq pipeline

10 steps from raw data to discovery

Built for real researchers,not just bioinformaticians

Start your first RNA-seq analysis today

Bioinfo Tools

RNAflow v1.0

What does RNA-seq measure?

What biological questions can RNAflow answer?

The complete RNAflow pipeline

RNAflow v1.0

GeneNet Lab v1.0.0

Download GeneNet

Features

Licensing Plans

How it works

NGSuite Coming Q3 2025

PlasmidMap Coming Q4 2025

ChIPflow Coming 2026

MetaFlow Coming 2026

scRNAflow Coming 2026

ProteomicsLab Coming 2026

Molecular Biology & Bioinformatics Fundamentals

A Brief History

Why Does Bioinformatics Matter?

What a Typical Bioinformatics Project Looks Like

Major Sub-fields

Career Paths and Demand

The Tools Ecosystem

Base Pairing Rules

Transcription — Step by Step

Translation — The Genetic Code

Post-Translational Modifications (PTMs)

Why RNA is Unstable Compared to DNA

Exceptions to the Central Dogma

Non-coding RNA

Chromosomes and Chromatin

Genome Size and the C-Value Paradox

Regulatory Elements

CpG Islands

Gene Structure — A Real Example

Alternative Splicing

How Illumina Sequencing-by-Synthesis Works

PHRED Quality Scores

Technology Comparison

When to Choose Each Technology

Linux, Terminal & Programming for Bioinformatics

Navigation and File Operations

File Permissions

Pipes and Redirection

Process Management

SSH for HPC / Server Access

Bioinformatics-Specific Tricks

Installing Conda / Mamba

Creating and Managing Environments

Containers: Docker and Singularity

Project Directory Structure

Version Control with Git

HPC Module System

Variables, Arrays, and String Operations

Conditionals (if/else)

Error Handling — set -euo pipefail

Logging Output

Full RNA-seq Pipeline Script

Running Scripts on HPC with SLURM

RStudio Setup

Reading and Writing Data

Tidyverse Data Wrangling

ggplot2 for Genomics Visualisation

R Markdown for Reproducible Reports

Biological Databases & File Formats

The Core Databases

Accession Number Formats

How Databases Link to Each Other

NCBI E-utilities API

How to Cite Databases

FASTA (.fa / .fasta)

FASTQ (.fastq / .fastq.gz)

SAM / BAM / CRAM

GTF / GFF3

RNAflow — the complete
RNA-seq pipeline

Built for real researchers,
not just bioinformaticians