🧬 Trusted by researchers worldwide

The Bioinformatics Platform You Need

Guided pipelines for every major genomics workflow. Every step explained. Every result interpreted. Runs on your computer — your data never leaves your machine.

0+
Organisms
0+
Gene families
0
Pipeline steps
0%
Free forever
Powered by
FastQC
HISAT2
fastp
featureCounts
DESeq2
samtools
ggplot2
MultiQC
Ensembl
🧬
15+
Built-in organisms with pre-loaded Ensembl genome URLs
🔬
40+
Gene families across 22 categories — TF, kinases, hormones and more
🖥️
7
Compute environments — local, SLURM, PBS, LSF, AWS, GCP, Azure
💸
$0
No account required, runs offline on your own machine
Featured tool

RNAflow — the complete
RNA-seq pipeline

From raw FASTQ files to publication-ready figures in one guided workflow. Every command explained. Every result interpreted. No bioinformatics PhD required.

One-click execution Click ▶ Run on any command block — it executes live in your terminal with real-time streaming output.
🌍
15+ organisms built-in Human, mouse, Arabidopsis, rice, wheat, maize, soybean, tomato, zebrafish and more — genome URLs pre-loaded from Ensembl.
🏢
HPC & cloud ready Generates SLURM, PBS, and LSF job scripts. Wraps commands with AWS S3, Google Cloud, and Azure sync operations for large-scale data.
📊
Results fully explained Every plot and every number is explained in plain English — volcano plots, heatmaps, PCA, DESeq2 tables and fold-change values.
RNAflow_App.html — analysis progress
Configure project
Download data from NCBI SRA
FastQC quality control
fastp adapter trimming
Genome download + HISAT2 index
6HISAT2 alignment⚡ running
7featureCounts gene counting
8Gene family analysis
9DESeq2 differential expression
10Volcano + heatmap + PCA
# Aligning sample SRR8490157
hisat2 -x rice_index -U trimmed.fastq \
  -S aligned/SRR8490157.sam -p 8
94.2% overall alignment rate ✓
How it works

10 steps from raw data to discovery

Each step generates ready-to-run commands with full explanations. Follow in order, or jump to any step — the app remembers where you left off.

⚙️1
Configure
⬇️2
Download data
🔍3
Quality control
✂️4
Trimming
🧬5
Genome & index
📍6
Alignment
🔢7
Gene counting
🧬8
Gene family
📊9
DESeq2
🌋10
Plots & results
Why BioInfoCodex

Built for real researchers,
not just bioinformaticians

🔒
Your data stays yours
Every tool runs entirely on your own computer. No cloud upload, no account, no server. Your raw data, intermediate files, and results never leave your machine.
📖
Everything explained
Every command has a plain-English explanation. Every result — volcano plots, heatmaps, alignment rates, DESeq2 tables — is interpreted so you understand what it means biologically.
From zero to results
No bioinformatics PhD required. Open the HTML file, enter your SRA accessions, and follow the guided steps. The app generates every command and walks you through each one.
🖥️
HPC and cloud support
Working with large genomes on a supercomputer? Generates SLURM, PBS, and LSF job scripts automatically. Data on AWS, GCP, or Azure? Commands include the sync steps.
🌍
15+ organisms ready
Human, mouse, Arabidopsis, rice, wheat, maize, soybean, tomato, zebrafish, Drosophila, C. elegans, yeast and more — Ensembl genome and annotation URLs pre-loaded.
💸
Built for researchers
BioInfoCodex is MIT-licensed and will always be free. No subscription, no feature paywalls, no tracking. Download and use it indefinitely, even without internet access.

Start your first RNA-seq analysis today

Register to receive your RNAflow access code. Get 50 days of full access to the complete RNA-seq analysis pipeline.

Software

Bioinfo Tools

Free, guided pipelines for every major bioinformatics workflow. Each tool has its own complete manual — click any tool to read the documentation, download, and get started.

✦ Available now
🧬
RNAflow
v1.0 · Released 2025
Live

Complete RNA-seq analysis pipeline — from raw FASTQ files to differential expression, volcano plots, heatmaps, PCA and gene family analysis. Works on 15+ organisms including all major crop plants.

RNA-seqDESeq2HISAT2 FastQCfastpfeatureCounts ggplot215 organisms
🍎 macOS 🐧 Linux 🪟 Windows WSL2 Full manual →
🧬
GeneNet Lab
v1.0.0 · Released 2026
Live

Full-featured lab management platform with team access control, seat-based licensing, OneDrive storage, and admin panel. Install on your computer and connect with a LAB-XXXXX code.

Lab ManagementTeam Access ControlSeat Licensing OneDrive StorageAdmin PanelEmail Invites
🍎 macOS 🐧 Linux 🪟 Windows Full details →
In development
🔬
NGSuite
Expected Q3 2025
Soon

Whole-genome and exome sequencing pipeline. Variant calling with GATK, SNP/INDEL annotation, structural variant detection, population genetics.

WGS / WESGATK4BWA-MEM2VEP
PlasmidMap
Expected Q4 2025
Soon

Plasmid design and annotation. Import GenBank files, identify restriction sites, annotate features, export circular maps for publication.

Plasmid designGenBankRestriction sitesCircular maps
🏔️
ChIPflow
Expected 2026
Soon

ChIP-seq and ATAC-seq pipeline. Peak calling with MACS2, motif analysis with HOMER, differential chromatin accessibility, BigWig tracks.

ChIP-seqATAC-seqMACS2HOMER
🌿
MetaFlow
Expected 2026
Soon

Metagenomics and 16S rRNA microbiome pipeline. Taxonomic classification, alpha/beta diversity, differential abundance with DESeq2.

Metagenomics16S rRNAQIIME2Kraken2
🔭
scRNAflow
Expected 2026
Soon

Single-cell RNA-seq pipeline using Seurat. Cell clustering, UMAP visualisation, marker gene identification, trajectory analysis.

scRNA-seqSeuratUMAPCell typing
⚗️
ProteomicsLab
Expected 2026
Soon

Mass spectrometry proteomics. Label-free quantification, TMT/iTRAQ, PTM analysis, pathway enrichment, volcano plots for protein data.

ProteomicsMaxQuantLFQTMT
RNA-seq Analysis Software

RNAflow v1.0

Complete RNA-seq analysis pipeline — from raw sequencing files to publication-ready figures. Works on 15+ organisms with a full manual explaining every step and every result.

Quick download
50-day access · Runs locally
Contents
📋 Overview
💻 Requirements
⚙️ Installation
Pipeline Manual
1 · Configure project
2 · Download data
3 · Quality control
4 · Trimming
5 · Genome & index
6 · Alignment
7 · Gene counting
8 · Gene family analysis
9 · DESeq2 analysis
10 · Plots & results
Reference
📊 Interpreting results
🌍 Supported organisms
❓ Troubleshooting
What is RNAflow?

RNAflow is a guided RNA-seq analysis application. Open it in any browser, enter your experiment details, and it generates every command you need — with explanations of what each command does and what good results look like. You can run commands automatically with one click, or copy them to your terminal manually.

9
Analysis steps
15+
Organisms built-in
40+
Gene families
3
Operating systems

What does RNA-seq measure?

RNA sequencing measures gene expression — which genes are active in a cell and by how much. Every cell contains the same DNA, but different cells (or the same cell under different conditions) activate different genes. RNA-seq captures a snapshot of this activity by sequencing the RNA molecules present at a specific moment.

🔬
Think of it this way: DNA is the instruction manual for your entire body. RNA is the specific page that is currently being read. RNA-seq tells you which pages are being read, in which cells, and how intensively.

What biological questions can RNAflow answer?

Which genes change between conditions?

Compare wildtype vs mutant, treated vs untreated, healthy vs diseased. RNAflow identifies statistically significant changes.

How much does each gene change?

Fold change values show whether genes double, halve, or show more dramatic changes in expression.

Which biological pathways are affected?

Gene family analysis groups DE genes into pathways — transcription factors, kinases, hormone signalling and more.

How similar are my samples to each other?

PCA and heatmaps reveal whether replicates cluster correctly and how distinct your conditions are.

The complete RNAflow pipeline

Raw FASTQ
FastQC / QC
fastp Trim
HISAT2 Align
featureCounts
DESeq2
Figures + Results
System requirements

RNAflow runs on your own computer. You need a Conda environment with bioinformatics tools and R for the statistical analysis steps.

💻
Operating system
  • 🍎 macOS 10.15 or newer
  • 🐧 Linux — Ubuntu, Fedora, Debian, Arch
  • 🪟 Windows via WSL2 (Ubuntu recommended)
🔧
Required software
  • Python 3.6+ (for the local server)
  • Conda / Miniforge (to install tools)
  • R 4.0+ with DESeq2 and ggplot2
  • RStudio (recommended for R steps)
  • Any modern web browser (Chrome, Firefox, Safari, Edge)
Hardware (minimum)
  • RAM: 8 GB minimum, 16 GB recommended
  • Storage: 50 GB free space per experiment
  • CPU: any modern processor (more cores = faster)
🌐
Internet
  • Required only for: downloading data (SRA), downloading genome files
  • Analysis itself runs completely offline
  • Large genomes (human, wheat) need fast connection — files can be 10–30 GB
💡
Large genome note: For human, mouse, wheat, and maize, genome files are 3–30 GB. Use the app on a computer with a fast internet connection for the download steps.
Installation

RNAflow is an HTML file — no installation wizard needed. Download it, download the server file, and open it in your browser. The app guides you through installing all the bioinformatics tools.

1Download RNAflow

Download both files and save them to the same folder (e.g. your Desktop):

RNAflow v1.0

Two files — keep them in the same folder

Open RNAflow_App.html in your browser — you will see the BioInfoCodex RNAflow interface immediately.

2Install Conda (Miniforge) — if not already installed

Conda is a package manager that installs bioinformatics tools. If you already have Anaconda or Miniconda, skip this step. The app's System Check will detect it automatically.

🌐
Download Miniforge from: github.com/conda-forge/miniforge/releases — scroll to the latest release and download the installer for your operating system.
# macOS / Linux — run in Terminal after downloading: bash Miniforge3-MacOSX-arm64.sh # macOS Apple Silicon bash Miniforge3-MacOSX-x86_64.sh # macOS Intel bash Miniforge3-Linux-x86_64.sh # Linux # Follow the prompts, accept the licence, allow it to initialise conda # Close and reopen Terminal when finished
⚠️
On Windows: install WSL2 first (search "Turn Windows features on or off" → enable "Windows Subsystem for Linux"), then install Ubuntu from the Microsoft Store. Open Ubuntu, then install Miniforge inside Ubuntu.
3Create the rnaseq environment and install tools

RNAflow uses a dedicated Conda environment called rnaseq. This keeps all bioinformatics tools isolated from your other software.

# Create the environment (takes 5-10 minutes) conda create -n rnaseq python=3.10 -y # Activate it conda activate rnaseq # Install all required bioinformatics tools conda install -c bioconda -c conda-forge fastqc fastp hisat2 samtools subread sra-tools multiqc -y

This installs: FastQC (quality control), fastp (trimming), HISAT2 (alignment), samtools (BAM processing), featureCounts/subread (gene counting), fasterq-dump/sra-tools (data download), MultiQC (report generation).

4Install R and required packages

R and RStudio are used for the DESeq2 statistical analysis and all plots. Download R from r-project.org and RStudio from posit.co/download/rstudio-desktop.

# Open RStudio and run this to install all required R packages: if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("DESeq2") install.packages(c("ggplot2", "pheatmap", "ggrepel", "dplyr", "RColorBrewer"))
💡
Package installation takes 10–20 minutes the first time. You only need to do this once.
5Start the local server (for Run mode)

If you want to use Run mode (commands execute automatically with one click), start the server in a Terminal window before opening the app. Keep this window open while working.

# Navigate to your Desktop (or wherever you saved the files) cd ~/Desktop # Start the server python3 rnaflow_server.py

You will see: RNAflow Local Server — ready · Listening on 127.0.0.1:7788

🔒
The server only listens on 127.0.0.1 (your own computer). It is completely invisible to any other computer on your network or the internet. Your data never leaves your machine.
Step 1 — Configure your project

Before running any commands, you set up your experiment details in the Configure page. RNAflow uses these settings to automatically generate every command throughout the pipeline.

1Select your organism

Choose from 15 pre-loaded organisms or enter custom genome URLs. When you select an organism, the genome FASTA and GTF annotation URLs are automatically filled in with the correct Ensembl links.

🐾 Animals

Human, Mouse, Rat, Zebrafish, Drosophila, C. elegans, Yeast

🌱 Plants

Arabidopsis, Rice, Maize, Wheat, Soybean, Tomato, Yarrowia

💡
For organisms not in the list, select Custom and paste your own Ensembl FTP URLs. The manual explains exactly how to find these on ensembl.org and plants.ensembl.org.
2Name your conditions and add SRA accessions

Give meaningful names to your two conditions (e.g. wildtype and mutant, or control and drought_treated). Then add SRA accession numbers for each replicate.

🔍
Finding SRA accessions: Go to ncbi.nlm.nih.gov/sra → search your paper title or dataset ID → click the experiment → note the SRR/ERR/DRR number for each sample. These start with SRR, ERR, or DRR.
📋 How many replicates do I need?
DESeq2 requires a minimum of 2 biological replicates per condition, but 3 is strongly recommended and is standard for publication. Biological replicates are separate biological samples (different plants, animals, or cultures) — not the same sample sequenced twice.
3Choose read type: single-end or paired-end

This describes how the sequencing library was prepared. If unsure, check the SRA run page — it shows the "Layout" as SINGLE or PAIRED.

Single-end (SE)
Each fragment is sequenced from one end only. Produces one FASTQ file per sample (e.g. SRR123.fastq). Older and cheaper but less information per read.
Paired-end (PE)
Each fragment is sequenced from both ends. Produces two FASTQ files per sample (e.g. SRR123_1.fastq and SRR123_2.fastq). More expensive but more accurate alignment.
Step 2 — Download raw data

RNAflow downloads your raw sequencing files directly from NCBI SRA using fasterq-dump. The files are saved into your project folder automatically.

1What happens during download

The command fasterq-dump connects to NCBI's servers and downloads compressed sequencing data, then decompresses it into FASTQ format. Each sample is typically 1–10 GB depending on the sequencing depth.

✅ Good result
Files appear in your raw_data/ folder. Each sample has a .fastq file (single-end) or two .fastq files ending in _1 and _2 (paired-end). Use the check command to confirm file sizes are reasonable (>100 MB per sample is normal).
⚠️ Slow download?
NCBI downloads can be slow. Typical speeds are 5–20 MB/s. A dataset of 6 samples at 2 GB each takes 10–60 minutes. Do not interrupt the download — if interrupted, run the command again and fasterq-dump will resume.
❌ Error: accession not found
The SRR/ERR number was entered incorrectly, or the dataset is not public yet. Check the accession on ncbi.nlm.nih.gov/sra — it should show "Public" status.
Step 3 — Quality control with FastQC

FastQC examines your raw reads and produces a visual HTML report for each sample. Always run this before any analysis — it reveals problems that would invalidate your results if not addressed.

1What FastQC checks
✅ Per-base quality (PASS)
The quality score at each position across all reads. Good data has Q30+ scores (green zone) across most of the read length. A slight drop at the 3' end is normal and expected.
✅ Per-sequence quality (PASS)
Distribution of average quality per read. Should peak above Q30. If most reads are below Q20, the sequencing run had quality issues — contact your sequencing provider.
⚠️ Per-base sequence content (WARN/FAIL)
Shows nucleotide composition across positions. FAIL is normal for RNA-seq — the first 10–12 bases often show imbalance due to random hexamer priming bias during library preparation. This is expected and not a problem.
⚠️ Sequence duplication (WARN)
RNA-seq data always shows high duplication because highly expressed genes have thousands of copies. A WARN or FAIL here is completely normal for RNA-seq and does not indicate a problem.
❌ Adapter content (FAIL) — action needed
Adapter sequences are present in your reads. This happens when the cDNA fragment is shorter than the read length. Adapters must be removed in the trimming step (Step 4) before alignment.
❌ Overrepresented sequences (FAIL)
A single sequence makes up >0.1% of all reads. Usually this is rRNA contamination or adapters. If the sequence is identified as a ribosomal RNA, it will simply not align to protein-coding genes and won't affect your results significantly.
2Reading the MultiQC summary report

MultiQC combines all individual FastQC reports into one interactive HTML page. Open it in your browser after FastQC completes.

📊 What to look for in MultiQC
The General Statistics table at the top shows: total reads (aim for 10–30 million per sample), % duplicates (60–90% is normal for RNA-seq), % GC content (should be consistent across samples). If one sample looks very different from the others, it may be a poor-quality sample.
✅ Ready to proceed if:
All samples have similar read counts. Quality scores are mostly green. Any FAILs are only for Per-base sequence content or Sequence duplication (both normal for RNA-seq).
❌ Investigate further if:
One sample has far fewer reads than the others (<5 million). Adapter content shows FAIL. Per-base quality drops below Q20 before 50% of the read length.
Step 4 — Adapter trimming with fastp

fastp removes adapter sequences, low-quality bases, and very short reads. It is fast, accurate, and works for both single-end and paired-end data automatically.

1What fastp does and why it matters

Adapter sequences are synthetic DNA added during library preparation — they are not part of your biological sample. If not removed, they interfere with alignment because the aligner cannot find a match for them in your genome. Low-quality bases at the ends of reads are also trimmed to improve alignment accuracy.

✅ Good fastp output
Reads passing filter: >95%. Adapter-trimmed reads: depends on your data (0–40% is normal). Q30 rate: >90%. Output files appear in your trimmed/ folder with the same names but _trimmed suffix.
2fastp settings used by RNAflow

RNAflow uses these fastp settings — all chosen to be conservative and appropriate for most RNA-seq experiments:

--trim_front1 5 # Remove first 5 bases (hexamer bias) --qualified_quality_phred 20 # Keep bases with Q20+ quality --length_required 36 # Discard reads shorter than 36 bp after trimming --detect_adapter_for_pe # Auto-detect adapters (paired-end)
Step 5 — Genome download and HISAT2 index

The reference genome and gene annotation are downloaded from Ensembl, then HISAT2 builds an index — a compressed searchable version of the genome that makes alignment fast.

1What files are downloaded and why
Genome FASTA (.fa.gz)
The complete DNA sequence of your organism's genome, one chromosome at a time. This is what the aligner maps your reads to. File size: 30 MB (yeast) to 15 GB (wheat).
Gene annotation GTF (.gtf.gz)
A table of exactly where every gene and exon is located on the genome — chromosome, start position, end position, strand. Used by HISAT2 for splice-aware alignment and by featureCounts to assign reads to genes.
⏱️
Building the HISAT2 index takes 5–30 minutes depending on genome size. Human takes ~20 minutes, yeast takes ~2 minutes. This only needs to be done once — the index can be reused for future experiments with the same organism.
Step 6 — Alignment with HISAT2

HISAT2 maps each sequencing read back to its location on the reference genome. It is splice-aware, meaning it correctly handles reads that span exon-exon junctions — essential for accurate RNA-seq alignment.

1Understanding alignment statistics

HISAT2 reports alignment statistics for each sample. These appear in the terminal output and are saved in your aligned/ folder as .log files.

>90%
Overall alignment — excellent
70–90%
Overall alignment — acceptable
<70%
Overall alignment — investigate
✅ Excellent: >90% overall alignment rate
Most reads found their location in the genome. Your data quality is high and the correct reference genome was used.
⚠️ Acceptable: 70–90% overall alignment rate
Some reads don't map — common reasons: reads from rRNA (which isn't in the genome annotation), very short reads after trimming, or reads from contaminating organisms. Analysis can proceed but investigate the unmapped reads if below 75%.
❌ Low: <70% overall alignment rate
Possible causes: wrong organism selected (check your accession on SRA), contaminated sample, genome version mismatch, or data quality issue. Check that your FASTA and GTF files are from the same Ensembl release number.
2BAM file sorting and indexing

After alignment, HISAT2 produces SAM files (text format). RNAflow then uses samtools to convert these to BAM (binary, 5× smaller), sort them by chromosome position, and index them. This is required by featureCounts and enables fast access to any region of the genome.

✅ BAM file check
Each sample should have a .bam file and a .bam.bai index file in your aligned/ folder. File sizes are typically 200 MB–3 GB depending on read depth.
Step 7 — Gene counting with featureCounts

featureCounts counts how many reads aligned to each gene in each sample. This produces a count matrix — the input to DESeq2 for statistical analysis.

1Understanding the count matrix

The output is a table where rows are genes and columns are samples. Each number is the count of reads that mapped to that gene in that sample.

# Example count matrix (first few rows): GeneID WT_rep1 WT_rep2 WT_rep3 MT_rep1 MT_rep2 MT_rep3 AT1G01010 245 231 258 512 498 487 AT1G01020 45 52 41 89 95 88 AT1G01030 102 98 115 44 39 51
📊 What do these numbers mean?
The raw counts are not directly comparable between samples because samples have different total sequencing depths. A gene with 500 counts in a sample with 10 million total reads is proportionally more expressed than 500 counts in a sample with 50 million total reads. DESeq2 handles this normalisation automatically.
2featureCounts assignment statistics
✅ Good: 60–80% assigned reads
Most reads were successfully assigned to a gene. This is a normal and expected range for RNA-seq data.
⚠️ Unassigned: no features
Reads mapped to the genome but not to an annotated gene region (e.g. intergenic regions, introns, non-coding regions). Some unassigned reads are normal (10–30%). High rates (>50%) may mean the GTF annotation version doesn't match your genome FASTA.
❌ Very low assignment (<40%)
Check that your genome FASTA and GTF are from the same Ensembl release number and the same organism. A mismatch between these two files is the most common cause.
Step 8 — Differential expression with DESeq2

DESeq2 is the gold-standard R package for RNA-seq differential expression analysis. It uses a negative binomial statistical model to identify genes that genuinely change between your conditions, accounting for biological variation.

1What DESeq2 does step by step
1. Size factor normalisation
DESeq2 calculates a size factor for each sample to correct for differences in sequencing depth. After normalisation, samples are directly comparable.
2. Dispersion estimation
DESeq2 estimates how much each gene varies between replicates. Genes with high variability need stronger evidence to be called significant.
3. Statistical testing
A Wald test is applied to every gene, producing a p-value for the probability that the observed difference is due to chance.
4. Multiple testing correction
Because thousands of genes are tested simultaneously, raw p-values are adjusted using the Benjamini-Hochberg method to control the false discovery rate (FDR). This produces the padj column.
2Understanding DESeq2 results columns
# DESeq2 results table — one row per gene: gene baseMean log2FC lfcSE stat pvalue padj AT1G01010 302.4 2.14 0.18 11.89 1.3e-32 8.2e-30 AT1G01020 47.1 -1.82 0.31 -5.87 4.4e-09 6.1e-08
baseMean
Average normalised count across all samples. Higher = more expressed overall. Very low baseMean genes (<10) are often unreliable due to stochastic variation.
log2FoldChange (log2FC)
How much the gene changes. +1 = 2× higher in treatment. -1 = 2× lower. +2 = 4× higher. 0 = no change. Positive values = upregulated in your treatment condition.
lfcSE (standard error)
Uncertainty in the fold change estimate. Small SE = reliable fold change. Large SE (especially in low-count genes) = less reliable.
pvalue
Raw probability. Never use this directly for significance — use padj instead. The raw p-value does not account for testing thousands of genes simultaneously.
padj (adjusted p-value) ← use this
The corrected p-value. Standard significance threshold: padj < 0.05. This means there is a <5% chance the result is a false discovery. Genes with padj < 0.05 AND |log2FC| > 1 are considered confidently differentially expressed.
NA values in padj
Some genes have NA in the padj column. This means DESeq2 filtered them out — either too low count (filtered by independent filtering) or an outlier in one replicate. NA genes are not significant.
3Interpreting the DESeq2 summary

After running DESeq2, the summary() function prints a table like this:

out of 5944 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 791, 13% ← upregulated genes LFC < 0 (down) : 1227, 21% ← downregulated genes outliers : 0, 0% low counts : 578, 10% ← filtered out (normal)
📊 Is my result normal?
The number of significant genes varies enormously depending on your experiment. A stress response might affect 10–40% of genes. A mild treatment might affect 1–5%. A comparison between very different cell types might affect 50%+. There is no "right" number — the biology of your experiment determines this.
Step 9 — Plots and visualisation

RNAflow generates three standard visualisations used in RNA-seq publications: a volcano plot, a heatmap, and a PCA plot. All are saved as 300 DPI PNG files suitable for direct use in papers and theses.

1Volcano plot — what it shows

A volcano plot visualises all genes simultaneously — fold change on the x-axis and statistical significance on the y-axis. It gets its name from its characteristic shape.

🌋 Reading the volcano plot
X-axis: log2 fold change. Left = downregulated, Right = upregulated. Genes near 0 don't change much.
Y-axis: -log10(padj). Higher = more significant. The dashed horizontal line marks padj=0.05.
Vertical dashed lines: mark log2FC = ±1 (= 2-fold change threshold).
Orange dots: significantly upregulated (right of vertical line, above horizontal line).
Blue dots: significantly downregulated.
Grey dots: not significant.
Gene labels: top 15 most significant genes are labelled automatically.
✅ Good volcano plot
A clear separation between conditions shows as a wide spread of dots. Most dots cluster near the centre (no change) with a "volcano" shape — a few highly significant genes at the top of each arm.
⚠️ Very few significant genes
If almost all dots are grey, your conditions may be very similar biologically, you may not have enough replicates, or sequencing depth was too low. Check your experimental design.
2Heatmap — what it shows

The heatmap shows the top 40 most significant genes as a colour grid — each row is a gene, each column is a sample. Colour represents expression level (scaled across the row).

🌡️ Reading the heatmap
Blue: lower expression than average (downregulated relative to mean).
Red/orange: higher expression than average (upregulated relative to mean).
White: average expression level.
Rows (genes) are clustered by similar expression pattern — genes that behave similarly group together.
Columns (samples) are not clustered — kept in your original order to preserve condition grouping.
✅ Good heatmap
You should see two clear colour blocks — one set of columns (your control samples) and one set (treatment) with opposite colours. Replicates within the same condition should look nearly identical.
3PCA plot — what it shows

Principal Component Analysis (PCA) reduces the expression of thousands of genes to just 2 dimensions. It shows how similar or different your samples are to each other overall.

🎯 Reading the PCA plot
Each dot = one sample. Samples close together are similar in overall gene expression. Samples far apart are very different.
PC1 (x-axis): captures the largest source of variation in your data — usually this is your experimental condition.
PC2 (y-axis): captures the second-largest source — often batch effects or biological variation between replicates.
% variance explained: shown on each axis. High PC1 % (e.g. 70%+) means your condition is the dominant source of variation.
✅ Excellent PCA
Replicates cluster tightly together. The two conditions are far apart on PC1, which explains a large proportion of variance (>50%). This means your experiment worked — the conditions are genuinely different and replicates are consistent.
⚠️ One outlier replicate
If one sample from a group is far from its replicates on the PCA, it may be a poor-quality or contaminated sample. Consider removing it from your analysis if the quality metrics (alignment rate, read count) are also poor for that sample.
❌ Conditions overlap completely
If your two conditions cannot be separated on any PC axis, the transcriptomes are very similar. You will find few or no significant genes. This may mean the treatment had no transcriptional effect, or the time point was wrong, or there is a sample labelling error.
Gene Family Analysis

For large genomes (wheat, maize, human, soybean), researchers often want to focus on one biological family rather than all 30,000+ genes. RNAflow includes 40+ gene families across 8 categories.

1When to use gene family analysis
📋 Use gene family analysis when:
• Your experiment targets a specific pathway (e.g. hormone response, stress signalling)
• You are working with a large genome and want focused results
• You want to produce a publication figure for a specific gene family
• Your full DESeq2 analysis found many significant genes and you want to understand the biology

The gene family analysis produces its own volcano plot and heatmap for just the selected family, saved separately from the full analysis results.

2Available gene families
Transcription factors

MYB, WRKY, bHLH, NAC, MADS, ERF, GRAS, bZIP, HOX, GATA, STAT, p53, NF-κB, ETS, SOX, FOX — 21 families

Protein kinases

MAPK, CDK, RLK, SnRK, CDPK, TOR, tyrosine kinases — 8 families

Plant hormone pathways

Auxin, GA, cytokinin, ABA, ethylene, JA, SA, BR, strigolactone, peptide hormones — 10 pathways

Transporters

ABC, aquaporin, sugar (SUT/SWEET), nitrate (NRT), phosphate, K+, metal (ZIP, YSL, NRAMP), SLC, ion channels

Animal signalling

Wnt/β-catenin, Notch, Hedgehog, TGF-β, PI3K/AKT, RAS/ERK, JAK/STAT, NF-κB, Hippo — 9 pathways

Immune / Cancer / Epigenetic

TLR, NLR, cytokines, BCL-2, caspases, oncogenes, tumour suppressors, DNA methylation, histone modifiers

Interpreting your results

A quick reference guide for the most common questions researchers have when looking at their RNAflow results for the first time.

1How do I decide which genes are significant?
✅ Standard significance criteria
A gene is considered differentially expressed (DE) if it meets BOTH of these criteria:

1. padj < 0.05 — statistically significant after multiple testing correction
2. |log2FoldChange| > 1 — biologically meaningful change (at least 2-fold)

Some researchers use padj < 0.01 (stricter) or |log2FC| > 0.58 (~1.5-fold, more lenient). State your thresholds in your methods section.
2What do I do with my list of significant genes?

Once you have a list of significant DE genes, these are the typical next steps in a research project:

Gene Ontology (GO) enrichment
Upload your DE gene list to geneontology.org or STRING-db to find which biological processes are overrepresented. This answers: "what is this list of genes collectively doing?"
KEGG pathway analysis
Map DE genes to metabolic and signalling pathways at kegg.jp. Shows which pathways are activated or suppressed in your treatment condition.
Validation by RT-qPCR
Select 5–10 top DE genes and validate their expression changes using RT-qPCR. This is typically required before publication to confirm the RNA-seq findings.
Protein interaction networks
Upload your DE gene list to string-db.org to visualise which proteins interact with each other. Clusters of interacting proteins often represent co-regulated functional modules.
3How to cite RNAflow and its tools in your paper

In your Methods section, cite each tool used in the pipeline. Example methods text:

RNA-seq analysis was performed using RNAflow (bioinfocodex.com). Raw reads were quality-assessed with FastQC (Andrews, 2010) and trimmed with fastp (Chen et al., 2018). Trimmed reads were aligned to the [organism] reference genome (Ensembl release [number]) using HISAT2 (Kim et al., 2019). Gene-level read counts were obtained with featureCounts (Liao et al., 2014). Differential expression analysis was performed with DESeq2 (Love et al., 2014) using a significance threshold of padj < 0.05 and |log2FC| > 1. Visualisations were generated using ggplot2 (Wickham, 2016) and pheatmap (Kolde, 2019).
Supported organisms

15 organisms are pre-loaded with their Ensembl genome URLs. Select any from the dropdown in the Configure page — genome and GTF URLs fill automatically.

🐾 Animals
  • Human — Homo sapiens (GRCh38)
  • Mouse — Mus musculus (GRCm39)
  • Rat — Rattus norvegicus (mRatBN7.2)
  • Zebrafish — Danio rerio (GRCz11)
  • Drosophila — Drosophila melanogaster (BDGP6)
  • C. elegans — Caenorhabditis elegans (WBcel235)
  • Yeast — Saccharomyces cerevisiae (R64)
  • Yarrowia — Yarrowia lipolytica (ASM101387v1)
🌱 Plants
  • Arabidopsis — Arabidopsis thaliana (TAIR10)
  • Rice — Oryza sativa (IRGSP-1.0)
  • Maize — Zea mays (B73-NAM-5.0)
  • Wheat — Triticum aestivum (IWGSC)
  • Soybean — Glycine max (Glycine_max_v2.1)
  • Tomato — Solanum lycopersicum (SL3.0)
💡
For organisms not listed, select Custom and paste your own Ensembl FTP URLs.
Troubleshooting

Solutions to the most common problems encountered during RNA-seq analysis with RNAflow.

The Run button is greyed out — what do I do?
The local server is not running. Open Terminal, navigate to the folder containing rnaflow_server.py, and run: python3 rnaflow_server.py. Keep that Terminal window open. The green dot in the top bar will appear when connected. If you prefer not to use the server, use Mode B (Copy) to copy commands manually.
HISAT2 alignment rate is below 70%
Most common cause: the wrong organism was selected. Check your SRA accession on ncbi.nlm.nih.gov/sra to confirm the organism. Also confirm that the genome FASTA and GTF annotation are from the same Ensembl release number. A mismatch between these two files causes very low alignment rates.
DESeq2 gives an error about zero counts or missing samples
This usually means the column names in your count matrix don't match the sample names in your metadata. Check that the order of samples in the count table matches the order in the DESeq2 script exactly. Also ensure you have at least 2 samples per condition.
featureCounts assigned only 20–30% of reads
The GTF annotation version does not match the genome FASTA. Make sure both files are from the same Ensembl release number and the same species. For example, Ensembl release 109 genome FASTA must be used with the Ensembl release 109 GTF file.
I have no significant genes after DESeq2
Common causes: (1) The biological conditions are too similar — try a larger fold-change or different time point. (2) Insufficient replicates — fewer than 3 replicates reduces statistical power. (3) Too few reads — aim for 10–20 million reads per sample. (4) The PCA shows samples don't separate — may indicate a sample labelling or metadata error.
How do I add more replicates or samples?
Go to the Configure page → click the + Add replicate button under each condition to add more SRA accession rows. You can add as many replicates as needed. All downstream commands update automatically.
The download is very slow or fails halfway
NCBI SRA downloads can be slow, especially for large files. If a download fails, simply run the fasterq-dump command again — it will resume from where it stopped. For very large datasets, consider downloading overnight or using the NCBI SRA toolkit's prefetch command first.
Can I analyse more than two conditions?
RNAflow currently supports two-condition comparisons (one control vs one treatment). For multi-condition experiments (e.g. timepoints, multiple treatments), run the pipeline separately for each pairwise comparison you are interested in. Future versions of RNAflow will support multi-factor designs.
Lab Management Platform

GeneNet Lab v1.0.0

A full-featured lab management platform with team access control, seat-based licensing, OneDrive storage, and a desktop app for macOS and Linux.

Version 1.0.0

Download GeneNet

🍎
macOS
.dmg installer · macOS 12+
95 MB · v1.0.0
⬇ Download for macOS
🐧
Linux
.AppImage · Ubuntu, Fedora, Debian
100 MB · v1.0.0
⬇ Download for Linux
🪟
Windows
.exe installer · Windows 10/11
78 MB · v1.0.0
⬇ Download for Windows
🔗 Already have a connection code?
Connect directly via the browser — no download needed.
→ Connect to Lab

Features

👥
Team Management
Invite members by email with unique invite codes. Admins can block, remove, or restore access per user at any time.
🪑
Seat-Based Licensing
Starter (5 seats), Pro (20 seats), or Enterprise (unlimited). Visual seat usage tracker with upgrade flow built in.
🛡️
Admin Control Panel
View all team members, change roles, block or remove users, and manage connection codes from one page.
☁️
OneDrive / Network Storage
Store your database and uploads on OneDrive or a shared network path. Configured once during setup — syncs automatically.
📧
Email Invites
Admins send invite emails with a one-click registration link. New users land on a pre-filled form — no code typing needed.
🔒
Access States
Users can be Active, Blocked (temporary), or Removed. Blocked users see a clear message to contact the administrator.

Licensing Plans

Starter
Free
Up to 5 team members
Full feature access · No expiry
Pro
Up to 20 team members
Contact us to upgrade
Enterprise
Unlimited seats
Custom deployment support

How it works

1
Download & install the app for your operating system. On first run, the admin completes a one-time setup wizard.
2
Admin invites team members by email from the Team & Seats panel. Each invite generates a unique registration link.
3
Members connect using the LAB-XXXXX connection code (e.g. LAB-84921) and register with their invite code.
4
Admin manages access from the Admin Panel — block, remove, restore, or promote any user at any time.
5
Scale up by upgrading the seat plan directly from the app when you hit the member limit.
⚠️
macOS users: If you see "GeneNet is damaged and can't be opened", run this command in Terminal:
xattr -cr ~/Downloads/GeneNet*.dmg
This removes the macOS quarantine flag from unsigned apps. You only need to do this once.
Whole-genome sequencing pipeline

NGSuite Coming Q3 2025

Variant calling, SNP/INDEL annotation, structural variant detection, population genetics. Powered by GATK4, BWA-MEM2, and VEP.

🔔 Get notified when NGSuite launches
Email info@bioinfocodex.com with subject: "Notify me — NGSuite"
Plasmid design and annotation

PlasmidMap Coming Q4 2025

Design plasmids, find restriction sites, annotate features, and export publication-ready circular maps. Import from GenBank.

🔔 Get notified when PlasmidMap launches
Email info@bioinfocodex.com with subject: "Notify me — PlasmidMap"
ChIP-seq and ATAC-seq pipeline

ChIPflow Coming 2026

Peak calling with MACS2, motif analysis with HOMER, differential chromatin accessibility, BigWig tracks for genome browsers.

🔔 Get notified when ChIPflow launches
Email info@bioinfocodex.com with subject: "Notify me — ChIPflow"
Metagenomics and microbiome pipeline

MetaFlow Coming 2026

16S rRNA and whole-metagenome shotgun analysis. Taxonomic classification, alpha/beta diversity, differential abundance with QIIME2 and Kraken2.

🔔 Get notified when MetaFlow launches
Email info@bioinfocodex.com with subject: "Notify me — MetaFlow"
Single-cell RNA-seq pipeline

scRNAflow Coming 2026

Single-cell RNA-seq analysis using Seurat. Cell clustering, UMAP visualisation, marker gene identification, trajectory analysis.

🔔 Get notified when scRNAflow launches
Email info@bioinfocodex.com with subject: "Notify me — scRNAflow"
Mass spectrometry proteomics pipeline

ProteomicsLab Coming 2026

Label-free quantification, TMT/iTRAQ, post-translational modification analysis, pathway enrichment. Powered by MaxQuant.

🔔 Get notified when ProteomicsLab launches
Email info@bioinfocodex.com with subject: "Notify me — ProteomicsLab"
🎓 Bioinformatics Academy

A comprehensive university-level curriculum — from molecular biology fundamentals to advanced genomic data analysis. Written by researchers, for researchers.

14 Modules 50+ Topics Real Code Examples Beginner → Advanced
📘 Module 1Foundations
💻 Module 2Computing
🗄️ Module 3Databases
🔍 Module 4Sequence Analysis
🧬 Module 5RNA-seq
🔬 Module 6Genomics
📊 Module 7Statistics
🚀 Module 8Advanced
🧪 Module 9Proteomics
🏗️ Module 10Structural Bio
🧩 Module 11Epigenomics
🔧 Module 12Genome Assembly
👥 Module 13Pop. Genetics
🕸️ Module 14Systems Biology
Module 1 · Foundations

Molecular Biology & Bioinformatics Fundamentals

Before writing a single command, you must understand what you are analysing. This module covers the biological foundations every bioinformatician needs — DNA structure, gene expression, the central dogma, and why bioinformatics exists.

1.1 What is Bioinformatics?

Bioinformatics is an interdisciplinary science that develops and applies computational methods to analyse biological data. It sits at the intersection of biology, computer science, mathematics, and statistics. The field emerged because modern biology generates data at a scale that is impossible to analyse by hand — a single whole-genome sequencing run produces hundreds of gigabytes of raw data, and large consortia such as the UK Biobank have sequenced over 500,000 individuals.

💡
Think of a sequencing machine as a very sophisticated camera. It produces a "photograph" of millions of DNA/RNA fragments. Bioinformatics is the darkroom — the place where raw data becomes biological insight.

A Brief History

The roots of bioinformatics trace back to the 1960s, when Margaret Dayhoff compiled the first protein sequence database — the Atlas of Protein Sequence and Structure — and developed the PAM substitution matrices still used in sequence alignment today. In the 1970s, Needleman-Wunsch (1970) and Smith-Waterman (1981) published their foundational sequence alignment algorithms. The launch of the Human Genome Project in 1990 was a watershed moment: a $3 billion, 13-year international effort that produced the first draft of the human genome in 2001. Since then, sequencing costs have plummeted dramatically — from roughly $100 million per genome in 2001 to under $200 today — falling faster than Moore's Law. This exponential cost reduction has democratised genomics and made bioinformatics skills essential for modern biological research.

1965 — Dayhoff's Atlas of Protein Sequence and Structure
1977 — Sanger sequencing developed
1990 — Human Genome Project begins
2001 — Draft human genome published
2005 — Next-gen sequencing (454, Illumina) arrives
2012 — CRISPR-Cas9 gene editing described
2020 — AlphaFold2 solves protein folding
2022 — T2T consortium completes full human genome

Why Does Bioinformatics Matter?

Drug Discovery

Identify drug targets by finding genes upregulated in disease. AlphaFold predicted 200M+ protein structures, accelerating structure-based drug design.

Agriculture

Improve crop resistance by identifying beneficial genetic variants in plant genomes through GWAS and marker-assisted selection.

Epidemiology

Track viral evolution in real time. SARS-CoV-2 variant tracking relied entirely on bioinformatics pipelines and GISAID.

What a Typical Bioinformatics Project Looks Like

A typical project follows a pipeline from experimental question to biological insight:

Define QuestionDesign ExperimentGenerate Data (sequencing) → Quality ControlAlignment / AssemblyQuantificationStatistical AnalysisBiological InterpretationVisualisation & Publication

At each step, the bioinformatician must make choices: Which aligner to use? What quality thresholds? How to handle batch effects? The answers depend on the specific data type (RNA-seq, ChIP-seq, WGS, etc.), the organism, and the biological question. This course will walk you through each step.

Major Sub-fields

Genomics — studying entire genomes, variant calling, assembly, GWAS
Transcriptomics — measuring which genes are active (RNA-seq, scRNA-seq)
Proteomics — identifying and quantifying proteins (mass spectrometry)
Metabolomics — measuring small molecules (metabolites) in cells
Structural Biology — 3D shapes of macromolecules (X-ray, cryo-EM, AlphaFold)
Metagenomics — sequencing entire microbial communities (16S, shotgun)
Epigenomics — DNA methylation, histone modifications, chromatin accessibility
Pharmacogenomics — how genetic variation affects drug response

Career Paths and Demand

Bioinformatics professionals are in very high demand. Common job titles include Bioinformatics Scientist, Computational Biologist, Genomics Data Analyst, and Research Software Engineer (Bioinformatics). Positions exist in academia, pharmaceutical companies, biotech startups, clinical diagnostics labs, and agricultural genomics companies.

Academic Research — postdoc, research associate, or bioinformatics core facility staff at universities
Pharma / Biotech — drug target identification, clinical trial genomics, companion diagnostics
Clinical Genomics — variant interpretation, genetic counselling support, diagnostic pipeline development
Industry Tools — software engineering at companies like Illumina, 10x Genomics, or Oxford Nanopore

The Tools Ecosystem

Bioinformatics relies on an ecosystem of mostly open-source tools. You will use a combination of command-line tools on Linux, scripting languages (Python, R, Bash), and specialised software for specific analyses.

Linux/Bash Python R / Bioconductor Conda / Mamba Nextflow / Snakemake Galaxy Docker / Singularity HPC / SLURM
📝
Key Takeaways:
• Bioinformatics combines biology + CS + statistics to extract meaning from biological data
• The field was born from protein databases in the 1960s and exploded with the Human Genome Project
• Sequencing costs have fallen over a million-fold, making data generation easy — analysis is now the bottleneck
• A typical project flows from experimental design through QC, alignment, quantification, statistics, and interpretation
• Career demand is extremely high across academia, pharma, clinical, and industry settings
1.2 DNA, RNA and Proteins — The Central Dogma

Francis Crick's central dogma (1958) describes the flow of genetic information: DNA → RNA → Protein. Every bioinformatics workflow targets one of these molecules. Understanding this flow at a molecular level is essential because the tools you use (aligners, quantifiers, variant callers) all make assumptions about the underlying biology.

DNA
Double-stranded
A, T, G, C
mRNA
Transcription
A, U, G, C
Protein
Translation
20 amino acids

Base Pairing Rules

DNA is a double-stranded helix held together by hydrogen bonds between complementary bases. The pairing rules are strict and fundamental to replication, transcription, and sequencing technology:

In DNA
Adenine (A) pairs with Thymine (T) — 2 hydrogen bonds
Guanine (G) pairs with Cytosine (C) — 3 hydrogen bonds
G-C pairs are stronger, so GC-rich regions are harder to denature and can cause sequencing bias.
In RNA
Adenine (A) pairs with Uracil (U) — 2 hydrogen bonds
Guanine (G) pairs with Cytosine (C) — 3 hydrogen bonds
RNA uses uracil instead of thymine. FASTQ files still write T by convention, but remember the molecule is actually U.

Transcription — Step by Step

Transcription is the process of copying a gene's DNA sequence into messenger RNA (mRNA). It occurs in the nucleus of eukaryotic cells and involves several steps:

1. Initiation: Transcription factors bind to the promoter region (e.g., TATA box at ~-25 bp). RNA Polymerase II is recruited to form the pre-initiation complex.

2. Elongation: RNA Pol II moves along the template strand (3' → 5'), synthesising mRNA in the 5' → 3' direction. It reads the template strand and adds complementary RNA bases (A→U, T→A, G→C, C→G).

3. Termination: RNA Pol II reaches a polyadenylation signal (AAUAAA). The pre-mRNA is cleaved and released.

4. Processing: Three modifications occur before the mRNA leaves the nucleus:
  • 5' cap — a modified guanine nucleotide added to the 5' end (protects from degradation)
  • 3' poly-A tail — 100-250 adenine nucleotides added (stability, export signal). This is why poly-A selection works for mRNA isolation in RNA-seq.
  • Splicing — introns removed, exons joined by the spliceosome

Translation — The Genetic Code

Translation occurs at ribosomes in the cytoplasm. The mRNA sequence is read in triplets called codons, each specifying one of 20 amino acids. The code is degenerate — most amino acids are encoded by multiple codons (e.g., Leucine has 6 codons). Key codons to know:

AUG
Start codon
Codes for Methionine
UAA, UAG, UGA
Stop codons
"Ochre, Amber, Opal"
64 total codons
61 code for amino acids
3 are stop codons

The reading frame matters enormously. A single insertion or deletion shifts the entire frame downstream (a frameshift mutation), usually producing a nonfunctional protein. This is why indels in coding regions are often more damaging than substitutions.

Post-Translational Modifications (PTMs)

After translation, proteins are often chemically modified. These modifications can change protein activity, location, stability, and interactions. Common PTMs include phosphorylation (adding a phosphate group — central to cell signalling), glycosylation (adding sugar chains — important for secreted proteins), ubiquitination (tagging for degradation by the proteasome), and acetylation (especially on histones, regulating gene expression). Proteomics and mass spectrometry are the primary tools for studying PTMs.

Why RNA is Unstable Compared to DNA

RNA is chemically less stable than DNA for two reasons: (1) the 2'-OH group on the ribose sugar makes RNA susceptible to alkaline hydrolysis, and (2) RNA is single-stranded, making it vulnerable to RNases (ubiquitous enzymes that degrade RNA). This instability is why RNA extraction requires careful technique — RNase-free conditions, working on ice, and rapid processing. It also means RNA quality degrades over time, which is why we measure RIN (RNA Integrity Number) before sequencing.

Exceptions to the Central Dogma

⚠️
The central dogma has important exceptions that are biologically and clinically significant:
Reverse Transcription

RNA → DNA. Performed by reverse transcriptase in retroviruses (HIV) and retrotransposons. Also used in the lab to create cDNA from mRNA (the basis of RNA-seq library preparation).

RNA Replication

RNA → RNA. RNA viruses (influenza, SARS-CoV-2) use RNA-dependent RNA polymerase (RdRp) to replicate their genomes. This enzyme is a drug target (e.g., remdesivir).

RNA Editing

Bases in mRNA can be chemically modified after transcription (e.g., A-to-I editing by ADAR enzymes). This means the mRNA sequence can differ from the genomic DNA — a potential pitfall in variant calling.

Prions

Protein → Protein. Misfolded prion proteins can template the misfolding of normal proteins without any nucleic acid involvement (the cause of mad cow disease / CJD).

Non-coding RNA

Not all RNA becomes protein. In fact, the vast majority of transcribed RNA in human cells is non-coding. Key types include: rRNA (ribosomal RNA — ~80% of total RNA, forms the ribosome structure), tRNA (transfer RNA — carries amino acids during translation), miRNA (microRNA — ~22 nt, silences gene expression post-transcriptionally), lncRNA (long non-coding RNA — >200 nt, diverse regulatory roles), and circRNA (circular RNA — resistant to degradation, potential biomarkers). These are increasingly studied in bioinformatics.

📝
Key Takeaways:
• The central dogma: DNA → RNA → Protein, with important exceptions (reverse transcription, RNA viruses, prions)
• Base pairing: A-T (DNA) / A-U (RNA) with 2 H-bonds; G-C with 3 H-bonds (stronger)
• Transcription produces pre-mRNA that is capped, polyadenylated, and spliced before export
• The genetic code is degenerate (64 codons for 20 amino acids + 3 stops); reading frame is critical
• RNA instability (2'-OH, RNases) has direct practical implications for sample handling and sequencing quality
1.3 Genome Organisation & Gene Structure

Understanding how genes are organised in the genome is essential for interpreting any bioinformatics result. When you load a BAM file in a genome browser, annotate variants, or run differential expression analysis, you are navigating genome structure. This section covers the physical and functional organisation of genomes, from the chromosome level down to individual regulatory elements.

Chromosomes and Chromatin

In eukaryotes, genomic DNA is packaged with histone proteins into chromatin, which is further organised into chromosomes. Humans have 23 pairs of chromosomes (22 autosomes + 1 pair of sex chromosomes, XX or XY), totalling about 3.2 billion base pairs. During cell division, each chromosome is replicated, and the two copies (called sister chromatids) are joined at the centromere before being separated.

Euchromatin

Loosely packed, transcriptionally active chromatin. Most genes are expressed from euchromatic regions. Appears light in chromosome staining.

Heterochromatin

Tightly packed, largely silent chromatin. Found at centromeres and telomeres. Contains repetitive sequences. Constitutive (always condensed) or facultative (conditionally silenced, e.g., the inactive X chromosome).

Genome Size and the C-Value Paradox

E. coli: 4.6 Mb, ~4,300 genes Yeast: 12 Mb, ~6,000 genes Fly: 180 Mb, ~14,000 genes Human: 3.2 Gb, ~20,000 genes Wheat: 17 Gb, ~107,000 genes Paris japonica (plant): 149 Gb

The C-value paradox refers to the observation that genome size does not correlate with organism complexity. Humans have 3.2 Gb but only ~20,000 protein-coding genes, while the onion has 16 Gb. The difference is largely due to repetitive elements — transposons, retrotransposons, and satellite DNA account for about 45% of the human genome. Only about 1.5% of the human genome actually encodes proteins.

Regulatory Elements

Promoters

Located immediately upstream of genes (within ~1 kb of the TSS). Contain elements like the TATA box (~-25 bp) and CpG islands (GC-rich regions, often unmethylated in active promoters). RNA polymerase and transcription factors bind here to initiate transcription.

Enhancers

Can be located thousands or even millions of base pairs away from their target gene — upstream, downstream, or even within introns. They loop in 3D space to contact the promoter and boost transcription. Enhancer activity is tissue-specific, explaining why the same genome produces different cell types.

CpG Islands

CpG islands are regions of at least 200 bp with a GC content above 50% and an observed/expected CpG ratio above 0.6. About 70% of human gene promoters are associated with CpG islands. In normal cells, most CpG islands in promoters are unmethylated, allowing gene expression. In cancer, aberrant hypermethylation of tumour suppressor gene promoters is a hallmark event. This is studied using bisulfite sequencing or methylation arrays (e.g., Illumina EPIC array).

Gene Structure — A Real Example

Consider the human BRCA1 gene (associated with breast cancer risk). It spans about 81 kb on chromosome 17, contains 23 exons, and produces a protein of 1,863 amino acids. The coding sequence itself is only about 5,592 bp — just 7% of the gene's genomic span. The rest consists of introns, UTRs, and regulatory regions.

BRCA1 gene structure (simplified):

5'---[Promoter/CpG island]---[5'UTR]---[Exon 1]---intron 1 (34 kb)---[Exon 2]---intron 2---...---[Exon 23]---[3'UTR]---[polyA signal]---3'

Green = exons (coding) | Grey = introns (removed by splicing) | Total span: ~81 kb

Alternative Splicing

A single gene can produce multiple protein variants (isoforms) through alternative splicing of its pre-mRNA. The human genome has ~20,000 protein-coding genes but produces over 200,000 different mRNA transcripts. Common types of alternative splicing include:

Exon skipping

The most common type in humans. An entire exon is excluded from the mature mRNA. Example: the Drosophila Dscam gene can produce over 38,000 isoforms through exon skipping.

Alternative 5' or 3' splice site

A different splice site is used within the same exon, making the exon longer or shorter. Changes the protein sequence at the boundary.

Intron retention

An intron is not removed and remains in the mature mRNA. More common in plants and fungi. Can introduce premature stop codons.

Mutually exclusive exons

One of two exons is included, but never both. The choice is regulated in a tissue-specific manner.

⚠️
When aligning RNA-seq reads to a genome, use a splice-aware aligner (STAR, HISAT2) — not a simple aligner like BWA. Otherwise reads spanning exon-exon junctions will fail to align. If you need isoform-level analysis, consider tools like Salmon or Kallisto for transcript-level quantification, or PacBio IsoSeq for full-length transcript sequencing.
📝
Key Takeaways:
• Eukaryotic genomes are organised into chromosomes with euchromatin (active) and heterochromatin (silent) regions
• Genome size does not correlate with complexity (C-value paradox) — most large genomes are inflated by repetitive elements
• Promoters, enhancers, and CpG islands control when and where genes are expressed
• Gene structure: 5'UTR - exons/introns - 3'UTR; introns are spliced out during mRNA processing
• Alternative splicing enables one gene to produce multiple protein isoforms, greatly expanding the proteome
1.4 Sequencing Technologies

Most bioinformatics data comes from Next-Generation Sequencing (NGS). Understanding what the machine actually does explains why your data looks the way it does — including error profiles, quality score distributions, read length limitations, and the types of biases you will encounter. The three major platforms each have distinct chemistries with different strengths and trade-offs.

How Illumina Sequencing-by-Synthesis Works

Illumina is the dominant platform, generating the majority of the world's sequencing data. Here is the step-by-step process:

Step 1 — Library Preparation: DNA/cDNA is fragmented (~300-500 bp), and adapters are ligated to both ends. These adapters contain sequences for binding to the flow cell and for PCR amplification. An index (barcode) sequence is included to allow multiplexing of multiple samples on one lane.

Step 2 — Cluster Generation: The library is loaded onto a flow cell — a glass slide coated with millions of short oligos complementary to the adapters. Each fragment binds to the surface, then undergoes bridge amplification: the fragment bends over and binds to a nearby oligo, forming a bridge, which is then extended by polymerase. This cycle repeats ~35 times, creating a cluster of ~1,000 identical copies of the original fragment. Each cluster will produce one "read."

Step 3 — Sequencing by Synthesis: Fluorescently-labelled, reversibly-terminated nucleotides are washed across the flow cell. In each cycle, exactly one nucleotide is incorporated into each growing strand. The flow cell is imaged to determine which base was added (each of A, C, G, T has a different fluorescent colour). The terminator and fluorescent tag are then chemically removed, and the next cycle begins. Typically 100-150 cycles are performed per read.

Step 4 — Base Calling: The raw images are processed by Illumina's Real-Time Analysis (RTA) software to determine the base at each position and assign a quality score. The output is a FASTQ file — one per read direction (R1 and R2 for paired-end). A standard run produces billions of reads.

PHRED Quality Scores

Every base in a FASTQ file has an associated PHRED quality score (Q) that represents the probability of an incorrect base call. The relationship is logarithmic:

Q = -10 × log10(P)

Where P = probability of error

Q10 = 1 in 10 error rate (10%) — terrible
Q20 = 1 in 100 error rate (1%) — minimum acceptable
Q30 = 1 in 1,000 error rate (0.1%) — good quality
Q40 = 1 in 10,000 error rate (0.01%) — excellent

In FASTQ files, quality scores are encoded as ASCII characters. For Illumina 1.8+ (Phred+33 encoding), the character ! = Q0, I = Q40. When you see a quality string like FFFFFFFF, each F represents Q37 — very high quality. Quality typically degrades toward the end of reads, which is why trimming the 3' ends is common practice.

Technology Comparison

Feature Illumina PacBio HiFi Oxford Nanopore
Read length 75-300 bp 10-25 kb Up to 4 Mb
Accuracy >99.9% (Q30+) ~99.9% (HiFi mode) ~95-99% (improving)
Throughput Up to 6 Tb (NovaSeq X) ~30 Gb per cell ~50-300 Gb (PromethION)
Run time 12-48 hours ~24 hours Minutes to days
Error type Substitutions Indels (rare in HiFi) Indels (homopolymers)
Cost per Gb ~$2-5 ~$20-30 ~$10-20
Best for RNA-seq, WGS, ChIP-seq, variant calling Genome assembly, structural variants, isoforms Field diagnostics, real-time, direct RNA, methylation

When to Choose Each Technology

Choose Illumina when:

• Differential gene expression (RNA-seq)
• Whole-genome variant calling
• ChIP-seq / ATAC-seq
• Large sample numbers (multiplexing)
• Budget is a concern

Choose PacBio when:

• De novo genome assembly
• Full-length transcript isoforms (IsoSeq)
• Structural variant detection
• Repetitive region resolution
• Phased diploid assembly

Choose Nanopore when:

• Rapid pathogen identification
• Field work (MinION is portable)
• Direct RNA sequencing (no RT step)
• Epigenetic modifications (m5C, m6A)
• Ultra-long reads needed

💡
Hybrid approaches: Many genome assembly projects now combine Illumina short reads (for accuracy/polishing) with long reads (for scaffolding and resolving repeats). The T2T (Telomere-to-Telomere) consortium used PacBio HiFi and Oxford Nanopore ultra-long reads to complete the first truly gapless human genome assembly in 2022.
📝
Key Takeaways:
• Illumina sequencing-by-synthesis uses fluorescent reversible terminators to read one base per cycle across billions of clusters
• PHRED quality scores are logarithmic: Q30 means a 1/1,000 chance of error
• Quality degrades at read ends; Illumina errors are substitutions, while long-read errors are typically indels
• Choose the platform based on your application: Illumina for most quantitative assays, long-read for assembly and structural analysis
• Hybrid approaches combining short and long reads often give the best results for genome projects
Module 2 · Computing Skills

Linux, Terminal & Programming for Bioinformatics

Almost all bioinformatics tools run on Linux/macOS in the terminal. This module teaches you the computing skills you need before running any analysis — from basic commands to writing scripts.

2.1 Essential Linux / Terminal Commands

You will spend 80% of your time in the terminal. These commands are non-negotiable knowledge. Linux (and macOS terminal) is the native environment for nearly all bioinformatics tools. Mastering the command line is not optional — it is the single most important skill for a bioinformatician.

Navigation and File Operations

# Navigation pwd # Print current directory ls -lh # List files (human-readable sizes) ls -lhS # List files sorted by size (largest first) ls -lht # List files sorted by modification time cd /path/to/folder # Change directory cd .. # Go up one level cd ~ # Go to home directory cd - # Go to previous directory (very useful!) # File operations cp source.txt dest.txt # Copy file cp -r dir1/ dir2/ # Copy directory recursively mv old.txt new.txt # Move or rename rm file.txt # Delete file (PERMANENT — no Trash!) mkdir -p project/data/raw # Create nested directories ln -s /path/to/file link # Create symbolic link (saves disk space) # Viewing files cat file.txt # Print entire file head -20 file.txt # First 20 lines tail -20 file.txt # Last 20 lines less file.txt # Page through (press q to quit) wc -l file.txt # Count lines # Searching grep "pattern" file.txt # Find lines matching pattern grep -c ">" file.fa # Count sequences in FASTA grep -v "^#" file.vcf # Exclude comment/header lines find . -name "*.fastq.gz" # Find files by name pattern # Compression gzip file.fastq # Compress (creates .fastq.gz) gunzip file.fastq.gz # Decompress zcat file.fastq.gz | head # View compressed file without decompressing
⚠️
rm -rf is permanent and irreversible. There is no Trash. Always double-check before running delete commands. Never run rm -rf / or rm -rf * without absolute certainty.

File Permissions

Linux file permissions control who can read, write, and execute files. Understanding permissions is essential because scripts need execute permission, and shared HPC directories require correct group permissions.

# View permissions (the -rwxr-xr-- part of ls -l output) ls -l script.sh # Output: -rwxr-xr-- 1 user group 1234 Mar 20 10:00 script.sh # r=read(4), w=write(2), x=execute(1) # Positions: owner | group | others # Make a script executable chmod +x script.sh # Add execute permission for everyone chmod 755 script.sh # rwx for owner, rx for group and others chmod 644 data.txt # rw for owner, read-only for group and others # Change ownership (requires sudo or being the owner) chown user:group file.txt # Change owner and group chgrp labgroup project/ # Change group for shared projects

Pipes and Redirection

Pipes (|) and redirection (>, >>) are the heart of Unix philosophy — chain small tools together to accomplish complex tasks. This is how most bioinformatics one-liners work.

# Pipe: send output of one command as input to the next cat file.fastq | grep "^@" | wc -l # Count reads in FASTQ # Redirect output to a file samtools flagstat file.bam > stats.txt # Overwrite file echo "run complete" >> log.txt # Append to file # Redirect stderr separately from stdout hisat2 -x index -U reads.fq -S out.sam 2> alignment_log.txt # Redirect both stdout and stderr command &> all_output.txt # Powerful bioinformatics one-liners using pipes # Count reads per chromosome from BAM: samtools idxstats sorted.bam | cut -f1,3 | sort -k2 -nr | head # Extract unique gene names from a GTF file: grep -v "^#" genes.gtf | awk '$3=="gene"' | grep -o 'gene_name "[^"]*"' | sort -u | wc -l

Process Management

Bioinformatics jobs often run for hours. Knowing how to manage long-running processes is critical, especially when working over SSH.

# Monitor system resources top # Live process viewer (press q to quit) htop # Better version of top (install with conda) df -h # Check disk space du -sh */ # Size of each subdirectory free -h # Check available RAM # Run jobs in background nohup long_command & # Run in background, survives logout jobs # List background jobs kill %1 # Kill background job #1 kill -9 12345 # Force kill process by PID # screen / tmux — persistent terminal sessions (essential for SSH) screen -S mysession # Start named session # Ctrl+A, D to detach screen -r mysession # Reattach to session screen -ls # List sessions # tmux (modern alternative to screen) tmux new -s analysis # Start named session # Ctrl+B, D to detach tmux attach -t analysis # Reattach

SSH for HPC / Server Access

Most bioinformatics work is done on remote servers or High-Performance Computing (HPC) clusters. SSH (Secure Shell) is how you connect:

# Basic SSH connection ssh username@hpc.university.edu # SSH with key authentication (more secure, no password prompt) ssh-keygen -t ed25519 # Generate key pair (once) ssh-copy-id user@server # Copy public key to server # Transfer files with scp scp local_file.txt user@server:/path/to/dest/ scp user@server:/path/to/results.txt ./ # Transfer files with rsync (better for large transfers — resumable) rsync -avP local_dir/ user@server:/remote/dir/

Bioinformatics-Specific Tricks

# Run a command on multiple files with xargs ls *.fastq.gz | xargs -I {} fastqc {} # GNU parallel — run commands in parallel (much faster) ls *.fastq.gz | parallel -j 4 "fastqc {}" # Process all FASTQ pairs in a directory for f in *_R1.fastq.gz; do r2=${f/_R1/_R2} sample=$(basename $f _R1.fastq.gz) echo "Processing $sample: $f and $r2" done # Quick stats: column-based operations with awk awk '{sum+=$5} END {print "Mean:", sum/NR}' data.txt # Sort and remove duplicate lines sort -u gene_list.txt > unique_genes.txt # Join two files by a common column join -t$'\t' -1 1 -2 1 <(sort file1.txt) <(sort file2.txt)
📝
Key Takeaways:
• Master navigation (cd, ls, pwd), file operations (cp, mv, rm), and file viewing (head, tail, less, grep)
• Understand permissions (chmod 755) — your scripts must be executable to run
• Pipes and redirection are the foundation of bioinformatics one-liners
• Use screen/tmux for persistent sessions when working over SSH
• GNU parallel and xargs enable efficient batch processing of many files
2.2 Setting Up Your Environment (Conda)

Conda is the standard package manager for bioinformatics. It installs tools and manages dependencies so they don't conflict with each other. Proper environment management is one of the most important skills for reproducible research — if you cannot recreate your exact software environment, your results cannot be reproduced.

Installing Conda / Mamba

# Option A: Install Miniconda (traditional) wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh source ~/.bashrc # Option B: Install Miniforge (RECOMMENDED — includes mamba by default) wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh bash Miniforge3-Linux-x86_64.sh source ~/.bashrc # Set up bioinformatics channels conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict # Important for reproducibility
💡
Why Mamba over Conda? Mamba is a drop-in replacement for conda that uses a C++ solver instead of Python. It resolves dependencies 10-50x faster than conda. With Miniforge, mamba is included by default. Just replace conda install with mamba install — everything else is the same.

Creating and Managing Environments

# Create a project environment (ALWAYS use environments!) mamba create -n rnaseq python=3.10 conda activate rnaseq # Install tools into the environment mamba install -c bioconda fastqc trim-galore hisat2 samtools subread mamba install -c conda-forge r-base r-biocmanager # List installed tools and their versions conda list # Save environment for reproducibility conda env export > environment.yml # Recreate on another machine: conda env create -f environment.yml # Useful management commands conda env list # List all environments conda deactivate # Return to base conda env remove -n old_env # Delete an environment
⚠️
Never install tools into your base environment. One project = one environment. This prevents version conflicts between projects and ensures reproducibility.

Containers: Docker and Singularity

Containers go beyond conda by packaging the entire operating system, libraries, and tools into a single portable image. This guarantees that the software runs identically on any machine. Docker is the standard for local development; Singularity (Apptainer) is the standard on HPC clusters because it does not require root privileges.

# Docker — pull and run a bioinformatics image docker pull biocontainers/samtools:v1.17 docker run -v $(pwd):/data biocontainers/samtools:v1.17 samtools flagstat /data/aligned.bam # Singularity — the HPC alternative (no root needed) singularity pull docker://biocontainers/samtools:v1.17 singularity exec samtools_v1.17.sif samtools flagstat aligned.bam # Many Nextflow/Snakemake pipelines use containers automatically

Project Directory Structure

Organising your project directory consistently saves enormous time and prevents errors. Here is a recommended structure:

# Create a standard project structure mkdir -p my_rnaseq_project/{data/{raw,trimmed,aligned},results/{counts,deseq2,figures},scripts,logs,ref} # Resulting structure: # my_rnaseq_project/ # data/ # raw/ <-- raw FASTQ files (never modify these!) # trimmed/ <-- trimmed FASTQ files # aligned/ <-- BAM files # results/ # counts/ <-- gene count matrices # deseq2/ <-- differential expression results # figures/ <-- publication-ready plots # scripts/ <-- analysis scripts (version controlled) # logs/ <-- log files from each step # ref/ <-- reference genome, GTF, indices

Version Control with Git

Track your analysis scripts with git. This provides a history of every change you made and allows collaboration. Do not track large data files — only scripts and small config files.

# Initialise a git repo in your project cd my_rnaseq_project git init echo -e "data/\nresults/\nlogs/\n*.bam\n*.fastq.gz" > .gitignore git add scripts/ .gitignore environment.yml git commit -m "Initial project setup" # Push to GitHub for backup and collaboration git remote add origin https://github.com/user/project.git git push -u origin main

HPC Module System

Many HPC clusters use the module system instead of (or alongside) conda to manage software. The module system allows administrators to install centrally-maintained software that users can load as needed.

# Common module commands on HPC module avail # List all available modules module avail samtools # Search for a specific tool module load samtools/1.17 # Load a specific version module list # Show currently loaded modules module unload samtools # Unload a module module purge # Unload all modules
📝
Key Takeaways:
• Use Mamba (via Miniforge) instead of plain Conda for much faster dependency resolution
• Always create separate environments for each project; never install into base
• Containers (Docker/Singularity) provide the highest level of reproducibility
• Maintain a consistent project directory structure and track scripts with git
• On HPC clusters, use the module system or Singularity containers
2.3 Bash Scripting for Automation

When you have 20 samples, you don't run commands one by one. Bash scripts automate repetitive tasks — this is the single most time-saving skill in bioinformatics. A well-written script ensures every sample is processed identically, provides a record of exactly what was done, and can be re-run when parameters need to change.

Variables, Arrays, and String Operations

#!/bin/bash # Variables — no spaces around the = sign! SAMPLE="control_rep1" THREADS=8 GENOME="/ref/hg38/genome" # String operations echo ${SAMPLE} # control_rep1 echo ${SAMPLE^^} # CONTROL_REP1 (uppercase) echo ${SAMPLE/_rep1/} # control (substitution) echo ${SAMPLE%.fastq.gz} # Remove suffix # Arrays — store lists of items SAMPLES=("ctrl_1" "ctrl_2" "treat_1" "treat_2") echo ${SAMPLES[0]} # ctrl_1 (first element) echo ${SAMPLES[@]} # All elements echo ${#SAMPLES[@]} # 4 (number of elements) # Loop over array for s in "${SAMPLES[@]}"; do echo "Processing $s" done

Conditionals (if/else)

#!/bin/bash # Check if a file exists before processing if [ -f "${SAMPLE}_R1.fastq.gz" ]; then echo "File found, processing..." elif [ -d "$INPUT_DIR" ]; then echo "Directory exists but no file found" else echo "ERROR: Input not found!" >&2 exit 1 fi # Check if a tool is installed if ! command -v samtools &> /dev/null; then echo "ERROR: samtools not found. Install with: mamba install -c bioconda samtools" exit 1 fi # Numeric comparisons READ_COUNT=$(zcat sample.fastq.gz | wc -l) READ_COUNT=$((READ_COUNT / 4)) if [ "$READ_COUNT" -lt 1000000 ]; then echo "WARNING: Only $READ_COUNT reads — too few for reliable DE analysis" fi

Error Handling — set -euo pipefail

By default, bash scripts continue running even when a command fails. This is dangerous — if trimming fails silently, you might align garbage data. Always start scripts with safety flags:

#!/bin/bash set -euo pipefail # -e: Exit immediately if any command fails # -u: Treat unset variables as an error # -o pipefail: If any command in a pipe fails, the whole pipe fails # Without pipefail, this would silently fail if hisat2 crashes: hisat2 -x index -U reads.fq | samtools sort -o out.bam # With pipefail, the script stops and reports the error
⚠️
set -euo pipefail should be on line 2 of every bioinformatics script. Without it, you may not notice when a critical step fails, leading to incorrect results downstream.

Logging Output

#!/bin/bash set -euo pipefail # Create a log file with timestamp LOG="logs/pipeline_$(date +%Y%m%d_%H%M%S).log" exec > >(tee -a "$LOG") 2>&1 # Send stdout and stderr to both screen and log echo "Pipeline started at $(date)" echo "Working directory: $(pwd)" echo "Conda environment: $CONDA_DEFAULT_ENV" echo "---" # Your pipeline commands here... echo "Pipeline finished at $(date)"

Full RNA-seq Pipeline Script

#!/bin/bash set -euo pipefail # ───────────────────────────────────────── # RNA-seq preprocessing pipeline # Usage: bash pipeline.sh # ───────────────────────────────────────── # Define paths (edit these) RAW_DIR="./data/raw" TRIM_DIR="./data/trimmed" ALIGN_DIR="./data/aligned" GENOME_DIR="./ref" THREADS=8 # Create output directories mkdir -p "$TRIM_DIR" "$ALIGN_DIR" "logs" # Loop over all samples for R1 in ${RAW_DIR}/*_R1.fastq.gz; do SAMPLE=$(basename "$R1" _R1.fastq.gz) R2="${RAW_DIR}/${SAMPLE}_R2.fastq.gz" echo "=== Processing: $SAMPLE ===" # Check paired file exists if [ ! -f "$R2" ]; then echo "ERROR: Paired file not found: $R2" >&2 exit 1 fi # Trim adapters with Trim Galore trim_galore --paired --cores $THREADS \ --fastqc -o "$TRIM_DIR" \ "$R1" "$R2" 2>&1 | tee "logs/${SAMPLE}_trim.log" # Align to genome hisat2 -p $THREADS -x "${GENOME_DIR}/hisat2_index" \ -1 "${TRIM_DIR}/${SAMPLE}_R1_val_1.fq.gz" \ -2 "${TRIM_DIR}/${SAMPLE}_R2_val_2.fq.gz" \ 2> "logs/${SAMPLE}_align.log" \ | samtools sort -@ $THREADS -o "${ALIGN_DIR}/${SAMPLE}.bam" samtools index "${ALIGN_DIR}/${SAMPLE}.bam" echo "=== Done: $SAMPLE ===" done echo "All samples processed!"

Running Scripts on HPC with SLURM

On most HPC clusters, you do not run jobs directly. Instead, you submit them to a job scheduler (most commonly SLURM). SLURM allocates resources and runs your script when resources become available.

#!/bin/bash #SBATCH --job-name=rnaseq_align #SBATCH --output=logs/slurm_%j.out # %j = job ID #SBATCH --error=logs/slurm_%j.err #SBATCH --time=12:00:00 # Max wall time: 12 hours #SBATCH --cpus-per-task=8 # Number of CPU cores #SBATCH --mem=32G # RAM request #SBATCH --partition=general # Queue/partition name set -euo pipefail # Load conda environment source ~/.bashrc conda activate rnaseq # Run the pipeline bash scripts/pipeline.sh # Submit with: sbatch scripts/slurm_align.sh # Monitor with: squeue -u $USER # Cancel with: scancel JOB_ID
💡
SLURM array jobs let you submit one script that runs many times with different inputs — perfect for processing many samples in parallel: #SBATCH --array=1-20 and use $SLURM_ARRAY_TASK_ID to select the sample.
📝
Key Takeaways:
• Always start scripts with set -euo pipefail to catch errors early
• Use variables for paths and parameters — makes scripts reusable
• Log everything: timestamps, tool versions, and parameters used
• Test on one sample before running on all; check outputs at each step
• On HPC, use SLURM (#SBATCH directives) to request resources and submit jobs
2.4 R for Bioinformatics — Getting Started

R is the language of choice for statistical analysis, visualisation, and Bioconductor packages (DESeq2, edgeR, limma). You don't need to be a software engineer — a working knowledge of R is enough to run the statistical analyses and produce publication-quality plots that make up the second half of most bioinformatics projects.

RStudio Setup

RStudio is the standard IDE for R. It provides a console, script editor, plot viewer, file browser, and help panel in one interface. Install R first (from CRAN or via conda), then install RStudio Desktop (free). For HPC, you can use RStudio Server through a web browser.

# Install via conda (recommended for reproducibility) mamba install -c conda-forge r-base=4.3 r-tidyverse r-biocmanager rstudio-desktop # Or install R packages within R: install.packages(c("tidyverse", "ggplot2", "pheatmap", "RColorBrewer")) # Install Bioconductor packages if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install(c("DESeq2", "edgeR", "clusterProfiler", "EnhancedVolcano", "ComplexHeatmap", "org.Hs.eg.db"))

Reading and Writing Data

# Read different file formats counts <- read.csv("counts.csv", row.names = 1) # CSV counts <- read.delim("counts.txt", row.names = 1) # Tab-delimited counts <- readRDS("counts.rds") # R binary (fast, compact) # Inspect the data head(counts) # First 6 rows dim(counts) # Dimensions: genes x samples str(counts) # Structure of the object summary(counts[, 1]) # Summary statistics for first column # Write outputs write.csv(results, "deseq2_results.csv") saveRDS(dds, "deseq2_object.rds") # Save full R object (can reload later)

Tidyverse Data Wrangling

The tidyverse is a collection of R packages (dplyr, tidyr, ggplot2, readr, etc.) that share a common design philosophy. The pipe operator (|> or %>%) chains operations together, making code readable.

library(tidyverse) # Read metadata and filter metadata <- read_csv("metadata.csv") metadata |> filter(condition == "treated") |> # Keep only treated samples select(sample_id, condition, batch) |> # Select specific columns arrange(batch) |> # Sort by batch mutate(label = paste(condition, batch, sep = "_")) # Add new column # Summarise expression across groups results <- read_csv("deseq2_results.csv") results |> filter(!is.na(padj)) |> # Remove NA values filter(padj < 0.05, abs(log2FoldChange) > 1) |> # Significant DEGs arrange(padj) |> # Sort by significance head(20) # Top 20

ggplot2 for Genomics Visualisation

ggplot2 is the most powerful plotting system in R. It uses a "grammar of graphics" — you build plots by adding layers (geom_point, geom_boxplot, etc.) to a base aesthetic mapping.

library(ggplot2) # Volcano plot — the classic DE visualisation ggplot(results, aes(x = log2FoldChange, y = -log10(padj))) + geom_point(aes(color = padj < 0.05 & abs(log2FoldChange) > 1), size = 0.8, alpha = 0.6) + scale_color_manual(values = c("grey70", "red3"), labels = c("NS", "Significant")) + geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey40") + geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "grey40") + labs(x = "Log2 Fold Change", y = "-Log10 Adjusted P-value", title = "Differential Expression: Treated vs Control") + theme_minimal(base_size = 12) + theme(legend.position = "bottom") # Save plot as publication-quality PDF ggsave("figures/volcano_plot.pdf", width = 7, height = 6, dpi = 300) # Boxplot of gene expression across conditions ggplot(expr_long, aes(x = condition, y = log2(count + 1), fill = condition)) + geom_boxplot(outlier.size = 0.5) + facet_wrap(~gene, scales = "free_y") + theme_bw()

R Markdown for Reproducible Reports

R Markdown lets you combine code, results, and narrative text in a single document. When you "knit" the document, R runs all the code and produces a formatted HTML or PDF report. This is the gold standard for reproducible analysis.

# In an .Rmd file: --- title: "RNA-seq Analysis Report" author: "Your Name" date: "`r Sys.Date()`" output: html_document --- ## Quality Control ```{r qc-summary} library(DESeq2) dds <- readRDS("deseq2_object.rds") plotPCA(vst(dds), intgroup = "condition") ``` ## Differential Expression ```{r de-results} res <- results(dds, contrast = c("condition", "treated", "control")) summary(res) ```
📌
Recommended R learning path: (1) R for Data Science by Hadley Wickham (free online at r4ds.had.co.nz) → (2) Bioconductor Workflows at bioconductor.org → (3) DESeq2 vignette → (4) ggplot2 book (ggplot2-book.org)
📝
Key Takeaways:
• Install R + RStudio for interactive analysis; use Bioconductor for genomics-specific packages
• The tidyverse (dplyr, ggplot2, tidyr) is essential for data wrangling and visualisation
• ggplot2 builds plots in layers — master aes(), geom_*, and theme() for publication figures
• Use R Markdown to combine code, results, and narrative in reproducible reports
• Save R objects with saveRDS() so you can reload them without re-running slow computations
Module 3 · Databases & Data

Biological Databases & File Formats

Public databases hold billions of sequences, structures, and experiments — and they are all free. Knowing which database to use and how to download data efficiently is an essential skill.

3.1 The Major Biological Databases

Public biological databases are one of the greatest resources in science. They contain billions of sequences, millions of structures, and thousands of curated experiments — all freely accessible. Knowing which database to search, how to interpret accession numbers, and how to programmatically retrieve data are essential skills for any bioinformatician.

The Core Databases

NCBI Multi-purpose

National Center for Biotechnology Information. Houses GenBank (nucleotide sequences), SRA (raw sequencing data), PubMed (literature), RefSeq (curated reference sequences), and ClinVar (clinical variants). Start here for most searches. URL: ncbi.nlm.nih.gov

Ensembl Genome

Best source for annotated reference genomes and GTF annotation files for 300+ species. Provides BioMart for bulk queries. Use this to download genome FASTA and GTF files. URL: ensembl.org

UniProt Protein

Comprehensive protein database. Swiss-Prot = manually curated (~570K entries, high quality). TrEMBL = computationally annotated (~250M entries). URL: uniprot.org

RCSB PDB Structure

Protein Data Bank. 200,000+ experimentally determined 3D structures (X-ray, cryo-EM, NMR). Now complemented by AlphaFold DB with 200M+ predicted structures. URL: rcsb.org

GEO / SRA Expression

Gene Expression Omnibus (processed data, metadata) and Sequence Read Archive (raw FASTQ files). Search by organism + condition + technology. Most published datasets are deposited here.

KEGG / Reactome Pathways

KEGG: metabolic pathways, drug targets, disease associations. Reactome: curated biological pathways and reactions, especially good for human data. Essential for pathway enrichment analysis.

STRING Interactions

Protein-protein interaction networks for 5000+ species. Input a gene list and get a functional interaction network. URL: string-db.org

Gene Ontology (GO) Annotation

Standardised vocabulary for gene function. Three domains: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Used in enrichment analysis. URL: geneontology.org

Accession Number Formats

Every entry in a database has a unique accession number. Recognising the format tells you which database it comes from:

Pattern Example Database What It Is
NM_######NM_007294NCBI RefSeqmRNA transcript
NP_######NP_009225NCBI RefSeqProtein
ENSG###########ENSG00000012048EnsemblGene (human)
ENST###########ENST00000357654EnsemblTranscript
P##### or Q#####P38398UniProt Swiss-ProtCurated protein
SRR#######SRR1234567NCBI SRASequencing run
GSE######GSE53697NCBI GEODataset/Series

How Databases Link to Each Other

Databases are extensively cross-referenced. A gene page on NCBI Gene links to its RefSeq transcript (NM_), RefSeq protein (NP_), UniProt entry, Ensembl ID, associated GEO datasets, ClinVar variants, and PDB structures. Similarly, an Ensembl gene page links out to NCBI, UniProt, KEGG, and more. This interconnection means you can start from any database and navigate to the information you need.

NCBI E-utilities API

For programmatic access, NCBI provides the E-utilities API — a set of URL-based tools for searching and retrieving records. This is essential for automating data retrieval in scripts.

# Search PubMed for articles about BRCA1 and RNA-seq esearch -db pubmed -query "BRCA1 AND RNA-seq" | efetch -format abstract | head -50 # Fetch a nucleotide sequence in FASTA format efetch -db nucleotide -id NM_007294 -format fasta > BRCA1_mRNA.fa # Search SRA for human RNA-seq datasets esearch -db sra -query '"Homo sapiens"[Organism] AND "RNA-Seq"[Strategy]' | efetch -format runinfo | head # Install E-utilities via conda mamba install -c bioconda entrez-direct

How to Cite Databases

💡
Always cite the databases you use in your Methods section. Each database publishes a yearly "Database Issue" paper in Nucleic Acids Research. For example: "Reference genome and annotation were obtained from Ensembl release 111 (Cunningham et al., 2024). Raw sequencing data were downloaded from the NCBI Sequence Read Archive (accession SRP123456)."
📝
Key Takeaways:
• NCBI is the starting point for most searches; Ensembl is best for genome downloads; UniProt for protein information
• Learn to recognise accession number formats (NM_, ENSG, P#####, SRR) to identify the source database
• Databases are cross-linked — navigate from gene to transcript to protein to structure to pathway
• Use NCBI E-utilities for programmatic access in automated pipelines
• Always cite the databases and their version/release numbers in publications
3.2 Essential File Formats Explained

Each step of a bioinformatics analysis produces a different file format. Recognising these formats, understanding their structure, and knowing how to convert between them is essential. Misinterpreting a file format is one of the most common causes of analysis errors.

FASTA (.fa / .fasta)

The simplest sequence format. Each entry has a header line starting with > followed by one or more lines of sequence. Used for reference genomes, transcriptomes, and protein sequences.

# Example FASTA file >ENST00000357654.9 BRCA1-201 cds chromosome:GRCh38:17:43044295:43125483:-1 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA ATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGAC ...
💡
Useful one-liners: grep -c "^>" file.fa counts the number of sequences. samtools faidx genome.fa creates an index (.fai) needed by many tools. The index file lists each sequence name, length, offset, line bases, and line bytes.

FASTQ (.fastq / .fastq.gz)

Raw sequencing reads. Every read is exactly 4 lines. FASTQ files are almost always gzip-compressed (.fastq.gz) to save space.

# Example FASTQ entry (4 lines per read) @SRR1234567.1 1 length=150 # Line 1: Header (starts with @) ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG # Line 2: Sequence + # Line 3: Separator (always +) IIIIIIIIIIIIIHHHHHGGGGFFFDDDBBBB@@@@>>>555222 # Line 4: Quality scores (ASCII) # Quality scores: I=Q40(99.99%), F=Q37, 5=Q20(99%), !=Q0 # Count reads: divide total lines by 4 echo $(( $(zcat sample.fastq.gz | wc -l) / 4 ))

SAM / BAM / CRAM

Alignment files. SAM is human-readable text; BAM is the compressed binary equivalent (5-10x smaller); CRAM compresses even further by referencing the genome. Always store BAM, never SAM.

# SAM format (tab-separated, 11 mandatory columns) # QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL SRR123.1 99 chr17 43044295 60 150M = 43044450 305 ATCGATCG... IIIIIIII... # Key samtools commands samtools view -bS aligned.sam > aligned.bam # SAM to BAM samtools sort aligned.bam -o sorted.bam # Sort by coordinate samtools index sorted.bam # Create .bai index samtools flagstat sorted.bam # Quick alignment stats

GTF / GFF3

Gene annotation files define where genes, exons, transcripts, and other features are located in the genome. Essential for gene counting (featureCounts) and transcript quantification.

# GTF format: 9 tab-separated columns # chr source feature start end score strand frame attributes chr17 ensembl gene 43044295 43125483 . - . gene_id "ENSG00000012048"; gene_name "BRCA1"; chr17 ensembl transcript 43044295 43125483 . - . gene_id "ENSG00000012048"; transcript_id "ENST00000357654"; chr17 ensembl exon 43124017 43125483 . - . gene_id "ENSG00000012048"; exon_number "1";

VCF (.vcf / .vcf.gz)

Variant Call Format — the standard for storing genetic variants. Covered in detail in Module 6.

BED (.bed)

Simple genomic interval format. Minimum 3 columns: chrom, start (0-based), end. Used for ChIP-seq peaks, exon coordinates, target regions, blacklisted regions.

⚠️
Coordinate systems matter! BED uses 0-based, half-open coordinates (start is 0-indexed, end is exclusive). GTF/GFF and VCF use 1-based, inclusive coordinates. A region at chr1:100-200 in BED is the same as chr1:101-200 in GTF. Mixing these up is a very common source of off-by-one errors.

Converting Between Formats

# SAM <-> BAM <-> CRAM samtools view -bS input.sam > output.bam # SAM to BAM samtools view -h input.bam > output.sam # BAM to SAM samtools view -C -T ref.fa input.bam > out.cram # BAM to CRAM # GTF <-> BED (extract gene coordinates) awk '$3=="gene"' genes.gtf | awk '{print $1"\t"$4-1"\t"$5"\t"$10}' | tr -d '";' > genes.bed # BED operations with bedtools bedtools intersect -a peaks.bed -b genes.bed # Overlapping regions bedtools getfasta -fi genome.fa -bed regions.bed # Extract sequences # Validate file integrity samtools quickcheck sorted.bam && echo "OK" || echo "CORRUPTED" md5sum file.fastq.gz > file.md5 # Generate checksum md5sum -c file.md5 # Verify checksum
📝
Key Takeaways:
• FASTA = sequences, FASTQ = sequences + quality, BAM = aligned reads, GTF = gene annotations, VCF = variants, BED = intervals
• Always store alignments as BAM (or CRAM), never SAM — the space savings are enormous
• Watch out for coordinate systems: BED is 0-based half-open; GTF and VCF are 1-based inclusive
• Use samtools and bedtools for format conversion and manipulation
• Always verify file integrity with checksums (md5sum) after transfers
3.3 Downloading Data from SRA & Ensembl

Most bioinformatics projects begin by downloading data — either raw sequencing data from public repositories (to reproduce or extend a study) or reference genomes and annotations for alignment. This section covers how to find, download, and verify data efficiently.

Finding SRA Accession Numbers from a Paper

Published papers deposit raw data in public repositories. To find the accession number: (1) Check the paper's Data Availability section — it will list a GEO accession (GSE######) or BioProject (PRJNA######). (2) Go to NCBI GEO and search for the GSE number. (3) Click "SRA Run Selector" at the bottom of the GEO page. (4) This gives you a table of all SRR accessions and their metadata. Download the Accession List text file.

Downloading from NCBI SRA

# Install SRA toolkit mamba install -c bioconda sra-tools # Download single-end data fasterq-dump SRR1234567 --outdir ./data/raw --progress # Download paired-end data (creates _1.fastq and _2.fastq) fasterq-dump SRR1234567 --outdir ./data/raw --split-files --progress # Compress immediately (fasterq-dump outputs uncompressed!) gzip ./data/raw/SRR1234567*.fastq # Batch download from accession list cat SRR_Acc_List.txt | parallel -j 4 "fasterq-dump {} --outdir ./data/raw --split-files && gzip ./data/raw/{}*.fastq"
⚠️
fasterq-dump outputs uncompressed FASTQ! A single sample can be 20-50 GB uncompressed. Always gzip immediately after download. Also, fasterq-dump uses a cache in ~/ncbi/ which can fill your home directory — set vdb-config --set /repository/user/cache-disabled=true or use --temp /scratch/tmp.

Faster Downloads with Aspera and ENA

Aspera is a high-speed transfer protocol that can be 10-50x faster than wget/ftp. NCBI and ENA both support it. The European Nucleotide Archive (ENA) is an alternative to SRA that is often faster, especially if you are in Europe, and provides pre-compressed FASTQ files directly.

# Download from ENA (often faster, provides .fastq.gz directly) # Search: https://www.ebi.ac.uk/ena/browser/ # ENA URLs follow a predictable pattern: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/SRR1234567_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/SRR1234567_2.fastq.gz # Using Aspera for fast downloads mamba install -c bioconda aspera-cli ascp -QT -l 300m -P 33001 -i ~/.aspera/cli/etc/asperaweb_id_dsa.openssh \ era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR123/007/SRR1234567/SRR1234567_1.fastq.gz \ ./data/raw/

Downloading Reference Genomes from Ensembl

# Human genome (GRCh38) — use primary_assembly (no alt/patch contigs) wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Human GTF annotation (matching release!) wget https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz # Mouse (GRCm39) wget https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz wget https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.gtf.gz # Decompress gunzip *.fa.gz *.gtf.gz
💡
Always use primary_assembly (not top_level) to avoid alt contigs and patches that can cause ambiguous alignments. And always ensure your genome FASTA and GTF come from the same Ensembl release — mismatched versions cause chromosome naming mismatches and counting errors.

Verifying Downloaded Data

# Verify file integrity with md5sum wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/CHECKSUMS md5sum Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Compare the hash with the value in the CHECKSUMS file # Quick sanity checks for FASTQ files zcat sample_R1.fastq.gz | head -8 # Check first 2 reads look normal echo $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 )) # Count total reads # Check that R1 and R2 have the same number of reads echo "R1: $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 )) reads" echo "R2: $(( $(zcat sample_R2.fastq.gz | wc -l) / 4 )) reads"

Organising Your Data

# Recommended project data structure mkdir -p project/{data/raw,data/trimmed,data/aligned,ref,results,scripts,logs} # Rename SRR files to meaningful sample names using a mapping file # mapping.txt: SRR1234567 control_rep1 while read srr name; do mv data/raw/${srr}_1.fastq.gz data/raw/${name}_R1.fastq.gz mv data/raw/${srr}_2.fastq.gz data/raw/${name}_R2.fastq.gz done < mapping.txt
📝
Key Takeaways:
• Find SRR accessions via GEO Run Selector (from the paper's Data Availability section)
• fasterq-dump outputs uncompressed files — always gzip immediately to save disk space
• ENA (European) often provides faster downloads with pre-compressed FASTQ files
• Always verify downloads with md5sum and check that R1/R2 files have matching read counts
• Keep genome FASTA and GTF from the same Ensembl release to avoid naming mismatches
Module 4 · Sequence Analysis

BLAST, Alignment & Sequence Comparison

Sequence comparison is the backbone of bioinformatics. Whether finding a gene's function, aligning reads, or building a phylogenetic tree — everything starts with comparing sequences.

4.1 BLAST — Finding Similar Sequences

BLAST (Basic Local Alignment Search Tool) is the single most important tool in bioinformatics. Developed at NCBI in 1990, it compares a query sequence against a database and returns statistically significant matches. Understand it deeply — you will use it almost every day.

How BLAST Works (The Algorithm)

BLAST does NOT compare every position. Instead it uses a three-step heuristic: (1) Word finding — breaks your query into short words (default 11 bp for blastn, 3 aa for blastp) and finds exact matches in the database. (2) Extension — extends each word match left and right, tracking a running score. Stops when the score drops significantly. (3) Evaluation — calculates E-value for each high-scoring segment pair (HSP). This heuristic makes BLAST fast — but it can miss very divergent homologues.

The Five BLAST Programs

blastn

Nucleotide → nucleotide. Use for: verifying a cloned sequence, finding a gene in a new genome. Fast but only finds close matches. For very similar sequences use megablast (default on NCBI web).

blastp

Protein → protein. Much more sensitive than blastn for detecting distant homologues — amino acid conservation is stronger than nucleotide. Uses substitution matrices (BLOSUM62 default).

blastx

Translates nucleotide query in all 6 reading frames → searches protein database. Use when you have an unknown DNA sequence and want to find the protein it encodes.

tblastn

Protein query → translated nucleotide database. Use for finding a gene in an unannotated genome — you have a known protein and want to find its genomic location.

PSI-BLAST

Position-Specific Iterated BLAST. Iteratively builds a profile from initial hits and re-searches. Much more sensitive for distant homology. Use for: finding remote protein family members.

DIAMOND

Not BLAST, but a modern alternative. 500–20,000× faster for blastx/blastp searches. Essential for metagenomics where you BLAST millions of reads. mamba install -c bioconda diamond

Understanding BLAST Output

⚠️
E-value is NOT a p-value. E-value = expected number of hits with this score by random chance in a database of this size. E-value 0.001 = expect this hit once in 1,000 random databases. E-value < 1e-5 is generally significant. E-value depends on database size — the same alignment has a different E-value in a small vs large database. Always compare E-values from the same search.

Key output fields in tabular format (-outfmt 6):

pident — % identity (how similar)
length — alignment length
evalue — statistical significance
bitscore — alignment quality (higher = better, database-independent)
qcovs — % of query covered by the alignment
sseqid — subject (database hit) ID

Complete BLAST Workflow

# Install BLAST mamba install -c bioconda blast # 1. Build a local BLAST database makeblastdb -in genome.fa -dbtype nucl -out blastdb/genome makeblastdb -in proteins.fa -dbtype prot -out blastdb/proteome # 2. Run blastn with useful output format blastn -query unknown_gene.fa -db blastdb/genome \ -out results.tsv -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs" \ -evalue 1e-10 -max_target_seqs 5 -num_threads 8 # 3. Run blastp for protein function annotation blastp -query my_protein.fa -db nr -remote \ -out results_nr.tsv -outfmt 6 -evalue 1e-5 # 4. Run blastx to find protein encoded by a DNA sequence blastx -query unknown_cdna.fa -db blastdb/proteome \ -out blastx_results.tsv -outfmt 6 -evalue 1e-5 # 5. Reciprocal Best Hit (RBH) — the gold standard for orthology # Run BLAST in both directions, keep pairs that are best hits in BOTH blastp -query speciesA.fa -db speciesB_db -outfmt 6 -evalue 1e-10 > A_vs_B.tsv blastp -query speciesB.fa -db speciesA_db -outfmt 6 -evalue 1e-10 > B_vs_A.tsv
💡
When BLAST isn't enough: BLAST misses very divergent homologues. For sensitive remote homology detection, use HMMER (profile-HMM searches — much more sensitive than PSI-BLAST) or HHblits (hidden Markov model vs HMM database). These tools are standard for protein domain annotation.

Key Takeaways

E-value < 1e-5 for significanceblastp more sensitive than blastnAlways check query coverage, not just E-valueUse DIAMOND for large-scale searchesReciprocal Best Hit for orthology
4.2 Multiple Sequence Alignment (MSA)

Multiple Sequence Alignment (MSA) arranges three or more biological sequences so that corresponding positions (homologous residues) line up in columns. This reveals evolutionary conservation: columns where all sequences have the same residue are under strong selective pressure — they are functionally important. MSA is the foundation for phylogenetics, protein domain identification, and functional annotation.

How MSA Algorithms Work

An exact optimal MSA is computationally infeasible for more than ~10 sequences (NP-hard problem). All practical tools use heuristics:

Progressive (ClustalW, MUSCLE v3)

Build a guide tree from pairwise distances, then align sequences in order of relatedness. Fast but early errors propagate.

Iterative (MUSCLE v5, MAFFT)

Align progressively, then refine by re-aligning subgroups. Better accuracy. MAFFT L-INS-i is the gold standard for accuracy.

Consistency-based (T-Coffee)

Uses all pairwise alignments to guide the MSA. Very accurate for small sets. Too slow for >200 sequences.

Gap Penalties — Why They Matter

Gaps represent insertions or deletions (indels) in evolution. The gap opening penalty controls how reluctant the aligner is to insert a gap. The gap extension penalty controls how costly it is to make an existing gap longer. High opening + low extension → fewer, longer gaps (appropriate for eukaryotic proteins). Low opening → many small gaps (may over-fragment the alignment).

Which Tool to Use?

MAFFT

Best overall choice. Use --auto to let MAFFT choose strategy. For <200 sequences: mafft --maxiterate 1000 --localpair (L-INS-i, most accurate). For >10,000 sequences: mafft --retree 1.

Recommended
MUSCLE v5

Excellent accuracy, competitive with MAFFT. Newer "super5" algorithm scales well. Use muscle -super5 input.fa -output aligned.fa for large datasets.

# ─── Complete MSA workflow ─── mamba install -c bioconda mafft muscle trimal # MAFFT — automatic mode (recommended start) mafft --auto sequences.fa > aligned.fa # MAFFT — maximum accuracy (for <200 seqs, publication-quality) mafft --maxiterate 1000 --localpair sequences.fa > aligned_accurate.fa # MUSCLE v5 — large datasets muscle -super5 sequences.fa -output aligned.fa # ─── Trim poorly aligned regions ─── # ALWAYS trim before building phylogenetic trees trimal -in aligned.fa -out trimmed.fa -automated1 # Check alignment stats trimal -in aligned.fa -sgc # Show column conservation scores trimal -in aligned.fa -sident # Show sequence identity matrix # ─── Visualise alignment ─── # Web: Jalview (jalview.org) — colour by conservation # Web: MView (ebi.ac.uk/Tools/msa/mview/) — quick coloured view # CLI: AliView — fast lightweight desktop viewer
⚠️
Common MSA mistakes: (1) Including too-divergent sequences — they add noise, not signal. Remove sequences with <20% identity. (2) Not trimming before phylogenetics — gappy regions create artefactual branches. (3) Using ClustalW — it is outdated and less accurate than MAFFT/MUSCLE. Use modern tools.

Key Takeaways

MAFFT --auto for most casesAlways trim before tree buildingL-INS-i for best accuracy (<200 seqs)Visualise with Jalview or AliViewConserved columns = functional importance
4.3 Phylogenetic Trees — Evolutionary Analysis

A phylogenetic tree is a branching diagram that represents the evolutionary relationships among biological sequences or species. Trees are constructed from multiple sequence alignments and use mathematical models to infer how sequences have diverged over time. Understanding how to build, read, and interpret phylogenetic trees is a core skill in evolutionary biology, epidemiology, and comparative genomics.

Reading a Phylogenetic Tree

Every tree has key structural elements. The root represents the most recent common ancestor. Internal nodes represent inferred ancestral sequences (speciation or divergence events). Tips (also called leaves) are the extant sequences in your analysis. Branch length represents the amount of evolutionary change — typically measured in substitutions per site. A branch of length 0.05 means 5 nucleotide substitutions per 100 sites have occurred along that branch.

Tree-Building Methods

Neighbour-Joining (NJ)

Fast, distance-based heuristic. Computes a pairwise distance matrix and groups closest neighbours iteratively. Good for quick exploratory trees with many sequences. Available in MEGA, FastTree, and Clustal Omega.

Maximum Likelihood (ML)

Statistical optimisation approach that evaluates the probability of the observed alignment given a tree topology and substitution model. Best method for publication-quality trees. Use IQ-TREE2 or RAxML-NG.

Bayesian Inference

Probabilistic framework using MCMC sampling to explore tree space. Produces posterior probability support values. MrBayes for phylogenetics, BEAST2 for molecular clock and divergence time estimation. Computationally expensive but powerful.

When to use which?

NJ: quick-and-dirty exploration (>500 sequences). ML: standard research publication. Bayesian: when you need divergence time estimates or posterior probabilities. For most projects, ML with IQ-TREE2 is the right choice.

Rooting a Tree with an Outgroup

An unrooted tree shows relationships but not the direction of evolution. To root a tree, include an outgroup — a sequence known to be more distantly related than any ingroup member. For example, when building a tree of mammalian species, include a reptile as the outgroup. The root is placed on the branch leading to the outgroup, which polarises all other relationships.

Bootstrap Support Interpretation

Bootstrap values assess the reliability of each node. The alignment is resampled with replacement many times (typically 1000), and each resampled dataset produces a new tree. The bootstrap value is the percentage of times a given clade appears across all resampled trees.

>90% — Strongly supported

High confidence in this branching pattern. Reliable for biological conclusions.

70–90% — Moderately supported

Generally considered acceptable. Report with caution.

50–70% — Weakly supported

Unreliable. The relationship may change with more data or different methods.

<50% — No support

The branching pattern is essentially random at this node. Do not interpret.

Substitution Models

Substitution models describe how nucleotides or amino acids change over evolutionary time. Choosing the right model is critical for accurate tree inference.

DNA: JC69

Simplest model. Assumes all substitutions are equally likely and all bases have equal frequency. Rarely realistic but useful as a baseline.

DNA: GTR (General Time Reversible)

Most general model. Six substitution rate parameters and four base frequency parameters. Usually the best fit for real data. Often combined with +G (gamma rate variation) and +I (invariant sites).

Protein: WAG

Empirical amino acid substitution model based on globular protein alignments. Good general-purpose choice for protein trees.

Protein: LG

Newer and generally better-fitting than WAG. Estimated from a larger, more diverse set of protein alignments. Recommended as the default for protein phylogenetics.

💡
Model selection: Use IQ-TREE2's -m TEST flag (ModelFinder) to automatically test all models and select the best one by BIC. Never guess — always let the software choose.

Complete IQ-TREE2 Workflow

# Install IQ-TREE2 conda install -c bioconda iqtree # Step 1: Automatic model selection + ML tree + bootstrap iqtree2 -s trimmed_alignment.fa \ -m TEST \ -B 1000 \ -T AUTO \ --prefix my_tree # -m TEST: ModelFinder — tests all substitution models # -B 1000: 1000 ultrafast bootstrap replicates # -T AUTO: use optimal number of CPU threads # Output files: # my_tree.treefile — the best ML tree (Newick format) # my_tree.iqtree — full report with model parameters # my_tree.log — run log # Step 2: For more rigorous support, use SH-aLRT + ultrafast bootstrap iqtree2 -s trimmed_alignment.fa -m TEST -B 1000 -alrt 1000 -T AUTO # Step 3: For protein alignment iqtree2 -s protein_alignment.fa -m TEST -B 1000 -T AUTO -st AA

Tree Visualisation

# Option 1: FigTree (desktop application) # Download from: https://github.com/rambaut/figtree/releases # Open my_tree.treefile in FigTree # Display bootstrap values: Node Labels → Display → label # Colour branches by clade, adjust branch width, export as PDF/SVG # Option 2: iTOL (Interactive Tree Of Life — web-based) # Upload my_tree.treefile to https://itol.embl.de # Supports annotation files for metadata (colours, bars, heatmaps) # Export publication-quality figures in SVG/PDF/PNG # Option 3: ggtree in R (programmatic, highly customisable) # library(ggtree) # tree <- read.tree("my_tree.treefile") # ggtree(tree) + geom_tiplab() + geom_nodelab(aes(label=label))
⚠️
Common mistake: Interpreting branch length as time. Branch lengths represent substitutions per site unless you explicitly calibrate a molecular clock (e.g., with BEAST2). Two long branches does not mean two species diverged long ago — it could mean they evolved rapidly.
Root = common ancestorBranch length = substitutions/siteBootstrap >70 = supportedGTR+G = standard DNA modelIQ-TREE2 -m TEST = auto modelFigTree / iTOL for visualisationAlways include an outgroup
Module 5 · Transcriptomics

RNA-seq Analysis — From Raw Reads to Gene Lists

RNA sequencing (RNA-seq) measures gene expression across the entire transcriptome. This module takes you through every step of the pipeline with detailed explanations of what each tool does and why.

5.1 Experimental Design — The Most Critical Step

No amount of computational analysis can rescue a poorly designed experiment. Experimental design decisions — replication, randomisation, library preparation — are made before any sequencing occurs and cannot be corrected downstream. This section covers the critical choices that determine whether your RNA-seq experiment will produce reliable, publishable results or generate data that no statistical method can salvage.

Statistical Power: How Many Replicates Do You Need?

Statistical power is the probability of detecting a true difference when one exists. Underpowered studies miss real biology; overpowered studies waste resources. Use the RNASeqPower R package to estimate sample size before your experiment.

# Power analysis with RNASeqPower library(RNASeqPower) # How many replicates to detect 2-fold change (log2FC=1)? rnapower(depth=20, cv=0.4, effect=1, alpha=0.05, power=0.8) # Result: ~3 replicates per group # What about detecting smaller changes (1.5-fold, log2FC=0.58)? rnapower(depth=20, cv=0.4, effect=0.58, alpha=0.05, power=0.8) # Result: ~6 replicates per group # Parameters: # depth = sequencing depth in millions of reads # cv = biological coefficient of variation (0.1–0.4 typical) # effect = log2 fold change you want to detect # alpha = significance threshold (0.05) # power = desired power (0.8 = 80%)
💡
Rule of thumb: 3 replicates can detect large changes (2-fold). To detect subtle but biologically important changes (1.5-fold), you need 6+ replicates. More replicates always outperform deeper sequencing for differential expression.

Biological vs Technical Replicates

Biological replicates

Independent biological samples (different mice, different cell culture flasks, different patients). These capture true biological variability and are essential for statistical inference. Minimum 3 per condition, ideally 5–6.

Technical replicates

The same sample sequenced multiple times, or the same library prepared twice. These only measure technical noise (which is very low in modern RNA-seq). Technical replicates do NOT substitute for biological replicates and are generally a waste of sequencing budget.

Randomisation and Blocking

Batch effects are systematic non-biological variations caused by processing samples at different times, on different machines, or by different technicians. They can completely confound your results if not properly handled.

⚠️
Critical mistake: Never process all treatment samples on Monday and all control samples on Tuesday. If you do, you cannot distinguish the treatment effect from the day-of-processing effect. Instead, balance conditions across batches: process 2 treatment + 2 control samples on each day.

Use a randomised block design: assign samples to batches randomly while ensuring each batch contains samples from every condition. Record all batch information (date, technician, lane, flowcell) in your metadata — you can correct for known batches computationally, but only if you recorded them.

RNA Quality

RNA integrity is measured by the RIN (RNA Integrity Number), scored 1–10 by a Bioanalyzer or TapeStation. Degraded RNA introduces systematic 3' bias because fragmented transcripts lose their 5' ends preferentially.

RIN > 7

Good quality. Suitable for standard RNA-seq with poly-A selection. This is the minimum for most experiments.

RIN 4–7

Partially degraded. Use rRNA depletion instead of poly-A selection. Consider FFPE-optimised protocols if working with clinical samples.

Library Preparation Choices

Poly-A selection vs rRNA depletion

Poly-A captures only mRNA (coding genes). Cheaper, cleaner signal. rRNA depletion captures mRNA + lncRNA + other non-coding RNA. Use rRNA depletion for degraded samples or when non-coding RNA is of interest.

Single-end (SE) vs Paired-end (PE)

SE reads one end of each fragment. PE reads both ends. PE is recommended for: novel isoform discovery, structural variant detection, and improved alignment to repetitive regions. For standard differential expression, SE 50bp is often sufficient and cheaper.

Paired vs unpaired experimental design

If you measure the same patient before and after treatment, use a paired design (~ patient + treatment in DESeq2). Paired designs are more powerful because they control for inter-individual variation.

Multiplexing & barcoding

Multiple samples pooled on one lane with unique index barcodes. Reduces cost and minimises lane-to-lane batch effects. Ensure all conditions are represented on each lane. Use dual-index barcodes to prevent index hopping on patterned flowcells.

📌
Sequencing depth guidelines: Differential expression: 20–30M reads/sample. Rare transcripts or lncRNA: 50–100M. Allele-specific expression: 100M+. For most bulk RNA-seq DE experiments, 25M PE 150bp reads is a good target.
Biological reps > technical reps3 reps minimum, 6 idealBalance batches across conditionsRIN > 7 for poly-APE for isoform discoveryUse RNASeqPower for sample sizeRecord ALL batch metadata
5.2 Quality Control — FastQC & Trimming

Quality control is the first analytical step in any sequencing project. Before aligning reads, you must assess their quality to identify problems such as adapter contamination, low-quality bases, and library preparation artefacts. FastQC generates a per-sample report, MultiQC aggregates all reports for easy comparison, and trimming tools remove problematic sequences. Skipping QC risks propagating errors through the entire downstream analysis.

Running FastQC and MultiQC

# Run FastQC on all samples (parallelised) mkdir -p qc_reports fastqc raw_data/*.fastq.gz -o qc_reports/ -t 8 # Aggregate all FastQC reports with MultiQC pip install multiqc multiqc qc_reports/ -o qc_reports/multiqc_report # Open the MultiQC report in your browser # qc_reports/multiqc_report/multiqc_report.html # Look for: samples with outlier GC content, low quality, high adapter %

The MultiQC report shows all samples side-by-side. Look for outlier samples that differ dramatically from the rest — these may indicate contamination, sample mix-ups, or failed library preparation. The General Statistics table shows total reads, % duplicates, and % GC per sample.

FastQC Modules — What Each Failure Actually Means

Per Base Sequence Quality

Should stay in the green zone (>Q28). Quality commonly drops at the 3' end — this is normal for Illumina chemistry. If quality drops below Q20 before position 50, investigate the sequencing run. Action: Trim low-quality 3' bases if they fall below Q20.

Per Sequence Quality Scores

Distribution of mean quality per read. Should be a single peak above Q30. A bimodal distribution or a shoulder below Q20 means a subset of reads are poor quality. Action: Filter reads with mean quality below Q20.

Per Base Sequence Content (FAIL at pos 1-10)

Non-uniform base composition at the first 10 positions is normal and expected for RNA-seq. It is caused by random hexamer priming bias during cDNA synthesis. Action: No action needed. Do NOT trim these bases.

Adapter Content (FAIL)

Adapter sequences detected in reads. This happens when insert size is shorter than read length. Action: You MUST trim adapters before alignment. Use Trim Galore or fastp.

Sequence Duplication Levels (WARN)

A WARN here is completely normal for RNA-seq. Highly expressed genes (ribosomal proteins, housekeeping genes) naturally produce many identical reads. Only worry if duplication exceeds 80% or if you see it in DNA-seq data. Action: None for RNA-seq.

Overrepresented Sequences

Shows sequences that make up >0.1% of all reads. Common culprits: adapter dimers, rRNA contamination, or highly expressed transcripts. Action: If rRNA is dominant, your rRNA depletion failed. If adapter dimers are present, trim aggressively.

Diagnostic Decision Tree

📌
If Adapter Content FAIL → Trim with Trim Galore or fastp (mandatory).
If Per Base Quality drops below Q20 → Trim low-quality bases from the 3' end.
If GC Content shows bimodal distribution → Suspect contamination from another organism. Run FastQ Screen.
If Overrepresented Sequences shows rRNA → Your rRNA depletion failed. Check remaining samples. Consider computational rRNA removal with SortMeRNA.
If Sequence Duplication WARN → Normal for RNA-seq. Ignore.
If Per Base Content FAIL at positions 1-10 → Normal for random hexamer priming. Ignore.

Trimming Tools Compared

Trim Galore

Wrapper around Cutadapt. Auto-detects adapter type. Simple to use. Good default choice. Slightly slower than fastp.

fastp

All-in-one tool: QC + trimming + filtering in a single pass. Very fast (written in C++). Generates its own HTML report. Increasingly popular as a replacement for FastQC + Trim Galore.

Trimmomatic

Java-based. Flexible with many options. Older tool. Requires specifying adapter sequences manually. Still widely used but fastp is generally preferred now.

When NOT to trim

If using pseudoalignment (Salmon or kallisto), trimming is usually unnecessary — these tools handle adapters and low-quality bases internally. Trimming can even reduce accuracy by creating artificially short fragments.

Trimming Commands

# Option 1: Trim Galore (recommended for beginners) conda install -c bioconda trim-galore trim_galore --paired --cores 8 \ --quality 20 \ --length 36 \ sample_R1.fastq.gz sample_R2.fastq.gz \ --output_dir trimmed/ # --quality 20: trim bases with Q < 20 from 3' end # --length 36: discard reads shorter than 36bp after trimming # Option 2: fastp (faster, all-in-one QC + trim) conda install -c bioconda fastp fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o trimmed/sample_R1.fastq.gz -O trimmed/sample_R2.fastq.gz \ --detect_adapter_for_pe \ --qualified_quality_phred 20 \ --length_required 36 \ --html trimmed/sample_fastp.html \ --thread 8 # Re-run FastQC after trimming to verify improvement fastqc trimmed/*.fastq.gz -o qc_reports/post_trim/ -t 8 multiqc qc_reports/post_trim/ -o qc_reports/post_trim_multiqc/
⚠️
Always re-run FastQC after trimming to confirm that adapter contamination is gone and quality profiles have improved. If adapter content persists, your adapter sequence may not be in the default database — specify it manually with --adapter.
FastQC per sample, MultiQC for overviewDuplication WARN is normal for RNA-seqAdapter FAIL = must trimfastp = fastest all-in-oneNo trimming needed for Salmon/kallistoRe-run QC after trimming
5.3 Alignment with HISAT2 & STAR

Alignment (or mapping) is the process of determining where each sequencing read originated in the reference genome. For RNA-seq, this is complicated by splicing — a single read may span an exon-exon junction, so standard DNA aligners will fail. Splice-aware aligners like HISAT2 and STAR can split a read across introns to find the correct genomic position. The output is a BAM file containing the coordinates and quality of each alignment.

HISAT2 vs STAR

HISAT2

Memory-efficient (~8 GB RAM for human genome), fast, and accurate. Uses a graph-based FM index that incorporates known splice sites and SNPs. Excellent for most organisms. Recommended when RAM is limited or for non-model organisms.

Recommended
STAR

Slightly more accurate, especially for novel splice junctions, but requires ~30 GB RAM for the human genome. Required by some downstream tools (e.g., STAR-Fusion for fusion gene detection). Gold standard in ENCODE and GTEx projects.

Alignment Workflow

# Build HISAT2 genome index (do once per genome) hisat2-build -p 8 genome.fa genome_index/hisat2_index # Align paired-end reads hisat2 -p 8 -x genome_index/hisat2_index \ -1 sample_R1_trimmed.fastq.gz \ -2 sample_R2_trimmed.fastq.gz \ --dta \ --rna-strandness RF \ 2> sample_hisat2.log \ | samtools sort -@ 8 -o sample.bam # Index the BAM file (required for most downstream tools) samtools index sample.bam # Check alignment statistics samtools flagstat sample.bam

SAM/BAM Format Explained

SAM (Sequence Alignment/Map) is a tab-delimited text format. BAM is its compressed binary equivalent. Each alignment line has 11 mandatory fields:

# Example SAM line (one read alignment): READ001 83 chr1 12345 60 75M200N25M = 12800 555 ACGT... FFFF... # Field breakdown: # 1 QNAME: READ001 — Read name # 2 FLAG: 83 — Bitwise flag (see below) # 3 RNAME: chr1 — Reference chromosome # 4 POS: 12345 — 1-based leftmost position # 5 MAPQ: 60 — Mapping quality (60 = uniquely mapped) # 6 CIGAR: 75M200N25M — 75bp match, 200bp intron, 25bp match # 7 RNEXT: = — Mate maps to same chromosome # 8 PNEXT: 12800 — Mate position # 9 TLEN: 555 — Template/fragment length # 10 SEQ: ACGT... — Read sequence # 11 QUAL: FFFF... — Base quality (Phred+33)

FLAG Values Decoded

The FLAG field is a bitwise combination of properties. Common flags:

FLAG 83 (0x53)

= 1 (paired) + 2 (proper pair) + 16 (reverse strand) + 64 (first in pair). A properly paired read on the reverse strand, first mate.

FLAG 163 (0xA3)

= 1 (paired) + 2 (proper pair) + 32 (mate reverse) + 128 (second in pair). A properly paired read on forward strand, second mate.

FLAG 4

= unmapped read. These reads did not align to the reference genome. Filter them with samtools view -F 4.

FLAG 256

= secondary alignment. The read mapped to multiple locations. This is the non-primary alignment. Filter with samtools view -F 256.

💡
Decode any FLAG value at broadinstitute.github.io/picard/explain-flags.html — enter a number and it shows exactly which bits are set.

Essential samtools Commands

# Alignment statistics (mapped reads, paired, singletons) samtools flagstat sample.bam # Per-chromosome read counts samtools idxstats sample.bam # Extract only mapped reads (remove unmapped) samtools view -F 4 -b sample.bam > sample_mapped.bam # Extract only properly paired reads samtools view -f 2 -b sample.bam > sample_proper_pairs.bam # Mark PCR duplicates (important for DNA-seq, optional for RNA-seq) samtools markdup sample.bam sample_markdup.bam # View reads in a specific region samtools view sample.bam chr1:1000000-1001000 | head

Interpreting Alignment Rates

💡
>90% overall alignment rate: Excellent. Expected for well-prepared libraries aligned to the correct genome.
75–90%: Acceptable. Some unmapped reads are normal (rRNA, adapter dimers, low-quality reads).
50–75%: Investigate. Possible causes: wrong reference genome, cross-species contamination, significant adapter contamination.
<50%: Serious problem. Check: (1) Are you using the correct genome assembly? (2) Run FastQ Screen to check for contamination from other organisms. (3) Check if adapter trimming was performed.

Visual Inspection with IGV

Always visually inspect alignments for a few genes of interest using the Integrative Genomics Viewer (IGV). Load your BAM file and the reference genome, then navigate to specific genes. Look for: even coverage across exons, clean splice junctions, and absence of systematic artefacts.

Generating Coverage Tracks (bigWig)

# Generate normalised coverage tracks with deepTools conda install -c bioconda deeptools bamCoverage -b sample.bam \ -o sample.bw \ --normalizeUsing RPKM \ --binSize 10 \ -p 8 # Output: sample.bw (bigWig format) # Load in IGV or UCSC Genome Browser for visualisation
HISAT2 = low RAM, STAR = higher accuracyCIGAR string shows splicing (N = intron)MAPQ 60 = uniquely mappedsamtools flagstat for QC>90% alignment = goodIGV for visual inspectionbamCoverage for bigWig tracks
5.4 Gene Counting with featureCounts

Gene counting is the process of assigning each aligned read to a gene based on its overlap with annotated gene coordinates in a GTF/GFF file. The result is a count matrix — a table where rows are genes, columns are samples, and values are integer read counts. This matrix is the fundamental input for differential expression analysis with DESeq2 or edgeR. Getting the counting step right is critical because errors here propagate to every downstream result.

How Gene Counting Works

The counting tool examines each aligned read and checks whether it overlaps with an annotated gene feature (usually exons). If a read falls entirely within one gene's exons, it is assigned to that gene. If a read overlaps two genes (an ambiguous read), featureCounts discards it by default. HTSeq-count offers similar behaviour with its "union" mode. Reads that do not overlap any annotated feature are counted as "unassigned."

Gene-level vs Transcript-level Counting

Gene-level counting

Counts reads per gene (collapsing all isoforms). Used by featureCounts and HTSeq-count. Simpler, more robust. Standard for differential expression analysis.

Transcript-level quantification

Estimates abundance per transcript isoform. Used by Salmon and kallisto (pseudoalignment). More information but more complex. Can be summarised to gene level with tximport.

featureCounts Workflow

# Install featureCounts (from Subread package) conda install -c bioconda subread # Count reads per gene for all samples featureCounts -T 8 \ -a annotation.gtf \ -o counts.txt \ -s 2 \ -p --countReadPairs \ -B \ aligned/*.bam # Key parameters: # -a : GTF annotation file (must match genome used for alignment) # -o : output file name # -s 2: reverse-stranded library (most Illumina kits) # -p --countReadPairs: count fragments, not individual reads (for PE) # -B : require both reads in a pair to be aligned # The output is a table: genes x samples with read counts # Column 1: gene ID, columns 7+: counts per sample # A summary file (counts.txt.summary) shows assignment statistics
⚠️
Strandedness matters! Wrong -s setting can cause ~50% read loss or incorrect counting. Use -s 0 (unstranded), -s 1 (stranded-forward), or -s 2 (stranded-reverse). Check your library preparation kit documentation. Most modern kits (Illumina TruSeq Stranded, NEBNext) are reverse-stranded (-s 2). If unsure, run featureCounts with all three settings on one sample and compare assignment rates — the correct setting will have the highest "Assigned" count.

featureCounts vs HTSeq-count

featureCounts

Extremely fast (processes millions of reads per minute). Multi-threaded. Can process multiple BAM files in one command. Recommended for most use cases.

HTSeq-count

Python-based, slower. Processes one BAM file at a time. Offers fine-grained control over ambiguous read handling (union, intersection-strict, intersection-nonempty modes). Used historically but featureCounts is preferred.

Pseudoalignment Alternative: Salmon

Salmon and kallisto use pseudoalignment — they do not perform traditional alignment to the genome. Instead, they map reads directly to the transcriptome (cDNA sequences), skipping the genome alignment step entirely. This is dramatically faster (minutes instead of hours) and directly estimates transcript-level abundances.

# Salmon quantification (no alignment needed) conda install -c bioconda salmon # Step 1: Build transcriptome index (do once) salmon index -t transcriptome.fa -i salmon_index -p 8 # Step 2: Quantify each sample salmon quant -i salmon_index \ -l A \ -1 sample_R1.fastq.gz \ -2 sample_R2.fastq.gz \ -p 8 \ --validateMappings \ -o salmon_output/sample/ # -l A: auto-detect library type # Output: quant.sf file with TPM and estimated counts per transcript

Creating a Count Matrix from Salmon with tximport

# In R: import Salmon outputs into DESeq2 with tximport library(tximport) library(DESeq2) # Create a named vector of file paths samples <- c("control1","control2","control3","treated1","treated2","treated3") files <- file.path("salmon_output", samples, "quant.sf") names(files) <- samples # Load transcript-to-gene mapping (from GTF or biomaRt) tx2gene <- read.csv("tx2gene.csv") # columns: TXNAME, GENEID # Import and summarise to gene level txi <- tximport(files, type="salmon", tx2gene=tx2gene) # Create DESeq2 object directly from tximport metadata <- data.frame(condition=c(rep("control",3), rep("treated",3)), row.names=samples) dds <- DESeqDataSetFromTximport(txi, metadata, ~condition)
💡
Salmon + tximport + DESeq2 is increasingly the recommended pipeline because: (1) it is much faster than alignment-based counting, (2) it corrects for GC bias and sequence-specific bias, and (3) tximport provides offset matrices that improve DE accuracy at the gene level.
featureCounts = fast gene countingStrandedness: check -s 0/1/2Salmon = no alignment neededtximport bridges Salmon → DESeq2Gene-level for DE analysisAmbiguous reads are discarded
5.5 Differential Expression with DESeq2

DESeq2 is the gold-standard R package for identifying genes that are significantly more or less expressed between experimental conditions. It models RNA-seq count data using the negative binomial distribution, which accounts for both the mean expression level and the biological variability (dispersion) between replicates. Unlike simpler models (Poisson), the negative binomial captures the overdispersion that is characteristic of real biological data — where variance exceeds the mean.

The DESeq2 Statistical Model

DESeq2 performs three key steps internally: (1) Size factor estimation — normalises for differences in library size using the median-of-ratios method. (2) Dispersion estimation — estimates gene-wise dispersion with empirical Bayes shrinkage toward a fitted trend, borrowing strength across genes. (3) Wald test — tests whether the log2 fold change for each gene is significantly different from zero. The design formula (e.g., ~ condition) specifies the experimental factors to model.

Basic DESeq2 Workflow

library(DESeq2) # 1. Load count data from featureCounts counts <- read.table("counts.txt", header=TRUE, row.names=1, skip=1) counts <- counts[, 6:ncol(counts)] # Keep only count columns # 2. Create metadata (sample information) metadata <- data.frame( condition = c("control","control","control","treated","treated","treated"), row.names = colnames(counts) ) # 3. Create DESeq2 object dds <- DESeqDataSetFromMatrix(counts, metadata, ~condition) # 4. Set the reference level (control as baseline) dds$condition <- relevel(dds$condition, ref="control") # 5. Filter low-count genes (improves power) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep, ] # 6. Run DESeq2 (size factors + dispersion + Wald test) dds <- DESeq(dds) # 7. Get results res <- results(dds, contrast=c("condition","treated","control")) res <- res[order(res$padj), ] summary(res)

Multi-factor Designs

When your experiment has known batch effects or additional covariates, include them in the design formula. DESeq2 will model their effect and remove it before testing the variable of interest.

# Multi-factor design: correct for batch while testing condition metadata <- data.frame( condition = c("ctrl","ctrl","ctrl","treat","treat","treat"), batch = c("A","B","A","B","A","B"), row.names = colnames(counts) ) dds <- DESeqDataSetFromMatrix(counts, metadata, ~ batch + condition) # The last variable in the formula is the one being tested # Batch effect is modelled but not tested # For paired designs (same patient before/after treatment): # design = ~ patient + treatment

LFC Shrinkage with apeglm

Raw log2 fold changes from DESeq2 can be noisy for low-count genes. LFC shrinkage produces more accurate, regularised fold change estimates — especially important for ranking genes and for visualisation.

# Apply apeglm shrinkage (recommended method) library(apeglm) res_shrunk <- lfcShrink(dds, coef="condition_treated_vs_control", type="apeglm") # Use res_shrunk for MA plots and volcano plots # Use the non-shrunken 'res' for strict significance testing

Visualisation: MA Plot, Volcano Plot, and Heatmap

# MA plot — log2FC vs mean expression plotMA(res_shrunk, ylim=c(-5, 5), main="MA Plot") # Blue dots = significant (padj < 0.1). Genes with high expression # and large fold change are most reliable. # Volcano plot with EnhancedVolcano library(EnhancedVolcano) EnhancedVolcano(res, lab = rownames(res), x = "log2FoldChange", y = "padj", pCutoff = 0.05, FCcutoff = 1, title = "Treated vs Control", subtitle = "DESeq2 results") # Heatmap of top 50 DEGs library(pheatmap) vsd <- vst(dds, blind=FALSE) top50 <- head(order(res$padj), 50) mat <- assay(vsd)[top50, ] mat <- mat - rowMeans(mat) # Center rows (z-score-like) pheatmap(mat, annotation_col = as.data.frame(colData(dds)["condition"]), scale = "none", show_rownames = TRUE, fontsize_row = 7)

Annotating and Exporting Results

# Add gene symbols using biomaRt library(biomaRt) ensembl <- useEnsembl(biomart="genes", dataset="hsapiens_gene_ensembl") annotations <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), filters = "ensembl_gene_id", values = rownames(res), mart = ensembl ) # Merge annotations with results res_df <- as.data.frame(res) res_df$ensembl_gene_id <- rownames(res_df) res_annotated <- merge(res_df, annotations, by="ensembl_gene_id", all.x=TRUE) # Export significant DEGs sig <- subset(res_annotated, padj < 0.05 & abs(log2FoldChange) > 1) write.csv(sig, "DEGs_significant_annotated.csv", row.names=FALSE) # Export for pathway analysis (all genes with stats) write.csv(res_annotated, "all_genes_DESeq2.csv", row.names=FALSE)
📌
Interpreting results: log2FoldChange = direction and magnitude (+1 = 2x higher in treated, -1 = 2x lower). padj = Benjamini-Hochberg adjusted p-value (FDR). Always use padj < 0.05, never raw pvalue. Combine with a fold-change threshold (|log2FC| > 1) for biological relevance.
⚠️
Common mistakes: (1) Using normalised counts (TPM, RPKM) as input — DESeq2 requires raw integer counts. (2) Forgetting to set the reference level — the default alphabetical order may not be what you want. (3) Not filtering low-count genes — this reduces statistical power because of excessive multiple testing correction.
Negative binomial modelRaw counts as input (never normalised)padj < 0.05 + |log2FC| > 1apeglm for LFC shrinkage~ batch + condition for multi-factorEnhancedVolcano for plotsbiomaRt for gene annotation
Module 6 · Genomics

DNA Sequencing, Variant Calling & Genome Analysis

Whole-genome and whole-exome sequencing reveal the genetic blueprint of an organism. This module covers variant calling, genome assembly, and annotation.

6.1 WGS/WES Pipeline — Variant Calling

Variant calling is the process of identifying positions in a sequenced genome where the sample differs from the reference. These differences include SNPs (single nucleotide polymorphisms — one base changed to another), indels (insertions or deletions of one or more bases), and MNVs (multi-nucleotide variants — multiple adjacent substitutions). Variant calling is the foundation of clinical genomics, population genetics, and disease gene discovery.

Variant Types and Zygosity

SNP (Single Nucleotide Polymorphism)

A single base change: e.g., reference A → alternative G. The most common variant type. ~4–5 million SNPs per human genome.

Indel (Insertion / Deletion)

Insertion of extra bases or deletion of reference bases. Short indels (1–50bp) are called by GATK. Larger structural variants require specialised tools (Manta, DELLY).

Homozygous reference (0/0)

Both copies of the chromosome carry the reference allele. This position is not a variant in this individual.

Heterozygous (0/1)

One chromosome carries the reference allele, the other carries the alternative. Expect ~50% of reads supporting each allele.

Homozygous alternative (1/1)

Both chromosomes carry the alternative allele. Expect ~100% of reads supporting the ALT allele.

MNV (Multi-Nucleotide Variant)

Two or more adjacent bases changed simultaneously (e.g., AG → TC). Often caused by a single mutational event. Correctly phasing these is important for protein impact prediction.

Why We Mark Duplicates

During library preparation, PCR amplification creates identical copies of DNA fragments. These PCR duplicates are not independent observations — they all come from the same original molecule. If not removed, they artificially inflate the number of reads supporting a variant, leading to false-positive variant calls. Duplicate marking identifies reads with identical start and end coordinates and flags all but one as duplicates.

How HaplotypeCaller Works

GATK's HaplotypeCaller does not simply count bases at each position. Instead, it performs local de novo assembly in active regions (regions with evidence of variation). It: (1) identifies active regions where the data deviates from the reference, (2) assembles the reads in that region into candidate haplotypes, (3) realigns reads to these haplotypes using a pair-HMM, and (4) assigns genotype likelihoods. This approach is more accurate than simple pileup-based callers, especially for indels and complex variants.

Complete GATK4 Pipeline

# Step 1: Align with BWA-MEM2 (faster successor to BWA-MEM) bwa-mem2 mem -t 8 -R "@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tLB:lib1" \ reference.fa sample_R1.fastq.gz sample_R2.fastq.gz \ | samtools sort -@ 8 -o sample.bam samtools index sample.bam # Step 2: Mark PCR duplicates gatk MarkDuplicates \ -I sample.bam \ -O sample_markdup.bam \ -M metrics.txt \ --CREATE_INDEX true # Step 3: Base Quality Score Recalibration (BQSR) # Corrects systematic errors in base quality scores gatk BaseRecalibrator \ -I sample_markdup.bam -R reference.fa \ --known-sites dbsnp.vcf.gz \ --known-sites mills_indels.vcf.gz \ -O recal.table gatk ApplyBQSR \ -I sample_markdup.bam -R reference.fa \ --bqsr-recal-file recal.table \ -O sample_recal.bam # Step 4: Call variants (per-sample GVCF mode) gatk HaplotypeCaller \ -I sample_recal.bam -R reference.fa \ -O sample.g.vcf.gz \ -ERC GVCF # Step 5: Joint genotyping (for cohorts) gatk GenotypeGVCFs -R reference.fa \ -V sample.g.vcf.gz \ -O sample.vcf.gz # Step 6: Filter variants gatk VariantFiltration -V sample.vcf.gz -O sample_filtered.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "LowQD" \ --filter-expression "MQ < 40.0" --filter-name "LowMQ" \ --filter-expression "FS > 60.0" --filter-name "HighFS" \ --filter-expression "SOR > 3.0" --filter-name "HighSOR"

Hard Filtering vs VQSR

Hard filtering

Apply fixed thresholds on quality metrics (QD, MQ, FS, SOR). Simple and works for any project size. Use for small cohorts (<30 samples), non-model organisms, or targeted panels.

VQSR (Variant Quality Score Recalibration)

Machine learning approach that learns the quality profile from known true variants (e.g., HapMap, 1000 Genomes). More accurate but requires large cohorts (>30 whole genomes) and curated training resources.

Quality Filter Annotations Explained

QD (QualByDepth)

Variant quality normalised by depth. Low QD (<2) means the variant quality is low relative to coverage — likely a false positive.

MQ (MappingQuality)

RMS mapping quality of reads at the site. Low MQ (<40) indicates reads are mapping to multiple locations — the region is ambiguous.

FS (FisherStrand)

Strand bias estimated by Fisher's exact test. High FS (>60 for SNPs) means the variant is only supported by reads on one strand — likely an artefact.

SOR (StrandOddsRatio)

Alternative strand bias metric. More robust than FS for high-coverage data. SOR > 3 indicates significant strand bias.

Visualising Variants in IGV

Always visually inspect candidate variants in the Integrative Genomics Viewer (IGV). Load the BAM file and VCF together, navigate to a variant position, and check: (1) Is there sufficient read depth? (2) Are both strands represented? (3) Are the variant-supporting reads properly mapped (no soft-clipping artefacts)? (4) Is the region not in a repeat or low-complexity sequence?

⚠️
BQSR requires known variant databases (dbSNP, Mills indels). For non-model organisms without these resources, skip BQSR and use more stringent hard filtering instead.
SNP = single base change0/0 = hom ref, 0/1 = het, 1/1 = hom altMark duplicates before callingHaplotypeCaller = local assemblyHard filter for small projectsVQSR for large cohortsAlways inspect in IGV
6.2 Understanding VCF Files & Variant Annotation

The Variant Call Format (VCF) is the standard file format for storing genetic variants discovered by variant calling pipelines. Understanding the VCF format in detail is essential for filtering, annotating, and interpreting variant data. A VCF file consists of header lines (metadata) followed by data lines (one variant per line), and can contain genotype information for one or many samples.

VCF File Structure

A VCF file has two sections: the header (lines starting with ##) and the data lines (tab-separated variant records). The header defines the meaning of INFO and FORMAT fields used in the data lines.

## VCF header lines (metadata) ##fileformat=VCFv4.2 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (REF,ALT)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 chr1 925952 rs12345 G A 5000.1 PASS DP=45;AF=0.50 GT:GQ:DP:AD 0/1:99:45:22,23

Anatomy of a VCF Data Line

Let us dissect the example line above field by field:

CHROM: chr1

Chromosome where the variant is located.

POS: 925952

1-based position on the reference. For indels, this is the position of the base before the event.

ID: rs12345

dbSNP identifier if known, otherwise "." (missing). Links to public variant databases.

REF: G / ALT: A

Reference allele and alternative allele. For SNPs: single bases. For a deletion: REF=ACG, ALT=A (CG deleted). For an insertion: REF=A, ALT=ATCG (TCG inserted).

QUAL: 5000.1

Phred-scaled quality score. QUAL 30 = 1/1000 chance the variant is wrong. QUAL 5000 = extremely confident. Higher is better.

FILTER: PASS

PASS means the variant passed all quality filters. Other values (e.g., "LowQD") indicate which filter(s) it failed.

Genotype Field (FORMAT + Sample)

The FORMAT column defines the order of fields in the sample column(s). In our example: GT:GQ:DP:AD0/1:99:45:22,23

GT: 0/1

Genotype. 0 = reference allele, 1 = first ALT allele. So 0/1 = heterozygous. 0/0 = homozygous reference. 1/1 = homozygous alternative. 0|1 = phased heterozygous.

GQ: 99

Genotype quality. Phred-scaled confidence that the genotype call is correct. GQ 99 = very high confidence. Low GQ (<20) means the genotype is uncertain.

DP: 45

Read depth at this position for this sample. For WGS, typically 30x. For WES, typically 50–100x. Low depth (<10) = unreliable genotype.

AD: 22,23

Allelic depths. 22 reads support REF (G), 23 reads support ALT (A). For a heterozygous call, expect roughly 50:50. Severe imbalance may indicate an artefact.

bcftools: Filtering, Extracting, and Merging VCFs

# Install bcftools conda install -c bioconda bcftools # View basic stats bcftools stats sample.vcf.gz | head -40 # Filter: keep only PASS variants with depth >= 20 bcftools view -f PASS -i 'DP>=20' sample.vcf.gz -Oz -o filtered.vcf.gz # Extract only SNPs (exclude indels) bcftools view -v snps sample.vcf.gz -Oz -o snps_only.vcf.gz # Extract only indels bcftools view -v indels sample.vcf.gz -Oz -o indels_only.vcf.gz # Extract specific regions bcftools view -r chr1:1000000-2000000 sample.vcf.gz # Extract specific samples from a multi-sample VCF bcftools view -s Sample1,Sample2 cohort.vcf.gz -Oz -o subset.vcf.gz # Merge multiple single-sample VCFs bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \ -Oz -o merged.vcf.gz # Index a VCF (required for most operations) bcftools index -t filtered.vcf.gz

Variant Annotation with SnpEff

SnpEff annotates each variant with its predicted functional impact on genes, transcripts, and proteins. It adds an ANN field to the INFO column.

# Install SnpEff conda install -c bioconda snpeff # List available databases snpEff databases | grep -i "homo_sapiens" # Download the human genome database snpEff download GRCh38.105 # Annotate variants snpEff ann GRCh38.105 filtered.vcf.gz > annotated.vcf # Generate an HTML summary report snpEff ann -stats snpEff_summary.html GRCh38.105 filtered.vcf.gz > annotated.vcf # Extract high-impact variants only cat annotated.vcf | snpSift filter "ANN[*].IMPACT = 'HIGH'" > high_impact.vcf

Variant Functional Categories

Synonymous (LOW impact)

DNA change that does NOT alter the amino acid (e.g., GCC→GCT both encode Alanine). Usually benign due to codon degeneracy.

Missense (MODERATE impact)

DNA change that alters one amino acid (e.g., GAG→GTG: Glutamate→Valine). May or may not affect protein function depending on the position and amino acid properties.

Nonsense / Stop-gained (HIGH impact)

Creates a premature stop codon (e.g., CAG→TAG: Glutamine→Stop). Truncates the protein. Often loss-of-function. Subject to nonsense-mediated decay (NMD).

Frameshift (HIGH impact)

Insertion or deletion not divisible by 3. Shifts the reading frame, altering all downstream amino acids and usually creating a premature stop. Almost always loss-of-function.

Splice site (HIGH impact)

Variant at an exon-intron boundary (donor GT or acceptor AG). Disrupts normal splicing, potentially causing exon skipping or intron retention. Often pathogenic.

In-frame indel (MODERATE impact)

Insertion or deletion divisible by 3. Adds or removes amino acids without shifting the reading frame. Effect depends on the protein domain affected.

Clinical Interpretation Resources

ClinVar

NCBI database of variants with clinical significance (pathogenic, likely pathogenic, benign, VUS). The primary resource for clinical variant interpretation. Query by rsID or genomic coordinates.

gnomAD

Population allele frequencies from >140,000 exomes and >76,000 genomes. A variant with AF > 1% in gnomAD is almost certainly benign. Essential for filtering common polymorphisms.

💡
Filtering strategy for clinical variants: Start with ~4 million variants → filter PASS only → remove AF > 0.01 in gnomAD → keep HIGH/MODERATE impact → check ClinVar → manual review in IGV. This typically reduces to a manageable list of 50–200 candidate variants.
VCF = standard variant formatGT:GQ:DP:AD = genotype fields0/1 = heterozygousbcftools for filteringSnpEff for annotationHIGH impact = likely functionalgnomAD for population frequencyClinVar for clinical significance
Module 7 · Statistics

Statistical Foundations for Bioinformatics

Bad statistics ruins good biology. This module covers the statistical concepts every bioinformatician must understand — from p-values and FDR to normalisation and dimensionality reduction.

7.1 P-values, Multiple Testing & FDR

Understanding p-values, multiple testing correction, and the False Discovery Rate (FDR) is arguably the single most important statistical concept in genomics. Misinterpreting these values leads to publishing false discoveries or missing real biology. This section provides the intuition, the mathematics, and the practical application you need to correctly interpret differential expression results.

⚠️
What a p-value IS: P(data | null hypothesis is true). The probability of observing data as extreme as yours, assuming there is no real effect. It is NOT: the probability that your gene is truly differentially expressed. A p-value of 0.03 does NOT mean there is a 97% chance the gene is real.

The Multiple Testing Problem — A Worked Example

Consider a typical RNA-seq experiment where you test 20,000 genes for differential expression. Suppose 1,000 genes are truly differentially expressed and 19,000 are not.

Without correction (p < 0.05)

Of the 19,000 null genes, 5% will produce p < 0.05 by chance = 950 false positives. Of the 1,000 true DEGs (assuming 80% power), you detect ~800. Your "significant" list has 800 + 950 = 1,750 genes, but 54% are false. You cannot tell which are real.

With BH correction (padj < 0.05)

Benjamini-Hochberg correction ensures that among your "significant" genes, at most 5% are false positives. If you call 750 genes significant, you expect ~38 false discoveries. This is manageable and scientifically defensible.

Benjamini-Hochberg FDR: Step-by-Step

The BH procedure adjusts p-values by ranking them:

1. Rank all m p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(m)

2. For each gene at rank i, compare its p-value to the threshold: i/m × q (where q is your desired FDR, e.g., 0.05)

3. Find the largest rank k where p(k) ≤ k/m × q

4. All genes with rank ≤ k are declared significant

# Example with 5 genes (m=5), FDR threshold q=0.05 # Gene Raw p-value Rank(i) BH threshold (i/m × q) Significant? # GeneA 0.001 1 1/5 × 0.05 = 0.010 YES (0.001 < 0.010) # GeneB 0.008 2 2/5 × 0.05 = 0.020 YES (0.008 < 0.020) # GeneC 0.030 3 3/5 × 0.05 = 0.030 YES (0.030 ≤ 0.030) # GeneD 0.040 4 4/5 × 0.05 = 0.040 YES (0.040 ≤ 0.040) # GeneE 0.060 5 5/5 × 0.05 = 0.050 NO (0.060 > 0.050) # In R, this is simply: p_values <- c(0.001, 0.008, 0.030, 0.040, 0.060) p.adjust(p_values, method = "BH") # DESeq2 does this automatically in the 'padj' column

FDR vs FWER

FDR (False Discovery Rate)

Controls the proportion of false discoveries among all rejected hypotheses. "Of the genes I call significant, at most 5% are false." Used for: RNA-seq, ChIP-seq, proteomics — any discovery-oriented experiment where some false positives are tolerable.

FWER (Family-Wise Error Rate)

Controls the probability of even one false positive. Much more conservative. Bonferroni: divide alpha by number of tests (0.05/20,000 = 2.5×10-6). Used for: GWAS, where a single false positive could lead to expensive follow-up studies.

Why padj, Not pvalue, in DESeq2

The pvalue column in DESeq2 results is the raw, uncorrected p-value. The padj column is the BH-adjusted p-value. Always use padj for determining significance. If you use raw p-values, you are ignoring the multiple testing problem and your results will be full of false positives.

Independent Filtering in DESeq2

DESeq2 automatically performs independent filtering: it removes genes with very low mean expression before applying multiple testing correction. Why? Genes with almost zero counts across all samples have no chance of being significant, but they consume part of the multiple testing budget, reducing power for the remaining genes. By filtering them out, DESeq2 increases statistical power for the genes that matter.

📌
NA values in padj: If you see NA in the padj column, it means DESeq2 filtered that gene out (usually due to very low counts or extreme outlier detection). These genes should not be interpreted as significant or non-significant — they were excluded from testing.

Effect Size and Biological Significance

Statistical significance (small padj) does not equal biological significance. With enough replicates, even tiny fold changes become statistically significant. A gene with padj = 0.0001 but log2FC = 0.1 is a 7% expression change — real but unlikely to be biologically meaningful. Always apply both a significance threshold (padj < 0.05) and an effect size threshold (|log2FC| > 1, i.e., 2-fold change) when defining your DEG list.

💡
Volcano plot intuition: X-axis = log2FoldChange (effect size), Y-axis = -log10(padj) (significance). Top-right = significantly upregulated with large effect. Top-left = significantly downregulated. Bottom-centre = not significant. Right side but not high = large fold change but underpowered (add more replicates).
p-value = P(data | H0 true)Always use padj, never pvalueBH-FDR = standard in genomicsBonferroni = too conservative for RNA-seqpadj < 0.05 AND |log2FC| > 1NA in padj = gene was filteredStatistical ≠ biological significance
7.2 Normalisation Methods

Raw read counts cannot be compared directly across samples or genes because they are confounded by two major technical factors: library size (total number of reads sequenced) and gene length (longer genes capture more reads). Normalisation removes these technical biases so that observed differences reflect true biological variation. Choosing the wrong normalisation method for your analysis is one of the most common errors in bioinformatics.

Why Normalisation Is Needed — A Concrete Example

Suppose Gene X has 100 raw counts in both Sample A and Sample B. It looks like equal expression. But Sample A has 10 million total reads and Sample B has 20 million total reads. Gene X consumed 100/10,000,000 = 0.001% of Sample A's reads, but only 100/20,000,000 = 0.0005% of Sample B's. Gene X is actually expressed at half the level in Sample B. Without normalisation, you would completely miss this 2-fold difference.

Normalisation Methods

CPM (Counts Per Million)

Corrects for library size only. Formula: CPM = (raw count / total counts) × 106. Use for within-gene comparisons across samples. Does NOT correct for gene length, so cannot compare expression between different genes.

TPM (Transcripts Per Million)

Corrects for gene length first, then library size. Formula: (1) Divide counts by gene length in kb → RPK. (2) Sum all RPK values → scaling factor. (3) Divide each RPK by scaling factor × 106. TPM values sum to 1 million per sample, making samples directly comparable. Current standard for expression visualisation.

RPKM / FPKM

RPKM = Reads Per Kilobase per Million mapped reads. FPKM = the paired-end equivalent. Formula: RPKM = (raw count × 109) / (total counts × gene length). Problem: RPKM values do NOT sum to the same value across samples, making cross-sample comparison unreliable. Superseded by TPM.

DESeq2 size factors (median-of-ratios)

Computes a per-sample scaling factor that accounts for library size AND composition bias (when a few highly expressed genes consume a disproportionate fraction of reads). Does NOT correct for gene length. Used internally by DESeq2 for differential expression.

DESeq2 Size Factor Calculation

The median-of-ratios method works as follows: (1) For each gene, compute the geometric mean across all samples. (2) For each sample, divide each gene's count by this geometric mean. (3) The median of these ratios is the size factor for that sample. Using the median makes this robust to outlier genes (e.g., one massively upregulated gene does not skew the normalisation).

# DESeq2 size factors (computed automatically by DESeq()) dds <- estimateSizeFactors(dds) sizeFactors(dds) # Values near 1.0 = average library. >1 = larger library. <1 = smaller library. # Normalised counts = raw count / size factor normalized_counts <- counts(dds, normalized=TRUE)

TMM Normalisation in edgeR

TMM (Trimmed Mean of M-values) is the normalisation method used by edgeR. It works by selecting a reference sample, computing fold-changes and absolute expression levels between each sample and the reference, trimming the extremes (top/bottom 30% of fold-changes, top/bottom 5% of expression), and computing a weighted mean of the remaining values as the scaling factor.

# TMM normalisation in edgeR library(edgeR) dge <- DGEList(counts = count_matrix) dge <- calcNormFactors(dge, method = "TMM") # Effective library size = lib.size * norm.factors dge$samples

Which Normalisation for Which Purpose?

Differential expression analysis

Use raw counts as input to DESeq2 or edgeR. These tools apply their own internal normalisation (size factors or TMM). Never pre-normalise counts before DE analysis.

Cross-sample visualisation (heatmaps, PCA)

Use TPM or variance-stabilised transform (VST/rlog from DESeq2). These make samples comparable for visualisation.

Cross-gene comparison within a sample

Use TPM (accounts for gene length). CPM is insufficient because longer genes accumulate more reads regardless of expression level.

Reporting expression levels

Use TPM in publications and supplementary tables. It is the current community standard and is comparable across samples and studies (with caveats).

⚠️
Common mistakes: (1) Using RPKM/FPKM for differential expression — these are not appropriate because the variance structure is lost. (2) Using CPM to compare expression between genes of different lengths. (3) Pre-normalising counts before passing to DESeq2 — the tool expects raw integer counts.
Raw counts for DE analysisTPM for visualisationRPKM/FPKM is outdatedDESeq2 = median-of-ratiosedgeR = TMMLibrary size + composition biasGene length matters for cross-gene comparison
7.3 PCA & Clustering for Quality Control

Before running differential expression, you should always visualise your samples using dimensionality reduction (PCA) and clustering. These exploratory analyses reveal sample outliers, batch effects, sample mix-ups, and whether your biological conditions are the dominant source of variation. Skipping this step means you may not discover critical problems until after you have wasted time on downstream analysis.

What PCA Actually Does

Principal Component Analysis (PCA) takes your high-dimensional data (20,000 genes per sample) and finds the directions of maximum variance. The first principal component (PC1) captures the most variance in the data, PC2 captures the second most (orthogonal to PC1), and so on. By plotting samples on PC1 vs PC2, you project 20,000 dimensions down to 2, preserving as much information as possible.

Interpreting a PCA Plot

PC1 (X-axis)

Captures the largest source of variation. Ideally this is your biological condition (treatment vs control). The % variance explained is shown in the axis label (e.g., "PC1: 65% variance").

PC2 (Y-axis)

The second largest source of variation. Often captures batch effects, sex differences, or other covariates. If PC2 separates batches more than conditions, you have a batch effect problem.

Good PCA

Replicates cluster tightly together. Conditions separate clearly along PC1. Percentage of variance explained by PC1 is high (>30%).

Problematic PCA

One replicate clusters with the wrong condition (sample mix-up). Samples cluster by batch rather than condition (batch effect). One sample is far from all others (outlier — consider removal).

PCA and Sample Distance Heatmap Code

library(DESeq2); library(ggplot2); library(pheatmap) # Variance stabilising transformation (for visualisation only) vsd <- vst(dds, blind=TRUE) # PCA plot coloured by condition plotPCA(vsd, intgroup="condition") + theme_minimal(base_size=14) + ggtitle("PCA — Sample Overview") + theme(plot.title = element_text(face="bold")) # Custom PCA with more components pcaData <- plotPCA(vsd, intgroup=c("condition","batch"), returnData=TRUE) percentVar <- round(100 * attr(pcaData, "percentVar")) ggplot(pcaData, aes(PC1, PC2, color=condition, shape=batch)) + geom_point(size=4) + xlab(paste0("PC1: ", percentVar[1], "% variance")) + ylab(paste0("PC2: ", percentVar[2], "% variance")) + theme_minimal(base_size=14) # Sample distance heatmap sampleDists <- dist(t(assay(vsd))) sampleDistMatrix <- as.matrix(sampleDists) colnames(sampleDistMatrix) <- NULL annotation_df <- as.data.frame(colData(dds)[, c("condition"), drop=FALSE]) pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, annotation_row = annotation_df, color = colorRampPalette(c("white","steelblue"))(50))

Detecting and Correcting Batch Effects

If your PCA shows samples clustering by batch (processing date, lane, technician) rather than by biological condition, you have a batch effect. Two common correction methods:

# Option 1: ComBat (from sva package) — for correcting normalised data library(sva) # Use on normalised expression matrix (for visualisation, clustering) adjusted <- ComBat( dat = assay(vsd), batch = colData(dds)$batch, mod = model.matrix(~ condition, data=colData(dds)) ) # Now use 'adjusted' matrix for PCA, heatmaps, etc. # Option 2: removeBatchEffect (from limma) — for visualisation only library(limma) adjusted <- removeBatchEffect( assay(vsd), batch = colData(dds)$batch, design = model.matrix(~ condition, data=colData(dds)) ) # Use adjusted matrix for PCA/heatmaps # For DE analysis: do NOT use corrected data! # Instead, include batch in the DESeq2 design formula: # design = ~ batch + condition
⚠️
Never use batch-corrected data for differential expression. ComBat and removeBatchEffect are for visualisation and clustering only. For DE analysis, include the batch variable in the DESeq2 design formula (~ batch + condition). DESeq2 will model and remove the batch effect statistically.

Hierarchical Clustering Methods

Ward's method (ward.D2)

Minimises within-cluster variance. Produces compact, evenly-sized clusters. Best for sample clustering in gene expression data. Most commonly used.

Complete linkage

Distance between clusters = maximum distance between any two members. Produces tight clusters. Sensitive to outliers.

Average linkage (UPGMA)

Distance between clusters = mean of all pairwise distances. Produces ultrametric trees. Used in phylogenetics but assumes constant rate of evolution.

Gene expression heatmap

Rows = genes (usually top DEGs), columns = samples. Colour = expression level (z-score). Both rows and columns are clustered. Reveals co-expression patterns.

# Gene expression heatmap of top 50 DEGs library(pheatmap) top50 <- head(order(res$padj), 50) mat <- assay(vsd)[top50, ] mat <- t(scale(t(mat))) # Z-score per gene (row) pheatmap(mat, annotation_col = as.data.frame(colData(dds)[, "condition", drop=FALSE]), clustering_method = "ward.D2", show_rownames = TRUE, fontsize_row = 7, color = colorRampPalette(c("navy","white","firebrick3"))(50), main = "Top 50 Differentially Expressed Genes")
💡
What % variance explained means: If PC1 explains 65% of the variance, it means 65% of all the variation across your 20,000 genes can be captured by a single direction. The remaining 35% is spread across all other PCs. High PC1 variance usually means a strong biological signal (good). Very low PC1 (<15%) may mean high noise or complex multi-factor design.
PCA before DE analysisPC1 = largest variance sourceReplicates should cluster togetherComBat / limma for batch correction~ batch + condition in DESeq2Ward.D2 for clusteringZ-score heatmaps for DEGs
Module 8 · Advanced Topics

Single-cell RNA-seq, Pathway Analysis & Machine Learning

The frontier of bioinformatics. These topics are increasingly required in modern research — from dissecting cell heterogeneity with scRNA-seq to interpreting gene lists with pathway analysis and applying machine learning to genomic data.

8.1 Single-cell RNA-seq (scRNA-seq) — Introduction

Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, revealing cellular heterogeneity that is invisible to bulk RNA-seq. Where bulk RNA-seq reports the average expression across thousands of cells, scRNA-seq can identify rare cell populations, map developmental trajectories, and characterise the cellular composition of complex tissues like tumours or the immune system. The most widely used platform is the 10x Genomics Chromium system.

How 10x Chromium Works

In the 10x Chromium system, individual cells are captured in nanoliter-sized oil droplets (GEMs — Gel Beads-in-emulsion) together with barcoded hydrogel beads. Each bead carries millions of oligonucleotides with three components: (1) a cell barcode (16bp, unique to each droplet/cell), (2) a UMI (Unique Molecular Identifier, 12bp, unique to each captured mRNA molecule), and (3) a poly-T capture sequence that binds mRNA. After reverse transcription inside the droplet, all cDNA from one cell shares the same barcode, allowing computational demultiplexing.

Why scRNA-seq Has High Dropout (Zero Inflation)

A major characteristic of scRNA-seq data is dropout: many genes that are truly expressed in a cell are detected as zero. This happens because each cell contributes only ~10–20% of its mRNA to the captured library (low capture efficiency). Lowly expressed genes are particularly affected — they may be expressed but not captured. This "zero inflation" complicates analysis and requires specialised statistical methods.

Quality Control Metrics

nFeature_RNA (genes per cell)

Number of unique genes detected. Too low (<200): empty droplet or dead cell. Too high (>5,000–8,000): possible doublet (two cells in one droplet). Typical range: 500–5,000 genes.

nCount_RNA (UMIs per cell)

Total UMI counts. Proportional to mRNA content. Very low counts indicate low-quality cells. Extremely high counts may indicate doublets.

percent.mt (mitochondrial %)

Percentage of reads from mitochondrial genes. Dying cells release cytoplasmic mRNA but retain mitochondrial mRNA, so high mt% (>20%) indicates cell stress or death. Common threshold: remove cells with >15–20% mt.

Typical QC thresholds

Remove cells with: nFeature_RNA < 200 or > 5000, percent.mt > 20%. These are starting points — adjust based on your cell type (some cells naturally have high mt% or low gene counts).

Advanced QC: Ambient RNA and Doublets

Ambient RNA removal (SoupX)

During cell lysis, free-floating mRNA ("soup") contaminates all droplets. SoupX estimates the contamination profile from empty droplets and subtracts it from each cell's counts. Run before standard Seurat analysis.

Doublet detection (scDblFinder / DoubletFinder)

Doublets (two cells captured in one droplet) appear as cells with hybrid gene expression profiles. scDblFinder simulates artificial doublets and identifies real cells that resemble them. Expected doublet rate: ~0.8% per 1,000 cells loaded.

Complete Seurat Workflow

# Standard Seurat v5 scRNA-seq workflow library(Seurat) # Load 10x Cell Ranger output data <- Read10X("filtered_feature_bc_matrix/") seurat <- CreateSeuratObject(data, min.cells=3, min.features=200) # Quality control seurat[["pct_mt"]] <- PercentageFeatureSet(seurat, pattern="^MT-") VlnPlot(seurat, features=c("nFeature_RNA","nCount_RNA","pct_mt"), pt.size=0.1) seurat <- subset(seurat, nFeature_RNA > 200 & nFeature_RNA < 5000 & pct_mt < 20) # Normalise (log-normalisation) seurat <- NormalizeData(seurat) # Find highly variable genes (for dimensionality reduction) seurat <- FindVariableFeatures(seurat, selection.method="vst", nfeatures=2000) # Scale data (z-score, optionally regress out confounders) seurat <- ScaleData(seurat, vars.to.regress="pct_mt") # PCA → Nearest neighbours → Clustering → UMAP seurat <- RunPCA(seurat, npcs=30) ElbowPlot(seurat) # Choose number of PCs (elbow ~15-20) seurat <- FindNeighbors(seurat, dims=1:20) seurat <- FindClusters(seurat, resolution=0.5) seurat <- RunUMAP(seurat, dims=1:20) # Visualise clusters DimPlot(seurat, label=TRUE, repel=TRUE) + NoLegend() # Find marker genes for each cluster markers <- FindAllMarkers(seurat, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.5) top5 <- markers %>% group_by(cluster) %>% slice_max(avg_log2FC, n=5) DoHeatmap(seurat, features=top5$gene)

Cell Cycle Scoring and Regression

# Score cells for cell cycle phase s.genes <- cc.genes$s.genes g2m.genes <- cc.genes$g2m.genes seurat <- CellCycleScoring(seurat, s.features=s.genes, g2m.features=g2m.genes) # If cell cycle dominates clustering, regress it out: seurat <- ScaleData(seurat, vars.to.regress=c("S.Score","G2M.Score"))

Trajectory Analysis with Monocle3

Trajectory analysis (pseudotime) orders cells along a continuous path representing a biological process (e.g., differentiation, activation). Monocle3 constructs a principal graph through the UMAP embedding and assigns each cell a pseudotime value based on its position along this graph.

💡
When to use trajectories: Developmental datasets (stem cell differentiation, embryogenesis), immune activation time courses, or any scenario where cells exist along a continuum rather than in discrete types.

Dataset Integration

When analysing multiple scRNA-seq datasets (different patients, batches, or experiments), batch effects cause identical cell types to cluster separately by dataset rather than by biology. Integration methods align datasets so that shared cell types overlap.

Harmony

Fast, lightweight. Operates in PCA space. Iteratively adjusts PCs to remove batch effects. Works well for most integration tasks. RunHarmony(seurat, "batch")

Seurat RPCA integration

Uses reciprocal PCA to find shared anchors between datasets. More conservative — better at preserving unique cell populations that exist in only one dataset. More computationally expensive.

⚠️
Integration pitfall: Over-integration can merge truly different cell types across datasets. Always compare integrated and non-integrated results. If a cell type appears in only one condition, integration may incorrectly merge it with a different cell type.
10x Chromium = cell barcode + UMInFeature 200–5000, mt% < 20%SoupX for ambient RNAscDblFinder for doubletsFindAllMarkers for cluster identityHarmony for fast integrationMonocle3 for trajectories
8.2 Gene Ontology & Pathway Enrichment Analysis

After identifying differentially expressed genes, the natural next question is: what biological processes, pathways, or functions are affected? Pathway enrichment analysis moves from individual genes to biological interpretation. Rather than examining 500 DEGs one by one, you ask whether genes belonging to a specific pathway (e.g., "apoptosis" or "cell cycle") appear more often in your results than expected by chance. Two main approaches exist: Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA).

What Is a Gene Set?

A gene set is a group of genes that share a common biological property. Gene sets are defined by databases:

Gene Ontology (GO)

Three domains: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Hierarchically organised (e.g., "immune response" contains "inflammatory response" contains "cytokine production"). Very comprehensive but often redundant.

KEGG Pathways

Curated metabolic and signalling pathways with detailed diagrams. ~340 human pathways. Good for metabolism, signalling, and disease pathways. Less comprehensive than GO but more interpretable.

MSigDB Hallmark Gene Sets

50 carefully curated gene sets representing well-defined biological states (e.g., "HALLMARK_INFLAMMATORY_RESPONSE", "HALLMARK_OXIDATIVE_PHOSPHORYLATION"). Excellent starting point — covers major biological themes without redundancy.

Reactome

Peer-reviewed pathway database with detailed molecular events. ~2,500 human pathways. More granular than KEGG. Open-source and actively curated.

How ORA Works

Over-Representation Analysis uses Fisher's exact test (or hypergeometric test) to ask: Is the overlap between my DEG list and a specific pathway larger than expected by chance?

Imagine you have 500 DEGs from a universe of 15,000 expressed genes. The "apoptosis" pathway contains 200 genes. If 40 of your 500 DEGs are in the apoptosis pathway, is that more than the ~7 you would expect by chance (200/15,000 × 500)? Fisher's exact test gives you a p-value for this overlap.

⚠️
Background gene set matters: Always use all expressed genes (genes that passed your expression filter) as the background, not all genes in the genome (~20,000). Using the wrong background inflates false positives because unexpressed genes cannot possibly appear in your DEG list but are counted as expected in the pathway.

How GSEA Works — Step by Step

GSEA does not use a DEG list with a cutoff. Instead, it uses the full ranked gene list (all genes ranked by fold change):

1. Rank all genes by log2 fold change (or a signed statistic like -log10(p) × sign(FC)).

2. Walk down the ranked list. For each gene, if it is in the pathway, increase a running enrichment score (ES); if not, decrease it.

3. The enrichment score is the maximum deviation from zero during the walk. A positive ES means the pathway genes are concentrated at the top (upregulated). A negative ES means they are at the bottom (downregulated).

4. Significance is assessed by permutation: shuffle gene labels 1,000 times and compute ES each time. Compare your real ES to the permutation distribution.

5. The Normalised Enrichment Score (NES) accounts for gene set size, making scores comparable across pathways.

📌
NES interpretation: Positive NES = pathway genes are enriched among upregulated genes. Negative NES = enriched among downregulated genes. Larger |NES| = stronger enrichment. Use padj < 0.05 for significance.

clusterProfiler Code — ORA and GSEA

library(clusterProfiler); library(org.Hs.eg.db); library(enrichplot) # Convert gene symbols to Entrez IDs gene_list <- bitr(rownames(sig), fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db) # Define background (all expressed genes, not just DEGs) background <- bitr(rownames(res), fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db) # === ORA: GO Biological Process === ego <- enrichGO(gene = gene_list$ENTREZID, universe = background$ENTREZID, OrgDb = org.Hs.eg.db, ont = "BP", pAdjustMethod = "BH", qvalueCutoff = 0.05, readable = TRUE) # Remove redundant GO terms ego_simplified <- simplify(ego, cutoff=0.7, by="p.adjust") dotplot(ego_simplified, showCategory=20) # === ORA: KEGG pathways === kk <- enrichKEGG(gene = gene_list$ENTREZID, organism = "hsa", pvalueCutoff = 0.05) barplot(kk, showCategory=15) # === GSEA: full ranked gene list === # Create named, sorted vector of fold changes gene_rank <- setNames(res$log2FoldChange, rownames(res)) gene_rank <- sort(gene_rank, decreasing=TRUE) # Must be sorted! gsea_res <- gseGO(geneList = gene_rank, OrgDb = org.Hs.eg.db, ont = "BP", minGSSize = 15, maxGSSize = 500, pvalueCutoff = 0.05) # Visualisations ridgeplot(gsea_res, showCategory=15) gseaplot2(gsea_res, geneSetID=1:3) # Running score plot # Enrichment network (shows relationships between terms) emapplot(pairwise_termsim(ego_simplified), showCategory=20)

Common Visualisations

Dot plot

Dot size = number of genes, colour = padj, X-axis = gene ratio. The default and most informative single plot.

Enrichment map / network

Nodes = pathways, edges connect pathways sharing genes. Reveals clusters of related biological themes. Use emapplot().

Ridgeplot

Shows the distribution of fold changes for genes in each pathway. Useful for GSEA results to see the shift direction.

GSEA running score plot

Shows the running enrichment score as you walk down the ranked list. The peak/trough is the ES. Vertical lines show where pathway genes fall.

Common Pitfalls

⚠️
Redundant GO terms: GO is hierarchical, so related terms often appear together (e.g., "immune response", "innate immune response", "regulation of immune response"). Use simplify() in clusterProfiler or REVIGO to remove redundancy and focus on the most specific or most significant terms.
💡
ORA vs GSEA: Use ORA when you have a clear DEG list (padj < 0.05, |log2FC| > 1). Use GSEA when you want to detect subtle, coordinated changes across an entire pathway — especially when few individual genes pass the significance threshold but many show small shifts in the same direction.
ORA = Fisher's test on DEG listGSEA = full ranked list, no cutoffPositive NES = upregulated pathwayUse expressed genes as backgroundsimplify() removes GO redundancyMSigDB Hallmark = best starting pointclusterProfiler for both ORA and GSEA
8.3 Machine Learning in Bioinformatics

Machine learning (ML) is now embedded across bioinformatics — from protein structure prediction (AlphaFold) to variant pathogenicity scoring, cell type classification, and biomarker discovery. This section covers the key ML concepts that bioinformaticians need, including the unique challenges posed by genomic data (many features, few samples), practical workflows in Python with scikit-learn, and when ML is appropriate versus when simpler statistical methods suffice.

Supervised vs Unsupervised Learning

Supervised learning

You have labels (e.g., cancer subtype, drug response). The model learns to predict labels from features (gene expression). Examples: classify tumour vs normal, predict patient survival, identify pathogenic variants. Algorithms: Random Forest, SVM, logistic regression, neural networks.

Unsupervised learning

No labels. The model discovers structure in the data. Examples: clustering scRNA-seq cells into types, identifying patient subtypes from multi-omics data, PCA for dimensionality reduction. Algorithms: k-means, hierarchical clustering, UMAP, t-SNE, NMF.

The Curse of Dimensionality

Genomic datasets typically have 20,000+ features (genes) but only 50–200 samples. This "p >> n" problem means: (1) models can easily memorise the training data (overfit), (2) many spurious correlations exist by chance, and (3) feature selection is critical. Before training, reduce features using: variance filtering (remove low-variance genes), differential expression pre-filtering, or embedded methods (Random Forest feature importance).

Cross-Validation: The Cardinal Rule

Never evaluate a model on the same data used to train it. Cross-validation provides an honest estimate of model performance on unseen data.

k-fold cross-validation

Split data into k folds (typically k=5 or k=10). Train on k-1 folds, test on the held-out fold. Repeat k times. Average the scores. The standard approach for most genomic ML.

Leave-One-Out (LOO)

Each sample is held out once. Maximises training data but computationally expensive. Useful for very small datasets (<30 samples) where you cannot afford to hold out 20% of data.

⚠️
Feature selection MUST happen inside CV: If you select the top 100 genes using all samples and then split into train/test, information from the test set leaked into feature selection. Always perform feature selection within each CV fold to get unbiased performance estimates.

Practical Example: Cancer Classification with scikit-learn

import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report, roc_auc_score from sklearn.feature_selection import SelectKBest, f_classif # Load expression matrix (genes x samples) and labels expr = pd.read_csv("expression_matrix.csv", index_col=0).T # samples x genes labels = pd.read_csv("sample_labels.csv")["subtype"] # Build a pipeline: feature selection → scaling → classifier pipeline = Pipeline([ ("select", SelectKBest(f_classif, k=200)), # Top 200 genes by ANOVA ("scale", StandardScaler()), # Z-score normalise ("clf", RandomForestClassifier( n_estimators=500, max_depth=10, random_state=42)) ]) # 5-fold stratified cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipeline, expr, labels, cv=cv, scoring="roc_auc_ovr") print(f"Mean AUC: {scores.mean():.3f} (+/- {scores.std():.3f})") # Train final model and get feature importance pipeline.fit(expr, labels) selector = pipeline.named_steps["select"] clf = pipeline.named_steps["clf"] # Get selected gene names and their importance selected_genes = expr.columns[selector.get_support()] importance = pd.Series(clf.feature_importances_, index=selected_genes) top_biomarkers = importance.sort_values(ascending=False).head(20) print("Top 20 biomarker genes:") print(top_biomarkers)

ROC Curve and AUC Interpretation

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate vs False Positive Rate at different classification thresholds. The AUC (Area Under the Curve) summarises overall performance:

AUC = 0.5

Random guessing. The model has no discriminative ability. Your features carry no signal for this task.

AUC = 0.7–0.8

Acceptable discrimination. Useful for biomarker discovery but may not be sufficient for clinical decisions.

AUC = 0.8–0.9

Good discrimination. Strong enough for most research applications and preliminary clinical use.

AUC > 0.9

Excellent discrimination. Be cautious — if AUC is >0.95 on a small dataset, double-check for data leakage or confounders (batch effect masquerading as biology).

Deep Learning for Sequence Data

Deep learning excels when the input is raw sequence data (DNA, RNA, protein). A common architecture is a 1D convolutional neural network (CNN) applied to one-hot encoded DNA sequences (4 channels: A, C, G, T). These models can learn motifs (like transcription factor binding sites) directly from sequence without hand-crafted features.

import tensorflow as tf from tensorflow.keras import layers, models # 1D CNN for TF binding site prediction # Input: one-hot encoded DNA sequences (batch, length=200, channels=4) model = models.Sequential([ layers.Conv1D(64, kernel_size=12, activation="relu", input_shape=(200, 4)), layers.MaxPooling1D(pool_size=4), layers.Conv1D(128, kernel_size=8, activation="relu"), layers.GlobalMaxPooling1D(), layers.Dense(64, activation="relu"), layers.Dropout(0.3), layers.Dense(1, activation="sigmoid") # Binary: bound / not bound ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["AUC"]) model.summary() # One-hot encode DNA sequence def one_hot_encode(seq): mapping = {"A": [1,0,0,0], "C": [0,1,0,0], "G": [0,0,1,0], "T": [0,0,0,1]} return np.array([mapping.get(base, [0,0,0,0]) for base in seq])

When NOT to Use Machine Learning

⚠️
ML is not always the answer: (1) Small sample sizes (n < 30) — models cannot generalise reliably. Use traditional statistics instead. (2) Confounded data — if your treatment is perfectly correlated with batch, no ML model can separate them. Fix the experimental design. (3) Simple questions — "which genes are differentially expressed?" is better answered by DESeq2 than by a neural network. Use the simplest method that works.
Supervised = labelled datap >> n = curse of dimensionalityFeature selection inside CVAUC 0.5 = random, >0.9 = excellentRandom Forest for biomarker genes1D CNN for DNA sequenceNever test on training dataSimplest method that works
8.4 Metagenomics & Microbiome Analysis

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples — soil, ocean water, the human gut — without the need to culture individual organisms. It reveals the composition and functional potential of entire microbial communities. Two main approaches exist: 16S rRNA amplicon sequencing (which bacteria are present?) and shotgun metagenomics (what are they doing?). This section covers both workflows in detail.

16S rRNA Amplicon Sequencing

The 16S ribosomal RNA gene is present in all bacteria and archaea. It contains conserved regions (where universal primers bind) flanking variable regions (V1–V9) that differ between species. By PCR-amplifying and sequencing the V3-V4 region (~460bp), you can identify which bacterial taxa are present in your sample. This approach is cost-effective, well-established, and sufficient for community profiling.

ASV vs OTU

OTU (Operational Taxonomic Unit)

Older approach. Clusters sequences at 97% similarity. Loses resolution (different species with >97% identity are merged). Computationally simple but biologically imprecise. Tools: QIIME1, mothur (legacy).

ASV (Amplicon Sequence Variant)

Modern standard. Resolves sequences to single-nucleotide precision. Each unique error-corrected sequence is an ASV. Reproducible across studies (unlike OTUs which depend on clustering parameters). Superior in every way. Tools: DADA2, Deblur.

💡
Always use ASVs. OTUs are a legacy concept from when sequencing error rates made exact sequences unreliable. Modern denoising algorithms (DADA2) correct errors statistically, making ASVs the standard since ~2017.

QIIME2 Pipeline Overview

# Install QIIME2 (in a conda environment) conda env create -n qiime2 --file qiime2-amplicon-2024.5-py39-linux-conda.yml conda activate qiime2 # Step 1: Import raw FASTQ files qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path raw_reads/ \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --output-path demux.qza # Step 2: Visualise read quality (choose truncation lengths) qiime demux summarize --i-data demux.qza --o-visualization demux.qzv # Step 3: Denoise with DADA2 (produces ASV table) qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux.qza \ --p-trunc-len-f 280 \ --p-trunc-len-r 220 \ --p-n-threads 8 \ --o-table table.qza \ --o-representative-sequences rep-seqs.qza \ --o-denoising-stats stats.qza # Step 4: Assign taxonomy (using SILVA or Greengenes2) qiime feature-classifier classify-sklearn \ --i-classifier silva-138-99-nb-classifier.qza \ --i-reads rep-seqs.qza \ --o-classification taxonomy.qza # Step 5: Build phylogenetic tree (needed for UniFrac) qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences rep-seqs.qza \ --o-alignment aligned-rep-seqs.qza \ --o-masked-alignment masked-aligned.qza \ --o-tree unrooted-tree.qza \ --o-rooted-tree rooted-tree.qza # Step 6: Alpha and beta diversity qiime diversity core-metrics-phylogenetic \ --i-phylogeny rooted-tree.qza \ --i-table table.qza \ --p-sampling-depth 10000 \ --m-metadata-file metadata.tsv \ --output-dir diversity-results/

Taxonomy Databases

SILVA

Comprehensive database for bacteria, archaea, and eukaryotes (18S/28S). Version 138.1 is the current standard. Highest quality curated taxonomy. Recommended for most studies.

Greengenes2 (2022)

Rebuilt from scratch using whole-genome phylogenetics. Consistent taxonomy derived from the Genome Taxonomy Database (GTDB). Better phylogenetic accuracy than the original Greengenes. Compatible with UniFrac.

Alpha Diversity Metrics

Alpha diversity measures the diversity within a single sample. Different metrics capture different aspects:

Shannon index (H')

Accounts for both richness (number of species) and evenness (how evenly individuals are distributed). Higher H' = more diverse. A sample dominated by one species has low Shannon even if many species are present.

Simpson index (1-D)

Probability that two randomly drawn individuals belong to different species. Emphasises dominant species. Less sensitive to rare species than Shannon. Range: 0 (no diversity) to 1 (high diversity).

Observed ASVs (richness)

Simply the number of unique ASVs detected. Sensitive to sequencing depth — always rarefy (subsample to equal depth) before comparing.

Faith's Phylogenetic Diversity (PD)

Sum of branch lengths in the phylogenetic tree connecting all species in the sample. Captures phylogenetic breadth — a sample with many distantly related species has higher PD than one with many closely related species.

Beta Diversity and Ordination

Beta diversity measures differences between samples. Each pair of samples gets a distance value, producing a distance matrix that is visualised by ordination (PCoA).

Bray-Curtis dissimilarity

Based on shared species abundances. Does not consider phylogeny. Range: 0 (identical) to 1 (completely different). The most common non-phylogenetic metric.

UniFrac (weighted & unweighted)

Incorporates phylogenetic relationships. Unweighted UniFrac considers only presence/absence (sensitive to rare taxa). Weighted UniFrac also considers abundance (sensitive to dominant taxa). Requires a phylogenetic tree.

PERMANOVA for Statistical Testing

To test whether microbial communities differ significantly between groups (e.g., healthy vs diseased), use PERMANOVA (Permutational Multivariate Analysis of Variance). It tests whether the centroids of groups in beta diversity space are significantly different by comparing observed F-statistics to a permutation distribution.

# PERMANOVA in QIIME2 qiime diversity beta-group-significance \ --i-distance-matrix diversity-results/bray_curtis_distance_matrix.qza \ --m-metadata-file metadata.tsv \ --m-metadata-column group \ --p-method permanova \ --o-visualization permanova-results.qzv # In R with vegan: # library(vegan) # adonis2(dist_matrix ~ group, data=metadata, permutations=999)

Shotgun Metagenomics Workflow

Unlike 16S amplicon sequencing, shotgun metagenomics sequences all DNA in the sample. This reveals not just which organisms are present (taxonomy) but also their functional potential (metabolic pathways, antibiotic resistance genes).

# Shotgun metagenomics pipeline # Step 1: Quality control (remove host reads if applicable) fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o clean_R1.fastq.gz -O clean_R2.fastq.gz \ --detect_adapter_for_pe # Remove human reads (for clinical samples) bowtie2 -x human_genome -1 clean_R1.fastq.gz -2 clean_R2.fastq.gz \ --un-conc-gz non_host_%.fastq.gz -S /dev/null # Step 2: Taxonomic profiling with MetaPhlAn4 conda install -c bioconda metaphlan metaphlan non_host_1.fastq.gz,non_host_2.fastq.gz \ --input_type fastq \ --nproc 8 \ -o taxonomy_profile.txt # Step 3: Functional profiling with HUMAnN3 conda install -c bioconda humann humann --input non_host_1.fastq.gz \ --output humann_output/ \ --threads 8 # HUMAnN3 outputs: # genefamilies.tsv — gene family abundances (UniRef90) # pathabundance.tsv — metabolic pathway abundances (MetaCyc) # pathcoverage.tsv — pathway coverage (fraction of pathway detected) # Step 4: Merge profiles across samples humann_join_tables -i humann_output/ -o merged_genefamilies.tsv \ --file_name genefamilies humann_renorm_table -i merged_genefamilies.tsv -o merged_genefamilies_cpm.tsv \ --units cpm
📌
16S vs Shotgun: Use 16S for cost-effective community profiling when you only need to know "who is there." Use shotgun metagenomics when you also need to know "what are they doing" (functional profiling) or when you need to identify viruses and eukaryotes (which lack the 16S gene).
⚠️
Rarefaction: Before comparing alpha diversity across samples, rarefy (subsample) all samples to the same sequencing depth. Samples with more reads will always show more taxa, creating a false diversity difference. Choose a rarefaction depth that retains most samples while providing sufficient coverage (e.g., the 10th percentile of library sizes).
16S V3-V4 for bacterial profilingASV > OTU (always)DADA2 for denoisingSILVA for taxonomyShannon = richness + evennessUniFrac = phylogenetic beta diversityPERMANOVA for group testingMetaPhlAn4 + HUMAnN3 for shotgun
Module 9 · Proteomics

Mass Spectrometry & Protein Analysis

Proteomics studies the entire complement of proteins in a cell, tissue, or organism. Unlike the genome (static), the proteome is dynamic — it changes with conditions, time, and cell type. Mass spectrometry is the workhorse technology for large-scale protein identification and quantification.

9.1 Mass Spectrometry Principles

Mass spectrometry (MS) measures the mass-to-charge ratio (m/z) of ionised molecules. In proteomics, proteins are digested into peptides (typically with trypsin), separated by liquid chromatography (LC), ionised, and then measured in the mass spectrometer. The combination — LC-MS/MS — is the standard workflow for identifying and quantifying thousands of proteins in a single experiment.

The LC-MS/MS Workflow

A typical bottom-up proteomics experiment follows these steps: (1) Protein extraction — lyse cells, solubilise proteins. (2) Digestion — trypsin cleaves after lysine (K) and arginine (R), producing peptides of 7–35 amino acids. (3) Separation — reverse-phase liquid chromatography separates peptides by hydrophobicity over a 60–120 minute gradient. (4) Ionisation — electrospray ionisation (ESI) converts peptides into gas-phase ions. (5) MS1 scan — measures intact peptide masses. (6) MS2 scan — isolates individual peptide ions and fragments them (HCD or CID). The fragmentation pattern reveals the amino acid sequence.

Key Instrument Types

Orbitrap (Thermo)

High resolution (120,000–500,000), high mass accuracy (<2 ppm). Gold standard for discovery proteomics. Exploris 480, Q Exactive series.

TOF (Bruker timsTOF)

Time-of-flight with trapped ion mobility (TIMS). Very fast scan rates. DIA-PASEF enables deep proteome coverage. Excellent for clinical applications.

Triple Quadrupole (QQQ)

Lower resolution but extremely sensitive and quantitative. Used for targeted proteomics (MRM/SRM) — measuring specific proteins with high precision.

MALDI-TOF

Matrix-assisted laser desorption. Used for intact protein profiling, tissue imaging (MALDI-MSI), and microbial identification (Bruker Biotyper).

DDA vs DIA Acquisition

DDA (Data-Dependent)

Selects the top N most intense ions from MS1 for fragmentation. Simple, well-established. Stochastic — may miss low-abundance peptides. Less reproducible between runs.

DIA (Data-Independent)

Fragments ALL ions in wide m/z windows systematically. No selection bias. More reproducible and comprehensive. Requires spectral libraries or library-free tools (DIA-NN). Now the preferred method.

Recommended
💡
Modern trend: DIA with tools like DIA-NN or Spectronaut is rapidly replacing DDA for most applications. DIA-NN can identify 8,000–10,000 proteins from a single human cell line in a 90-minute gradient, with excellent quantitative reproducibility.
LC-MS/MS is standard workflowTrypsin cleaves at K and ROrbitrap = high resolutionDIA > DDA for reproducibilityDIA-NN for library-free analysis
9.2 Protein Identification & Database Searching

After acquiring MS/MS spectra, the key bioinformatics task is identifying which peptides (and therefore proteins) generated them. This is done by matching experimental spectra against theoretical spectra predicted from a protein sequence database. The process is called database searching — and it is conceptually similar to BLAST, but for mass spectra instead of sequences.

How Database Search Works

(1) Take each experimental MS2 spectrum. (2) In silico digest all proteins from a reference database (e.g., UniProt human) with trypsin. (3) For each candidate peptide whose mass matches the precursor mass (within tolerance), predict the theoretical fragmentation spectrum. (4) Score the match between experimental and theoretical spectra. (5) Apply statistical validation (FDR control at 1% using target-decoy approach).

Major Search Engines

MaxQuant

Free, widely used. Built-in Andromeda search engine. Excellent for label-free and SILAC. GUI-based. Standard in academic proteomics.

MSFragger / FragPipe

Ultrafast open search. 100× faster than traditional engines. Integrated in FragPipe with downstream analysis. Excellent for PTM discovery.

DIA-NN

Deep neural network for DIA data. Library-free mode generates in silico spectral libraries. State-of-the-art sensitivity and speed.

# ─── MaxQuant (GUI-based, download from maxquant.org) ─── # Key settings to configure: # Raw files: add your .raw files # FASTA: download from uniprot.org (reviewed Swiss-Prot + isoforms) # Digestion: Trypsin/P, max 2 missed cleavages # Fixed mod: Carbamidomethyl (C) — from iodoacetamide alkylation # Variable mods: Oxidation (M), Acetyl (Protein N-term) # FDR: 1% at PSM and protein level # Match between runs: ON (transfers IDs across runs) # LFQ: ON for label-free quantification # ─── DIA-NN (command line) ─── # Download from github.com/vdemichev/DiaNN diann --f sample1.raw --f sample2.raw --f sample3.raw \ --lib "" \ --fasta uniprot_human.fasta \ --gen-spec-lib \ --predictor \ --qvalue 0.01 \ --matrices \ --out-lib report-lib.tsv \ --out report.tsv # Key output files: # report.tsv — all identified precursors with quantities # report.pg_matrix.tsv — protein group quantities (use this!)
⚠️
Target-Decoy FDR: All search engines use a decoy database (reversed or shuffled sequences) to estimate false discovery rate. At 1% FDR: for every 100 identified proteins, ~1 is expected to be wrong. Always use 1% FDR at both PSM (peptide-spectrum match) and protein level. Never relax this to 5%.
1% FDR is mandatoryMaxQuant for DDADIA-NN for DIAFragPipe = fastestUse reviewed UniProt FASTA
9.3 Quantitative Proteomics & Statistical Analysis

Quantitative proteomics measures how much of each protein is present, enabling comparison between conditions. There are two main strategies: label-free quantification (LFQ), which compares ion intensities directly, and labelling methods (TMT, SILAC), which use chemical or metabolic tags to multiplex samples.

Quantification Strategies

Label-Free (LFQ)

Compare MS1 peptide intensities across runs. Simple, no extra chemistry. Requires careful normalisation. Best with DIA. MaxLFQ algorithm in MaxQuant.

Most common
TMT/iTRAQ

Chemical labels with same mass but different fragmentation reporters. Multiplex up to 18 samples in one run (TMTpro 18-plex). Reduces missing values. Higher throughput.

SILAC

Metabolic labelling — grow cells with heavy amino acids (13C-Lys, 13C-Arg). Most accurate quantification. Only works for cell culture. Not for tissues/clinical samples.

Statistical Analysis with R

# ─── Differential protein abundance analysis ─── # Using limma (same package used for microarrays — works great for proteomics) library(limma) library(tidyverse) # Load MaxQuant proteinGroups.txt or DIA-NN pg_matrix pg <- read.delim("proteinGroups.txt") # Filter: remove contaminants, reverse hits, only identified by site pg <- pg[pg$Reverse != "+" & pg$Potential.contaminant != "+" & pg$Only.identified.by.site != "+", ] # Extract LFQ intensity columns and log2-transform lfq <- pg[, grep("^LFQ.intensity.", names(pg))] lfq[lfq == 0] <- NA # Replace 0 with NA mat <- log2(as.matrix(lfq)) # Filter: keep proteins with values in at least 70% of samples per group # (adjust based on your design) # Imputation: replace missing values (MinProb method — common) # Or use DEqMS / msqrob2 for more sophisticated handling # Differential analysis with limma group <- factor(c("Control","Control","Control","Treated","Treated","Treated")) design <- model.matrix(~ 0 + group) colnames(design) <- levels(group) fit <- lmFit(mat, design) contrast <- makeContrasts(Treated - Control, levels = design) fit2 <- contrasts.fit(fit, contrast) fit3 <- eBayes(fit2) results <- topTable(fit3, number = Inf, sort.by = "P") sig <- results[results$adj.P.Val < 0.05 & abs(results$logFC) > 1, ] cat(nrow(sig), "significantly changed proteins (|logFC|>1, FDR<0.05)\n")
💡
Missing values in proteomics: Unlike RNA-seq, proteomics data has many missing values (10–30% typically). These can be MCAR (missing completely at random — stochastic), MNAR (missing not at random — protein is below detection limit), or a mixture. Imputation strategy matters: use MinProb for MNAR, kNN for MCAR. Better yet, use statistical methods that handle missingness directly: DEqMS, msqrob2, or proDA.
LFQ for most experimentsTMT for multiplexinglimma works well for proteomicsHandle missing values carefully|logFC| > 1 + FDR < 0.05
Module 10 · Structural Bioinformatics

Protein Structure & AlphaFold

Protein function is determined by its three-dimensional structure. Structural bioinformatics uses computational methods to predict, analyse, and visualise protein structures — revolutionised by AlphaFold, which can predict structures with near-experimental accuracy.

10.1 Protein Structure Levels & Databases

Proteins fold into defined 3D structures that determine their function. Understanding the four levels of protein structure is fundamental to structural bioinformatics and drug design.

The Four Levels of Structure

Primary (1°)

The linear amino acid sequence. Determines all higher levels. Read from gene sequence. 20 standard amino acids linked by peptide bonds.

Secondary (2°)

Local folding patterns: α-helices (3.6 residues per turn, stabilised by i→i+4 hydrogen bonds) and β-sheets (parallel or antiparallel strands). Loops connect regular elements.

Tertiary (3°)

The complete 3D fold of a single polypeptide chain. Stabilised by hydrophobic core packing, disulfide bonds, salt bridges, and hydrogen bonds. This is what we see in PDB structures.

Quaternary (4°)

Assembly of multiple polypeptide chains (subunits) into a functional complex. Example: haemoglobin is a tetramer (α2β2). Many enzymes and receptors function as oligomers.

Key Structural Databases

PDB (rcsb.org)

~220,000 experimentally determined structures. Solved by X-ray crystallography (~87%), cryo-EM (~10%), or NMR (~3%). The primary source for known structures.

AlphaFold DB

~200 million predicted structures covering nearly all known proteins. Free access at alphafold.ebi.ac.uk. Quality indicated by pLDDT confidence score (blue = high, orange = low).

UniProt

Links sequence → structure → function. Cross-references PDB and AlphaFold entries. Shows domain architecture, active sites, binding sites, and post-translational modifications.

# Download a PDB structure wget https://files.rcsb.org/download/1UBQ.pdb # Download AlphaFold prediction by UniProt ID wget https://alphafold.ebi.ac.uk/files/AF-P0DTD1-F1-model_v4.pdb # Batch download AlphaFold structures for a proteome wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN_v4.tar # Visualise (install PyMOL or ChimeraX) mamba install -c conda-forge pymol-open-source pymol 1UBQ.pdb
📌
Reading pLDDT scores: AlphaFold provides a per-residue confidence score (pLDDT, 0–100). >90 = very high confidence (core domains). 70–90 = confident. 50–70 = low confidence (may be flexible). <50 = disordered/unstructured regions. Always check pLDDT before interpreting predicted structures.
PDB for experimental structuresAlphaFold for predictionspLDDT > 70 = trustworthyPyMOL or ChimeraX for visualisation
10.2 Structure Prediction — AlphaFold & ColabFold

AlphaFold2, developed by DeepMind, solved the protein structure prediction problem in 2020. It predicts 3D structures from amino acid sequence alone with near-experimental accuracy. ColabFold makes it easy to run AlphaFold predictions using Google Colab — no local GPU required.

How AlphaFold Works (Simplified)

(1) MSA generation — searches sequence databases to find homologous proteins (evolutionary information). (2) Template search — finds known structures of related proteins in PDB. (3) Evoformer — a deep neural network processes the MSA and pairwise features through 48 attention-based blocks. (4) Structure module — converts learned representations into 3D atomic coordinates. (5) Recycling — repeats the prediction 3 times, feeding output back as input, improving accuracy each time.

Running ColabFold

💡
Easiest way to predict a structure: Go to colab.research.google.com and search for "ColabFold" notebook. Paste your amino acid sequence, click "Runtime → Run all". The prediction takes 5–30 minutes depending on protein size. Output: PDB file + pLDDT scores + PAE (predicted aligned error) plot.
# ─── Local AlphaFold (requires GPU + ~2 TB databases) ─── # Use LocalColabFold for easier local installation: git clone https://github.com/YoshitakaMo/localcolabfold.git cd localcolabfold bash install_colabbatch_linux.sh # Predict a single protein colabfold_batch input.fasta output_dir/ # Predict a protein complex (separate chains with :) # In FASTA: >complex\nSEQUENCE_CHAIN_A:SEQUENCE_CHAIN_B colabfold_batch complex.fasta output_dir/ --model-type alphafold2_multimer_v3 # ─── AlphaFold3 (online, free for academics) ─── # Visit alphafoldserver.com — predicts proteins, DNA, RNA, ligands, ions # Supports complexes of any type (protein-DNA, protein-drug, etc.)
⚠️
Limitations of AlphaFold: (1) Does not predict conformational changes — gives a single static structure. (2) Disordered regions (low pLDDT) should not be interpreted structurally. (3) Cannot predict effects of mutations reliably. (4) Ligand/cofactor binding sites may not be accurate without the ligand present. (5) AlphaFold2 struggles with protein-protein interfaces — use AlphaFold-Multimer or AlphaFold3 for complexes.
ColabFold = easiest way to predictAlphaFold3 for complexesCheck pLDDT and PAEStatic structure only
10.3 Molecular Docking & Drug Discovery

Molecular docking predicts how a small molecule (ligand/drug) binds to a protein target. It is a cornerstone of computer-aided drug design (CADD). The goal is to find the binding pose (orientation + conformation) and estimate binding affinity. Virtual screening uses docking to test millions of compounds computationally before synthesising the most promising ones.

Docking Tools

AutoDock Vina

The most widely used docking tool. Fast, accurate, free. Flexible ligand + rigid receptor. Easy to use via command line.

Start here
Gnina

CNN-based scoring function built on Vina. Better pose prediction than classical scoring. GPU-accelerated.

DiffDock

Diffusion model for docking. Does not require predefined binding site. State-of-the-art blind docking accuracy.

# ─── AutoDock Vina workflow ─── mamba install -c conda-forge autodock-vina mamba install -c conda-forge openbabel # For format conversion # 1. Prepare receptor (remove water, add hydrogens) # Use ADFR suite or Open Babel: obabel receptor.pdb -O receptor.pdbqt -h # 2. Prepare ligand obabel ligand.sdf -O ligand.pdbqt --gen3d -h # 3. Define search box (center on binding site) # 4. Run docking vina --receptor receptor.pdbqt --ligand ligand.pdbqt \ --center_x 10.0 --center_y 20.0 --center_z 15.0 \ --size_x 20 --size_y 20 --size_z 20 \ --exhaustiveness 32 --num_modes 9 \ --out docked_poses.pdbqt # Output: ranked poses with binding affinity (kcal/mol) # More negative = stronger predicted binding # Typical drug: -7 to -12 kcal/mol
⚠️
Docking limitations: Docking scoring functions are approximate — they estimate binding affinity with ~2 kcal/mol error. Use docking for ranking (which compounds bind better), not absolute binding energy prediction. Always validate top hits experimentally. For more accurate binding free energies, use molecular dynamics with free energy perturbation (FEP) or MM-PBSA methods.
AutoDock Vina to startDiffDock for blind dockingUse for ranking, not absolute affinityAlways validate experimentally
Module 11 · Epigenomics

ChIP-seq, ATAC-seq & DNA Methylation

Epigenetics studies heritable changes in gene activity that do not involve changes to the DNA sequence. These modifications — histone marks, chromatin accessibility, and DNA methylation — control which genes are on or off in each cell type. Epigenomic assays reveal the regulatory landscape of the genome.

11.1 ChIP-seq — Mapping Protein-DNA Interactions

Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) identifies where a specific protein (transcription factor or modified histone) binds across the genome. The protein-DNA complexes are cross-linked with formaldehyde, chromatin is sheared into 200–500 bp fragments, an antibody pulls down fragments bound by the target protein, and the bound DNA is sequenced. Peaks in the sequencing signal reveal binding sites.

ChIP-seq Analysis Pipeline

# ─── Complete ChIP-seq pipeline ─── mamba install -c bioconda bowtie2 samtools macs2 deeptools bedtools homer # 1. Align reads to genome (bowtie2, not HISAT2 — no splice awareness needed) bowtie2 -x bt2_index/genome -U chip_sample.fastq.gz \ --very-sensitive -p 8 | samtools sort -O bam -o chip.bam samtools index chip.bam # Same for input/IgG control bowtie2 -x bt2_index/genome -U input_control.fastq.gz \ --very-sensitive -p 8 | samtools sort -O bam -o input.bam samtools index input.bam # 2. Remove duplicates samtools markdup -r chip.bam chip_dedup.bam samtools markdup -r input.bam input_dedup.bam # 3. Call peaks with MACS2 # For transcription factors (narrow peaks): macs2 callpeak -t chip_dedup.bam -c input_dedup.bam \ -f BAM -g hs --outdir peaks/ -n my_tf -q 0.01 # For histone marks like H3K27me3 (broad peaks): macs2 callpeak -t chip_dedup.bam -c input_dedup.bam \ -f BAM -g hs --outdir peaks/ -n h3k27me3 --broad --broad-cutoff 0.1 # 4. Visualise with deepTools bamCoverage -b chip_dedup.bam -o chip.bw --normalizeUsing RPGC \ --effectiveGenomeSize 2913022398 --binSize 10 -p 8 # 5. Motif analysis (what TF motifs are enriched in peaks?) findMotifsGenome.pl peaks/my_tf_peaks.narrowPeak hg38 motif_output/ -size 200
⚠️
Always use an input control. ChIP-seq without a control (input DNA or IgG pulldown) will produce false peaks at open chromatin regions and repetitive elements. The control captures background signal, enabling the peak caller to distinguish real binding from noise. No control = no publication.
💡
CUT&RUN / CUT&Tag: Modern alternatives to ChIP-seq that require far fewer cells (100–1,000 vs millions), have lower background, and are cheaper. CUT&Tag is especially popular — it uses a protein A-Tn5 fusion to cut and tag DNA near the antibody target. Analysis is similar to ChIP-seq but use SEACR instead of MACS2 for peak calling.
Bowtie2 for alignmentMACS2 for peak callingAlways include input controlHOMER for motif analysisCUT&Tag is replacing ChIP-seq
11.2 ATAC-seq — Chromatin Accessibility

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) maps open chromatin regions genome-wide. It uses the Tn5 transposase, which preferentially inserts sequencing adapters into accessible (open) chromatin, avoiding tightly packed nucleosome-occupied regions. Open chromatin corresponds to active regulatory elements — promoters, enhancers, and transcription factor binding sites.

ATAC-seq Analysis Pipeline

# ─── ATAC-seq pipeline ─── mamba install -c bioconda bowtie2 samtools macs2 deeptools picard # 1. Align with bowtie2 (paired-end, max fragment size 2000) bowtie2 -x bt2_index/genome -1 atac_R1.fastq.gz -2 atac_R2.fastq.gz \ --very-sensitive -X 2000 --no-mixed --no-discordant -p 8 | \ samtools sort -O bam -o atac.bam # 2. Remove mitochondrial reads (major contaminant in ATAC-seq!) samtools view -h atac.bam | grep -v chrM | samtools sort -O bam -o atac_nomt.bam samtools index atac_nomt.bam # 3. Remove duplicates picard MarkDuplicates I=atac_nomt.bam O=atac_dedup.bam \ M=dup_metrics.txt REMOVE_DUPLICATES=true samtools index atac_dedup.bam # 4. Shift reads (Tn5 inserts with +4/-5 bp offset) alignmentSieve -b atac_dedup.bam -o atac_shifted.bam \ --ATACshift --minFragmentLength 0 --maxFragmentLength 2000 # 5. Call peaks (use --nomodel and shift parameters) macs2 callpeak -t atac_shifted.bam -f BAMPE -g hs \ --nomodel --shift -75 --extsize 150 \ --outdir peaks/ -n atac -q 0.01 --keep-dup all # 6. Generate signal track bamCoverage -b atac_shifted.bam -o atac.bw --normalizeUsing RPGC \ --effectiveGenomeSize 2913022398 --binSize 10 -p 8
📌
Fragment size distribution: A good ATAC-seq library shows a characteristic nucleosomal ladder pattern: a peak at ~200 bp (nucleosome-free), ~400 bp (mono-nucleosome), ~600 bp (di-nucleosome). If you see only the ~200 bp peak, your transposition worked well. If the distribution is broad without clear peaks, the experiment may have had too many or too few Tn5 insertions.
Remove chrM reads firstTn5 shift correction is essentialNucleosomal ladder = good QCNo input control needed
11.3 DNA Methylation & Bisulfite Sequencing

DNA methylation — the addition of a methyl group to the 5th carbon of cytosine (5mC) — is the most studied epigenetic modification. In mammals, it occurs predominantly at CpG dinucleotides. Methylation of gene promoters generally silences transcription. It plays critical roles in development, X-chromosome inactivation, genomic imprinting, and cancer (where aberrant methylation is a hallmark).

Detection Methods

WGBS

Whole-genome bisulfite sequencing. Gold standard. Single-CpG resolution across entire genome. Expensive — needs 10–15× coverage. Bisulfite converts unmethylated C→U; methylated C is protected.

RRBS

Reduced Representation BS. MspI digestion enriches CpG-rich regions. 1–10% of genome, covers most CpG islands. Much cheaper than WGBS.

Methylation Arrays

Illumina EPIC array (850K CpGs). Cheapest per-sample. Good for clinical studies. Limited to pre-selected CpGs. EPIC v2 covers ~930K sites.

# ─── WGBS analysis with Bismark ─── mamba install -c bioconda bismark bowtie2 samtools trim-galore # 1. Prepare bisulfite genome (creates C→T and G→A converted versions) bismark_genome_preparation --bowtie2 genome_folder/ # 2. Trim adapters + low quality trim_galore --paired --rrbs read_R1.fastq.gz read_R2.fastq.gz # 3. Align with Bismark bismark --bowtie2 --genome genome_folder/ \ -1 read_R1_val_1.fq.gz -2 read_R2_val_2.fq.gz -p 4 # 4. Deduplicate deduplicate_bismark --bam read_R1_val_1_bismark_bt2_pe.bam # 5. Extract methylation (outputs CpG, CHG, CHH context) bismark_methylation_extractor --paired-end --comprehensive \ --CX --bedGraph --genome_folder genome_folder/ \ read_R1_val_1_bismark_bt2_pe.deduplicated.bam # 6. Generate summary report bismark2report
💡
Interpreting methylation: CpG island methylation at promoters = gene silencing. Gene body methylation = correlated with active transcription (paradoxically). Enhancer methylation = generally low at active enhancers. In cancer, look for promoter hypermethylation of tumour suppressors and global hypomethylation (genomic instability). DMRs (Differentially Methylated Regions) are typically more biologically meaningful than individual CpG changes — use tools like dmrseq or DSS for DMR calling.
WGBS = gold standardBismark for alignmentCpG islands at promotersDMRs > individual CpGsPromoter methylation = silencing
Module 12 · Genome Assembly & Annotation

Building Genomes from Scratch

De novo genome assembly reconstructs complete chromosome sequences from millions of short or long sequencing reads — like assembling a jigsaw puzzle without the picture on the box. With the emergence of long-read technologies, telomere-to-telomere assemblies are now achievable for many organisms.

12.1 De Novo Assembly — Short & Long Reads

Genome assembly combines overlapping reads into longer contiguous sequences (contigs), then scaffolds contigs into chromosome-scale sequences. The approach depends fundamentally on read length: short reads (150 bp Illumina) cannot span repetitive elements, producing fragmented assemblies. Long reads (10–100 kb from PacBio HiFi or Oxford Nanopore) span most repeats, enabling near-complete assemblies.

Assembly Strategies

Short-read (de Bruijn graph)

Chops reads into k-mers, builds a graph where k-mers are nodes and overlaps are edges. Traverses the graph to produce contigs. Tools: SPAdes (small genomes, metagenomes), MEGAHIT (metagenomes). Good for bacteria, not for complex eukaryotes.

Long-read (overlap-layout-consensus)

Finds overlaps between all read pairs, constructs a layout graph, derives consensus sequence. Tools: hifiasm (PacBio HiFi — best), Flye (Nanopore/PacBio CLR). Produces chromosome-scale contigs for most genomes.

# ─── Bacterial genome (short reads, SPAdes) ─── mamba install -c bioconda spades quast spades.py -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \ -o spades_output/ -t 16 --careful # ─── Eukaryotic genome (PacBio HiFi, hifiasm) ─── mamba install -c bioconda hifiasm # Haplotype-resolved assembly (diploid) hifiasm -o asm -t 32 hifi_reads.fastq.gz # Convert GFA to FASTA awk '/^S/{print ">"$2"\n"$3}' asm.bp.p_ctg.gfa | fold > primary_assembly.fa # ─── Nanopore (Flye) ─── mamba install -c bioconda flye flye --nano-hq nanopore_reads.fastq.gz \ --out-dir flye_output/ --threads 32 --genome-size 500m # Polish Nanopore assembly with short reads (Illumina) mamba install -c bioconda pilon pilon --genome flye_output/assembly.fasta \ --frags illumina_aligned.bam --output polished
📌
Assembly quality metrics: N50 = the contig length such that 50% of the assembly is in contigs of this size or larger. Higher N50 = more contiguous. L50 = how many contigs make up 50% of the assembly. Lower L50 = better. BUSCO = completeness assessment based on universal single-copy orthologs (aim for >95% complete). Always report all three.
hifiasm for HiFi readsSPAdes for bacteriaN50 measures contiguityBUSCO measures completenessPolish Nanopore with Illumina
12.2 Assembly QC — BUSCO & QUAST

An assembly is only useful if it is correct and complete. Quality assessment uses two complementary approaches: contiguity statistics (N50, total length, number of contigs) measured by QUAST, and gene completeness measured by BUSCO (Benchmarking Universal Single-Copy Orthologs), which checks whether expected genes are present and intact.

# ─── QUAST — contiguity statistics ─── mamba install -c bioconda quast quast assembly.fa -o quast_output/ -t 8 # If you have a reference genome for comparison: quast assembly.fa -r reference.fa -g genes.gff -o quast_output/ -t 8 # Key outputs: N50, L50, total length, # contigs, GC%, misassemblies # ─── BUSCO — gene completeness ─── mamba install -c bioconda busco # Choose the right lineage dataset for your organism: busco --list-datasets # Shows all available lineages # Run BUSCO busco -i assembly.fa -o busco_output -m genome \ -l mammalia_odb10 -c 16 # Interpretation: # C: >95% = excellent assembly # C: 90-95% = good # C: <80% = significant genes missing (check lineage choice) # D: >5% = possible contamination or collapsed haplotypes
💡
Merqury is a reference-free assembly evaluation tool that uses k-mer counts from raw reads to assess completeness, accuracy, and phasing of haplotype-resolved assemblies. It is now the gold standard for HiFi assemblies.
BUSCO >95% = excellentQUAST for contiguity statsMerqury for k-mer QCChoose correct BUSCO lineage
12.3 Gene Prediction & Genome Annotation

After assembling a genome, the next step is annotation — finding where the genes are and what they encode. Gene prediction uses ab initio methods (statistical models trained on known gene structures), homology-based methods (aligning known proteins or transcripts), and evidence-based methods (using RNA-seq data from the same organism). Modern pipelines combine all three approaches.

Annotation Pipeline

# ─── Prokaryotic annotation (fast, automated) ─── mamba install -c bioconda prokka prokka assembly.fa --outdir annotation/ --prefix my_organism \ --kingdom Bacteria --genus Streptomyces --cpus 16 # Output: .gff (gene coordinates), .faa (protein sequences), # .ffn (gene nucleotide sequences), .gbk (GenBank format) # ─── Eukaryotic annotation (more complex) ─── # Step 1: Mask repeats (RepeatMasker) mamba install -c bioconda repeatmasker RepeatMasker -species "homo sapiens" -pa 16 -xsmall assembly.fa # Step 2: Align RNA-seq evidence (HISAT2 → StringTie) hisat2 -x genome_index -1 rnaseq_R1.fq.gz -2 rnaseq_R2.fq.gz -p 16 | \ samtools sort -O bam -o rnaseq_aligned.bam stringtie rnaseq_aligned.bam -o transcripts.gtf # Step 3: Run BRAKER2/3 (combines ab initio + RNA-seq + protein evidence) # BRAKER3 (recommended — integrates GeneMark + AUGUSTUS + protein hints) braker.pl --genome=assembly.fa.masked \ --bam=rnaseq_aligned.bam \ --prot_seq=proteins_from_related_species.fa \ --softmasking --threads 16 # Step 4: Functional annotation (assign GO terms, domains, pathways) mamba install -c bioconda interproscan interproscan.sh -i predicted_proteins.faa -f tsv -dp -goterms -pa
⚠️
Repeat masking is essential: Eukaryotic genomes can be 40–80% repetitive. Without masking, gene predictors will predict thousands of spurious genes in transposable elements. Always run RepeatMasker before gene prediction. Use soft-masking (lowercase, not N-replacement) so gene predictors can still use the sequence information.
Prokka for bacteriaBRAKER3 for eukaryotesRepeatMasker before gene predictionRNA-seq evidence improves accuracyInterProScan for functional annotation
Module 13 · Population Genetics & GWAS

Genetic Variation in Populations

Population genetics studies how allele frequencies change within and between populations over time. Genome-wide association studies (GWAS) connect genetic variants to phenotypes and diseases, forming the basis of precision medicine.

13.1 Population Structure & Hardy-Weinberg Equilibrium

Hardy-Weinberg Equilibrium (HWE) describes the expected genotype frequencies in a population with no evolutionary forces acting. For a variant with allele frequencies p (reference) and q (alternative), expected genotype frequencies are: p² (homozygous ref), 2pq (heterozygous), q² (homozygous alt). Deviations from HWE can indicate selection, population stratification, genotyping errors, or inbreeding.

Key Population Genetics Concepts

Allele Frequency

The proportion of a specific allele in a population. If 80 of 200 chromosomes carry the T allele, frequency = 0.40. Minor allele frequency (MAF) < 0.01 = rare variant.

Genetic Drift

Random fluctuation of allele frequencies due to finite population size. Stronger in small populations. Founder effect and bottleneck are examples of extreme drift.

FST (Fixation Index)

Measures genetic differentiation between populations. FST = 0 means identical; FST = 1 means completely different. Human populations: FST ≈ 0.12 globally.

Linkage Disequilibrium (LD)

Non-random association between alleles at different loci. Nearby variants are in LD (inherited together). Measured by r². LD decays with physical distance due to recombination.

# ─── Population structure analysis with PLINK ─── mamba install -c bioconda plink2 # 1. Basic QC filtering plink2 --vcf cohort.vcf.gz --maf 0.01 --geno 0.05 --hwe 1e-6 \ --mind 0.1 --make-pgen --out filtered # 2. LD pruning (remove correlated variants for PCA) plink2 --pfile filtered --indep-pairwise 50 5 0.2 --out pruned plink2 --pfile filtered --extract pruned.prune.in --make-pgen --out pruned_data # 3. PCA for population structure plink2 --pfile pruned_data --pca 20 --out pca_results # 4. Plot PC1 vs PC2 in R: # pca <- read.table("pca_results.eigenvec", header=TRUE) # plot(pca$PC1, pca$PC2, col=as.factor(pca$Population)) # 5. ADMIXTURE analysis (ancestry proportions) mamba install -c bioconda admixture # Convert to BED format first, then: for K in 2 3 4 5; do admixture pruned_data.bed $K --cv -j16 done
📌
Population stratification in GWAS: If cases and controls come from different ancestral backgrounds, allele frequency differences due to ancestry (not disease) will create false associations. This is why GWAS always includes PCA components as covariates — to correct for population structure. Typically the top 10–20 PCs are included.
HWE test for QCPCA reveals population structurePLINK2 for genetic analysisFST measures differentiationLD decays with distance
13.2 GWAS — Genome-Wide Association Studies

GWAS tests each genetic variant across the genome for association with a phenotype (disease, height, drug response, etc.). It is a hypothesis-free approach — no prior knowledge of which genes are involved is needed. Since millions of variants are tested simultaneously, very stringent significance thresholds are required (p < 5×10⁻⁸ for genome-wide significance). GWAS results are visualised as Manhattan plots.

GWAS Pipeline

# ─── GWAS with PLINK2 ─── # Assumes: genotypes in VCF/PGEN, phenotype file (FID IID PHENO) # 1. Sample & variant QC plink2 --vcf genotypes.vcf.gz \ --maf 0.01 \ # Remove rare variants (MAF < 1%) --geno 0.02 \ # Remove variants missing in >2% of samples --mind 0.05 \ # Remove samples missing >5% of variants --hwe 1e-6 \ # Remove variants violating HWE --make-pgen --out qc_data # 2. Calculate PCs for population stratification plink2 --pfile qc_data --indep-pairwise 50 5 0.2 --out prune plink2 --pfile qc_data --extract prune.prune.in --pca 20 --out pcs # 3. Run association (logistic for case/control, linear for quantitative) plink2 --pfile qc_data \ --pheno pheno.txt --pheno-name Disease \ --covar pcs.eigenvec --covar-col-nums 3-12 \ --glm --out gwas_results # 4. Genome-wide significance threshold: p < 5e-8 # Suggestive significance: p < 1e-5 # 5. Manhattan plot in R: # library(qqman) # manhattan(gwas, chr="CHROM", bp="POS", p="P", snp="ID")

Interpreting GWAS Results

⚠️
GWAS finds associations, not causation. A significant SNP is usually a tag SNP in LD with the true causal variant — not the causal variant itself. Fine-mapping (credible set analysis, SuSiE) is needed to narrow down the causal variant. Functional validation (CRISPR, reporter assays) is needed to prove causality. Most GWAS hits are in non-coding regions (enhancers, promoters), not in protein-coding exons.
💡
Post-GWAS analysis: After finding significant loci, perform: (1) Fine-mapping — identify the most likely causal variant. (2) Colocalization (coloc, eCAVIAR) — test if a GWAS signal and an eQTL signal share the same causal variant. (3) Gene mapping — assign GWAS hits to likely target genes using eQTL data, chromatin interaction (Hi-C), and distance. (4) Pathway enrichment — test if GWAS hits are enriched in specific biological pathways. (5) Polygenic Risk Scores — combine effects of all variants to predict individual risk.
p < 5e-8 for genome-wide significanceAlways correct for population structureGWAS = association, not causationFine-mapping for causal variantsManhattan plot for visualisation
Module 14 · Systems Biology & Network Analysis

Networks, Interactions & Integrative Analysis

Systems biology views biological processes as interconnected networks rather than isolated pathways. By integrating multi-omics data — transcriptomics, proteomics, metabolomics — with interaction networks, we can understand how molecular changes propagate through biological systems.

14.1 Protein-Protein Interaction Networks

Protein-Protein Interaction (PPI) networks represent physical or functional associations between proteins. Nodes are proteins, edges represent interactions. Highly connected proteins (hubs) tend to be essential. PPI networks help identify functional modules, predict protein function, and prioritise drug targets.

Key Resources & Tools

STRING (string-db.org)

Largest PPI database. Integrates experimental data, co-expression, text mining, genomic context. Confidence scores 0–1. Use score > 0.7 for high-confidence interactions.

Cytoscape

Desktop software for network visualisation and analysis. Supports plugins (clusterMaker, MCODE for module detection, CytoHubba for hub identification). Essential for publication figures.

BioGRID

Curated repository of experimentally validated interactions (protein, genetic, chemical). Over 2 million interactions. Higher confidence than computationally predicted interactions.

Network Analysis Concepts

Degree

Number of connections a node has. High-degree nodes (hubs) are often essential genes. Degree distribution follows a power law in biological networks (scale-free property).

Betweenness Centrality

How often a node lies on shortest paths between other nodes. Bottleneck proteins control information flow through the network — they are often good drug targets.

Clustering Coefficient

How densely connected a node's neighbours are. Nodes in protein complexes have high clustering. Used to detect functional modules (MCODE algorithm).

Modules/Communities

Densely connected subnetworks often correspond to protein complexes or functional pathways. Detect with MCODE, Louvain, or Markov clustering (MCL).

# ─── Network analysis in Python with NetworkX ─── pip install networkx pandas matplotlib # Build network from STRING export import networkx as nx import pandas as pd edges = pd.read_csv("string_interactions.tsv", sep="\t") edges = edges[edges["combined_score"] > 700] # High confidence G = nx.from_pandas_edgelist(edges, "protein1", "protein2", edge_attr="combined_score") # Key network metrics print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}") print(f"Density: {nx.density(G):.4f}") print(f"Components: {nx.number_connected_components(G)}") # Find hub genes (top 10 by degree) degree = dict(G.degree()) hubs = sorted(degree, key=degree.get, reverse=True)[:10] print("Hub genes:", hubs) # Betweenness centrality (bottleneck nodes) betw = nx.betweenness_centrality(G) bottlenecks = sorted(betw, key=betw.get, reverse=True)[:10] # Community detection (Louvain) from networkx.algorithms.community import louvain_communities communities = louvain_communities(G, resolution=1.0) print(f"Found {len(communities)} communities")
💡
DEG-PPI integration: A powerful approach is to overlay your differentially expressed genes onto a PPI network. This reveals which functional modules are disrupted in your condition. Steps: (1) Get DEGs from RNA-seq. (2) Query STRING with DEG list. (3) Import network into Cytoscape. (4) Colour nodes by log2FC (red = up, blue = down). (5) Run MCODE to find activated/repressed modules. (6) Run pathway enrichment on each module separately.
STRING for PPI dataCytoscape for visualisationHub genes are often essentialMCODE for module detectionOverlay DEGs on PPI networks
14.2 Gene Regulatory Networks & Multi-Omics Integration

Gene Regulatory Networks (GRNs) describe how transcription factors (TFs) control the expression of target genes. Unlike PPI networks which show physical contacts, GRNs show regulatory relationships — which TFs activate or repress which genes. Multi-omics integration combines transcriptomics, proteomics, metabolomics, and epigenomics data to build comprehensive models of cellular regulation.

GRN Inference Methods

SCENIC / pySCENIC

Infers GRNs from single-cell RNA-seq data. Uses GRNBoost2 (random forest) for coexpression, then validates with TF binding motif enrichment (RcisTarget). Outputs regulon activity scores per cell.

Best for scRNA-seq
WGCNA

Weighted Gene Co-expression Network Analysis. Groups genes into co-expression modules, correlates modules with clinical traits. Excellent for bulk RNA-seq. Widely used, well-cited. R package.

ARACNe / VIPER

Mutual information-based network inference. VIPER then infers TF activity from target gene expression. Widely used in cancer biology for identifying master regulators.

CellOracle

Combines scRNA-seq + scATAC-seq to build context-specific GRNs. Simulates TF perturbations in silico. Predicts effects of gene knockouts on cell fate.

Multi-Omics Integration Strategies

Integrating multiple data types reveals biology invisible to any single assay:

Correlation-based

Correlate mRNA ↔ protein levels, mRNA ↔ metabolite levels. Discordant pairs (high mRNA, low protein) suggest post-transcriptional regulation. Tools: mixOmics (R), MOFA+.

Pathway-based

Map all omics data onto KEGG/Reactome pathways. Identify pathways with consistent changes across data types. Tools: Paintomics, iPath, OmicsNet.

Factor analysis (MOFA+)

Multi-Omics Factor Analysis decomposes multiple data matrices into shared latent factors. Identifies which sources of variation are shared vs unique to each data type.

Network-based

Construct multilayer networks connecting genes, proteins, metabolites. Network propagation algorithms (Random Walk with Restart) identify key regulators driving multi-level changes.

# ─── WGCNA in R (bulk RNA-seq co-expression networks) ─── library(WGCNA) # Prepare normalised expression matrix (genes × samples) # Use VST-transformed counts from DESeq2 # Choose soft-thresholding power (aim for scale-free topology R² > 0.8) powers <- c(1:20) sft <- pickSoftThreshold(datExpr, powerVector = powers, verbose = 5) # Plot: sft$fitIndices[,1] vs -sign(sft$fitIndices[,3]) * sft$fitIndices[,2] # Build network and identify modules net <- blockwiseModules(datExpr, power = 6, TOMType = "unsigned", minModuleSize = 30, reassignThreshold = 0, mergeCutHeight = 0.25, numericLabels = TRUE, saveTOMs = TRUE) # Correlate modules with clinical traits moduleTraitCor <- cor(net$MEs, traits, use = "p") moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples) # Find hub genes in a module of interest module_genes <- names(datExpr)[net$colors == module_number] kME <- signedKME(datExpr, net$MEs) hub_genes <- module_genes[order(kME[module_genes, paste0("kME", module_number)], decreasing = TRUE)[1:20]]
📌
The future is multi-modal single-cell: Technologies like CITE-seq (RNA + surface protein), Multiome (RNA + ATAC), SHARE-seq, and spatial transcriptomics with paired proteomics are enabling true multi-omics integration at single-cell resolution. Tools like MOFA+, WNN (Seurat v4), and scVI/totalVI are designed for these integrated analyses.
WGCNA for bulk co-expressionSCENIC for scRNA-seq GRNsMOFA+ for multi-omicsAlways validate key findingsMulti-modal single-cell is the future
🔐

Register for RNAflow

Fill in your details to receive a 50-day access code via email. Your information helps us improve RNAflow and keep you updated with new features.

🔒 Your information is private. We will never share your data with third parties.

Contact

Get in touch

Questions about a tool, a tutorial idea, a bug report, or want to collaborate? We'd love to hear from you.

Contact information

✉️
🐦
Twitter / X
📍
Location
Global project — contributors worldwide

Follow BioInfoCodex

Frequently asked questions

Quick answers before you send a message.

Is BioInfoCodex really free forever?
BioInfoCodex tools and tutorials are currently available to everyone. We are committed to supporting the research community. There is no subscription or paywalled features at this time.
Does my data get uploaded anywhere?
No. All computation happens on your own computer. The HTML app and Python server communicate only via localhost (127.0.0.1) — your data never leaves your machine. There is no server, no cloud, no analytics on your files.
Can I contribute a tool or tutorial?
Absolutely. We welcome contributions from researchers and developers. Open a pull request on GitHub, or send us an email describing what you'd like to add. All contributors are credited.
I found a bug — how do I report it?
The best way is to open a GitHub issue with a description of what you did, what you expected, and what happened (include any error messages). You can also use the contact form and we'll log it.
Can I cite BioInfoCodex in my paper?
We are preparing a manuscript for submission. In the meantime, you can cite the GitHub repository URL and version number. We'll post the citation format as soon as the preprint is on bioRxiv.

Send a message

Bug reports, feature requests, collaboration proposals, or just to say hello — all welcome.

Message sent!
We'll get back to you within 48 hours.