From raw FASTQ files to publication-ready figures in one guided workflow. Every command explained. Every result interpreted. No bioinformatics PhD required.
Each step generates ready-to-run commands with full explanations. Follow in order, or jump to any step — the app remembers where you left off.
Free, guided pipelines for every major bioinformatics workflow. Each tool has its own complete manual — click any tool to read the documentation, download, and get started.
Complete RNA-seq analysis pipeline — from raw FASTQ files to differential expression, volcano plots, heatmaps, PCA and gene family analysis. Works on 15+ organisms including all major crop plants.
Full-featured lab management platform with team access control, seat-based licensing, OneDrive storage, and admin panel. Install on your computer and connect with a LAB-XXXXX code.
Whole-genome and exome sequencing pipeline. Variant calling with GATK, SNP/INDEL annotation, structural variant detection, population genetics.
Plasmid design and annotation. Import GenBank files, identify restriction sites, annotate features, export circular maps for publication.
ChIP-seq and ATAC-seq pipeline. Peak calling with MACS2, motif analysis with HOMER, differential chromatin accessibility, BigWig tracks.
Metagenomics and 16S rRNA microbiome pipeline. Taxonomic classification, alpha/beta diversity, differential abundance with DESeq2.
Single-cell RNA-seq pipeline using Seurat. Cell clustering, UMAP visualisation, marker gene identification, trajectory analysis.
Mass spectrometry proteomics. Label-free quantification, TMT/iTRAQ, PTM analysis, pathway enrichment, volcano plots for protein data.
Complete RNA-seq analysis pipeline — from raw sequencing files to publication-ready figures. Works on 15+ organisms with a full manual explaining every step and every result.
RNAflow is a guided RNA-seq analysis application. Open it in any browser, enter your experiment details, and it generates every command you need — with explanations of what each command does and what good results look like. You can run commands automatically with one click, or copy them to your terminal manually.
RNA sequencing measures gene expression — which genes are active in a cell and by how much. Every cell contains the same DNA, but different cells (or the same cell under different conditions) activate different genes. RNA-seq captures a snapshot of this activity by sequencing the RNA molecules present at a specific moment.
Compare wildtype vs mutant, treated vs untreated, healthy vs diseased. RNAflow identifies statistically significant changes.
Fold change values show whether genes double, halve, or show more dramatic changes in expression.
Gene family analysis groups DE genes into pathways — transcription factors, kinases, hormone signalling and more.
PCA and heatmaps reveal whether replicates cluster correctly and how distinct your conditions are.
RNAflow runs on your own computer. You need a Conda environment with bioinformatics tools and R for the statistical analysis steps.
RNAflow is an HTML file — no installation wizard needed. Download it, download the server file, and open it in your browser. The app guides you through installing all the bioinformatics tools.
Download both files and save them to the same folder (e.g. your Desktop):
Open RNAflow_App.html in your browser — you will see the BioInfoCodex RNAflow interface immediately.
Conda is a package manager that installs bioinformatics tools. If you already have Anaconda or Miniconda, skip this step. The app's System Check will detect it automatically.
RNAflow uses a dedicated Conda environment called rnaseq. This keeps all bioinformatics tools isolated from your other software.
This installs: FastQC (quality control), fastp (trimming), HISAT2 (alignment), samtools (BAM processing), featureCounts/subread (gene counting), fasterq-dump/sra-tools (data download), MultiQC (report generation).
R and RStudio are used for the DESeq2 statistical analysis and all plots. Download R from r-project.org and RStudio from posit.co/download/rstudio-desktop.
If you want to use Run mode (commands execute automatically with one click), start the server in a Terminal window before opening the app. Keep this window open while working.
You will see: RNAflow Local Server — ready · Listening on 127.0.0.1:7788
127.0.0.1 (your own computer). It is completely invisible to any other computer on your network or the internet. Your data never leaves your machine.Before running any commands, you set up your experiment details in the Configure page. RNAflow uses these settings to automatically generate every command throughout the pipeline.
Choose from 15 pre-loaded organisms or enter custom genome URLs. When you select an organism, the genome FASTA and GTF annotation URLs are automatically filled in with the correct Ensembl links.
Human, Mouse, Rat, Zebrafish, Drosophila, C. elegans, Yeast
Arabidopsis, Rice, Maize, Wheat, Soybean, Tomato, Yarrowia
Give meaningful names to your two conditions (e.g. wildtype and mutant, or control and drought_treated). Then add SRA accession numbers for each replicate.
This describes how the sequencing library was prepared. If unsure, check the SRA run page — it shows the "Layout" as SINGLE or PAIRED.
RNAflow downloads your raw sequencing files directly from NCBI SRA using fasterq-dump. The files are saved into your project folder automatically.
The command fasterq-dump connects to NCBI's servers and downloads compressed sequencing data, then decompresses it into FASTQ format. Each sample is typically 1–10 GB depending on the sequencing depth.
raw_data/ folder. Each sample has a .fastq file (single-end) or two .fastq files ending in _1 and _2 (paired-end). Use the check command to confirm file sizes are reasonable (>100 MB per sample is normal).FastQC examines your raw reads and produces a visual HTML report for each sample. Always run this before any analysis — it reveals problems that would invalidate your results if not addressed.
MultiQC combines all individual FastQC reports into one interactive HTML page. Open it in your browser after FastQC completes.
fastp removes adapter sequences, low-quality bases, and very short reads. It is fast, accurate, and works for both single-end and paired-end data automatically.
Adapter sequences are synthetic DNA added during library preparation — they are not part of your biological sample. If not removed, they interfere with alignment because the aligner cannot find a match for them in your genome. Low-quality bases at the ends of reads are also trimmed to improve alignment accuracy.
RNAflow uses these fastp settings — all chosen to be conservative and appropriate for most RNA-seq experiments:
The reference genome and gene annotation are downloaded from Ensembl, then HISAT2 builds an index — a compressed searchable version of the genome that makes alignment fast.
HISAT2 maps each sequencing read back to its location on the reference genome. It is splice-aware, meaning it correctly handles reads that span exon-exon junctions — essential for accurate RNA-seq alignment.
HISAT2 reports alignment statistics for each sample. These appear in the terminal output and are saved in your aligned/ folder as .log files.
After alignment, HISAT2 produces SAM files (text format). RNAflow then uses samtools to convert these to BAM (binary, 5× smaller), sort them by chromosome position, and index them. This is required by featureCounts and enables fast access to any region of the genome.
featureCounts counts how many reads aligned to each gene in each sample. This produces a count matrix — the input to DESeq2 for statistical analysis.
The output is a table where rows are genes and columns are samples. Each number is the count of reads that mapped to that gene in that sample.
DESeq2 is the gold-standard R package for RNA-seq differential expression analysis. It uses a negative binomial statistical model to identify genes that genuinely change between your conditions, accounting for biological variation.
After running DESeq2, the summary() function prints a table like this:
RNAflow generates three standard visualisations used in RNA-seq publications: a volcano plot, a heatmap, and a PCA plot. All are saved as 300 DPI PNG files suitable for direct use in papers and theses.
A volcano plot visualises all genes simultaneously — fold change on the x-axis and statistical significance on the y-axis. It gets its name from its characteristic shape.
The heatmap shows the top 40 most significant genes as a colour grid — each row is a gene, each column is a sample. Colour represents expression level (scaled across the row).
Principal Component Analysis (PCA) reduces the expression of thousands of genes to just 2 dimensions. It shows how similar or different your samples are to each other overall.
For large genomes (wheat, maize, human, soybean), researchers often want to focus on one biological family rather than all 30,000+ genes. RNAflow includes 40+ gene families across 8 categories.
The gene family analysis produces its own volcano plot and heatmap for just the selected family, saved separately from the full analysis results.
MYB, WRKY, bHLH, NAC, MADS, ERF, GRAS, bZIP, HOX, GATA, STAT, p53, NF-κB, ETS, SOX, FOX — 21 families
MAPK, CDK, RLK, SnRK, CDPK, TOR, tyrosine kinases — 8 families
Auxin, GA, cytokinin, ABA, ethylene, JA, SA, BR, strigolactone, peptide hormones — 10 pathways
ABC, aquaporin, sugar (SUT/SWEET), nitrate (NRT), phosphate, K+, metal (ZIP, YSL, NRAMP), SLC, ion channels
Wnt/β-catenin, Notch, Hedgehog, TGF-β, PI3K/AKT, RAS/ERK, JAK/STAT, NF-κB, Hippo — 9 pathways
TLR, NLR, cytokines, BCL-2, caspases, oncogenes, tumour suppressors, DNA methylation, histone modifiers
A quick reference guide for the most common questions researchers have when looking at their RNAflow results for the first time.
Once you have a list of significant DE genes, these are the typical next steps in a research project:
In your Methods section, cite each tool used in the pipeline. Example methods text:
15 organisms are pre-loaded with their Ensembl genome URLs. Select any from the dropdown in the Configure page — genome and GTF URLs fill automatically.
Solutions to the most common problems encountered during RNA-seq analysis with RNAflow.
python3 rnaflow_server.py. Keep that Terminal window open. The green dot in the top bar will appear when connected. If you prefer not to use the server, use Mode B (Copy) to copy commands manually.A full-featured lab management platform with team access control, seat-based licensing, OneDrive storage, and a desktop app for macOS and Linux.
LAB-84921) and register with their invite code.Variant calling, SNP/INDEL annotation, structural variant detection, population genetics. Powered by GATK4, BWA-MEM2, and VEP.
Design plasmids, find restriction sites, annotate features, and export publication-ready circular maps. Import from GenBank.
Peak calling with MACS2, motif analysis with HOMER, differential chromatin accessibility, BigWig tracks for genome browsers.
16S rRNA and whole-metagenome shotgun analysis. Taxonomic classification, alpha/beta diversity, differential abundance with QIIME2 and Kraken2.
Single-cell RNA-seq analysis using Seurat. Cell clustering, UMAP visualisation, marker gene identification, trajectory analysis.
Label-free quantification, TMT/iTRAQ, post-translational modification analysis, pathway enrichment. Powered by MaxQuant.
A comprehensive university-level curriculum — from molecular biology fundamentals to advanced genomic data analysis. Written by researchers, for researchers.
Before writing a single command, you must understand what you are analysing. This module covers the biological foundations every bioinformatician needs — DNA structure, gene expression, the central dogma, and why bioinformatics exists.
Bioinformatics is an interdisciplinary science that develops and applies computational methods to analyse biological data. It sits at the intersection of biology, computer science, mathematics, and statistics. The field emerged because modern biology generates data at a scale that is impossible to analyse by hand — a single whole-genome sequencing run produces hundreds of gigabytes of raw data, and large consortia such as the UK Biobank have sequenced over 500,000 individuals.
The roots of bioinformatics trace back to the 1960s, when Margaret Dayhoff compiled the first protein sequence database — the Atlas of Protein Sequence and Structure — and developed the PAM substitution matrices still used in sequence alignment today. In the 1970s, Needleman-Wunsch (1970) and Smith-Waterman (1981) published their foundational sequence alignment algorithms. The launch of the Human Genome Project in 1990 was a watershed moment: a $3 billion, 13-year international effort that produced the first draft of the human genome in 2001. Since then, sequencing costs have plummeted dramatically — from roughly $100 million per genome in 2001 to under $200 today — falling faster than Moore's Law. This exponential cost reduction has democratised genomics and made bioinformatics skills essential for modern biological research.
Identify drug targets by finding genes upregulated in disease. AlphaFold predicted 200M+ protein structures, accelerating structure-based drug design.
Improve crop resistance by identifying beneficial genetic variants in plant genomes through GWAS and marker-assisted selection.
Track viral evolution in real time. SARS-CoV-2 variant tracking relied entirely on bioinformatics pipelines and GISAID.
A typical project follows a pipeline from experimental question to biological insight:
At each step, the bioinformatician must make choices: Which aligner to use? What quality thresholds? How to handle batch effects? The answers depend on the specific data type (RNA-seq, ChIP-seq, WGS, etc.), the organism, and the biological question. This course will walk you through each step.
Bioinformatics professionals are in very high demand. Common job titles include Bioinformatics Scientist, Computational Biologist, Genomics Data Analyst, and Research Software Engineer (Bioinformatics). Positions exist in academia, pharmaceutical companies, biotech startups, clinical diagnostics labs, and agricultural genomics companies.
Bioinformatics relies on an ecosystem of mostly open-source tools. You will use a combination of command-line tools on Linux, scripting languages (Python, R, Bash), and specialised software for specific analyses.
Francis Crick's central dogma (1958) describes the flow of genetic information: DNA → RNA → Protein. Every bioinformatics workflow targets one of these molecules. Understanding this flow at a molecular level is essential because the tools you use (aligners, quantifiers, variant callers) all make assumptions about the underlying biology.
DNA is a double-stranded helix held together by hydrogen bonds between complementary bases. The pairing rules are strict and fundamental to replication, transcription, and sequencing technology:
Transcription is the process of copying a gene's DNA sequence into messenger RNA (mRNA). It occurs in the nucleus of eukaryotic cells and involves several steps:
Translation occurs at ribosomes in the cytoplasm. The mRNA sequence is read in triplets called codons, each specifying one of 20 amino acids. The code is degenerate — most amino acids are encoded by multiple codons (e.g., Leucine has 6 codons). Key codons to know:
The reading frame matters enormously. A single insertion or deletion shifts the entire frame downstream (a frameshift mutation), usually producing a nonfunctional protein. This is why indels in coding regions are often more damaging than substitutions.
After translation, proteins are often chemically modified. These modifications can change protein activity, location, stability, and interactions. Common PTMs include phosphorylation (adding a phosphate group — central to cell signalling), glycosylation (adding sugar chains — important for secreted proteins), ubiquitination (tagging for degradation by the proteasome), and acetylation (especially on histones, regulating gene expression). Proteomics and mass spectrometry are the primary tools for studying PTMs.
RNA is chemically less stable than DNA for two reasons: (1) the 2'-OH group on the ribose sugar makes RNA susceptible to alkaline hydrolysis, and (2) RNA is single-stranded, making it vulnerable to RNases (ubiquitous enzymes that degrade RNA). This instability is why RNA extraction requires careful technique — RNase-free conditions, working on ice, and rapid processing. It also means RNA quality degrades over time, which is why we measure RIN (RNA Integrity Number) before sequencing.
RNA → DNA. Performed by reverse transcriptase in retroviruses (HIV) and retrotransposons. Also used in the lab to create cDNA from mRNA (the basis of RNA-seq library preparation).
RNA → RNA. RNA viruses (influenza, SARS-CoV-2) use RNA-dependent RNA polymerase (RdRp) to replicate their genomes. This enzyme is a drug target (e.g., remdesivir).
Bases in mRNA can be chemically modified after transcription (e.g., A-to-I editing by ADAR enzymes). This means the mRNA sequence can differ from the genomic DNA — a potential pitfall in variant calling.
Protein → Protein. Misfolded prion proteins can template the misfolding of normal proteins without any nucleic acid involvement (the cause of mad cow disease / CJD).
Not all RNA becomes protein. In fact, the vast majority of transcribed RNA in human cells is non-coding. Key types include: rRNA (ribosomal RNA — ~80% of total RNA, forms the ribosome structure), tRNA (transfer RNA — carries amino acids during translation), miRNA (microRNA — ~22 nt, silences gene expression post-transcriptionally), lncRNA (long non-coding RNA — >200 nt, diverse regulatory roles), and circRNA (circular RNA — resistant to degradation, potential biomarkers). These are increasingly studied in bioinformatics.
Understanding how genes are organised in the genome is essential for interpreting any bioinformatics result. When you load a BAM file in a genome browser, annotate variants, or run differential expression analysis, you are navigating genome structure. This section covers the physical and functional organisation of genomes, from the chromosome level down to individual regulatory elements.
In eukaryotes, genomic DNA is packaged with histone proteins into chromatin, which is further organised into chromosomes. Humans have 23 pairs of chromosomes (22 autosomes + 1 pair of sex chromosomes, XX or XY), totalling about 3.2 billion base pairs. During cell division, each chromosome is replicated, and the two copies (called sister chromatids) are joined at the centromere before being separated.
Loosely packed, transcriptionally active chromatin. Most genes are expressed from euchromatic regions. Appears light in chromosome staining.
Tightly packed, largely silent chromatin. Found at centromeres and telomeres. Contains repetitive sequences. Constitutive (always condensed) or facultative (conditionally silenced, e.g., the inactive X chromosome).
The C-value paradox refers to the observation that genome size does not correlate with organism complexity. Humans have 3.2 Gb but only ~20,000 protein-coding genes, while the onion has 16 Gb. The difference is largely due to repetitive elements — transposons, retrotransposons, and satellite DNA account for about 45% of the human genome. Only about 1.5% of the human genome actually encodes proteins.
Located immediately upstream of genes (within ~1 kb of the TSS). Contain elements like the TATA box (~-25 bp) and CpG islands (GC-rich regions, often unmethylated in active promoters). RNA polymerase and transcription factors bind here to initiate transcription.
Can be located thousands or even millions of base pairs away from their target gene — upstream, downstream, or even within introns. They loop in 3D space to contact the promoter and boost transcription. Enhancer activity is tissue-specific, explaining why the same genome produces different cell types.
CpG islands are regions of at least 200 bp with a GC content above 50% and an observed/expected CpG ratio above 0.6. About 70% of human gene promoters are associated with CpG islands. In normal cells, most CpG islands in promoters are unmethylated, allowing gene expression. In cancer, aberrant hypermethylation of tumour suppressor gene promoters is a hallmark event. This is studied using bisulfite sequencing or methylation arrays (e.g., Illumina EPIC array).
Consider the human BRCA1 gene (associated with breast cancer risk). It spans about 81 kb on chromosome 17, contains 23 exons, and produces a protein of 1,863 amino acids. The coding sequence itself is only about 5,592 bp — just 7% of the gene's genomic span. The rest consists of introns, UTRs, and regulatory regions.
A single gene can produce multiple protein variants (isoforms) through alternative splicing of its pre-mRNA. The human genome has ~20,000 protein-coding genes but produces over 200,000 different mRNA transcripts. Common types of alternative splicing include:
The most common type in humans. An entire exon is excluded from the mature mRNA. Example: the Drosophila Dscam gene can produce over 38,000 isoforms through exon skipping.
A different splice site is used within the same exon, making the exon longer or shorter. Changes the protein sequence at the boundary.
An intron is not removed and remains in the mature mRNA. More common in plants and fungi. Can introduce premature stop codons.
One of two exons is included, but never both. The choice is regulated in a tissue-specific manner.
Most bioinformatics data comes from Next-Generation Sequencing (NGS). Understanding what the machine actually does explains why your data looks the way it does — including error profiles, quality score distributions, read length limitations, and the types of biases you will encounter. The three major platforms each have distinct chemistries with different strengths and trade-offs.
Illumina is the dominant platform, generating the majority of the world's sequencing data. Here is the step-by-step process:
Every base in a FASTQ file has an associated PHRED quality score (Q) that represents the probability of an incorrect base call. The relationship is logarithmic:
In FASTQ files, quality scores are encoded as ASCII characters. For Illumina 1.8+ (Phred+33 encoding), the character ! = Q0, I = Q40. When you see a quality string like FFFFFFFF, each F represents Q37 — very high quality. Quality typically degrades toward the end of reads, which is why trimming the 3' ends is common practice.
| Feature | Illumina | PacBio HiFi | Oxford Nanopore |
| Read length | 75-300 bp | 10-25 kb | Up to 4 Mb |
| Accuracy | >99.9% (Q30+) | ~99.9% (HiFi mode) | ~95-99% (improving) |
| Throughput | Up to 6 Tb (NovaSeq X) | ~30 Gb per cell | ~50-300 Gb (PromethION) |
| Run time | 12-48 hours | ~24 hours | Minutes to days |
| Error type | Substitutions | Indels (rare in HiFi) | Indels (homopolymers) |
| Cost per Gb | ~$2-5 | ~$20-30 | ~$10-20 |
| Best for | RNA-seq, WGS, ChIP-seq, variant calling | Genome assembly, structural variants, isoforms | Field diagnostics, real-time, direct RNA, methylation |
• Differential gene expression (RNA-seq)
• Whole-genome variant calling
• ChIP-seq / ATAC-seq
• Large sample numbers (multiplexing)
• Budget is a concern
• De novo genome assembly
• Full-length transcript isoforms (IsoSeq)
• Structural variant detection
• Repetitive region resolution
• Phased diploid assembly
• Rapid pathogen identification
• Field work (MinION is portable)
• Direct RNA sequencing (no RT step)
• Epigenetic modifications (m5C, m6A)
• Ultra-long reads needed
Almost all bioinformatics tools run on Linux/macOS in the terminal. This module teaches you the computing skills you need before running any analysis — from basic commands to writing scripts.
You will spend 80% of your time in the terminal. These commands are non-negotiable knowledge. Linux (and macOS terminal) is the native environment for nearly all bioinformatics tools. Mastering the command line is not optional — it is the single most important skill for a bioinformatician.
rm -rf is permanent and irreversible. There is no Trash. Always double-check before running delete commands. Never run rm -rf / or rm -rf * without absolute certainty.Linux file permissions control who can read, write, and execute files. Understanding permissions is essential because scripts need execute permission, and shared HPC directories require correct group permissions.
Pipes (|) and redirection (>, >>) are the heart of Unix philosophy — chain small tools together to accomplish complex tasks. This is how most bioinformatics one-liners work.
Bioinformatics jobs often run for hours. Knowing how to manage long-running processes is critical, especially when working over SSH.
Most bioinformatics work is done on remote servers or High-Performance Computing (HPC) clusters. SSH (Secure Shell) is how you connect:
Conda is the standard package manager for bioinformatics. It installs tools and manages dependencies so they don't conflict with each other. Proper environment management is one of the most important skills for reproducible research — if you cannot recreate your exact software environment, your results cannot be reproduced.
conda install with mamba install — everything else is the same.Containers go beyond conda by packaging the entire operating system, libraries, and tools into a single portable image. This guarantees that the software runs identically on any machine. Docker is the standard for local development; Singularity (Apptainer) is the standard on HPC clusters because it does not require root privileges.
Organising your project directory consistently saves enormous time and prevents errors. Here is a recommended structure:
Track your analysis scripts with git. This provides a history of every change you made and allows collaboration. Do not track large data files — only scripts and small config files.
Many HPC clusters use the module system instead of (or alongside) conda to manage software. The module system allows administrators to install centrally-maintained software that users can load as needed.
When you have 20 samples, you don't run commands one by one. Bash scripts automate repetitive tasks — this is the single most time-saving skill in bioinformatics. A well-written script ensures every sample is processed identically, provides a record of exactly what was done, and can be re-run when parameters need to change.
By default, bash scripts continue running even when a command fails. This is dangerous — if trimming fails silently, you might align garbage data. Always start scripts with safety flags:
set -euo pipefail should be on line 2 of every bioinformatics script. Without it, you may not notice when a critical step fails, leading to incorrect results downstream.On most HPC clusters, you do not run jobs directly. Instead, you submit them to a job scheduler (most commonly SLURM). SLURM allocates resources and runs your script when resources become available.
#SBATCH --array=1-20 and use $SLURM_ARRAY_TASK_ID to select the sample.set -euo pipefail to catch errors earlyR is the language of choice for statistical analysis, visualisation, and Bioconductor packages (DESeq2, edgeR, limma). You don't need to be a software engineer — a working knowledge of R is enough to run the statistical analyses and produce publication-quality plots that make up the second half of most bioinformatics projects.
RStudio is the standard IDE for R. It provides a console, script editor, plot viewer, file browser, and help panel in one interface. Install R first (from CRAN or via conda), then install RStudio Desktop (free). For HPC, you can use RStudio Server through a web browser.
The tidyverse is a collection of R packages (dplyr, tidyr, ggplot2, readr, etc.) that share a common design philosophy. The pipe operator (|> or %>%) chains operations together, making code readable.
ggplot2 is the most powerful plotting system in R. It uses a "grammar of graphics" — you build plots by adding layers (geom_point, geom_boxplot, etc.) to a base aesthetic mapping.
R Markdown lets you combine code, results, and narrative text in a single document. When you "knit" the document, R runs all the code and produces a formatted HTML or PDF report. This is the gold standard for reproducible analysis.
Public databases hold billions of sequences, structures, and experiments — and they are all free. Knowing which database to use and how to download data efficiently is an essential skill.
Public biological databases are one of the greatest resources in science. They contain billions of sequences, millions of structures, and thousands of curated experiments — all freely accessible. Knowing which database to search, how to interpret accession numbers, and how to programmatically retrieve data are essential skills for any bioinformatician.
National Center for Biotechnology Information. Houses GenBank (nucleotide sequences), SRA (raw sequencing data), PubMed (literature), RefSeq (curated reference sequences), and ClinVar (clinical variants). Start here for most searches. URL: ncbi.nlm.nih.gov
Best source for annotated reference genomes and GTF annotation files for 300+ species. Provides BioMart for bulk queries. Use this to download genome FASTA and GTF files. URL: ensembl.org
Comprehensive protein database. Swiss-Prot = manually curated (~570K entries, high quality). TrEMBL = computationally annotated (~250M entries). URL: uniprot.org
Protein Data Bank. 200,000+ experimentally determined 3D structures (X-ray, cryo-EM, NMR). Now complemented by AlphaFold DB with 200M+ predicted structures. URL: rcsb.org
Gene Expression Omnibus (processed data, metadata) and Sequence Read Archive (raw FASTQ files). Search by organism + condition + technology. Most published datasets are deposited here.
KEGG: metabolic pathways, drug targets, disease associations. Reactome: curated biological pathways and reactions, especially good for human data. Essential for pathway enrichment analysis.
Protein-protein interaction networks for 5000+ species. Input a gene list and get a functional interaction network. URL: string-db.org
Standardised vocabulary for gene function. Three domains: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Used in enrichment analysis. URL: geneontology.org
Every entry in a database has a unique accession number. Recognising the format tells you which database it comes from:
| Pattern | Example | Database | What It Is |
| NM_###### | NM_007294 | NCBI RefSeq | mRNA transcript |
| NP_###### | NP_009225 | NCBI RefSeq | Protein |
| ENSG########### | ENSG00000012048 | Ensembl | Gene (human) |
| ENST########### | ENST00000357654 | Ensembl | Transcript |
| P##### or Q##### | P38398 | UniProt Swiss-Prot | Curated protein |
| SRR####### | SRR1234567 | NCBI SRA | Sequencing run |
| GSE###### | GSE53697 | NCBI GEO | Dataset/Series |
Databases are extensively cross-referenced. A gene page on NCBI Gene links to its RefSeq transcript (NM_), RefSeq protein (NP_), UniProt entry, Ensembl ID, associated GEO datasets, ClinVar variants, and PDB structures. Similarly, an Ensembl gene page links out to NCBI, UniProt, KEGG, and more. This interconnection means you can start from any database and navigate to the information you need.
For programmatic access, NCBI provides the E-utilities API — a set of URL-based tools for searching and retrieving records. This is essential for automating data retrieval in scripts.
Each step of a bioinformatics analysis produces a different file format. Recognising these formats, understanding their structure, and knowing how to convert between them is essential. Misinterpreting a file format is one of the most common causes of analysis errors.
The simplest sequence format. Each entry has a header line starting with > followed by one or more lines of sequence. Used for reference genomes, transcriptomes, and protein sequences.
grep -c "^>" file.fa counts the number of sequences. samtools faidx genome.fa creates an index (.fai) needed by many tools. The index file lists each sequence name, length, offset, line bases, and line bytes.Raw sequencing reads. Every read is exactly 4 lines. FASTQ files are almost always gzip-compressed (.fastq.gz) to save space.
Alignment files. SAM is human-readable text; BAM is the compressed binary equivalent (5-10x smaller); CRAM compresses even further by referencing the genome. Always store BAM, never SAM.
Gene annotation files define where genes, exons, transcripts, and other features are located in the genome. Essential for gene counting (featureCounts) and transcript quantification.
Variant Call Format — the standard for storing genetic variants. Covered in detail in Module 6.
Simple genomic interval format. Minimum 3 columns: chrom, start (0-based), end. Used for ChIP-seq peaks, exon coordinates, target regions, blacklisted regions.
Most bioinformatics projects begin by downloading data — either raw sequencing data from public repositories (to reproduce or extend a study) or reference genomes and annotations for alignment. This section covers how to find, download, and verify data efficiently.
Published papers deposit raw data in public repositories. To find the accession number: (1) Check the paper's Data Availability section — it will list a GEO accession (GSE######) or BioProject (PRJNA######). (2) Go to NCBI GEO and search for the GSE number. (3) Click "SRA Run Selector" at the bottom of the GEO page. (4) This gives you a table of all SRR accessions and their metadata. Download the Accession List text file.
~/ncbi/ which can fill your home directory — set vdb-config --set /repository/user/cache-disabled=true or use --temp /scratch/tmp.Aspera is a high-speed transfer protocol that can be 10-50x faster than wget/ftp. NCBI and ENA both support it. The European Nucleotide Archive (ENA) is an alternative to SRA that is often faster, especially if you are in Europe, and provides pre-compressed FASTQ files directly.
Sequence comparison is the backbone of bioinformatics. Whether finding a gene's function, aligning reads, or building a phylogenetic tree — everything starts with comparing sequences.
BLAST (Basic Local Alignment Search Tool) is the single most important tool in bioinformatics. Developed at NCBI in 1990, it compares a query sequence against a database and returns statistically significant matches. Understand it deeply — you will use it almost every day.
BLAST does NOT compare every position. Instead it uses a three-step heuristic: (1) Word finding — breaks your query into short words (default 11 bp for blastn, 3 aa for blastp) and finds exact matches in the database. (2) Extension — extends each word match left and right, tracking a running score. Stops when the score drops significantly. (3) Evaluation — calculates E-value for each high-scoring segment pair (HSP). This heuristic makes BLAST fast — but it can miss very divergent homologues.
Nucleotide → nucleotide. Use for: verifying a cloned sequence, finding a gene in a new genome. Fast but only finds close matches. For very similar sequences use megablast (default on NCBI web).
Protein → protein. Much more sensitive than blastn for detecting distant homologues — amino acid conservation is stronger than nucleotide. Uses substitution matrices (BLOSUM62 default).
Translates nucleotide query in all 6 reading frames → searches protein database. Use when you have an unknown DNA sequence and want to find the protein it encodes.
Protein query → translated nucleotide database. Use for finding a gene in an unannotated genome — you have a known protein and want to find its genomic location.
Position-Specific Iterated BLAST. Iteratively builds a profile from initial hits and re-searches. Much more sensitive for distant homology. Use for: finding remote protein family members.
Not BLAST, but a modern alternative. 500–20,000× faster for blastx/blastp searches. Essential for metagenomics where you BLAST millions of reads. mamba install -c bioconda diamond
Key output fields in tabular format (-outfmt 6):
Multiple Sequence Alignment (MSA) arranges three or more biological sequences so that corresponding positions (homologous residues) line up in columns. This reveals evolutionary conservation: columns where all sequences have the same residue are under strong selective pressure — they are functionally important. MSA is the foundation for phylogenetics, protein domain identification, and functional annotation.
An exact optimal MSA is computationally infeasible for more than ~10 sequences (NP-hard problem). All practical tools use heuristics:
Build a guide tree from pairwise distances, then align sequences in order of relatedness. Fast but early errors propagate.
Align progressively, then refine by re-aligning subgroups. Better accuracy. MAFFT L-INS-i is the gold standard for accuracy.
Uses all pairwise alignments to guide the MSA. Very accurate for small sets. Too slow for >200 sequences.
Gaps represent insertions or deletions (indels) in evolution. The gap opening penalty controls how reluctant the aligner is to insert a gap. The gap extension penalty controls how costly it is to make an existing gap longer. High opening + low extension → fewer, longer gaps (appropriate for eukaryotic proteins). Low opening → many small gaps (may over-fragment the alignment).
Best overall choice. Use --auto to let MAFFT choose strategy. For <200 sequences: mafft --maxiterate 1000 --localpair (L-INS-i, most accurate). For >10,000 sequences: mafft --retree 1.
Excellent accuracy, competitive with MAFFT. Newer "super5" algorithm scales well. Use muscle -super5 input.fa -output aligned.fa for large datasets.
A phylogenetic tree is a branching diagram that represents the evolutionary relationships among biological sequences or species. Trees are constructed from multiple sequence alignments and use mathematical models to infer how sequences have diverged over time. Understanding how to build, read, and interpret phylogenetic trees is a core skill in evolutionary biology, epidemiology, and comparative genomics.
Every tree has key structural elements. The root represents the most recent common ancestor. Internal nodes represent inferred ancestral sequences (speciation or divergence events). Tips (also called leaves) are the extant sequences in your analysis. Branch length represents the amount of evolutionary change — typically measured in substitutions per site. A branch of length 0.05 means 5 nucleotide substitutions per 100 sites have occurred along that branch.
Fast, distance-based heuristic. Computes a pairwise distance matrix and groups closest neighbours iteratively. Good for quick exploratory trees with many sequences. Available in MEGA, FastTree, and Clustal Omega.
Statistical optimisation approach that evaluates the probability of the observed alignment given a tree topology and substitution model. Best method for publication-quality trees. Use IQ-TREE2 or RAxML-NG.
Probabilistic framework using MCMC sampling to explore tree space. Produces posterior probability support values. MrBayes for phylogenetics, BEAST2 for molecular clock and divergence time estimation. Computationally expensive but powerful.
NJ: quick-and-dirty exploration (>500 sequences). ML: standard research publication. Bayesian: when you need divergence time estimates or posterior probabilities. For most projects, ML with IQ-TREE2 is the right choice.
An unrooted tree shows relationships but not the direction of evolution. To root a tree, include an outgroup — a sequence known to be more distantly related than any ingroup member. For example, when building a tree of mammalian species, include a reptile as the outgroup. The root is placed on the branch leading to the outgroup, which polarises all other relationships.
Bootstrap values assess the reliability of each node. The alignment is resampled with replacement many times (typically 1000), and each resampled dataset produces a new tree. The bootstrap value is the percentage of times a given clade appears across all resampled trees.
High confidence in this branching pattern. Reliable for biological conclusions.
Generally considered acceptable. Report with caution.
Unreliable. The relationship may change with more data or different methods.
The branching pattern is essentially random at this node. Do not interpret.
Substitution models describe how nucleotides or amino acids change over evolutionary time. Choosing the right model is critical for accurate tree inference.
Simplest model. Assumes all substitutions are equally likely and all bases have equal frequency. Rarely realistic but useful as a baseline.
Most general model. Six substitution rate parameters and four base frequency parameters. Usually the best fit for real data. Often combined with +G (gamma rate variation) and +I (invariant sites).
Empirical amino acid substitution model based on globular protein alignments. Good general-purpose choice for protein trees.
Newer and generally better-fitting than WAG. Estimated from a larger, more diverse set of protein alignments. Recommended as the default for protein phylogenetics.
-m TEST flag (ModelFinder) to automatically test all models and select the best one by BIC. Never guess — always let the software choose.RNA sequencing (RNA-seq) measures gene expression across the entire transcriptome. This module takes you through every step of the pipeline with detailed explanations of what each tool does and why.
No amount of computational analysis can rescue a poorly designed experiment. Experimental design decisions — replication, randomisation, library preparation — are made before any sequencing occurs and cannot be corrected downstream. This section covers the critical choices that determine whether your RNA-seq experiment will produce reliable, publishable results or generate data that no statistical method can salvage.
Statistical power is the probability of detecting a true difference when one exists. Underpowered studies miss real biology; overpowered studies waste resources. Use the RNASeqPower R package to estimate sample size before your experiment.
Independent biological samples (different mice, different cell culture flasks, different patients). These capture true biological variability and are essential for statistical inference. Minimum 3 per condition, ideally 5–6.
The same sample sequenced multiple times, or the same library prepared twice. These only measure technical noise (which is very low in modern RNA-seq). Technical replicates do NOT substitute for biological replicates and are generally a waste of sequencing budget.
Batch effects are systematic non-biological variations caused by processing samples at different times, on different machines, or by different technicians. They can completely confound your results if not properly handled.
Use a randomised block design: assign samples to batches randomly while ensuring each batch contains samples from every condition. Record all batch information (date, technician, lane, flowcell) in your metadata — you can correct for known batches computationally, but only if you recorded them.
RNA integrity is measured by the RIN (RNA Integrity Number), scored 1–10 by a Bioanalyzer or TapeStation. Degraded RNA introduces systematic 3' bias because fragmented transcripts lose their 5' ends preferentially.
Good quality. Suitable for standard RNA-seq with poly-A selection. This is the minimum for most experiments.
Partially degraded. Use rRNA depletion instead of poly-A selection. Consider FFPE-optimised protocols if working with clinical samples.
Poly-A captures only mRNA (coding genes). Cheaper, cleaner signal. rRNA depletion captures mRNA + lncRNA + other non-coding RNA. Use rRNA depletion for degraded samples or when non-coding RNA is of interest.
SE reads one end of each fragment. PE reads both ends. PE is recommended for: novel isoform discovery, structural variant detection, and improved alignment to repetitive regions. For standard differential expression, SE 50bp is often sufficient and cheaper.
If you measure the same patient before and after treatment, use a paired design (~ patient + treatment in DESeq2). Paired designs are more powerful because they control for inter-individual variation.
Multiple samples pooled on one lane with unique index barcodes. Reduces cost and minimises lane-to-lane batch effects. Ensure all conditions are represented on each lane. Use dual-index barcodes to prevent index hopping on patterned flowcells.
Quality control is the first analytical step in any sequencing project. Before aligning reads, you must assess their quality to identify problems such as adapter contamination, low-quality bases, and library preparation artefacts. FastQC generates a per-sample report, MultiQC aggregates all reports for easy comparison, and trimming tools remove problematic sequences. Skipping QC risks propagating errors through the entire downstream analysis.
The MultiQC report shows all samples side-by-side. Look for outlier samples that differ dramatically from the rest — these may indicate contamination, sample mix-ups, or failed library preparation. The General Statistics table shows total reads, % duplicates, and % GC per sample.
Should stay in the green zone (>Q28). Quality commonly drops at the 3' end — this is normal for Illumina chemistry. If quality drops below Q20 before position 50, investigate the sequencing run. Action: Trim low-quality 3' bases if they fall below Q20.
Distribution of mean quality per read. Should be a single peak above Q30. A bimodal distribution or a shoulder below Q20 means a subset of reads are poor quality. Action: Filter reads with mean quality below Q20.
Non-uniform base composition at the first 10 positions is normal and expected for RNA-seq. It is caused by random hexamer priming bias during cDNA synthesis. Action: No action needed. Do NOT trim these bases.
Adapter sequences detected in reads. This happens when insert size is shorter than read length. Action: You MUST trim adapters before alignment. Use Trim Galore or fastp.
A WARN here is completely normal for RNA-seq. Highly expressed genes (ribosomal proteins, housekeeping genes) naturally produce many identical reads. Only worry if duplication exceeds 80% or if you see it in DNA-seq data. Action: None for RNA-seq.
Shows sequences that make up >0.1% of all reads. Common culprits: adapter dimers, rRNA contamination, or highly expressed transcripts. Action: If rRNA is dominant, your rRNA depletion failed. If adapter dimers are present, trim aggressively.
Wrapper around Cutadapt. Auto-detects adapter type. Simple to use. Good default choice. Slightly slower than fastp.
All-in-one tool: QC + trimming + filtering in a single pass. Very fast (written in C++). Generates its own HTML report. Increasingly popular as a replacement for FastQC + Trim Galore.
Java-based. Flexible with many options. Older tool. Requires specifying adapter sequences manually. Still widely used but fastp is generally preferred now.
If using pseudoalignment (Salmon or kallisto), trimming is usually unnecessary — these tools handle adapters and low-quality bases internally. Trimming can even reduce accuracy by creating artificially short fragments.
--adapter.Alignment (or mapping) is the process of determining where each sequencing read originated in the reference genome. For RNA-seq, this is complicated by splicing — a single read may span an exon-exon junction, so standard DNA aligners will fail. Splice-aware aligners like HISAT2 and STAR can split a read across introns to find the correct genomic position. The output is a BAM file containing the coordinates and quality of each alignment.
Memory-efficient (~8 GB RAM for human genome), fast, and accurate. Uses a graph-based FM index that incorporates known splice sites and SNPs. Excellent for most organisms. Recommended when RAM is limited or for non-model organisms.
RecommendedSlightly more accurate, especially for novel splice junctions, but requires ~30 GB RAM for the human genome. Required by some downstream tools (e.g., STAR-Fusion for fusion gene detection). Gold standard in ENCODE and GTEx projects.
SAM (Sequence Alignment/Map) is a tab-delimited text format. BAM is its compressed binary equivalent. Each alignment line has 11 mandatory fields:
The FLAG field is a bitwise combination of properties. Common flags:
= 1 (paired) + 2 (proper pair) + 16 (reverse strand) + 64 (first in pair). A properly paired read on the reverse strand, first mate.
= 1 (paired) + 2 (proper pair) + 32 (mate reverse) + 128 (second in pair). A properly paired read on forward strand, second mate.
= unmapped read. These reads did not align to the reference genome. Filter them with samtools view -F 4.
= secondary alignment. The read mapped to multiple locations. This is the non-primary alignment. Filter with samtools view -F 256.
Always visually inspect alignments for a few genes of interest using the Integrative Genomics Viewer (IGV). Load your BAM file and the reference genome, then navigate to specific genes. Look for: even coverage across exons, clean splice junctions, and absence of systematic artefacts.
Gene counting is the process of assigning each aligned read to a gene based on its overlap with annotated gene coordinates in a GTF/GFF file. The result is a count matrix — a table where rows are genes, columns are samples, and values are integer read counts. This matrix is the fundamental input for differential expression analysis with DESeq2 or edgeR. Getting the counting step right is critical because errors here propagate to every downstream result.
The counting tool examines each aligned read and checks whether it overlaps with an annotated gene feature (usually exons). If a read falls entirely within one gene's exons, it is assigned to that gene. If a read overlaps two genes (an ambiguous read), featureCounts discards it by default. HTSeq-count offers similar behaviour with its "union" mode. Reads that do not overlap any annotated feature are counted as "unassigned."
Counts reads per gene (collapsing all isoforms). Used by featureCounts and HTSeq-count. Simpler, more robust. Standard for differential expression analysis.
Estimates abundance per transcript isoform. Used by Salmon and kallisto (pseudoalignment). More information but more complex. Can be summarised to gene level with tximport.
-s setting can cause ~50% read loss or incorrect counting. Use -s 0 (unstranded), -s 1 (stranded-forward), or -s 2 (stranded-reverse). Check your library preparation kit documentation. Most modern kits (Illumina TruSeq Stranded, NEBNext) are reverse-stranded (-s 2). If unsure, run featureCounts with all three settings on one sample and compare assignment rates — the correct setting will have the highest "Assigned" count.Extremely fast (processes millions of reads per minute). Multi-threaded. Can process multiple BAM files in one command. Recommended for most use cases.
Python-based, slower. Processes one BAM file at a time. Offers fine-grained control over ambiguous read handling (union, intersection-strict, intersection-nonempty modes). Used historically but featureCounts is preferred.
Salmon and kallisto use pseudoalignment — they do not perform traditional alignment to the genome. Instead, they map reads directly to the transcriptome (cDNA sequences), skipping the genome alignment step entirely. This is dramatically faster (minutes instead of hours) and directly estimates transcript-level abundances.
DESeq2 is the gold-standard R package for identifying genes that are significantly more or less expressed between experimental conditions. It models RNA-seq count data using the negative binomial distribution, which accounts for both the mean expression level and the biological variability (dispersion) between replicates. Unlike simpler models (Poisson), the negative binomial captures the overdispersion that is characteristic of real biological data — where variance exceeds the mean.
DESeq2 performs three key steps internally: (1) Size factor estimation — normalises for differences in library size using the median-of-ratios method. (2) Dispersion estimation — estimates gene-wise dispersion with empirical Bayes shrinkage toward a fitted trend, borrowing strength across genes. (3) Wald test — tests whether the log2 fold change for each gene is significantly different from zero. The design formula (e.g., ~ condition) specifies the experimental factors to model.
When your experiment has known batch effects or additional covariates, include them in the design formula. DESeq2 will model their effect and remove it before testing the variable of interest.
Raw log2 fold changes from DESeq2 can be noisy for low-count genes. LFC shrinkage produces more accurate, regularised fold change estimates — especially important for ranking genes and for visualisation.
log2FoldChange = direction and magnitude (+1 = 2x higher in treated, -1 = 2x lower). padj = Benjamini-Hochberg adjusted p-value (FDR). Always use padj < 0.05, never raw pvalue. Combine with a fold-change threshold (|log2FC| > 1) for biological relevance.Whole-genome and whole-exome sequencing reveal the genetic blueprint of an organism. This module covers variant calling, genome assembly, and annotation.
Variant calling is the process of identifying positions in a sequenced genome where the sample differs from the reference. These differences include SNPs (single nucleotide polymorphisms — one base changed to another), indels (insertions or deletions of one or more bases), and MNVs (multi-nucleotide variants — multiple adjacent substitutions). Variant calling is the foundation of clinical genomics, population genetics, and disease gene discovery.
A single base change: e.g., reference A → alternative G. The most common variant type. ~4–5 million SNPs per human genome.
Insertion of extra bases or deletion of reference bases. Short indels (1–50bp) are called by GATK. Larger structural variants require specialised tools (Manta, DELLY).
Both copies of the chromosome carry the reference allele. This position is not a variant in this individual.
One chromosome carries the reference allele, the other carries the alternative. Expect ~50% of reads supporting each allele.
Both chromosomes carry the alternative allele. Expect ~100% of reads supporting the ALT allele.
Two or more adjacent bases changed simultaneously (e.g., AG → TC). Often caused by a single mutational event. Correctly phasing these is important for protein impact prediction.
During library preparation, PCR amplification creates identical copies of DNA fragments. These PCR duplicates are not independent observations — they all come from the same original molecule. If not removed, they artificially inflate the number of reads supporting a variant, leading to false-positive variant calls. Duplicate marking identifies reads with identical start and end coordinates and flags all but one as duplicates.
GATK's HaplotypeCaller does not simply count bases at each position. Instead, it performs local de novo assembly in active regions (regions with evidence of variation). It: (1) identifies active regions where the data deviates from the reference, (2) assembles the reads in that region into candidate haplotypes, (3) realigns reads to these haplotypes using a pair-HMM, and (4) assigns genotype likelihoods. This approach is more accurate than simple pileup-based callers, especially for indels and complex variants.
Apply fixed thresholds on quality metrics (QD, MQ, FS, SOR). Simple and works for any project size. Use for small cohorts (<30 samples), non-model organisms, or targeted panels.
Machine learning approach that learns the quality profile from known true variants (e.g., HapMap, 1000 Genomes). More accurate but requires large cohorts (>30 whole genomes) and curated training resources.
Variant quality normalised by depth. Low QD (<2) means the variant quality is low relative to coverage — likely a false positive.
RMS mapping quality of reads at the site. Low MQ (<40) indicates reads are mapping to multiple locations — the region is ambiguous.
Strand bias estimated by Fisher's exact test. High FS (>60 for SNPs) means the variant is only supported by reads on one strand — likely an artefact.
Alternative strand bias metric. More robust than FS for high-coverage data. SOR > 3 indicates significant strand bias.
Always visually inspect candidate variants in the Integrative Genomics Viewer (IGV). Load the BAM file and VCF together, navigate to a variant position, and check: (1) Is there sufficient read depth? (2) Are both strands represented? (3) Are the variant-supporting reads properly mapped (no soft-clipping artefacts)? (4) Is the region not in a repeat or low-complexity sequence?
The Variant Call Format (VCF) is the standard file format for storing genetic variants discovered by variant calling pipelines. Understanding the VCF format in detail is essential for filtering, annotating, and interpreting variant data. A VCF file consists of header lines (metadata) followed by data lines (one variant per line), and can contain genotype information for one or many samples.
A VCF file has two sections: the header (lines starting with ##) and the data lines (tab-separated variant records). The header defines the meaning of INFO and FORMAT fields used in the data lines.
Let us dissect the example line above field by field:
Chromosome where the variant is located.
1-based position on the reference. For indels, this is the position of the base before the event.
dbSNP identifier if known, otherwise "." (missing). Links to public variant databases.
Reference allele and alternative allele. For SNPs: single bases. For a deletion: REF=ACG, ALT=A (CG deleted). For an insertion: REF=A, ALT=ATCG (TCG inserted).
Phred-scaled quality score. QUAL 30 = 1/1000 chance the variant is wrong. QUAL 5000 = extremely confident. Higher is better.
PASS means the variant passed all quality filters. Other values (e.g., "LowQD") indicate which filter(s) it failed.
The FORMAT column defines the order of fields in the sample column(s). In our example: GT:GQ:DP:AD → 0/1:99:45:22,23
Genotype. 0 = reference allele, 1 = first ALT allele. So 0/1 = heterozygous. 0/0 = homozygous reference. 1/1 = homozygous alternative. 0|1 = phased heterozygous.
Genotype quality. Phred-scaled confidence that the genotype call is correct. GQ 99 = very high confidence. Low GQ (<20) means the genotype is uncertain.
Read depth at this position for this sample. For WGS, typically 30x. For WES, typically 50–100x. Low depth (<10) = unreliable genotype.
Allelic depths. 22 reads support REF (G), 23 reads support ALT (A). For a heterozygous call, expect roughly 50:50. Severe imbalance may indicate an artefact.
SnpEff annotates each variant with its predicted functional impact on genes, transcripts, and proteins. It adds an ANN field to the INFO column.
DNA change that does NOT alter the amino acid (e.g., GCC→GCT both encode Alanine). Usually benign due to codon degeneracy.
DNA change that alters one amino acid (e.g., GAG→GTG: Glutamate→Valine). May or may not affect protein function depending on the position and amino acid properties.
Creates a premature stop codon (e.g., CAG→TAG: Glutamine→Stop). Truncates the protein. Often loss-of-function. Subject to nonsense-mediated decay (NMD).
Insertion or deletion not divisible by 3. Shifts the reading frame, altering all downstream amino acids and usually creating a premature stop. Almost always loss-of-function.
Variant at an exon-intron boundary (donor GT or acceptor AG). Disrupts normal splicing, potentially causing exon skipping or intron retention. Often pathogenic.
Insertion or deletion divisible by 3. Adds or removes amino acids without shifting the reading frame. Effect depends on the protein domain affected.
NCBI database of variants with clinical significance (pathogenic, likely pathogenic, benign, VUS). The primary resource for clinical variant interpretation. Query by rsID or genomic coordinates.
Population allele frequencies from >140,000 exomes and >76,000 genomes. A variant with AF > 1% in gnomAD is almost certainly benign. Essential for filtering common polymorphisms.
Bad statistics ruins good biology. This module covers the statistical concepts every bioinformatician must understand — from p-values and FDR to normalisation and dimensionality reduction.
Understanding p-values, multiple testing correction, and the False Discovery Rate (FDR) is arguably the single most important statistical concept in genomics. Misinterpreting these values leads to publishing false discoveries or missing real biology. This section provides the intuition, the mathematics, and the practical application you need to correctly interpret differential expression results.
Consider a typical RNA-seq experiment where you test 20,000 genes for differential expression. Suppose 1,000 genes are truly differentially expressed and 19,000 are not.
Of the 19,000 null genes, 5% will produce p < 0.05 by chance = 950 false positives. Of the 1,000 true DEGs (assuming 80% power), you detect ~800. Your "significant" list has 800 + 950 = 1,750 genes, but 54% are false. You cannot tell which are real.
Benjamini-Hochberg correction ensures that among your "significant" genes, at most 5% are false positives. If you call 750 genes significant, you expect ~38 false discoveries. This is manageable and scientifically defensible.
The BH procedure adjusts p-values by ranking them:
1. Rank all m p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(m)
2. For each gene at rank i, compare its p-value to the threshold: i/m × q (where q is your desired FDR, e.g., 0.05)
3. Find the largest rank k where p(k) ≤ k/m × q
4. All genes with rank ≤ k are declared significant
Controls the proportion of false discoveries among all rejected hypotheses. "Of the genes I call significant, at most 5% are false." Used for: RNA-seq, ChIP-seq, proteomics — any discovery-oriented experiment where some false positives are tolerable.
Controls the probability of even one false positive. Much more conservative. Bonferroni: divide alpha by number of tests (0.05/20,000 = 2.5×10-6). Used for: GWAS, where a single false positive could lead to expensive follow-up studies.
The pvalue column in DESeq2 results is the raw, uncorrected p-value. The padj column is the BH-adjusted p-value. Always use padj for determining significance. If you use raw p-values, you are ignoring the multiple testing problem and your results will be full of false positives.
DESeq2 automatically performs independent filtering: it removes genes with very low mean expression before applying multiple testing correction. Why? Genes with almost zero counts across all samples have no chance of being significant, but they consume part of the multiple testing budget, reducing power for the remaining genes. By filtering them out, DESeq2 increases statistical power for the genes that matter.
Statistical significance (small padj) does not equal biological significance. With enough replicates, even tiny fold changes become statistically significant. A gene with padj = 0.0001 but log2FC = 0.1 is a 7% expression change — real but unlikely to be biologically meaningful. Always apply both a significance threshold (padj < 0.05) and an effect size threshold (|log2FC| > 1, i.e., 2-fold change) when defining your DEG list.
Raw read counts cannot be compared directly across samples or genes because they are confounded by two major technical factors: library size (total number of reads sequenced) and gene length (longer genes capture more reads). Normalisation removes these technical biases so that observed differences reflect true biological variation. Choosing the wrong normalisation method for your analysis is one of the most common errors in bioinformatics.
Suppose Gene X has 100 raw counts in both Sample A and Sample B. It looks like equal expression. But Sample A has 10 million total reads and Sample B has 20 million total reads. Gene X consumed 100/10,000,000 = 0.001% of Sample A's reads, but only 100/20,000,000 = 0.0005% of Sample B's. Gene X is actually expressed at half the level in Sample B. Without normalisation, you would completely miss this 2-fold difference.
Corrects for library size only. Formula: CPM = (raw count / total counts) × 106. Use for within-gene comparisons across samples. Does NOT correct for gene length, so cannot compare expression between different genes.
Corrects for gene length first, then library size. Formula: (1) Divide counts by gene length in kb → RPK. (2) Sum all RPK values → scaling factor. (3) Divide each RPK by scaling factor × 106. TPM values sum to 1 million per sample, making samples directly comparable. Current standard for expression visualisation.
RPKM = Reads Per Kilobase per Million mapped reads. FPKM = the paired-end equivalent. Formula: RPKM = (raw count × 109) / (total counts × gene length). Problem: RPKM values do NOT sum to the same value across samples, making cross-sample comparison unreliable. Superseded by TPM.
Computes a per-sample scaling factor that accounts for library size AND composition bias (when a few highly expressed genes consume a disproportionate fraction of reads). Does NOT correct for gene length. Used internally by DESeq2 for differential expression.
The median-of-ratios method works as follows: (1) For each gene, compute the geometric mean across all samples. (2) For each sample, divide each gene's count by this geometric mean. (3) The median of these ratios is the size factor for that sample. Using the median makes this robust to outlier genes (e.g., one massively upregulated gene does not skew the normalisation).
TMM (Trimmed Mean of M-values) is the normalisation method used by edgeR. It works by selecting a reference sample, computing fold-changes and absolute expression levels between each sample and the reference, trimming the extremes (top/bottom 30% of fold-changes, top/bottom 5% of expression), and computing a weighted mean of the remaining values as the scaling factor.
Use raw counts as input to DESeq2 or edgeR. These tools apply their own internal normalisation (size factors or TMM). Never pre-normalise counts before DE analysis.
Use TPM or variance-stabilised transform (VST/rlog from DESeq2). These make samples comparable for visualisation.
Use TPM (accounts for gene length). CPM is insufficient because longer genes accumulate more reads regardless of expression level.
Use TPM in publications and supplementary tables. It is the current community standard and is comparable across samples and studies (with caveats).
Before running differential expression, you should always visualise your samples using dimensionality reduction (PCA) and clustering. These exploratory analyses reveal sample outliers, batch effects, sample mix-ups, and whether your biological conditions are the dominant source of variation. Skipping this step means you may not discover critical problems until after you have wasted time on downstream analysis.
Principal Component Analysis (PCA) takes your high-dimensional data (20,000 genes per sample) and finds the directions of maximum variance. The first principal component (PC1) captures the most variance in the data, PC2 captures the second most (orthogonal to PC1), and so on. By plotting samples on PC1 vs PC2, you project 20,000 dimensions down to 2, preserving as much information as possible.
Captures the largest source of variation. Ideally this is your biological condition (treatment vs control). The % variance explained is shown in the axis label (e.g., "PC1: 65% variance").
The second largest source of variation. Often captures batch effects, sex differences, or other covariates. If PC2 separates batches more than conditions, you have a batch effect problem.
Replicates cluster tightly together. Conditions separate clearly along PC1. Percentage of variance explained by PC1 is high (>30%).
One replicate clusters with the wrong condition (sample mix-up). Samples cluster by batch rather than condition (batch effect). One sample is far from all others (outlier — consider removal).
If your PCA shows samples clustering by batch (processing date, lane, technician) rather than by biological condition, you have a batch effect. Two common correction methods:
~ batch + condition). DESeq2 will model and remove the batch effect statistically.Minimises within-cluster variance. Produces compact, evenly-sized clusters. Best for sample clustering in gene expression data. Most commonly used.
Distance between clusters = maximum distance between any two members. Produces tight clusters. Sensitive to outliers.
Distance between clusters = mean of all pairwise distances. Produces ultrametric trees. Used in phylogenetics but assumes constant rate of evolution.
Rows = genes (usually top DEGs), columns = samples. Colour = expression level (z-score). Both rows and columns are clustered. Reveals co-expression patterns.
The frontier of bioinformatics. These topics are increasingly required in modern research — from dissecting cell heterogeneity with scRNA-seq to interpreting gene lists with pathway analysis and applying machine learning to genomic data.
Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, revealing cellular heterogeneity that is invisible to bulk RNA-seq. Where bulk RNA-seq reports the average expression across thousands of cells, scRNA-seq can identify rare cell populations, map developmental trajectories, and characterise the cellular composition of complex tissues like tumours or the immune system. The most widely used platform is the 10x Genomics Chromium system.
In the 10x Chromium system, individual cells are captured in nanoliter-sized oil droplets (GEMs — Gel Beads-in-emulsion) together with barcoded hydrogel beads. Each bead carries millions of oligonucleotides with three components: (1) a cell barcode (16bp, unique to each droplet/cell), (2) a UMI (Unique Molecular Identifier, 12bp, unique to each captured mRNA molecule), and (3) a poly-T capture sequence that binds mRNA. After reverse transcription inside the droplet, all cDNA from one cell shares the same barcode, allowing computational demultiplexing.
A major characteristic of scRNA-seq data is dropout: many genes that are truly expressed in a cell are detected as zero. This happens because each cell contributes only ~10–20% of its mRNA to the captured library (low capture efficiency). Lowly expressed genes are particularly affected — they may be expressed but not captured. This "zero inflation" complicates analysis and requires specialised statistical methods.
Number of unique genes detected. Too low (<200): empty droplet or dead cell. Too high (>5,000–8,000): possible doublet (two cells in one droplet). Typical range: 500–5,000 genes.
Total UMI counts. Proportional to mRNA content. Very low counts indicate low-quality cells. Extremely high counts may indicate doublets.
Percentage of reads from mitochondrial genes. Dying cells release cytoplasmic mRNA but retain mitochondrial mRNA, so high mt% (>20%) indicates cell stress or death. Common threshold: remove cells with >15–20% mt.
Remove cells with: nFeature_RNA < 200 or > 5000, percent.mt > 20%. These are starting points — adjust based on your cell type (some cells naturally have high mt% or low gene counts).
During cell lysis, free-floating mRNA ("soup") contaminates all droplets. SoupX estimates the contamination profile from empty droplets and subtracts it from each cell's counts. Run before standard Seurat analysis.
Doublets (two cells captured in one droplet) appear as cells with hybrid gene expression profiles. scDblFinder simulates artificial doublets and identifies real cells that resemble them. Expected doublet rate: ~0.8% per 1,000 cells loaded.
Trajectory analysis (pseudotime) orders cells along a continuous path representing a biological process (e.g., differentiation, activation). Monocle3 constructs a principal graph through the UMAP embedding and assigns each cell a pseudotime value based on its position along this graph.
When analysing multiple scRNA-seq datasets (different patients, batches, or experiments), batch effects cause identical cell types to cluster separately by dataset rather than by biology. Integration methods align datasets so that shared cell types overlap.
Fast, lightweight. Operates in PCA space. Iteratively adjusts PCs to remove batch effects. Works well for most integration tasks. RunHarmony(seurat, "batch")
Uses reciprocal PCA to find shared anchors between datasets. More conservative — better at preserving unique cell populations that exist in only one dataset. More computationally expensive.
After identifying differentially expressed genes, the natural next question is: what biological processes, pathways, or functions are affected? Pathway enrichment analysis moves from individual genes to biological interpretation. Rather than examining 500 DEGs one by one, you ask whether genes belonging to a specific pathway (e.g., "apoptosis" or "cell cycle") appear more often in your results than expected by chance. Two main approaches exist: Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA).
A gene set is a group of genes that share a common biological property. Gene sets are defined by databases:
Three domains: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Hierarchically organised (e.g., "immune response" contains "inflammatory response" contains "cytokine production"). Very comprehensive but often redundant.
Curated metabolic and signalling pathways with detailed diagrams. ~340 human pathways. Good for metabolism, signalling, and disease pathways. Less comprehensive than GO but more interpretable.
50 carefully curated gene sets representing well-defined biological states (e.g., "HALLMARK_INFLAMMATORY_RESPONSE", "HALLMARK_OXIDATIVE_PHOSPHORYLATION"). Excellent starting point — covers major biological themes without redundancy.
Peer-reviewed pathway database with detailed molecular events. ~2,500 human pathways. More granular than KEGG. Open-source and actively curated.
Over-Representation Analysis uses Fisher's exact test (or hypergeometric test) to ask: Is the overlap between my DEG list and a specific pathway larger than expected by chance?
Imagine you have 500 DEGs from a universe of 15,000 expressed genes. The "apoptosis" pathway contains 200 genes. If 40 of your 500 DEGs are in the apoptosis pathway, is that more than the ~7 you would expect by chance (200/15,000 × 500)? Fisher's exact test gives you a p-value for this overlap.
GSEA does not use a DEG list with a cutoff. Instead, it uses the full ranked gene list (all genes ranked by fold change):
1. Rank all genes by log2 fold change (or a signed statistic like -log10(p) × sign(FC)).
2. Walk down the ranked list. For each gene, if it is in the pathway, increase a running enrichment score (ES); if not, decrease it.
3. The enrichment score is the maximum deviation from zero during the walk. A positive ES means the pathway genes are concentrated at the top (upregulated). A negative ES means they are at the bottom (downregulated).
4. Significance is assessed by permutation: shuffle gene labels 1,000 times and compute ES each time. Compare your real ES to the permutation distribution.
5. The Normalised Enrichment Score (NES) accounts for gene set size, making scores comparable across pathways.
Dot size = number of genes, colour = padj, X-axis = gene ratio. The default and most informative single plot.
Nodes = pathways, edges connect pathways sharing genes. Reveals clusters of related biological themes. Use emapplot().
Shows the distribution of fold changes for genes in each pathway. Useful for GSEA results to see the shift direction.
Shows the running enrichment score as you walk down the ranked list. The peak/trough is the ES. Vertical lines show where pathway genes fall.
simplify() in clusterProfiler or REVIGO to remove redundancy and focus on the most specific or most significant terms.Machine learning (ML) is now embedded across bioinformatics — from protein structure prediction (AlphaFold) to variant pathogenicity scoring, cell type classification, and biomarker discovery. This section covers the key ML concepts that bioinformaticians need, including the unique challenges posed by genomic data (many features, few samples), practical workflows in Python with scikit-learn, and when ML is appropriate versus when simpler statistical methods suffice.
You have labels (e.g., cancer subtype, drug response). The model learns to predict labels from features (gene expression). Examples: classify tumour vs normal, predict patient survival, identify pathogenic variants. Algorithms: Random Forest, SVM, logistic regression, neural networks.
No labels. The model discovers structure in the data. Examples: clustering scRNA-seq cells into types, identifying patient subtypes from multi-omics data, PCA for dimensionality reduction. Algorithms: k-means, hierarchical clustering, UMAP, t-SNE, NMF.
Genomic datasets typically have 20,000+ features (genes) but only 50–200 samples. This "p >> n" problem means: (1) models can easily memorise the training data (overfit), (2) many spurious correlations exist by chance, and (3) feature selection is critical. Before training, reduce features using: variance filtering (remove low-variance genes), differential expression pre-filtering, or embedded methods (Random Forest feature importance).
Never evaluate a model on the same data used to train it. Cross-validation provides an honest estimate of model performance on unseen data.
Split data into k folds (typically k=5 or k=10). Train on k-1 folds, test on the held-out fold. Repeat k times. Average the scores. The standard approach for most genomic ML.
Each sample is held out once. Maximises training data but computationally expensive. Useful for very small datasets (<30 samples) where you cannot afford to hold out 20% of data.
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate vs False Positive Rate at different classification thresholds. The AUC (Area Under the Curve) summarises overall performance:
Random guessing. The model has no discriminative ability. Your features carry no signal for this task.
Acceptable discrimination. Useful for biomarker discovery but may not be sufficient for clinical decisions.
Good discrimination. Strong enough for most research applications and preliminary clinical use.
Excellent discrimination. Be cautious — if AUC is >0.95 on a small dataset, double-check for data leakage or confounders (batch effect masquerading as biology).
Deep learning excels when the input is raw sequence data (DNA, RNA, protein). A common architecture is a 1D convolutional neural network (CNN) applied to one-hot encoded DNA sequences (4 channels: A, C, G, T). These models can learn motifs (like transcription factor binding sites) directly from sequence without hand-crafted features.
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples — soil, ocean water, the human gut — without the need to culture individual organisms. It reveals the composition and functional potential of entire microbial communities. Two main approaches exist: 16S rRNA amplicon sequencing (which bacteria are present?) and shotgun metagenomics (what are they doing?). This section covers both workflows in detail.
The 16S ribosomal RNA gene is present in all bacteria and archaea. It contains conserved regions (where universal primers bind) flanking variable regions (V1–V9) that differ between species. By PCR-amplifying and sequencing the V3-V4 region (~460bp), you can identify which bacterial taxa are present in your sample. This approach is cost-effective, well-established, and sufficient for community profiling.
Older approach. Clusters sequences at 97% similarity. Loses resolution (different species with >97% identity are merged). Computationally simple but biologically imprecise. Tools: QIIME1, mothur (legacy).
Modern standard. Resolves sequences to single-nucleotide precision. Each unique error-corrected sequence is an ASV. Reproducible across studies (unlike OTUs which depend on clustering parameters). Superior in every way. Tools: DADA2, Deblur.
Comprehensive database for bacteria, archaea, and eukaryotes (18S/28S). Version 138.1 is the current standard. Highest quality curated taxonomy. Recommended for most studies.
Rebuilt from scratch using whole-genome phylogenetics. Consistent taxonomy derived from the Genome Taxonomy Database (GTDB). Better phylogenetic accuracy than the original Greengenes. Compatible with UniFrac.
Alpha diversity measures the diversity within a single sample. Different metrics capture different aspects:
Accounts for both richness (number of species) and evenness (how evenly individuals are distributed). Higher H' = more diverse. A sample dominated by one species has low Shannon even if many species are present.
Probability that two randomly drawn individuals belong to different species. Emphasises dominant species. Less sensitive to rare species than Shannon. Range: 0 (no diversity) to 1 (high diversity).
Simply the number of unique ASVs detected. Sensitive to sequencing depth — always rarefy (subsample to equal depth) before comparing.
Sum of branch lengths in the phylogenetic tree connecting all species in the sample. Captures phylogenetic breadth — a sample with many distantly related species has higher PD than one with many closely related species.
Beta diversity measures differences between samples. Each pair of samples gets a distance value, producing a distance matrix that is visualised by ordination (PCoA).
Based on shared species abundances. Does not consider phylogeny. Range: 0 (identical) to 1 (completely different). The most common non-phylogenetic metric.
Incorporates phylogenetic relationships. Unweighted UniFrac considers only presence/absence (sensitive to rare taxa). Weighted UniFrac also considers abundance (sensitive to dominant taxa). Requires a phylogenetic tree.
To test whether microbial communities differ significantly between groups (e.g., healthy vs diseased), use PERMANOVA (Permutational Multivariate Analysis of Variance). It tests whether the centroids of groups in beta diversity space are significantly different by comparing observed F-statistics to a permutation distribution.
Unlike 16S amplicon sequencing, shotgun metagenomics sequences all DNA in the sample. This reveals not just which organisms are present (taxonomy) but also their functional potential (metabolic pathways, antibiotic resistance genes).
Proteomics studies the entire complement of proteins in a cell, tissue, or organism. Unlike the genome (static), the proteome is dynamic — it changes with conditions, time, and cell type. Mass spectrometry is the workhorse technology for large-scale protein identification and quantification.
Mass spectrometry (MS) measures the mass-to-charge ratio (m/z) of ionised molecules. In proteomics, proteins are digested into peptides (typically with trypsin), separated by liquid chromatography (LC), ionised, and then measured in the mass spectrometer. The combination — LC-MS/MS — is the standard workflow for identifying and quantifying thousands of proteins in a single experiment.
A typical bottom-up proteomics experiment follows these steps: (1) Protein extraction — lyse cells, solubilise proteins. (2) Digestion — trypsin cleaves after lysine (K) and arginine (R), producing peptides of 7–35 amino acids. (3) Separation — reverse-phase liquid chromatography separates peptides by hydrophobicity over a 60–120 minute gradient. (4) Ionisation — electrospray ionisation (ESI) converts peptides into gas-phase ions. (5) MS1 scan — measures intact peptide masses. (6) MS2 scan — isolates individual peptide ions and fragments them (HCD or CID). The fragmentation pattern reveals the amino acid sequence.
High resolution (120,000–500,000), high mass accuracy (<2 ppm). Gold standard for discovery proteomics. Exploris 480, Q Exactive series.
Time-of-flight with trapped ion mobility (TIMS). Very fast scan rates. DIA-PASEF enables deep proteome coverage. Excellent for clinical applications.
Lower resolution but extremely sensitive and quantitative. Used for targeted proteomics (MRM/SRM) — measuring specific proteins with high precision.
Matrix-assisted laser desorption. Used for intact protein profiling, tissue imaging (MALDI-MSI), and microbial identification (Bruker Biotyper).
Selects the top N most intense ions from MS1 for fragmentation. Simple, well-established. Stochastic — may miss low-abundance peptides. Less reproducible between runs.
Fragments ALL ions in wide m/z windows systematically. No selection bias. More reproducible and comprehensive. Requires spectral libraries or library-free tools (DIA-NN). Now the preferred method.
RecommendedAfter acquiring MS/MS spectra, the key bioinformatics task is identifying which peptides (and therefore proteins) generated them. This is done by matching experimental spectra against theoretical spectra predicted from a protein sequence database. The process is called database searching — and it is conceptually similar to BLAST, but for mass spectra instead of sequences.
(1) Take each experimental MS2 spectrum. (2) In silico digest all proteins from a reference database (e.g., UniProt human) with trypsin. (3) For each candidate peptide whose mass matches the precursor mass (within tolerance), predict the theoretical fragmentation spectrum. (4) Score the match between experimental and theoretical spectra. (5) Apply statistical validation (FDR control at 1% using target-decoy approach).
Free, widely used. Built-in Andromeda search engine. Excellent for label-free and SILAC. GUI-based. Standard in academic proteomics.
Ultrafast open search. 100× faster than traditional engines. Integrated in FragPipe with downstream analysis. Excellent for PTM discovery.
Deep neural network for DIA data. Library-free mode generates in silico spectral libraries. State-of-the-art sensitivity and speed.
Quantitative proteomics measures how much of each protein is present, enabling comparison between conditions. There are two main strategies: label-free quantification (LFQ), which compares ion intensities directly, and labelling methods (TMT, SILAC), which use chemical or metabolic tags to multiplex samples.
Compare MS1 peptide intensities across runs. Simple, no extra chemistry. Requires careful normalisation. Best with DIA. MaxLFQ algorithm in MaxQuant.
Most commonChemical labels with same mass but different fragmentation reporters. Multiplex up to 18 samples in one run (TMTpro 18-plex). Reduces missing values. Higher throughput.
Metabolic labelling — grow cells with heavy amino acids (13C-Lys, 13C-Arg). Most accurate quantification. Only works for cell culture. Not for tissues/clinical samples.
Protein function is determined by its three-dimensional structure. Structural bioinformatics uses computational methods to predict, analyse, and visualise protein structures — revolutionised by AlphaFold, which can predict structures with near-experimental accuracy.
Proteins fold into defined 3D structures that determine their function. Understanding the four levels of protein structure is fundamental to structural bioinformatics and drug design.
The linear amino acid sequence. Determines all higher levels. Read from gene sequence. 20 standard amino acids linked by peptide bonds.
Local folding patterns: α-helices (3.6 residues per turn, stabilised by i→i+4 hydrogen bonds) and β-sheets (parallel or antiparallel strands). Loops connect regular elements.
The complete 3D fold of a single polypeptide chain. Stabilised by hydrophobic core packing, disulfide bonds, salt bridges, and hydrogen bonds. This is what we see in PDB structures.
Assembly of multiple polypeptide chains (subunits) into a functional complex. Example: haemoglobin is a tetramer (α2β2). Many enzymes and receptors function as oligomers.
~220,000 experimentally determined structures. Solved by X-ray crystallography (~87%), cryo-EM (~10%), or NMR (~3%). The primary source for known structures.
~200 million predicted structures covering nearly all known proteins. Free access at alphafold.ebi.ac.uk. Quality indicated by pLDDT confidence score (blue = high, orange = low).
Links sequence → structure → function. Cross-references PDB and AlphaFold entries. Shows domain architecture, active sites, binding sites, and post-translational modifications.
AlphaFold2, developed by DeepMind, solved the protein structure prediction problem in 2020. It predicts 3D structures from amino acid sequence alone with near-experimental accuracy. ColabFold makes it easy to run AlphaFold predictions using Google Colab — no local GPU required.
(1) MSA generation — searches sequence databases to find homologous proteins (evolutionary information). (2) Template search — finds known structures of related proteins in PDB. (3) Evoformer — a deep neural network processes the MSA and pairwise features through 48 attention-based blocks. (4) Structure module — converts learned representations into 3D atomic coordinates. (5) Recycling — repeats the prediction 3 times, feeding output back as input, improving accuracy each time.
Molecular docking predicts how a small molecule (ligand/drug) binds to a protein target. It is a cornerstone of computer-aided drug design (CADD). The goal is to find the binding pose (orientation + conformation) and estimate binding affinity. Virtual screening uses docking to test millions of compounds computationally before synthesising the most promising ones.
The most widely used docking tool. Fast, accurate, free. Flexible ligand + rigid receptor. Easy to use via command line.
Start hereCNN-based scoring function built on Vina. Better pose prediction than classical scoring. GPU-accelerated.
Diffusion model for docking. Does not require predefined binding site. State-of-the-art blind docking accuracy.
Epigenetics studies heritable changes in gene activity that do not involve changes to the DNA sequence. These modifications — histone marks, chromatin accessibility, and DNA methylation — control which genes are on or off in each cell type. Epigenomic assays reveal the regulatory landscape of the genome.
Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) identifies where a specific protein (transcription factor or modified histone) binds across the genome. The protein-DNA complexes are cross-linked with formaldehyde, chromatin is sheared into 200–500 bp fragments, an antibody pulls down fragments bound by the target protein, and the bound DNA is sequenced. Peaks in the sequencing signal reveal binding sites.
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) maps open chromatin regions genome-wide. It uses the Tn5 transposase, which preferentially inserts sequencing adapters into accessible (open) chromatin, avoiding tightly packed nucleosome-occupied regions. Open chromatin corresponds to active regulatory elements — promoters, enhancers, and transcription factor binding sites.
DNA methylation — the addition of a methyl group to the 5th carbon of cytosine (5mC) — is the most studied epigenetic modification. In mammals, it occurs predominantly at CpG dinucleotides. Methylation of gene promoters generally silences transcription. It plays critical roles in development, X-chromosome inactivation, genomic imprinting, and cancer (where aberrant methylation is a hallmark).
Whole-genome bisulfite sequencing. Gold standard. Single-CpG resolution across entire genome. Expensive — needs 10–15× coverage. Bisulfite converts unmethylated C→U; methylated C is protected.
Reduced Representation BS. MspI digestion enriches CpG-rich regions. 1–10% of genome, covers most CpG islands. Much cheaper than WGBS.
Illumina EPIC array (850K CpGs). Cheapest per-sample. Good for clinical studies. Limited to pre-selected CpGs. EPIC v2 covers ~930K sites.
De novo genome assembly reconstructs complete chromosome sequences from millions of short or long sequencing reads — like assembling a jigsaw puzzle without the picture on the box. With the emergence of long-read technologies, telomere-to-telomere assemblies are now achievable for many organisms.
Genome assembly combines overlapping reads into longer contiguous sequences (contigs), then scaffolds contigs into chromosome-scale sequences. The approach depends fundamentally on read length: short reads (150 bp Illumina) cannot span repetitive elements, producing fragmented assemblies. Long reads (10–100 kb from PacBio HiFi or Oxford Nanopore) span most repeats, enabling near-complete assemblies.
Chops reads into k-mers, builds a graph where k-mers are nodes and overlaps are edges. Traverses the graph to produce contigs. Tools: SPAdes (small genomes, metagenomes), MEGAHIT (metagenomes). Good for bacteria, not for complex eukaryotes.
Finds overlaps between all read pairs, constructs a layout graph, derives consensus sequence. Tools: hifiasm (PacBio HiFi — best), Flye (Nanopore/PacBio CLR). Produces chromosome-scale contigs for most genomes.
An assembly is only useful if it is correct and complete. Quality assessment uses two complementary approaches: contiguity statistics (N50, total length, number of contigs) measured by QUAST, and gene completeness measured by BUSCO (Benchmarking Universal Single-Copy Orthologs), which checks whether expected genes are present and intact.
After assembling a genome, the next step is annotation — finding where the genes are and what they encode. Gene prediction uses ab initio methods (statistical models trained on known gene structures), homology-based methods (aligning known proteins or transcripts), and evidence-based methods (using RNA-seq data from the same organism). Modern pipelines combine all three approaches.
Population genetics studies how allele frequencies change within and between populations over time. Genome-wide association studies (GWAS) connect genetic variants to phenotypes and diseases, forming the basis of precision medicine.
Hardy-Weinberg Equilibrium (HWE) describes the expected genotype frequencies in a population with no evolutionary forces acting. For a variant with allele frequencies p (reference) and q (alternative), expected genotype frequencies are: p² (homozygous ref), 2pq (heterozygous), q² (homozygous alt). Deviations from HWE can indicate selection, population stratification, genotyping errors, or inbreeding.
The proportion of a specific allele in a population. If 80 of 200 chromosomes carry the T allele, frequency = 0.40. Minor allele frequency (MAF) < 0.01 = rare variant.
Random fluctuation of allele frequencies due to finite population size. Stronger in small populations. Founder effect and bottleneck are examples of extreme drift.
Measures genetic differentiation between populations. FST = 0 means identical; FST = 1 means completely different. Human populations: FST ≈ 0.12 globally.
Non-random association between alleles at different loci. Nearby variants are in LD (inherited together). Measured by r². LD decays with physical distance due to recombination.
GWAS tests each genetic variant across the genome for association with a phenotype (disease, height, drug response, etc.). It is a hypothesis-free approach — no prior knowledge of which genes are involved is needed. Since millions of variants are tested simultaneously, very stringent significance thresholds are required (p < 5×10⁻⁸ for genome-wide significance). GWAS results are visualised as Manhattan plots.
Systems biology views biological processes as interconnected networks rather than isolated pathways. By integrating multi-omics data — transcriptomics, proteomics, metabolomics — with interaction networks, we can understand how molecular changes propagate through biological systems.
Protein-Protein Interaction (PPI) networks represent physical or functional associations between proteins. Nodes are proteins, edges represent interactions. Highly connected proteins (hubs) tend to be essential. PPI networks help identify functional modules, predict protein function, and prioritise drug targets.
Largest PPI database. Integrates experimental data, co-expression, text mining, genomic context. Confidence scores 0–1. Use score > 0.7 for high-confidence interactions.
Desktop software for network visualisation and analysis. Supports plugins (clusterMaker, MCODE for module detection, CytoHubba for hub identification). Essential for publication figures.
Curated repository of experimentally validated interactions (protein, genetic, chemical). Over 2 million interactions. Higher confidence than computationally predicted interactions.
Number of connections a node has. High-degree nodes (hubs) are often essential genes. Degree distribution follows a power law in biological networks (scale-free property).
How often a node lies on shortest paths between other nodes. Bottleneck proteins control information flow through the network — they are often good drug targets.
How densely connected a node's neighbours are. Nodes in protein complexes have high clustering. Used to detect functional modules (MCODE algorithm).
Densely connected subnetworks often correspond to protein complexes or functional pathways. Detect with MCODE, Louvain, or Markov clustering (MCL).
Gene Regulatory Networks (GRNs) describe how transcription factors (TFs) control the expression of target genes. Unlike PPI networks which show physical contacts, GRNs show regulatory relationships — which TFs activate or repress which genes. Multi-omics integration combines transcriptomics, proteomics, metabolomics, and epigenomics data to build comprehensive models of cellular regulation.
Infers GRNs from single-cell RNA-seq data. Uses GRNBoost2 (random forest) for coexpression, then validates with TF binding motif enrichment (RcisTarget). Outputs regulon activity scores per cell.
Best for scRNA-seqWeighted Gene Co-expression Network Analysis. Groups genes into co-expression modules, correlates modules with clinical traits. Excellent for bulk RNA-seq. Widely used, well-cited. R package.
Mutual information-based network inference. VIPER then infers TF activity from target gene expression. Widely used in cancer biology for identifying master regulators.
Combines scRNA-seq + scATAC-seq to build context-specific GRNs. Simulates TF perturbations in silico. Predicts effects of gene knockouts on cell fate.
Integrating multiple data types reveals biology invisible to any single assay:
Correlate mRNA ↔ protein levels, mRNA ↔ metabolite levels. Discordant pairs (high mRNA, low protein) suggest post-transcriptional regulation. Tools: mixOmics (R), MOFA+.
Map all omics data onto KEGG/Reactome pathways. Identify pathways with consistent changes across data types. Tools: Paintomics, iPath, OmicsNet.
Multi-Omics Factor Analysis decomposes multiple data matrices into shared latent factors. Identifies which sources of variation are shared vs unique to each data type.
Construct multilayer networks connecting genes, proteins, metabolites. Network propagation algorithms (Random Walk with Restart) identify key regulators driving multi-level changes.
Fill in your details to receive a 50-day access code via email. Your information helps us improve RNAflow and keep you updated with new features.
Questions about a tool, a tutorial idea, a bug report, or want to collaborate? We'd love to hear from you.
Quick answers before you send a message.
Bug reports, feature requests, collaboration proposals, or just to say hello — all welcome.