Single Nucleotide Polymorphism (SNP) calling pipelines are a series of bioinformatics tools and processes used to identify and characterize genetic variations, specifically single nucleotide differences, within a population of organisms. These pipelines are commonly employed in genomics research and applications such as studying genetic diversity, identifying disease-associated variants, and understanding evolutionary relationships. Here's an overview of a typical SNP calling pipeline:

  1. Sequence Alignment:
  • Raw sequencing data (e.g., from next-generation sequencing platforms) is first processed to remove low-quality reads and sequencing artifacts.
  • The clean reads are aligned to a reference genome or a closely related reference sequence using alignment software such as BWA, Bowtie, or HISAT2.
  1. Variant Calling:
  • Aligned reads are analyzed to identify positions where there are variations (SNPs) compared to the reference sequence.
  • Variant calling tools like GATK (Genome Analysis Toolkit), SAMtools, and FreeBayes are used to identify potential SNP sites.
  1. Variant Filtering:
  • The initial list of variant calls often contains false positives and low-confidence calls. Filtering steps are applied to improve the accuracy of the identified SNPs.
  • Filters can be based on sequencing depth, base quality, mapping quality, strand bias, and other criteria.
  1. Variant Annotation:
  • Annotated information, such as the functional consequences of the SNP (e.g., coding region, intron, intergenic), can be added to the SNP calls using tools like ANNOVAR or SnpEff.
  • Annotations provide insights into the potential impact of the identified SNPs on genes and their functions.
  1. Population-Level Analysis:
  • If studying multiple individuals or populations, the pipeline can include steps to compare SNP profiles and identify population-specific variants.
  • Principal Component Analysis (PCA) or other clustering methods can help visualize population structure.
  1. Statistical Analysis:
  • Further statistical tests may be applied to assess the significance of observed genetic variations.
  • For association studies, tests like chi-square, Fisher's exact, or logistic regression can be used to identify SNPs associated with phenotypic traits or diseases.
  1. Visualization and Reporting:
  • Results can be visualized using various tools, such as genome browsers (IGV, UCSC Genome Browser), variant visualization software, or custom scripts.
  • A final report summarizing the identified SNPs, their annotations, and any associated statistical findings is generated.
  1. Validation:
  • Identified SNPs are often validated using additional experimental techniques, such as Sanger sequencing or genotyping arrays, to confirm their presence and accuracy.
    It's important to note that the specifics of an SNP calling pipeline can vary depending on factors like the type of sequencing data (whole genome, exome, targeted), the organism being studied, and the research goals. Developing and optimizing an SNP calling pipeline requires careful consideration of various parameters and quality control steps to ensure accurate and reliable results.

Dr. Md. Monirul Islam
Senior Scientist

Fig: SNP calling workflow diagram. Horizontal boxes represent steps in the workflow and arrows to the left indicate steps in the workflow challenged with reference genomic DNA, and sequence data.