Bulk RNA-seq (5): Streamlining Mapping with a Custom Linux Script

Bulk RNA-seq (5): Streamlining Mapping with a Custom Linux Script

Note

To overcome the inconvenience of manually mapping each RNA-seq sample to the reference genome, I’ve developed a Linux shell script. This script automates the mapping process using STAR, significantly enhancing efficiency by processing all samples in one go.

Here’s a detailed explanation of the script and its functionality:

#!/bin/bash

# Set the directory for the reference genome
GENOME_DIR=~/reference/genome/grcm39/index/

# Set the output directory for mapped files
OUT_DIR=~/RNA-seq/mapping/hPRMT1/2307_3organs/hearts/Tmapping_0711/		

# Create an associative array to store output file prefixes
declare -A arr
arr=( ["LAB_410_13"]="WT1_2" 
      ["LAB_410_14"]="WT2_2" 
      ["LAB_410_15"]="WT3_2"
      ["LAB_410_16"]="WT4_2"
      ["LAB_410_17"]="MUT1_2" 
      ["LAB_410_18"]="MUT2_2"
      ["LAB_410_19"]="MUT3_2"
      ["LAB_410_20"]="MUT4_2"
      )

# Loop through the associative array
for key in "${!arr[@]}"
do
  # Construct file names for R1 and R2 reads
  R1="${key}_R1_val_1.fq.gz"
  R2="${key}_R2_val_2.fq.gz"

  # Get the current output file prefix
  OUT_PREFIX="${OUT_DIR}${arr[$key]}"

  # Execute STAR for mapping
  STAR --runThreadN 10 \
  --runMode alignReads \
  --readFilesCommand zcat \
  --twopassMode Basic \
  --outSAMtype BAM SortedByCoordinate \
  --genomeDir $GENOME_DIR \
  --readFilesIn $R1 $R2 \
  --outFileNamePrefix $OUT_PREFIX
done

Script Features:

  • Reference Genome Directory: Specifies where the reference genome is located.
  • Output Directory: Designates where the script should save the BAM files resulting from mapping.
  • Associative Array for Sample Prefixes: Maps unique identifiers to output file prefixes, streamlining file management and ensuring clarity in sample identification.
  • Automated Mapping Loop: Iterates through each sample, automatically generating file names for paired-end reads and performing the mapping with STAR using predefined parameters.

Running the Script:

  1. Copy the script into a file, for example, mapping_script.sh.
  2. Make the script executable: chmod +x mapping_script.sh.
  3. Execute the script: ./mapping_script.sh.

This approach not only saves time by automating the mapping process for multiple samples but also ensures consistency and accuracy in RNA-seq data analysis, allowing researchers to focus on downstream analysis tasks.

comments powered by Disqus

Related Posts

What is the RNA-seq analysis?

What is the RNA-seq analysis?

Overall description of RNA-Seq The transcriptome has a high degree of complexity including multiple types of coding and non-coding RNA species, like mRNA, pre-mRNA, microRNA, and long ncRNA.

Read More
Bulk RNA-seq (4): Mapping and Quality Control with STAR, Qualimap, and featureCounts

Bulk RNA-seq (4): Mapping and Quality Control with STAR, Qualimap, and featureCounts

Note This section of the blog series will guide you through mapping RNA-seq read files to a genome index using STAR, performing quality control (QC) with Qualimap, and generating count data with featureCounts.

Read More
Bulk RNA-seq (9): Gene Ontology Analysis of Differential Expression

Bulk RNA-seq (9): Gene Ontology Analysis of Differential Expression

Note Next, we will perform Gene Ontology (GO) to unearth the biological processes, cellular components, and molecular functions our significant genes are involved in.

Read More