Pattern searching in the human genome
In an article I wrote a couple of months ago, I talked about different file types in which high throughput data is stored in. Today, we will analyse two types of files: Fasta and GTF. If you want to learn more about these formats, refer to this article.
- How to download data using the curl command.
- How to unzip your files in order to make them readable.
- How to use Regex with grep.
- And types of questions biologists try to answer to using bioinformatics.
Getting the data
First, let me point that I am taking an Applied Computational Genomics course by Aaron Quinlan - I highly recommend this course if you want to learn both how to analyse and interpret genomic data, plus the lectures are really fun - and this is one of the homeworks.
I thought about writing an in silico type of article rather than a tutorial. You will learn:
Get the GTF File from Ensemble
# Download the file
curl ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz > human.genes.gtf.gz
# Decompress the file
gzip -d human.genes.gtf.gzGet the fasta file
# Download the file
curl http://hgdownload.cse.ucsc.e du/goldenPath/hg38/chromosomes/chr22.fa.gz > chr22.fa.gz
# Decompress the file
gzip -d chr22.fa.gzQ1 and Q2 : How many GTF exons from protein coding genes are on the forward (+) strand? How many GTF exons from protein coding genes are on the reverse (-) strand?
cat human.genes.gtf | grep -w "transcript_biotype \"protein_coding"\" | cut -f3,7 | grep "exon" | sort | uniq -c360028 exon +
351037 exon -Q3: How many lines in the chr22 file have exactly 15 cytosines?
grep -v ">" chr22.fa | grep -ion "c" | uniq -c | grep
"15 " | wc -l40675Q4: How many lines in the chr22 file have >=15 cytosines?
grep -v ">" chr22.fa | grep -ionE "(.\*c){15,}" | wc -l202793Q5: Which gene from the file used in GTF file has the most transcripts/isoforms?
grep -v "#" human.genes.gtf | grep -w "transcript" | grep -Eo 'gene_name \"\S+\"' | uniq -c | sort -r | head -n1170 gene_name "PCBP1-AS1"Q6: Which isoform from the file used in GTF file has the most exons?
grep -v "#" human.genes.gtf | grep -w "exon" | grep -oE 'transcript_name \"\S+\"' | uniq -c | sort | tail -1363 transcript_name "TTN-018"Q7: Restriction enzymes cleave DNA molecules at or near a specific sequence of bases. For example, the HindIII (Hin D 3) enzyme cuts at the “/” in either this motif: 5’A/AGCTT3’ or its reverse complement, 3’TTCGA/A5’. How many perfectly matching HindIII restriction enzyme cut sites are there on chr22? Don’t worry about sites that span two lines - just care about sites that are fully contained on a single line.
grep -ioE "AAGCTT|TTCGAA" chr22.fa | wc -l9064Q8: How many HindIII cut sites are there on chr22, assuming that a mutant form of HindIII will tolerate a mismatch in the second position? Think about ways in which you could best test for all the possible DNA combinations. There are a few valid approaches. Don’t worry about sites that span two lines - just care about sites that are fully contained on a single line.
grep -ioE "A[ACGT]GCTT|TTCG[AGCT]A" chr22.fa | wc -l34052And that’s pretty much it for today’s in silico practice!