In an article I wrote a couple of months ago, I talked about different file types in which high throughput data is stored in. Today, we will analyse two types of files: Fasta and GTF. If you want to learn more about these formats, refer to this article.

  • How to download data using the curl command.
  • How to unzip your files in order to make them readable.
  • How to use Regex with grep.
  • And types of questions biologists try to answer to using bioinformatics.

Getting the data

First, let me point that I am taking an Applied Computational Genomics course by Aaron Quinlan - I highly recommend this course if you want to learn both how to analyse and interpret genomic data, plus the lectures are really fun - and this is one of the homeworks.

I thought about writing an in silico type of article rather than a tutorial. You will learn:

Get the GTF File from Ensemble

# Download the file

curl ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz > human.genes.gtf.gz

# Decompress the file

gzip -d human.genes.gtf.gz

Get the fasta file

# Download the file

curl http://hgdownload.cse.ucsc.e du/goldenPath/hg38/chromosomes/chr22.fa.gz > chr22.fa.gz

# Decompress the file

gzip -d chr22.fa.gz

Q1 and Q2 : How many GTF exons from protein coding genes are on the forward (+) strand? How many GTF exons from protein coding genes are on the reverse (-) strand?

cat human.genes.gtf | grep -w "transcript_biotype \"protein_coding"\" | cut -f3,7 | grep "exon" | sort | uniq -c
360028 exon +
351037 exon -

Q3: How many lines in the chr22 file have exactly 15 cytosines?

grep -v ">" chr22.fa | grep -ion "c" | uniq -c | grep
"15 " | wc -l
40675

Q4: How many lines in the chr22 file have >=15 cytosines?

grep -v ">" chr22.fa | grep -ionE "(.\*c){15,}" | wc -l
202793

Q5: Which gene from the file used in GTF file has the most transcripts/isoforms?

grep -v "#" human.genes.gtf | grep -w "transcript" | grep -Eo 'gene_name \"\S+\"' | uniq -c | sort -r | head -n1
170 gene_name "PCBP1-AS1"

Q6: Which isoform from the file used in GTF file has the most exons?

grep -v "#" human.genes.gtf | grep -w "exon" | grep -oE 'transcript_name \"\S+\"' | uniq -c | sort | tail -1
363 transcript_name "TTN-018"

Q7: Restriction enzymes cleave DNA molecules at or near a specific sequence of bases. For example, the HindIII (Hin D 3) enzyme cuts at the “/” in either this motif: 5’A/AGCTT3’ or its reverse complement, 3’TTCGA/A5’. How many perfectly matching HindIII restriction enzyme cut sites are there on chr22? Don’t worry about sites that span two lines - just care about sites that are fully contained on a single line.

grep -ioE "AAGCTT|TTCGAA" chr22.fa | wc -l
9064

Q8: How many HindIII cut sites are there on chr22, assuming that a mutant form of HindIII will tolerate a mismatch in the second position? Think about ways in which you could best test for all the possible DNA combinations. There are a few valid approaches. Don’t worry about sites that span two lines - just care about sites that are fully contained on a single line.

grep -ioE "A[ACGT]GCTT|TTCG[AGCT]A" chr22.fa | wc -l
34052

And that’s pretty much it for today’s in silico practice!