Pattern searching in the human genome

In an article I wrote a couple of months ago, I talked about different file types in which high throughput data is stored in. Today, we will analyse two types of files: Fasta and GTF. If you want to learn more about these formats, refer to this article.

How to download data using the curl command.
How to unzip your files in order to make them readable.
How to use Regex with grep.
And types of questions biologists try to answer to using bioinformatics.

Getting the data

First, let me point that I am taking an Applied Computational Genomics course by Aaron Quinlan - I highly recommend this course if you want to learn both how to analyse and interpret genomic data, plus the lectures are really fun - and this is one of the homeworks.

I thought about writing an in silico type of article rather than a tutorial. You will learn:

Get the GTF File from Ensemble

# Download the file

curl ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz > human.genes.gtf.gz

# Decompress the file

gzip -d human.genes.gtf.gz

Get the fasta file

# Download the file

curl http://hgdownload.cse.ucsc.e du/goldenPath/hg38/chromosomes/chr22.fa.gz > chr22.fa.gz

# Decompress the file

gzip -d chr22.fa.gz

Q1 and Q2 : How many GTF exons from protein coding genes are on the forward (+) strand? How many GTF exons from protein coding genes are on the reverse (-) strand?

cat human.genes.gtf | grep -w "transcript_biotype \"protein_coding"\" | cut -f3,7 | grep "exon" | sort | uniq -c

360028 exon +
351037 exon -

Q3: How many lines in the chr22 file have exactly 15 cytosines?

grep -v ">" chr22.fa | grep -ion "c" | uniq -c | grep
"15 " | wc -l

Q4: How many lines in the chr22 file have >=15 cytosines?

grep -v ">" chr22.fa | grep -ionE "(.\*c){15,}" | wc -l

Q5: Which gene from the file used in GTF file has the most transcripts/isoforms?

grep -v "#" human.genes.gtf | grep -w "transcript" | grep -Eo 'gene_name \"\S+\"' | uniq -c | sort -r | head -n1

170 gene_name "PCBP1-AS1"

Q6: Which isoform from the file used in GTF file has the most exons?

grep -v "#" human.genes.gtf | grep -w "exon" | grep -oE 'transcript_name \"\S+\"' | uniq -c | sort | tail -1

363 transcript_name "TTN-018"

Q7: Restriction enzymes cleave DNA molecules at or near a specific sequence of bases. For example, the HindIII (Hin D 3) enzyme cuts at the “/” in either this motif: 5’A/AGCTT3’ or its reverse complement, 3’TTCGA/A5’. How many perfectly matching HindIII restriction enzyme cut sites are there on chr22? Don’t worry about sites that span two lines - just care about sites that are fully contained on a single line.

grep -ioE "AAGCTT|TTCGAA" chr22.fa | wc -l

Q8: How many HindIII cut sites are there on chr22, assuming that a mutant form of HindIII will tolerate a mismatch in the second position? Think about ways in which you could best test for all the possible DNA combinations. There are a few valid approaches. Don’t worry about sites that span two lines - just care about sites that are fully contained on a single line.

grep -ioE "A[ACGT]GCTT|TTCG[AGCT]A" chr22.fa | wc -l

And that’s pretty much it for today’s in silico practice!

Pattern searching in the human genome

Getting the data

Get the GTF File from Ensemble

Get the fasta file

Q1 and Q2 : How many GTF exons from protein coding genes are on the forward (+) strand? How many GTF exons from protein coding genes are on the reverse (-) strand?

Q3: How many lines in the chr22 file have exactly 15 cytosines?

Q4: How many lines in the chr22 file have >=15 cytosines?

Q5: Which gene from the file used in GTF file has the most transcripts/isoforms?

Q6: Which isoform from the file used in GTF file has the most exons?

You might also like:

How to use p5.js in VS Code?

UNIX Tips and tricks - 01

A beginner-friendly guide to SEO

Getting the data

Get the GTF File from Ensemble

Get the fasta file

Q1 and Q2 : How many GTF exons from protein coding genes are on the forward (+) strand? How many GTF exons from protein coding genes are on the reverse (-) strand?

Q3: How many lines in the chr22 file have exactly 15 cytosines?

Q4: How many lines in the chr22 file have >=15 cytosines?

Q5: Which gene from the file used in GTF file has the most transcripts/isoforms?

Q6: Which isoform from the file used in GTF file has the most exons?

Share this:

How to use p5.js in VS Code?

UNIX Tips and tricks - 01

A beginner-friendly guide to SEO