Friday, January 30, 2015

sequence exercies in R, occurence of DNA words

R code exercise on occurrence of DNA words.

Learning outcomes:
Longer words should have less occurrence in DNA
Restriction enzymes with longer sites should occur less frequently in DNA.

Reference
http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter1.html
http://www.bioconductor.org/packages/release/bioc/html/REDseq.html



# Exercise to study how occurence of DNA words are influenced by their length.
# What are the occurence of 1-letter, 2-letter, 3-letter, ... 8-letter DNA words? 
# Learning outcome: longer words should have less occurrence in DNA
# by Hong Qin, Jan 30, 2015, for Bio125 @ Spelman College

library("seqinr");

# read in some bacterial 16s rDNA sequences
seqs = read.fasta( "http://www.bioinformatics.org/ctls/download/data/16srDNA.fasta",seqtype="DNA");

# look at the first sequence
seq1 = seqs[[1]]
count(seq1, 1) #nucleotide composition
mean( count(seq1, 1) )

count(seq1, 2) # occurence of two-letter DNA words
mean( count(seq1, 2) )

count(seq1, 3) # occurence of 3-letter DNA words
mean( count(seq1, 3) )
results = count(seq1, 3)
results['agc']

# ?? # occurence 4-letter words?
# ?  # occurence of 5-letter DNA words
# ? # occurence of 6-letter DNA words

count(seq1, 8) # occurence of 8-letter DNA words
mean( count(seq1, 8) )
median( count(seq1, 8) )
max( count(seq1, 8) )
hist(count(seq1, 8), br=30)

results = count(seq1, 8)
results['agccgacc']


No comments:

Post a Comment