Working with Biological Sequences

Opening a FASTA file

fp = file('a.fasta')
a = fp.readlines()
fp.close()
print a

output

['>gi|88853329|emb|AJ628425.1| Fasciola gigantica ITS1, isolate FgGZB2\n',
 'ACCTGAAAATCTACTCTTACACAAGCGATACACGTGTGACCGTCATGTCATGCGATAAAAATTTGCGGAC\n',
 'GGCTATGCCTGGCTCATTGAGGTCACAGCATATCCGATCACTGATGGGGTGCCTACCTGTATGATACTCC\n',
 'GATGGTATGCTTGCGTCTCTCGGGGCGCTTGTCCAAGCCAGGAGAACGGGTTGTACTGCCATGATTGGTA\n',
 'GTGCTAGGCTTAAAGAGGAGATTTGGGCTACGGCCCTGCTCCCGCCCTATGAACTGTTTCATTACTACAA\n',
 'TTACACTGTTAAAGTGGTATTGAATGGCTTGCCATTCTTTGCCATTGCCCTCGCATGCACCCGGTCCTTG\n',
 'TGGCTGGACTGCACGTACGTCGCCCGGCGGTGCCTATCCCGGGTTGGACTGATAACCTGGTCTTTGACCA\n', 'TA']

Extracting Sequence from FASTA File

# open fasta file - alternate form of the previous example
a = file('a.fasta').readlines()
# remove \n and join all lines except the first
seq = ''.join(a[1:])
seq = seq.replace('\n','')
print seq

output

ACCTGAAAATCTACTCTTACACAAGCGATACACGTGTGACCGTCATGTCAT...CA

Extracting Sequence from a GenBank File

# read file
a = file('NC_001284.gbk').read()
# DNA starts a line after ORIGIN and ends a line before //
orgn = a.find('ORIGIN')
start = a.find('1', orgn)
end = a.find('//', orgn)
b = a[start:end].split('\n')
seq = ''
for i in b:
	subseq = i.split()
	seq += ''.join(subseq[1:])
print seq

run as:

python code.py > output.txt

Exercises

  1. Extract the header of a FASTA file
  2. Extract sequence from a file containing 5 FASTA sequences
  3. Convert a GenBank sequence to a FASTA file