Workshop 1. Introduction to Python: Workshop¶
Today’s workshop will involve solving a series of programming tasks with Python.
Task 1. Counting Nucleotides¶
Create a function that takes a str
as an argument
and returns a dict
containing a count of each nucleotide.
Example Usage:
>>> count_nucleotides(gene)
{'A': 1735, 'G': 1276, 'C': 1276, 'T': 1729}
Here’s a gene sequence to test with:
AGACACGTGGTTCAGAGAGAACTTATAAATCTCCCCTCCCCGGCAAGATCGTGATGTTATCTGCTGGCAGCAGAAGGTTCGCTCCGAGCGGAGCTCCAGAAGCTCCTGACAAGAGAAAGACAGATTGAGATAGAGATAGAAAGAGAAAGAGAGAAAGAGACAGCAGAGCGAGAGCGCAAGTGAAAGAGGCAGGGGAGGGGGATGGAGAATATTAGCCTGACGGTCTAGGGAGTCATCCAGGAACAAACTGAGGGGCTGCCCGGCTGCAGACAGGAGGAGACAGAGAGGATCTATTTTAGGGTGGCAAGTGCCTACCTACCCTAAGCGAGCAATTCCACGTTGGGGAGAAGCCAGCAGAGGTTGGGAAAGGGTGGGAGTCCAAGGGAGCCCCTGCGCAACCCCCTCAGGAATAAAACTCCCCAGCCAGGGTGTCGCAAGGGCTGCCGTTGTGATCCGCAGGGGGTGAACGCAACCGCGACGGCTGATCGTCTGTGGCTGGGTTGGCGTTTGGAGCAAGAGAAGGAGGAGCAGGAGAAGGAGGGAGCTGGAGGCTGGAAGCGTTTGCAAGCGGCGGCGGCAGCAACGTGGAGTAACCAAGCGGGTCAGCGCGCGCCCGCCAGGGTGTAGGCCACGGCGCGCAGCTCCCAGAGCAGGATCCGCGCCGCCTCAGCAGCCTCTGCGGCCCCTGCGGCACCCGACCGAGTACCGAGCGCCCTGCGAAGCGCACCCTCCTCCCCGCGGTGCGCTGGGCTCGCCCCCAGCGCGCGCACACGCACACACACACACACACACACACACGCACGCACACACGTGTGCGCTTCTCTGCTCCGGAGCTGCTGCTGCTCCTGCTCTCAGCGCCGCAGTGGAAGGCAGGACCGAACCGCTCCTTCTTTAAATATATAAATTTCAGCCCAGGTCAGCCTCGGCGGCCCCCCTCACCGCGCTCCCGGCGCCCCTCCCGTCAGTTCGCCAGCTGCCAGCCCCGGGACCTTTTCATCTCTTCCCTTTTGGCCGGAGGAGCCGAGTTCAGATCCGCCACTCCGCACCCGAGACTGACACACTGAACTCCACTTCCTCCTCTTAAATTTATTTCTACTTAATAGCCACTCGTCTCTTTTTTTCCCCATCTCATTGCTCCAAGAATTTTTTTCTTCTTACTCGCCAAAGTCAGGGTTCCCTCTGCCCGTCCCGTATTAATATTTCCACTTTTGGAACTACTGGCCTTTTCTTTTTAAAGGAATTCAAGCAGGATACGTTTTTCTGTTGGGCATTGACTAGATTGTTTGCAAAAGTTTCGCATCAAAAACAACAACAACAAAAAACCAAACAACTCTCCTTGATCTATACTTTGAGAATTGTTGATTTCTTTTTTTTATTCTGACTTTTAAAAACAACTTTTTTTTCCACTTTTTTAAAAAATGCACTACTGTGTGCTGAGCGCTTTTCTGATCCTGCATCTGGTCACGGTCGCGCTCAGCCTGTCTACCTGCAGCACACTCGATATGGACCAGTTCATGCGCAAGAGGATCGAGGCGATCCGCGGGCAGATCCTGAGCAAGCTGAAGCTCACCAGTCCCCCAGAAGACTATCCTGAGCCCGAGGAAGTCCCCCCGGAGGTGATTTCCATCTACAACAGCACCAGGGACTTGCTCCAGGAGAAGGCGAGCCGGAGGGCGGCCGCCTGCGAGCGCGAGAGGAGCGACGAAGAGTACTACGCCAAGGAGGTTTACAAAATAGACATGCCGCCCTTCTTCCCCTCCGAAACTGTCTGCCCAGTTGTTACAACACCCTCTGGCTCAGTGGGCAGCTTGTGCTCCAGACAGTCCCAGGTGCTCTGTGGGTACCTTGATGCCATCCCGCCCACTTTCTACAGACCCTACTTCAGAATTGTTCGATTTGACGTCTCAGCAATGGAGAAGAATGCTTCCAATTTGGTGAAAGCAGAGTTCAGAGTCTTTCGTTTGCAGAACCCAAAAGCCAGAGTGCCTGAACAACGGATTGAGCTATATCAGATTCTCAAGTCCAAAGATTTAACATCTCCAACCCAGCGCTACATCGACAGCAAAGTTGTGAAAACAAGAGCAGAAGGCGAATGGCTCTCCTTCGATGTAACTGATGCTGTTCATGAATGGCTTCACCATAAAGACAGGAACCTGGGATTTAAAATAAGCTTACACTGTCCCTGCTGCACTTTTGTACCATCTAATAATTACATCATCCCAAATAAAAGTGAAGAACTAGAAGCAAGATTTGCAGGTATTGATGGCACCTCCACATATACCAGTGGTGATCAGAAAACTATAAAGTCCACTAGGAAAAAAAACAGTGGGAAGACCCCACATCTCCTGCTAATGTTATTGCCCTCCTACAGACTTGAGTCACAACAGACCAACCGGCGGAAGAAGCGTGCTTTGGATGCGGCCTATTGCTTTAGAAATGTGCAGGATAATTGCTGCCTACGTCCACTTTACATTGATTTCAAGAGGGATCTAGGGTGGAAATGGATACACGAACCCAAAGGGTACAATGCCAACTTCTGTGCTGGAGCATGCCCGTATTTATGGAGTTCAGACACTCAGCACAGCAGGGTCCTGAGCTTATATAATACCATAAATCCAGAAGCATCTGCTTCTCCTTGCTGCGTGTCCCAAGATTTAGAACCTCTAACCATTCTCTACTACATTGGCAAAACACCCAAGATTGAACAGCTTTCTAATATGATTGTAAAGTCTTGCAAATGCAGCTAAAATTCTTGGAAAAGTGGCAAGACCAAAATGACAATGATGATGATAATGATGATGACGACGACAACGATGATGCTTGTAACAAGAAAACATAAGAGAGCCTTGGTTCATCAGTGTTAAAAAATTTTTGAAAAGGCGGTACTAGTTCAGACACTTTGGAAGTTTGTGTTCTGTTTGTTAAAACTGGCATCTGACACAAAAAAAGTTGAAGGCCTTATTCTACATTTCACCTACTTTGTAAGTGAGAGAGACAAGAAGCAAATTTTTTTTAAAGAAAAAAATAAACACTGGAAGAATTTATTAGTGTTAATTATGTGAACAACGACAACAACAACAACAACAACAAACAGGAAAATCCCATTAAGTGGAGTTGCTGTACGTACCGTTCCTATCCCGCGCCTCACTTGATTTTTCTGTATTGCTATGCAATAGGCACCCTTCCCATTCTTACTCTTAGAGTTAACAGTGAGTTATTTATTGTGTGTTACTATATAATGAACGTTTCATTGCCCTTGGAAAATAAAACAGGTGTATAAAGTGGAGACCAAATACTTTGCCAGAAACTCATGGATGGCTTAAGGAACTTGAACTCAAACGAGCCAGAAAAAAAGAGGTCATATTAATGGGATGAAAACCCAAGTGAGTTATTATATGACCGAGAAAGTCTGCATTAAGATAAAGACCCTGAAAACACATGTTATGTATCAGCTGCCTAAGGAAGCTTCTTGTAAGGTCCAAAAACTAAAAAGACTGTTAATAAAAGAAACTTTCAGTCAGAATAAGTCTGTAAGTTTTTTTTTTTCTTTTTAATTGTAAATGGTTCTTTGTCAGTTTAGTAAACCAGTGAAATGTTGAAATGTTTTGACATGTACTGGTCAAACTTCAGACCTTAAAATATTGCTGTATAGCTATGCTATAGGTTTTTTCCTTTGTTTTGGTATATGTAACCATACCTATATTATTAAAATAGATGGATATAGAAGCCAGCATAATTGAAAACACATCTGCAGATCTCTTTTGCAAACTATTAAATCAAAACATTAACTACTTTATGTGTAATGTGTAAATTTTTACCATATTTTTTATATTCTGTAATAATGTCAACTATGATTTAGATTGACTTAAATTTGGGCTCTTTTTAATGATCACTCACAAATGTATGTTTCTTTTAGCTGGCCAGTACTTTTGAGTAAAGCCCCTATAGTTTGACTTGCACTACAAATGCATTTTTTTTTTAATAACATTTGCCCTACTTGTGCTTTGTGTTTCTTTCATTATTATGACATAAGCTACCTGGGTCCACTTGTCTTTTCTTTTTTTTGTTTCACAGAAAAGATGGGTTCGAGTTCAGTGGTCTTCATCTTCCAAGCATCATTACTAACCAAGTCAGACGTTAACAAATTTTTATGTTAGGAAAAGGAGGAATGTTATAGATACATAGAAAATTGAAGTAAAATGTTTTCATTTTAGCAAGGATTTAGGGTTCTAACTAAAACTCAGAATCTTTATTGAGTTAAGAAAAGTTTCTCTACCTTGGTTTAATCAATATTTTTGTAAAATCCTATTGTTATTACAAAGAGGACACTTCATAGGAAACATCTTTTTCTTTAGTCAGGTTTTTAATATTCAGGGGGAAATTGAAAGATATATATTTTAGTCGATTTTTCAAAAGGGGAAAAAAGTCCAGGTCAGCATAAGTCATTTTGTGTATTTCACTGAAGTTATAAGGTTTTTATAAATGTTCTTTGAAGGGGAAAAGGCACAAGCCAATTTTTCCTATGATCAAAAAATTCTTTCTTTCCTCTGAGTGAGAGTTATCTATATCTGAGGCTAAAGTTTACCTTGCTTTAATAAATAATTTGCCACATCATTGCAGAAGAGGTATCCTCATGCTGGGGTTAATAGAATATGTCAGTTTATCACTTGTCGCTTATTTAGCTTTAAAATAAAAATTAATAGGCAAAGCAATGGAATATTTGCAGTTTCACCTAAAGAGCAGCATAAGGAGGCGGGAATCCAAAGTGAAGTTGTTTGATATGGTCTACTTCTTTTTTGGAATTTCCTGACCATTAATTAAAGAATTGGATTTGCAAGTTTGAAAACTGGAAAAGCAAGAGATGGGATGCCATAATAGTAAACAGCCCTTGTGTTGGATGTAACCCAATCCCAGATTTGAGTGTGTGTTGATTATTTTTTTGTCTTCCACTTTTCTATTATGTGTAAATCACTTTTATTTCTGCAGACATTTTCCTCTCAGATAGGATGACATTTTGTTTTGTATTATTTTGTCTTTCCTCATGAATGCACTGATAATATTTTAAATGCTCTATTTTAAGATCTCTTGAATCTGTTTTTTTTTTTTTTAATTTGGGGGTTCTGTAAGGTCTTTATTTCCCATAAGTAAATATTGCCATGGGAGGGGGGTGGAGGTGGCAAGGAAGGGGTGAAGTGCTAGTATGCAAGTGGGCAGCAATTATTTTTGTGTTAATCAGCAGTACAATTTGATCGTTGGCATGGTTAAAAAATGGAATATAAGATTAGCTGTTTTGTATTTTGATGACCAATTACGCTGTATTTTAACACGATGTATGTCTGTTTTTGTGGTGCTCTAGTGGTAAATAAATTATTTCGATGATATGTGGATGTCTTTTTCCTATCAGTACCATCATCGAGTCTAGAAAACACCTGTGATGCAATAAGACTATCTCAAGCTGGAAAAGTCATACCACCTTTCCGATTGCCCTCTGTGCTTTCTCCCTTAAGGACAGTCACTTCAGAAGTCATGCTTTAAAGCACAAGAGTCAGGCCATATCCATCAAGGATAGAAGAAATCCCTGTGCCGTCTTTTTATTCCCTTATTTATTGCTATTTGGTAATTGTTTGAGATTTAGTTTCCATCCAGCTTGACTGCCGACCAGAAAAAATGCAGAGAGATGTTTGCACCATGCTTTGGCTTTCTGGTTCTATGTTCTGCCAACGCCAGGGCCAAAAGAACTGGTCTAGACAGTATCCCCTGTAGCCCCATAACTTGGATAGTTGCTGAGCCAGCCAGATATAACAAGAGCCACGTGCTTTCTGGGGTTGGTTGTTTGGGATCAGCTACTTGCCTGTCAGTTTCACTGGTACCACTGCACCACAAACAAAAAAACCCACCCTATTTCCTCCAATTTTTTTGGCTGCTACCTACAAGACCAGACTCCTCAAACGAGTTGCCAATCTCTTAATAAATAGGATTAATAAAAAAAGTAATTGTGACTCAAAAAAAAAAAAAA
Task 2: Analyzing Sequences in a File
Data File: sequences.txt
Create a function that takes a file
as an argument
and iterates over its lines. Each line will be a single gene
sequence. For each line, function should print to the console
the length of the sequence and the CG/AT ratio of the sequence.
Example Usage:
>>> with open("sequences.txt") as f:
... analyze_sequence_file(f)
6016 0.7367205542725174
Task 3: Parsing Metadata
Data File sequences_with_annotations.txt
Create a function that takes a file
as an argument. Each line
may be either a gene sequence or its annotations. The annotations line
always precedes the gene’s nucleotide sequence. For each pair of gene
sequence and annotation line, create a dict
which contains the
key "sequence"
with the value being the gene sequence, and the key
"annotations"
for a dict
of the parsed annotation line
values. All keyword pairs whose key is "KW"
should be stored in a
list called "keywords"
, and all other keyword pairs should be stored
in this annotation dict
with their keyword name in lower case.
Each annotation line contains data in the following grammar:
<annotation_line>: sp|<accession_number>|<gene_symbol> <common_name> <keywords_list>
<accession_number>: {uppercase letters or numbers}
<gene_symbol>: {uppercase letters or numbers}_HUMAN
<common_name>: {free text}
<keywords_list>: {keyword_pair}[;{keyword_pair}]
<keyword_pair>: {letters}={free text}
Example Annotation Line:
sp|P61812|TGFB2_HUMAN Transforming growth factor beta-2 KW=3D-structure;KW=Alternative splicing;KW=Aortic aneurysm;KW=Chromosomal rearrangement;KW=Cleavage on pair of basic residues;KW=Complete proteome;KW=Direct protein sequencing;KW=Disease mutation;KW=Disulfide bond;KW=Glycoprotein;KW=Growth factor;KW=Mitogen;KW=Polymorphism;KW=Reference proteome;KW=Secreted;KW=Signal;RefSeq=NM_001135599.3
This should be parsed into the following annotation dictionary:
{'accession': 'P61812',
'gene_symbol': 'TGFB2_HUMAN',
'keywords': ['3D-structure',
'Alternative splicing',
'Aortic aneurysm',
'Chromosomal rearrangement',
'Cleavage on pair of basic residues',
'Complete proteome',
'Direct protein sequencing',
'Disease mutation',
'Disulfide bond',
'Glycoprotein',
'Growth factor',
'Mitogen',
'Polymorphism',
'Reference proteome',
'Secreted',
'Signal'],
'name': 'Transforming growth factor beta-2',
'refseq': 'NM_001135599.3'}
Your final outcome should be a list of dictionaries. Example usage might be
>>> with open("sequences_with_annotations.txt") as f:
... sequence_data = parse_with_metadata(f)
>>> len(sequence_data)
138
Task 4: Partitioning on Keywords
Continuing to use the data you parsed from Task 3, get the complete set of all keywords in the dataset. For each keyword, count the number of times it occurs, and how often it co-occurs with each other keyword. Use this to construct a matrix of co-occurrences.
Using this matrix, find the most frequent keyword, and it’s most common co-occurring keywords.
Task 5: Writing Structured Data Out
Convert your dictionary representation into a text format you can write to a file and read back in to construct the same dictionary. If you choose to use an existing format, explain why you chose the format and what its strengths and weaknesses are.