Workshop 1. Introduction to Python: Workshop

Today’s workshop will involve solving a series of programming tasks with Python.

solutions.py

Task 1. Counting Nucleotides

Create a function that takes a str as an argument and returns a dict containing a count of each nucleotide.

Example Usage:

>>> count_nucleotides(gene)
{'A': 1735, 'G': 1276, 'C': 1276, 'T': 1729}

Here’s a gene sequence to test with:

AGACACGTGGTTCAGAGAGAACTTATAAATCTCCCCTCCCCGGCAAGATCGTGATGTTATCTGCTGGCAGCAGAAGGTTCGCTCCGAGCGGAGCTCCAGAAGCTCCTGACAAGAGAAAGACAGATTGAGATAGAGATAGAAAGAGAAAGAGAGAAAGAGACAGCAGAGCGAGAGCGCAAGTGAAAGAGGCAGGGGAGGGGGATGGAGAATATTAGCCTGACGGTCTAGGGAGTCATCCAGGAACAAACTGAGGGGCTGCCCGGCTGCAGACAGGAGGAGACAGAGAGGATCTATTTTAGGGTGGCAAGTGCCTACCTACCCTAAGCGAGCAATTCCACGTTGGGGAGAAGCCAGCAGAGGTTGGGAAAGGGTGGGAGTCCAAGGGAGCCCCTGCGCAACCCCCTCAGGAATAAAACTCCCCAGCCAGGGTGTCGCAAGGGCTGCCGTTGTGATCCGCAGGGGGTGAACGCAACCGCGACGGCTGATCGTCTGTGGCTGGGTTGGCGTTTGGAGCAAGAGAAGGAGGAGCAGGAGAAGGAGGGAGCTGGAGGCTGGAAGCGTTTGCAAGCGGCGGCGGCAGCAACGTGGAGTAACCAAGCGGGTCAGCGCGCGCCCGCCAGGGTGTAGGCCACGGCGCGCAGCTCCCAGAGCAGGATCCGCGCCGCCTCAGCAGCCTCTGCGGCCCCTGCGGCACCCGACCGAGTACCGAGCGCCCTGCGAAGCGCACCCTCCTCCCCGCGGTGCGCTGGGCTCGCCCCCAGCGCGCGCACACGCACACACACACACACACACACACACGCACGCACACACGTGTGCGCTTCTCTGCTCCGGAGCTGCTGCTGCTCCTGCTCTCAGCGCCGCAGTGGAAGGCAGGACCGAACCGCTCCTTCTTTAAATATATAAATTTCAGCCCAGGTCAGCCTCGGCGGCCCCCCTCACCGCGCTCCCGGCGCCCCTCCCGTCAGTTCGCCAGCTGCCAGCCCCGGGACCTTTTCATCTCTTCCCTTTTGGCCGGAGGAGCCGAGTTCAGATCCGCCACTCCGCACCCGAGACTGACACACTGAACTCCACTTCCTCCTCTTAAATTTATTTCTACTTAATAGCCACTCGTCTCTTTTTTTCCCCATCTCATTGCTCCAAGAATTTTTTTCTTCTTACTCGCCAAAGTCAGGGTTCCCTCTGCCCGTCCCGTATTAATATTTCCACTTTTGGAACTACTGGCCTTTTCTTTTTAAAGGAATTCAAGCAGGATACGTTTTTCTGTTGGGCATTGACTAGATTGTTTGCAAAAGTTTCGCATCAAAAACAACAACAACAAAAAACCAAACAACTCTCCTTGATCTATACTTTGAGAATTGTTGATTTCTTTTTTTTATTCTGACTTTTAAAAACAACTTTTTTTTCCACTTTTTTAAAAAATGCACTACTGTGTGCTGAGCGCTTTTCTGATCCTGCATCTGGTCACGGTCGCGCTCAGCCTGTCTACCTGCAGCACACTCGATATGGACCAGTTCATGCGCAAGAGGATCGAGGCGATCCGCGGGCAGATCCTGAGCAAGCTGAAGCTCACCAGTCCCCCAGAAGACTATCCTGAGCCCGAGGAAGTCCCCCCGGAGGTGATTTCCATCTACAACAGCACCAGGGACTTGCTCCAGGAGAAGGCGAGCCGGAGGGCGGCCGCCTGCGAGCGCGAGAGGAGCGACGAAGAGTACTACGCCAAGGAGGTTTACAAAATAGACATGCCGCCCTTCTTCCCCTCCGAAACTGTCTGCCCAGTTGTTACAACACCCTCTGGCTCAGTGGGCAGCTTGTGCTCCAGACAGTCCCAGGTGCTCTGTGGGTACCTTGATGCCATCCCGCCCACTTTCTACAGACCCTACTTCAGAATTGTTCGATTTGACGTCTCAGCAATGGAGAAGAATGCTTCCAATTTGGTGAAAGCAGAGTTCAGAGTCTTTCGTTTGCAGAACCCAAAAGCCAGAGTGCCTGAACAACGGATTGAGCTATATCAGATTCTCAAGTCCAAAGATTTAACATCTCCAACCCAGCGCTACATCGACAGCAAAGTTGTGAAAACAAGAGCAGAAGGCGAATGGCTCTCCTTCGATGTAACTGATGCTGTTCATGAATGGCTTCACCATAAAGACAGGAACCTGGGATTTAAAATAAGCTTACACTGTCCCTGCTGCACTTTTGTACCATCTAATAATTACATCATCCCAAATAAAAGTGAAGAACTAGAAGCAAGATTTGCAGGTATTGATGGCACCTCCACATATACCAGTGGTGATCAGAAAACTATAAAGTCCACTAGGAAAAAAAACAGTGGGAAGACCCCACATCTCCTGCTAATGTTATTGCCCTCCTACAGACTTGAGTCACAACAGACCAACCGGCGGAAGAAGCGTGCTTTGGATGCGGCCTATTGCTTTAGAAATGTGCAGGATAATTGCTGCCTACGTCCACTTTACATTGATTTCAAGAGGGATCTAGGGTGGAAATGGATACACGAACCCAAAGGGTACAATGCCAACTTCTGTGCTGGAGCATGCCCGTATTTATGGAGTTCAGACACTCAGCACAGCAGGGTCCTGAGCTTATATAATACCATAAATCCAGAAGCATCTGCTTCTCCTTGCTGCGTGTCCCAAGATTTAGAACCTCTAACCATTCTCTACTACATTGGCAAAACACCCAAGATTGAACAGCTTTCTAATATGATTGTAAAGTCTTGCAAATGCAGCTAAAATTCTTGGAAAAGTGGCAAGACCAAAATGACAATGATGATGATAATGATGATGACGACGACAACGATGATGCTTGTAACAAGAAAACATAAGAGAGCCTTGGTTCATCAGTGTTAAAAAATTTTTGAAAAGGCGGTACTAGTTCAGACACTTTGGAAGTTTGTGTTCTGTTTGTTAAAACTGGCATCTGACACAAAAAAAGTTGAAGGCCTTATTCTACATTTCACCTACTTTGTAAGTGAGAGAGACAAGAAGCAAATTTTTTTTAAAGAAAAAAATAAACACTGGAAGAATTTATTAGTGTTAATTATGTGAACAACGACAACAACAACAACAACAACAAACAGGAAAATCCCATTAAGTGGAGTTGCTGTACGTACCGTTCCTATCCCGCGCCTCACTTGATTTTTCTGTATTGCTATGCAATAGGCACCCTTCCCATTCTTACTCTTAGAGTTAACAGTGAGTTATTTATTGTGTGTTACTATATAATGAACGTTTCATTGCCCTTGGAAAATAAAACAGGTGTATAAAGTGGAGACCAAATACTTTGCCAGAAACTCATGGATGGCTTAAGGAACTTGAACTCAAACGAGCCAGAAAAAAAGAGGTCATATTAATGGGATGAAAACCCAAGTGAGTTATTATATGACCGAGAAAGTCTGCATTAAGATAAAGACCCTGAAAACACATGTTATGTATCAGCTGCCTAAGGAAGCTTCTTGTAAGGTCCAAAAACTAAAAAGACTGTTAATAAAAGAAACTTTCAGTCAGAATAAGTCTGTAAGTTTTTTTTTTTCTTTTTAATTGTAAATGGTTCTTTGTCAGTTTAGTAAACCAGTGAAATGTTGAAATGTTTTGACATGTACTGGTCAAACTTCAGACCTTAAAATATTGCTGTATAGCTATGCTATAGGTTTTTTCCTTTGTTTTGGTATATGTAACCATACCTATATTATTAAAATAGATGGATATAGAAGCCAGCATAATTGAAAACACATCTGCAGATCTCTTTTGCAAACTATTAAATCAAAACATTAACTACTTTATGTGTAATGTGTAAATTTTTACCATATTTTTTATATTCTGTAATAATGTCAACTATGATTTAGATTGACTTAAATTTGGGCTCTTTTTAATGATCACTCACAAATGTATGTTTCTTTTAGCTGGCCAGTACTTTTGAGTAAAGCCCCTATAGTTTGACTTGCACTACAAATGCATTTTTTTTTTAATAACATTTGCCCTACTTGTGCTTTGTGTTTCTTTCATTATTATGACATAAGCTACCTGGGTCCACTTGTCTTTTCTTTTTTTTGTTTCACAGAAAAGATGGGTTCGAGTTCAGTGGTCTTCATCTTCCAAGCATCATTACTAACCAAGTCAGACGTTAACAAATTTTTATGTTAGGAAAAGGAGGAATGTTATAGATACATAGAAAATTGAAGTAAAATGTTTTCATTTTAGCAAGGATTTAGGGTTCTAACTAAAACTCAGAATCTTTATTGAGTTAAGAAAAGTTTCTCTACCTTGGTTTAATCAATATTTTTGTAAAATCCTATTGTTATTACAAAGAGGACACTTCATAGGAAACATCTTTTTCTTTAGTCAGGTTTTTAATATTCAGGGGGAAATTGAAAGATATATATTTTAGTCGATTTTTCAAAAGGGGAAAAAAGTCCAGGTCAGCATAAGTCATTTTGTGTATTTCACTGAAGTTATAAGGTTTTTATAAATGTTCTTTGAAGGGGAAAAGGCACAAGCCAATTTTTCCTATGATCAAAAAATTCTTTCTTTCCTCTGAGTGAGAGTTATCTATATCTGAGGCTAAAGTTTACCTTGCTTTAATAAATAATTTGCCACATCATTGCAGAAGAGGTATCCTCATGCTGGGGTTAATAGAATATGTCAGTTTATCACTTGTCGCTTATTTAGCTTTAAAATAAAAATTAATAGGCAAAGCAATGGAATATTTGCAGTTTCACCTAAAGAGCAGCATAAGGAGGCGGGAATCCAAAGTGAAGTTGTTTGATATGGTCTACTTCTTTTTTGGAATTTCCTGACCATTAATTAAAGAATTGGATTTGCAAGTTTGAAAACTGGAAAAGCAAGAGATGGGATGCCATAATAGTAAACAGCCCTTGTGTTGGATGTAACCCAATCCCAGATTTGAGTGTGTGTTGATTATTTTTTTGTCTTCCACTTTTCTATTATGTGTAAATCACTTTTATTTCTGCAGACATTTTCCTCTCAGATAGGATGACATTTTGTTTTGTATTATTTTGTCTTTCCTCATGAATGCACTGATAATATTTTAAATGCTCTATTTTAAGATCTCTTGAATCTGTTTTTTTTTTTTTTAATTTGGGGGTTCTGTAAGGTCTTTATTTCCCATAAGTAAATATTGCCATGGGAGGGGGGTGGAGGTGGCAAGGAAGGGGTGAAGTGCTAGTATGCAAGTGGGCAGCAATTATTTTTGTGTTAATCAGCAGTACAATTTGATCGTTGGCATGGTTAAAAAATGGAATATAAGATTAGCTGTTTTGTATTTTGATGACCAATTACGCTGTATTTTAACACGATGTATGTCTGTTTTTGTGGTGCTCTAGTGGTAAATAAATTATTTCGATGATATGTGGATGTCTTTTTCCTATCAGTACCATCATCGAGTCTAGAAAACACCTGTGATGCAATAAGACTATCTCAAGCTGGAAAAGTCATACCACCTTTCCGATTGCCCTCTGTGCTTTCTCCCTTAAGGACAGTCACTTCAGAAGTCATGCTTTAAAGCACAAGAGTCAGGCCATATCCATCAAGGATAGAAGAAATCCCTGTGCCGTCTTTTTATTCCCTTATTTATTGCTATTTGGTAATTGTTTGAGATTTAGTTTCCATCCAGCTTGACTGCCGACCAGAAAAAATGCAGAGAGATGTTTGCACCATGCTTTGGCTTTCTGGTTCTATGTTCTGCCAACGCCAGGGCCAAAAGAACTGGTCTAGACAGTATCCCCTGTAGCCCCATAACTTGGATAGTTGCTGAGCCAGCCAGATATAACAAGAGCCACGTGCTTTCTGGGGTTGGTTGTTTGGGATCAGCTACTTGCCTGTCAGTTTCACTGGTACCACTGCACCACAAACAAAAAAACCCACCCTATTTCCTCCAATTTTTTTGGCTGCTACCTACAAGACCAGACTCCTCAAACGAGTTGCCAATCTCTTAATAAATAGGATTAATAAAAAAAGTAATTGTGACTCAAAAAAAAAAAAAA

Task 2: Analyzing Sequences in a File

Data File: sequences.txt

Create a function that takes a file as an argument and iterates over its lines. Each line will be a single gene sequence. For each line, function should print to the console the length of the sequence and the CG/AT ratio of the sequence.

Example Usage:

>>> with open("sequences.txt") as f:
...     analyze_sequence_file(f)
6016    0.7367205542725174

Task 3: Parsing Metadata

Data File sequences_with_annotations.txt

Create a function that takes a file as an argument. Each line may be either a gene sequence or its annotations. The annotations line always precedes the gene’s nucleotide sequence. For each pair of gene sequence and annotation line, create a dict which contains the key "sequence" with the value being the gene sequence, and the key "annotations" for a dict of the parsed annotation line values. All keyword pairs whose key is "KW" should be stored in a list called "keywords", and all other keyword pairs should be stored in this annotation dict with their keyword name in lower case.

Each annotation line contains data in the following grammar:

<annotation_line>:  sp|<accession_number>|<gene_symbol> <common_name> <keywords_list>
<accession_number>: {uppercase letters or numbers}
<gene_symbol>:      {uppercase letters or numbers}_HUMAN
<common_name>:      {free text}
<keywords_list>:    {keyword_pair}[;{keyword_pair}]
<keyword_pair>:     {letters}={free text}

Example Annotation Line:

sp|P61812|TGFB2_HUMAN Transforming growth factor beta-2 KW=3D-structure;KW=Alternative splicing;KW=Aortic aneurysm;KW=Chromosomal rearrangement;KW=Cleavage on pair of basic residues;KW=Complete proteome;KW=Direct protein sequencing;KW=Disease mutation;KW=Disulfide bond;KW=Glycoprotein;KW=Growth factor;KW=Mitogen;KW=Polymorphism;KW=Reference proteome;KW=Secreted;KW=Signal;RefSeq=NM_001135599.3

This should be parsed into the following annotation dictionary:

{'accession': 'P61812',
 'gene_symbol': 'TGFB2_HUMAN',
 'keywords': ['3D-structure',
              'Alternative splicing',
              'Aortic aneurysm',
              'Chromosomal rearrangement',
              'Cleavage on pair of basic residues',
              'Complete proteome',
              'Direct protein sequencing',
              'Disease mutation',
              'Disulfide bond',
              'Glycoprotein',
              'Growth factor',
              'Mitogen',
              'Polymorphism',
              'Reference proteome',
              'Secreted',
              'Signal'],
 'name': 'Transforming growth factor beta-2',
 'refseq': 'NM_001135599.3'}

Your final outcome should be a list of dictionaries. Example usage might be

>>> with open("sequences_with_annotations.txt") as f:
...     sequence_data = parse_with_metadata(f)
>>> len(sequence_data)
138

Task 4: Partitioning on Keywords

Continuing to use the data you parsed from Task 3, get the complete set of all keywords in the dataset. For each keyword, count the number of times it occurs, and how often it co-occurs with each other keyword. Use this to construct a matrix of co-occurrences.

Using this matrix, find the most frequent keyword, and it’s most common co-occurring keywords.

Task 5: Writing Structured Data Out

Convert your dictionary representation into a text format you can write to a file and read back in to construct the same dictionary. If you choose to use an existing format, explain why you chose the format and what its strengths and weaknesses are.