Workshop 1. Introduction to Python¶

Introduction¶

Hello, my name is Josh.

These are the materials for the second workshop, workshop one.

This workshop is intended to introduce you to problem-solving with python. If you’re already familiar with some parts of the language, feel free to skip over those sections.

Running Python Code¶

Python can be used to write executabl programs, as well as interactively. To start an interactive Python interpreter from the shell:

$ python
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

We can execute statements interactively with this program, which will read the code you write, evaluate it, and print the results, and loop back to the beginning. This is why this type of program is called a REPL.

The REPL will keep the results of previous statements from that session. We can define variables

>>> a = 3
>>> b = 4
>>> a + b
7

Using this interactive prompt, we can treat Python like a simple desk calculator or as a way to write short one-off programs. To quit the interactive session, you can write quit() and press <Enter>.

We can also execute files containing Python code

example1.py¶

#!/usr/bin/env python
import math

a = 3
b = 4

# pythagorean theorem
#   |\
# b | \ c
#   |__\
#     a

c_squared = a ** 2 + b ** 2

print(c_squared, math.sqrt(c_squared))

$ python example1.py
25 5

There are other ways to run Python code, but these two methods are the ones that will be used for now.

Builtin Types¶

Python has many handy built-in types, and lets you define more yourself easily. Here, by “type”, what I mean is “kind of thing”, not to press keys on a keyboard. For example, we can all agree that the 12 is a number, and that “cat” is a series of characters called a string and that a cat is an animal. We know we can add numbers together, so we can add 12 to another number like 13 and get the number 25, but attempting to add two cats together is liable to be a messy, unpredictable process which computers don’t have definition for. Similarly, we can agree that the uppercase version of “cat” which is “CAT”, but only typographers will assert that there is an upper or lower case for the number 12.

After all of this, I hope I’ve convinced you that types are useful enough for you to go read An Informal Introduction to Python which is a part of the official Python documentation. We will be referencing parts of it throughout this workshop, and it’s often the best way to learn how the language works.

Note

Expected Reading Time: 15 minutes

A Simple Program¶

Okay, now that you’ve learned more about how to play with numbers, strings, and lists we’ll use these to build on the while-loop you saw at the end. Let’s say you have a list of numbers

>>> numbers = [1, 0, 2, 1, 3, 1, 5, 1, 2, 3, 4, 4, 4, 5, 1]

We want to count the number of times each number appears in numbers

>>> n = len(numbers)

You learned about the len() function in that informal introduction, it’s a function which returns the number of items in a container type, which a list is. So now n contains the number of items in numbers. We can iterate over numbers using a while loop like this:

>>> i = 0
>>> while i < n:
...     j = numbers[i]
...     print(j)
...     i += 1
...
1
0
2
1
3
1
5
1
2
3
4
4
4
5
1

We can then use a list of the length of the largest number in numbers + 1 to count the number of times each number was seen. We have to add one to largest number because we start from zero instead if one.

>>> i = 0
>>> counts = [0, 0, 0, 0, 0, 0]
>>> while i < n:
...     j = numbers[i]
...     counts[j] += 1
...     i += 1
...
>>> print(counts)
[1, 5, 2, 2, 3, 2]

This works because j’s value is a number which is both the thing we want to count and is used to find the place where we hold the count inside the counts list. This was pretty convenient and obviously only works because we’re counting numbers from 0 and up without many gaps. We also had to pre-fill counts with the right size, and set all of its values to 0. We’ll try to improve this example to be more idiomatic and less contrived.

In Python, while loops are uncommon for a number of reasons. The primary reason is that if you forgot to include the line with i += 1, your loop would run forever and never stop. Like many other languages Python has for loops too, but they’re a bit different. Instead of iterating over a user-defined start and stop condition like in C-like languages, Python’s iterates over an object in a type-specific fashion which means that types control how they’re traversed. In the case of sequences like list or str this means they’re iterated over in order, from start to end.

For more information, see 4.2 for Statements.

If we were to rewrite that counting step with a for-loop, here’s what it would look like:

>>> counts = [0, 0, 0, 0, 0, 0]
>>> for j in numbers:
...     counts[j] += 1
...
>>> print(counts)
[1, 5, 2, 2, 3, 2]

This simplified the code and removed the potential for you to accidentally enter an infinite loop. We still need to pre-initialize counts though. We’ll fix this using another loop and an if statement.

if Statements¶

An if statement works like the conditional expression of a while loop.

>>> a = 3
>>> b = 4
>>> c = 5

>>> if b > a:
...     print("b is greater than a")
b is greater than a

We can make those conditional expressions as complicated as we want, using boolean operators like and and or to combine them. We can also specify

>>> if b == (a ** 2 - 1) / 2 and c == b + 1:
...     print("a, b, and c are Pythagorean triples!")
... else:
...     print("crude failures of numbers")
a, b, and c are Pythagorean triples!
>>> b = 6
>>> if b == (a ** 2 - 1) / 2 and c == b + 1:
...     print("a, b, and c are Pythagorean triples!")
... else:
...     print("crude failures of numbers")
crude failures of numbers

You can read more about if statements at 4.1 if Statements and the boolean operators at 6.11 Boolean operations.

Simplifying Solution¶

Now, to solve our problem,

>>> counts = []
>>> for j in numbers:
...     difference = j - len(counts) + 1
...     if difference > 0:
...         for i in range(difference):
...             counts.append(0)

Here, the range function returns an object which when iterated over, produces numbers from 0 to the first argument (non-inclusive). Alternatively, if given two arguments, it will produce numbers starting from the first argument up to the second argument.

We can solve this even more simply by using another builtin function max. max will return the largest value in an iterable object.

>>> counts = []
>>> for i in range(max(numbers) + 1):
...     counts.append(0)

There is another builtin function called min which does the opposite, returning the smallest value in an iterable object.

Now, the simplified program looks like

counts = []
for i in range(max(numbers) + 1):
    counts.append(0)
for j in numbers:
    counts[j] += 1
print(counts)

Defining Functions¶

With our simplified solution, we can now count numbers with very few lines of code, but we still need to repeat those lines every time we want to count a list of numbers. This isn’t ideal if we have a problem where we need to do this a lot.

We can create a function which contains all of that logic, giving it a name, and just call that function whenever we want to do that task. For an explanation of how this is done, please read Defining Functions.

Note

Expected reading time: 5 minutes

Now, to take what you just read and apply it here, we define the input to the function, a list of numbers, and the output of the function, a list of counts.

def count_numbers(number_series):
    counts = []
    for i in range(max(number_series) + 1):
        counts.append(0)
    for j in number_series:
        counts[j] += 1
    return counts

Note that the variables inside the function don’t refer to the variables from previous examples, even if they share the same name. This is because they have different scopes. It is possible for names within an inner scope to reference variables from an outer scope, but this should be used sparingly, as this introduces a logical dependence between the two pieces of code which is not easy to see.

We can show our function works by calling it on our list of numbers and seeing it gives the same answer:

>>> count_numbers(numbers)
[1, 5, 2, 2, 3, 2]

For more on defining functions, including fancier ways of passing arguments, see More on Defining Functions

More On Types and Iteration¶

I’ve been saying this word “iterate” a lot. It comes from the Latin “iterum” meaning “again”, and in mathematics and computer science it is used when we want to repeat a process again and again. A for-loop is one form of iteration.

In Python, types can define how they are iterated over. As we saw before, list objects are iterated over in order, and the same goes for str and tuple, even though these types are meant to represent different things. This is because these types are all examples of Sequence types or more specifically, these types all implement the Sequence interface. A Sequence supports the following operations:

# Item Getting
>>> sequence[i]
somevalue

# Testing for Membership
>>> x in sequence
True/False

# Size-able
>>> len(sequence)
integer

Because of the first and third property, iterability is implicitly just

for i in range(len(sequence)):
    yield sequence[i]

You can read more about this interface at Sequence Types

Because Python is not staticly typed, we can use those Sequence types interchange-ably.

>>> for c in ["a", "b", "c"]:
...    print(c)
a
b
c
>>> # The sequence of values inside parentheses defines a tuple
>>> for c in ("a", "b", "c"):
...    print(c)
a
b
c
>>> for c in "abc":
...    print(c)
a
b
c

There are many other types in Python which support iteration, such as set, dict, and file, but iteration over objects of these types is not always the same.

Warning

No matter the type of the object you’re iterating over, you should not and usually cannot modify the object while iterating. Common builtin types will throw an error. If you need to modify the object you’re iterating over, first make a copy of the object, and then iterate over the copy and modify the original as needed.

Whenever you use a for loop, Python implicitly calls the iter() function on the object being iterated over. iter() returns a Iterator for the object being iterated. Iterator objects can be used to retrieve successive items from the thing they iterate over using the next() function. If next() is called and no new data are available, a StopIteration exception will be raised. The for loop automatically handles the exception, but you’re calling next() directly, you’ll need to be prepared to handle the exception yourself.

Dictionaries¶

Note

Dictionaries are very important, make sure you try out some of this code if you’re unfamiliar with them!

Dictionaries, or dict as they’re written in Python, are incredibly powerful data structures. They allow you to associate “key” objects with “value” objects, forming key-value pairs. This lets you create names for values, flexibly relate arbitrary data together and make it easy to locate information without complicated indexing schemes.

This process usually involves a hash function, and requires that the “key” objects be Hashable and comparable. Usually, this requires that the “key” be immutable, a property that str, int, float, and tuple posses.

>>> lookup = dict()
# alternative syntax for a dictionary literal
>>> lookup = {}
# set the value of the key "string key" to the value "green eggs and spam"
>>> lookup["string key"] = "green eggs and spam"
>>> print(lookup)
{"string key": "green eggs and spam"}
# set the value of the key 55 to the value [1, 2, 3]
>>> lookup[55] = [1, 2, 3]
# Get the value associated with the key 55
>>> lookup[55]
[1, 2, 3]
# mutable objects like lists cannot be keys
>>> lookup[[4, 5]] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>
# tuples are immutable, and can be keys
>>> lookup[(4, 5)] = 5
# getting a key with square braces that doesn't exist throws
# an error
>>> lookup["not a key"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'not a key'
>>>
# using the `get` method of a dict will return None if the key
# is missing
>>> lookup.get("not a key") == None
True
# the second argument to `get` is an optional default value to
# return instead
# of `None` when the key is missing
>>> lookup.get("not a key", 42) == 42
# using membership testing checks to see if a value a  key in
# this dictionary
>>> (4, 5) in lookup
True
# iterating over a dictionary yields its keys
>>> g = iter(lookup)
>>> list(g)
['string key', 55, (4, 5)]
# the `keys` method returns a "view" of the keys that can be
# iterated over
>>> lookup.keys()
dict_keys(['string key', 55, (4, 5)])
# the `values` method returns a "view" of the values that can
# be iterated over
>>> lookup.values()
dict_values(['green eggs and spam', [1, 2, 3], 5])
>>>
# the `items` method returns a view over the key-value pairs.
# This is a very common way way to iterate over a dictionary!
>>> lookup.items()
dict_items([('string key', 'green eggs and spam'), (55, [1, 2, 3]), ((4, 5), 5)])

They go by many names, like “associative array”, “hash”, “hash table” or “map” in other languages.

More Data Structures¶

Python’s builtin data structures are one of its strengths. They are the building blocks of any program, and understanding what they can do lets you choose the right tool for each job you encounter.

More on Lists
Tuples and Sequences
Strings and Their Methods. There are many string methods with niche uses. The important ones are:
1. endswith
2. format
3. replace
4. split
5. startswith
6. strip
7. encode. This method touches on the sticky subject of str vs bytes, something you don’t need to know much about just yet.
Sets
Dictionaries

There are also many more examples of how they’re used in Looping Techniques

Note

Expected Reading Time: 15 minutes

File I/O¶

Data File: “data.txt”

There comes a day in every programmer’s life where they simply have to ask for information from the outside world, and that usually takes the form of reading files in, and then writing a file back out to tell the world what they’ve done.

In Python, you open a file using open(path, mode), where path refers to the path to the file you wish to open, and mode refers to how you want the file to be opened, as specified by a common string pattern:

“r”: open for reading in text mode. Error if the file does not exist. This is the default mode.
“w”: open for writing in text mode. Destroys any existing content
“a”: open for appending in text mode. All new content is written to the end of the existing content

If a “b” is appended to the mode string, it means that the file is opened in binary mode, which causes all of the methods we’ll cover to return bytes objects instead of str objects. For now, we’ll ignore this and only deal with text mode.

>>> f = open("data.txt", 'r')
>>> contents = f.read()
>>> print(contents)
Alice Low bandwidth caused a bottleneck in networking
John  Excess fruit consumption suspected to cause cancer
Bob   Failure to communicate induced by person standing inbetween self and recipient
>>> f.close()

Whenever you open a file, it is important to remember to close it when you are done. It’s more important when writing to a file because some or all of the content you wrote to the file may not actually be written out until the file is closed. If your program structure lets you, it’s better to open a file using a with block (shown below).

Files can also be iterated over, yielding lines sequentially.

>>> with open("data.txt", "r") as f:
...     for line in f:
...         # There will be an extra blank line between lines as both the
...         # newline added by print() and the one at the end of the line
...         # are shown
...         print(line)
Alice Low bandwidth caused a bottleneck in networking

John  Excess fruit consumption suspected to cause cancer

Bob   Failure to communicate induced by person standing inbetween self and recipient

>>> # f.close() is called as soon as the with block is over

file objects are their own Iterators, so you can call next() directly on them to retrieve successive lines, or more explicitly, you can call the readline() method.

>>> with open("data.txt", "r") as f:
...     print(next(f))
...     print(f.readline())
...     print(next(f))
Alice Low bandwidth caused a bottleneck in networking

John  Excess fruit consumption suspected to cause cancer

Bob   Failure to communicate induced by person standing inbetween self and recipient

These methods can be used even while looping over the file using a for loop to work with more than one line at a time, though care must be used when calling next() repeatedly.

We can combine some of the things we’ve learend to do things with the contents of this file. The first thing we can do is to just recapitulate the element counting task we did earlier on a list of numbers using the characters for each line of this file. Instead of using a list to store the counts, we need to use something that can connect elements of a str to a number, so we’ll use dict (there’s a better subclass of dict that we could use in the collections module, but that’s a story for another time).

Let’s also associate the count with the person’s name at the start of the line, omitting it from the count.

>>> counters_per_line = {}
>>> with open('data.txt') as f:
...     for line in f:
...         # split the line on spaces to separate the name
...         tokens = line.split(" ")
...         name = tokens[0]
...         counter = {}
...         # slice from the 1st element forward to skip the name
...         for token in tokens[1:]:
...             # iterate over the individual letters
...             for c in token:
...                 # retreive the current count or 0 if missing
...                 current = counter.get(c, 0)
...                 # store the updated value
...                 counter[c] = current + 1
...         # store the counts for this line under the associated name
...         counters_per_line[name] = counter
...
>>> print(counters_per_line)
{'Bob': {'a': 4, 'c': 4, 'b': 2, 'e': 10, 'd': 4, 'g': 1, 'F': 1, 'i': 7, 'f': 1, 'm': 2, 'l': 2, 'o': 3, 'n': 9, 'p': 2, 's': 3, 'r': 3, 'u': 3, 't': 5, 'w': 1, 'y': 1}, 'John': {'a': 2, 'c': 6, 'E': 1, 'd': 1, 'f': 1, 'i': 2, '\n': 1, 'm': 1, 'o': 3, 'n': 3, 'p': 2, 's': 6, 'r': 2, 'u': 4, 't': 4, 'x': 1, 'e': 5}, 'Alice': {'a': 3, 'c': 2, 'b': 2, 'e': 4, 'd': 3, 'g': 1, 'i': 3, 'h': 1, 'k': 2, '\n': 1, 'L': 1, 'o': 3, 'n': 5, 's': 1, 'r': 1, 'u': 1, 't': 4, 'w': 3, 'l': 1}}

Now that we have these letter counts for each line, we can write them out to a new file organized in a meaningful way. Let’s define the format to be:

<name>:
\t<letter>:<count>\n
...

To write content to a file opened for writing text, we use the file.write() method, which takes a str as an argument. write doesn’t assume you’re passing it a complete line, so you’ll need to include the newline character \n yourself when you’re done with a line.

>>> with open("output.txt", 'w') as f:
...     for name, counts in counters_per_line.items():
...         f.write(name + "\n")
...         for letter, count in counts.items():
...             f.write("\t" + letter + ":")
...             # convert count from an int into a str so it can be written
...             f.write(str(count))
...             f.write("\n")
...             # or alternatively use a format string to do everything in one go:
...             f.write("\t{letter}:{count}\n".format(letter=letter, count=count))
...         f.write("\n")

The contents of “output.txt” will be:

Bob
    a:4
    c:4
    b:2
    e:10
    d:4
    g:1
    F:1
    i:7
    f:1
    m:2
    l:2
    o:3
    n:9
    p:2
    s:3
    r:3
    u:3
    t:5
    w:1
    y:1

John
    a:2
    c:6
    E:1
    d:1
    f:1
    i:2

:1
    m:1
    o:3
    n:3
    p:2
    s:6
    r:2
    u:4
    t:4
    x:1
    e:5

Alice
    a:3
    c:2
    b:2
    e:4
    d:3
    g:1
    i:3
    h:1
    k:2

:1
    L:1
    o:3
    n:5
    s:1
    r:1
    u:1
    t:4
    w:3
    l:1

This is exactly what we said to do, but after seeing the results, we can see a few things that may not make sense. The first is that there are blank lines followed by a line starting in the wrong place without a letter and just a “:1”. This is because the newline character was also counted. We could omit the newlines by checking for them in the inner-most loop. The second thing is that the uppercase and lowercase letters are counted separately. Supposing we don’t want this, we would need to redo the counting process to fix it.

The revised counting code would look like this, after we wrap it up in a function

example3.py¶

def count_letters_per_line(line_file):
    counters_per_line = {}
    for line in line_file:

        # remove the trailing newline and
        # split the line on spaces to separate the name
        tokens = line.strip().split(" ")

        name = tokens[0]
        counter = {}
        # slice from the 1st element forward to skip the name
        for token in tokens[1:]:
            for c in token:

                # force the character to be lowercase
                c = c.lower()

                # retreive the current count or 0 if missing
                current = counter.get(c, 0)
                counter[c] = current + 1
        counters_per_line[name] = counter
    return counters_per_line

def write_output(counters, result_file):
    for name, counts in counters.items():
        result_file.write(name + "\n")
        for letter, count in counts.items():
            result_file.write("\t{letter}:{count}\n".format(letter=letter, count=count))
        result_file.write("\n")

If you want to know more about reading and writing files, please see Reading and Writing Files for more information.

Modules¶

Since we’ve organized these into functions, where input and output are defined by whoever calls them, we don’t need them to share the same scope as the input we want to call them with. We can move them into a “module”, which is the word for a file which contains Python code that will be “imported” and used elsewhere. To do this, we just create a file, let’s call it “line_parser.py” and put the code for these functions in it and save the file.

Now, back in our interactive session, we can just “import” the module by name. This creates a new module object which just provides all the names defined inside it as attributes. We can call those functions defined within using the same attribute access notation:

>>> import line_parser
>>> with open("data.txt") as f:
...     counts = line_parser.count_letters_per_line(f)
...
>>> counts.keys()
dict_keys(['Alice', 'John', 'Bob'])
>>> with open("output.txt", 'w') as f:
...     line_parser.write_output(counts, f)
...
>>> print(open("output.txt").read())
Alice
        l:2
        o:3
        w:3
        b:2
        a:3
        n:5
        d:3
        i:3
        t:4
        h:1
        c:2
        u:1
        s:1
        e:4
        k:2
        r:1
        g:1

John
        e:6
        x:1
        c:6
        s:6
        f:1
        r:2
        u:4
        i:2
        t:4
        o:3
        n:3
        m:1
        p:2
        d:1
        a:2

Bob
        f:2
        a:4
        i:7
        l:2
        u:3
        r:3
        e:10
        t:5
        o:3
        c:4
        m:2
        n:9
        d:4
        b:2
        y:1
        p:2
        s:3
        g:1
        w:1

Executing Scripts with Arguments from the Command Line¶

Once we make a program that can take arguments, we might want to run the program using user-provided data without modifying the program from run to run. To do this, we can read the arguments from the command line using the same patterns you saw in workshop 0.

In order to access the command line arguments in python, we need to import another module from the Python standard library, called sys. The sys module contains lots of functions related to the Python runtime, but the feature we’re interested in is sys.argv, which is a list of the command line arguments.

If we create an example file called cli.py with the contents:

cli.py¶

import sys

print(sys.argv)

and then run

$ python cli.py arg1 arg2 arg3

We’ll see the output:

['cli.py', 'arg1', 'arg2', 'arg3']

sys.argv[0] is always the name of the script, and successive arguments are those parsed by the shell.

To wire together our line parsing code with command line arguments, we’ll create a program of the form

$ <program> <inputfile> <outputfile>

parse_lines.py¶

import sys
import line_parser

inputfile = sys.argv[1]
outputfile = sys.argv[2]

with open(inputfile) as f:
    counts = line_parser.count_letters_per_line(f)

with open(outputfile, 'w') as f:
    line_parser.write_output(counts, f)

Now, we can run

$ python parse_lines.py data.txt output.txt
$ cat output.txt

and we’ll get the result:

Other File objects¶

If we wanted to always dump the output to STDOUT, we don’t need to rewrite line_parser, just pass a different file. sys.stdout is a file-like object, in that it supports all of the methods of a file does, and is opened in text-mode. All text sent to the terminal from the program is written to it to reach the screen. We can pass it to line_parser.write_output() it will be displayed directly without any extra effort.

parse_lines2.py¶

import sys
import line_parser

inputfile = sys.argv[1]

with open(inputfile) as f:
    counts = line_parser.count_letters_per_line(f)

line_parser.write_output(counts, sys.stdout)

and run

$ python parse_lines2.py data.txt

and we’ll receive the same output without needing run cat or use print

This is useful because lots of other sources can provide us with streams of text or binary data other than just plain old files on the hard drive.

More Parsing¶

Below is some code to parse the output format we created in the previous section. It exercises some more methods of builtin types you may have read about, and shows you how to build a stateful file parser.

reparse.py¶

def reparse(input_file):
    # dict to hold all sections going from name -> section
    sections = {}
    # dict to hold all the letter counts from the current section
    current_section = {}
    # the name of the current section
    current_name = None
    for line in input_file:
        # strip only right side white space as the left side matters
        line = line.rstrip()
        # a line starting with tab must mean we're on a line
        # with a count. We assume no sane name starts
        # with a tab character, of course.
        if line.startswith("\t"):
            # split on the : character to separate the number from
            # the letter
            tab_letter, number = line.split(":")
            # convert number str to int
            number = int(number)
            # remove the leading tab character from the letter
            letter = tab_letter.replace("\t", "")
            # store this letter-count pair in the current section
            # dict
            current_section[letter] = number
        # a blank line means we've finished a section
        elif line == "":
            # idiom: never compare to None with == or !=, use identity
            # testing with is and is not
            if current_name is not None:
                # save the current section by its name
                # and prepare for a new section
                sections[current_name] = current_section
                current_section = {}
                current_name = None
        else:
            # we must have arrived at a new name
            current_name = line
    return sections

This code might be used as a template for work you’ll do during the workshop activities.