Workshop 1. Introduction to Python¶
Running Python Code¶
Python can be used to write executabl programs, as well as interactively. To start an interactive Python interpreter from the shell:
$ python
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
We can execute statements interactively with this program, which will read the code you write, evaluate it, and print the results, and loop back to the beginning. This is why this type of program is called a REPL.
The REPL will keep the results of previous statements from that session. We can define variables
>>> a = 3
>>> b = 4
>>> a + b
7
Using this interactive prompt, we can treat Python like a simple desk calculator
or as a way to write short one-off programs. To quit the interactive session,
you can write quit()
and press <Enter>
.
We can also execute files containing Python code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #!/usr/bin/env python
import math
a = 3
b = 4
# pythagorean theorem
# |\
# b | \ c
# |__\
# a
c_squared = a ** 2 + b ** 2
print(c_squared, math.sqrt(c_squared))
|
$ python example1.py
25 5
There are other ways to run Python code, but these two methods are the ones that will be used for now.
Builtin Types¶
Python has many handy built-in types, and lets you define more yourself easily. Here, by “type”, what I mean is “kind of thing”, not to press keys on a keyboard. For example, we can all agree that the 12 is a number, and that “cat” is a series of characters called a string and that a cat is an animal. We know we can add numbers together, so we can add 12 to another number like 13 and get the number 25, but attempting to add two cats together is liable to be a messy, unpredictable process which computers don’t have definition for. Similarly, we can agree that the uppercase version of “cat” which is “CAT”, but only typographers will assert that there is an upper or lower case for the number 12.
After all of this, I hope I’ve convinced you that types are useful enough for you to go read An Informal Introduction to Python which is a part of the official Python documentation. We will be referencing parts of it throughout this workshop, and it’s often the best way to learn how the language works.
Note
Expected Reading Time: 15 minutes
A Simple Program¶
Okay, now that you’ve learned more about how to play with numbers, strings, and lists we’ll
use these to build on the while-loop
you saw at the end. Let’s say you have a list of numbers
>>> numbers = [1, 0, 2, 1, 3, 1, 5, 1, 2, 3, 4, 4, 4, 5, 1]
We want to count the number of times each number appears in numbers
>>> n = len(numbers)
You learned about the len()
function in that informal introduction, it’s a function which
returns the number of items in a container type, which a list
is. So now n
contains
the number of items in numbers
. We can iterate over numbers
using a while loop like this:
>>> i = 0
>>> while i < n:
... j = numbers[i]
... print(j)
... i += 1
...
1
0
2
1
3
1
5
1
2
3
4
4
4
5
1
We can then use a list of the length of the largest number in numbers
+ 1 to count the number
of times each number was seen. We have to add one to largest number because we start from zero
instead if one.
>>> i = 0
>>> counts = [0, 0, 0, 0, 0, 0]
>>> while i < n:
... j = numbers[i]
... counts[j] += 1
... i += 1
...
>>> print(counts)
[1, 5, 2, 2, 3, 2]
This works because j
’s value is a number which is both the thing we want
to count and is used to find the place where we hold the count inside the counts
list
. This was pretty convenient and obviously only works because we’re
counting numbers from 0 and up without many gaps. We also had to pre-fill counts
with the right size, and set all of its values to 0. We’ll try to improve this
example to be more idiomatic and less contrived.
In Python, while
loops are uncommon for a number of reasons. The primary reason
is that if you forgot to include the line with i += 1
, your loop would run forever
and never stop. Like many other languages Python has for
loops too, but they’re
a bit different. Instead of iterating over a user-defined start and stop condition like
in C
-like languages, Python’s iterates over an object in a type-specific fashion
which means that types control how they’re traversed. In the case of sequences like
list
or str
this means they’re iterated over in order, from start to end.
For more information, see 4.2 for Statements.
If we were to rewrite that counting step with a for
-loop, here’s what it
would look like:
>>> counts = [0, 0, 0, 0, 0, 0]
>>> for j in numbers:
... counts[j] += 1
...
>>> print(counts)
[1, 5, 2, 2, 3, 2]
This simplified the code and removed the potential for you to accidentally enter
an infinite loop. We still need to pre-initialize counts
though. We’ll fix
this using another loop and an if
statement.
if Statements¶
An if
statement works like the conditional expression of a while
loop.
>>> a = 3
>>> b = 4
>>> c = 5
>>> if b > a:
... print("b is greater than a")
b is greater than a
We can make those conditional expressions as complicated as we want, using
boolean operators like and
and or
to combine them. We can also
specify
>>> if b == (a ** 2 - 1) / 2 and c == b + 1:
... print("a, b, and c are Pythagorean triples!")
... else:
... print("crude failures of numbers")
a, b, and c are Pythagorean triples!
>>> b = 6
>>> if b == (a ** 2 - 1) / 2 and c == b + 1:
... print("a, b, and c are Pythagorean triples!")
... else:
... print("crude failures of numbers")
crude failures of numbers
You can read more about if
statements at 4.1 if Statements and
the boolean operators at 6.11 Boolean operations.
Simplifying Solution¶
Now, to solve our problem,
>>> counts = []
>>> for j in numbers:
... difference = j - len(counts) + 1
... if difference > 0:
... for i in range(difference):
... counts.append(0)
Here, the range
function returns an object which when iterated over,
produces numbers from 0 to the first argument (non-inclusive). Alternatively,
if given two arguments, it will produce numbers starting from the first
argument up to the second argument.
We can solve this even more simply by using another builtin function max
.
max
will return the largest value in an iterable object.
>>> counts = []
>>> for i in range(max(numbers) + 1):
... counts.append(0)
There is another builtin function called min
which does the opposite,
returning the smallest value in an iterable object.
Now, the simplified program looks like
counts = []
for i in range(max(numbers) + 1):
counts.append(0)
for j in numbers:
counts[j] += 1
print(counts)
Defining Functions¶
With our simplified solution, we can now count numbers with very few lines of code, but we still need to repeat those lines every time we want to count a list of numbers. This isn’t ideal if we have a problem where we need to do this a lot.
We can create a function which contains all of that logic, giving it a name, and just call that function whenever we want to do that task. For an explanation of how this is done, please read Defining Functions.
Note
Expected reading time: 5 minutes
Now, to take what you just read and apply it here, we define the input to the function, a list of numbers, and the output of the function, a list of counts.
def count_numbers(number_series):
counts = []
for i in range(max(number_series) + 1):
counts.append(0)
for j in number_series:
counts[j] += 1
return counts
Note that the variables inside the function don’t refer to the variables from previous examples, even if they share the same name. This is because they have different scopes. It is possible for names within an inner scope to reference variables from an outer scope, but this should be used sparingly, as this introduces a logical dependence between the two pieces of code which is not easy to see.
We can show our function works by calling it on our list of numbers and seeing it gives the same answer:
>>> count_numbers(numbers)
[1, 5, 2, 2, 3, 2]
For more on defining functions, including fancier ways of passing arguments, see More on Defining Functions
More On Types and Iteration¶
I’ve been saying this word “iterate” a lot. It comes from the Latin “iterum” meaning “again”, and in mathematics and computer science it is used when we want to repeat a process again and again. A for-loop is one form of iteration.
In Python, types can define how they are iterated over. As we saw before, list
objects are iterated over in order, and the same goes for str
and tuple
,
even though these types are meant to represent different things. This is because
these types are all examples of Sequence
types or more specifically, these
types all implement the Sequence
interface. A Sequence
supports
the following operations:
# Item Getting
>>> sequence[i]
somevalue
# Testing for Membership
>>> x in sequence
True/False
# Size-able
>>> len(sequence)
integer
Because of the first and third property, iterability is implicitly just
for i in range(len(sequence)):
yield sequence[i]
You can read more about this interface at Sequence Types
Because Python is not staticly typed, we can use those Sequence
types
interchange-ably.
>>> for c in ["a", "b", "c"]:
... print(c)
a
b
c
>>> # The sequence of values inside parentheses defines a tuple
>>> for c in ("a", "b", "c"):
... print(c)
a
b
c
>>> for c in "abc":
... print(c)
a
b
c
There are many other types in Python which support iteration, such as set
,
dict
, and file
, but iteration over objects of these types is not
always the same.
Warning
No matter the type of the object you’re iterating over, you should not and usually cannot modify the object while iterating. Common builtin types will throw an error. If you need to modify the object you’re iterating over, first make a copy of the object, and then iterate over the copy and modify the original as needed.
Whenever you use a for
loop, Python implicitly calls the iter()
function
on the object being iterated over. iter()
returns a Iterator
for the
object being iterated. Iterator
objects can be used to retrieve successive
items from the thing they iterate over using the next()
function. If next()
is called and no new data are available, a StopIteration
exception will be
raised. The for
loop automatically handles the exception, but you’re calling next()
directly, you’ll need to be prepared to handle the exception yourself.
Dictionaries¶
Note
Dictionaries are very important, make sure you try out some of this code if you’re unfamiliar with them!
Dictionaries, or dict
as they’re written in Python, are incredibly powerful
data structures. They allow you to associate “key” objects with “value” objects, forming
key-value pairs. This lets you create names for values, flexibly relate arbitrary data
together and make it easy to locate information without complicated indexing schemes.
This process usually involves a hash function, and requires
that the “key” objects be Hashable
and comparable. Usually, this requires
that the “key” be immutable, a property that str
, int
,
float
, and tuple
posses.
>>> lookup = dict()
# alternative syntax for a dictionary literal
>>> lookup = {}
# set the value of the key "string key" to the value "green eggs and spam"
>>> lookup["string key"] = "green eggs and spam"
>>> print(lookup)
{"string key": "green eggs and spam"}
# set the value of the key 55 to the value [1, 2, 3]
>>> lookup[55] = [1, 2, 3]
# Get the value associated with the key 55
>>> lookup[55]
[1, 2, 3]
# mutable objects like lists cannot be keys
>>> lookup[[4, 5]] = 5
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>
# tuples are immutable, and can be keys
>>> lookup[(4, 5)] = 5
# getting a key with square braces that doesn't exist throws
# an error
>>> lookup["not a key"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'not a key'
>>>
# using the `get` method of a dict will return None if the key
# is missing
>>> lookup.get("not a key") == None
True
# the second argument to `get` is an optional default value to
# return instead
# of `None` when the key is missing
>>> lookup.get("not a key", 42) == 42
# using membership testing checks to see if a value a key in
# this dictionary
>>> (4, 5) in lookup
True
# iterating over a dictionary yields its keys
>>> g = iter(lookup)
>>> list(g)
['string key', 55, (4, 5)]
# the `keys` method returns a "view" of the keys that can be
# iterated over
>>> lookup.keys()
dict_keys(['string key', 55, (4, 5)])
# the `values` method returns a "view" of the values that can
# be iterated over
>>> lookup.values()
dict_values(['green eggs and spam', [1, 2, 3], 5])
>>>
# the `items` method returns a view over the key-value pairs.
# This is a very common way way to iterate over a dictionary!
>>> lookup.items()
dict_items([('string key', 'green eggs and spam'), (55, [1, 2, 3]), ((4, 5), 5)])
They go by many names, like “associative array”, “hash”, “hash table” or “map” in other languages.
More Data Structures¶
Python’s builtin data structures are one of its strengths. They are the building blocks of any program, and understanding what they can do lets you choose the right tool for each job you encounter.
- More on Lists
- Tuples and Sequences
- Strings and Their Methods. There are many string methods with niche uses. The important ones are:
- Sets
- Dictionaries
There are also many more examples of how they’re used in Looping Techniques
Note
Expected Reading Time: 15 minutes
File I/O¶
Data File: “data.txt”
There comes a day in every programmer’s life where they simply have to ask for information from the outside world, and that usually takes the form of reading files in, and then writing a file back out to tell the world what they’ve done.
In Python, you open a file using open(path, mode)
, where path
refers to the path
to the file you wish to open, and mode
refers to how you want the file to be opened, as
specified by a common string pattern:
- “r”: open for reading in text mode. Error if the file does not exist. This is the default mode.
- “w”: open for writing in text mode. Destroys any existing content
- “a”: open for appending in text mode. All new content is written to the end of the existing content
If a “b” is appended to the mode string, it means that the file is opened in binary mode,
which causes all of the methods we’ll cover to return bytes
objects instead of str
objects. For now, we’ll ignore this and only deal with text mode.
>>> f = open("data.txt", 'r')
>>> contents = f.read()
>>> print(contents)
Alice Low bandwidth caused a bottleneck in networking
John Excess fruit consumption suspected to cause cancer
Bob Failure to communicate induced by person standing inbetween self and recipient
>>> f.close()
Whenever you open a file, it is important to remember to close it when you
are done. It’s more important when writing to a file because some or all of
the content you wrote to the file may not actually be written out until the
file is closed. If your program structure lets you, it’s better to open a file
using a with
block (shown below).
Files can also be iterated over, yielding lines sequentially.
>>> with open("data.txt", "r") as f:
... for line in f:
... # There will be an extra blank line between lines as both the
... # newline added by print() and the one at the end of the line
... # are shown
... print(line)
Alice Low bandwidth caused a bottleneck in networking
John Excess fruit consumption suspected to cause cancer
Bob Failure to communicate induced by person standing inbetween self and recipient
>>> # f.close() is called as soon as the with block is over
file
objects are their own Iterators
, so you can call next()
directly
on them to retrieve successive lines, or more explicitly, you can call the readline()
method.
>>> with open("data.txt", "r") as f:
... print(next(f))
... print(f.readline())
... print(next(f))
Alice Low bandwidth caused a bottleneck in networking
John Excess fruit consumption suspected to cause cancer
Bob Failure to communicate induced by person standing inbetween self and recipient
These methods can be used even while looping over the file using a for
loop to work with
more than one line at a time, though care must be used when calling next()
repeatedly.
We can combine some of the things we’ve learend to do things with the contents of this file.
The first thing we can do is to just recapitulate the element counting task we did earlier on
a list of numbers using the characters for each line of this file. Instead of using a list to
store the counts, we need to use something that can connect elements of a str
to
a number, so we’ll use dict
(there’s a better subclass of dict that we could use in
the collections module, but that’s a story for another time).
Let’s also associate the count with the person’s name at the start of the line, omitting it from the count.
>>> counters_per_line = {}
>>> with open('data.txt') as f:
... for line in f:
... # split the line on spaces to separate the name
... tokens = line.split(" ")
... name = tokens[0]
... counter = {}
... # slice from the 1st element forward to skip the name
... for token in tokens[1:]:
... # iterate over the individual letters
... for c in token:
... # retreive the current count or 0 if missing
... current = counter.get(c, 0)
... # store the updated value
... counter[c] = current + 1
... # store the counts for this line under the associated name
... counters_per_line[name] = counter
...
>>> print(counters_per_line)
{'Bob': {'a': 4, 'c': 4, 'b': 2, 'e': 10, 'd': 4, 'g': 1, 'F': 1, 'i': 7, 'f': 1, 'm': 2, 'l': 2, 'o': 3, 'n': 9, 'p': 2, 's': 3, 'r': 3, 'u': 3, 't': 5, 'w': 1, 'y': 1}, 'John': {'a': 2, 'c': 6, 'E': 1, 'd': 1, 'f': 1, 'i': 2, '\n': 1, 'm': 1, 'o': 3, 'n': 3, 'p': 2, 's': 6, 'r': 2, 'u': 4, 't': 4, 'x': 1, 'e': 5}, 'Alice': {'a': 3, 'c': 2, 'b': 2, 'e': 4, 'd': 3, 'g': 1, 'i': 3, 'h': 1, 'k': 2, '\n': 1, 'L': 1, 'o': 3, 'n': 5, 's': 1, 'r': 1, 'u': 1, 't': 4, 'w': 3, 'l': 1}}
Now that we have these letter counts for each line, we can write them out to a new file organized in a meaningful way. Let’s define the format to be:
<name>:
\t<letter>:<count>\n
...
To write content to a file opened for writing text, we use the file.write()
method, which
takes a str
as an argument. write
doesn’t assume you’re passing it a complete line,
so you’ll need to include the newline character \n
yourself when you’re done with a line.
>>> with open("output.txt", 'w') as f:
... for name, counts in counters_per_line.items():
... f.write(name + "\n")
... for letter, count in counts.items():
... f.write("\t" + letter + ":")
... # convert count from an int into a str so it can be written
... f.write(str(count))
... f.write("\n")
... # or alternatively use a format string to do everything in one go:
... f.write("\t{letter}:{count}\n".format(letter=letter, count=count))
... f.write("\n")
The contents of “output.txt” will be:
Bob
a:4
c:4
b:2
e:10
d:4
g:1
F:1
i:7
f:1
m:2
l:2
o:3
n:9
p:2
s:3
r:3
u:3
t:5
w:1
y:1
John
a:2
c:6
E:1
d:1
f:1
i:2
:1
m:1
o:3
n:3
p:2
s:6
r:2
u:4
t:4
x:1
e:5
Alice
a:3
c:2
b:2
e:4
d:3
g:1
i:3
h:1
k:2
:1
L:1
o:3
n:5
s:1
r:1
u:1
t:4
w:3
l:1
This is exactly what we said to do, but after seeing the results, we can see a few things that may not make sense. The first is that there are blank lines followed by a line starting in the wrong place without a letter and just a “:1”. This is because the newline character was also counted. We could omit the newlines by checking for them in the inner-most loop. The second thing is that the uppercase and lowercase letters are counted separately. Supposing we don’t want this, we would need to redo the counting process to fix it.
The revised counting code would look like this, after we wrap it up in a function
def count_letters_per_line(line_file):
counters_per_line = {}
for line in line_file:
# remove the trailing newline and
# split the line on spaces to separate the name
tokens = line.strip().split(" ")
name = tokens[0]
counter = {}
# slice from the 1st element forward to skip the name
for token in tokens[1:]:
for c in token:
# force the character to be lowercase
c = c.lower()
# retreive the current count or 0 if missing
current = counter.get(c, 0)
counter[c] = current + 1
counters_per_line[name] = counter
return counters_per_line
def write_output(counters, result_file):
for name, counts in counters.items():
result_file.write(name + "\n")
for letter, count in counts.items():
result_file.write("\t{letter}:{count}\n".format(letter=letter, count=count))
result_file.write("\n")
If you want to know more about reading and writing files, please see Reading and Writing Files for more information.
Modules¶
Since we’ve organized these into functions, where input and output are defined by whoever calls them, we don’t need them to share the same scope as the input we want to call them with. We can move them into a “module”, which is the word for a file which contains Python code that will be “imported” and used elsewhere. To do this, we just create a file, let’s call it “line_parser.py” and put the code for these functions in it and save the file.
Now, back in our interactive session, we can just “import” the module by name. This creates a new
module
object which just provides all the names defined inside it as attributes. We can
call those functions defined within using the same attribute access notation:
>>> import line_parser
>>> with open("data.txt") as f:
... counts = line_parser.count_letters_per_line(f)
...
>>> counts.keys()
dict_keys(['Alice', 'John', 'Bob'])
>>> with open("output.txt", 'w') as f:
... line_parser.write_output(counts, f)
...
>>> print(open("output.txt").read())
Alice
l:2
o:3
w:3
b:2
a:3
n:5
d:3
i:3
t:4
h:1
c:2
u:1
s:1
e:4
k:2
r:1
g:1
John
e:6
x:1
c:6
s:6
f:1
r:2
u:4
i:2
t:4
o:3
n:3
m:1
p:2
d:1
a:2
Bob
f:2
a:4
i:7
l:2
u:3
r:3
e:10
t:5
o:3
c:4
m:2
n:9
d:4
b:2
y:1
p:2
s:3
g:1
w:1
Executing Scripts with Arguments from the Command Line¶
Once we make a program that can take arguments, we might want to run the program using user-provided data without modifying the program from run to run. To do this, we can read the arguments from the command line using the same patterns you saw in workshop 0.
In order to access the command line arguments in python, we need to import another module from
the Python standard library, called sys
. The sys
module contains lots of functions related
to the Python runtime, but the feature we’re interested in is sys.argv
, which is a list of
the command line arguments.
If we create an example file called cli.py
with the contents:
1 2 3 | import sys
print(sys.argv)
|
and then run
$ python cli.py arg1 arg2 arg3
We’ll see the output:
['cli.py', 'arg1', 'arg2', 'arg3']
sys.argv[0]
is always the name of the script, and successive arguments are those
parsed by the shell.
To wire together our line parsing code with command line arguments, we’ll create a program of the form
$ <program> <inputfile> <outputfile>
1 2 3 4 5 6 7 8 9 10 11 | import sys
import line_parser
inputfile = sys.argv[1]
outputfile = sys.argv[2]
with open(inputfile) as f:
counts = line_parser.count_letters_per_line(f)
with open(outputfile, 'w') as f:
line_parser.write_output(counts, f)
|
Now, we can run
$ python parse_lines.py data.txt output.txt
$ cat output.txt
and we’ll get the result:
Alice
l:2
o:3
w:3
b:2
a:3
n:5
d:3
i:3
t:4
h:1
c:2
u:1
s:1
e:4
k:2
r:1
g:1
John
e:6
x:1
c:6
s:6
f:1
r:2
u:4
i:2
t:4
o:3
n:3
m:1
p:2
d:1
a:2
Bob
f:2
a:4
i:7
l:2
u:3
r:3
e:10
t:5
o:3
c:4
m:2
n:9
d:4
b:2
y:1
p:2
s:3
g:1
w:1
Other File objects¶
If we wanted to always dump the output to STDOUT, we don’t need
to rewrite line_parser
, just pass a different file. sys.stdout
is
a file-like object, in that it supports all of the methods of a file
does,
and is opened in text-mode. All text sent to the terminal from the program is written to
it to reach the screen. We can pass it to line_parser.write_output()
it will be
displayed directly without any extra effort.
1 2 3 4 5 6 7 8 9 | import sys
import line_parser
inputfile = sys.argv[1]
with open(inputfile) as f:
counts = line_parser.count_letters_per_line(f)
line_parser.write_output(counts, sys.stdout)
|
and run
$ python parse_lines2.py data.txt
and we’ll receive the same output without needing run cat
or use print
This is useful because lots of other sources can provide us with streams of text or binary data other than just plain old files on the hard drive.
More Parsing¶
Below is some code to parse the output format we created in the previous section. It exercises some more methods of builtin types you may have read about, and shows you how to build a stateful file parser.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | def reparse(input_file):
# dict to hold all sections going from name -> section
sections = {}
# dict to hold all the letter counts from the current section
current_section = {}
# the name of the current section
current_name = None
for line in input_file:
# strip only right side white space as the left side matters
line = line.rstrip()
# a line starting with tab must mean we're on a line
# with a count. We assume no sane name starts
# with a tab character, of course.
if line.startswith("\t"):
# split on the : character to separate the number from
# the letter
tab_letter, number = line.split(":")
# convert number str to int
number = int(number)
# remove the leading tab character from the letter
letter = tab_letter.replace("\t", "")
# store this letter-count pair in the current section
# dict
current_section[letter] = number
# a blank line means we've finished a section
elif line == "":
# idiom: never compare to None with == or !=, use identity
# testing with is and is not
if current_name is not None:
# save the current section by its name
# and prepare for a new section
sections[current_name] = current_section
current_section = {}
current_name = None
else:
# we must have arrived at a new name
current_name = line
return sections
|
This code might be used as a template for work you’ll do during the workshop activities.