Python clinic day 1: Text processing¶

Na-Rae Han (naraehan@pitt.edu), 2017-07-12, Pittsburgh NEH Institute “Make Your Edition”

Preparation¶

Data¶

This tutorial is found on https://pittsburgh-neh-institute.github.io/Institute-Materials-2017/schedule/week_1/week_1_day_3_plan.html
Download and unzip the “C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/
Place the unzipped inaugural folder on your desktop

Jupyter tips¶

Click + to create a new cell, ► to run
Alt+ENTER to run cell, create a new cell below
Shift+ENTER to run cell, go to next cell

More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

The very basics¶

First code¶

Printing a string, using print().

print("hello, world!")

Strings¶

String type objects are enclosed in quotation marks (" or ').
+ is a concatenation operator.
Below, greet is a variable name assigned to a string value.
Here we are not explicitly printing out; instead, a string value is returned.

greet = "Hello, world!"
greet = greet + " I come in peace."
greet

String methods such as .upper(), .lower() transform a string.
Rather than changing the original variable, the commands return a new string value.

greet.upper()

Some string methods return a boolean value (True/False)

# try .isupper(), .isalnum(), .startswith('he')
'hello'.islower()

len() returns the length of a string in the # of characters.

len(greet)

in tests substring-hood between two strings.

'he' in 'hello'

Numbers¶

Integers and floats are written without quotes.
You can use algebraic operations such as +, -, * and / with numbers.

num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", num2, "is", result)  # can print multiple things!

Lists¶

Lists are enclosed in [ ], with elements separated with commas. Lists can contain strings, numbers, and more.
As with string, you can use len() to get the size of a list.
As with string, you can use in to see whether an element is in a list.

li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)

# Try logical operators not, and, or
'mauve' in li

A list can be indexed through li[i]. Python indexes starts with 0.
A list can be sliced: li[3:5] returns a sub-list beginning with index 3 up to and not including index 5.

# Try [0], [2], [-1], [3:5], [3:], [:5]
li[4]

`for` loop¶

Using a for loop, you can loop through a list of items, applying the same set of operations to each element.
The embedded code block is marked with indentation.

for x in li :
    print(x, "is", len(x), "characters long.")
print("Done!")

List comprehension¶

List comprehension builds a new list from an existing list.
You can filter to include only certain elements, and you can apply transformationa in the process.
Try: .upper(), len(), +'ish'

# filter
[x for x in li if x.endswith('e')]

# transform
[x+'ish' for x in li]

# filter and transform
[x.upper() for x in li if len(x)>=5]

Dictionaries¶

Dictionaries hold key:value mappings.
len() on dictionary returns the number of keys.
Looping over a dictionary means looping over its keys.

di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Lisa']

# 20 years-old or younger. x is bound to keys. 
[x for x in di if di[x] <= 20]

len(di)

Using NLTK¶

NLTK (Natural Language Toolkit) is an external library; you must import it first.

import nltk

Let's first download some data files. In the doanloader window, Models > punkt > Download.
If server is overloaded, download this punkt.zip file and unzip it as ~/nltk_data/tokenizers/punkt

nltk.download()

# Tokenizing function: turns a text (a single string) into a list of word & symbol tokens
nltk.word_tokenize(greet)

help(nltk.word_tokenize)

sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)

nltk.FreqDist() is is another useful NLTK function.
It builds a frequency count dictionary from a list.

# First "Rose" is capitalized. How to lowercase? 
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)

freq = nltk.FreqDist(toks)
freq

freq.most_common(3)

freq['rose']

len(freq)

Processing a single text file¶

Reading in a text file¶

open(filename).read() opens a text file and reads in the content as a single continuous string.

myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)

len(wtxt)     # Number of characters in text

'fellow citizens' in wtxt  # phrase as a substring. try "Americans"

'th' in wtxt

Tokenize text, compile frequency count¶

# Turn off/on pretty printing (prints too many lines)
%pprint

# Tokenize text
nltk.word_tokenize(wtxt)

wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text

# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']

wfreq['we']

len(wfreq)      # Number of unique words in text

wfreq.most_common(40)     # 40 most common words

More tomorrow¶

How long are Washington’s sentences on average?
Which long words did he use, and how frequent were they?
Processing the entire Inaugural Address corpus
- Which inaugural speech was the longest? The shortest?
- Which presidents favored long sentences?

All answered in Python Clinic Day 2: Corpus Processing

Python clinic day 1: Text processing¶

Preparation¶

Data¶

Jupyter tips¶

The very basics¶

First code¶

Strings¶

Numbers¶

Lists¶

for loop¶

List comprehension¶

Dictionaries¶

Using NLTK¶

Processing a single text file¶

Reading in a text file¶

Tokenize text, compile frequency count¶

More tomorrow¶

`for` loop¶