Python clinic day 2: Corpus processing

Na-Rae Han (naraehan@pitt.edu), 2017-07-13, Pittsburgh NEH Institute “Make Your Edition”

Preparation

Data

Jupyter tips

  • Click + to create a new cell, ► to run
  • Alt+ENTER to run cell, create a new cell below
  • Shift+ENTER to run cell, go to next cell

More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

Review

Processing a single text file, continued

Reading in a text file

  • Start by opening up the 1789 Washington address, using open(filename).read().
In [ ]:
myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt[:500])

Tokenize text, compile frequency count

In [ ]:
import nltk    # Don't forget to import nltk
%pprint    # Turn off/on pretty printing (prints too many lines)
In [ ]:
# Build a token list
wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text
In [ ]:
# Build a dictionary of word frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']
In [ ]:
len(wfreq)      # Number of unique words in text
In [ ]:
wfreq.most_common(40)     # 40 most common words

Average sentence length, frequency of long words

In [ ]:
sentcount = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or 
print(sentcount)
In [ ]:
# Tokens include symbols and punctuation. First 50 tokens:
wtokens[:50]
In [ ]:
wtokens_nosym = [t for t in wtokens if t.isalnum()]    # alpha-numeric tokens only
len(wtokens_nosym)
In [ ]:
# Try "n't", "20th", "."
"n't".isalnum()
In [ ]:
# First 50 tokens, alpha-numeric tokens only: 
wtokens_nosym[:50]
In [ ]:
len(wtokens_nosym)/sentcount     # Average sentence length in number of words
In [ ]:
[w for w in wfreq if len(w) >= 13]       # all 13+ character words
In [ ]:
long = [w for w in wfreq if len(w) >= 13] 
# sort long alphabetically using sorted()
for w in sorted(long) :
    print(w, len(w), wfreq[w])               # long words tend to be less frequent

Processing a corpus

  • NLTK can read in an entire corpus from a directory (the “root” directory).
  • As it reads in a corpus, it applies word tokenization (.words()) and sentence tokenization (.sents()).
In [ ]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/Jane Eyre/Desktop/inaugural'  # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 
In [ ]:
# .txt file names as file IDs
inaug.fileids()
In [ ]:
# NLTK automatically tokenizes the corpus. First 50 words: 
print(inaug.words()[:50])
In [ ]:
# You can also specify individual file ID. First 50 words from Obama 2009:
print(inaug.words('2009-Obama.txt')[:50])
In [ ]:
# NLTK automatically segments sentences too, which are accessed through .sents()
print(inaug.sents('2009-Obama.txt')[0])   # first sentence
print(inaug.sents('2009-Obama.txt')[1])   # 2nd sentence
In [ ]:
# How long are these speeches in terms of word and sentence count?
print('Washington 1789:', len(inaug.words('1789-Washington.txt')), len(inaug.sents('1789-Washington.txt')))
print('Obama 2009:', len(inaug.words('2009-Obama.txt')), len(inaug.sents('2009-Obama.txt')))
In [ ]:
# for-loop through file IDs and print out word count. 
# While looping, populate fid_avsent which holds avg sent lengths.
# Break long line with \, specify tab separator. 
fid_avsent = []
for f in inaug.fileids():
    print(len(inaug.words(f)), len(inaug.sents(f)), \
          len(inaug.words(f)) / len(inaug.sents(f)), f, sep='\t')
    fid_avsent.append((len(inaug.words(f)) / len(inaug.sents(f)), f))
In [ ]:
# Turn pretty print back on 
%pprint
sorted(fid_avsent)
In [ ]:
# same thing, with list comprehension! 
fid_avsent2 = [(len(inaug.words(f)) / len(inaug.sents(f)), f) for f in inaug.fileids()]
sorted(fid_avsent2)

Trouble shooting

  • Unfortunately, 2005 Bush file produces a Unicode encoding error.
  • Let's make a new text file from http://www.presidency.ucsb.edu/inaugurals.php
  • Copy text and paste in Notepad/Notepad++ (Windows) or BBEdit (Mac).
    • Windows users: make sure to choose UTF-8 encoding and not ANSI when saving.
  • The text files are locked; We will need to save, halt and then re-start the Python notebook.
In [ ]:
# Corpus size in number of words
print(len(inaug.words()))
In [ ]:
# Building word frequency distribution for the entire corpus
inaug_freq = nltk.FreqDist(inaug.words())
inaug_freq.most_common(100)

Extra: using regular expressions for tokenization

  • re is Python's regular expression module. Start by importing.
  • re.findall finds all substrings that match a pattern.
  • For regular expression strings, use r'...' (rawstring) prefix.
In [ ]:
import re
In [ ]:
sent = "You haven't seen Star Wars...?"
re.findall(r'\w+', sent)
In [ ]:
%pprint
re.findall(r'\w+', wtxt)

What next?

Take a Python course! There are many online courses available on Coursera, EdX, udemy, DataCamp, and more.