Na-Rae Han (naraehan@pitt.edu), 2017-07-12, Pittsburgh NEH Institute “Make Your Edition”
inaugural
folder on your desktop +
to create a new cell, ► to run Alt+ENTER
to run cell, create a new cell belowShift+ENTER
to run cell, go to next cellMore on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/
print("hello, world!")
greet
is a variable name assigned to a string value. greet = "Hello, world!"
greet = greet + " I come in peace."
greet
.upper()
, .lower()
transform a string. greet.upper()
# try .isupper(), .isalnum(), .startswith('he')
'hello'.islower()
len()
returns the length of a string in the # of characters. len(greet)
in
tests substring-hood between two strings. 'he' in 'hello'
+
, -
, *
and /
with numbers. num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", num2, "is", result) # can print multiple things!
[ ]
, with elements separated with commas. Lists can contain strings, numbers, and more. len()
to get the size of a list. in
to see whether an element is in a list. li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)
# Try logical operators not, and, or
'mauve' in li
li[i]
. Python indexes starts with 0. li[3:5]
returns a sub-list beginning with index 3 up to and not including index 5. # Try [0], [2], [-1], [3:5], [3:], [:5]
li[4]
for
loop¶for
loop, you can loop through a list of items, applying the same set of operations to each element. for x in li :
print(x, "is", len(x), "characters long.")
print("Done!")
.upper()
, len()
, +'ish'
# filter
[x for x in li if x.endswith('e')]
# transform
[x+'ish' for x in li]
# filter and transform
[x.upper() for x in li if len(x)>=5]
len()
on dictionary returns the number of keys. di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Lisa']
# 20 years-old or younger. x is bound to keys.
[x for x in di if di[x] <= 20]
len(di)
import nltk
~/nltk_data/tokenizers/punkt
nltk.download()
# Tokenizing function: turns a text (a single string) into a list of word & symbol tokens
nltk.word_tokenize(greet)
help(nltk.word_tokenize)
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)
nltk.FreqDist()
is is another useful NLTK function. # First "Rose" is capitalized. How to lowercase?
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)
freq = nltk.FreqDist(toks)
freq
freq.most_common(3)
freq['rose']
len(freq)
myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt' # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)
len(wtxt) # Number of characters in text
'fellow citizens' in wtxt # phrase as a substring. try "Americans"
'th' in wtxt
# Turn off/on pretty printing (prints too many lines)
%pprint
# Tokenize text
nltk.word_tokenize(wtxt)
wtokens = nltk.word_tokenize(wtxt)
len(wtokens) # Number of words in text
# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']
wfreq['we']
len(wfreq) # Number of unique words in text
wfreq.most_common(40) # 40 most common words
All answered in Python Clinic Day 2: Corpus Processing