View on GitHub

NEH Institute materials

July 2017

Week 2, Day 3: Wednesday, July 19

Synopsis

Week 2, Day 3 expands upon the idea of digital editions as text processing pipelines. After a short recap of day 2, we continue with the step normalization. We will show how these two pipeline stages prepare texts for automated collation. The process of automated collation is also discussed from a modeling perspective (with the Gothenburg Model). Participants learn that their research goals and questions influence the computational pipelines.

Outcome goals

Understanding the principles of basic text transformations like normalization and how they serve different objectives
Bringing together tokenization and normalization as individual pipeline steps and seeing how they can be implemented in the act of collation
Normalize, tokenize, and collate text
Fundamentals of TAG: hypergraph
Modeling discontinuity
Legend
Presentation: by instructors
Discussion: instructors and participants
Talk lab: participants discuss or plan in small groups
Code lab: participants code alone or in small groups

9:00–10:30: Normalization

Time	Topic	Type
15 min	Review of week 2, day 2: computational pipelines, modeling, processing, and tokenization	Discussion
15 min	Basic normalization	Presentation
15 min	Using NLTK for normalization	Code lab
15 min	Regular expressions	Code lab
15 min	Basic XML normalization: transforming XML to a stream of normalized (word) tokens	Code lab
15 min	Hands-on exercise with NLTK and regular expressions	Code lab

10:30–11:00: Coffee break

11:00–12:30: Collation

Time	Topic	Type
15 min	Modeling and collation	Presentation
15 min	Collation within editorial theory	Talk lab
30 min	Collation practice	Code lab
30 min	Tokenization and normalization for collation purposes	Code lab

12:30–2:00: Lunch

2:00–3:30: Challenging textual phenomena: Introducing Text as Graph (TAG)

Time	Topic	Type
90 min	Challenging textual phenomena: Introducing Text as Graph (TAG)	Presentation

3:30–4:00: Coffee break

4:00–5:30: Review

Time	Topic	Type
90 min	Review	Talk lab

We’ll end each day with a request for feedback, based on a general version of the day’s outcome goals, and we’ll try to adapt on the fly to your responses. Please complete Week 2, Day 3 feedback (just copy and paste it into a plain-text document) and email your response to Kaylen at kaylensanders@pitt.edu with the subject heading “Week 2, Day 3 feedback”.