State of the Union Data: Methodology

We pulled each president’s first State of the Union (SOTU) address from the American Presidency Project (http://www.presidency.ucsb.edu/sou.php), beginning with Woodrow Wilson because from Wilson forward, presidents—aside from Truman and Hoover—have delivered their speeches verbally. The transcripts were then sorted in two ways: reading level and word frequency.

Reading Level
We used readable.io/text/ to determine each speech’s Flesch-Kincaid reading grade level. Flesch-Kincaid is the most commonly-used metric to assess the grade level of spoken speeches.

Frequency
We used a standard word frequency counter to determine the count of each address, then grouped the words into the following categories: education, economy, nations, policy, military, superlatives and individual vs. group. In each category, we combined similar variations of words. For example, in the military category, the words “naval” and “navy” are both categorized under “navy.”

When calculating word frequency, we adjusted for speech length by standardizing the word count by rate of 100 words.


UC Berkeley School of Information

Phone Number: 855-678-6437
Email Address: admissions@datascience.berkeley.edu

Legal

© UC Berkeley School of Information.