Slides from Workshop on Automated Content Analysis
Philip A. Schrodt
Department of Political Science
Penn State University
Presented at the 20th
Political Methodology Summer Conference
University of Minnesota
18 July 2003
Outline of workshop
- Overview of content analysis
- Human vs. automated coding
- Accessing material on the web
- Text as a statistical object
- Dos and Donts in contemporary content analysis
- Further information
Key points to be made:
- Contemporary content analysis is very different from methods used in the 1960s
- Automated coding is superior to human coding in large projects; it is a well-developed technology
- The Web has made a tremendous amount of data available in machine readable form, at your desktop, for free
- Learn and use the Perl language for text processing
- Text has regular statistical characteristics but should be treated inductively
Contemporary Content Analysis
Levels of content analysis
Analytical Term |
Linguistic Term |
Methodology |
Thematic | Lexical | Analysis of words and phrases |
Syntactic | Syntactic | Use of grammatical rules to determine role of words |
Network | Semantic | Use relationships between words to disambiguate meanings |
Research in other fields
Library science | Automated indexing |
Computational Linguistics | Automated translation and natural language processing generally |
Psychology | Personality tests |
Communications Studies | Content of popular culture: books, movie and television scripts |
Education | Automated grading |
Business | Automated evaluation of resumés, aptitude tests |
Resources: Books
- Alexa, Melina and Cornelia Zuell. 1999. A Review of Software for Text Analysis. Mannheim: Zentrum für Umfragen, Methoden und Analysen.
- Neuendorf, Kimberly A. . 2002. The Content Analysis Guidebook. Thousand Oaks CA: Sage
- Popping, Roel. 2000. Computer-Assisted Text Analysis. Thousand Oaks CA: Sage
- Roberts, Carl. 1997. Text Analysis for the Social Sciences. Mahwah NJ: Lawrence Earlbum Associates
- Salton, G. 1989. Automatic Text Processing. Reading, Mass: Addison-Wesley.
- Weber, Robert Philip. 1990. Basic Content Analysis, 2d ed. Newbury Park, CA: Sage Publications.
Potential text sources relevant to political behavior
- News reports
- Legislation
- Campaign platforms
- Editorials
- Open ended survey questions
Advantages of text as a data source
- Text is one of the primary methods of communicating political information
- Text is unaffected by the act of measurement
- The source material is intentional: it was created for some political purpose
- Web-based text can be collected in near-real-time at very little cost
- Using automated acquisition and coding methods, a single individual can create an original, customized data set with little or no funding
Human versus Automated Coding
Reliability in content analysis
- stabilitythe ability of a coder to consistently assign the same code to a given text;
- reproducibilityintercoder reliability;
- accuracythe ability of a group of coders to conform to a standard.
Source: Weber (1990:17)
Advantages of automated coding
- Fast and inexpensive
- Transparent: coding rules are explicit in the dictionaries
- Reproducible: a coding system can be consistently maintained over a period of time without the "coding drift" caused by changing teams of coders.
- Coding dictionaries are also be shared between institutions
- Unaffected by the biases of individual coders.
Disadvantages of automated coding
- Automated thematic coding has problems with disambiguation; automated syntactic coding makes errors on complex sentences.
- Requires a properly formatted, machine-readable source of text, therefore older paper and microfilm sources are difficult to code.
- Development of new coding dictionaries is time-consumingKEDS/PANDA initial dictionary development required 2-labor-years. (Modification of existing dictionaries, however, requires far less effort)
Tradeoffs between human and machine coding
- Machine coding uses only information that is explicit in the text; human coders are likely to use implicit knowledge of the situation.
- Machine coding is not affected by boredom and fatigue
- Human coders can more effectively interpret idiomatic and metaphorical text
- Human coders can more effectively deal with complex subordinate phrases
Summary: Comparative advantages of human versus machine coding
Advantage to human coding
- Small data sets
- Data coded only one time at a single site
- Existing dictionaries cannot be modified
- Complex sentence structure
- Metaphorical, idiomatic, or time- dependent text
- Money available to fund coders and supervisors
Advantage to machine coding
- Large data sets
- Data coded over a period of time or across projects
- Existing dictionaries can be modified
- Simple sentence structures
- Literal, present-tense text
- Money is limited
Dont commit yourself to human coding until you have first spent several hoursnot just a few minutesdoing the coding. It is a tedious, mind-numbing task.
"Doing content analysis by hand will reduce even
the most fanatical post-modernist to pleading for a computer."
Philip Stone (author of General Inquirer)
Do you have the funds to hire group of reliable, enthusiastic, and committed graduate students or undergraduate honors students with excellent substantive knowledge who will code accurately and consistently for months or years at a time?
[No, you dont...]
Design your coding protocol with automated coding in mind. Coding categories that cannot be easily differentiated by automated methods usually cannot be easily differentiated by human coders either.
Do not mix data from manual and automated coding! Optimize your coding dictionaries first, then use automated coding for the entire data set.
Disambiguation and Lemmmaization
Disambiguation refers to the problem of dealing with homonymswords that sound (and are written) the same but have different meanings. These are very common in English
Lemmaization refers to the problem of associating various forms of a word with the same root. This can usually be done with simple stemming in English; it is more complicated in most other languages
Disambiguation: "Bat"
- wooden (or aluminum) cylinder used in the game of baseball
- small flying mammal
- act of batting ("at bat")
- blinking ("bat an eye")
Idiomatic uses
- "go to bat for": defending or interceding;
- "right off the bat": immediately;
- "bats in the belfry": commentary on an individuals cognitive ability
Foreign phrases
- "bat mitzvah": a girls coming-of-age ceremony (Hebrew).
Disambiguation, cont.
Any of these uses might be encountered in an English-language text, and multiple uses might be found in a single sentence
"The umpire didnt bat an eye as Sarah lowered her bat to watch the bat flying around the pitcher."
Words can also change from verbs to nouns without modification: Consider
- I plan to drive to the store, then wash the car
- When John returned from the car wash, he parked his car in the drive.
In summary: "Verbing weirds language."
Bill Watterson, Calvin and Hobbes
Nouns: "Syria"
- Possessive: "Syrias"
- Adjectival: "Syrian"
- Plural: Syrians
Verbs: "discuss"
- 3rd person singular: "discusses"
- Past tense: "discussed"
In general, English language word forms are exceptionally simple: it has only two noun cases (singular and plural), only two regular verb endings (-s/es and -ed), and does not change nouns to indicate whether the noun is a subject or object (case). Most other languages are more complex, but that complexity also carries additional information
Text Processing using Perl
Why should a political methodologist learn programming?
- It is at the guts of all of the programs you will be using anyway, so it helps you figure them out.
- It gives you vastly more flexibility than you would otherwise have, particularly dealing with text. Things can be done very easily with a program that are difficult with a search-and-replace or statistical transformations
10-year-olds program; and 16-year-olds can cover
the basics in about 10 weeks (albeit in BASIC or Pascal)
20-year-old hackers in developing countries can write and deploy viruses for the Windows OS that cause billions of dollars of damage across the planet in a few hours!
- It is easy to learn, though to get it down well, you need to practice, practice, practice.
Why learn programming?, continued
- Moores Lawcomputer capacity doubles every 18 months. You
dont want to use this??
Economists lawevery discussion of computing must start by mentioning Moores Law
- Otherwise you are at the mercy of computer programmers
See also: plumbing, automobile repair, landscaping, remodeling
The wrong reasons to learn programming (despite what you have heard)
Instant access to fantastic jobs earning zillion-dollar salaries
- See Micro-smurfs
- See NASDAQ technology index, 1998-present
- If you dont enjoy it, you dont want to do it for a living
- Academic salaries are quite competitive
Only opportunity to meet, and possibly mate with, other individuals with severe personality disorders and zero social skills
Advantages of Perl
Note: these advantages assume one already knows C/C++ or Java...
- Most of the control structures and syntax of Perl are the same as in C++ and Java.
- Perl does not require any of the headers and variable declarations used in C and Java.
- Perl contains a large number of additional string-oriented functions and data structures not available in C.
- The pattern matching and substitution options are incredibly rich.
- Perl transparently interfaces with the operating system in other words, a Perl program can easily move, delete or rename files, fetch web pages, and the like.
Advantages of Perl, continued
- Perl is open-source and freely available for Unix, Linux, Windows, and Macintosh. It runs as part of the operating system on many Unix machines, in Linux, and in the Macintosh OS X operating system.
- There is extensive documentation and source code available on the Web.
- "Perl is the glue that holds the web together"much of what you download from the web will have been generated from Perl and is therefore easily processed with Perl
Perl comes out of the Unix community and a lot of the most powerful features of the language are based on Unix models, which will seem obscure until you become familiar with them. But once you've learned the "regular expression" syntax for Perl, you can also use it in Unix.
Disadvantages of Perl
Perl is an interpreted language, rather than a compiled language, so it is probably too slow for writing large programs. The speed seems fine on both Unix and the Mac, howevera simple program for count event types in a WEIS file runs through a 30,000 line data file in less than a second on a Mac G3.
- This is a text-processing language, not a general purpose language.
For further information on Perl
Larry Wall, Tom Christiansen, and Jan Orwant. 2000.
Programming Perl. (3rd edition) Cambridge: O'Reilly Associates.
(this is known as the "camel book" and is the definitive guide to Perl. 1067 pages. Possibly more than you want to know.) -
Randal Schwartz and Tom Christiansen. 1997. Learning
Perl. (2rd edition) Cambridge: O'Reilly Associates.
(covers the 30% of the language that is used most of the time) - (home page for the Perl enterprise)
- (this links into full Perl documentation, complete with a search facility)
Instantaneous Introduction to Perl"
[by Michael Grobe, University of Kansas]
A Perl program for downloading a known set of URLS
while ($theURL = <FIN>) {
$theHTML = get($theURL );
print FOUT "\n\n$theHTML";
Alternative: script a browser
This is likely to be the easier method is the site requires authentication or other security measures
- Step 1: Log into the site manually and manually navigate to a point where you can access the material you want
- Step 2: Run a separate script (for example, AppleScript on the Macintosh) that drives the browser to do the downloads.
Dont assume that you will be able to download from a site: it may use internal scripts or other methods that get in the way. Experiment first.
However, most sites can be downloaded. In particular, any site that can be indexed by Google can be downloaded using automated methods (since that is how Google works). This provides an incentive for sites that want traffic to be Perl-friendly
Text Filtering
This is an essential step in any original automated analysis. The text that you download will not be in a format that you can immediately analyze!
Filters are used to regularize the text for later processing. Perl is ideal for this task.
What a Text Filter Needs to Do
- Remove the HTML tags and other web-specific coding
- Locate the beginning and end of the document text
- Segment article into sentences
- Problems: Periods in abbreviations
Abbreviations at the end of sentence - Identify quotations for separate treatment:
- Problems: Short quoted phrases in mid-sentence
Bill "Mad Dog" Jones
Use of double-apostrophes rather than quotation marks - Eliminate duplicate storiescomparison of character counts seems to work for this
- Ignore everything in the file not required for the above tasks
Text File Formats
- ASCII ("text")this is what you want.
- MS-Word (or other word processing)nearly impossible to process; convert to "text"
- HTMLdownloaded from the web; this is ACSII plus tags
- RTF"rich text format"; also ASCII with tags
- PDFportable document format (Adobe); see "MS-Word"
- JPEG and other graphics formats: These are scanned images of the document and cannot be coded directly OCR might work on some of these, but it is tedious
Operating System Differences
How is a line ended?
- MacintoshASCII 10 (\n)
- UnixASCII 13 (\r)
- Windows ASCII 10 + ASCII 13
Special characters (e.g. diacriticals å, ü)there are a wide variety of "standards";
"Unicode"successor to ASCII; incorporates character sets of all widely-used languages (e.g. Russian, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese)
Treating Text as a Statistical Object
Statistical Characteristics of Text
Zipfs Law (a.k.a. rank-size law)
"The frequency of the occurrence of a word in a natural language is inversely proportional to its rank in frequency"
In mathematics: fi µ 1/ri
In English: A small number of words account for most of word usage
Word frequency in English
% of usage | # of words |
40% | 50 |
60% | 2,300 |
85% | 8,000 |
99% | 16,000 |
Total words in American English: about 600,000
Total words in technical English (all fields): about 3-million
Functional Words
Very short words such as
Articles | a an the |
Interogatives | who what when where why how |
Prepositions | to from at in above below |
Auxillary verbs | have has was were been |
Markers | by in at to (French de, German du, Arabic fi) |
Pronouns | I you he she him her his hers |
In English, the specificity of a word is generally proportional to its length
Marker words have multiple uses: Random House College Dictionary lists 29 meanings for "by," 31 for "in," 25 for "to," and 15 for "for."
Zipfs Law collides with statistical analysis
Information theory:
the information contained in an item of data is proportional to log(fi)
Statistical Efficiency:
the standard error of a parameter estimate is inversely proportional
to the square root of the sample size
The upshot: Any content analysis must balance the high level of information contained in low-frequency words with the requirements of getting a sample of those words sufficiently large for reasonable parameter estimation
What does a document look like as a statistical object?
Mathematically: it is a high-dimensional, sparse feature vector where the elements of the vector are the frequencies of specific words and phrases in the document
Geometrically: it is a point in a high-dimensional space.
The upshot: Anything you can do with points, you can do with documents
Do's and Don't's in Contemporary Content Analysis
Content Analysis "Best Practices"
Wherein we introduce two individuals who we will follow through the the hazardous process of implementing a content analysis project in 2003
who does everything wrong...
who does everything right...
Click here for further technical applications of this material
Coding Methods1
Relies on manual coding because human coders can apply contextual interpretations to the text
Uses automated coding because it is fast, transparent, reliable, and stable.
Coding Methods2
Establishes a coding protocol that requires extensive training and supervision of coders
Establishes a coding protocol that takes into consideration the limitations and likely errors of automated text processing
Coding Methods3
Hires a team of coders, trains them to 85% inter-coder reliability levels, then plays Quidditch while they complete the coding
Avoids the use of multiple coders whenever possible, and does continual cross-checks if they are used. Coders only work on dictionaries, never with the final data
Coding Methods4
Uses data produced by a combination of automated methods and manual correction
Uses data produced by fully-automated methods to insure transparency and reproducibility
Choice of Medium
Codes from paper
Codes from sources on the web or CD-ROM
Obtaining information from the web
Downloads using cut-and-paste from web pages
Downloads HTML source using a spider or script, and post-processes the information using a Perl program
Formatting the Input Data 1
Assumes that data will be in a form that can be processed immediately by the coding program
Assumes that the data will require extensive reformatting before it can be processed by the coding program
Formatting the Input Data 2
Reformats the text using MS-Word, SAS macros, or Visual Basic, or hires a computer science graduate student to write a reformatting program in LISP
Reformats the text herself using Perl
Review of state-of-the-art methods
Reads U.S. political science studies from the 1960s
Studies current research in sociology, psychology and communications studies, with a focus on European research
Choosing an automated coding program
Uses the program his graduate advisor used in 1978 or obsessively checks out every program referenced on Bill Evanss web site.
First determines the requirements of the project, then checks several reviews of available software, and then chooses between two or three of the most promising programs.
Deciding what to code from the text
Codes as much information as possible from the text given the limitations of the automated coding program.
Codes only the information required for the project, and focuses on maximizing the validity of that coding. Incrementally adds complexity to the coding scheme as required.
Intellectual property issues
Openly flaunts copyrights because hey, were the Napster generation and information wants to be free; shares copyrighted primary source material
Quietly asserts right to download copyrighted material for research purposes under the legal doctrine of fair use; shares only secondary data
Determining Coding Categories
Establishes a small number of coding categories based on deductive understanding of the knowledge domain and examination of a few source texts
Determines coding categories by automated coding of all of the data and then applies a statistical method for reduction of dimensionality or clustering
Assumptions about the accuracy of the coding
Assumes coded data contain very little error
Assumes datawhether coded manually or by automated methodswill contain at least 15% erroneously coded records, and probably 25% to 35%. Some of this error will be systematic.
Additional Resources
Additional Resources: Books
Additional Resources: Web Sites
- William Evans' content analysis web page
- Harald Klein's text analysis software page
- Kansas Event Data System site (automated event data analysis)