PETRARCH: The successor to TABARI

In the 2014, the Computational Event Data project transitioned to a new coder named PETRARCH and work on TABARI effectively stopped at this point. The codebase for PETRARCH is entirely new, though at the present time [August 2014] the system still uses modified versions of several TABARI dictionaries, but we expect this will gradually change. In addition, the TABARI-formatted dictionaries are not being updated, whereas the PETRARCH dictionaries are being substantially updated to incorporate all of the political actors commonly found in current news reports. TABARI is still in active use—sometimes with, sometimes without, attribution—but in general if you are starting a new project, you should probably focus on PETRARCH, not TABARI: as of June 2014 PETRARCH has all of the functionality of TABARI and has been very thoroughly tested.

The core, and motivating, innovation for the new program—a change which cascades through the entire system—is that PETRARCH uses fully-parsed Penn TreeBank input: its coding is parser-based, whereas the coding of TABARI was largely pattern-based. This has a number of very substantial implications.

First, because TABARI was a pattern-based shallow parser, it could get the right answer for the wrong reason, and at least some of the dictionary entries—in particular those treating nouns as if they were verbs—depended on this. PETRARCH, in contrast, only matches true verbs: (VP (VBx in the parse tree. While this means that a small number of existing noun-based patterns no longer work, the parser virtually eliminates the problem of noun-verb disambiguation (or, rather, relegates it to whatever parser is producing the Treebank output), which is a vastly more important issue. The problem of word sense disambiguation remains, particularly the non-trivial issue of distinguishing verbal from physical sense in words such as 'attack,' and dictionaries still need to handle this.

Parsed input is, however, typically less robust than pattern-based input, since the addition or deletion of words that seem trivial to a native speaker will sometimes change the parse (which is, of course, itself produced by a very complex program). This has two implications. First, it means that PETRARCH will be more conservative than TABARI, which was one of the motivations for the change, particularly as event data have gained increasing attention in the policy community where false positives are a major concern. Second, while the TABARI dictionaries provide a starting point, they will eventually need to be adapted. That said, some features that had to be dealt with as special cases in TABARI are taken care of automatically in PETRARCH, and the full parts-of-speech markup should ultimately simplify the dictionaries by eliminating verb phrases that existed solely to handle noun-verb disambiguation.

Third, switching to one or more open-source parsers—in common with many contemporary projects, we are currently using the Stanford CoreNLP parser—means that we are relegating the parsing to the computational linguists. and more generally to a large community that develops parsers that can produce TreeBank output. This has somewhat simplified the required code, though not dramatically as the quirks of a full parse are, if anything, more complex than those of a pattern-based shallow parse. And the parse doesn't take care of everything: for example comma-delimited clause deletion and passive voice detection are essentially done the same way as in TABARI.

Nonetheless, the shift to Treebank input may allow PETRARCH to be easily adapted to other languages since the TreeBank format is standard across many languages. It will still be necessary to adjust for some of the phrase and word-ordering rules, and of course the dictionaries would need to be translated, but except for passive-voice detection. PETRARCH works only with the Treebank tags, not the content.

Finally, Treebank identifies any noun phrase that could potentially be a political actor, whereas TABARI was restricted to identifying actors that were in the dictionaries. The new_actor_length parameter in the PETR_config.ini file allows arbitrary noun phrases to be recorded in the source and target slots of the event data whenever these occur in the subject and object positions of the verb phrase. These phrases can then be processed to extract the high-frequency named entities which are not in the dictionaries.

So what's not to like? Speed and the addition of another step, parsing, into the coding pipeline. TABARI could code very rapidly, typically around 1,000 to 2,000 sentences per second depending on the dictionaries. PETRARCH currently codes at only about 150 sentences per second, and the CoreNLP system parses at about 2 to 5 sentences per second. Consequently the computational demands are much higher, and high-volume coding requires substantial cluster computer resources. Presumably most of the performance hit in PETRARCH is due to the use of Python rather than C++. Python, however, is far better suited for writing code for processing text than C++, so the program is substantially shorter and easier to debug and modify. Python is also much more robust across platforms than C++—C++ proved to be a significant barrier to adoption for some users not familiar with Unix environments—and Python has a much larger, and younger, community of programmers.

A few other major changes:

Due to a series of unfortunate events (unrelated to anything involving Penn State), the http://event web site was discontinued, and the repository for the TABARI program, utilities and dictionaries is now at PETRARCH, and more generally the EL:DIABLO real-time coding suite of the Open Event Data Alliance are maintained at the GitHub repository In keeping with the now-common practices of the open source community (and gaining the advantage of sophisticated versioning, which we decidedly did not have earlier), the canonical repository for PETRARCH and its utilities will be on GitHub rather than this site. The new project is very much a team effort, in particular with John Beieler doing much of the development of the near-real-time coding pipeline and Andy Halterman putting substantial effort into the dictionary. And there is more to come. Brave new world.

All that said, the new environment is more complex, if arguably more robust. We have introduced another processing step—CoreNLP—which is quite complex in itself and, alas, in yet-another-language, Java, so you need to make sure that both your Java and Python environments are reasonably current. Even GitHub has a bit of a learning curve, though a simple download of the code should be straightforward. At present, the Lexis-Nexis and Factiva filter programs only produce TABARI-formatted input and as of August-2014, we don't have programs to automatically go from the TABARI to PETRARCH formats, though we expect to have these in the near future once those formats are completely stable (or, if you've written one, please make it available!) Consequently for “toy” problems and simply experimenting with automated coding to get some idea of how it works, or as a classroom exercise, TABARI might still have some utility. For “industrial” (and policy) applications, however, PETRARCH is definitely the way forward.