TABARI: Text Analysis By Augmented Replacement Instructions

TABARI: Text Analysis By Augmented Replacement Instructions
April 2000

Introduction

TABARI is the successor program to KEDS. The project operates under open source software principles -- the source code is available in the public domain and be documented internally and externally on the assumption that it might be later modified. The target operating system is Linux and the programming language C++. Version 0.2 -- which implements most of the core coding features of KEDS -- is now available for the Linux and Macintosh operating systems. A tentative compiled version is also available for Windows -- this runs in a "DOS console window".

All versions of the program will be backwards-compatible with KEDS dictionaries, though this might require a few changes in the .options file. However, it will eventually incorporate a series of incremental modifications to deal with some of the common problems in the KEDS parsing model. These will eventually include

Better disambiguation of nouns and verbs
Synonym sets
Separation of the parsing and coding functions -- it will be possible to run the coder on parsed input text, including the output of full parsers
Use of XML as the standard for dictionary and options input
Hard-coded routines to handle passive voice and attribution
Proximity limitations in patterns
Production of dictionaries for the IDEA event coding scheme, a successor to WEIS/PANDA

Almost none of these features are incorporated in Version 0.2; expect most of them by mid-summer 2000.

Why?

The KEDS parser-coder has been a successful prototype for an event data coder that has been used in at least five NSF-sponsored projects and produced data used in a number of refereed articles in political science. However, KEDS' design is almost ten years old and suffers several disadvantages: a number of poor decisions were made while constructing the sparse parser (e.g. ALL CAPS, throwing away articles); the parsed representation is inaccessible to the user; it is written in Pascal rather than C/C++; it was written to run in very little memory (with correspondingly brittle data structures) and it runs only on the Macintosh operating system. KEDS does not provide a good foundation for a more complex coder. More succinctly: "Plan to throw one away; you will, anyhow." (Fred Brooks, The Mythical Man-Month, Chapter 11).

The new version of the program is considerably faster than the current one (a key issue for those of you tired of coding only 85 events per second, we're sure!). Due to compiler constraints, KEDS is actually MC680x0 code, which is emulated in a Power PC processor. Simply compiling to native code should substantially increase the speed, and in our experience, C/C++ produces much faster code than the Pascal compiler.

Producing code that will run in Linux also opens the project to a much wider variety of machines. While the Macintosh did survive its near death experience and is once again widely available, Macs are rare in many research environments, and many users would find it useful to have an alternative. In contrast to KEDS, the interface of TABARI is kept distinct from the processing, and the program has been designed from the beginning to support multiple operating systems.

Comments on version 0.2 [April 2000]

This release of the TABARI source code is fully functional for both the Linux and Macintosh operating systems. The MS-Word file TABARI.0.2.changes.doc discusses the various differences between TABARI and KEDS, as well as providing validation information. Dale Thomas has done an initial conversion to Windows and has provided an executable file for this.

TABARI is completely compatible with KEDS .actor and .verbs dictionaries, though there are a few pathological "features" in KEDS pattern-matching that have not been duplicated in TABARI. Consult the TABARI.0.2.changes.doc file concerning the differences between the two programs (these are summarized below). TABARI project files are text, rather than customized Macintosh files -- see the documentation or the TABARI.demo.project files for an example.

The TABARI.Demo.project will run through the basic features of the program and is a good place to start. While the program has been extensively tested on a suite of test sentences, as well as run through a corpus of 26,000 real-world newswire leads, it has seen only limited use in actual coding. If you find bugs -- "When" you find bugs, as there will be bugs -- please report these and we will try to deal with them more or less promptly.

Comments on version 0.4 [March 2002]

New features and bug fixes in version 0.4

Discard codes now work on any phrase in the sentence; previously this worked only if the phrase was a source, target or event. New behavior is similar to that in KEDS.
"Time-shifting" -- phrases in the text such as "yesterday" or "next week" -- can be used to change the recorded date of the event.
"Issues" -- assignment of codes to specific strings in the text, as in conventional content analysis -- have been implemented. It can also be used to pull out numerical totals. This facility is similar to that in KEDS.
"Attribution" -- program can pull out a separate code indicating who reported the event.
Ability to divert records and information on how they were parsed to a "problems" file has been implemented; this is similar to KEDS.
Detection of passive voice is now automatic.
The .options file will accept a list of labels for the event codes; this is similar to the CODE: facility in KEDS.
Nouns can be specified explicitly, rather than only by null-coding verb or actors.
Assorted bugs in the interface (e.g. inability to change the first pattern) have been fixed.
Assorted pattern matching bugs were fixed.

Comments on version 0.5 [May 2005]

New features and bug fixes in version 0.5

Code is now compiled using the open-source gcc compiler suite, so the same code base should work for both the Macintosh OS-X and Linux (and any other systems for which gcc is available).
The interface now uses the Unix "ncurses" library. As a consequence, TABARI can be run remotely from a server.
The core pattern matching routine "checkPattern" was completely re-written and should now be more reliable. The validation suite now contains more than 300 test cases.
A color-coded display of the parsed text can be accessed used a browser

TABARI versus KEDS

Generally, TABARI works the same way that KEDS works, so you can use KEDS text, .actors and .verbs files without any changes. This compatibility will be maintained indefinitely, though it will eventually require a command in the .options file. The .project file is quite different -- see comments below -- and in this release, only a small number of the commands in the .options file have been implemented.

In order to maximize compatibility between operating systems, TABARI currently uses a very simple Unix-style keyboard-driven "dumb terminal" interface rather than the GUI interface of KEDS. This is currently designed to work on a screen that is 48 lines high by 80 characters wide. The input and display functions have generally been much more carefully isolated in TABARI than in KEDS, so adding a GUI interface in the future should be relatively easy, though it is not entirely clear that this is really necessary.

TABARI is not going to duplicate the CLASSES and RULES facility of KEDS. Similar capabilities will be provided, but the syntax will be different. More generally, TABARI will eventually move to completely eliminate KEDS's "stemming" and replace it with formally-defined word endings, and will make much more extensive use of categories of words. The University of Kansas General Research Fund has provided a grant for conversion of KEDS dictionaries to the new TABARI format during the 2000-2001 academic year, so we will eventually move completely to the new system.

Because the event data community has generally reached a consensus that machine-assisted coding is not a good idea because it is neither transparent nor reproducible, TABARI will not implement the machine-assisted coding facilities of KEDS. Again, if you need these facilities, use KEDS. (Or, since this is open-source code, write the appropriate routines.)

[Actually, some such facilities will be implemented, specifically those that are useful in dealing with dictionary development such as the .problems files. But at the moment they aren't in the system.]

How Fast Is It?

Very fast. I timed TABARI on Levant texts for 1987-1990, about 26,000 sentences. On a 350Mhz Mac G3 and using the default ("None") autocoding mode that provides no screen feedback, TABARI codes 2000 events per second. It will initialize the dictionaries, code, and write events for these four files in about 15 seconds. This is roughly 70 times faster than KEDS on the same machine. On a 650Mhz Dell Pentium III, the speed is around 3000 events per second.

If we use the usual benchmark that human coders can reliably produce 40 events per day, TABARI running on a G3 does in a second what a human coder does in about three months. A wall-clock speedup of around a factor of 7.8-million.

So, what accounts for the speed-up? Brilliant programming? Not much -- the programming in TABARI is a lot cleaner than KEDS, but generally the algorithms haven't changed much. Also I've observed this factor-of-100 speedup in translating some other programs from Pascal to C. The increase is probably a combination of the fact that KEDS was compiled using a ten year old compiler that could not take advantage of speed-enhancing features of contemporary machines (e.g. caches); KEDS is running in 680x0 emulation rather than in native code; and even KEDS minimal screen feedback may have slowed the program more than I thought. But C generally is written closer to the machine than Pascal, and both Metrowerks CodeWarrior and GNU g++ are known for producing very fast compiled code.

This speed means that we are close being able to actively experiment with the implications of changing dictionaries, i.e. change a dictionary entry and then plot the results for a ten-year series within a minute. Even a small parallel cluster would reduce this to a few seconds.

Last update: 17 October 2001.

FAQ

TABARI??? "Augmented Replacement Instructions"???

Have you tried coming up with an original program name lately? EGRET ("Extendable Generator for the Reduction of Text"), MARMOT ("Management, Analysis, Reduction, Manipulation of Text"), KESTREL (don't ask...), SITTA ("System for Integrated Technical Text Analysis") were all taken. Besides, SITTA is excessively obscure.

If you think that Y2K was a crisis for the computer industry, wait for the upcoming "computer program moniker crisis".

"Augmented Replacement Instructions" is how the program works...

"SITTA"is obscure? What's a "TABARI"?

Abu Jafar Muhammad ibn Jarir al-Tabari (839-c.923). A distinguished Arab historian and political analyst. He also taught law, but we'll forgive that. SUNY Press carries his books in translation.

Open source?!? Why, you must be one of them peacenik, socialist, tofu-eating, crypto-vegetarian, tree-hugging yoga nuts!

I am not a tree-hugger.

There are trees in Kansas. At least this part of Kansas. Any self-respecting Kansas tree will be protected by one or more of the following:

three-inch thorns
poison ivy
greenbriar
ticks
chiggers
a cordon of copperheads and timber rattlesnakes

Kansas trees do not wish to be hugged.

The most sophisticated recent analysis I've seen of the economics of open source is from the inimitable Joel Spolsky, who has noted that open source is a rational strategy when viewed as a complementary technology that increases demand of another product. IBM, Oracle, and Hewlett-Packard aren't spending billions on Linux (and broadcasting this in full-page ads in the Economist) out of charity; they are doing it to make [lots of] money. Joel explains how.

Now, how does this affect us poor academics living in genteel poverty in the wilds of fly-over country? Look at our options. As international relations scholars we could

Write the 1,534th article proving statistically that the democratic peace works, or the 1,487th article proving statistically that it doesn't;
Write the 4,056th article creating a new paradigm for the study of international politics -- preferably with a "neo-" prefix and/or capItAl letters and pa(rent)heses somewhere in the title -- and hope that some poor graduate student will have to read it in order to pass prelims.

Or we could try doing something that hasn't been studied to death (yet): statistical crisis forecasting with event data. But the human-coded event data data sets ended in 1978, and we want to analyze contemporary crises. So we create a complementary technology to bring down the cost of event data, increasing the demand for our skills, and get cool free trips to Washington, Europe, and College Station, Texas.

In short, we are self-centered capitalist maruppie scum.

So, feel better?

Are you going to produce a version for Windows?

You've got to be kidding...

However, because the interface and processing sections of TABARI will be much more clearly separated than those in KEDS -- and the program will be more carefully documented -- it should be considerably easier for someone acquainted with the ways of the Evil Empire to modify the code to work on Windows.

[Update: 10 April 2000. Okay, okay, okay. Since it has now been legally established that people who use Windows are in fact the unwitting victims of a ruthless and predatory monopoly, a certain level of compassion is in order.

A version of TABARI running in Windows -- well, a "DOS console" in Windows -- is almost ready to go. It's ugly but it works. If someone would like to produce a better-looking interface, I would welcome this.

Meanwhile Microsoft has hired Ralph Reed as their lobbyist (NYT 11 April 00), which suggests Microsoft figures they may need to appeal for help from really high places...]

[Update: 17 April 2000. Dale Thomas, who did his dissertation on conflict in Northern Ireland using KEDS, has a smoothly-working version of TABARI running in Windows, and is apparently intending to work further on the program. The current version runs the demonstration file without any problems. See the Read.Me. file in the .zip file.]

[Update: 30 March 2003. So, with the collapse of the dot.com boom, it is once again possible to hire good undergraduate programmers. And we've got a couple of great ones working for us at the moment, and their first task was converting TABARI.0.4 to Windows -- both the source code and compiled versions of the program are at the KEDS software page. We expect to continue maintaining a more or less current version of this in the future. "It's ugly but it works" has become the motto of this sub-project.]

PalmPilot?

You can't imagine how tempting this is... TABARI, like KEDS, uses only about 2Mb of memory, but that is a lot of memory for the Palm OS. So not at the moment, but possibly in the future.

"maruppie"?

Middle-aged rural professionals.

"SITTA"

Okay, good point -- on the web, nothing is obscure...

Original announcement: 18 November 1999