Text Filters

This collection of filters aid in the retrieval and formatting of internet-based news leads, and helps compile the data into an input file to be read by various KEDS utilities and TABARI. The processes involved in this task include downloading the lead sentences from a web-based source, ordering the information chronologically, and formatting the specific sourcecodes and identifiers for interpretation by TABARI. The tasks performed by each individual filter are expounded below.

The filters are listed in reverse chronological order, with the programs that we have used in our most recent research listed first. Due to changes in data service (NEXIS, Factiva) formats over time, programs that are more than a couple of years old will probably not work without modification, but we are leaving the older code available since it might provide templates for writing other filters. That said, Python and perl are so vastly superior to C, C++, and Pascal for text processing that the more recent filters are generally the only versions worth bothering with unless you are dealing with archived downloads.

Advisory: The Perl and Python programs should work on Macintosh, Unix, Linux, and Windows operating systems. However, make sure that you have converted the source code and any input files to the appropriate operating system file format before running them: if the program appears to behaving erratically, it is quite likely due to a file incompatibility (e.g. a Windows program trying to read a Unix file).

LexisNexis.filter.Feb14.py (Python)

This is a Python filter for formating all of the sentences (with some length restrictions) from Lexis-Nexis downloads into the KEDS/TABARI format with 2-digit years. This code looks for a specific set of sources but it is easily generalized. It also contacts a fairly compact pattern based sentence segmenter in the program that is an alternative to the nltk 'punkt' machine-learning segmenter, but punkt could be easily substituted.

Last update: 3 August 2012

Download LexisNexis.filter.Feb14.py source code (.zip)

Factiva.filter.Aug2012.py (Python)

This is a completely new filter -- formating lead sentences into the KEDS/TABARI format with 4-digit years -- that handles Reuters from Factiva using the save-to-disk option. Detailed instructions for the formatting and command line options (yes, we now have command line options...) are in the header. It uses the "Full Article/Report plus indexing", which is particularly easy to parse (thank you, Factiva...now, about those Capchas...).

This is the first of what are likely to be subsequent programs in Python -- Python seems to work as well as perl for the purpose, and yes, it is both much easier to read, and Python programmers are generally easier to find than perl programmers.

Last update: 3 August 2012

Download Factiva.filter.Aug2012.py source code (.zip)

Factiva.RAN.v2.pl (Perl)

This is a version of Factiva.Reutlead.filter.pl that handles Reuters, Agence France Press and New York Times stories downloaded from Factiva using the email option, and formats the lead sentences of those stories into the KEDS/TABARI format. It works with the Factiva format as of November 2011. Note that this produces 4-digit rather than 2-digit dates. Modifications by Natalie M. C. Odilo (odilo.natalie@gmail.com) and Marsha Sowell

Last update: 15 November 2011

Download Factiva.RAN.v2.pl source code (.zip)

NewNexisFormat.pl (Perl)

This Perl program reformats stories emailed from the LexisNexis Academic Universe system into the TABARI format. It replaces the older "nexispider.pl" that did the downloading automatically; this no longer works due to changes in the NEXIS web site. The program is currently set to process only Agence France Presse records but should be easy to modify for other sources. Because LexisNexis downloads are sent in Windows file format, the program automatically converts these to Unix format.

Update December 2011: This still works for current stories in both the emailed and downloaded LN format with the 'text' option here at Penn State; we have seen minor variations on that format at some other institutions but at the moment the program works with those as well.

It works for most but not all stories from the 1990s: we have found some files from the late (but not the early...go figure...it's LN....) 1990s where there are format incompatibilities: see the internal documentation. We haven't done the systematic research to see exactly where these occur: if they are regular, it will be fairly fairly simple to work around them (again, see the internal documentation) and if you work this out, please send us a copy.

Last update: 20 December 2011

"Read.Me" file that explains how to do the LexisNexis downloading and formating.

Zipped file containing NewNexisFormat.pl, nexisreverse.pl, LNAFP.seqsort.pl and NewNexisFilter.readme.txt.

Factiva.Reutlead.filter.pl (Perl)

This Perl program processes a set of Reuters stories downloaded from Factiva using the email option, and formats the lead sentences of those stories into the KEDS/TABARI format. The input to the program is output files for the formatted leads and a date file, then a list of the files containing the stories.

Update November 2011: This no longer works with the current Factiva format; see the program posted above.<>

Last update: 16 July 2008

Download Factiva.Reutlead.filter.1b1.pl source code (.zip)

nexisreverse.pl (Perl)

This Perl program reverses the order of stories that were downloaded from NEXIS using the nxdnldformat.pl or nexispider.pl programs (more generally, it will reverse the order of any "KEDS-formatted" files). The program solves the problem of NEXIS downloading stories in reverse chronological order, while event data coding usually needs records in chronological order. The program also combines multiple downloads into a single file, and eliminates stories that have identical first lines. The current version gets only lead sentences, but this is easily changed.

Last update: 26 January 2003

Download nexisreverse.pl source code (.zip)

ActorFilter

This program locates potential new actor names in a file of KEDS input records by looking for strings of consecutive capitalized words and comparing these against an existing sets of actor names and a list of stop words. It produces a keyword-in-context index of the new actors sorted by frequency. Documentation in .pdf and MS-Word format is included. The beta version of the program was available only for the Macintosh. The java version was created in March of 2001, both are available here.

Beta version uploaded: 12 October 1997

ActorFilter program and manual (.sit)

Java version 2.03 updated: 13 June 2001

ActorFilter program (java version)