INSTRUCTIONS FOR RUNNING NewNexisFilter.pl

DOWNLOADING FROM NEXIS

Run the appropriate search in "LexisNexis Academic -- News Sources", and then use either

-- "Email Documents" (the quaint little icon with a stamped letter, from the old days when
    we used dead trees to communicate) option or
 
-- "Download Documents" (the even more quaint little icon of a removable disk...most of
    you won't remember those...we used to keep them next to the counter-sorter, and they
    make nice drink coasters...I digress...). This will be downloaded with Windows line
    feeds (CRLF), even on a Unix system. This program works with those, but keep this in
    mind if you are using other programs. 

At the moment, at Penn State, these two options yield the same format. Emailing feels 
slightly faster, but I think the two modes take about the same time.

Use "Select Items" in the "Document Range" box to download a maximum of 500 stories at 
a time. In addition, the program assumes the following options have been selected for 
downloading:

  Format: text 
  Document View: Full document
  Font and highlighting should be irrelevant in the 'text' format

You can increase the efficiency by eliminating sports stories and news summaries: these 
will be skipped anyway.

Note that the program is currently set up only to filter for Agence France Presse stories

RUNNING THE FILTER

1. Put all of the Nexis files you intend to filter and the following two programs in a folder:

	NewNexisFormat.pl
	nexisreverse.pl

2. In the Terminal (command-line), move to that folder

3. Assuming your Nexis downloads have a file name of the form

	Agence_France_Presse_-_English2007-09-14_16-31.TXT

enter the command

	ls Agence_Fr* > format.files

Alternatively, just use the command

	ls > format.files

and manually edit out all of the files that are not downloaded files

4. Enter the command

	perl NewNexisFormat.pl <prefix>

where <prefix> is the prefix for the formatted file. For example

	perl NewNexisFormat.pl AFPLVT

5. Program should run, with the dates and headlines of the various stories scrolling past 
as they are processed. If the program stops working -- crashes or stops responding -- 
the last story displayed (or the one following it) is probably the cause, so just 
eliminate that story and try running the program again.

6. When the program has finished, enter the command

	ls <prefix>* > filelist

where <prefix> is the prefix you entered earlier. For example

	ls AFPLVT* > filelist

7. Enter the command

	perl nexisreverse.pl

8. The resulting TABARI input files are in the file 

	reverse.output

which can be renamed at this point, and a summary of the number of entries in this file 
can be found in

	filelist.summary

Note that as currently configured, nexisreverse.pl only gets the first sentence of the story.

9. The command 

	perl LNAFP.seqsort.pl reverse.output

will do a date sort on the records if they are out of order. Sorted output is in 
seqsort.reverse.output

NOTES:

1. This is still working at Penn State. "Still" in the sense that the last major change
   we had to make to the program was around 2006. However, there are at least some minor
   differences in the downloading format across universities. The program appears to
   be robust against these at the moment, but if you run into a situation where it
   doesn't seem to work, please let me know.
   
2. The program assumes that the story begins two lines following a line containing
   DATELINE:  This is present in most but not all downloads, and at the moment
   we haven't done an exhaustive search to determine when it doesn't, but at least some. 
   periods in the late 1990s (but not the early 1990s...) do not consistently include 
   that field. If the program hits the end of the file while searching for DATELINE:,
   the remainder of the file is not processed and a WARNING is given when the program
   has finished, listing the skipped files. 


Programmer: Philip A. Schrodt (schrodt.parusanalytics.com)
Last Update: 19 December 2011