INSTRUCTIONS FOR RUNNING NewNexisFilter.pl DOWNLOADING FROM NEXIS Run the appropriate search in "LexisNexis Academic -- News Sources", and then use either -- "Email Documents" (the quaint little icon with a stamped letter, from the old days when we used dead trees to communicate) option or -- "Download Documents" (the even more quaint little icon of a removable disk...most of you won't remember those...we used to keep them next to the counter-sorter, and they make nice drink coasters...I digress...). This will be downloaded with Windows line feeds (CRLF), even on a Unix system. This program works with those, but keep this in mind if you are using other programs. At the moment, at Penn State, these two options yield the same format. Emailing feels slightly faster, but I think the two modes take about the same time. Use "Select Items" in the "Document Range" box to download a maximum of 500 stories at a time. In addition, the program assumes the following options have been selected for downloading: Format: text Document View: Full document Font and highlighting should be irrelevant in the 'text' format You can increase the efficiency by eliminating sports stories and news summaries: these will be skipped anyway. Note that the program is currently set up only to filter for Agence France Presse stories RUNNING THE FILTER 1. Put all of the Nexis files you intend to filter and the following two programs in a folder: NewNexisFormat.pl nexisreverse.pl 2. In the Terminal (command-line), move to that folder 3. Assuming your Nexis downloads have a file name of the form Agence_France_Presse_-_English2007-09-14_16-31.TXT enter the command ls Agence_Fr* > format.files Alternatively, just use the command ls > format.files and manually edit out all of the files that are not downloaded files 4. Enter the command perl NewNexisFormat.pl where is the prefix for the formatted file. For example perl NewNexisFormat.pl AFPLVT 5. Program should run, with the dates and headlines of the various stories scrolling past as they are processed. If the program stops working -- crashes or stops responding -- the last story displayed (or the one following it) is probably the cause, so just eliminate that story and try running the program again. 6. When the program has finished, enter the command ls * > filelist where is the prefix you entered earlier. For example ls AFPLVT* > filelist 7. Enter the command perl nexisreverse.pl 8. The resulting TABARI input files are in the file reverse.output which can be renamed at this point, and a summary of the number of entries in this file can be found in filelist.summary Note that as currently configured, nexisreverse.pl only gets the first sentence of the story. 9. The command perl LNAFP.seqsort.pl reverse.output will do a date sort on the records if they are out of order. Sorted output is in seqsort.reverse.output NOTES: 1. This is still working at Penn State. "Still" in the sense that the last major change we had to make to the program was around 2006. However, there are at least some minor differences in the downloading format across universities. The program appears to be robust against these at the moment, but if you run into a situation where it doesn't seem to work, please let me know. 2. The program assumes that the story begins two lines following a line containing DATELINE: This is present in most but not all downloads, and at the moment we haven't done an exhaustive search to determine when it doesn't, but at least some. periods in the late 1990s (but not the early 1990s...) do not consistently include that field. If the program hits the end of the file while searching for DATELINE:, the remainder of the file is not processed and a WARNING is given when the program has finished, listing the skipped files. Programmer: Philip A. Schrodt (schrodt.parusanalytics.com) Last Update: 19 December 2011