Dictionaries
CountryInfo
CountryInfo.txt is a general purpose file intended to facilitate natural language processing of news reports and political texts. It was originally developed to identify states for the text filtering system used in the development of MID4, then extended to incorporate CIA World Factbook and WordNet information for the development of TABARI dictionaries.
File contains about 32,000 lines, covering about 240 countries and administrative units (e.g. American Samoa, Christmas Island, Hong Kong, Greenland). It is internally documented and almost but not quite XML: The major fields are delimited with tags of the form <tag>...</tag> but elements inside are delimited with line feeds. Converting this to strict XML would be a relatively simple programming exercise for anyone who should be working with the file in the first place. File is UTF-8 with Unix line feeds and will need to be converted if used in a Windows system.
Fields include
- Country name in English
- Adjectival forms and synonyms of the country name, including some non-English versions of the name
- ISO-3166 numeric, alpha2 and alpha3 codes, FIPS-10 code, IMF code, COW alpha and numeric codes
- Capital city
- Cities with populations over 1-million
- Regions and geographical features (WordNet meronyms)
- Leaders, 1960-2008 (rulers.org)
- Members of government, 2003-2010 (CIA World Leaders)
NOTE: I'm gradually transitioning to using GitHub as a primary repository, so you should check https://github.com/philip-schrodt/CountryInfo-1 for possible more recent versions.
CountryInfo.140728.txt has been archived at dataverse.harvard.edu with the persistent URL http://dx.doi.org/10.7910/DVN/NBPRDW
Download
CountryInfo.120116.txt (.zip) [Updated 16-Jan-2012]
Download
CountryInfo.140728.txt (.zip) [Updated 28-Jul-2014]
Revision which has TABARI-style date restrictions on countries which became independent
in the post-1989 period (mostly former Soviet Union and former Yugoslavia) plus some
additional small code corrections.
Download
CountryInfo.120116.actors (.zip), a TABARI-formatted .actors file extracted from
CountryInfo.120116.txt [Updated 6-Jan-2012]
Download CountryInfo perl
utilities (.zip).
translate.countryinfo.pl is the perl program used to extract the .actors file; this would also
be a good starting point for doing additional work with the CountryInfo.txt format, though it does not accommodate the date-restricted
country codes. format.rulers.org.pl attempts to
convert rulers.org entries into CountryInfo format; takes input from either a file or cut-paste
from the keyboard and works with most standard entries. [Updated 6-Jan-2012]
Links to additional resources on names
The EU Joint Research Centre (those great folks who give us European Media Monitor) maintains a very large and multi-lingual list of political names at
http://langtech.jrc.ec.europa.eu/JRC-Names.html.
This web site also links to an excellent paper on the technical challenges involved with name detection and resolution.
From Vincent Arel-Bundock:
The countrycode package for R includes a set of regular expressions which can be used to match country names in character strings to country codes.
http://cran.r-project.org/web/packages/countrycode/index.html
Also see the kountry Stata module by Rafal Raciborski: