GAZETTEER DOCUMENTATION 19 OCT 1992, for gazetteer version 4.0 1. INTRODUCTION: The TIPSTER Gazetteer is a compilation and reformulation of gazetteer information from a number of sources, primarily CIA's RWDB2, the US Geological Survey, the CIA World Fact- book (version 11), and the Board of Geographic Names namelists. Coverage varies drastically, based on the degree of completion for the region or attribute. Version 4.0 has over 240,000 place names. Depending on the source, multiple entries may exist for the same geographic entity, under various spellings. The Gazetteer is not intended to be exhaustive. The TIPSTER task will inevitably continue to have place-names not included in the gazetteer, either because of alternate spelling, or because they were not present in the sources. 2. CONTENTS: The Gazetteer contains entries for the following named geographic entities: o Continents o Countries o Provinces, including states, republics (of the former USSR); in some cases, of multiple order (corresponding to states and counties, for example). o Islands o Island groups o Cities (including capitals) o Ports o Airports Additional types of named geographic entities may be included in subsequent versions of the Gazetteer (possibly including lakes, rivers, mountains). 3. FORMAT: A Gazetteer entry consists of one or more place names, each followed by an indicator in parentheses. Indicators are listed below: o CONTINENT o COUNTRY o PROVINCE o ISLAND o ISLAND-GROUP o CITY o PORT o AIRPORT If the gazetteer is expanded to include additional types of entities, additional indicators may be necessary. In a Gazetteer entry, there may be up to three place-name-indicator pairs, in ascending order of granularity. For example, Fuga (AIRPORT) Cagayan (PROVINCE) Philippines (COUNTRY) indicates the airport is in the province indicated, which is in the Philippines. An indicator may be followed by a ranking. The ranking is a positive integer indicating the relative rank (order) of the geographic entity as indicated by the indicator. For example, an (ISLAND 1) is bigger or more important than (ISLAND 8). The lists in Section APPENDIX A: RANKINGS give an indication of the entity class represented by each rank. The criteria for ranking some of the types of entities are not known, whereas for others it is obvious. For example, it isn't clear what constitutes a major vs. a minor island; in the case of provinces, however, rank 2 provinces are contained (typically) within rank 1 provinces (as is the case with counties and states in the US). 4. NAME REPRESENTATION IN THE GAZETTEER: Names in the Gazetteer are in a BGN (the Board of Geographic Names of the US Government) format, in conventional format, or in any of a number of alternate spelling. A number of names in the Gazetteer have an indication of a diacritic (an @ followed by a letter or number); Section APPENDIX B: DIACRITICS lists the meanings of these diacritics. In general use in TIPSTER and other tasks, these diacritics are typically not used. For this reason, each place name with a diacritic code, there is a duplicate entry with the @ and the following character removed. It is not clear whether the diacritics accounted for minimal pairs in the gazetteer. No effort was taken to replace diacritics with alphabetic (near-)equivalences; for example, the capital of Japan is listed as To@Vkyo and as Tokyo, but not as Tookyo or any other alphabetic alloorth. Where the name of the second or third name in an entry contains a diacritic, it is stripped out and no duplicate record is available with the full (diacritic included) name of the second or third name. For example, the Cambodian province has the following entries: Take@Jv (PROVINCE) Cambodia (COUNTRY) Takev (PROVINCE) Cambodia (COUNTRY) and the Cambodian city Angk Tasaom has the following two entries: A@Jngk Tasao@Jm (CITY 4) Takev (PROVINCE) Cambodia (COUNTRY) Angk Tasaom (CITY 4) Takev (PROVINCE) Cambodia (COUNTRY) but not A@Jngk Tasao@Jm (CITY 4) Take@Jv (PROVINCE) Cambodia (COUNTRY). 5. GEOPOLITICAL CHANGES The issue arose of reporting LOCATIONS in countries which have undergone political changes such as former USSR and the Germanies. The gazetteer ver 3.0 was inconsistent in its treatment (i.e., GERMANY (1991)). So for the former USSR and Germanies, separate PRE- and POST-1991 forms are given in many cases. Each entry in USSR or Germany is identified as PRE or POST. Use 1 JAN 1991 as the cut-off date for pre vs post (any article before 1 JAN 1991, use the PRE 1991 form, any article on or after 1 JAN 1991 use the POST 1991 form) If a PRE form does not exist, but is called for, use the POST form if listed (or vice versa); this arose because some of the sources were PRE, and some POST, and some information from onw was not listed in the other. For example, Moscow in an article dated 20 OCT 1989 would be reported as Moscow (CITY 1) R.S.F.S.R. (PROVINCE) USSR (COUNTRY) (PRE 1991) but Moscow in an article from 1992 would be reported as Moscow (CITY 1) Russian Federation (COUNTRY) (POST 1991) Since the Unites States Government never recognized the illegal annexation of Estonia, Latvia, and Lithuania to the USSR, they are not listed as part of USSR prior to 1991. 6. RETRIEVAL OF ENTRIES FOR TIPSTER: The TIPSTER task requires the identification of LOCATIONs for a number of slots. Below are guidelines for indicating o If the gazetteer doesn't have an entry for Upper Slobovia, the slot fill should be "Upper Slobovia" (UNKNOWN). Note the quotes around the name, because it is a string, not a set fill item. Note that capitalization is irrelevant for comparison. o Below is a list of "aliases". These are exceptions to (or macros for) the "exact match" requirement. For example, if an article discusses a LOCATION as "in the US", you may use US as an allowed alias for the look-up or matching process for United States. So report United States (COUNTRY); use the actual gazetteer listing form in reporting, but use the forms in the right column in performing the look-up or match as equivalencies for the left-hand forms. Table 1: Table of Acceptable Aliases ENTRY FORM ACCEPTABLE MATCH X, Republic of X Republic of X United States US USA Germany, Federal Republic of West Germany (be sure to deal with PRE/POST) Germany Democratic Republic Easy Germany (be sure to deal with PRE/POST) United States America USSR Soviet Union (be sure to deal with PRE/POST) Korea, Republic of South Korea Korea, Democratic People's Rep. North Korea X the Y X Y (e.g., Gambia allowed for Gambia, The) Yemen (Aden) South Yemen Yemen (Sanaa) North Yemen X (X with imbedded punctuation marks, such as U.S.S.R. for USSR) X (X, where imbedded punctuation marks left out, as in RSFSR for R.S.F.S.R.) United Kingdom Great Britain o If a document reports the location as "a suburb of Tokyo" then just use Tokyo (CITY 1) Japan (COUNTRY), even if the suburb is named. o If multiple entries are retrieved when searching for a name, and the type of geographic entity is known (e.g., city, or airport, or country) then use that information to select an entry. For example, the following entries are retrieved when searching for Alexandria (among others): Alexandria (CITY 4) United States (COUNTRY), Alexandria (PORT 3) United States (COUNTRY), Alexandria (PROVINCE) South Africa (COUNTRY). If it is known that the Alexandria in question is a port (and all other criteria such as those identifying country agree) then select the port entry. o If multiple entries are retrieved when searching for a name, and more specific information is known, make use of that more specific information to identify the correct entry. For example, if a LOCATION is known to be Alexandria, a search on the Gazetteer may return the following entries: Alexandria (CITY 4) Egypt (COUNTRY), Alexandria (CITY 4) Greece (COUNTRY), Alexandria (CITY 4) Northern Territory (PROVINCE) Australia (COUNTRY), and others. If the text indicates that the city is in Greece, for example, then select that entry. o If there are multiple entries retrieved for a name, (and country or other discriminating criteria have been applied) and there are no indications of the appropriate geographic entity type, then use the following order of precedence (highest to lowest): CONTINENT, COUNTRY, CITY, ISLAND-GROUP, ISLAND, PROVINCE, PORT, AIRPORT. o If multiple entries are retrieved for a name, and after identifying the appropriate geographic entity type, country, etc. there are multiple entries still available, you may use the rank to identify the more significant of the entries. For example, in discussion of Anji in China, the following entries are returned: Anji (CITY 4) China (COUNTRY) and Anji (CITY 3) China (COUNTRY). Use the lower rank (i.e., more important or bigger) entity. If one of the entries has no rank, then assume the least important rank for that entry. o Use linguistic cues to discriminate, where possible, among geographic entity types. For example, "on" would precede an island, whereas "in" would precede a city name; "Ajer Island" should result in selecting Ajer (ISLAND) Indonesia (COUNTRY)from the gazetteer, whereas "Ajer Islands" should result in Ajer (ISLAND-GROUP) Indonesia (COUNTRY) being selected. o Match names exactly. Do not attempt to coerce a name into a spelling found in the Gazetteer. If it does not match exactly, then consider the entry as not in the Gazetteer. The only inexactnesses allowed in comparison are 1) in capitalization -- do comparison disregarding case 2) English-language geographic entity type designators such as Island or Province do not need to be matched, and 3) forms in the Tables of Acceptable Aliases match the corresponding forms from the table. o If there is no indication of what country a city (for example) is in, and the Gazetteer returns multiple entries, then attempt to identify the appropriate country from references to countries in the rest of the document being processed. This is a (minor) inference. o If a number of candidates are still possible after applying the above guidelines, then represent the place as UNKNOWN. APPENDIX A: RANKINGS Rankings for CITY: 1. National capital 2. Territorial capital 3. Administration center 4. Populated place Rankings for AIRPORT: The ranking is unknown, because the source documentation lists 4 ranks (International, Major, Minor but one runway at least 4000 feet, and Minor), but ranks between 1 and 8 are found in the source. Rankings for ISLAND: 1. Major islands that should appear on all maps 2. Additional major islands 3. Moderately important islands 4. Additional islands 5. Minor islands 6. Very small islands 8. Reefs Rankings for PROVINCE: 1. First order administrative subdivision 2. Second order administrative subdivision Rankings for PORT: 1. Major 2. Medium 3. Small 4. Very small APPENDIX B: DIACRITICS The @ symbol followed by a number indicates a special character to be placed in the character stream. @1 = AE Ligature @2 = ae ligature @3 = OE ligature @4 = oe ligature @5 = barred D @6 = barred d @7 = undotted i The @ symbol followed by a letter indicates a diacritic to be placed on the previous letter. @A = acute @B = acute over breve @C = acute over circumflex @D = barred O @E = breve @F = chandrabindu @G = cedilla @H = circle above @I = circle below @J = circumflex @K = dieresis @L = dieresis below @M = dot above @N = dot below @O = double acute @P = grave @Q = grave over breve @R = grave over circumflex @S = high comma off center @T = ligature left @U = ligature right @V = macron @W = pseudo question mark @X = pseudo question mark over breve @Y = pseudo question mark over circumflex @Z = right hook @a = slashed L @b = tilde @c = tilde over breve @d = tilde over circumflex @e = wedge @f = macron below