Location Module: Disambiguation
Location Module Disambiguation
Last week we made significant progress on disambiguating locations, but still have a few problems to resolve before we will consider disambiguation complete.
Algorithm for Disambiguation
1. We run a reimplemented version of Stanford's Named Entity Recognition program, which returns a set of locations.
2. Classification
- We go through each location and determine if it is a country.
- Similarly, we go through each location and determine if it is a state.
3. Search the database based on the following cases
a. No countries or states found:
- Return no locations
- The final version will use DNS to find the country
b. No countries were found but states were found:
- Return no locations
- The final version will use DNS to find the country
c. No state were found but countries were found:
- Search the database for results using Name and Country
d. Both states and countries exist:
- Search the database for results using Name, State, and Country
4. Searching the database (applies for 3c and 3d)
a. Loop through each State/Country combination for each node and return the best result
b. If no results: Drop state from the search.
c. If no results: Split the Name if it is more than one word. d. If no results: Return no locations for this node.
5. Evaluating best result
- To evaluate best result, we consider only the case where more than one location for the node was found.
- We use position in the document to find the best result:
a. "I am working in Kansas City, Kansas. Although I like Kansas, I would really like to be in Branson, Missouri.
b. "Kansas City", "Kansas", "Branson", and "Missouri" are the nodes in this example.
c. For the node Kansas City, we get two results: Kansas City, Kansas and Kansas City, Missouri.
d. Kansas City will be matched with Kansas because Kansas is closer to Kansas City than Missouri in the document.
e. Note: for this example we assume United States is mentioned somewhere in the document.
Problems Resolved This Week
1. Database problem
- Our database does not correctly match every city with the correct state. There are a number of cities where state is NULL, so they are not found in our query. This is a minority, but when it happens it leads to wrong results. * The above problem has already been addressed in the algorithm. We drop state when no results are found using state.
2. NER Problem
- NER will recognize a location such as X Province, where our database will only have a result for X. This can be fixed in NER by changing Province to lowercase, but obviously we will not be manually doing this for each document. * The above problem has already been addressed in the algorithm. We split the locations if no results are found.
Examples
Yesterday when I presented this to Wesam, I had to modify the colored words to lowercase.
Local authorities have culled 58 cows after tests confirmed on Oct. 30 the presence of Asia Type One foot-and-month disease at a village in Yushu County, the ministry said on its Web site (www.agri.gov.cn).
Foot-and-mouth disease does not affect humans and outbreaks are relatively easy to control, but the disease can have a serious impact on the livestock industry by reducing meat and milk production.
Outbreaks that resulted in the slaughter of more than 1,000 animals were reported around China last year.
That has been fixed. Here is our output:
(Qinghai,25.808333,106.075,0.0)
(Yushu,41.637778,124.822778,0.0)
Current Problems and Remaining Work
1. Our database has no data for states or provinces. 2. In the above example, Beijing is returned as a location. However, it is only the place of publication. 3. Our script needs to find the country if it is not a location in the document. This should be done using DNS.
In conclusion, our disambiguation script has been successful as of 9/22/09 on nearly all of the documents in our time map.
Created by andrew08. Last Modification: Tuesday 22 of September, 2009 16:31:26 CDT by andrew08.
