Geocoding v3 beta
1. GeoCoding v3 beta Engine
1.1 Overview Methodology
Geocoding is the process of finding geographic coordinates (expressed as geograhic longitude and latitude) from other geographic data, such as addresses (city, street, house numbers ..), point of interests, etc. Common data input formats are free text fields (single line search), address based form (structured search) and mixture of these two. Since other methods are just specialized (derived) and use cases which allow more precise address resolution with higher performance, the implementation of Geocoding v3 beta engine will be explained only in terms of single line search in this paper. Normally, the output type of Geocoder is displayed in XML format.
A postal address includes one or more of the following pieces of information:
- Country (e.g. "Germany")
- State (e.g. "Baden-Württemberg")
- County (e.g. "Uckermark")
- Settlement (e.g. "Stuttgart")
- Settlements district (e.g. "Botnang")
- Post Code (e.g. "70176")
- Street (e.g. "Leuschner Strasse")
- House Number (e.g. "45a")
- Point Of Interest (e.g. "Pizzeria Bella Italia")
2. Implementation details of Geocoding v3 beta engine
The Geocoding v3 beta engine is implemented in terms of single line search as in following:
- preprocessing the search input query
- restricting the search region to some parameters
- calculating search results
- preparing and prioritization the geocoding results
2.1 Preprocessing the search input query
The first step what the Geocoder does is performing some modification to input query as soon as it is typed in. These modification process contains three further sub processes. They are: (1) precompiling search string, (2) itemizing it, (3) normalizing finally.
2.1.1 Precompilation of search string
For efficiency and performance reasons any search string offered by users can be precompiled in two passes:
- Pass 1: Locale and region independent preprocessing
- Pass 2: Locale and region dependent second preprocessing step
If necessary, the locale/region independent compilation adds some extra separators to the original search string, e.g. space character before upper case letters and numbers. For example, The query
"Stuttgart Leuschner Str. 45". The language/region specific precompilation handles also transcriptions and abbreviations. For example, the query
"Stuttgart Boeheimstr." becomes=
2.1.2 Itemized Search String
The search string may include soft separators like spaces, and hard separators like semicolons or commas. A soft separator means that two tokens are divided by a delimiter character and may belong together, but not need to. While a hard separator implies two search string tokens belong to different address parts. The "Itemizer" calculates a list of possible address items out of the search string. For example, with the query=
"Stuttgart Germany; Leuschner Strasse 45" the itemizer calculates following items:
- Stuttgart Germany
- Leuschner Strasse 45
- Leuschner Strasse
- Strasse 45
2.1.3 Normalize Search Items
Every search item string of the itemizer is normalized by following rules:
- remove all spaces and special characters (like brackets)
- fold all upper case letters to lower cases
- remove all graves, acutes, circumflexes, tildes, diaeresis, rings etc. from all letters
- normalize Unicode characters with NFKD (Normalization Form Compatibility Decomposition - see http://unicode.org/reports/tr15/#Norm_Forms)
For example, an item named
"Varieté Düsseldorf" is converted to
2.2 Restricting the search region to some parameters
A direct search for streets or POI's in the whole MicroMap-Database can be very expensive because of the huge number of features. However, a search query usually contains at least one of these two: a postcode or a settlement. Therefore the MMGCE tries first to reduce the search area by extracting these information of the previously calculated search items. In fact, these can be done in three ways. They are: (1) determining post code names and calculating post code regions (2) determining settlement names and calculating settlement regions. (3)allover constriction region.
2.2.1 Determining and calculating post code names and regions
Determine Post Code Names
The MicroMap database contains an ordered string container with all postcode names available in a certain map. A post code name can be an attribute on every geographic feature like a postcode region, street, poi, etc. The geocoding engine tries to get matches for each search item within the passed post code name database. Normally there should be not more than one result. With the found postcode name(s) the search engine has direct access to every geographic feature which holds this attribute (by ID's).
Calculate Post Code region
There are two ways to get the postcode regions. In some MicroMap databases (e.g. compiled from Navteq or Teleatlas) there are already proper region polygons stored for each postcode. In this case, the result region for the post code constriction is just the union of all found polygons with the correct post code name as attribute. There might be more than one region in the union because one post code could be valid in more than one country. In OpenStreetMap there are no polygons associated to a post code. However, there are some features types like streets or POI's which have this attribute set. In this case the geocoding engine calculates approximate upper bound regions by grouping point clouds.
2.2.2 Determining and calculating settlement names and regions
Determine Settlement Names
In general this works similar to the determining of post code names, but with a few modifications:
- If the engine finds a post code region, the engine searches just for settlements within this region
- Because of the large (world wide) quantity of settlements the engine expects that at least the first 3 letters (characters) were typed in correctly so the advantages of a binary search can be applied.
- If the engine doesn't find an exact match for a given search item the number of allowed typos/errors is increased, depending on the search string length
Calculate Settlement Region
Analogues to the postcode calculation regions with the difference that the union can contain several single regions, because e.g. the settlement name "Frankfurt" leads to many results in the world. If there is no exact polygon region for a settlement is found so the approximate region is a circle around the city center with a radius depending on the population and importance of the city or district.
2.2.3 Allover Constriction Region
If both, the postcode region and settlement region are valid, the final constriction region is the intersection of both.
2.3 Calculating search results
2.3.1 Calculation of Street Results
- Determining of street names within the postcode/settlement region which matches with at least one of the search items. The error tolerance is handled like for the search of settlement names.
- Collect all street features from MicroMap database with that street name as attribute
- Group these features into street result. Background: There can be several street features with the same name, which in reality belong together. For example there could be a small way behind a house with the same name like the big avenue in front. They are not even connected but belong together.
- Calculation of the position that represents a group: MMGCE try to detect a house number within the search items and looks if one member of a street group contains this house number. If so, this position is taken for the representation of the group, if not the most central point is taken.
2.3.2 Calculation of POI Results
Works analogues to the first two steps of a street result calculation.
2.4 Preparing and prioritization the Geocoding v3 beta results
2.4.1 Preparing Geocoding v3 beta Results
After the steps above, we have a number of street / POI / settlement results with a representative position. By applying a kind of reverse geocoding process the engine adds all available area information which is still missing (like country, state, county, postcode, settlement, district) to each result entry. If there were no result found for street / POI / settlement but a postcode region, so all settlement centres within the postcode region are taken as results and enriched with the area information.
2.4.2 Weighing the Geocoding v3 beta Results (Prioritization)
Each result represents an address and contains the following address parts: country, state, county, settlement, district, postcode, street, street number, sight, house number. Each of these address parts have a static weight, which describes its priority. Usually not every part is known for a single address like e.g. Liechtenstein has no states or a settlement result does not have a street or house number part. For every address part an individual priority is calculated (value between 0.0 and 1.0). In general this priority contains the orthographic match priority. The MMGCE identifies the best match between a search item and the address part string. The better the conformity, the higher the priority is. For some address parts there are additional parameters which influence the priority. For a settlement the importance and population is taken into account so that e.g. "Paris" in France is more weighted than "Paris" in Canada. If there are no exact settlement regions given, we consider the distance between city center and address position. The higher the distance the lower the priority. After an general priority is determined for an address the engine collects the priorities of every single address parts (considering the weight) and joins them with an allover priority - which mainly reflects how many search items have been used for the passed result. Last but not least, all results are ordered by their priority. The ones with a very bad evaluation are thrown away.
3. GeoCoding v3 beta Sever API
Geocoding v3 beta engine features a powerful address and POI search using MicroMap vector maps.
3.1 Request Structure
Generally, the geocoding HTTP API has the URL format as shown below:
In this example:
http://beta.geocoding.cloudmade.com/v3/: link to MapServer
APIKEY: your application apikey for accessing CloudMade services
geo.location.search.2?: request type of search
format=xml: format of the result
source=OSM: data source
enc=UTF-8: encoding code
limit=10: maximum numbers of results
locale=de: locale of current search
q=Leinfelden-Echterdingen: input query string
3.2 Request Parameters
While forming the input, there should be some regulations as to parts of input query. Normally, a valid input must consist of a few valid input parameters. Following are the most important input tag parameters:
bbox=lonMin,latMin;lonMax,latMax: used to reduce the search area to certain bounding box set by those coordinate parameters (lonMin, latMin, lonMax, latMax) and return results within this bounding-box. These parametes can be a list of coordinates, e. g.
limit=number: used to specify the maximal number of results. For example,
limit=10returns 10 search results.
format=xml: used to specify the format of output. Normally output will be displayed in xml format.Can be used next values:
sdf(Structured Data Format)
source=Name of the Data sources: used to specify the data source to perform this query on. For example,
enc=Encoding format: used to specify encoding format of the input. For example,
locale=locale: used to specify the locale of current search and consider country or language specific name abbreviations. For example,
locale=demeans that locale is German. With this, German abbreviations like "St." for "Sankt", or "str." for "Straße" can be recognized.
q: used to present query string and specify the address for geocoding. For example,
q="Leinfelden-Echterdingen Germany Karlstr. 12".
3.3 Types of search query string
In order to resolve a postal address, one or more of the following address fields will be passed as query strings. They are:
- country name
- state name
- city name
- post/zip code
- street name
- house number
- point of interest (POI) name
Normally, a text-based input is given in a structured way by using forms of free style form, address form and mixture of these two. In all these searches, the character
“q” is used as an abbreviation for query and addresses will be assigned to it by following syntaxes respect to each type.
3.3.1 Free style search
In free style search (single line search or fuzzy search) string queries are just provided without specifying what they are as we all know from search engines like Google like
q = "Leinfelden-Echterdingen Germany Ludwigstr. 4" as in following example:
With this kind of query string, it becomes difficult for the engine to decide what is
“Leinfelden-Echterdingen” and what is
“Leinfelden-Echterdingen” is not specified as a city and
“Germany” not as a country and the like. Therefore, result of this search wouldn't be very ideal.
3.3.2 Address form search
Address form search is also named as structured search. Normally, the input query of this search has the syntax of
[key=value]. These key and value pairs must be enclosed in square brackets. Currently supported keys are:
“country", "state", "county", "zip", "city", "district", "street", "housenumber", “sight”. Values can be specific names for these keys like [country=Germany], [street=Karlstr.] and others as in following example:
With this kind of query string, engine can directly recognize
“Germany” as a country,
“Leinfelden-Echterdingen” as a city and so on. For the sake of specificity, this search is much faster and more precise than free style search. Some important tags used for forming query string for this kind of search :
[country=Country name]: used to specify the name of country. For example:
[state=State name]: used to specify the name of state. For example:
[county=County name]: used to specify the name of county. For example:
[zip=Zip code]: used to specify the zip code. For example:
[city=City name]: used to specify the name of the city. For example:
[district=District name]: used to specify the name of the district. For example:
[street=Street name]: used to specify the name of the street. For example:
[housenumber=House number]: used to specify the house number. For example:
[sight=Point of interests]: used to specify the point of interests. For example:
[sight=Mezzogiorno]. Without this attribute, engine might return cities or countries with assigned names instead of returning point of interests.
[bbox=lonMin,latMin;lonMax,latMax]: used to search items in some area bounded by box.
[circle=lon,lat,radius]: used to search items in some area bounded by circle.
3.3.3 Hybrid search
Geocoding engine can also accept query string which is formed from mixture of these two styles like
[country=Germany] Karlstr. Leinfelden-Echterdingen 70771 [housenumber=12] as in following example:
It is recommended to put delimiters, soft separators (e.g. space) and hard separators (e.g. semicolons or commas), between segments of query string. Normally, a soft separator implies that two tokens which are divided by a delimiter character may belong together, but not need to. While a hard separator implies that two search string tokens belong to different address parts. The main reason for placing separators is to trigger the search faster and more precise. For example, the search with this query string
q= "70176; Stuttgart; Leuschnerstr; 44; Stuttgart; Germany" returns faster and has better results than
q="70176 Stuttgart Leuschnerstr 44 Stuttgart Germany". Notice, the order of parameters doesn't matter as long as you make use of the delimiters mentioned above.
3.5 %-Encoding (URL encoding)
Being an important part of URL, query strings can also follow URL encoding format. That is to say, it is formed by making use of so called %-Encoding (percent encoding), while a query string prepared for the search. Especially, some reserved characters such as square brackets, spaces... and unreserved characters such as umlaut characters in German language can be encoded with %-Encoding. Moreover, the query string formed with %-Encoding returns exactly the same result as that formed with other formats. Here, a few essential characters used in URL and their encoding are listed. Reserved characters:
“ [ “is encode as
“ ] “is encode as
“ = “is encode as
“ “is encode as
“%20”(Spaces before and after equal signs does not matter at all)
Umlaut characters and character
“ß” encoding among unreserved characters:
“ ü ”is encode as
“ ö ”is encoded as
“ ä ”is encoded as
“ ß ”is encoded as
See table below for example:
Notice, the time needed for search is calculated by ourselves, therefore it is probably different.
3.6 Query string and result
The explicitly of input query directly influence on the correctness of search result. Here, a few common input query and their result are summarized:
- If the house number is not given or can't be found from data source, then the middle of the street is given as result point.
- If merely a postcode is given as a query, then reference point of that area is returned.
- If merely the city name is given as a query, then the center point(s) of the city(s) is returned.
- If merely the street name is given and there are not too many streets with that name, the streets are found and ordered by city population where it belongs to.
4. Geocoding v3 beta results
4.1 Output structure
The result of geocoding search can be in xml format. For example:
4.2 Output Parameters
In the table above, a few tags are used to display search result in response to request input.
The most common output tag parameters:
</city>: used to display city name
</country>: used to display country name
</district>: used to display district name
</position>: used to display coordinates of the result
</state>: used to display state name
</zip>: used to display ZIP-Code
<status>Status of result
</status>: used to display status of result such as whether the search is success or not...
</duration>: used to display time spent for searching in milliseconds
5. Reverse Geocoding
The GeoCoding v3 beta Engine now supports reverse geocoding as well which used to be a separate module. Reverse geocoding returns address in response to input given with coordinates. For example:
This input returns following result:
The order of longitude/latitude does not matter, because in most cases just one result makes sense. If this is not the case the geocoder returns both possible results. In similar fashion, reverse geocoding's string query can be formed in three different ways and delimiters can be used to separate its segments like geocoding query strings, but unlike it coordinates (latitute and longitute) are assigned to it as value such as
q=[latitude=48.77615073];[longitude=9.16416465]as in following example: