A recurring task I have as developer of the GoodFoodTalks.com site is importing restaurant chains into the database via CSV file. In trying to map their address formats into ours, I tried out several online address parsers. What I discovered was that you can't expect any automated process to be able to handle improperly formatted addresses because there is no way for its developer(s) to anticipate what any particular invalid input might look like. That's why something like "International Convention Centre, level 2, 8 Quayside, London, UK" is not likely to be parsable. I wrote about the challenge of parsing addresses in the Parsing Building and Street Fields from an Address using Regular Expressions article. The difficulties in parsing addresses does not rule out automated parsing services, it just means that you have to be working with properly formatted data if you're going to obtain usable results. Bearing that in mind, the goal of this guide is to determine how some of the more popular online address parsers fare against a list of varied but well-formatted addresses. In this first instalment, we'll be looking at address-parser.net and Texas A&M GeoServices.
Finding Test Data
I'm no expert on International address formats, so I wanted to rely on test addresses that are fairly reliable, although some wonky data might be good for seeing how a parser handles bad data. I eventually landed on Wikipedia's excellent Address page, which features addresses for many countries. Here is a list of ten sample addresses from around the World:
Target Insurance Brokers, Level 2, Principal Towers, 11 Jalan Sultan Ismail, 50250 KUALA LUMPUR, MALAYSIA Piedras 623, Piso 2, depto 4, C1070AAM, Capital Federal Australia Post, 219-241 Cleveland St, STRAWBERRY HILLS NSW 1427 101-3485 RUE DE LA MONTAGNE, MONTRÉAL (QUÉBEC) H3G 2A6 P.R. China 528400, Beijing City, East District, Mingdu Road, Hengda Garden, 7th Building, Room 702 Hauptstr. 5, 01234 Musterstadt Budapest, Fiktív utca 82., IV. em./28. - or - Pf. 184. 2806 5, Mahatma Gandhi Road, BUDHAGAON, District Sangli, 471594, Maharashtra The Shelbourne Hotel, 27 St Stephen's Green, Dublin, D02 H529 ul. Pobedy, d. 20, kv. 29, pos. Oktyabrskiy, Borskiy r-n, Nizhegorodskaya obl., Russia, 606480
The name of our first site says it all! This international address parser takes structured and free-form addresses alike. It splits them into separate component parts, such as house number, street type (bd, street, ..), street name, unit (apt, batiment, ...), zipcode, state, country, city, etc... It's useful for comparing, validating, weeding out duplicates, standardizing, and geocoding your addresses. Behind the scenes, the parser employs its own proprietary parsing technology, based on computational linguistics, natural language processing, parsing technology, semantic technology, and text mining.
The actual product is traditional software, but they also offer a free online version that provides some of the same functionality on individual addresses. Here, an address may be parsed and/or standardized. Here is what the address entry of "Budapest, Fiktív utca 82., IV. em./28. - or - Pf. 184. 2806" produced:
It did an admirable job, although it did drop some of the information, notably, the "IV. em./28. - or - Pf. 184".
Texas A&M GeoServices
A product of the research conducted at the Texas A&M University Department of Geography, TAMU GeoServices offers a number of online geographic information processing services, including address processing, geocoding, reverse geocoding, drag & drop mapping, to name but a few.
On the Address Processing page, you'll find the section for Batch Address Parsing & Standardization. Clicking the "Start Processing Data" takes you to the Batch Database Address Normalization page. Working with the services is a four step process:
- Upload your data files and validate that we can open and read them
- Choose which data in your files you want to process
- Identify the fields of your data so we know which column is what
- Choose your processing options and start the process
Services are free to use, but you must create an account before using them. Therefore, step one redirects to the login/signup page.
Once you've created your account, you'll receive 2500 credits the first time you log in. You can also become a partner to receive additional free credits. That is more than enough credits for all but the most high volume processing, in which case you also have the option to purchase additional credits.
At the bottom of the navigation links on the left-hand side of the Address Processing page, there's a link to download some sample data files. These provide a good idea of what the address parser does, except that it turns out that it does not handle cities, postal codes, province/states, or countries. In other words, it only parses the house/building number and street information, not unlike what I did in my Parsing Building and Street Fields from an Address using Regular Expressions article. For that reason, suppose that you've got some addresses like the following, which I lifted off the United States Mailing Address Formats and Other International Mailing Information page:
Addresses "JOHN DOE 421 E DRACHMAN TUCSON AZ 85705-7598 USA" "MARY ROE SUITE 5A-1204 799 E DRAGRAM TUCSON AZ 85705 USA" "BITBOOST POB 65502 TUCSON AZ 85728 USA" "JANE DOE 799 E DRAGRAM SUITE 5A TUCSON AZ 85705 USA" "JOHN SMITH 100 MAIN ST PO BOX 1022 SEATTLE WA 98104 USA"
You'd have to trim them down to the street level:
Addresses "421 E DRACHMAN" "SUITE 5A-1204 799 E DRAGRAM" "POB 65502" "799 E DRAGRAM SUITE 5A" "100 MAIN ST PO BOX 1022"
In the next instalment, we'll go over the steps to upload your data to the Texas A&M GeoServices' address parser and how to run the service against it. We'll also be evaluating a product called Smartylist.
Rob Gravelle resides in Ottawa, Canada, and is the founder of GravelleWebDesign.com. Rob has built systems for Intelligence-related organizations such as Canada Border Services, CSIS as well as for numerous commercial businesses.
In his spare time, Rob has become an accomplished guitar player, and has released several CDs. His band, Ivory Knight, was rated as one of Canada's top hard rock and metal groups by Brave Words magazine (issue #92) and reached the #1 spot in the National Heavy Metal charts on Reverb Nation.