Outline ·
[ Standard ] ·
Linear+
Address Parser
TSyiemszx
|
Nov 11 2015, 02:20 PM, updated 9y ago
|
Getting Started
|
Does anyone have idea how to parse malaysian address?
Especially how to separate House Number and Street since house number doesn't have any standards.
Example: No 23 Jalan ABC, Taman XYZ, 54321, Kuala Lumpur, Kuala Lumpur
parse into
House number: No 23 Street name: Jalan ABC Section: Taman XYZ Postcode: 54321 City: Kuala Lumpur State: Kuala Lumpur
|
|
|
|
zeb kew
|
Nov 11 2015, 03:32 PM
|
|
Manually. There is simply too much variation to do it completely automatically.
A number could be a house number, apartment number, street number for the building, road number, mile/km for the road, etc. Some addresses have more than one street name line in it.
If you've up to 100,000 addresses, the simplest way is to load them into a spreadsheet, and use formulas to partially parse them. A lot of manual processing would be required. But if you're smart about it, it could be done in an hour or two.
For databases larger than that, you could use regex to match "known patterns", and process those, leaving the rest for further processing/tweaking. You will have to do it for each "pattern", each way the address is written. This would be a lot of work. In the end, there still will be lots of addresses left for you to process manually.
|
|
|
|
TSyiemszx
|
Nov 12 2015, 12:00 PM
|
Getting Started
|
QUOTE(zeb kew @ Nov 11 2015, 03:32 PM) Manually. There is simply too much variation to do it completely automatically. A number could be a house number, apartment number, street number for the building, road number, mile/km for the road, etc. Some addresses have more than one street name line in it. If you've up to 100,000 addresses, the simplest way is to load them into a spreadsheet, and use formulas to partially parse them. A lot of manual processing would be required. But if you're smart about it, it could be done in an hour or two. For databases larger than that, you could use regex to match "known patterns", and process those, leaving the rest for further processing/tweaking. You will have to do it for each "pattern", each way the address is written. This would be a lot of work. In the end, there still will be lots of addresses left for you to process manually. Thank you so much sir. Do you have idea how to work with regex especially for the house/apt number? Malaysians house numbers are way too vary and doesn't have specific standard. And also, the address is considered easy to parse if there is comma "," as the delimiter or separator. So how to deal with addresses that are lack/no delimiter?
|
|
|
|
zeb kew
|
Nov 12 2015, 01:54 PM
|
|
QUOTE(yiemszx @ Nov 12 2015, 12:00 PM) Thank you so much sir. Do you have idea how to work with regex especially for the house/apt number? Malaysians house numbers are way too vary and doesn't have specific standard. And also, the address is considered easy to parse if there is comma "," as the delimiter or separator. So how to deal with addresses that are lack/no delimiter? You look for numbers vs letters vs punctuations like /, strings like Jalan/Jln/Taman/Tmn/Lorong/Lrg/etc. Comma alone is too unreliable. It isn't easy. A lot of work. Hours of work matching tens if not hundreds of patterns. That is why I suggest simply doing it semi-manually on a spreadsheet unless you're looking at hundreds of thousands of addresses.
|
|
|
|
Upsilon
|
Nov 12 2015, 04:47 PM
|
|
might be relevant: http://stackoverflow.com/questions/1116019...into-componentsturns out address parsing is not something easy to do, you will find a lot of interesting exceptions along the way that would break whatever magic code you throw at it. If you are not building something at a large scale, you may want to try services provided by google for this (not sure if it exists). Otherwise, may be start collecting some data and do deploy some machine learning to the data to help parsing part of the address.
|
|
|
|
teknokrasi
|
Dec 1 2015, 05:22 PM
|
Getting Started
|
QUOTE(yiemszx @ Nov 11 2015, 03:20 PM) Does anyone have idea how to parse malaysian address? Especially how to separate House Number and Street since house number doesn't have any standards. Example: No 23 Jalan ABC, Taman XYZ, 54321, Kuala Lumpur, Kuala Lumpur parse into House number: No 23 Street name: Jalan ABC Section: Taman XYZ Postcode: 54321 City: Kuala Lumpur State: Kuala Lumpur regex expression to "tokennized" the whole address string. The expression will return it as an array.
|
|
|
|