Welcome Guest ( Log In | Register )

Bump Topic Topic Closed RSS Feed

Outline · [ Standard ] · Linear+

 Address Parser

views
     
TSyiemszx
post Nov 11 2015, 02:20 PM, updated 9y ago

Getting Started
**
Junior Member
69 posts

Joined: Apr 2013


Does anyone have idea how to parse malaysian address?

Especially how to separate House Number and Street since house number doesn't have any standards.

Example: No 23 Jalan ABC, Taman XYZ, 54321, Kuala Lumpur, Kuala Lumpur


parse into

House number: No 23
Street name: Jalan ABC
Section: Taman XYZ
Postcode: 54321
City: Kuala Lumpur
State: Kuala Lumpur


zeb kew
post Nov 11 2015, 03:32 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
Manually. There is simply too much variation to do it completely automatically.

A number could be a house number, apartment number, street number for the building, road number, mile/km for the road, etc. Some addresses have more than one street name line in it.

If you've up to 100,000 addresses, the simplest way is to load them into a spreadsheet, and use formulas to partially parse them. A lot of manual processing would be required. But if you're smart about it, it could be done in an hour or two.

For databases larger than that, you could use regex to match "known patterns", and process those, leaving the rest for further processing/tweaking. You will have to do it for each "pattern", each way the address is written. This would be a lot of work. In the end, there still will be lots of addresses left for you to process manually.

TSyiemszx
post Nov 12 2015, 12:00 PM

Getting Started
**
Junior Member
69 posts

Joined: Apr 2013


QUOTE(zeb kew @ Nov 11 2015, 03:32 PM)
Manually. There is simply too much variation to do it completely automatically.

A number could be a house number, apartment number, street number for the building, road number, mile/km for the road, etc. Some addresses have more than one street name line in it.

If you've up to 100,000 addresses, the simplest way is to load them into a spreadsheet, and use formulas to partially parse them. A lot of manual processing would be required. But if you're smart about it, it could be done in an hour or two.

For databases larger than that, you could use regex to match "known patterns", and process those, leaving the rest for further processing/tweaking. You will have to do it for each "pattern", each way the address is written. This would be a lot of work. In the end, there still will be lots of addresses left for you to process manually.
*
Thank you so much sir.

Do you have idea how to work with regex especially for the house/apt number? Malaysians house numbers are way too vary and doesn't have specific standard.

And also, the address is considered easy to parse if there is comma "," as the delimiter or separator. So how to deal with addresses that are lack/no delimiter?
zeb kew
post Nov 12 2015, 01:54 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(yiemszx @ Nov 12 2015, 12:00 PM)
Thank you so much sir.

Do you have idea how to work with regex especially for the house/apt number? Malaysians house numbers are way too vary and doesn't have specific standard.

And also, the address is considered easy to parse if there is comma "," as the delimiter or separator. So how to deal with addresses that are lack/no delimiter?
*
You look for numbers vs letters vs punctuations like /, strings like Jalan/Jln/Taman/Tmn/Lorong/Lrg/etc. Comma alone is too unreliable.
It isn't easy. A lot of work. Hours of work matching tens if not hundreds of patterns.
That is why I suggest simply doing it semi-manually on a spreadsheet unless you're looking at hundreds of thousands of addresses.
Upsilon
post Nov 12 2015, 04:47 PM

On my way
****
Senior Member
518 posts

Joined: Jan 2003
From: Subang Jaya



might be relevant:
http://stackoverflow.com/questions/1116019...into-components

turns out address parsing is not something easy to do, you will find a lot of interesting exceptions along the way that would break whatever magic code you throw at it. If you are not building something at a large scale, you may want to try services provided by google for this (not sure if it exists). Otherwise, may be start collecting some data and do deploy some machine learning to the data to help parsing part of the address.
teknokrasi
post Dec 1 2015, 05:22 PM

Getting Started
**
Junior Member
213 posts

Joined: Jul 2010
QUOTE(yiemszx @ Nov 11 2015, 03:20 PM)
Does anyone have idea how to parse malaysian address?

Especially how to separate House Number and Street since house number doesn't have any standards.

Example: No 23 Jalan ABC, Taman XYZ, 54321, Kuala Lumpur, Kuala Lumpur
parse into

House number: No 23
Street name: Jalan ABC
Section: Taman XYZ
Postcode: 54321
City: Kuala Lumpur
State: Kuala Lumpur
*
regex expression to "tokennized" the whole address string. The expression will return it as an array.

Topic ClosedOptions
 

Change to:
| Lo-Fi Version
0.0155sec    0.17    5 queries    GZIP Disabled
Time is now: 29th March 2024 - 04:06 AM