Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 Anyone understand how web crawler works?, I understand but wont say master

views
     
TSCastleFire
post Mar 23 2017, 03:12 PM, updated 8y ago

Getting Started
**
Junior Member
62 posts

Joined: Nov 2016


CODE
def web_crawler(seed):
    tocrawl = [seed]
    crawled = []

    while tocrawl:
       page = tocrawl.pop()
       if page not in to crawl:
            content = getpage(page)
            add_page_to_index(index, page, content)
            union(tocrawl, get_all_links(content)
            crawled.append(page)
    return index




CODE
def add_to_index(index, keyword, url):
   for entry in index:
       if entry[0] == keyword:
           entry[1].append(url)
           return
   index.append([keyword, [url]])

def lookup(index, keyword):
   for entry in index:
       if entry[0] == keyword:
           return entry[1]
   return []

def add_page_to_index(index, url, content):
   words = content.split()
   for word in words:
       add_to_index(index, word, url)



just putting this up here for discussion. its in python

This post has been edited by CastleFire: Mar 23 2017, 03:13 PM
workaholic
post Mar 24 2017, 11:09 AM

Getting Started
**
Junior Member
126 posts

Joined: Jun 2007
To crawl a page & parse the content is easy.
But to combat anti crawler is tough.
One of the really good anti crawler site is anidb.net.

angch
post Mar 24 2017, 12:47 PM

On my way
****
Junior Member
635 posts

Joined: Jul 2006
QUOTE(workaholic @ Mar 24 2017, 11:09 AM)
To crawl a page & parse the content is easy.
But to combat anti crawler is tough.
One of the really good anti crawler site is anidb.net.
*
What info you'd get from crawling anidb.net that you can't get via their API (ugh too little docs on it) or their datadumps? https://wiki.anidb.net/w/API

This post has been edited by angch: Mar 24 2017, 12:48 PM
workaholic
post Mar 24 2017, 01:31 PM

Getting Started
**
Junior Member
126 posts

Joined: Jun 2007
QUOTE(angch @ Mar 24 2017, 12:47 PM)
What info you'd get from crawling anidb.net that you can't get via their API (ugh too little docs on it) or their datadumps? https://wiki.anidb.net/w/API
*
Didn't know they provide API.
I used to run anime streaming site & app (before it got shutdown within a month tongue.gif) couple years back.
I keep hitting their anime page by feeding the AID in the URL parameter, idea is to get the latest anime entry in their database. I do notice they sometime skip few numbers. My guess is they add the entry and later on remove that entry. Once I got a valid anime entry with the AID, I store both AID & generate my ID for that particular anime into my database, the title, episode numbers, cover, description & alternative title in other common language such as the jap, etc (this is very important from SEO pov).

The other cronjob is to crawl the anime list for latest episode releases & capture the episode title, duration.

I copy the anime stream video links from other popular anime streaming sites.
narf03
post Mar 24 2017, 02:37 PM

Look at all my stars!!
*******
Senior Member
4,545 posts

Joined: Dec 2004
From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol


QUOTE(workaholic @ Mar 24 2017, 11:09 AM)
To crawl a page & parse the content is easy.
But to combat anti crawler is tough.
One of the really good anti crawler site is anidb.net.
*
I tot u can put something in your page to ask Google crawl out of here?

Google robots.txt

This post has been edited by narf03: Mar 24 2017, 02:43 PM
TSCastleFire
post Mar 24 2017, 03:25 PM

Getting Started
**
Junior Member
62 posts

Joined: Nov 2016


QUOTE(workaholic @ Mar 24 2017, 11:09 AM)
To crawl a page & parse the content is easy.
But to combat anti crawler is tough.
One of the really good anti crawler site is anidb.net.
*
mind sharing your thought process on doing a crawler?
cubiclecarbonate
post Mar 24 2017, 04:00 PM

On my way
****
Junior Member
557 posts

Joined: Jul 2011


QUOTE(workaholic @ Mar 24 2017, 01:31 PM)
Didn't know they provide API.
I used to run anime streaming site & app (before it got shutdown within a month tongue.gif) couple years back.
I keep hitting their anime page by feeding the AID in the URL parameter, idea is to get the latest anime entry in their database. I do notice they sometime skip few numbers. My guess is they add the entry and later on remove that entry. Once I got a valid anime entry with the AID, I store both AID & generate my ID for that particular anime into my database, the title, episode numbers, cover, description & alternative title in other common language such as the jap, etc (this is very important from SEO pov).

The other cronjob is to crawl the anime list for latest episode releases & capture the episode title, duration.

I copy the anime stream video links from other popular anime streaming sites.
*
active crawler? mind sharing how you manage the crawling? until they victims alerted?
workaholic
post Mar 24 2017, 07:38 PM

Getting Started
**
Junior Member
126 posts

Joined: Jun 2007
QUOTE(narf03 @ Mar 24 2017, 02:37 PM)
I tot u can put something in your page to ask Google crawl out of here?

Google robots.txt
*
ah yes, you can use that to block a certain suspicious IP.

QUOTE(CastleFire @ Mar 24 2017, 03:25 PM)
mind sharing your thought process on doing a crawler?
*
1) Identify the objective
2) List down the logic in steps (For instance, grab anime page, if title doesn't consist of "404", then it's valid anime entry, proceed to extract data.)
3) Turn that into code.

QUOTE(cubiclecarbonate @ Mar 24 2017, 04:00 PM)
active crawler? mind sharing how you manage the crawling? until they victims alerted?
*
I didn't implement any on-error-alert-me type of logic. For that you can prolly do something like "when the extract content is empty, terminate process & email webmaster the URL you're crawling". Even like that happen when the site you're crawling changes their structure.
tjying95
post Mar 25 2017, 11:12 AM

Getting Started
**
Junior Member
54 posts

Joined: Oct 2011


QUOTE(workaholic @ Mar 24 2017, 07:38 PM)
ah yes, you can use that to block a certain suspicious IP.
1) Identify the objective
2) List down the logic in steps (For instance, grab anime page, if title doesn't consist of "404", then it's valid anime entry, proceed to extract data.)
3) Turn that into code.
I didn't implement any on-error-alert-me type of logic. For that you can prolly do something like "when the extract content is empty, terminate process & email webmaster the URL you're crawling". Even like that happen when the site you're crawling changes their structure.
*
Make an anime named "404, page not found - the animation" haha…

Dunno what I made can be considered a crawler, I used it to get specific info from a website for their latest listing. Since they have the data embedded in the webpage, so I just get that and parse it.
wKkaY
post Mar 25 2017, 08:03 PM

misutā supākoru
Group Icon
VIP
6,008 posts

Joined: Jan 2003
QUOTE(CastleFire @ Mar 23 2017, 05:12 PM)
just putting this up here for discussion. its in python
*
If you understand what you wrote, then you understand the basics of crawlers. What you should do next, depends on what your goals are. Do you want to learn the math and computer science behind information retrieval, or do you just want to build something using available software?

PS: it should be "if page not in crawled".

 

Change to:
| Lo-Fi Version
0.0146sec    0.39    5 queries    GZIP Disabled
Time is now: 29th March 2024 - 11:03 PM