Web Dev Matters and Me

Web Development Matters - HTML, XML, C#, .NET, AJAX/Javascript(jQuery), CSS, XML-XSLT

ME - LIFE,Philippines, Tokyo, ECE, PhilNITS/JITSE,情報処理, 日本語

things about Philippines, gaming, C# development and web development, how to make money in stock trading

Web Dev Matters and Me

Making A Web Crawler :

I have thought of this for many times and wondered, how to Google, Yahoo! and other search engines do this stuffs, sniffing all the net info? Does every page have to submit to their directory?

Well constant thinking gave me an idea, and somehow I think I can also make one using C#.


Their algorithm probably runs like this (I could be wrong, but this is just an assumption based on what I percieve as possible).


First, this program will check things from 1.0.0.0 until it reach 255.255.255.255.
Then, for every dataset result on each page landing on every tick of the IP, this bot will check the contents, and will try to read all HTML elements, or just the contents. After which he will follow, if those are links to other document, using the root IP. If it detects an error (404 - Not Found), it will try the other part, and also will try to check files on every directory. Then this will save the contents on the DB, and also try to check the links to where it points to do a new scan.


So maybe, sometimes we have to submit our site to these search engine because it could take time for the bots to crawl on every pages. Another thing that these web bots can detect are the "GET" request, which get cached.



So far that is the fundamental, I think. And the other things like meta analysis, checking based on contents and other black SEO practices done based on that Search engine policies.


0 comments:

FB Connect