may sound a little little wood Shanghai dragon to take a very simple example, for example in the website flip directory added 29 articles, that is to say the last one is thirtieth, and the spider is a one-time grab 10 article links, this spider.
The principle of How to determine whether the
for this page type page, the spider is mainly through the records of each article found that crawl the web links, and then found the article links with the history of the discovery of the link on the comparison, if there is intersection, the crawl and found all of the new article, can stop on the back of the page to grab; otherwise, that the crawl did not find all the new article, need to continue to crawl under a page or a few pages to find all of the new article.
why do we need this capture mechanism of
to judge whether the article published by arrangement is a necessary condition for this type of page, as described below. So how to determine whether the resource release time arrangement? Some pages in each article followed the release time, through the link corresponding to the time set, time to determine whether the collection at large to small or small to large order, if it is, then the web resources are released according to the time order the arrangement, and vice versa. If you didn’t write the release time, the spider can write the article itself according to the actual release time of judgment.
is ordered flip
?Most web sites use page?
Spider system’s goal is to found and all valuable web crawling in the Internet, love Shanghai official also made it clear that the spider can only crawl to as much as possible and valuable resources and keep the page system and the actual environment of consistency at the same time not to give the site experience pressure, that is to say all the spider will not crawl all pages this website, there are a lot of spider crawling strategy as soon as possible to complete the discovery of resources links, improve the efficiency of capture. The only way to try to meet most of the web spider, which is why we want to do website link structure, then the wood just for a Shanghai dragon spider on the web page type mechanism to seize a few comments. (this article will not test the rate of other grasping mechanism, from a single point analysis)
page form to orderly distribution of web resources, when a new article increases, goes back to old resources page in the series. For the spiders, the index page of this specific type is an effective channel for crawling, but the spider crawling frequency and site update frequency is not the same, the link is likely to be pushed to the page, so the spiders could not every day from the first page climb to eightieth, then an article in an article crawl to the database comparison, this is a waste of time but also a waste of your web spider, the spider included time, so the need for such a special type of page type "an additional grab mechanism, so as to ensure the complete collection of resources.