The magic of search engines starts with crawling. While at first glance Web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. Web Crawling outlines the key scientific and practical challenges, describes the state-of-the-art models and solutions, and highlights avenues for future work. Web Crawling is intended for anyone who wishes to understand or develop crawler software, or conduct research related to crawling.So, implementations performing random file accesses perform poorly, but those that perform streaming sequential reads or writes can achieve reasonable throughput. The Mercator crawler leveraged this observation by aggregating many set lookup and insertion operations into a ... the files that together hold the batch are read back into memory one by one and merge-sorted into the main URL hash fileanbsp;...
|Author||:||Christopher Olston, Marc Najork|
|Publisher||:||Now Publishers Inc - 2010-01-01|