The 2008 Domain Harvest
During October 2008 the Library will perform a large-scale harvest of the New Zealand internet. We will do this using a ‘web crawler’ to find and download web pages.
The domain harvest will attempt to acquire every publicly accessible website that falls under the nz country code, as well as certain other websites that are owned by New Zealanders or legally considered New Zealand publications.
The internet is always changing, and uses a myriad of technologies, so it is impossible to make a perfect copy. Despite this, we are hoping to harvest 100 million URLs during October 2008, giving us a snapshot of the internet at that time.
The harvested web pages will be stored at the Library, and will eventually be made publicly accessible
Ok, Normally having a site indexed is not an issue. It’s how we find things. But when you blatantly ignore the robots.txt I have an issue. I don’t know about others, but I purposely add files / directories into the robots file so they are _not_ indexed/archived.
Needless to say, Did we get any warning at all that our Web servers were going to be put under huge load from their bloody robot. I don’t recall seeing anything anywhere.. I found the page about from a log entry on my server. (While trying to figure out why the load was so high)
Be warned NZ.. even if you have disclaimers stating the content may not be archived or have files/directories in robots.txt you don’t have indexed/archived on Google etc.. This will blatantly ignore those entries and both index and archive your site.