Saving the web for posterity

I posted here about how knowledge on the web, and on digital media generally, disappears – risking the impoverishment of future historical research.

Just before I could post this follow-up, Jessica anticipated me and commented that I should try Archive.org. Well, guess what – this is all about that.

A recent interview with British Library chief Lynne Brindley in The Guardian discussed some positive efforts to archive the web, notably the San Francisco-based  Internet Archive.

In San Francisco, the non-profit Internet Archive automatically scrapes parts of the web and its Wayback Machine allows people to surf back in time to see what their favourites sites looked like as far back as 1996. It already contains three petabytes of data, which equates to more than three million gigabytes.

All well and good. But what it doesn’t mention is that the Internet Archive itself is losing its digital information.

Way back, I used to know someone called Tim Worman, who became better known as Tim Polecat – lead singer of the UK rockabilly band The Polecats. We lost touch, obviously (I don’t really move in pop star circles), but about 10 years ago, I thought I’d see if he was on the web. 

He was! He had a fun site with all the usual stuff about his interests and current news – which also benefited from the fact that he was also a good artist and designer, so it looked pretty cool. 

I checked back every so often, but then a few years later was disappointed to find it no longer seemed to exist. Aha – but no: there it was. Archived by the Internet Archive and accessible through its Wayback Machine (though sadly without some of the graphics and MP3 downloads).

I visited occasionally and then – guess what? Yes – his site had vanished from the Internet Archive too. 

The obvious question, then, is what use is an internet archive that just archives for a few years? If Tim Polecat’s site was valuable at all, surely it should be kept in perpetuity. If it’s not actually valuable, then why keep it at all – for any length of time? 

Maybe the Internet Archive scrapes the web automatically and then real people wade through the content it stores to decide what’s valuable and what isn’t – a process that would obviously take a while. So perhaps his site was only archived until someone got a chance to have a look at it and then decide it was of no use. 

But that undermines the very principle of archiving ephemera that the British Library is so concerned about. After all, it is from some of the most trivial material that we gain some of our most important insights into the lives of ancient peoples. What they considered important at the time is not necessarily what concerns historians today – and we have no idea what future historians will want to know about us.

Advertisements

3 Comments

Filed under Journalism

3 responses to “Saving the web for posterity

  1. You pointed out some important things. But I also think that’s present is way more … prensent to people than future. And some might think of the future, but a lot of people only care for their immidiate needs. I don’t say that’s good, I only describe my observations.

    History is always one way of showing things. It’s a problem with virtual data, they *are* ephemeral.

  2. Just another thought, couldn’t it be deleted by accident? There seems to be no rating system… http://www.archive.org/about/about.php

    Or xould this be an explanation:
    “How can I remove my site’s pages from the Wayback Machine?

    The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.”
    (quoted from http://www.archive.org/about/faqs.php#The_Wayback_Machine, 21 April 2009, 16:38)

    • freelanceunbound

      Interesting – I think the owner of the site concerned in this case would not have noticed it was archived. I bet he wouldn’t have made an effort to remove it. But I could be wrong…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s