I attended the 2010 General Assembly meeting of the IIPC. These are my notes and observations from the meeting.
What the IIPC is, and Why It's Important that UNT is a Member
The mission of the International Internet Preservation Consortium (IIPC) is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.
Most of the institutions that make up the IIPC are national libraries. UNT is very privileged to be a part of this important international group that is leading change and innovation in cultural memory organizations.
The 2010 IIPC Venue
The 2010 IIPC General Assembly meeting was held at the National Library Board Singapore. The two leg flight to Singapore was pretty exhausting. Crossing the international date line going West throws you a full 24 hours ahead on the clock; it's the middle of the night in Texas when it's noon in the Far East.
IIPC Preservation Working Group Session
This Session was held on 4 May 2010. For the context of this working group, see the notes on background below.
Presentation on Formats Survey
Clément Oury began the session with a presentation on the formats survey that the WG conducted over the last 12 months. This was WP1 (Work Package 1), a part of the environmental scan work packages that the WG set out to accomplish as part of their work plan for the last year.
The working group did surveys over the last year of formats found in web crawls conducted by members (France, Australia, Sweden) in 2005, 2007, and 2009. They tracked stats for the fifty most common format types. They excluded the stats from the Internet Archive because it dwarfed the stats for the others and obscured what they were trying to figure out. They are planning to do a paper for the iPres conference on these results. Will also tabulate figures for LC and some other institutions. One finding is that national web archives (for example LC) have a higher preponderance of PDF files than other types of web archives. In most cases, html, jpeg, and gif are the top three most common mime file types. Hope to put this information in a format that can be easily and comprehensively shared.
An observation is made by a member of the WG that some of the rows in the spreadsheet we’re looking at on the screen appear to be duplicates with slight misspellings of the mime types (for example: pjpeg vs. jpeg). Clément agrees that we need to investigate to see if these are true duplicates or somehow different format representations; he is cautious and prefers to note the issue and not yet make merge decisions. Suggestion from Harvard that we may wish to compare these stats to format percentages in institutional repositories, where we are now seeing changing file type characteristics. Note that percentage of png is increasing in 2007 and 2009, but gif is still five times more prevalent than png.
There was no attempt to analyze format obsolescence, but perhaps notes about the preponderance of file format types can inform discussions of long term viability. Note that only shockwave-flash and not other types of flash are tracked here (argument ensues as to whether it is tracked under mime video formats rather than application mime types, and whether these distinctions are tracked in focused versus general web crawls). Note that Jhove and Droid report things differently. In terms of designing preservation interventions, we are still at sea and don’t know what to base decisions on. Some WG members state flatly that they don’t trust the tools at all. Harvard suggests using their new tool FITS (http://code.google.com/p/fits/), which uses all the other tools as a meta-identifier. Criticism by Tobias Steinke of the DN that JHOVE does not scale for millions of files, because each JHOVE run takes several seconds and is primarily for file type analysis and can generate 2 GB of xml in analyzing one file (!). Suggestion that we need a quick tool for just identification rather than complex file type analysis. Strong critique from KB that use of JHOVE or Droid is inadequate and questionable because they are limited in number of format types they can understand and report different results. Also, questions about the fact that JHOVE is very conservative and often reports that the file is unreadable or ambiguous errors about readability, when the file is perfectly readable. Different institutions should not have to analyze what these reports are saying in terms of viability and validity. Tobia strongly claims that we need different tools for identification versus validity checking, and that we should have Heretrix capability to efficiently and quickly identify file formats upon ingestion and include this information in the WARC file. Clément agrees that JHOVE is absurd for large crawl analysis.
This discussion concludes with the agreement to make the survey results on file format types available to all members through the website.
Presentation on Survey of Web Archive Access Software – Interim results
Colin Webb from the National Library of Australia (NLA) reports on this survey (still in progress). This was WP3 (Work Package 3), a part of the environmental scan work packages (agreed to in work plan last year). Complements WP1 (formats survey) and other work packages. Got six (!) completed results, by BNF, BL, DN, Harvard, KB, and NLA. LC still coming, and they would still like to get additional responses (reminder: ask Mark and Cathy to respond). Germans (DN) and Dutch (KB) don’t provide access of any kind yet (the Germans because they haven’t harvested anything yet!). BNF and some others only provide access in the reading room. Harvard, BL, and NLA provide access remotely.
Identified obsolete formats or formats not accessible with software include Vivoactive video files (1997-2001), the X bitmap format (xbm), and e-book formats.
Gina makes comments about sharing this data in a systematic way, perhaps by making it into a database. She has tested a Survey Monkey implementation. But the question is really more about the survey itself, and whether this is the right information to gather.
Denmark comments on what their librarians saw as inadequate in the survey and asks what the ultimate purpose of the survey was. They thought some elements (like operating system of the computers in their reading room) were not useful, and instead we should gather more information on browsers and plug-ins. Colin relates that the intent of the survey was to capture information about the environment that was provided to browse the material, for later reconstruction. Denmark asks: isn’t that already done in the larger web community more effectively? Tobias comments that this is more of a reality check on what was going on at a particular time. Colin relates that the information is indeed available in other sources (Wikipedia for one) but requires digging, and this seemed like a simple way to gather the information. Raju comments that our expectation is that the crawlers acts as browsers, and the dependencies of browsers are useful to collect. A long rambling discussion ensues concerning the needs in this area, and several people acknowledge the uncertainty of long term rendering and emulation. This discussion highlights the need for a registry (which they started at one part on a wiki, but then abandoned) of browser requirements as they evolve over time.
BL conveys the idea that we should focus on endangered file types, and the idea that we could collaboratively take responsibility for handling particular formats at our particular institutions, taking responsibility for handling particular formats. By dividing this work and focusing on handling particular formats as we encounter them individually, perhaps we can get traction on this format migration or at least preserve them by one time rendering. Gina relates that we should coordinate with the IT History Museum, Computer History Museum, or GDFR on these issues.
Report on Virus Issues
Work just now beginning. Background research on this topic shows that there is an assumption that materials that go into a digital library are “clean”, free of viruses because they were created or acquired under controlled conditions. But we know that the Web is like the Carnivale in Rio for viruses! Accidental and deliberate distribution of malware viruses is common. Safe to say that there are definitely viruses in web archives and that it is hard to scan on harvest for virsues. At least two perspectives: a) we should clean sites up when we harvest, or b) the authentic web concept, that the viruses are part of the web and we should harvest it as we find it. Tobias argues that we should only be concerned about security of our current archival systems, and not worry about the (diminishing) risk of viruses in the long term. Argument ensues about whether or not this is an issue for the Preservation WG or some other WG. Discussion continues concerning risks; scenario given of a Ransom virus (?) that locks up data until you pay a ransom, what if this thing gets loose on a preserved dataset? The idea eventually emerges that we don’t want to be virus labs, we want virus labs (commercial and research) to do that and we should collaboratively use their products.
The group acknowledged that none of us know what to do about viruses, in fact, none of us do anything about viruses in our web archives. Long discussion of the proposed work plan, centering on whether or not to create a taxonomy of risks to web archives from viruses.
Colin leads discussion. This is WP2. If we assume that bit preservation is under control (maybe not), then this is the center of our WG focus. There was a study at NLA last year that he will discuss.
He thinks we should seek to describe the criteria of what a successful strategy includes. Hopes to develop a large agenda for discussion over the next few years.
Would like to first discuss the purposes of the IIPC and then the Preservation WG more specifically. We are not trying to create a single solution for all of IIPC; instead a diversity of solutions appropriate for our various members, so members can be informed and choose among options. If there is content that is no longer accessible, are their ways to recover it? We should try to improve our ability to predict what will be useful in the future. Want to refine our predictions and ability to predict.
Colin recounts some issues identified in the NLA study by Andrew Long. There are currently no good tools for emulation. Quite convinced that you must know the digital preservation objectives are that you are seeking to achieve, and reference the success of any activity against those objectives. Migration appears easier than emulation, but again begs the question of what you are trying to achieve. Riddled with minefields: properties of objects may subtly change in the process of migration. Could summarize the study as saying that it shared some insights, but can’t really say whether we’ve succeeded until we identify what we’re trying to achieve, i.e. what constitutes a successful strategy. Perhaps would include an account of how we successfully recover access that has been lost.
May need a taxonomy of criteria concerning what constitutes successful digital preservation. Clément mentions that there are already tools that have been created for planning and testing digital preservation. The Planets Testbed is an infrastructure for migration which is available. Questions about how broad the remediation solutions are, will it enable fixes not just in the videos but in the structure of the links to the videos? The Plato tool (http://www.ifs.tuwien.ac.at/dp/plato/intro.html) is an example; it is a decision support tool (http://publik.tuwien.ac.at/files/PubDat_170832.pdf) for digital preservation (beyond web archiving). There are mentions of the KEEP project (http://www.keep-project.eu) as well.
Summary of Session
Clément sums up the session outcomes in terms of the work packages. This was curtailed to go report back to the rest of the General Assembly.
Background on the Preservation Working Group
The Preservation Working Group (PWG) focus is on policy, practices and resources in support of preserving the content and accessibility of web archives. The PWG aims to understand and report on how approaches used for other kind of digital resources might be used with web archives, as well as the special characteristics of web archives that might require new approaches. It will provide recommendations for additions or enhancements to tools, standards, practice guidelines, and possible further studies/research.
The Preservation Working Group Mandate
- Characterize large scale web archives in order to
- Identify relevant approaches, standards and practices already used for preservation of other digital assets
- Report on how they might be used with archived web resources and/or
- Identify the gaps and promote new approaches.
- Make recommendations for enhancements or additions to tools, standards, practices, guidelines, testing, and possible further studies/research. These recommendations may be intended for IIPC members, other working groups, institutions and members of the digital preservation community, or tools developers / vendors.
- Design projects related to web archives preservation for IIPC funding to the Steering Committee.
- Promote recognition of the unique requirements to preserve archived web resources not achieved by other preservation programs for digital assets.
The PWG will continue in its work until standards and best practices for the preservation of archived web resources are developed and implemented across institutions.
2009-2010 Work Packages
- WP1: Tools gap analysis for formats/Study Scalability. Goal:
- List the main formats available in web archives.
- Test the ability of the identification / characterization tools to handle them.
- Make tools enhancements recommendations for most important formats.
- WP2: Preservation Strategies. Goal:
- Analyze and compare different preservation strategies for web archives.
- Provide metrics and costs (time, machines, workforce…) and analyze results.
- WP3: Browsers/Dependencies: Goal:
- List and describe the main browsers and plug-ins over the course of time.
- Analyze their dependencies (OS, hardware).
- WP4: Software Documentation Harvesting. Goal:
- Harvest main software vendors’ websites to preserve information on how software should be installed and used (e.g. user manuals…).
- Test if software (if freely available) may be archived as well.
- Analyze if it is possible to do this in a collaborative way.
- WP5: Crawler Documentation. Goal:
- List and describe the crawlers that were and are used to build web archives.
- Identify possible idiosyncratic features of the files they produce. (e.g. Heritrix website…).
- WP6: Viruses. Goal:
- Assess the risk of keeping viruses in web archives.
- Provide scenarios to identify and discard viruses.
- If recognized necessary, set up a project to encourage one or several AV tools to manage WARC files.
- WP7: Information Packages. Goal:
- Build scenarios to design information packages (in the sense of the OAIS model) according to institutions’ content policies and preservation goals. It will notably encompass what kind of metadata to use, their level of granularity, and their location.
- The goal is to collect different perspectives and create a generic model, similar to the preservation workflows work package.
- WP8: Risk Assessment Review. Goal:
- Update the PWG “table of threats”. Identify and evaluate specific risks for web archives preservation
Notes for Estonia:
Smaller countries that aren’t involved otherwise.
Briefings, for example to Open Planets on PLNs, things they don’t know about.