+ - 0:00:00
Notes for current slide
Notes for next slide

This page intentionally left blank. ⬇️, ➡️, or spacebar 🛰 to start slidedeck.

1 / 28

🕷 Web Archiving 🕷

2 / 28

Web Archiving

What? Why? How?

3 / 28

What

Web Archiving is collecting portions of the World Wide Web to ensure the information is preserved in an archive for the future.

4 / 28

Why

5 / 28

How

harvesting

crawling sites and saving them

database preservation

retrieving and storing the database of database-driven websites

transactional archiving

storing during the act of transferring data

6 / 28

Web curation

What is it?

A discipline of itself within web archiving is the decision-making around what to save, how to save it, and how to define collections and content, with considerations also to what can be and cannot be seen, and how to scope collection projects adequately.

7 / 28

WARC

  • WARC (Web ARCive) files assemble all the digital resources and related relevant information.

The specification is standardized under ISO so it is a closed standard, which really sucks but here it is. But a very close equivalent can be found here.

8 / 28

WARC goals

  • Ability to store both the payload content and control information from mainstream Internet application layer protocols, such as - HTTP, DNS, and FTP.
  • Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding)
  • Support for data compression and maintenance of data record integrity.
  • Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information.
  • Ability to store the results of data transformations linked to other stored data.
  • Ability to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or - substantially similar resources).
  • Ability to be extended without disruption to existing functionality
  • Support handling of overly long records by truncation or segmentation where desired
9 / 28

How does WARC work?

  • container file format (different from but not unlike the more-ubiquitous .zip)
  • content can be in any format

10 / 28

What does a WARC contain?

A WARC file (.warc) contains one or more WARC records.

11 / 28

WARC records

There are eight types of WARC record:

  • warcinfo
  • response
  • resource
  • request
  • metadata
  • revisit
  • conversion
  • continuation

and each record in the WARC file must have one of these types.

12 / 28

WARC records: warcinfo

  • warcinfo: should be the first record, only occur once, and describes the other records in the file

fields:

  • operator
  • software
  • robots
  • hostname
  • ip
  • http-header-user-agent
  • http-header-from
13 / 28

WARC records: response

This is where the HTTP or HTTPS response goes, in its complete form, as defined by the record type and URI scheme of the target-URI.

14 / 28

WARC records: resource

This is where the content goes, which is the same as the stuff in the response record but without the transfer information.

15 / 28

WARC records: request

The request record contains the request and network protocol information.

16 / 28

WARC records: response/resource/request

That's right, the response contains the full response, the resource contains only the content information, and the request contains only the network information. This response/resource pattern is similar to http-based query payloads. Response contains both request and resource in one payload.

17 / 28

WARC records: metadata

Information describing the resource in addition than what can be found in the other records.

18 / 28

WARC records: revisit

This record is used when visiting a site again and updating the difference, if any, between the site as it previously existed and as it exists upon the supplemental difference. This is mostly for saving space.

Some revisit profiles are "Identical Payload Digest" or "Server Not Modified".

19 / 28

WARC records: conversion

If the content has been transformed or changed in any way, that will be tracked here.

20 / 28

WARC records: continuation

If a WARC file is very big and has to be broken into multiple files, the data about that would be stored here.

21 / 28

WARC record fields

WARC records have fields! Those can be:

WARC-Record-ID (required)
Content-Length (required)
WARC-Date (required)
WARC-Type (required)
Content-Type
WARC-Concurrent-To
WARC-Block-Digest
WARC-Payload-Digest
WARC-IP-Address
WARC-Refers-To
WARC-Refers-To-Target-URI
WARC-Refers-To-Date
WARC-Target-URI
WARC-Truncated
WARC-Warcinfo-ID
WARC-Filename
WARC-Profile
WARC-Identified-Payload-Type
WARC-Segment-Number
WARC-Segment-Origin-ID
WARC-Segment-Total-Length
22 / 28

WARC records and record fields

What's the point of knowing about all of these fields?

You normally won't have to interact with them or inspect them, but understanding what fields are associated with the records inside of a WARC file can give insight into the kinds of information WARC files can hold and what messages will be transferred to those accessing the files in the future, or what kinds of software can be built to render WARC files.

23 / 28

Make your site archive-friendly

  • Follow web standards and accessibility guidelines.
  • Be careful with robots.txt exclusions.
  • Use a site map, transparent links, and contiguous navigation.
  • Maintain stable URIs and redirect when necessary.
  • Consider using a Creative Commons license.
  • Use sustainable data formats.
  • Embed metadata, especially the character encoding.
  • Use archiving-friendly platform providers and content management systems.

From the Library of Congress Guide to Creating Preservable Websites

24 / 28

Is your site archive-friendly?

Use Archive Ready to see,

or use an existing archiving tool.

Like what? Glad you asked...

25 / 28

Tools

Archiving lots of stuff

Archiving individual sites

26 / 28

Learning more

Home

28 / 28

🕷 Web Archiving 🕷

2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow