Skip to main content

A Persistent Web IDentifier (PWID) URN Namespace
draft-pwid-urn-specification-00

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft whose latest revision state is "Expired".
Author Eld Maj-Britt Olmuetz Zierau
Last updated 2017-12-08
RFC stream (None)
Formats
Additional resources
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-pwid-urn-specification-00
Internet Engineering Task Force                           E. Zierau, Ed.
Internet-Draft                                  The Royal Danish Library
Intended status: Informational                          December 7, 2017
Expires: June 10, 2018

            A Persistent Web IDentifier (PWID) URN Namespace
                    draft-pwid-urn-specification-00

Abstract

   This document specifies a Uniform Resource Name (URN) for Persistent
   Web IDentifiers to web material in web archives using the 'pwid'
   namespace identifier.  The purpose of the standard is to support
   general, global, sustainable, humanly readable, technology agnostic,
   persistent and precise web references for web materials.

   The PWID URN can assist in two ways: First, by providing potential
   resolvable precise and persistent reference scheme for documents,
   which is not sufficiently covered by existing web reference
   practices.  Second, by providing a standardized way to specify web
   elements in a web collection also known as web corpus.  Definitions
   of web collections are often needed for extraction of data used in
   production of research results, e.g. for evaluations in the future.
   Current practices today are not persistent as they often use some CDX
   version, which vary for different implementations.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on June 10, 2018.

Zierau                    Expires June 10, 2018                 [Page 1]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Namespace Registration Template . . . . . . . . . . . . . . .   3
   3.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  11
   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
     4.1.  Normative References  . . . . . . . . . . . . . . . . . .  11
     4.2.  Informative References  . . . . . . . . . . . . . . . . .  12
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  13

1.  Introduction

   The purpose of the PWID URN is to represent general, global,
   sustainable, humanly readable, technology agnostic, persistent and
   precise web archive resource references - in a way that can be used
   for technical solutions.

   The motivation for defining a PWID namespace is the growing challenge
   of references to web resources, - both regarding referencing web
   resources from papers and regarding definition of web collection/
   corpus - calls for action now, which can be provided with PWID as a
   URN, allthough the prefixing with "urn:" is not ideal.  In deatail
   the challenges are:

   o  Citation guidelines generally do not cover general and persistent
      referencing techniques for web resources that are not registered
      by Persistent Identifier systems (like DOI [DOI]).  However, an
      increasing number of references point to resources that only exist
      on the web, e.g. blogs that turned out to have a historical
      impact.  In order to obtain persistency for a reference, the
      target need to be stable.  As the live web is 'alive' and in
      constant change, persistency can only be obtained by referring to

Zierau                    Expires June 10, 2018                 [Page 2]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

      archived snapshots of the web.  The PWID URN is therefore focused
      on referencing archived web material in a technology agnostic way
      (research documented in [IPRES] and [ResawRef]).

   o  There are many different requirements for construction of
      collection definitions for web material besides precision and
      persistency.  Recent research have found that various legal and
      sustainability issues leads to a need for a collection to be
      defined by references to the web parts in the collection.  The
      PWID URN is needed in such definitions in order to fulfil these
      requirements and to enable a collection to cover web materials
      from more archives (Research documented in and [ResawColl]).

   For the sake of usability and sustainability, the definition of the
   PWID URN is focused on only having the minimum required information
   to make a precise identification of a resource in an arbitrary web
   archive.  Resent research have found that this is obtain by the
   following information [ResawRef]:

   o  Identification of web archive

   o  Identification of source:

      *  Archived URI or identifier

      *  Archival timestamp

   o  Intended coverage (page, part, subsite etc.)

   The PWID URN represents this information in an unambiguous way, and
   thus enabling technical solutions to be defined in this URN.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

2.  Namespace Registration Template

   Namespace Identifier:

      PWID

   Version:

      1

Zierau                    Expires June 10, 2018                 [Page 3]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

   Date:

      2017-12-08

   Registrant:

      Eld Maj-Britt Olmuetz Zierau
      The Royal Danish Library
      Soeren Kierkegaards Plads 1
      1219 Copenhagen
      Denmark
      ph: +45 9132 4690
      email: elzi@kb.dk

   Purpose:

      The purpose of the PWID URN is to represent general, global,
      sustainable, humanly readable, technology agnostic, persistent and
      precise web archive resource references - in a way that can be
      used for technical solutions.  The standard is needed to address
      web materials meeting precision and persistency issues on par
      precision in with traditional references for analogue material.

      The scope and applicability of the PWID URN is both references to
      archived web resources from literature as well as basis for
      defining parts of a archived web collection.  Strict unambiguous
      syntax is needed in order to ensure that it can be used for
      computational purposes.  This is relevant for web collection
      definitions, which will need a strict syntax in order to be a
      basis for automatic extraction.  Furthermore, readers of research
      papers are today expecting to be able to access a referenced
      resource by clicking an actionable URI, therefore a similar
      facility will be expected for references to available archived web
      material.

      The interest for this new PWID has already been shown, a paper
      about the invention of the PWID "Persistent Web References - Best
      Practices and New Suggestions" [IPRES] was accepted for the iPres
      2016 conference and nominated as best paper.  At the RESAW 2017
      conference there are two related papers: One on referencing
      practices [ResawRef] and one on research data management practices
      [ResawColl].  The interest for the PWID so far indicates that this
      is a recognized issue, and that the PWID can fill a gap.

      The PWID URN is to represent this information in a way that can be
      used for technical solutions, for example for resolving of a
      references and automatic extraction of web collection defined by

Zierau                    Expires June 10, 2018                 [Page 4]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

      PWID URNs [ResawRef] [ResawColl].  There may come different
      implementations for resolving which may rely on different
      protocols and application.  As a start there is work on a
      prototype that can work for the Danish Netarkiv data and open web
      archives with standard patterns for the current technologies.
      Further work will be done to make this more internationally
      integrated.

      The ambition is to make the PWID URN namespace definition a
      constituent part of a standard being developed in the IETF or some
      other recognized standards body.  The textual version of the PWID
      (see additional information) is already in draft of the revision
      of the ISO 690 reference standard.

   Syntax:

      The syntax of the PWID URN is specified below in Augmented Backus-
      Naur Form (ABNF) RFC 5234 [RFC5234] and it conforms to URN syntax
      defined in RFC 8141 [RFC8141].  The syntax definition of the PWID
      URN is:

                    pwid-urn  = = "urn" ":" pwid-NID ":" pwid-NSS

                 pwid-NID = "pwid"
                 pwid-NSS = archive-id ":" archival-time ":" coverage-spec
                 ":" archived-item

                    archive-id  = +( unreserved )

                 archival-time   = full-date datetime-delim full-pwid-time
                 datetime-delim  = "T"
                 full-pwid-time  = time-hour [":"] time-minute [":"] time-second "Z"

                 coverage-spec    = "part" / "page" / "subsite" / "site"
                 / "collection" / "recording" / "snapshot"
                 / "other"

Zierau                    Expires June 10, 2018                 [Page 5]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

                    archived-item = URI / archived-item-id
                    archived-item-id  = +( unreserved )

      where

      *  'unreserved' is defined as in RFC 3986 [RFC3986]

      *  'coverage-spec' values are not case sensitive (i.e.  "PAGE" /
         "PART" / "PaGe" / ... are valid values as well.)

      *  'archival-time' is a UTC timestamp conforming to the W3C
         profile ISO8601 ISO 8601 [ISO8601] (also defined in RFC 3339
         [RFC3339]), with a few exception.  It has to be a UTC timestamp
         in order to conform with web archiving practices, which always
         uses UTC in order to avoid confusions.  The 'full-date' is
         defined as in RFC 3339 [RFC3339].  The 'archival-time' must
         represent the time specified in the archive, and can therefore
         be specified at any of the levels of granularity as described
         in [W3CDTF] and in accordance with teh WARC standard ISO 28500
         [ISO28500].

         In line with RFC 3339 [RFC3339] the "T" may alternatively be
         lower case "t".

         'time-hour', 'time-minute' and 'time-second' are defined as in
         RFC 3339 [RFC3339].

         In line with RFC 3339 [RFC3339] the "Z" may alternatively be
         lower case "z".

      *  'URI' is defined as in RFC 3986 [RFC3986]

      The 'coverage-spec' defines the type of archived item, serving as
      a precision to what is referred:

      *  part
         the single archived element, e.g. a pdf, a html text, an image

      *  page
         the full context as a page, e.g. a html page with referred
         images

      *  subsite
         the full context as a subsite within its domain, e.g. a
         document represented in a web structure

      *  site
         the full context as a site within its domain

Zierau                    Expires June 10, 2018                 [Page 6]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

      *  collection
         a collection/corpora definition, e.g. defined as descibed in
         [ResawColl]

      *  snapshot
         a snapshot (image) representation of web material, e.g. a web
         page

      *  recording
         a recording of a web browsing

      *  other
         if something else

      Note that the 'coverage-spec' is a parameter that could have been
      specified as a query.  However, since the 'pwid-urn' can include
      an URI as 'archived-item', it would introduce ambiguities if the
      'coverage-spec' was specified as a query, since it would not be
      clear whether the query belonged to the 'pwid-urn' or the
      'archived-item'.

   Assignment:

      There are no authorities for assigning PWID URNs to resources, as
      the rule is the it is the given by the syntax that the name is
      assigned according to the

      *  Identification of web archive

      *  Identification of source:

         +  Archived URI or identifier

         +  Archival timestamp

      *  Intended coverage (page, part, subsite etc.)

      Therefore, the PWID URNs are created independently, but following
      an algorithm that itself guarantees uniqueness.

      The name will always be unique, as the only way to define a clash
      would be that a web archive cease to exist, and by time another
      web archive gets the same name space and have resources with the
      same name and same archiving timestamp, but with different
      contents.  This is a highly unlikely scenario.

   Security and Privacy:

Zierau                    Expires June 10, 2018                 [Page 7]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

      Security and privacy considerations are restricted to accessible
      web resources in web archives.  If resolvers to PWID URNs are
      created, there should be made an analysis of whether they can be
      restricted to the former mentioned registry of web archives.
      Security and privacy will then be a question of security and
      privacy considerations related to the web archive resources.

   Interoperability:

      This is covered by comments in the Syntax description, where the
      'archival-time' conforms to the W3C profile ISO 8601 [ISO8601]
      (also defined in RFC 3339 [RFC3339]) and use of date conforms to
      the WARC standard ISO 28500 [ISO28500] using UTC dates only.
      Furthermore, the when the 'archived-item' is a URI, this conforms
      to the URI standard defined as in RFC 3986 [RFC3986].

   Resolution:

      The information in a PWID URN can be used for locating a web
      archive resource, for any kind of web archive.  It includes the
      minimum information for web archive materials, which enables
      resolvability, manually or by a resolver.  The plan is to develop
      a resolving service, but this is only in a prototype form at the
      moment.

      Resolution of a PWID URN is the primary motivation of making a
      formal URN definition, instead of just textual representation of
      the for needed parts of a PWID:

      *  Web archive identification
         to find the archive holding the material

      *  Archived URI or identifier of item
         as part of identifying the material

      *  Date and time associated with the archived URI/item
         as part of precise identification of the material

      *  Coverage of what is referred
         as part of clarification of what the referred material covers
         (page, part etc.)

      in the following the different resolution techniques are explained
      (manual as well as via a service) An example of a PWID URN is:

         urn:pwid:archive.org:2016-01-22T11:20:29Z:page:http://www.dr.dk

      has the information:

Zierau                    Expires June 10, 2018                 [Page 8]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

      *  archive.org
         currently known identifier in form of the Internet Archive
         domian name for their open access web archive

      *  2016-01-22T11:20:29Z
         UTC date and time associated with the archived URI

      *  page
         clarification that the reference cover the full web page with
         all its inherited parts selected by the web archive

      *  http://www.dr.dk
         archived URI of item

      With knowledge of the current (2017) Internet Archive open access
      web interface having the form:

         https://web.archive.org/web/<time>/<uri>

      We can manually (or technically) deduce an actual (current 2017)
      access https address:

         https://web.archive.org/web/20160122112029/http://www.dr.dk

      and regard the referred web page as the reference.

      The same recipe can be used for other Wayback platforms - and
      possibly also other web archive access tools platforms, as the
      crucial information is date and URI, which are requested to be
      looked up in a specified archive.

      Note that this also includes access to archives that are only
      accessible via a local proxy to a restricted environment.  Here
      the difference is that the archive information is used to identify
      the local environment used (possibly on-site) and then construct
      local http/https address based on knowledge from the local access
      installation.  In November there was created a prototype for PWIDs
      to the Netarkivet, and there are plans to extend it.

      Automatic access of a referenced web resource may work on the open
      net for open web archive or in restricted environments for the
      closed web archives.  There may be a need for varied operation
      depending on the available technology and applications, e.g.:

      *  Via locally installed browser plug-ins or applications forming
         http/https URIs:

         +  http/https URIs for standard web archive interfaces

Zierau                    Expires June 10, 2018                 [Page 9]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

            At this stage there are initiatives on streamlined and
            standardize APIs to web archives interfaces, - and in case
            such APIs will be implemented generally, it may be used for
            resolving of the PWID URNs.  This could be on form (denoting
            pwid parts in <> using syntax names):

               https://<archive-id>/pwid?time=<archival-
               time>&coverage=<coverage-spec>&item=<archived-item>

            The example from previous section would then resolve by

               https://archive.org/pwid?time=2016-01-22T11:20:29Z&covera
               ge=page&item=http://www.dr.dk

         +  http/https URIs for archive material for individual web
            archives
            Using the current open access http/https address pattern for
            the individual web archives, which for the example is

               https://web.archive.org/web/20160122112029/
               http://www.dr.dk

            This would require a registry of the different patterns for
            the individual web archives

      *  Via web research infrastructures this is a future solution
         scenario as a web archive research infrastructure do not yet
         exists.  However, it is a likely future scenario, as it is
         currently being proposed in the RESAW community [RESAW].  The
         PWID URN resolving could in such cases be a question of
         starting a special application, as for the 'mailto' scheme RFC
         6068 [RFC6068].

   Documentation:

      None relevant

   Additional Information:

      The PWID was orininally suggested as a URI based on research
      between a computer science researcher with know of web archiving
      and researchers from humanity subject (History and Literature).
      This resulted in the paper "Persistent Web References - Best
      Practices and New Suggestions" [IPRES]  from the iPres 2016
      conference.  In this paper the PWID is referred to as WPID.
      However, one of the feedbacks has been a concern that WPID was
      interpreted as a PID related to a PID-system, e.g. as the DOI.
      All though PID does not have a precise definition that makes it

Zierau                    Expires June 10, 2018                [Page 10]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

      wrong to call it a "WPID.  The danger is that it is confused with
      PID systems, which is not the intension.  Consequently, this
      suggestion names the PWID instead.

      The comments on the drafted PWID URI ([DraftPwidUri]) has been
      that is seems to be a URN rather than a URI.  Which is the reason
      why it is now suggested as a URN, although there is a danger that
      users of the reference style can be confused by the the additional
      "urn:" prefix.

      At the RESAW 2017 conference there are two related papers: One on
      referencing practices [ResawRef] and one on research data
      management practices [ResawColl].  This practice is also planned
      to be used for Danish web collections.

   Revision Information:

      This is the initial version of PWID as a URN

3.  Acknowledgements

   A special thanks to Caroline Nyvang and Thomas Kromann who have
   contributed to the research identifying the minimum information
   required in a persistent web reference, and to Bolette Jurik
   contributed with supplementary research concerning requirements for
   web collection/copora definitions.  Also thanks to all that have
   contributed to this work with the research and reviewing this RFC.

4.  References

4.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
              Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002,
              <https://www.rfc-editor.org/info/rfc3339>.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, DOI 10.17487/RFC3986, January 2005,
              <https://www.rfc-editor.org/info/rfc3986>.

Zierau                    Expires June 10, 2018                [Page 11]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [RFC8141]  Saint-Andre, P. and J. Klensin, "Uniform Resource Names
              (URNs)", RFC 8141, DOI 10.17487/RFC8141, April 2017,
              <https://www.rfc-editor.org/info/rfc8141>.

4.2.  Informative References

   [DOI]      International DOI Foundation, "The DOI System", 2016,
              <https://web.archive.org/web/20161020222635/
              https:/www.doi.org/>.

              pwid:archive.org:2016-10-20_22.26.35:site:https://www.doi.
              org/

   [DraftPwidUri]
              Zierau, E., "DRAFT: Scheme Specification for the pwid URI,
              version 3", December 2017,
              <https://datatracker.ietf.org/doc/
              draft-pwid-uri-specification/>.

   [IPRES]    Zierau, E., Nyvang, C., and T. Kromann, "Persistent Web
              References - Best Practices and New Suggestions", October
              2016, <http://www.ipres2016.ch/frontend/organizers/media/
              iPRES2016/_PDF/
              IPR16.Proceedings_4_Web_Broschuere_Link.pdf>.

              In: proceedings of the 13th International Conference on
              Preservation of Digital Objects (iPres) 2016, pp. 237-246

   [ISO28500]
              International Organization for Standardization,
              "Information and documentation -- WARC file format", 2017,
              <https://www.iso.org/standard/68004.html>.

   [ISO8601]  International Organization for Standardization, "Data
              elements and interchange formats -- Information
              interchange -- Representation of dates and times", 2004,
              <https://www.iso.org/standard/40874.html>.

   [RESAW]    The Resaw Community, "A Research infrastructure for the
              Study of Archived Web materials", 2017,
              <https://web.archive.org/web/20170529113150/
              http://resaw.eu/>.

Zierau                    Expires June 10, 2018                [Page 12]
Internet-DraftA Persistent Web IDentifier (PWID) URN NamespDecember 2017

              pwid:archive.org:2017-05-29_11.31.50Z:site:http://resaw.eu
              /

   [ResawColl]
              Jurik, B. and E. Zierau, "Data Management of Web archive
              Research Data", 2017,
              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
              RESAW2017-JurikZierau-
              Data_management_of_web_archive_research_data.pdf>.

              In: proceedings of the RESAW 2017 Conference, DOI:
              10.14296/resaw.0002

   [ResawRef]
              Nyvang, C., Kromann, T., and E. Zierau, "Capturing the Web
              at Large - a Critique of Current Web Referencing
              Practices", 2017,
              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
              RESAW2017-NyvangKromannZierau-
              Capturing_the_web_at_large.pdf>.

              In: proceedings of the RESAW 2017 Conference, DOI:
              10.14296/resaw.0004

   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, DOI 10.17487/RFC2141,
              May 1997, <https://www.rfc-editor.org/info/rfc2141>.

   [RFC6068]  Duerst, M., Masinter, L., and J. Zawinski, "The 'mailto'
              URI Scheme", RFC 6068, DOI 10.17487/RFC6068, October 2010,
              <https://www.rfc-editor.org/info/rfc6068>.

   [W3CDTF]   W3C, "Date and Time Formats: note submitted to the W3C. 15
              September 1997", 1997,
              <http://www.w3.org/TR/NOTE-datetime>.

              W3C profile of ISO 8601 pwid:archive.org:2017-04-
              03_03.37.42Z:page:http://www.w3.org/TR/NOTE-datetime

Author's Address

   Eld Maj-Britt Olmuetz Zierau (editor)
   The Royal Danish Library
   Soeren Kierkegaards Plads 1
   Copenhagen  1219
   Denmark

   Phone: +45 9132 4690
   Email: elzi@kb.dk

Zierau                    Expires June 10, 2018                [Page 13]