Skip to main content

Scheme Specification for the pwid URI
draft-pwid-uri-specification-02

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft whose latest revision state is "Expired".
Author Eld Maj-Britt Olmuetz Zierau
Last updated 2017-06-09
RFC stream Internet Engineering Task Force (IETF)
Formats
Additional resources
Stream WG state (None)
Document shepherd (None)
IESG IESG state AD is watching
Consensus boilerplate Unknown
Telechat date (None)
Responsible AD Alexey Melnikov
Send notices to (None)
draft-pwid-uri-specification-02
Internet Engineering Task Force                           E. Zierau, Ed.
Internet-Draft                                  The Royal Danish Library
Intended status: Informational                              June 9, 2017
Expires: December 11, 2017

                 Scheme Specification for the pwid URI
                    draft-pwid-uri-specification-02

Abstract

   This document specifies a Uniform Resource Identifier (URI) for
   Persistent Web IDentifiers to web material in web archives using the
   'pwid' scheme name.  The purpose of the standard is to support
   general, global, sustainable, humanly readable, technology agnostic,
   persistent and precise web references for such web materials.

   The PWID URI ca assist in two ways: First, by providing potential
   resolvable precise and persistent reference scheme for documents,
   which is not sufficiently covered by existing web reference
   practices.  Second, by providing a standardized way to specify web
   elements in a web collection also known as web corpus.  Definitions
   of web collections are often needed for extraction of data used in
   production of research results, e.g. for evaluations in the future.
   Current practices today are not persistent as they often use some CDX
   version, which vary for different implementations.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on December 11, 2017.

Zierau                  Expires December 11, 2017               [Page 1]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Demonstrable, New, Long-Lived Utility . . . . . . . . . . . .   4
   3.  Syntactic Compatibility . . . . . . . . . . . . . . . . . . .   4
   4.  Well Defined  . . . . . . . . . . . . . . . . . . . . . . . .   6
   5.  Definition of Operations  . . . . . . . . . . . . . . . . . .   8
   6.  Context of Use  . . . . . . . . . . . . . . . . . . . . . . .   9
   7.  Internationalization and Character Encoding . . . . . . . . .   9
   8.  Scheme Name Considerations  . . . . . . . . . . . . . . . . .  10
   9.  Interoperability Considerations . . . . . . . . . . . . . . .  10
   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  10
   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   12. Clear Security and Privacy Considerations . . . . . . . . . .  10
   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
     13.1.  Normative References . . . . . . . . . . . . . . . . . .  10
     13.2.  Informative References . . . . . . . . . . . . . . . . .  11
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   The purpose of the PWID URI is to represent general, global,
   sustainable, humanly readable and technology agnostic web archive
   resource references - in a scheme that can be used for technical
   solutions.  The motivation for defining a PWID URI scheme is the
   growing challenge of references to web resources, - both regarding
   referencing web resources from papers and regarding definition of web
   collection/corpus.

   o  Citation guidelines generally do not cover general and persistent
      referencing techniques for web resources that are not registered
      by Persistent Identifier systems (like DOI [DOI]).  However, an

Zierau                  Expires December 11, 2017               [Page 2]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

      increasing number of references point to resources that only exist
      on the web, e.g. blogs that turned out to have a historical
      impact.  In order to obtain persistency for a reference, the
      target need to be stable.  As the live web is 'alive' and in
      constant change, persistency can only be obtained by referring to
      archived snapshots of the web.  The PWID URI is therefore focused
      on referencing archived web material in a technology agnostic way
      (research documented in [IPRES] and [ResawRef]).

   o  There are many different requirements for construction of
      collection definitions for web material besides precision and
      persistency.  Recent research have found that various legal and
      sustainability issues leads to a need for a collection to be
      defined by references to the web parts in the collection.  The
      PWID URI is needed in such definitions in order to fulfil these
      requirements and to enable a collection to cover web materials
      from more archives (Research documented in and [ResawColl]).

   For the sake of usability and sustainability, the definition of the
   PWID URI scheme is focused on only having the minimum required
   information to make a precise identification of a resource in an
   arbitrary web archive.  Resent research have found that this is
   obtain by the following information [ResawRef]:

   o  Identification of web archive

   o  Identification of source:

      *  Archived URI

      *  Archival timestamp

   o  Intended coverage (page, part, subsite etc.)

   The PWID URI scheme represents this information in an unambiguous
   way, and thus enabling technical solutions to be defined based on
   this scheme.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Zierau                  Expires December 11, 2017               [Page 3]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

2.  Demonstrable, New, Long-Lived Utility

   The purpose of the PWID URI is to represent needed referencing
   information (as listed in the introduction) in a scheme that can be
   used for technical solutions.  As described in [ResawColl] such
   references can be represented in a textual way.  However, strict
   unambiguous syntax is needed in order to ensure that it can be used
   for computational purposes.  This is relevant for web collection
   definitions, which will need a strict scheme in order to be a basis
   for automatic extraction.  Furthermore, readers of research papers
   are today expecting to be able to access a referenced resource by
   clicking an actionable URI, therefore a similar facility will be
   expected for references to available archived web material.

   The interest for this new PWID URI scheme has already been shown, a
   paper about the invention of the PWID URI "Persistent Web References
   - Best Practices and New Suggestions" [IPRES] was accepted for the
   iPres 2016 conference and nominated as best paper.  At the RESAW 2017
   conference there are two related papers: One on referencing practices
   [ResawRef] and one on research data management practices [ResawColl].
   The interest for the PWID URI so far indicates that this is a
   recognized issue, and that the PWID URI can fill a gap.

   The PWID URI could function as a URN RFC 2141 [RFC2141], but is not
   defined as such as the ambition is to make an easyly understandable
   and technology independent persistent identifier, where the prefixing
   of "urn:" will be desturbing.  At the same time the PWID definition
   can enjoy the same common syntactic, semantic, and shared language
   benefits that the URI presentation confers.

   It should be noted that for closed web archives, the PWID URI can be
   used to resolve within a closed environment.  Likewise, the PWID can
   be resolved within coming web archive research infrastructure, which
   is currently being proposed in the RESAW community [RESAW].

3.  Syntactic Compatibility

   The syntax of the PWID URI Scheme is specified below in Augmented
   Backus-Naur Form (ABNF) RFC 5234 [RFC5234] and it conforms to URI
   syntax defined in RFC 3986 [RFC3986].  The syntax definition of the
   PWID URI is:

     pwid-uri  = pwid-scheme ":" pwid-spec

     pwid-scheme = "pwid"
     pwid-spec = archive-id ":" archival-time ":" coverage-spec
                 ":" archived-item

Zierau                  Expires December 11, 2017               [Page 4]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

     archive-id  = +( unreserved )

     archival-time   = full-date datetime-delim full-pwid-time
     datetime-delim  = "_" / "T"
     full-pwid-time  = time-hour ["."] time-minute ["."] time-second "Z"

     coverage-spec    = "part" / "page" / "subsite" / "site"
                       / "collection" / "recording" / "snapshot"
                       / "other"

     archived-item = URI / archived-item-id
     archived-item-id  = +( unreserved )

   where

   o  'unreserved' is defined as in RFC 3986 [RFC3986]

   o  'coverage-spec' values are not case sensitive (i.e.  "PAGE" /
      "PART" / "PaGe" / ... are valid values as well.)

   o  'archival-time' is a UTC timestamp conforming to the W3C profile
      ISO8601 ISO 8601 [ISO8601] (also defined in RFC 3339 [RFC3339]),
      with a few exception for the 'datetime-delim' and 'full-pwid-
      time', as well as using "." is used instead of ":" in order not to
      collide with ":" used for delimitation of URI parts.  The 'full-
      date' is defined as in RFC 3339 [RFC3339].  The 'archival-time'
      must represent the time specified in the archive, and can
      therefore be specified at any of the levels of granularity as
      described in [W3CDTF] and in accordance with teh WARC standard ISO
      28500 [ISO28500].

      The 'datetime-delim' "_" is accepted in order to make it more
      readable, in the same way as the W3C profile accepts " ", but
      where "_" is used here in order to use allowed URI characters in
      an URI.  In line with RFC 3339 [RFC3339] the "T" may alternatively
      be lower case "t".

      'time-hour', 'time-minute' and 'time-second' are defined as in RFC
      3339 [RFC3339].

      In line with RFC 3339 [RFC3339] the "Z" may alternatively be lower
      case "z".

   o  'URI' is defined as in RFC 3986 [RFC3986]

   The 'coverage-spec' defines the type of archived item, serving as a
   precision to what is referred:

Zierau                  Expires December 11, 2017               [Page 5]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

   o  part
      the single archived element, e.g. a pdf, a html text, an image

   o  page
      the full context as a page, e.g. a html page with referred images

   o  subsite
      the full context as a subsite within its domain, e.g. a document
      represented in a web structure

   o  site
      the full context as a site within its domain

   o  collection
      a collection/corpora definition, e.g. defined as descibed in
      [ResawColl]

   o  snapshot
      a snapshot (image) representation of web material, e.g. a web page

   o  recording
      a recording of a web browsing

   o  other
      if something else

   Note that the 'coverage-spec' is a parameter that could have been
   specified as a query.  However, since the 'pwid-uri' can include an
   URI as 'archived-item', it would introduce ambiguities if the
   'coverage-spec' was specified as a query, since it would not be clear
   whether the query belonged to the 'pwid-uri' or the 'archived-item'.

4.  Well Defined

   The information in a PWID URI can be used for locating a web archive
   resource, for any kind of web archive.  It includes the minimum
   information for web archive materials, which enables resolvability,
   manually or by a resolver.  One of the reasons for defining PWID as a
   URI is to enable a general, technology agnostic, persistent
   representation to be resolvable at any time.

   The information needed is:

   o  Web archive identification
      to find the archive holding the material

   o  Archived URI or identifier of item
      as part of identifying the material

Zierau                  Expires December 11, 2017               [Page 6]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

   o  Date and time associated with the archived URI/item
      as part of precise identification of the material

   o  Coverage of what is referred
      as part of clarification of what the referred material covers
      (page, part etc.)

   For example the PWID URI:

      pwid:archive.org:2016-01-22_11.20.29Z:page:http://www.dr.dk

   has the information:

   o  archive.org
      currently known identifier in form of the Internet Archive domian
      name for their open access web archive

   o  2016-01-22_11.20.29Z
      date and time associated with the archived URI

   o  page
      clarification that the reference cover the full web page with all
      its inherited parts selected by the web archive

   o  http://www.dr.dk
      archived URI of item

   With knowledge of the current (2017) Internet Archive open access web
   interface having the form:

      https://web.archive.org/web/<time>/<uri>

   We can manually (or technically) deduce an actual (current 2017)
   access https address:

      https://web.archive.org/web/20160122112029/http://www.dr.dk

   and regard the referred web page as the reference.

   The same recipe can be used for other Wayback platforms - and
   possibly also other web archive access tools platforms, as the
   crucial information is date and URI, which are requested to be looked
   up in a specified archive.

   Note that this also includes access to archives that are only
   accessible via a local proxy to a restricted environment.  Here the
   difference is that the archive information is used to identify the
   local environment used (possibly on-site) and then construct local

Zierau                  Expires December 11, 2017               [Page 7]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

   http/https address based on knowledge from the local access
   installation.

5.  Definition of Operations

   The PWID URI Scheme is another step in facilitating, supporting, and
   standardizing the problem of persistent web references to resources
   in web archives.  There is not a specific definition of computational
   operation yet.  It is expected that there may be different
   implementations in pace with needed use and available technology and
   infrastructures.

   Automatic access of a referenced web resource may work on the open
   net for open web archive or in restricted environments for the closed
   web archives.  There may be a need for varied operation depending on
   the available technology and applications, e.g.:

   o  Via locally installed browser plug-ins or applications forming
      http/https URIs:

      *  http/https URIs for standard web archive interfaces
         At this stage there are initiatives on streamlined and
         standardize APIs to web archives interfaces, - and in case such
         APIs will be implemented generally, it may be used for
         resolving of the PWID URIs.  This could be on form (denoting
         pwid parts in <> using syntax names):

            https://<archive-id>/pwid?time=<archival-
            time>&coverage=<coverage-spec>&item=<archived-item>

         The example from previous section would then resolve by

            https://archive.org/pwid?time=2016-01-22_11.20.29Z&coverage=
            page&item=http://www.dr.dk

      *  http/https URIs for archive material for individual web
         archives
         Using the current open access http/https address pattern for
         the individual web archives, which for the example is

            https://web.archive.org/web/20160122112029/http://www.dr.dk

         This would require a registry of the different patterns for the
         individual web archives

   o  Via web research infrastructures this is a future solution
      scenario as a web archive research infrastructure do not yet
      exists.  However, it is a likely future scenario, as it is

Zierau                  Expires December 11, 2017               [Page 8]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

      currently being proposed in the RESAW community [RESAW].  The PWID
      URI resolving could in such cases be a question of starting a
      special application, as for the 'mailto' scheme RFC 6068
      [RFC6068].

   Use of URIs for standard web archive interfaces is preferred as
   dependency on registries and infrastructures may pose too many
   limits.

6.  Context of Use

   The PWID URI scheme facilitates, supports and standardise a scheme
   for specification of identification of web archive resources in a
   general, global, sustainable, humanly readable and technology
   agnostic way.  The standard is needed to address web materials
   meeting precision and persistency issues on par precision in with
   traditional references for analogue material.

   The purpose with the PWID URI is to represent this information in a
   scheme that can be used for technical solutions, for example for
   resolving of a references and automatic extraction of web collection
   defined by PWID URIs [ResawRef] [ResawColl].  As described above,
   there may come different implementations for resolving which may rely
   on different protocols and application.

7.  Internationalization and Character Encoding

   Internationalization and character encoding for PWID URIs are
   relevant for the 'webarchive-id' and 'archived-item' syntactical
   units of the scheme-specific-part of the PWID URI.  The rest of the
   main syntactical units ('archival-time' and 'coverage-spec') are only
   constructed by a very limited set of characters, and do therefore
   need internationalization and character encoding.

   The 'webarchive-id' will not be case sensitive, but can allow for
   percent encodings, although for simplicity reasons, it may turn out
   that the coming establishment of an archiving registry will recommend
   using letters that do not need encodings.

   The 'archived-item' follows the rules of URIs in general (currently
   for http and https URIs archived in web archives).  The 'archived-
   item' is only case sensitive to the extent that the web archive can
   handle archived case sensitive URIs.

Zierau                  Expires December 11, 2017               [Page 9]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

8.  Scheme Name Considerations

   The scheme name is "pwid" - short for Persistent Web Identifier.
   Initially, the scheme name "wpid" was reserved.  However, one of the
   feedbacks has been a concern that "wpid" was interpreted as a PID
   related to a PID-system, e.g. as the DOI.  All though PID does not
   have a precise definition that makes it wrong to call it a "wpid",
   the danger is that it is confused with PID systems, which is not the
   intension.  Consequently, this suggestion names the scheme "pwid"
   instead.

9.  Interoperability Considerations

   This is covered by comments on the date in the section of Syntactic
   Compatibility, where the 'archival-time' conforms to the W3C profile
   ISO8601, except for minor modification in order to make it fit into a
   URI.  Furthermore, the 'archived-item' conforms to the URI standard.

10.  Acknowledgements

   A special thanks to Caroline Nyvang and Thomas Kromann who have
   contributed to the research identifying the minimum information
   required in a persistent web reference, and to Bolette Jurik
   contributed with supplementary research concerning requirements for
   web collection/copora definitions.  Also thanks to all that have
   contributed to this work with the research and reviewing this RFC.

11.  IANA Considerations

   The URI scheme name 'pwid' is reserved as a provisional URI as result
   of request IANA #938449

12.  Clear Security and Privacy Considerations

   Security and privacy considerations are restricted to accessible web
   resources in web archives.  If resolvers to PWID URIs are created,
   there should be made an analysis of whether they can be restricted to
   the former mentioned registry of web archives.  Security and privacy
   will then be a question of security and privacy considerations
   related to the web archive resources.

13.  References

13.1.  Normative References

Zierau                  Expires December 11, 2017              [Page 10]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <http://www.rfc-editor.org/info/rfc2119>.

   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
              Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002,
              <http://www.rfc-editor.org/info/rfc3339>.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, DOI 10.17487/RFC3986, January 2005,
              <http://www.rfc-editor.org/info/rfc3986>.

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <http://www.rfc-editor.org/info/rfc5234>.

13.2.  Informative References

   [DOI]      International DOI Foundation, "The DOI System", 2016,
              <https://web.archive.org/web/20161020222635/
              https:/www.doi.org/>.

              pwid:archive.org:2016-10-20_22.26.35:site:https://www.doi.
              org/

   [IPRES]    Zierau, E., Nyvang, C., and T. Kromann, "Persistent Web
              References - Best Practices and New Suggestions", October
              2016, <http://www.ipres2016.ch/frontend/organizers/media/
              iPRES2016/_PDF/
              IPR16.Proceedings_4_Web_Broschuere_Link.pdf>.

              In: proceedings of the 13th International Conference on
              Preservation of Digital Objects (iPres) 2016, pp. 237-246

   [ISO28500]
              International Organization for Standardization,
              "Information and documentation -- WARC file format", 2017,
              <https://www.iso.org/standard/68004.html>.

   [ISO8601]  International Organization for Standardization, "Data
              elements and interchange formats -- Information
              interchange -- Representation of dates and times", 2004,
              <https://www.iso.org/standard/40874.html>.

Zierau                  Expires December 11, 2017              [Page 11]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

   [RESAW]    The Resaw Community, "A Research infrastructure for the
              Study of Archived Web materials", 2017,
              <https://web.archive.org/web/20170529113150/
              http://resaw.eu/>.

              pwid:archive.org:2017-05-29_11.31.50Z:site:http://resaw.eu
              /

   [ResawColl]
              Jurik, B. and E. Zierau, "Data Management of Web archive
              Research Data", 2017,
              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
              RESAW2017-JurikZierau-
              Data_management_of_web_archive_research_data.pdf>.

              In: proceedings of the RESAW 2017 Conference, DOI:
              10.14296/resaw.0002

   [ResawRef]
              Nyvang, C., Kromann, T., and E. Zierau, "Capturing the Web
              at Large - a Critique of Current Web Referencing
              Practices", 2017,
              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
              RESAW2017-NyvangKromannZierau-
              Capturing_the_web_at_large.pdf>.

              In: proceedings of the RESAW 2017 Conference, DOI:
              10.14296/resaw.0004

   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, DOI 10.17487/RFC2141,
              May 1997, <http://www.rfc-editor.org/info/rfc2141>.

   [RFC6068]  Duerst, M., Masinter, L., and J. Zawinski, "The 'mailto'
              URI Scheme", RFC 6068, DOI 10.17487/RFC6068, October 2010,
              <http://www.rfc-editor.org/info/rfc6068>.

   [W3CDTF]   W3C, "Date and Time Formats: note submitted to the W3C. 15
              September 1997", 1997,
              <http://www.w3.org/TR/NOTE-datetime>.

              W3C profile of ISO 8601 pwid:archive.org:2017-04-
              03_03.37.42Z:page:http://www.w3.org/TR/NOTE-datetime

Author's Address

Zierau                  Expires December 11, 2017              [Page 12]
Internet-Draft    Scheme Specification for the pwid URI        June 2017

   Eld Maj-Britt Olmuetz Zierau (editor)
   The Royal Danish Library
   Soeren Kierkegaards Plads 1
   Copenhagen  1219
   Denmark

   Phone: +45 9132 4690
   Email: elzi@kb.dk

Zierau                  Expires December 11, 2017              [Page 13]