Network Working Group                                           J. Kunze
Internet-Draft                                California Digital Library
Expires: May 22, 2009                                  November 18, 2008


                         Oxum: Octet Stream Sum
      http://www.ietf.org/internet-drafts/draft-kunze-oxum-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on May 22, 2009.

Copyright Notice

   Copyright (C) The IETF Trust (2008).















Kunze                     Expires May 22, 2009                  [Page 1]


Internet-Draft                    Oxum                     November 2008


Abstract

   This document specifies "oxum", a two-part number, OCTETS.STREAMS,
   that is a kind of simple size summary for complex digital objects.
   In the mainstream case of a complex object that is a set of files,
   the STREAMS part is the total number of files and the OCTETS part is
   the total number of 8-bit bytes across all those files; for example,
   an oxum of 876543.21 could mean a total of 876,543 bytes across 21
   files.  Which set of streams comprises a complex object for an oxum
   computation depends in general on the object's type.  One important
   type is the stream set defined by the set of files contained in a
   file hierarchy.  An oxum is not a checksum in that, while a changed
   oxum means a changed object, an unchanged oxum does not mean an
   unchanged object.





































Kunze                     Expires May 22, 2009                  [Page 2]


Internet-Draft                    Oxum                     November 2008


1.  The size of a digital object

   It can be hard to characterize the size of an arbitrary digital
   object.  Word count, page count, image dimensions, video running
   time, or number of database records might all be useful metrics,
   depending on the type of the object.  For a single file, one crude
   but easily obtained metric is the number of octets (8-bit bytes) in
   the file.  This document introduces an analogous metric for a
   _complex digital object_, by which we mean an object that is not
   equivalent to a single file.  A complex object may consist of a group
   of files or parts of one or more files (e.g., a database).








































Kunze                     Expires May 22, 2009                  [Page 3]


Internet-Draft                    Oxum                     November 2008


2.  The octet stream sum (oxum)

   A complex digital object that has a well-defined set of octet
   streams, such as a document represented by a group of 14 text and
   image files, has a well-defined "oxum" (octet stream sum).  The oxum
   is a two part number such as

     567898.14

   which corresponds to 567,898 octets spread over 14 files.  The
   general form of an oxum is

     OCTETS.STREAMS

   where STREAMS is the total number of streams (e.g., files) and OCTETS
   is the total number of octets across all those streams.  In general,
   these two numbers will be positive integers, although there may be
   situations (not described here) in which it makes sense for either
   one of them to be left unspecified with a hyphen ('-').  The period
   ('.') separator is required.  Other examples:

   1998.10              # 1998 octets spread over 10 streams
   105.3                # 105 octets, 3 streams (not 105 and 3 tenths)
   21436794142.831      # almost 19 Gigabytes spread over 831 streams
   709895249489.8756    # about 661 Gb, or 710 Gb if you divide by 1000
   -.1                  # one stream, but number of octets not known yet

   The oxum is designed to be machine readable and to fit into a variety
   of syntactic contexts, such as command lines, file paths, URL
   [RFC3986] query strings, and XML [XML] tags.

   Note that the oxum is _not_ designed as a secure digest or checksum.
   While an oxum cannot change without a change to the object, an
   unchanged oxum absolutely does not imply an unchanged object.  Do not
   use oxum in place of a cryptographic digest algorithm (cf. SHA1
   [RFC3174]).















Kunze                     Expires May 22, 2009                  [Page 4]


Internet-Draft                    Oxum                     November 2008


3.  Oxum complex object types

   An _oxum object type_ is used to describe how to derive an object's
   stream set.  For oxum to be meaningful for an object type, the type
   must have a well-defined, canonical stream set.  Once the stream set
   is known, the oxum computation is straightforward and the streams can
   be processed in any order.  One especially natural way to derive a
   stream set is to define a way to reduce an object type to a file
   group.

   Files are primal streams.  In this document, a "regular" file is a
   contiguous sequence of octets with a well-defined start and end,
   whether the sequence is named in static storage (e.g., "memo.pdf") or
   is unnamed and recently retrieved (e.g., a web page) from a network
   socket.  There are many filesystem entities that are not regular
   files, including directory nodes, block special files, and symbolic
   links.  In this document, the word "file" usually refers to a regular
   file.

   A (regular) file is an oxum-ready stream.  As a base case, a complex
   object consisting of exactly one file has an oxum of the form
   "OCTETS.1", as in

     12345.1

   Things get more interesting when dealing with more than one file.
   Any private or public agreement can be made about what constitutes a
   file group, hence a stream set, for the purposes of an oxum
   computation.  A stream set might be declared to comprise all the
   attachments of an email message, or all the files resulting from a
   normalized dump procedure run against the tables of a database.  An
   easily delineated group is all the files contained in a directory.

   Any recognized group of regular files can form on oxum stream set,
   including a simple manifest or list of filenames.  For example, a
   transfer protocol might use oxum to help set the receiver's
   expectations in terms of total bytes and files contained in a
   transferred package [GRABIT].

3.1.  File hierarchy oxum

   The "file hierarchy" oxum type has the stream set defined by the
   group of all regular files rooted in a given directory (or folder),
   and including all regular files nested in all its subdirectories.
   This Linux shell script computes oxums for each of its arguments.






Kunze                     Expires May 22, 2009                  [Page 5]


Internet-Draft                    Oxum                     November 2008


   #!/bin/csh -f
   foreach f ($*)
       find $f -type f | sed "s/.*/'&'/" | xargs stat -t | \
           awk -v f=$f '{s += $2} END {printf "%s.%s %s\n", s, NR, f}'
   end

3.2.  Database oxum

   The "database" oxum type has the stream set defined by the group of
   files resulting from a canonical dump of all the database tables.
   This is [not worked out!  XXX] related to the canonicalization phase
   in computing a Universal Numeric Fingerprint UNF [UNF].  Without
   UNF's cryptographic phase, this produces an oxum that is stable even
   if the raw data is moved to another platform or software system.





































Kunze                     Expires May 22, 2009                  [Page 6]


Internet-Draft                    Oxum                     November 2008


4.  Security considerations

   Neither the oxum metric nor its computation pose any direct risk to
   computers and networks.  Documentation that describes any use of oxum
   should caution implementors not to use an oxum in place of a
   cryptographically secure digest.  Any such use would be trivial to
   spoof.












































Kunze                     Expires May 22, 2009                  [Page 7]


Internet-Draft                    Oxum                     November 2008


5.  References

   [GRABIT]   NDIIPP/CDL, "The GrabIt File Exchange Protocol", 2008,
              <http://dot.ucop.edu/home/jak/grabitspec.html>.

   [RFC3174]  Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1
              (SHA1)", RFC 3174, September 2001.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, January 2005.

   [UNF]      Altman and King, "A Proposed Standard for the Scholarly
              Citation of Data", March 2007,
              <http://www.dlib.org/dlib/march07/altman/03altman.html>.

   [XML]      Bray, "Extensible Markup Language (XML) 1.0 (Fourth
              Edition)", August 2006,
              <http://www.w3.org/TR/2006/REC-xml-20060816/>.
































Kunze                     Expires May 22, 2009                  [Page 8]


Internet-Draft                    Oxum                     November 2008


Author's Address

   John A. Kunze
   California Digital Library
   415 20th St, 4th Floor
   Oakland, CA  94612
   US

   Fax:   +1 510-893-5212
   Email: jak@ucop.edu









































Kunze                     Expires May 22, 2009                  [Page 9]


Internet-Draft                    Oxum                     November 2008


Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Acknowledgment

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).





Kunze                     Expires May 22, 2009                 [Page 10]