Skip to main content

The VCDIFF Generic Differencing and Compression Data Format
RFC 3284

Document Type RFC - Proposed Standard (July 2002)
Was draft-korn-vcdiff (individual)
Authors David Korn , Joshua MacDonald, Jeffrey Mogul , Kiem-Phong Vo
Last updated 2013-03-02
RFC stream Internet Engineering Task Force (IETF)
Formats
IESG Responsible AD (None)
Send notices to (None)
RFC 3284
quot;inst" array includes an index to
   some entry in the instruction code table that determines:

   a. Whether one or two instructions were encoded and their types.

   b. If the instructions have their sizes encoded separately, these
      sizes will follow, in order, in the tuple.

Korn, et. al.               Standards Track                    [Page 20]
RFC 3284                         VCDIFF                        June 2002

   c. If the instructions have accompanying data, i.e., ADDs or RUNs,
      their data will be in the array "data".

   d. Similarly, if the instructions are COPYs, the coded addresses are
      found in the array "addr".

   The decoding procedure simply processes the arrays by reading one
   code index at a time, looking up the corresponding instruction code
   entry, then consuming the respective sizes, data and addresses
   following the directions in this entry.  In other words, the decoder
   maintains an implicit next-element pointer for each array;
   "consuming" an instruction tuple, data, or address value implies
   incrementing the associated pointer.

   For example, if during the processing of the target window, the next
   unconsumed tuple in the inst array has an index value of 19, then the
   first instruction is a COPY, whose size is found as the immediately
   following integer in the inst array.  Since the mode of this COPY
   instruction is VCD_SELF, the corresponding address is found by
   consuming the next integer in the addr array.  The data array is left
   intact.  As the second instruction for code index 19 is a NOOP, this
   tuple is finished.

7. APPLICATION-DEFINED CODE TABLES

   Although the default code table used in Vcdiff is good for general
   purpose encoders, there are times when other code tables may perform
   better.  For example, to code a file with many identical segments of
   data, it may be advantageous to have a COPY instruction with the
   specific size of these data segments, so that the instruction can be
   encoded in a single byte.  Such a special code table MUST then be
   encoded in the delta file so that the decoder can reconstruct it
   before decoding the data.

   Vcdiff allows an application-defined code table to be specified in a
   delta file with the following data:

         Size of near cache            - byte
         Size of same cache            - byte
         Compressed code table data

   The "compressed code table data" encodes the delta between the
   default code table (source) and the new code table (target) in the
   same manner as described in Section 4.3 for encoding a target window
   in terms of a source window.  This delta is computed using the
   following steps:

Korn, et. al.               Standards Track                    [Page 21]
RFC 3284                         VCDIFF                        June 2002

   a. Convert the new instruction code table into a string, "code", of
      1536 bytes using the below steps in order:

       i. Add in order the 256 bytes representing the types of the first
          instructions in the instruction pairs.
      ii. Add in order the 256 bytes representing the types of the
          second instructions in the instruction pairs.
     iii. Add in order the 256 bytes representing the sizes of the first
          instructions in the instruction pairs.
      iv. Add in order the 256 bytes representing the sizes of the
          second instructions in the instruction pairs.
       v. Add in order the 256 bytes representing the modes of the first
          instructions in the instruction pairs.
      vi. Add in order the 256 bytes representing the modes of the
          second instructions in the instruction pairs.

   b. Similarly, convert the default code table into a string "dflt".

   c. Treat the string "code" as a target window and "dflt" as the
      corresponding source data and apply an encoding algorithm to
      compute the delta encoding of "code" in terms of "dflt".  This
      computation MUST use the default code table for encoding the delta
      instructions.

   The decoder can then reverse the above steps to decode the compressed
   table data using the method of Section 6, employing the default code
   table, to generate the new code table.  Note that the decoder does
   not need to know about the details of the encoding algorithm used in
   step (c).  It is able to decode the new code table because the Vcdiff
   format is independent from the choice of encoding algorithm, and
   because the encoder in step (c) uses the known, default code table.

8. Performance

   The encoding format is compact.  For compression only, using the LZ-
   77 string parsing strategy and without any secondary compressors, the
   typical compression rate is better than Unix compress and close to
   gzip.  For differencing, the data format is better than all known
   methods in terms of its stated goal, which is primarily decoding
   speed and encoding efficiency.

   We compare the performance of compress, gzip and Vcdiff using the
   archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
   gcc-2.95.2.tar and gcc-2.95.3.tar.  Gzip was used at its default
   compression level.  The Vcdiff data were obtained using the
   Vcodex/Vcdiff software (Section 13).

Korn, et. al.               Standards Track                    [Page 22]
RFC 3284                         VCDIFF                        June 2002

   Below are the different Vcdiff runs:

      Vcdiff: vcdiff is used as a compressor only.

      Vcdiff-d: vcdiff is used as a differencer only.  That is, it only
         compares target data against source data.  Since the files
         involved are large, they are broken into windows.  In this
         case, each target window, starting at some file offset in the
         target file, is compared against a source window with the same
         file offset (in the source file).  The source window is also
         slightly larger than the target window to increase matching
         opportunities.

      Vcdiff-dc: This is similar to Vcdiff-d, but vcdiff can also
         compare target data against target data as applicable.  Thus,
         vcdiff both computes differences and compresses data.  The
         windowing algorithm is the same as above.  However, the above
         hint is recinded in this case.

      Vcdiff-dcw: This is similar to Vcdiff-dc but the windowing
         algorithm uses a content-based heuristic to select a source
         window that is more likely to match with a given target window.
         Thus, the source data segment selected for a target window
         often will not be aligned with the file offsets of this target
         window.

                       gcc-2.95.1     gcc-2.95.2     gcc-2.95.3
      ---------------------------------------------------------
      1. raw size      55,746,560     55,797,760     55,787,520
      2. compress         -           19,939,390     19,939,453
      3. gzip             -           12,973,443     12,998,097
      4. Vcdiff           -           15,358,786     15,371,737
      5. Vcdiff-d         -              100,971     26,383,849
      6. Vcdiff-dc        -               97,246     14,461,203
      7. Vcdiff-dcw       -              256,445      1,248,543

   The above table shows the raw sizes of the tar files and the sizes of
   the compressed results.  The differencing results in the gcc-2.95.2
   column were obtained by compressing gcc-2.95.2, given gcc-2.95.1.
   The same results for the column gcc-2.95.3 were obtained by
   compressing gcc-2.95.3, given gcc-2.95.2.

   Rows 2, 3 and 4 show that, for compression only, the compression rate
   from Vcdiff is worse than gzip and better than compress.

Korn, et. al.               Standards Track                    [Page 23]
RFC 3284                         VCDIFF                        June 2002

   The last three rows in the column gcc-2.95.2 show that when two file
   versions are very similar, differencing can give dramatically good
   compression rates.  Vcdiff-d and Vcdiff-dc use the same simple window
   selection method of aligning by file offsets, but Vcdiff-dc also does
   compression so its output is slightly smaller.  Vcdiff-dcw uses a
   content-based algorithm to search for source data that likely will
   match a given target window.  Although it does a good job, the
   algorithm does not always find the best matches, which in this case,
   are given by the simple algorithm of Vcdiff-d.  As a result, the
   output size for Vcdiff-dcw is slightly larger.

   The situation is reversed in the gcc-2.95.3 column.  Here, the files
   and their contents were sufficiently rearranged or changed between
   the making of the gcc-2.95.3.tar archive and the gcc-2.95.2 archive
   so that the simple method of aligning windows by file offsets no
   longer works.  As a result, Vcdiff-d and Vcdiff-dc do not perform
   well.  By allowing compression, along with differencing, Vcdiff-dc
   manages to beat Vcdiff-c, which does compression only.  The content-
   based window matching algorithm in Vcdiff-dcw is effective in
   matching the right source and target windows so that Vcdiff-dcw is
   the overall winner.

9. Further Issues

   This document does not address a few issues:

   Secondary compressors:
      As discussed in Section 4.3, certain sections in the delta
      encoding of a window may be further compressed by a secondary
      compressor.  In our experience, the basic Vcdiff format is
      adequate for most purposes so that secondary compressors are
      seldom needed.  In particular, for normal use of data
      differencing, where the files to be compared have long stretches
      of matches, much of the gain in compression rate is already
      achieved by normal string matching.  Thus, the use of secondary
      compressors is seldom needed in this case.  However, for
      applications beyond differencing of such nearly identical files,
      secondary compressors may be needed to achieve maximal compressed
      results.

      Therefore, we recommend leaving the Vcdiff data format defined as
      in this document so that the use of secondary compressors can be
      implemented when they become needed in the future.  The formats of
      the compressed data via such compressors or any compressors that
      may be defined in the future are left open to their
      implementations.  These could include Huffman encoding, arithmetic
      encoding, and splay tree encoding [8,9].

Korn, et. al.               Standards Track                    [Page 24]
RFC 3284                         VCDIFF                        June 2002

   Large file system vs. small file system:
      As discussed in Section 4, a target window in a large file may be
      compared against some source window in another file or in the same
      file (from some earlier part).  In that case, the file offset of
      the source window is specified as a variable-sized integer in the
      delta encoding.  There is a possibility that the encoding was
      computed on a system supporting much larger files than in a system
      where the data may be decoded (e.g., 64-bit file systems vs. 32-
      bit file systems).  In that case, some target data may not be
      recoverable.  This problem could afflict any compression format,
      and ought to be resolved with a generic negotiation mechanism in
      the appropriate protocol(s).

10.  Summary

   We have described Vcdiff, a general and portable encoding format for
   compression and differencing.  The format is good in that it allows
   implementing a decoder without knowledge of the encoders.  Further,
   ignoring the use of secondary compressors not defined within the
   format, the decoding algorithms run in linear time and requires
   working space proportional to window size.

11. Acknowledgements

   Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur
   Van Hoff who provided much encouragement to publicize Vcdiff.  In
   particular, Jeff helped in clarifying the description of the data
   format presented here.

12. Security Considerations

   Vcdiff only provides a format to encode compressed and differenced
   data.  It does not address any issues concerning how such data are,
   in fact, stored in a given file system or the run-time memory of a
   computer system.  Therefore, we do not anticipate any security issues
   with respect to Vcdiff.

13. Source Code Availability

   Vcdiff is implemented as a data transforming method in Phong Vo's
   Vcodex library.  AT&T Corp. has made the source code for Vcodex
   available for anyone to use to transmit data via HTTP/1.1 Delta
   Encoding [10,11].  The source code and according license is
   accessible at the below URL:

      http://www.research.att.com/sw/tools

Korn, et. al.               Standards Track                    [Page 25]
RFC 3284                         VCDIFF                        June 2002

14. Intellectual Property Rights

   The IETF has been notified of intellectual property rights claimed in
   regard to some or all of the specification contained in this
   document.  For more information consult the online list of claimed
   rights, at <http://www.ietf.org/ipr.html>.

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; neither does it represent that it
   has made any effort to identify any such rights.  Information on the
   IETF's procedures with respect to rights in standards-track and
   standards-related documentation can be found in BCP 11.  Copies of
   claims of rights made available for publication and any assurances of
   licenses to be made available, or the result of an attempt made to
   obtain a general license or permission for the use of such
   proprietary rights by implementors or users of this specification can
   be obtained from the IETF Secretariat.

15. IANA Considerations

   The Internet Assigned Numbers Authority (IANA) administers the number
   space for Secondary Compressor ID values.  Values and their meaning
   must be documented in an RFC or other peer-reviewed, permanent, and
   readily available reference, in sufficient detail so that
   interoperability between independent implementations is possible.
   Subject to these constraints, name assignments are First Come, First
   Served - see RFC 2434 [13].  Legal ID values are in the range 1..255.

   This document does not define any values in this number space.

16. References

   [1]  D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
        Practical Reusable Unix Software, Editor B. Krishnamurthy, John
        Wiley & Sons, Inc., 1995.

   [2]  J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
        Compression, IEEE Trans. on Information Theory, 23(3):337-343,
        1977.

   [3]  W. Tichy, The String-to-String Correction Problem with Block
        Moves, ACM Transactions on Computer Systems, 2(4):309-321,
        November 1984.

Korn, et. al.               Standards Track                    [Page 26]
RFC 3284                         VCDIFF                        June 2002

   [4]  E.M. McCreight, A Space-Economical Suffix Tree Construction
        Algorithm, Journal of the ACM, 23:262-272, 1976.

   [5]  J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
        Algorithms, IEEE Software Configuration and Maintenance
        Workshop, 1996.

   [6]  J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical
        Analysis, ACM Trans. on Software Engineering and Methodology,
        7:192-214, 1998.

   [7]  D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the
        Summer '91 Usenix Conference, 1991.

   [8]  D. W. Jones, Application of Splay Trees to Data Compression,
        CACM, 31(8):996:1007.

   [9]  M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-
        434-1, M&T Books, New York, NY, 1995.

   [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
        Potential benefits of delta encoding and data compression for
        HTTP, SIGCOMM '97, Cannes, France, 1997.

   [11] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., Goland,
        Y. and A. Van Hoff, "Delta Encoding in HTTP", RFC 3229, January
        2002.

   [12] Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.

   [13] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
        Considerations Section in RFCs", BCP 26, RFC 2434, October 1998.

   [14] D.G. Korn and K.P. Vo, Engineering a Differencing and
        Compression Data Format, Submitted to Usenix'2002, 2001.

Korn, et. al.               Standards Track                    [Page 27]
RFC 3284                         VCDIFF                        June 2002

17. Authors' Addresses

   Kiem-Phong Vo (main contact)
   AT&T Labs, Room D223
   180 Park Avenue
   Florham Park, NJ 07932

   Phone: 1 973 360 8630
   EMail: kpv@research.att.com

   David G. Korn
   AT&T Labs, Room D237
   180 Park Avenue
   Florham Park, NJ 07932

   Phone: 1 973 360 8602
   EMail: dgk@research.att.com

   Jeffrey C. Mogul
   Western Research Laboratory
   Hewlett-Packard Company
   1501 Page Mill Road, MS 1251
   Palo Alto, California, 94304, U.S.A.

   Phone: 1 650 857 2206 (email preferred)
   EMail: JeffMogul@acm.org

   Joshua P. MacDonald
   Computer Science Division
   University of California, Berkeley
   345 Soda Hall
   Berkeley, CA 94720

   EMail: jmacd@cs.berkeley.edu

Korn, et. al.               Standards Track                    [Page 28]
RFC 3284                         VCDIFF                        June 2002

18.  Full Copyright Statement

   Copyright (C) The Internet Society (2002).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.

Korn, et. al.               Standards Track                    [Page 29]