The VCDIFF Generic Differencing and Compression Data Format
RFC 3284
Document | Type |
RFC
- Proposed Standard
(July 2002)
Was
draft-korn-vcdiff
(individual)
|
|
---|---|---|---|
Authors | David Korn , Joshua MacDonald, Jeffrey Mogul , Kiem-Phong Vo | ||
Last updated | 2013-03-02 | ||
RFC stream | Internet Engineering Task Force (IETF) | ||
Formats | |||
IESG | Responsible AD | (None) | |
Send notices to | (None) |
RFC 3284
quot;inst" array includes an index to some entry in the instruction code table that determines: a. Whether one or two instructions were encoded and their types. b. If the instructions have their sizes encoded separately, these sizes will follow, in order, in the tuple. Korn, et. al. Standards Track [Page 20] RFC 3284 VCDIFF June 2002 c. If the instructions have accompanying data, i.e., ADDs or RUNs, their data will be in the array "data". d. Similarly, if the instructions are COPYs, the coded addresses are found in the array "addr". The decoding procedure simply processes the arrays by reading one code index at a time, looking up the corresponding instruction code entry, then consuming the respective sizes, data and addresses following the directions in this entry. In other words, the decoder maintains an implicit next-element pointer for each array; "consuming" an instruction tuple, data, or address value implies incrementing the associated pointer. For example, if during the processing of the target window, the next unconsumed tuple in the inst array has an index value of 19, then the first instruction is a COPY, whose size is found as the immediately following integer in the inst array. Since the mode of this COPY instruction is VCD_SELF, the corresponding address is found by consuming the next integer in the addr array. The data array is left intact. As the second instruction for code index 19 is a NOOP, this tuple is finished. 7. APPLICATION-DEFINED CODE TABLES Although the default code table used in Vcdiff is good for general purpose encoders, there are times when other code tables may perform better. For example, to code a file with many identical segments of data, it may be advantageous to have a COPY instruction with the specific size of these data segments, so that the instruction can be encoded in a single byte. Such a special code table MUST then be encoded in the delta file so that the decoder can reconstruct it before decoding the data. Vcdiff allows an application-defined code table to be specified in a delta file with the following data: Size of near cache - byte Size of same cache - byte Compressed code table data The "compressed code table data" encodes the delta between the default code table (source) and the new code table (target) in the same manner as described in Section 4.3 for encoding a target window in terms of a source window. This delta is computed using the following steps: Korn, et. al. Standards Track [Page 21] RFC 3284 VCDIFF June 2002 a. Convert the new instruction code table into a string, "code", of 1536 bytes using the below steps in order: i. Add in order the 256 bytes representing the types of the first instructions in the instruction pairs. ii. Add in order the 256 bytes representing the types of the second instructions in the instruction pairs. iii. Add in order the 256 bytes representing the sizes of the first instructions in the instruction pairs. iv. Add in order the 256 bytes representing the sizes of the second instructions in the instruction pairs. v. Add in order the 256 bytes representing the modes of the first instructions in the instruction pairs. vi. Add in order the 256 bytes representing the modes of the second instructions in the instruction pairs. b. Similarly, convert the default code table into a string "dflt". c. Treat the string "code" as a target window and "dflt" as the corresponding source data and apply an encoding algorithm to compute the delta encoding of "code" in terms of "dflt". This computation MUST use the default code table for encoding the delta instructions. The decoder can then reverse the above steps to decode the compressed table data using the method of Section 6, employing the default code table, to generate the new code table. Note that the decoder does not need to know about the details of the encoding algorithm used in step (c). It is able to decode the new code table because the Vcdiff format is independent from the choice of encoding algorithm, and because the encoder in step (c) uses the known, default code table. 8. Performance The encoding format is compact. For compression only, using the LZ- 77 string parsing strategy and without any secondary compressors, the typical compression rate is better than Unix compress and close to gzip. For differencing, the data format is better than all known methods in terms of its stated goal, which is primarily decoding speed and encoding efficiency. We compare the performance of compress, gzip and Vcdiff using the archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, gcc-2.95.2.tar and gcc-2.95.3.tar. Gzip was used at its default compression level. The Vcdiff data were obtained using the Vcodex/Vcdiff software (Section 13). Korn, et. al. Standards Track [Page 22] RFC 3284 VCDIFF June 2002 Below are the different Vcdiff runs: Vcdiff: vcdiff is used as a compressor only. Vcdiff-d: vcdiff is used as a differencer only. That is, it only compares target data against source data. Since the files involved are large, they are broken into windows. In this case, each target window, starting at some file offset in the target file, is compared against a source window with the same file offset (in the source file). The source window is also slightly larger than the target window to increase matching opportunities. Vcdiff-dc: This is similar to Vcdiff-d, but vcdiff can also compare target data against target data as applicable. Thus, vcdiff both computes differences and compresses data. The windowing algorithm is the same as above. However, the above hint is recinded in this case. Vcdiff-dcw: This is similar to Vcdiff-dc but the windowing algorithm uses a content-based heuristic to select a source window that is more likely to match with a given target window. Thus, the source data segment selected for a target window often will not be aligned with the file offsets of this target window. gcc-2.95.1 gcc-2.95.2 gcc-2.95.3 --------------------------------------------------------- 1. raw size 55,746,560 55,797,760 55,787,520 2. compress - 19,939,390 19,939,453 3. gzip - 12,973,443 12,998,097 4. Vcdiff - 15,358,786 15,371,737 5. Vcdiff-d - 100,971 26,383,849 6. Vcdiff-dc - 97,246 14,461,203 7. Vcdiff-dcw - 256,445 1,248,543 The above table shows the raw sizes of the tar files and the sizes of the compressed results. The differencing results in the gcc-2.95.2 column were obtained by compressing gcc-2.95.2, given gcc-2.95.1. The same results for the column gcc-2.95.3 were obtained by compressing gcc-2.95.3, given gcc-2.95.2. Rows 2, 3 and 4 show that, for compression only, the compression rate from Vcdiff is worse than gzip and better than compress. Korn, et. al. Standards Track [Page 23] RFC 3284 VCDIFF June 2002 The last three rows in the column gcc-2.95.2 show that when two file versions are very similar, differencing can give dramatically good compression rates. Vcdiff-d and Vcdiff-dc use the same simple window selection method of aligning by file offsets, but Vcdiff-dc also does compression so its output is slightly smaller. Vcdiff-dcw uses a content-based algorithm to search for source data that likely will match a given target window. Although it does a good job, the algorithm does not always find the best matches, which in this case, are given by the simple algorithm of Vcdiff-d. As a result, the output size for Vcdiff-dcw is slightly larger. The situation is reversed in the gcc-2.95.3 column. Here, the files and their contents were sufficiently rearranged or changed between the making of the gcc-2.95.3.tar archive and the gcc-2.95.2 archive so that the simple method of aligning windows by file offsets no longer works. As a result, Vcdiff-d and Vcdiff-dc do not perform well. By allowing compression, along with differencing, Vcdiff-dc manages to beat Vcdiff-c, which does compression only. The content- based window matching algorithm in Vcdiff-dcw is effective in matching the right source and target windows so that Vcdiff-dcw is the overall winner. 9. Further Issues This document does not address a few issues: Secondary compressors: As discussed in Section 4.3, certain sections in the delta encoding of a window may be further compressed by a secondary compressor. In our experience, the basic Vcdiff format is adequate for most purposes so that secondary compressors are seldom needed. In particular, for normal use of data differencing, where the files to be compared have long stretches of matches, much of the gain in compression rate is already achieved by normal string matching. Thus, the use of secondary compressors is seldom needed in this case. However, for applications beyond differencing of such nearly identical files, secondary compressors may be needed to achieve maximal compressed results. Therefore, we recommend leaving the Vcdiff data format defined as in this document so that the use of secondary compressors can be implemented when they become needed in the future. The formats of the compressed data via such compressors or any compressors that may be defined in the future are left open to their implementations. These could include Huffman encoding, arithmetic encoding, and splay tree encoding [8,9]. Korn, et. al. Standards Track [Page 24] RFC 3284 VCDIFF June 2002 Large file system vs. small file system: As discussed in Section 4, a target window in a large file may be compared against some source window in another file or in the same file (from some earlier part). In that case, the file offset of the source window is specified as a variable-sized integer in the delta encoding. There is a possibility that the encoding was computed on a system supporting much larger files than in a system where the data may be decoded (e.g., 64-bit file systems vs. 32- bit file systems). In that case, some target data may not be recoverable. This problem could afflict any compression format, and ought to be resolved with a generic negotiation mechanism in the appropriate protocol(s). 10. Summary We have described Vcdiff, a general and portable encoding format for compression and differencing. The format is good in that it allows implementing a decoder without knowledge of the encoders. Further, ignoring the use of secondary compressors not defined within the format, the decoding algorithms run in linear time and requires working space proportional to window size. 11. Acknowledgements Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff who provided much encouragement to publicize Vcdiff. In particular, Jeff helped in clarifying the description of the data format presented here. 12. Security Considerations Vcdiff only provides a format to encode compressed and differenced data. It does not address any issues concerning how such data are, in fact, stored in a given file system or the run-time memory of a computer system. Therefore, we do not anticipate any security issues with respect to Vcdiff. 13. Source Code Availability Vcdiff is implemented as a data transforming method in Phong Vo's Vcodex library. AT&T Corp. has made the source code for Vcodex available for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. The source code and according license is accessible at the below URL: http://www.research.att.com/sw/tools Korn, et. al. Standards Track [Page 25] RFC 3284 VCDIFF June 2002 14. Intellectual Property Rights The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights, at <http://www.ietf.org/ipr.html>. The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP 11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. 15. IANA Considerations The Internet Assigned Numbers Authority (IANA) administers the number space for Secondary Compressor ID values. Values and their meaning must be documented in an RFC or other peer-reviewed, permanent, and readily available reference, in sufficient detail so that interoperability between independent implementations is possible. Subject to these constraints, name assignments are First Come, First Served - see RFC 2434 [13]. Legal ID values are in the range 1..255. This document does not define any values in this number space. 16. References [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, Practical Reusable Unix Software, Editor B. Krishnamurthy, John Wiley & Sons, Inc., 1995. [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977. [3] W. Tichy, The String-to-String Correction Problem with Block Moves, ACM Transactions on Computer Systems, 2(4):309-321, November 1984. Korn, et. al. Standards Track [Page 26] RFC 3284 VCDIFF June 2002 [4] E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23:262-272, 1976. [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, IEEE Software Configuration and Maintenance Workshop, 1996. [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998. [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the Summer '91 Usenix Conference, 1991. [8] D. W. Jones, Application of Splay Trees to Data Compression, CACM, 31(8):996:1007. [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851- 434-1, M&T Books, New York, NY, 1995. [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, Potential benefits of delta encoding and data compression for HTTP, SIGCOMM '97, Cannes, France, 1997. [11] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., Goland, Y. and A. Van Hoff, "Delta Encoding in HTTP", RFC 3229, January 2002. [12] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [13] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 2434, October 1998. [14] D.G. Korn and K.P. Vo, Engineering a Differencing and Compression Data Format, Submitted to Usenix'2002, 2001. Korn, et. al. Standards Track [Page 27] RFC 3284 VCDIFF June 2002 17. Authors' Addresses Kiem-Phong Vo (main contact) AT&T Labs, Room D223 180 Park Avenue Florham Park, NJ 07932 Phone: 1 973 360 8630 EMail: kpv@research.att.com David G. Korn AT&T Labs, Room D237 180 Park Avenue Florham Park, NJ 07932 Phone: 1 973 360 8602 EMail: dgk@research.att.com Jeffrey C. Mogul Western Research Laboratory Hewlett-Packard Company 1501 Page Mill Road, MS 1251 Palo Alto, California, 94304, U.S.A. Phone: 1 650 857 2206 (email preferred) EMail: JeffMogul@acm.org Joshua P. MacDonald Computer Science Division University of California, Berkeley 345 Soda Hall Berkeley, CA 94720 EMail: jmacd@cs.berkeley.edu Korn, et. al. Standards Track [Page 28] RFC 3284 VCDIFF June 2002 18. Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Korn, et. al. Standards Track [Page 29]