gzip_1_2_4程序分析

勤奮不止 2012-04-06

展開全文

gzip-1.2.4程序分析- -

《algorithm.doc文件的部分翻譯》和《gzip-1.2.4程序分析》

algorithm.doc文件的部分翻譯

2 bytes GZIP標(biāo)志字節(jié)：0x1f, 0x8b (\037 \213)
1 byte   壓縮方法： (0..7 reserved, 8 = deflate)
1 byte   標(biāo)志位：
            bit 0 set: 文件可能是ASCII文本文件
            bit 1 set: 附加多個gzip文件部分
            bit 2 set: 存在有可選的附加內(nèi)容
            bit 3 set: 提供了原始的文件名稱
            bit 4 set: 則提供有一個O－終結(jié)的文件內(nèi)容
            bit 5 set: 文件被加密
            bit 6,7:   保留
4 bytes 文件更改時間(Unix時間)
1 byte   額外的標(biāo)志，決定了壓縮方法。 2:使用最大的壓縮，最慢的算法
                              4:采用最快的算法
1 byte   這個標(biāo)志指明了進行壓縮時系統(tǒng)的類型。
                 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
                 1 - Amiga
                 2 - VMS (or OpenVMS)
                 3 - Unix
                 4 - VM/CMS
                 5 - Atari TOS
                 6 - HPFS filesystem (OS/2, NT)
                 7 - Macintosh
                 8 - Z-System
                 9 - CP/M
                10 - TOPS-20
                11 - NTFS filesystem (NT)
                12 - QDOS
                13 - Acorn RISCOS
               255 - unknown
2 bytes optional part number (second part=1) 可選的序號
2 bytes optional extra field length        可選的附加內(nèi)容的長度
bytes optional extra field           可選的附加內(nèi)容
bytes optional original file name, zero terminated
可選的原始文件名稱，以'\0'結(jié)束
bytes optional file comment, zero terminated
可選文件內(nèi)容(這部分不被解釋，而是可讀的供人使用的，以'\0'結(jié)束
12 bytes optional encryption header
bytes compressed data
4 bytes crc32                    這個是未壓縮數(shù)據(jù)的循環(huán)冗余校驗值。
4 bytes uncompressed input size modulo 2^32    這是原始數(shù)據(jù)的長度以2的32次方為模的值。

設(shè)計了一種可以單向編碼的格式，而不用反向查找，也不用預(yù)知未壓縮數(shù)據(jù)及輸出的
已壓縮數(shù)據(jù)的大小。如果輸入的數(shù)據(jù)不是一個文件，那么修改時間被設(shè)置為壓縮的開
始時間。
The format was designed to allow single pass compression without any
backwards seek, and without a priori knowledge of the uncompressed
input size or the available size on the output media. If input does
not come from a regular disk file, the file modification time is set
to the time at which compression started.

時間戳主要是用在在網(wǎng)絡(luò)上傳輸gzip文件的情況下。在這種情況下，它不需要保存所有
者的屬性。在本地傳輸?shù)臅r候，所有者的屬性在壓縮/解壓縮時由gzip所保存。忽略值為
0的時間戳。
The time stamp is useful mainly when one gzip file is transferred over
a network. In this case it would not help to keep ownership
attributes. In the local case, the ownership attributes are preserved
by gzip when compressing/decompressing the file. A time stamp of zero
is ignored.

標(biāo)志位中，值為0的位是可選的，它可以使我們對輸入的數(shù)據(jù)做一個預(yù)先的了解。在不
確定的時候，要將標(biāo)志位清除。對有不同文件格式(文本文件和二進制文件)的系統(tǒng)來說，
解碼時，可以使用標(biāo)志位來選擇不同的格式。
Bit 0 in the flags is only an optional indication, which can be set by
a small lookahead in the input data. In case of doubt, the flag is
cleared indicating binary data. For systems which have different
file formats for ascii text and binary data, the decompressor can
use the flag to choose the appropriate format.

如果有附加內(nèi)容，則它必須包含一個或多個子字段，每個子字段有如下格式：
The extra field, if present, must consist of one or more subfields,
each with the following format:

subfield id   : 2 bytes    子字段ID
subfield size : 2 bytes (little-endian format)子字段長度(小端字節(jié)序)
subfield data        子字段內(nèi)容

    子字段ID可以包含兩個可記住的字母。請發(fā)送一些這樣的ID給jloup@chorus.fr.
第二個字節(jié)為0的ID是被保留的。定義了如下的ID
The subfield id can consist of two letters with some mnemonic value.
Please send any such id to jloup@chorus.fr. Ids with a zero second
byte are reserved for future use. The following ids are defined:

Ap (0x41, 0x70) : Apollo file type information

    子字段長度是子字段內(nèi)容的長度，不包含ID及子字段長度這四字節(jié)。但是
前面所說的"可選的附加內(nèi)容的長度"則包含了ID及子字段長度的四字節(jié)。
The subfield size is the size of the subfield data and does not
include the id and the size itself. The field 'extra field length' is
the total size of the extra field, including subfield ids and sizes.

必須可以在壓縮數(shù)據(jù)中找到數(shù)據(jù)結(jié)束的位置，而不論數(shù)據(jù)的實際長度是多少。如果壓縮
數(shù)據(jù)不能夠放到一個文件中(如磁盤的情況)，每一部分都要由一個頭字段開始，但是只
有最后一部分中有CRC32和原始數(shù)據(jù)的長度。解壓程序應(yīng)該可以提示輸入另外的，存在
于多個壓縮文件中的數(shù)據(jù)。這是必要，但不是絕對的，因為當(dāng)一部分數(shù)據(jù)毀壞時，還需
要得到其它部分的內(nèi)容。
It must be possible to detect the end of the compressed data with any
compression format, regardless of the actual size of the compressed
data. If the compressed data cannot fit in one file (in particular for
diskettes), each part starts with a header as described above, but
only the last part has the crc32 and uncompressed size. A decompressor
may prompt for additional data for multipart compressed files. It is
desirable but not mandatory that multiple parts be extractable
independently so that partial data can be recovered if one of the
parts is damaged. This is possible only if no compression state is
kept from one part to the other. The compression-type dependent flags
can indicate this.

如果壓縮文件的系統(tǒng)對文件名的大小寫不敏感，則原始文件名會會強制轉(zhuǎn)換成小寫。
如果是從標(biāo)準(zhǔn)輸入讀入的數(shù)據(jù)，則沒有原始文件名。
If the file being compressed is on a file system with case insensitive
names, the original name field must be forced to lower case. There is
no original file name if the data was compressed from standard input.

即使壓縮后的文件會比原來的文件大，壓縮還是會完成的。
Compression is always performed, even if the compressed file is
slightly larger than the original. The worst case expansion is
a few bytes for the gzip file header, plus 5 bytes every 32K block,
or an expansion ratio of 0.015% for large files. Note that the actual
number of used disk blocks almost never increases.

The encryption is that of zip 1.9. For the encryption check, the
last byte of the decoded encryption header must be zero. The time
stamp of an encrypted file might be set to zero to avoid giving a clue
about the construction of the random header.

gzip-1.2.4程序分析

一點說明：
    在gzip.c中：
DECLARE(uch, inbuf, INBUFSIZ +INBUF_EXTRA);
DECLARE(uch, outbuf, OUTBUFSIZ+OUTBUF_EXTRA);
DECLARE(ush, d_buf, DIST_BUFSIZE);
DECLARE(uch, window, 2L*WSIZE);
#ifndef MAXSEG_64K
       DECLARE(ush, tab_prefix, 1L<<BITS);
#else
    DECLARE(ush, tab_prefix0, 1L<<(BITS-1));
    DECLARE(ush, tab_prefix1, 1L<<(BITS-1));
#endif
    實際上定義了一些數(shù)組：inbuf,outbuf,d_buf,window,tab_prefix,tab_prefix0,tabfix1.
1/
==================================================================================
入口程序：gzip-1.2.4/gzip.c
函數(shù)： int main (argc, argv)
         int argc;
         char **argv;
功能： 1)通過命令內(nèi)容(gzip,gunzip,unzip等)，設(shè)置操作類型(壓縮或是解壓縮)。
    2)通過參數(shù)，設(shè)置一些全局變量的值，對我們而言，有用的是：ascii(表示
為文本文件，可以根據(jù)本地的換行符來代替解壓后的文件中的換行符)、decompress(表示進行解壓操作)和level(轉(zhuǎn)換操作的級別—進行更快
的轉(zhuǎn)換還是進行更大壓縮比的轉(zhuǎn)換，當(dāng)然，這只對壓縮而言)。
    3)為輸入、輸出及窗口的緩沖分配內(nèi)存。
    4)調(diào)用treat_file(argv[optind++]);對文件進行操作。

2/
==================================================================================
函數(shù)： local void treat_file(iname)
         char *iname;
參數(shù)：為文件名稱；
功能： 1)得到輸入的文件的狀態(tài)：name,size,time,mode等。
    2)創(chuàng)建輸出文件的名稱。
    3)當(dāng)進行解壓操作時，調(diào)用    local int get_method(in) 來得到gz文件的壓縮方法。
    4)如果命令行中的參數(shù)-l，則調(diào)用do_list()顯示文件信息。
    5)調(diào)用local int create_outfile()創(chuàng)建輸出文件。
    6) 調(diào)用(*work)(ifd, ofd)進行壓縮、解壓縮的操作。這時的work指針被get_method()
函數(shù)置為unzip()函數(shù)(解壓時)，或是為默認的zip()函數(shù)。在解壓縮時，
這個過程是在循環(huán)中的，因為可能會包含多個文件。

3/
==================================================================================
函數(shù)： local int get_method(in)
        int in;        /* input file descriptor */
參數(shù)：文件名稱
功能： 1)驗證第一第二字節(jié)是否為0x1F,0x8B。
    2)驗證第三字節(jié)是否為0x08(deflate)。
    3)設(shè)置函數(shù)指針work = unzip。(work的默認值是zip)
    4)得到做為flags的第四字節(jié)。
    5)如果設(shè)置了第1、5、6、7位，則給出錯誤提示。(編號0到7是從最低位開始)
    6)將第5到8字節(jié)中的時間值保存在全局變量time_stamp中。
    7)跳過第9字節(jié)(壓縮時采用的算法—更快或是比例更高)和
第10字節(jié)(壓縮時的操作系統(tǒng))。
    8)如果設(shè)置了flags的第1位，則得到當(dāng)前文件的編號
    9)如果設(shè)置了flags的第2位(存在有附加的內(nèi)容)，則得到附加內(nèi)容的長度，
并跳過這部分內(nèi)容。
    10)如果設(shè)置了flags的第3位(存在有原始文件的名稱)，則得到原始文件的名稱。
    11)如果設(shè)置了flags的第4位(存在一段不用解析的內(nèi)容，是給人提供可讀信息的)，
跳過這部分可讀信息。
    12) 設(shè)置頭部信息的長度：header_bytes，包括了最后的CRC及文件長度部分。
返回：函數(shù)壓縮方法(一般為"deflate"，程序中的返回值為8)

4/
==================================================================================
在文件gzip-1.2.4/unzip.c中：
函數(shù)： int unzip(in, out)
       int in, out;   /* input and output file descriptors */
參數(shù)：為輸入、輸出文件。
功能： 1)初始化全局變量crc。
    2)調(diào)用函數(shù)inflate()進行解碼操作。
    3)得到原來文件中保存的CRC及長度值。如果與當(dāng)前計算出的值不同，則產(chǎn)生提示。

5/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int inflate()
說明： ulg bb;                         /* 是 bit buffer */
unsigned bk;                    /* 是bit buffer中還有多少位，即剩余的位數(shù) */
功能： 1) 循環(huán)調(diào)用inflate_block(&e)，一塊一塊的解壓數(shù)據(jù)。
   2)若bk>-8,即bb中有完整的字節(jié)，則將此字節(jié)放回輸入中。
    3)輸出解壓得到的內(nèi)容。

6/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int inflate_block(e)
int *e;                 /* last block flag */
參數(shù)：如果是1，是說明當(dāng)前塊是最后一塊。
功能： 1)得到第一位，這一位說明當(dāng)前塊是否為最后一塊(0，不是；1，是)并相應(yīng)的設(shè)置參數(shù)。
    2)得到下兩位的值：
0，本塊沒有壓縮，
1，用固定的Huffman編碼壓縮，見RFC1951的3.2.6節(jié)。
2，用動態(tài)的Huffman編碼壓縮，見RFC1951的3.2.7節(jié)。
    3)根據(jù)前面得到的值，調(diào)用不同的函數(shù)解壓：
        inflate_stored();    對于未壓縮的數(shù)據(jù)，調(diào)用這個函數(shù)。
        inflate_fixed(); 對于用固定的Huffman編碼壓縮的數(shù)據(jù)，調(diào)用這個函數(shù)。
        inflate_dynamic();    對于用動態(tài)的Huffman編碼壓縮的數(shù)據(jù)，調(diào)用這個函數(shù)。

7/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int inflate_stored()
功能：處理非壓縮的數(shù)據(jù)內(nèi)容
1)丟棄不足一字節(jié)的位。由于非壓縮的數(shù)據(jù)中，內(nèi)容都是以字節(jié)為單位的，所以原來按
        位讀取的時候，會剩余不足一字節(jié)位內(nèi)容，現(xiàn)在要去掉這些位。
    2)讀入兩字節(jié)的內(nèi)容，其值是未壓縮的數(shù)據(jù)長度。再讀入兩字節(jié)的內(nèi)容，其值應(yīng)該是前
        兩字節(jié)所表示的長度的補碼，若不是，則錯誤。
    3)逐字節(jié)的讀入內(nèi)容，并輸出到輸出文件中。

8/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int inflate_fixed()
功能：用固定的Huffman編碼壓縮的數(shù)據(jù)
1)     為0至287的文字/length值設(shè)定編碼長度:
                   Lit Value    Bits        Codes
                   ---------    ----        -----
                     0 - 143     8          00110000 through
                                            10111111
                   144 - 255     9          110010000 through
                                            111111111
                   256 - 279     7          0000000 through
                                            0010111
                   280 - 287     8          11000000 through
                                            11000111
2)     調(diào)用huft_build()建造文字/length值的Huffman樹
3)     設(shè)置所有distance值(從0至29)的編碼長度為5。
4)     調(diào)用huft_build()建造distance值的Huffman樹
5)     調(diào)用函數(shù)inflate_codes()進行解碼。

9/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int inflate_dynamic()
功能：用動態(tài)的Huffman編碼壓縮的數(shù)據(jù)
1)讀入5位的值HLIT，算出nl = 257+HLIT。這是需要編碼的最大值。
2)讀入5位的值HDIST，算出nd = 1+HDIST。這是distance的最大值。
3)讀入4位的值HCLEN，算出nb = 4+HCLEN。說明有多少種編碼長度。
4)再讀入3*nb位，每三位的值表示用多少位來表示所對應(yīng)的編碼長度。
5)調(diào)用huft_build()建造編碼長度的Huffman樹。
6)利用這個Huffman樹，對接下來的若干位解碼出nl+nd個值，這些值依次是0~nl-1
的編碼長度(對于文字/length平說)，及0~nd-1的編碼長度(對于distance來說)。
7)利用上面解碼出的兩組長度值，兩次調(diào)用huft_build()函數(shù)，建造兩個Huffman樹
(一個是為文字/length，另一個是為distance)。
8)調(diào)用函數(shù)inflate_codes()進行解碼。

10/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int inflate_codes(tl, td, bl, bd)
struct huft *tl, *td;　/* literal/length and distance decoder tables */
int bl, bd;            /* number of bits decoded by tl[] and td[] */
參數(shù)： tl,td是進行Huffman編碼解碼時用到的結(jié)構(gòu)體，由于length和distance用不同的編碼
方式，所以要有兩個指針進行解碼。
    在兩種編碼中，用struct huft結(jié)構(gòu)編碼時，分別以bl,bd位進行編碼。
功能：用兩個以經(jīng)做好的鏈表來進行解碼。
1）解碼一個值X，如果0<=X<=255，則X是一個字符，輸出，循環(huán)1）。
2）如果X==255，則說明塊結(jié)束，函數(shù)返回。
3） X>255，則說明讀到的是一個length值，根據(jù)這個值，及其后的附加位，得到真實的
length值。
4）繼續(xù)讀入一個值，這個值是distance的標(biāo)志值，根據(jù)這個值及其后的附加位得到真實
的distance。
5）在已經(jīng)輸出的串中，向前查找distance個字節(jié)，拷貝length個字節(jié)到輸出串的末尾。
6）循環(huán)1）

11/
==================================================================================
在文件gzip-1.2.4/inflate.c中：
函數(shù)： int huft_build()　和函數(shù)int huft_free()比較獨立，可以直接引用，不再分析。
功能： int huft_build()    ：建立Huffman解碼鏈表。
int huft_free() ：清除鏈表。
12/
==================================================================================
在文件gzip-1.2.4/zip.c中：
函數(shù)： int zip(in, out)
       int in, out;   /* input and output file descriptors */
參數(shù)：為輸入、輸出文件。
功能：
1）向輸出寫入三字節(jié)：0x1F 0x8B 0x08。
2）向輸出寫入一個含有8個標(biāo)志位的字節(jié)。
3）向輸出寫入4字節(jié)的系統(tǒng)時間。
4）初始化CRC的值。
5）調(diào)用bi_init(out)初始化讀入位串的程序。
6）調(diào)用ct_init()進行分配內(nèi)存，初始化變量表，保存原始文件信息的
操作。
7）調(diào)用lm_init()為新文件初始化"最長匹配"的程序。
8）再向輸出寫入2字節(jié)，一個為額外的標(biāo)志，一個為系統(tǒng)類型。
9）如果需要，則保存原始文件名稱。
10) 保存頭部信息的長度。
11) 調(diào)用函數(shù)deflate()壓縮。
12) 寫入4字節(jié)的CRC值。
13) 寫入4字節(jié)的原始內(nèi)容長度值。
14）修改前面保存的頭部信息長度的值。

13/
==================================================================================
在文件gzip-1.2.4/deflate.c中：
函數(shù)： ulg deflate()
功能：壓縮數(shù)據(jù)。此函數(shù)通過一些復(fù)雜的算法來進行壓縮操作，可以直接引用。
1)     如果需要快速壓縮，則調(diào)用函數(shù)deflate_fast()，然后返回。
2)     將當(dāng)前內(nèi)容插入到哈希表中，并查找最長匹配。
3)     若找到匹配內(nèi)容，則輸出<length,distence>對的編碼，否則輸出字符編碼。

14/
==================================================================================
在文件gzip-1.2.4/deflate.c中：
函數(shù)： ulg deflate()
功能：壓縮數(shù)據(jù)。此函數(shù)通過一些稍簡單一些的算法來進行壓縮操作，可以直接引用。
1）將當(dāng)前內(nèi)容插入到哈希表中，并查找最長匹配。
2）若找到匹配內(nèi)容，則輸出<length,distence>對的編碼，否則輸出字符編碼。