ECE291 Computer Engineering II Lockwood, Spring 1998

Machine Problem 2: Huffman Compression

Due DateFriday 2/20/98
Purpose Subroutines, User I/O, Compression algorithms
Points50

Introduction

Compression is extremely useful for data such as text, images, sound, and video. Compression allows audio files (such as MP3s) to be compressed to a fraction of their original size, permits hundreds of channels of video to be transmitted through a single digital satellite system (DSS); enables a modem to achieve higher throughput, and can increase the storage capacity of a disk or tape drive.

Non-lossy compression algorithms, such as the one studied in this MP, completely preserve the data they compress. By representing the most frequently recurring data patterns using the smallest bit sequence, the total size of the data can be greatly reduced.

In this MP, we will write an interactive assembly program that encodes and decodes text messages using Huffman compression. The text message will consist of symbols that include: the 26 letters ('A'..'Z'), the space (' '), and the asterisk ('*'). Each symbol will be represented using a variable-bit-length encoding.

Huffman Encoding

In general, the generation of an optimal Huffman encoding table is data dependent. The first step requires scanning through a document to determine each symbol's frequency of occurance. Next, each symbol is represented as a node in a graph, with a weight equal to its probability of occurance.

The main body of the algorithm involves sequentially combining the least frequenctly occuring pairs of nodes into a tree-structured supernode. The parent of a supernode is given a weight equal to the sum of the weighs of the children. The algorithm continues combining nodes and supernodes until all nodes have been combined into a single tree. Once finished, all of the symbols appear as leaves in the tree.

The encoding of the symbols is determined by assigning '0' and '1' values to the branches of the tree. The encoded value is determined by following the branches of the tree, starting at the root. The number of bits required to represent a symbol is equal to the depth of the leaf. Frequent letters are represented with fewer bits, while infrequent letters are represented with more.

For this MP, you are given the Huffman encoding table. This table was generated using Prof. Lockwood's PhD dissertation as a representative sample of standard writing. (The question of whether or not Prof. Lockwood's writing actually reflects standard writing is an entirely different question...). A Perl script was run that scanned the entire file and counted the frequency of each letter in the document. The results of this are shown below:

e 0.121 t 0.105 i 0.081 o 0.072 a 0.068 n 0.065 s 0.062 r 0.061 c 0.046
h 0.044 l 0.043 u 0.033 d 0.033 p 0.030 m 0.028 f 0.024 g 0.019 b 0.017
w 0.014 v 0.012 y 0.009 k 0.005 q 0.004 x 0.003 z 0.001 j 0.001

As expected, vowels and common letters such as 'E', 'T', 'I', and 'O' appeared most often. The letters 'Z' and 'J' appeared least frequently. Using the algorithm described above, the following encoding tree was generated: (The space and asterisk symbols were added later by splitting the nodes for nodes for S and V)

In this tree, the letter 'E' can be found by following the right-left-right branches. By assigning '0' and and '1' values to the left and right branches, respectively, this corresponds to an encoding pattern of 101. Likewise, the letter 'Z' has an encoding of: 110000110. Note that decoding of Huffman codes is somewhat tricky due to the fact that symbols can be represented with a differing number of bits.

Messages and documents can be formed by appending symbols together. The message 'HELLO', for example, can be compactly represented with only 22 bits as: 00101 101 00100 00100 0001. To clearify the encoding, spaces were shown between the symbols. In memory, however, adjacent symbols would occupy adjacent bit positions. In general, we will be treating memory as an array of bits. Because the symbols do not fall on even byte boundries, it is up to the decoding algorithm to decide where one symbol stops and the next begins.

User Interface

Sample Input & Output

Data Structures

Procedures

Preliminary Procedure

Final Steps

  1. Print a copy of the MP2 grading sheet.
  2. Demonstrate MP2.EXE to a TA or to the instructor.
  3. Handin in your program by running:
    A:\Handin YourWindowsLogin
  4. Print MP2.ASM
  5. Staple the MP2 grading sheet to the front of your MP2.ASM file and give both to the same TA that approved your demonstration.

HUFFCODE.INC (Huffman Encoding Table)

; Using Lockwood's PhD thesis as a model for typical English usage ; The frequencies of each letter were determined as: ; ; e 0.121 t 0.105 i 0.081 o 0.072 a 0.068 n 0.065 s 0.062 r 0.061 c 0.046 ; h 0.044 l 0.043 u 0.033 d 0.033 p 0.030 m 0.028 f 0.024 g 0.019 b 0.017 ; w 0.014 v 0.012 y 0.009 k 0.005 q 0.004 x 0.003 z 0.001 j 0.001 ; ; Huffman's algorithm was then used to generate the variable length encoding ; for these symbols. ; Format: ASCII Letter, Pattern (MSB first), Pattern (LSB first), # bits) HuffCodes HuffCode<' ', 01011b,11010b ,5> HuffCode<'A', 0000b,0000b ,4> HuffCode<'B', 100100b,001001b ,6> HuffCode<'C', 1101b,1011b ,4> HuffCode<'D', 01100b,00110b ,5> HuffCode<'E', 101b,101b ,3> HuffCode<'F', 11001b,10011b ,5> HuffCode<'G', 100101b,101001b ,6> HuffCode<'H', 00101b,10100b ,5> HuffCode<'I', 0011b,1100b ,4> HuffCode<'J', 110000111b,111000011b ,9> HuffCode<'K', 1001110b,0111001b ,7> HuffCode<'L', 00100b,00100b ,5> HuffCode<'M', 10000b,00001b ,5> HuffCode<'N', 0111b,1110b ,4> HuffCode<'O', 0001b,1000b ,4> HuffCode<'P', 10001b,10001b ,5> HuffCode<'Q', 1100000b,0000011b ,7> HuffCode<'R', 0100b,0010b ,4> HuffCode<'S', 01010b,01010b ,5> HuffCode<'T', 111b,111b ,3> HuffCode<'U', 01101b,10110b ,5> HuffCode<'V', 1100010b,0100011b ,7> HuffCode<'W', 100110b,011001b ,6> HuffCode<'X', 11000010b,01000011b ,8> HuffCode<'Y', 1001111b,1111001b ,7> HuffCode<'Z', 110000110b,011000011b ,9> HuffCode<'*', 1100011b,1100011b ,7> ; Note: The symbols for space and * were added after calculation ; of the Huffman encoding by splitting the nodes for S and V.

MP2.ASM (Program framework)

PAGE 75, 132 TITLE ECE291:MP2:MP2-Compress - Your Name - Date COMMENT * Data Compression. For this MP, you will write an interactive program which uses Huffman compression to compress and decompress textual data. By represents the most frequently used letters with the smallest number of bits, Huffman encoding can achieve significant data compression. ECE291: Machine Problem 2 Prof. John W. Lockwood Unversity of Illinois Dept. of Electrical & Computer Engineering Spring 1998 Ver 2.0 * ;====== Constants ========================================================= BEEP EQU 7 BS EQU 8 CR EQU 13 LF EQU 10 ESCKEY EQU 27 SPACE EQU 32 HuffCode STRUCT letter BYTE ? ; Letter to encode DownEncoding WORD ? ; Encoding: MSBit first UpEncoding WORD ? ; Encoding: LSBit first blength BYTE ? ; Bit Encoding Length HuffCode ENDS TextMsgMaxLength EQU 70 ; Bytes ; Limit text messages to one line BufferMaxLengthBits EQU TextMsgMaxLength * 9 ; Worst case: all 9-bit encodes BufferMaxLength EQU 1 + (BufferMaxLengthBits)/8 ; Bytes ;====== Externals ========================================================= ; -- LIB291 Routines (Free) --- extrn kbdine:near, kbdin:near, dspout:near ; LIB291 Routines extrn dspmsg:near, binasc:near, ascbin:near ; (Always Free) extrn PerformanceTest:near ; Measures performance of your code extrn mp2xit:near ; Exit program with a call to this procedure ; -- LIBMP2 Routines (Replace these with your own code) --- extrn PrintBuffer:near ; Print contents of Buffer extrn ReadBuffer:near ; Read Buffer from keyboard extrn ReadTextMsg:near ; Read TextMsg from keyboard extrn PrintTextMsg:near ; Print contents of TxtMsg extrn Encode:near ; Encode ASCII -> n-bit extrn AppendBufferN:near ; Append N bits to Buffer extrn EncodeHuff:near ; Huffman Encode TextMsg -> Buffer extrn DecodeHuff:near ; Huffman Decode Buffer -> TextMsg ;====== SECTION 3: Define stack segment =================================== stkseg segment stack ; *** STACK SEGMENT *** db 64 dup ('STACK ') ; 64*8 = 512 Bytes of Stack stkseg ends ;====== SECTION 4: Define code segment ==================================== cseg segment public 'CODE' ; *** CODE SEGMENT *** assume cs:cseg, ds:cseg, ss:stkseg, es:nothing ;====== SECTION 5: Variables ============================================== Buffer db BufferMaxLength dup(0) ; Data Buffer for encoded Message TextMsg db TextMsgMaxLength dup('$'), '$' ; Text Message BufferLength dw 0 ; Number of bits in buffer crlf db CR,LF,'$' ; DOS uses carriage return + Linefeed for new line PBuf db 7 dup(?) PUBLIC Buffer, TextMsg, BufferLength, HuffCodes Include huffcode.inc ; Huffman Encoding Table ;====== Procedures ======================================================== ; Your Subroutines go here ! ; ---- ----------- -- ---- ;====== Main procedure ==================================================== MenuMessage db CR,LF, \ '----------- MP2 Menu -----------',CR,LF,\ ' Enter (T)ext (B)inary',CR,LF, \ ' Print (M)essage (R)buffeR',CR,LF, \ ' Huffman (E)ncode (D)ecode',CR,LF, \ '---- [ESC] or (Q)uit = exit ----',CR,LF,'$' main proc far mov ax, cseg mov ds, ax MOV DX, Offset MenuMessage CALL DSPMSG ; Display Menu MainLoop: MOV DX, Offset CRLF CALL DSPMSG MainRead: CALL KBDIN ; Read Input CMP AL,'a' JB MainOpt CMP AL,'z' ; Convert Lowercase to Uppercase JA MainOpt SUB AL,'a'-'A' MainOpt: CMP AL,'T' JNE MainNotT Call ReadTextMsg ; Read in a text message JMP MainLoop MainNotT: CMP AL,'B' JNE MainNotB Call ReadBuffer ; Read in a binary message JMP MainLoop MainNotB: CMP AL,'M' JNE MainNotM Call PrintTextMsg ; Print TextMsg JMP MainLoop MainNotM: CMP AL,'R' JNE MainNotR ; Print Buffer Call PrintBuffer ; (show least significants bit first) JMP MainLoop MainNotR: CMP AL,'E' JNE MainNotE Call EncodeHuff ; Huffman Encode Message Call PrintBuffer ; and print result JMP MainLoop MainNotE: CMP AL,'D' JNE MainNotD Call DecodeHuff ; Huffman Decode Message Call PrintTextMsg ; and show result JMP MainLoop MainNotD: CMP AL,'P' JNE MainNotP ; Performance Test MOV SI, offset EncodeHuff MOV DI, offset DecodeHuff Call PerformanceTest JMP MainLoop MainNotP: CMP AL,ESCKEY JE MainDone ; Quit program CMP AL,'Q' JE MainDone JMP MainRead ; Ignore any other character MainDone: call MP2xit ; Exit to DOS main endp cseg ends end main