Slowest LZSS Compressor in C - CodeProject

:

Introduction

My wish to share my latest LZSS decompressor Nakamichi 'Loonette' has all to do with the hope someone (in the future) to improve on it.

No sane person needs slowestness, yet in order to explore the potential of deep LZSS (on English texts), I did dive into the deep.

Background

One of the main goals is to boost I/O bandwidth (linear reads) 2x or 3x by using superfast decompression. NTFS uses some kind of LZSS to achieve that. Recently, I saw m.2 SSD drives breaking the 1GB/s Linear Reads barrier, with super implementations as LzTurbo and LZ4 these 1024MB/s look like an automobile from the 70's. Now, we reach for ten times as much.

A Glimpse at Compression Approach

Right into the ZIP face-off:

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir

02/20/2015  06:00 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015  03:46 AM        13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015  05:37 PM        11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015  09:04 PM        11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>timer32.exe 
            7za.exe t Agatha_Christie_89-ebooks_TXT.tar.zip

7-Zip (A) 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
Processing archive: Agatha_Christie_89-ebooks_TXT.tar.zip
Testing     Agatha_Christie_89-ebooks_TXT.tar
Everything is Ok
Size:       33258496
Compressed: 11173343

Kernel  Time =     0.015 =    3%
User    Time =     0.421 =   92%
Process Time =     0.436 =   95%    Virtual  Memory =      2 MB
Global  Time =     0.457 =  100%    Physical Memory =      4 MB

The battlehorse 7-zip decompresses roughly at 33,258,496/0.436/1024/1024= 72MB/s while Nakamichi 'Loonette' at 256MB/s (shown further below). Of course, ZIP sees 64KB arear while 'Loonette' 512MB, it means that on BIG&REDUNDANT TEXTS, compression ratio will go up from 3:1 to 4:1.

As it was stated, it is slowest because only brutal memmem() match finding is used. Choosing Match Lengths to be either 12 or 8 bytes long simplifies the compression if hashing is to be implemented.

Chosen order is sizewise:

// 12:2 = 6
// 8:2 =  4
// 12:3 = 4
// 12:4 = 3
// 8:3 =  2.6
// 8:4 =  2

The second number stands for bytes used to encode Match Offsets i.e., 2, 3 or 4.
The compression is based on classical LZSS (Lempel–Ziv–Storer–Szymanski), I used blueprints by a Japanese coder (many thanks Ito) and simplified both the encoder and decoder.
The attached (in Download Section) archive contains the C source and its Assembly counterpart generated by the latest Intel C optimizer.

A Glimpse at Decompression Approach

My obsession with superfast textual processing resulted in appearance of third-fastest (LzTurbo and LZ4 being the first and second) textual decompressor known to me. It is called Nakamichi 'Tengu' and decompresses 1GB of English texts on i5 4690K @3.5GHz (no-turbo) as:

>Nakamichi_Tengu_XMM_PREFETCH_4096.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, 
    babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000016394080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

>lz4 -9 -Sx -b -T1 DDETT-Definitive_Decompression_English_Texts_Torture.tar

Nb of threads = 1 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 864 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 905969664 -> 315340335 ( 34.81%), 22.6 MB/s , 2116.8 MB/s

>lz4 -9 -Sx -b DDETT-Definitive_Decompression_English_Texts_Torture.tar

Nb of threads = 4 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 896 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 939524096 -> 329287352 ( 35.05%), 85.9 MB/s , 8265.6 MB/s

Yann, the author of LZ4, wrote an almost perfect decompressor, AFAIU utilizing 1/5 of the absolute maximum both in single-thread 2116/10394 and multi-thread 8265/4????, AMAZING!

However my word is for Nakamichi 'Loonette', featuring 8KB/2MB/512MB or (16-3)bit/(24-3)bit/(32-3)bit Sliding Windows with 2/3/4 bytes long offsets. Only 12 (up to 16) and 8 matchlengths are used.

So, this etude is all about writing an heavily optimized LZSS decompress function capable to extract compressed English texts at INSANE speeds disregarding nowadays RAM and CPU's caches latencies (and bandwidth for that matter). Putting aside the boring (both speedwise and sizewise) memory limitations, behold the simplicity of Loonette's decompression:

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
    char* retLOCAL = ret;
    char* srcLOCAL = src;
    char* srcEndLOCAL = src+srcSize;
    unsigned int DWORDtrio;
    while (srcLOCAL < srcEndLOCAL) {
        DWORDtrio = *(unsigned int*)srcLOCAL;
#ifndef _N_GP
#ifdef _N_prefetch_4096
        _mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
#endif
#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|L|xxxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// L = 0b means Long MatchLength, (12-L) or 12
// L = 1b means Short MatchLength, (12-L) or 8
// OO = 00b means Literal                                                                        
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 
// 2 bytes long i.e. Sliding Window is 2*8-L-OO=(1+OO)*8-3=13 or   8KB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 
// 3 bytes long i.e. Sliding Window is 3*8-L-OO=(1+OO)*8-3=21 or   2MB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 
// 4 bytes long i.e. Sliding Window is 4*8-L-OO=(1+OO)*8-3=29 or 512MB     
        DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
        if ( (DWORDtrio & 0x03) == 0x00 ) {                                                       
                #ifdef _N_GP
        memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
                #endif
                #ifdef _N_XMM
        SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
                #endif
        retLOCAL+= (DWORDtrio>>3);
        srcLOCAL+= (DWORDtrio>>3)+1;
        } else {
                #ifdef _N_GP
            memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), 16);
                #endif
                #ifdef _N_XMM
            SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), retLOCAL );
                #endif
        srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2
        retLOCAL+= 12-(DWORDtrio&0x04);
        }
    }        
    return (unsigned int)(retLOCAL - ret);
}

/*
; 'Loonette' decompression loop, 76-14+2=100 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications 
; running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";

.B7.3::                         
  00014 41 0f 18 8a 00 
        10 00 00         prefetcht0 BYTE PTR [4096+r10]         
  0001c b8 ff ff ff ff   mov eax, -1                            
  00021 41 8b 12         mov edx, DWORD PTR [r10]               
  00024 89 d1            mov ecx, edx                           
  00026 83 f1 03         xor ecx, 3                             
  00029 c1 e1 03         shl ecx, 3                             
  0002c d3 e8            shr eax, cl                            
  0002e 23 d0            and edx, eax                           
  00030 89 d0            mov eax, edx                           
  00032 83 e0 03         and eax, 3                             
  00035 75 18            jne .B7.5 
.B7.4::                         
  00037 f3 41 0f 6f 42 
        01               movdqu xmm0, XMMWORD PTR [1+r10]       
  0003d c1 ea 03         shr edx, 3                             
  00040 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  00045 4c 03 ca         add r9, rdx                            
  00048 ff c2            inc edx                                
  0004a 4c 03 d2         add r10, rdx                           
  0004d eb 24            jmp .B7.6 
.B7.5::                         
  0004f 89 d1            mov ecx, edx                           
  00051 83 e2 04         and edx, 4                             
  00054 c1 e9 03         shr ecx, 3                             
  00057 f7 da            neg edx                                
  00059 ff c0            inc eax                                
  0005b 48 f7 d9         neg rcx                                
  0005e 83 c2 0c         add edx, 12                            
  00061 49 03 c9         add rcx, r9                            
  00064 4c 03 d0         add r10, rax                           
  00067 f3 0f 6f 01      movdqu xmm0, XMMWORD PTR [rcx]         
  0006b f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  00070 4c 03 ca         add r9, rdx                            
.B7.6::                         
  00073 4d 3b d0         cmp r10, r8                            
  00076 72 9c            jb .B7.3 
*/

A hundred bytes of beauty!

And the quick showdown versus LZ4 (on my laptop Core 2 Q9550s @2833MHz) to see how jammed is the existing CPU-RAM subsystem:

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15
>Nakamichi_Loonette_XMM_PREFETCH_4096.exe
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source, 
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Usage: Nakamichi filename

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>Nakamichi_Loonette_XMM_PREFETCH_4096.exe 
Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source, 
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4220
NumberOf(Short)Matches[Short]Window  (8): 412312
NumberOf(Short)Matches[Medium]Window (8): 848071
NumberOf(Short)Matches[Big]Window    (8): 353660
NumberOf(Long)Matches[Short]Window   (12): 191049
NumberOf(Long)Matches[Medium]Window  (12): 794178
NumberOf(Long)Matches[Big]Window     (12): 595339
RAM-to-RAM performance: 1839 bytes/s.

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir

02/20/2015  06:09 AM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015  03:46 AM        13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015  05:37 PM        11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015  03:06 AM            86,636 Nakamichi_Loonette.c
02/20/2015  03:07 AM           445,206 Nakamichi_Loonette_XMM_PREFETCH_4096.cod
02/20/2015  12:21 PM           117,248 Nakamichi_Loonette_XMM_PREFETCH_4096.exe

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>
Nakamichi_Loonette_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source, 
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 11745883 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000001040080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 192239 clocks or 2.727MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>lz4 -9 -Sx -b -T1  
Agatha_Christie_89-ebooks_TXT.tar
Nb of threads = 1 ; Compression Level = 9
Agatha_Christie :  33258496 ->  13365090 ( 40.19%),   12.7 MB/s ,  909.2 MB/s

Nakamichi 'Loonette' has it own homepage: http://www.sanmayce.com/Hayabusa/.

Points of Interest

Hope someone is as passionate as I am to improve on it, that's the name of the game - 'Reaching for the stars.'

The French artist Rachelle Bartel, a.k.a. Lillycat created this soulful piece of art.

Add-on from 2015-Feb-23:

Happy to share even more refined 'Loonette' called 'Loonette-Hoshimi'.

Since Agatha Christie is the queen of English prose, quite naturally her complete works is the best testdataset for English texts compression.

The Guinness Book of World Records lists Christie as the best-selling novelist of all time. Her novels have sold roughly 2 billion copies ... /Wikipedia/

D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 171
Legend: MatchLengths: 4=Tiny, 8=Short, 12=Medium, 16=Long; WindowSizes: 2/3/4=Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 325551
NumberOf(Short)Matches[Short]Window (8)[2]: 240654
NumberOf(Medium)Matches[Short]Window (12)[2]: 80607
NumberOf(Long)Matches[Short]Window (16)[2]: 52966
NumberOf(Tiny)Matches[Medium]Window (4)[3]: 382909
NumberOf(Short)Matches[Medium]Window (8)[3]: 677654
NumberOf(Medium)Matches[Medium]Window (12)[3]: 440648
NumberOf(Long)Matches[Medium]Window (16)[3]: 239806
NumberOf(Short)Matches[Long]Window (8)[4]: 272624
NumberOf(Medium)Matches[Long]Window (12)[4]: 461732
NumberOf(Long)Matches[Long]Window (16)[4]: 266965
RAM-to-RAM performance: 1249 bytes/s.

D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 10854012 bytes ...
RAM-to-RAM performance: 320 MB/s.
Memory pool starting address: 0000000000E60080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 194229 clocks or 2.699MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 11%

D:\Nakamichi_Loonette_Intel15_Hoshimi>7za.exe a -tzip -mx9 Agatha_Christie_89-ebooks_TXT.tar.zip Agatha_Christie_89-ebooks_TXT.tar
D:\Nakamichi_Loonette_Intel15_Hoshimi>tangelo_32bit.exe c Agatha_Christie_89-ebooks_TXT.tar Agatha_Christie_89-ebooks_TXT.tar.tangelo
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method0 Agatha_Christie_89-ebooks_TXT.tar -method 0
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method1 Agatha_Christie_89-ebooks_TXT.tar -method 1
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method2 Agatha_Christie_89-ebooks_TXT.tar -method 2
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method3 Agatha_Christie_89-ebooks_TXT.tar -method 3
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method4 Agatha_Christie_89-ebooks_TXT.tar -method 4
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method5 Agatha_Christie_89-ebooks_TXT.tar -method 5

D:\Nakamichi_Loonette_Intel15_Hoshimi>dir

02/23/2015  03:33 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/23/2015  03:52 PM        33,277,230 Agatha_Christie_89-ebooks_TXT.tar.method0.zpaq
02/23/2015  03:52 PM        11,709,902 Agatha_Christie_89-ebooks_TXT.tar.method1.zpaq
02/23/2015  03:52 PM         9,557,649 Agatha_Christie_89-ebooks_TXT.tar.method2.zpaq
02/23/2015  03:53 PM         6,462,103 Agatha_Christie_89-ebooks_TXT.tar.method3.zpaq
02/23/2015  03:54 PM         6,387,670 Agatha_Christie_89-ebooks_TXT.tar.method4.zpaq
02/23/2015  03:56 PM         6,036,509 Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq
02/23/2015  03:03 PM        10,854,012 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/23/2015  03:10 PM         6,475,761 Agatha_Christie_89-ebooks_TXT.tar.tangelo
02/23/2015  03:12 PM        11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip
02/23/2015  05:51 AM            99,469 Nakamichi_Loonette_Hoshimi.c
02/23/2015  05:51 AM           771,399 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.cod
02/23/2015  05:51 AM           118,784 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe

D:\Nakamichi_Loonette_Intel15_Hoshimi>

Along with strengthening the compression ratio I was lucky to boost decompression speed and reduce the code by 2bytes:

; Nakamichi 'Loonette-Hoshimi' decompression loop, 74-14+2=98 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";

.B7.3::                         
  00014 41 0f 18 8a 00 
        10 00 00         prefetcht0 BYTE PTR [4096+r10]         
  0001c b8 ff ff ff ff   mov eax, -1                            
  00021 41 8b 12         mov edx, DWORD PTR [r10]               
  00024 89 d1            mov ecx, edx                           
  00026 83 f1 03         xor ecx, 3                             
  00029 c1 e1 03         shl ecx, 3                             
  0002c d3 e8            shr eax, cl                            
  0002e 23 d0            and edx, eax                           
  00030 89 d0            mov eax, edx                           
  00032 83 e0 03         and eax, 3                             
  00035 75 18            jne .B7.5 
.B7.4::                         
  00037 f3 41 0f 6f 42 
        01               movdqu xmm0, XMMWORD PTR [1+r10]       
  0003d c1 ea 04         shr edx, 4                             
  00040 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  00045 4c 03 ca         add r9, rdx                            
  00048 ff c2            inc edx                                
  0004a 4c 03 d2         add r10, rdx                           
  0004d eb 22            jmp .B7.6 
.B7.5::                         
  0004f 89 d1            mov ecx, edx                           
  00051 83 e2 0c         and edx, 12                            
  00054 c1 e9 04         shr ecx, 4                             
  00057 ff c0            inc eax                                
  00059 83 c2 04         add edx, 4                             
  0005c 48 f7 d9         neg rcx                                
  0005f 49 03 c9         add rcx, r9                            
  00062 4c 03 d0         add r10, rax                           
  00065 f3 0f 6f 01      movdqu xmm0, XMMWORD PTR [rcx]         
  00069 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  0006e 4c 03 ca         add r9, rdx                            
.B7.6::                         
  00071 4d 3b d0         cmp r10, r8                            
  00074 72 9e            jb .B7.3 

XMM transfers are slow on Core 2, they are unaligned and penalties are high especially when RAM is accessed. So on Intel 3rd/4th/5th generations Nakamichi 'Loonette-Hoshimi' will surpass 1GB/s easily. Below is my attempt to  put on the map or rather juxtapose Nakamichi 'Loonette-Hoshimi' with one of the beast in compression.

Notes about superb ZPAQ (written by Dr. Matt Mahoney and released to the public domain) methods:

Method 0 is not just like tar where the files are concatenated together with some headers in between. It is still the zpaq archive format, where the files are fragmented and deduplicated and packed into blocks. The blocks are stored in non context modeled format, which means split into smaller blocks (64K) with 4 byte headers to indicate their size. There are separate blocks storing fragment SHA-1 hashes and file names. So there is more overhead than with tar but sometimes some savings if you have duplicate files or fragments.

Method 1 uses LZ77, compressing by replacing duplicate strings with pointers to previous occurrences.

Method 2 is the same but spends more time looking for better matches (using a suffix array instead of a hash table).

Method 3 uses either BWT (context sorting) or LZ77 for long matches and an order 1 context model and arithmetic coding for literals depending on the file type.

Method 4 either uses LZ77, BWT or a high order context model.

Method 5 uses a complex, high order context mixing model with over 20 bit prediction components.

Also. Dr. Mahoney wrote one very cute tool, called 'fv', drawing a very informative image of the structure of a given file. It allows one to see the hotspots of string repetitions: Legend (how to interpret the image):

The following diagrams show the distribution of string matches of length 1 (black), 2 (red), 4 (green), and 8 (blue). The horizontal axis represents the position in the file. The vertical axis shows the distance backwards to the previous match on a logarithmic scale. The major tick marks reading upwards are 1, 10, 100, 1000, etc.

I will try to interpret 3 different files - the original, the Hoshimi archive and the strongest ZPAQ archive.

First 'fv-Agatha_Christie_89-ebooks_TXT.tar.bmp':

The blue band at the top shows that matches of length 8 are most often separated by at least 10^4 bytes (4 major tick marks) up to the entire length of the file.
More precisely, 3x10^4=30,000.
The red band shows that matches of length 2 are separated by about 4x10 to 4x100 bytes.

Second 'fv-Agatha_Christie_89-ebooks_TXT.tar.Nakamichi.bmp':

One good compressor should erase all the blues and greens from the picture, in here 'Hoshimi' failed to compress some vague remnants of green (4bytes long matches) in area 1x10^6 and 1x10^7, that is, there are some uncomprressed 4bytes chunks 1MB..10MB apart of each other.
The red band shows that matches of length 2 are separated by about 2x10^3 to 5x10^5 bytes.
Vertical stripes with no color are 'problems', in those parts of the file the compressor was blinded by something and couldn't see any correlations.

Third 'fv-Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq.bmp':

The red band shows that matches of length 2 are separated by about 3x10^3 to 3x10^5 bytes.

The (星見 - hoshimi) literal translation is 'Looking Stars', or more poetically 'Stargazing'.

Hanami (花見, lit. "flower viewing") is the Japanese traditional custom of enjoying the transient beauty of flowers, ...

Also in Buddhism Moongazing (月見 - tsukimi) is of great importance, also there is Sungazing (? - ?) branch of Yoga called Surya Yoga.
If a muse comes to me, next variant will be called 'Tsukimi'.

Add-on from 2015-Mar-23:

Didn't like the 'if-else' branching, so 'Loonette-Hoshimi' became branchless, sadly the code size jumped from 98 to 125 bytes. As for the decompression speed, Q9550s disappoints (see further below the 'enwik8' test), however I am careless of the penalties, the etude is just fine.

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
	char* retLOCAL = ret;
	char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while (srcLOCAL < srcEndLOCAL) {
		DWORDtrio = *(unsigned int*)srcLOCAL;
//#ifndef _N_GP
//#ifdef _N_prefetch_4096
//		_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
//#endif
//#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|LL|xxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// L = 00b means 04 MatchLength, (1+LL)<<2)
// L = 01b means 08 MatchLength, (1+LL)<<2)
// L = 10b means 12 MatchLength, (1+LL)<<2)
// L = 11b means 16 MatchLength, (1+LL)<<2)
// OO = 00b means Literal                                                                        
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 2 bytes long i.e. Sliding Window is 2*8-LL-OO=(1+OO)*8-4=12 or   4KB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 3 bytes long i.e. Sliding Window is 3*8-LL-OO=(1+OO)*8-4=20 or   1MB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 4 bytes long i.e. Sliding Window is 4*8-LL-OO=(1+OO)*8-4=28 or 256MB     

/*
// Branchfull [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		if ( (DWORDtrio & 0x03) == 0x00 ) {                                                       
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
				#endif
				#ifdef _N_XMM
		SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
				#endif
		retLOCAL+= (DWORDtrio>>4);
		srcLOCAL+= (DWORDtrio>>4)+1;
		} else {
				#ifdef _N_GP
			memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), 16);
				#endif
				#ifdef _N_XMM
			SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), retLOCAL );
				#endif
		srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2
		retLOCAL+= (1+((DWORDtrio>>2)&0x03))<<2; // 4/8/12/16
		}
// Branchfull ]
*/
// Branchless [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!(DWORDtrio & 0x03);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		SlowCopy128bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
//		memcpy(retLOCAL, (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), 16);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16);
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)((1+((DWORDtrio>>2)&0x03))<<2)&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}        
	return (unsigned int)(retLOCAL - ret);
}

/*
; 'Nakamichi_Loonette-Hoshimi_branchless' decompression loop, bb-40+2=125 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";

.B7.3::                         
  00040 44 8b 32         mov r14d, DWORD PTR [rdx]              
  00043 44 89 f1         mov ecx, r14d                          
  00046 83 f1 03         xor ecx, 3                             
  00049 41 bf ff ff ff 
        ff               mov r15d, -1                           
  0004f c1 e1 03         shl ecx, 3                             
  00052 33 ff            xor edi, edi                           
  00054 41 d3 ef         shr r15d, cl                           
  00057 45 23 f7         and r14d, r15d                         
  0005a 4c 89 c9         mov rcx, r9                            
  0005d 45 89 f5         mov r13d, r14d                         
  00060 44 89 f5         mov ebp, r14d                          
  00063 41 83 e5 03      and r13d, 3                            
  00067 49 89 d7         mov r15, rdx                           
  0006a 0f 44 f8         cmove edi, eax                         
  0006d 41 83 e6 0c      and r14d, 12                           
  00071 c1 ed 04         shr ebp, 4                             
  00074 ff cf            dec edi                                
  00076 41 ff c5         inc r13d                               
  00079 41 89 ec         mov r12d, ebp                          
  0007c 49 89 fb         mov r11, rdi                           
  0007f 49 2b cc         sub rcx, r12                           
  00082 49 f7 d3         not r11                                
  00085 48 ff c9         dec rcx                                
  00088 4d 23 fb         and r15, r11                           
  0008b 48 23 cf         and rcx, rdi                           
  0008e ff c5            inc ebp                                
  00090 41 83 c6 04      add r14d, 4                            
  00094 4d 23 e3         and r12, r11                           
  00097 49 23 eb         and rbp, r11                           
  0009a 4c 23 ef         and r13, rdi                           
  0009d f3 42 0f 6f 44 
        39 01            movdqu xmm0, XMMWORD PTR [1+rcx+r15]   
  000a4 4c 23 f7         and r14, rdi                           
  000a7 49 03 ed         add rbp, r13                           
  000aa 4d 03 e6         add r12, r14                           
  000ad 48 03 d5         add rdx, rbp                           
  000b0 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  000b5 4d 03 cc         add r9, r12                            
  000b8 49 3b d0         cmp rdx, r8                            
  000bb 72 83            jb .B7.3 
*/

The 'enwik8' test done on core 2 Q9550s:

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette-Hoshimi_branchless.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 192 MB/s.
Memory pool starting address: 0000000002620080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193784 clocks or 2.706MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 7%

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2  enwik8

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000002710080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193453 clocks or 2.710MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2  enwik8

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>

I don't believe I can refine it further.

History

First Nakamichi was written back a year ago, as of now 2015-Feb-20, 'Loonette' is the latest and most refined.