Sunday, February 3, 2013

Optimal Compression for client side (cookie) storage

The challenge:


Localstorage e.g. in IE8 is limited to just 10mb, In this article about about web-storage the limit is 5mb by default:
"" According to the HTML5 spec, this limit can be increased by the user when needed; however, only a few browsers support this.""
and:
"" Local Storage is String Storage ""


In  Javascript (EcmaScript) a character is referred as a 16-bit unsigned value, while latin based languages typically only use the first 8 bits (0-254, #00 #FF) for character codes.

Further more, things do get more inefficient when working with Base64 encoded string for e.g.  DataURI's combined with local storage. While each bases64 character take the place of a possible 2 byte value. A base64 character code is only 6 bits (0-63) and is stored in a 16 bits position, For each base64 character 10 bits are wasted.

So in a limited storage environment it should not be that hard to virtually double the storage capacity when most of your data is stored as latin based strings.  Theoretically, when you need to store images in the web-storage as base64 encodes strings, the storage capacity can almost triple when fully using the double byte per char position.

The Goal:


Use the 16 bit per character code (UTF-16) optimal for storing 6 bit values in case of base64 encoding and optimize storage for typical latin based strings, while still being able to work with 2 bytes string (For my chinese friends, although they don't benefit the solution much).

The solution for Base64 encoded images:

  1. We need a look-up-table to be able to retrieve de base64 char code
  2. We need to create a sequence (string) of 16 bits characters and fill these characters with 6 bits values to the max (do not spill one single bit). One 16 bit character can contain 2 4/6 (=16) base64 encoded chars
  3.  We need to be able to translate a compressed string back to it's original base64 presentation.


var comp = {
  map64 : [
    'A', 'B', 'C', 'D', 'E' ,'F' ,'G' ,'H', 'I' ,'J', 'K', 'L', 'M' ,'N', 'O' ,'P', 
    'Q' ,'R' ,'S', 'T', 'U' ,'V', 'W', 'X', 'Y' ,'Z', 'a', 'b', 'c', 'd', 'e' ,'f',
    'g' ,'h', 'i', 'j', 'k', 'l', 'm' ,'n', 'o' ,'p', 'q' ,'r' ,'s', 't', 'u' ,'v',
    'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '/'
  ]
}

The index of each character responds to it's value.

2: Compression


_compB64 : function(s){
    var re = /=/g;
    s = s.replace(re, "");
    var len = s.length;
    var bits = len * 6;
    var remb = (bits % 16) + 4;
    if(remb > 16){
      remb -= 16;
    }
    var unused = 16 - remb;
    var i = 0;
    var out = ''; // UTF-16 string for output
    var bits = 16; // available storage block
    var n = 0; // base64 map value
    var rem_value = 0; // remaining VALUE
    var storage = 0; // a full 16 bits value
    var rem = 0;
    var tmp = 0;
    
    bits -= 4;
    storage = unused << bits;

    for(;i<len;i++){
      n = this.map64.indexOf(s[i]); 
      if(bits >= 6){
        bits -= 6;
        storage += n << bits; 
        if(bits == 0){
          out += String.fromCharCode(storage);
          bits = 16;
          storage = 0;
        }
      }else{
        rem = 6 - bits;
        tmp = n >> rem;
        out += String.fromCharCode(storage+tmp);
        rem_value = n - (tmp << rem);

        bits =  16 - rem;
        storage = rem_value << bits; 
      }

      if(i == (len-1) && bits > 0 && bits < 16){
        out += String.fromCharCode(storage);
      }
   }
   return out;
  }

(I will post the full link later ). What happens here:
This method takes a base64 encoded string, it replaces the occasional '=' signs (They are not needed, and are not part of the actual data).

In the first 4 bits we store a value (0-15) (unnused) . While we have 6 bits values, it will only fit in 16 bits when we have 8 base64 chars. 3* 16 bits = 48 bits, 8 * 6 bits = 48 bits.  Byte alignment will take place when we have 8 base64 encoded chars. To overcome this problem the first 4 bits are used to store the unused remaining bits. The remaing bits do not contain any data and we should skip those later on when decompressed.

Further down the code the actual loop over the chars take place where 'storage' is a fully 16 bits value. This 'storage' need to be filled to the full 16 bits. First we get the base64 char, look up its value and store it (when it will fit in the remaining bits) or break it down with bit shifting into separate bits to top up the remaing bits.

When the 'storage' value is full, we create the string sequence based on a 16 bits charcode and concatenated it to 'out'. On the last loop, we output the remains so we have a complete sequence of 16 bit values.

3: Decompression:


_deCompB64 : function(s){
    var len = s.length;
    var i = len;
    var n, out = '', bits;
    var skipb = 0;
    var rem = 0;
    var rem_val = null;

    for(;i;i--){
      n = s.charCodeAt(len - i);
      bits = 16;
      if(i === len ){
        // Get the skipb
        bits -= 4;
        skipb = n >> bits;
        n -= skipb << bits;
      }
      for(;bits;){

        if(rem > 0){
          bits -= rem;
          ret_val = n >> bits;
          n -= ret_val << bits;
          out += this.map64[ret_val+rem_val];
          rem = 0;
        }
        if(i === 1 && (bits <= skipb)){
          // ready
          bits = 0;
        }else if(bits >= 6){
          bits -= 6;
          ret_val = n >> bits;
          out += this.map64[ret_val];
          n -= ret_val << bits;
        }else if(bits > 0){
            rem = 6 - bits;
            rem_val = n << rem;
            bits = 0;
        }
        
      }
    }
    return out;

Obvious, the other way around: Loop over a string containing 16 bit values and construct 6 bits values to restore the original Base64 string. On the last loop we stop retrieving values when there are unused bits, we concatenat a string based on the Base64 loop-up-table.

With this compression method and dealing with base64 encoded strings you may get compression rates of over  65%.

So why not use this method for all strings (a browser is capable in creating base64 encoded string window.btoa )? 

Well, it should works but my second solution is far more simple and slightly faster 

The compression solution for Latin based strings

  1. Compress
  2. de-Compress
The above example is rather complicated compared with the following solution:

1: Compression:

_deflate : function(s){
    var out = '', i, len, val;
    len = s.length;
    if(len < 1 || typeof s !== 'string'){
      return s;
    }
    // Ensure 1 byte chars (0 / 254)
    s = unescape(encodeURIComponent(s));
    if((len % 2) === 1){
      // Ad an extra byte for byte allignment
      // Odd bytes won't fill a 16 bits slot
      s +=String.fromCharCode(0);
    }
    i = 0;
    len = s.length;
    for(; i< len; i+=2){
      val = (s.charCodeAt(i+1) << 8) + (s.charCodeAt(i));
      out += String.fromCharCode(val);
    }
    return out;
  }


Take a string, encode it to be sure we have 8 bit character values , look if we have byte allignment (the length must be even) add an extra control char if it isn't. Loop over the string and store 2 single byte chars into one 2 byte char. Guaranteed 50% more storage capacity on latin based strings

2: De-Compress:

_inflate: function(s){
    if(s.length < 1 || typeof s !== 'string'){
      return s;
    }
    var n, out = '', high, low, i, len;
    i = 0;
    len = s.length;
    for(; i< len; i++){
      n = s.charCodeAt(i);
      high = n >> 8;
      low = n - (high << 8);
      out += String.fromCharCode(low);
      if(i == len-1 && high == 0){
        // skip byte
      }else{
        out += String.fromCharCode(high);
      }

    }
    return decodeURIComponent(escape(out));  
  }

Pass the encoded string, loop over it and create 2, 8 bit chars from every 16 bit value. In the last loop, when the 'high' value is equal to the odd/even control char , we skip the output. At the end we restore the string in it's original state bij reversing the escaping and that's it.

Conclusion

It's worth to compress strings for local-storage to gain more storage capacity, when working with base64 encoded images the advantage is even bigger. But not only that: imagine compressing JSON encoded strings en POST it to the server. 

The full script can be found here and is part of my multi-platform javascript framework

This solution was inspired by this article



No comments:

Post a Comment