Make it possible to read and write xerial snappy

Fixes mumrah/kafka-python#126 TL;DR ===== This makes it possible to read and write snappy compressed streams that are compatible with the java and scala kafka clients (the xerial blocking format)) Xerial Details ============== Kafka supports transparent compression of data (both in transit and at rest) of messages, one of the allowable compression algorithms is Google's snappy, an algorithm which has excellent performance at the cost of efficiency. The specific implementation of snappy used in kafka is the xerial-snappy implementation, this is a readily available java library for snappy. As part of this implementation, there is a specialised blocking format that is somewhat none standard in the snappy world. Xerial Format ------------- The blocking mode of the xerial snappy library is fairly simple, using a magic header to identify itself and then a size + block scheme, unless otherwise noted all items in xerials blocking format are assumed to be big-endian. A block size (```xerial_blocksize``` in implementation) controls how frequent the blocking occurs 32k is the default in the xerial library, this blocking controls the size of the uncompressed chunks that will be fed to snappy to be compressed. The format winds up being | Header | Block1 len | Block1 data | Blockn len | Blockn data | | ----------- | ---------- | ------------ | ---------- | ------------ | | 16 bytes | BE int32 | snappy bytes | BE int32 | snappy bytes | It is important to not that the blocksize is the amount of uncompressed data presented to snappy at each block, whereas the blocklen is the number of bytes that will be present in the stream, that is the length will always be <= blocksize. Xerial blocking header ---------------------- Marker | Magic String | Null / Pad | Version | Compat ------ | ------------ | ---------- | -------- | -------- byte | c-string | byte | int32 | int32 ------ | ------------ | ---------- | -------- | -------- -126 | 'SNAPPY' | \0 | variable | variable The pad appears to be to ensure that SNAPPY is a valid cstring, and to align the header on a word boundary. The version is the version of this format as written by xerial, in the wild this is currently 1 as such we only support v1. Compat is there to claim the minimum supported version that can read a xerial block stream, presently in the wild this is 1. Implementation specific details =============================== The implementation presented here follows the Xerial implementation as of its v1 blocking format, no attempts are made to check for future versions. Since none-xerial aware clients might have persisted snappy compressed messages to kafka brokers we allow clients to turn on xerial compatibility for message sending, and perform header sniffing to detect xerial vs plain snappy payloads.
author: Greg Bowyer <gbowyer@fastmail.co.uk> 2014-02-19 14:42:40 -0800
committer: Greg Bowyer <gbowyer@fastmail.co.uk> 2014-02-19 14:42:40 -0800
commit: 96c9ce0c4a2ab515c581062117d5a6221f1f2036 (patch)
tree: a679fc840ae486807207158fb85a483dbd1faef7
parent: 4abf7ee1fbbdc47c8cb7b35f2600e58f1f95e6bb (diff)
download: kafka-python-96c9ce0c4a2ab515c581062117d5a6221f1f2036.tar.gz
2 files changed, 138 insertions, 3 deletions
diff --git a/kafka/codec.py b/kafka/codec.py
index eb5d03c..206ddb4 100644
--- a/kafka/codec.py
+++ b/kafka/codec.py
@@ -1,5 +1,9 @@
 from cStringIO import StringIO
 import gzip
+import struct
+
+_XERIAL_V1_HEADER = (-126, 'S', 'N', 'A', 'P', 'P', 'Y', 0, 1, 1)
+_XERIAL_V1_FORMAT = 'bccccccBii'
 
 try:
     import snappy
@@ -36,13 +40,101 @@ def gzip_decode(payload):
     return result
 
 
-def snappy_encode(payload):
+def snappy_encode(payload, xerial_compatible=False, xerial_blocksize=32 * 1024):
+    """Encodes the given data with snappy if xerial_compatible is set then the
+       stream is encoded in a fashion compatible with the xerial snappy library
+
+       The block size (xerial_blocksize) controls how frequent the blocking occurs
+       32k is the default in the xerial library.
+
+       The format winds up being
+        +-------------+------------+--------------+------------+--------------+
+        |   Header    | Block1 len | Block1 data  | Blockn len | Blockn data  |
+        |-------------+------------+--------------+------------+--------------|
+        |  16 bytes   |  BE int32  | snappy bytes |  BE int32  | snappy bytes |
+        +-------------+------------+--------------+------------+--------------+
+
+        It is important to not that the blocksize is the amount of uncompressed
+        data presented to snappy at each block, whereas the blocklen is the
+        number of bytes that will be present in the stream, that is the
+        length will always be <= blocksize.
+    """
+
     if not _has_snappy:
         raise NotImplementedError("Snappy codec is not available")
-    return snappy.compress(payload)
+
+    if xerial_compatible:
+        def _chunker():
+            for i in xrange(0, len(payload), xerial_blocksize):
+                yield payload[i:i+xerial_blocksize]
+
+        out = StringIO()
+
+        header = ''.join([struct.pack('!' + fmt, dat) for fmt, dat
+            in zip(_XERIAL_V1_FORMAT, _XERIAL_V1_HEADER)])
+
+        out.write(header)
+        for chunk in _chunker():
+            block = snappy.compress(chunk)
+            block_size = len(block)
+            out.write(struct.pack('!i', block_size))
+            out.write(block)
+
+        out.seek(0)
+        return out.read()
+
+    else:
+        return snappy.compress(payload)
+
+
+def _detect_xerial_stream(payload):
+    """Detects if the data given might have been encoded with the blocking mode
+        of the xerial snappy library.
+
+        This mode writes a magic header of the format:
+            +--------+--------------+------------+---------+--------+
+            | Marker | Magic String | Null / Pad | Version | Compat |
+            |--------+--------------+------------+---------+--------|
+            |  byte  |   c-string   |    byte    |  int32  | int32  |
+            |--------+--------------+------------+---------+--------|
+            |  -126  |   'SNAPPY'   |     \0     |         |        |
+            +--------+--------------+------------+---------+--------+
+
+        The pad appears to be to ensure that SNAPPY is a valid cstring
+        The version is the version of this format as written by xerial,
+        in the wild this is currently 1 as such we only support v1.
+
+        Compat is there to claim the miniumum supported version that
+        can read a xerial block stream, presently in the wild this is
+        1.
+    """
+
+    if len(payload) > 16:
+        header = header = struct.unpack('!' + _XERIAL_V1_FORMAT, bytes(payload)[:16])
+        return header == _XERIAL_V1_HEADER
+    return False
 
 
 def snappy_decode(payload):
     if not _has_snappy:
         raise NotImplementedError("Snappy codec is not available")
-    return snappy.decompress(payload)
+
+    if _detect_xerial_stream(payload):
+        # TODO ? Should become a fileobj ?
+        out = StringIO()
+        byt = buffer(payload[16:])
+        length = len(byt)
+        cursor = 0
+
+        while cursor < length:
+            block_size = struct.unpack_from('!i', byt[cursor:])[0]
+            # Skip the block size
+            cursor += 4
+            end = cursor + block_size
+            out.write(snappy.decompress(byt[cursor:end]))
+            cursor = end
+
+        out.seek(0)
+        return out.read()
+    else:
+        return snappy.decompress(payload)
diff --git a/test/test_unit.py b/test/test_unit.py
index e3fd4bb..b5f0118 100644
--- a/test/test_unit.py
+++ b/test/test_unit.py
@@ -70,6 +70,49 @@ class TestCodec(unittest.TestCase):
             s2 = snappy_decode(snappy_encode(s1))
             self.assertEquals(s1, s2)
 
+    @unittest.skipUnless(has_snappy(), "Snappy not available")
+    def test_snappy_detect_xerial(self):
+        import kafka as kafka1
+        _detect_xerial_stream = kafka1.codec._detect_xerial_stream
+
+        header = b'\x82SNAPPY\x00\x00\x00\x00\x01\x00\x00\x00\x01Some extra bytes'
+        false_header = b'\x01SNAPPY\x00\x00\x00\x01\x00\x00\x00\x01'
+        random_snappy = snappy_encode('SNAPPY' * 50)
+        short_data = b'\x01\x02\x03\x04'
+
+        self.assertTrue(_detect_xerial_stream(header))
+        self.assertFalse(_detect_xerial_stream(b''))
+        self.assertFalse(_detect_xerial_stream(b'\x00'))
+        self.assertFalse(_detect_xerial_stream(false_header))
+        self.assertFalse(_detect_xerial_stream(random_snappy))
+        self.assertFalse(_detect_xerial_stream(short_data))
+
+    @unittest.skipUnless(has_snappy(), "Snappy not available")
+    def test_snappy_decode_xerial(self):
+        header = b'\x82SNAPPY\x00\x00\x00\x00\x01\x00\x00\x00\x01'
+        random_snappy = snappy_encode('SNAPPY' * 50)
+        block_len = len(random_snappy)
+        random_snappy2 = snappy_encode('XERIAL' * 50)
+        block_len2 = len(random_snappy2)
+
+        to_test = header \
+            + struct.pack('!i', block_len) + random_snappy \
+            + struct.pack('!i', block_len2) + random_snappy2 \
+
+        self.assertEquals(snappy_decode(to_test), ('SNAPPY' * 50) + ('XERIAL' * 50))
+
+    @unittest.skipUnless(has_snappy(), "Snappy not available")
+    def test_snappy_encode_xerial(self):
+        to_ensure = b'\x82SNAPPY\x00\x00\x00\x00\x01\x00\x00\x00\x01' + \
+            '\x00\x00\x00\x18' + \
+            '\xac\x02\x14SNAPPY\xfe\x06\x00\xfe\x06\x00\xfe\x06\x00\xfe\x06\x00\x96\x06\x00' + \
+            '\x00\x00\x00\x18' + \
+            '\xac\x02\x14XERIAL\xfe\x06\x00\xfe\x06\x00\xfe\x06\x00\xfe\x06\x00\x96\x06\x00'
+
+        to_test = ('SNAPPY' * 50) + ('XERIAL' * 50)
+
+        compressed = snappy_encode(to_test, xerial_compatible=True, xerial_blocksize=300)
+        self.assertEquals(compressed, to_ensure)
 
 class TestProtocol(unittest.TestCase):
author	Greg Bowyer <gbowyer@fastmail.co.uk>	2014-02-19 14:42:40 -0800
committer	Greg Bowyer <gbowyer@fastmail.co.uk>	2014-02-19 14:42:40 -0800
commit	96c9ce0c4a2ab515c581062117d5a6221f1f2036 (patch)
tree	a679fc840ae486807207158fb85a483dbd1faef7
parent	4abf7ee1fbbdc47c8cb7b35f2600e58f1f95e6bb (diff)
download	kafka-python-96c9ce0c4a2ab515c581062117d5a6221f1f2036.tar.gz