SNAPPY Compression for Parquet, ORC, Avro

SNAPPY is a high speed compression/decompression library that allows Parquet-based table in particular to reduce storage space while increase performance to a certain degree.

This article describes the steps to gauge the query performance and storage output upon applying SNAPPY compression on the database tables with Avro, ORC and Parquet file formats in both Hive LLAP and Impala query engine using CDW in CDP Private Cloud platform.

Prerequisites
Testing Procedure
Compression Verification
Performance Result
Storage Output
Conclusion

Prerequisites

The tests are carried out using CDW on the CDP Private Cloud (Openshift) platform with the following hardware specification.

CPU	Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
Memory	DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns)
Disk	SSD P4610 1.6TB SFF

A random sample data of 300 million CSV rows is produced using a python script with the faker generator. The schema of each row is sequenced as Lastname, Firstname, MSISDN, Date of Birth, Postcode, City as illustrated below.
```
  Maria,Harmon,32378521,1998-11-14,17,30766,Durhammouth
  Anne,Adams,29481072,1982-10-28,36,70830,Deannabury
  Deborah,Sanders,21125797,2002-04-07,56,63993,New Ronaldland
```

Copy the file to the HDFS storage cluster.

  # hdfs dfs -put 300mil.csv /tmp/sampledata/
    
  # hdfs dfs -du -h /tmp/sampledata/
  16.0 G  47.9 G  /tmp/sampledata/300mil.csv    

In CDW, create a hive and an impala virtual warehouse with only 1 executor each.

Testing Procedure

Access Hue tool of the hive virtual warehouse. Create database db1.
Use the SQL Editor to create an external table in the database db1.
Execute the following command and take note of the speed result. Repeat running the same command and jot down the result again.

Create a Hive managed table using the ORC file format with SNAPPY compression as illustrated below.

 SET hive.exec.compress.output=true;
 CREATE TABLE db1.orc_snappy(
 FirstName string, LastName string,    
 MSISDN bigint, DOB date, age int,
 Postcode int, City string)
 STORED AS orc
 TBLPROPERTIES ("orc.compress"="SNAPPY",'orc.schema.literal'='{
 "name": "sample1",
 "type": "record",
 "fields": [
 {"name":"one", "type":"binary"},
 {"name":"two", "type":"binary"},
 {"name":"three", "type":"bigint"},
 {"name":"four", "type":"binary"},
 {"name":"five", "type":"int"},
 {"name":"six", "type":"int"},
 {"name":"seven", "type":"binary"}
 ]}');

Insert the data from the external tmp table into this newly created ORC-based table. Take note of the speed to execute this task completely.
```
 INSERT INTO TABLE db1.orc_snappy SELECT * FROM tmp;
```
Check the result of the loaded data.
```
 SELECT * FROM db1.orc_snappy;
```

Run the following SQL queries twice and take note of the speed to complete the task.

 SELECT COUNT (*) FROM db1.orc_snappy;

 SELECT AVG(age) FROM db1.orc_snappy where lastname = 'Davis' and age > 30 and age < 40;

Repeat step 4 for file format Parquet by creating a table with the following schema.

 CREATE TABLE db1.parquet_snappy(
 FirstName string, LastName string,    
 MSISDN bigint, DOB date, age int,
 Postcode int, City string)
 STORED AS parquet
 TBLPROPERTIES ("parquet.compression"="SNAPPY",'parquet.schema.literal'='{
 "name": "sample1",
 "type": "record",
 "fields": [
 {"name":"one", "type":"binary"},
 {"name":"two", "type":"binary"},
 {"name":"three", "type":"INT64"},
 {"name":"four", "type":"date"},
 {"name":"five", "type":"INT32"},
 {"name":"six", "type":"INT32"},
 {"name":"seven", "type":"binary"}
 ]}')

Insert the data from the external tmp table into this newly created Parquet-based table. Take note of the speed to execute this task completely.
```
 INSERT INTO TABLE db1.parquet_snappy SELECT * FROM tmp;
```

Run the following SQL queries twice and take note of the speed result.

SELECT COUNT (*) FROM db1.parquet_snappy;

SELECT AVG(age) FROM db1.parquet_snappy where lastname = 'Davis' and age > 30 and age < 40;

Repeat step 4 for file format Avro by creating a table with the following schema.

set hive.exec.compress.output=true;
set avro.output.codec=snappy;
CREATE TABLE db1.avro_snappy(
FirstName string, LastName string,    
MSISDN bigint, DOB date, age int,
Postcode int, City string)
STORED AS avro
TBLPROPERTIES ('avro.schema.literal'='{
"name": "sample1",
"type": "record",
"fields": [
{"name":"FirstName", "type":"string"},
{"name":"LastName", "type":"string"},
{"name":"MSISDN", "type":"long"},
{"name":"DOB", "type":"string"},
{"name":"age", "type":"int"},
{"name":"Postcode", "type":"int"},
{"name":"City", "type":"string"}
]}');

Run the following SQL queries twice and take note of the speed to complete the task.

SELECT COUNT (*) FROM db1.avro_snappy;

SELECT AVG(age) FROM db1.avro_snappy where lastname = 'Davis' and age > 30 and age < 40;

Access Hue tool of the Impala virtual warehouse. Create database db2.
Use the SQL Editor to create an external table in the database db2.

Create a managed table using the Parquet file format based on the schema as shown below.

CREATE TABLE db2.parquet2_snappy(
FirstName string, LastName string,    
MSISDN bigint, DOB date, age int,
Postcode int, City string)
STORED AS parquet
TBLPROPERTIES (
"parquet.compression"="SNAPPY",
'parquet.schema.literal'='{
"name": "sample1",
"type": "record",
"fields": [
{"name":"one", "type":"string"},
{"name":"two", "type":"string"},
{"name":"three", "type":"long"},
{"name":"four", "type":"date"},
{"name":"five", "type":"long"},
{"name":"six", "type":"int"},
{"name":"seven", "type":"string"}
]}')

Insert the data from the external tmp table into this newly created Parquet-based table. Take note of the speed to execute this task completely.
```
INSERT INTO table db2.parquet2_snappy SELECT * from db1.tmp;  
```

Run the following SQL queries twice and take note of the speed result.

SELECT COUNT (*) FROM db2.parquet2_snappy;

SELECT AVG(age) FROM db2.parquet2_snappy where lastname = 'Davis' and age > 30 and age < 40;

Compression Verification

By default, ZLIB compression is enabled for any managed ORC-based table in hive engine in CDW unless specified. In this case, SNAPPY compression is enabled for orc_snappy table as highlighted in the testing procedure above.

Verify the compression codec in the dataset file as shown below.

  # hive --orcfiledump /warehouse/tablespace/managed/hive/db1.db/orc/delta_0000001_0000001_0000/bucket_00000_0 | grep Compression
  Compression: ZLIB
  Compression size: 32768

  # hive --orcfiledump /warehouse/tablespace/managed/hive/db1.db/orc_snappy/delta_0000001_0000001_0000/bucket_00000_0 | grep Compression
  Compression: SNAPPY
  Compression size: 32768

By default, no compression is enabled for any managed Avro-based table in Hive engine in CDW unless specified. In this case, SNAPPY compression is enabled for avro_snappy table as specified in the testing procedure above.

Verify the compression codec in the dataset file as shown below.

  # hadoop fs -copyToLocal /warehouse/tablespace/managed/hive/db1.db/avro_snappy/delta_0000001_0000001_0000/000000_0 tmp.avro
    
  # avro-tools getmeta tmp.avro | grep codec
  avro.codec	snappy

  # hadoop fs -copyToLocal /warehouse/tablespace/managed/hive/db1.db/avro/delta_0000001_0000001_0000/000000_0 tmp2.avro

  # avro-tools getmeta tmp2.avro | grep codec
  #

By default, the dataset is UNCOMPRESSED for any managed Parquet-based table in Hive engine in CDW unless specified. In this case, SNAPPY compression is enabled for parquet_snappy table as specified in the testing procedure above.

Verify the compression method in the dataset file as shown below.

  # hadoop fs -copyToLocal /warehouse/tablespace/managed/hive/db1.db/parquet/delta_0000001_0000001_0000/000000_0 tmp2.parquet

  # parquet-tools meta tmp2.parquet
  file:        file:/root/tmp2.parquet 
  creator:     parquet-mr version 1.10.99.2021.0.6-b96 (build 83328831d0a79be26dd977c497fd38b301335b4a) 
  extra:       writer.date.proleptic = false 
  extra:       writer.time.zone = UTC 
  extra:       writer.model.name = 3.1.3000.2021.0.6-b96 
  extra:       writer.zone.conversion.legacy = false 

  file schema: hive_schema 
  --------------------------------------------------------------------------------------------------------------------------------------------------------------
  firstname:   OPTIONAL BINARY L:STRING R:0 D:1
  lastname:    OPTIONAL BINARY L:STRING R:0 D:1
  msisdn:      OPTIONAL INT64 R:0 D:1
  dob:         OPTIONAL INT32 L:DATE R:0 D:1
  age:         OPTIONAL INT32 R:0 D:1
  postcode:    OPTIONAL INT32 R:0 D:1
  city:        OPTIONAL BINARY L:STRING R:0 D:1

  row group 1: RC:3890100 TS:135252863 OFFSET:4 
  --------------------------------------------------------------------------------------------------------------------------------------------------------------
  firstname:    BINARY UNCOMPRESSED DO:0 FPO:4 SZ:4883456/4883456/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: Aaron, max: Zoe, num_nulls: 0]
  lastname:     BINARY UNCOMPRESSED DO:0 FPO:4883460 SZ:4886753/4886753/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: Abbott, max: Zuniga, num_nulls: 0]
  msisdn:       INT64 UNCOMPRESSED DO:0 FPO:9770213 SZ:31126845/31126845/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 26, max: 99999984, num_nulls: 0]
  dob:          INT32 UNCOMPRESSED DO:0 FPO:40897058 SZ:15566445/15566445/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 1970-01-01, max: 2022-07-15, num_nulls: 0]
  age:          INT32 UNCOMPRESSED DO:0 FPO:56463503 SZ:3418278/3418278/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 99, num_nulls: 0]
  postcode:     INT32 UNCOMPRESSED DO:0 FPO:59881781 SZ:15566445/15566445/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 501, max: 99950, num_nulls: 0]
  city:         BINARY UNCOMPRESSED DO:0 FPO:75448226 SZ:59804641/59804641/1.00 VC:3890100 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY,PLAIN ST:[min: Aaronberg, max: Zunigaville, num_nulls: 0]

  # hadoop fs -copyToLocal /warehouse/tablespace/managed/hive/db1.db/parquet_snappy/delta_0000001_0000001_0000/000000_0 tmp.parquet

  # parquet-tools meta tmp.parquet
  file:        file:/root/tmp.parquet 
  creator:     parquet-mr version 1.10.99.2021.0.6-b96 (build 83328831d0a79be26dd977c497fd38b301335b4a) 
  extra:       writer.date.proleptic = false 
  extra:       writer.time.zone = UTC 
  extra:       writer.model.name = 3.1.3000.2021.0.6-b96 
  extra:       writer.zone.conversion.legacy = false 

  file schema: hive_schema 
  --------------------------------------------------------------------------------------------------------------------------------------------------------------
  firstname:   OPTIONAL BINARY L:STRING R:0 D:1
  lastname:    OPTIONAL BINARY L:STRING R:0 D:1
  msisdn:      OPTIONAL INT64 R:0 D:1
  dob:         OPTIONAL INT32 L:DATE R:0 D:1
  age:         OPTIONAL INT32 R:0 D:1
  postcode:    OPTIONAL INT32 R:0 D:1
  city:        OPTIONAL BINARY L:STRING R:0 D:1

  row group 1: RC:5390100 TS:188413650 OFFSET:4 
  --------------------------------------------------------------------------------------------------------------------------------------------------------------
  firstname:    BINARY SNAPPY DO:0 FPO:4 SZ:6763410/6763856/1.00 VC:5390100 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: Aaron, max: Zoe, num_nulls: 0]
  lastname:     BINARY SNAPPY DO:0 FPO:6763414 SZ:6765842/6767152/1.00 VC:5390100 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: Abbott, max: Zuniga, num_nulls: 0]
  msisdn:       INT64 SNAPPY DO:0 FPO:13529256 SZ:31528932/43129170/1.37 VC:5390100 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 26, max: 99999997, num_nulls: 0]
  dob:          INT32 SNAPPY DO:0 FPO:45058188 SZ:21555328/21568770/1.00 VC:5390100 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 1970-01-01, max: 2022-07-15, num_nulls: 0]
  age:          INT32 SNAPPY DO:0 FPO:66613516 SZ:4737802/4736178/1.00 VC:5390100 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 99, num_nulls: 0]
  postcode:     INT32 SNAPPY DO:0 FPO:71351318 SZ:21571232/21568770/1.00 VC:5390100 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 501, max: 99950, num_nulls: 0]
  city:         BINARY SNAPPY DO:0 FPO:92922550 SZ:41644490/83879754/2.01 VC:5390100 ENC:BIT_PACKED,PLAIN,PLAIN_DICTIONARY,RLE ST:[min: Aaronberg, max: Zunigaville, num_nulls: 0]

By default, SNAPPY compression is enabled for any managed Parquet-based table in Impala engine in CDW unless specified.

Performance Result

The following table shows the time taken (in seconds) to run each SQL query in the specific file format table without SNAPPY compression. This result is adapted from the previous test as described here.

File Format	Engine	INSERT	SELECT COUNT (1st)	SELECT COUNT (2nd)	SELECT AVG(1st)	SELECT AVG(2nd)
ORC (ZLIB)	Hive	507	0.40	0.39	8.13	0.39
Avro	Hive	513	0.40	0.38	287	0.40
Parquet	Hive	332	0.38	0.38	11.78	0.37

The following table shows the time taken (in seconds) to run each SQL query in the specific file format table with SNAPPY compression.

File Format	Engine	INSERT	SELECT COUNT (1st)	SELECT COUNT (2nd)	SELECT AVG(1st)	SELECT AVG(2nd)
ORC	Hive	450	0.38	0.38	5.93	0.39
Avro	Hive	323	0.38	0.38	208	0.40
Parquet	Hive	352	0.38	0.38	0.37	0.37
Parquet	Parquet	32	0.36	0.35	1.76	1.62

Storage Output

The following output shows the HDFS storage size of the generated managed tables in Hive engine (with and without SNAPPY compression) for each file format table.

  # hdfs dfs -du -h /warehouse/tablespace/managed/hive/db1.db
0 G  38.9 G  /warehouse/tablespace/managed/hive/db1.db/avro
5 G   28.4 G  /warehouse/tablespace/managed/hive/db1.db/avro_snappy
1 G   12.4 G  /warehouse/tablespace/managed/hive/db1.db/orc
8 G   14.5 G  /warehouse/tablespace/managed/hive/db1.db/orc_snappy
7 G   29.0 G  /warehouse/tablespace/managed/hive/db1.db/parquet
0 G   20.9 G  /warehouse/tablespace/managed/hive/db1.db/parquet_snappy

By default, Impala engine applies SNAPPY compression on the Parquet-based table.

  # hdfs dfs -du -h /warehouse/tablespace/managed/hive/db2.db
  6.4 G  19.3 G  /warehouse/tablespace/managed/hive/db2.db/parquet2
  6.4 G  19.3 G  /warehouse/tablespace/managed/hive/db2.db/parquet2_snappy

Conclusion

SNAPPY compression helps saving HDFS storage space and also offers higher performance result in terms of speed to complete the INSERT and SELECT queries.
ZLIB compression seems saving slighly more storage but lagging in speed performance for ORC-base table in Hive engine compared to SNAPPY compression.
Avro is the worst performer with or without compression codec in place.
Parquet stands out in terms of achieving the highest speed for running the interactive SQL query. As it is a pioneer file format for Impala, running SQL query in Parquet-based table in Impala produces quicker result compared to running the same query in Hive engine with SNAPPY compression codec (which is enabled by default).