IO Parquet with Java on Colab

Read/Write Parquet with Tablesaw in Java on Google Colab (Jupyter)

Introduction

We’re going to take a quick tour of Apache Parquet using Java on Google Colab. This is possible with the help of the IJava Kernel for Jupyter written by Spencer Park.

What is Parquet? … an open source, column-oriented data file format designed for efficient data storage and retrieval. [providing] efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. … [and] designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

If you are not already familiar with Apache Parquet, I highly recommend either of the two videos below.

Although both videos are a small commitment (~45+ minutes), and more technical than a 5 minute summary you might find elsewhere, they are well worth the investment.

I’ve listened to each multiple times while drive around town on my morning commute from my home and back to my home office. Despite the office being about 20 feet from my bed, I appreciate the freedom of taking a leisurely drive in the country, while drinking my morning coffee.

Previous
Previous

Query Financial Statements from SEC Data using Tablesaw for Java

Next
Next

Line Charts with D3 on Google Colab