Want to store data in Hive tables? I was just wondering which file format to use, ORC or Parquet?
Well, this is a question that many have tried to answer in various ways.
Let us understand how the Optimized Row Columnar (ORC) file format is different from our usual flat file.
ORC is a columnar file format. You can visualize an ORC file’s structure as an area divided into Header, body, and footer.
Header Section
The Header contains the text ‘ORC’ in case some tools are required to determine the file type while processing.
The body contains the actual data as well as the indexes. Actual data is stored in the ORC file in the form of rows of data that are called Stripes. The default stripe size is 250 MB.
Stripes are further divided into three more sections: the index section, the actual data, and a stripe footer section. One interesting thing to note here is that the index and data section are stored as columns so that only the columns where the required data is present are read. Index data consists of min and max values for each column as well as the row positions within each column. ORC indexes help to locate the stripes based on the data required, as well as row groups. The Stripe footer contains the encoding of each column, the directory of the streams, and their location.
Footer Section
The footer section consists of file metadata, file footer, and postscript.
The File Metadata section contains the various statistical information related to the columns, which is present at a stripe level. These statistics enable input split elimination based on predicate push down, which evaluates for each stripe. The file footer contains information regarding the list of stripes in the file, the number of rows per stripe, and the data type for each column. It also includes aggregate counts at column-level like min, max, and sum. The Postscript section contains:
- the file information like the length of the file’s Footer and Metadata section,
- the version of the file, and
- the compression parameters like general compression used (for example none, zlib, or snappy) and the size of the compressed folder
I’m sure now you will have a much better understanding of the ORC file format structure, which will help you make a better decision in selecting the file formats. Of course, your next step is comparing file formats like ORC and Parquet and concluding which is better suited to your project.
Do you need clarification about where to begin this comparison? Well, in my next blog, I will be taking you through how to compare various file formats to help those looking forward to doing this activity. Stay Tuned.