Want to store data in Hive tables? I was just wondering which file format to use, ORC or Parquet?

Well, this is a question that many have tried to answer in various ways.

Let us understand how the Optimized Row Columnar (ORC) file format is different from our usual flat file.

ORC is a columnar file format. You can visualize an ORC file’s structure as an area divided into Header, body, and footer.

Header Section

The Header contains the text ‘ORC’ in case some tools are required to determine the file type while processing.

ORC-File-Structure-header

The body contains the actual data as well as the indexes. Actual data is stored in the ORC file in the form of rows of data that are called Stripes. The default stripe size is 250 MB.

Stripes are further divided into three more sections: the index section, the actual data, and a stripe footer section. One interesting thing to note here is that the index and data section are stored as columns so that only the columns where the required data is present are read. Index data consists of min and max values for each column as well as the row positions within each column. ORC indexes help to locate the stripes based on the data required, as well as row groups. The Stripe footer contains the encoding of each column, the directory of the streams, and their location.

Footer Section

The footer section consists of file metadata, file footer, and postscript.

The File Metadata section contains the various statistical information related to the columns, which is present at a stripe level. These statistics enable input split elimination based on predicate push down, which evaluates for each stripe. The file footer contains information regarding the list of stripes in the file, the number of rows per stripe, and the data type for each column. It also includes aggregate counts at column-level like min, max, and sum. The Postscript section contains:

  • the file information like the length of the file’s Footer and Metadata section,
  • the version of the file, and
  • the compression parameters like general compression used (for example none, zlib, or snappy) and the size of the compressed folder

I’m sure now you will have a much better understanding of the ORC file format structure, which will help you make a better decision in selecting the file formats. Of course, your next step is comparing file formats like ORC and Parquet and concluding which is better suited to your project.

Do you need clarification about where to begin this comparison? Well, in my next blog, I will be taking you through how to compare various file formats to help those looking forward to doing this activity. Stay Tuned.

Emergys Blog

Recent Articles

  • Service Desk Automation

    Top Candidates for Service Desk Automation

    Top Candidates for Service Desk Automation

    Automation is not new to anyone. It is the foundation [...]

    Automation is not new to anyone. It is the foundation for any enterprise digitization. However, companies [...]

  • Maximizing Customer Engagement with Salesforce

    Maximizing Customer Engagement with Salesforce

    Maximizing Customer Engagement with Salesforce

    Forget about closing deals – in today's business world, customer [...]

    Forget about closing deals – in today's business world, customer engagement is all about building bridges, [...]

  • Bridging the Gap Between Humans and Machines with Generative AI

    Bridging the Gap Between Humans and Machines with Generative AI

    Bridging the Gap Between Humans and Machines with Generative AI

    Nowadays, customers expect quick and thorough help whenever they reach [...]

    Nowadays, customers expect quick and thorough help whenever they reach out, whether it’s to order something, [...]