“My life is very monotonous,” the fox said. “I hunt chickens; men hunt me. All the chickens are just alike, and all the men are just alike. And, in consequence, I am a little bored.”
– Antoine de Saint-Exupéry

My recent visit to a KPO reminded me of the famous French quote above. There seemed to be tremendous boredom and monotony amongst the educated and well-paid workforce. I could not but feel sorry for them as I could imagine the amount of human effort spent manually classifying documents daily. Being a software geek who had solved several such problems earlier, I knew there was a better way for those folks to get the job done, for example, by automating the process using Machine Learning.

Document Classification Automation aims to ease the life of a domain expert by avoiding painstaking, repetitive, and time-consuming processes.

What do the Classifiers Do

Classifiers make ‘predictions’ based on experience. When a classifier is fed a new document, it predicts that the document belongs to a particular class or category and assigns a “label.”

Source Data for Building a Classification Process

The Source dataset is a collection of documents that have been classified in the past. The Source dataset must be bifurcated into two parts – Training and Testing datasets

  1. Training dataset – Building a classification model requires a training dataset.
    It needs to be large enough to have adequate documents in each class. The Training dataset needs to be of good quality with a clear demarcation of differences in the documents belonging to the different categories.
  2. Testing dataset – Evaluating the effectiveness of the classification model requires a training dataset.

How the Classifier is built

  1. Pre-processing of dataset
    Pre-processing the data is necessary since source data may contain unnecessary information like noise and unreliable data. The objective is to structure the data to facilitate the Classification Process. Data pre-processing includes Data Cleansing, Normalization, Feature Extraction, and Feature Selection.We need to remember that Data Preparation is a complex subject involving many iterations, exploration, and analysis. Readying data in the Pre-processing steps is essential to get good results from the Classifier. Pre-processing steps play a vital role in improving the accuracy of a classifier.
  2. Classification Algorithm
    Documents are classified by comparing the number of matching terms in the document vectors to see which class it most closely resembles. Classifier makes a document into one of the category types and assigning a label to a document within a given category type.As per my experience, classification algorithms such as Support Vector Machines, Naive Bayes, and Rocchio are best suited for Document Classification.

The Accuracy Measure

Once the Classification model is built, it needs to be evaluated by feeding testing dataset. If the accuracy of the current Classification model is not as expected, then you must take a few steps to improve it. I took the following measures to improve accuracy :

  1. Revisit pre-processing of the dataset and filter out unwanted data
  2. Improve the quality of the Training corpus
  3. Try other Classification Algorithms or try the Ensemble approach

Summary

Document Classification is a supervised method that involves the creation of a model based on a pre-processed data set. To predict the category of any given document, the Classifier gets training on this training dataset.

The quality of the training dataset affects the quality of prediction. So, keeping the variation in each document category is essential to keep up the quality of the training dataset.

Emergys Blog

Recent Articles

  • Service Desk Automation

    Top Candidates for Service Desk Automation

    Top Candidates for Service Desk Automation

    Automation is not new to anyone. It is the foundation [...]

    Automation is not new to anyone. It is the foundation for any enterprise digitization. However, companies [...]

  • Maximizing Customer Engagement with Salesforce

    Maximizing Customer Engagement with Salesforce

    Maximizing Customer Engagement with Salesforce

    Forget about closing deals – in today's business world, customer [...]

    Forget about closing deals – in today's business world, customer engagement is all about building bridges, [...]

  • Bridging the Gap Between Humans and Machines with Generative AI

    Bridging the Gap Between Humans and Machines with Generative AI

    Bridging the Gap Between Humans and Machines with Generative AI

    Nowadays, customers expect quick and thorough help whenever they reach [...]

    Nowadays, customers expect quick and thorough help whenever they reach out, whether it’s to order something, [...]