“My life is very monotonous,” the fox said. “I hunt chickens; men hunt me. All the chickens are just alike, and all the men are just alike. And, in consequence, I am a little bored.”
– Antoine de Saint-Exupéry
My recent visit to a KPO reminded me of the famous French quote above. There seemed to be tremendous boredom and monotony amongst the educated and well-paid workforce. I could not but feel sorry for them as I could imagine the amount of human effort spent manually classifying documents daily. Being a software geek who had solved several such problems earlier, I knew there was a better way for those folks to get the job done, for example, by automating the process using Machine Learning.
Document Classification Automation aims to ease the life of a domain expert by avoiding painstaking, repetitive, and time-consuming processes.
What do the Classifiers Do
Classifiers make ‘predictions’ based on experience. When a classifier is fed a new document, it predicts that the document belongs to a particular class or category and assigns a “label.”
Source Data for Building a Classification Process
The Source dataset is a collection of documents that have been classified in the past. The Source dataset must be bifurcated into two parts – Training and Testing datasets
- Training dataset – Building a classification model requires a training dataset.
It needs to be large enough to have adequate documents in each class. The Training dataset needs to be of good quality with a clear demarcation of differences in the documents belonging to the different categories. - Testing dataset – Evaluating the effectiveness of the classification model requires a training dataset.
How the Classifier is built
- Pre-processing of dataset
Pre-processing the data is necessary since source data may contain unnecessary information like noise and unreliable data. The objective is to structure the data to facilitate the Classification Process. Data pre-processing includes Data Cleansing, Normalization, Feature Extraction, and Feature Selection.We need to remember that Data Preparation is a complex subject involving many iterations, exploration, and analysis. Readying data in the Pre-processing steps is essential to get good results from the Classifier. Pre-processing steps play a vital role in improving the accuracy of a classifier. - Classification Algorithm
Documents are classified by comparing the number of matching terms in the document vectors to see which class it most closely resembles. Classifier makes a document into one of the category types and assigning a label to a document within a given category type.As per my experience, classification algorithms such as Support Vector Machines, Naive Bayes, and Rocchio are best suited for Document Classification.
The Accuracy Measure
Once the Classification model is built, it needs to be evaluated by feeding testing dataset. If the accuracy of the current Classification model is not as expected, then you must take a few steps to improve it. I took the following measures to improve accuracy :
- Revisit pre-processing of the dataset and filter out unwanted data
- Improve the quality of the Training corpus
- Try other Classification Algorithms or try the Ensemble approach