Calculating Inverse Document Frequency

Well, I’ve been gone for a minute but I’m back!

For this tutorial I am calculating the Inverse Document Frequency for a random sample of abstracts from NSF Research Awards Abstracts 1990-2003 Part 1.

Linux Pre-Processing

 Step 1: First I extracted all of the characters that came after word Abstract

 find -name *txt -type d -exec sed -i ‘0,/^Abstract/d’ *.txt {} \;

 Step 2: Some documents did not have any abstracts so I just delete the files in order to save some time with the text preprocessing

 find . -size 0 -delete

Step 3: I know that this is a little extra but I extracted all of the files from their sub directories into one directory. I did not want to waste any time looping through all of those directories.

 find /home/Desktop/NSF_Part1/ -iname ‘*.txt’ -exec mv ‘{}’ /home/Desktop/clean \;

 

Python Pre-Processing

 

There are many ways to  create a program in order to preprocess and calculate idf scores. I did this in python and I provided many comments in my code so it will be easy to walk through it. I used a random small set from the NSF Part 1 abstracts because it took my code forever to run. So if you guys can find a more efficient code in finding the idf just message me. The name of the csv file that I produced is called output.csv, it is a list of words and their idf values.

 


from string import digits
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import glob
import math
import csv
import string
from collections import Counter

cachedStopWords =stopwords.words("english") #get all the stopwords
newStopWords=[] #create a new list

for words in cachedStopWords: #loops through all stop words
    words = str(words) #puts it from unicode to string
    newStopWords.append(words) #string version of step words

stop_words = set(newStopWords) #puts stopwords into a set

flist = glob.glob('/home/corpus3/*txt')#gets all of the paths for each text document

countedwordinstances = {}

allwordinstances = []

for fname in flist:
    tfile=open(fname,'r+')#opens the file
    line=tfile.read() #reads each line
    line = line.replace('\n','') #strips the new line character
    line = line.translate(None,digits) #strips digits for the text
    line = line.lower() #turns all the letters into lower case
    words = [i.translate(None,string.punctuation) for i in line.split() if i not in stop_words] #takes out stop words and punctuation
    allwordinstances += words
    countedwordinstances = dict(Counter(allwordinstances)) #Creates a dictionary of the number of times the word is iterated through the entire corpus
    tfile.close()

newvalues = []

for key, value in countedwordinstances.iteritems():
    result = []
    result.append(key) #appends the word
    result.append(math.log(float(len(flist)))/float(value)) #DO LOG THINGS HERE
    newvalues.append(result)

newvalues

with open('output.csv', ';wb') as f: #writes each value in it's own row in a csv file
    writer = csv.writer(f)
    writer.writerows(newvalues)

IDF- Inverse Document Frequency

IDF it is an important part of a natural language processing algorithm  term frequnecy-inverse document frequency tf-idf intended to reflect how important a word is to a  documents in  a collection. In term frequency all terms are considered equally important. Certain terms that occur too frequently have more weight in determining the relevance. The IDF is a to cancel out the affects of TF in order to find the most important terms in the corpus. Also the terms that occur less in the document can be more relevant, this phenomenon of over counting is called Zipfs Law.

 

Inspect terms that have high and low IDF scores. Discuss how well (or poorly) terms with high and low scores characterize the collection.

 

The “inverse document frequency” which measures how common a word is among all documents. The more common a word is, the lower its idf.  The least common the word appears in the corpus the higher its idf value. We take the ratio of the total number of documents to the number of documents containing word, then take the log of that. Add 1 to the divisor to prevent division by zero.

Based on the terms that I have collected in the directory corpus 3 the top 10 terms with the highest frequencies are:

Word Score
subtropical 4.969813
education 4.969813
domesticated 4.969813
ingredients 4.969813
isoprenoid 4.969813
envisaged 4.969813
canada 4.969813
desulfovibrio 4.969813
xenon 4.969813
validation 4.969813

The top ten words that have the lowest idf values are:

Word Score
research 0.027157
study 0.060607
project 0.08283
theory 0.084234
program 0.085686
systems 0.092034
work 0.09377
problems 0.09377
new 0.101425
dr 0.101425

Would removing administrative data change your results?

Unless there is an omnipotent administrative scientist that can get his name written on the majority of the NSF abstract documents, I do not believe removing administrative data will change the results of the IDF score values. Administrative data only describes a small portion of the data being represented in the abstracts. Administrative data and the data of the abstracts overlap.

 

Will IDF ever be less than 0 or undefined? 

The answer is no. There will never be an instance where an IDF is less than 0 or undefined. Since the idf is a logarithmic function the only way its possible to get negative values is to have 0.5 or word which is impossible. Also the only way for a logarithmic function to become undefined is to have a word not appear in any documents. Meaning that the document count for that word is 0. If the word does not exists in any documents then it will not have an idf value.

And there you have it!

gif funny fun gangnam style oppa gangnam style OPPAN GANGNAM STYLE gangnam style gif postsfeitospormim

Resources:

To find out more about term frequency and inverse frequency check out these links.

http://aimotion.blogspot.com/2011/12/machine-learning-with-python-meeting-tf.html

http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/

Advertisements

One thought on “Calculating Inverse Document Frequency

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s