voice of humanity: The Academic Document Retrieval Scene    
 The Academic Document Retrieval Scene4 comments
picture9 Nov 2003 @ 12:36, by Roger Eaton

Two or three principal finds plus the sheer fun of words have made the search through academia worthwhile.

Wordnet is a fascinating site for word bugs. It is an English dictionary/thesaurus developed at Princeton over the last 20 years or so to help automate semantic searches. Wordnet lists English nouns, verbs, adjectives and adverbs in all of their senses, including each sense in a "synset" of synonyms. In addition, antonyms, hypernyms and hyponyms are tracked, so each word is enmeshed in its relations with other words. Wordnet is available both online and for download under a liberal license. For more details, see the Wikipedia article.

See Jaap Kamps' and Maarten Marx's Words with Attitude article for an illustration of the fun that can be had with wordnet. These two professors have explored the synonyms of three pairs of adjectives in wordnet, good/bad, active/passive and strong/weak. These three pairs are the result of earlier work by Osgood, the discoverer of semantic space. There are 21,365 adjectives in wordnet 1.6. Of these, there is a subset of 5410 adjectives which can be reached by linking through synonym subsets beginning with the three pairs. The surprise is that the exact same 5,410 adjectives are reached this way from each of the three pairs of starter adjectives. Words with attitude! A lovely result.

Kamps continues in a fascinating paper called The Structure of Meaning. Those 5,410 adjectives that make up words with attitude are only one "component", where a component is a set of words that all link to each other through the wordnet synsets. It turns out that there is one giant adjectival component, one giant noun component, and one giant verb component. These three giants are much bigger than any other component. Zeroing in on the Words with Attitude component, these words can be divided very evenly into words with positive and negative connotations, allowing an article to be automatically rated as giving a positive or negative spin to its subject matter. We could do this for our English voh articles. Someday anyway. And don't miss Kamps wonderful jiggling display of "good" and "bad".

At some point, we may want to use wordnet and the wordnets for other languages that are springing up to add some sophistication to our voice of humanity search methods. Even now, we may want to use it as and English "stemmer". Wordnet does have the capability to stem all its vocabulary. So you can feed it "steamrolled" and it will return "steamroll". Most likely, we will not use stemming in the initial voh implementation. Somewhere I read that it does not really help that much, plus we would need a different stemmer for each language.

Here's a good overview of the statistical keyword applications from Prof Belew of UC San Diego. The big revelation here is the Inverse Document Frequency article. Idf is an easily applied keyword weight, which much improves document retrieval via keyword. The theory is that keywords that occur with the highest and lowest frequency are not useful, so we throw them out. Amongst those in the middle, it is those keywords that bunch up into fewer articles for the same number of occurrences overall that work best as keywords. Idf measures the bunching-up-ness. For instance, a paper by William Church and Kenneth Gale of Bell Labs, Inverse Document Frequency (IDF): A Measure of Deviations from Poisson compares the words "boycott" and "somewhat". Both words occur the same number of times in a set of Associated Press articles (the "corpus" as they say), but "boycott" occurs in 676 of the corpus articles, while "somewhat" occurs in 979. Our intuition tells us that "boycott" is a better keyword than "somewhat" and here we have a way to capture that intuition for automated use. Another article, this time from Kishore Papineni of IBM's Watson Research Center tells us idf has been proved to be the best measure. Kevin Prey, James C. French, Allison L. Powell, a team out of the University of Virginia along with Charles Viles from UNC Chapel Hill have shown that idf applies well even to very large corpora, such as subsets of the www. This one find makes the academic overview worthwhile.

Something else worthwhile that turned up was a free fast tool for distinguishing the language of a document on the fly.

If the reader knows of other such gems, please add a comment at the bottom of this article or emailing rogereaton@earthlink.net.



[< Back] [voice of humanity]

Category:  

4 comments

20 Dec 2008 @ 22:45 by 沢尻エリカ 壁紙 @219.116.149.150 : thanks
nice site. thanks.  


8 Jun 2009 @ 04:37 by jewelry @218.19.53.159 : pearl
Don't over indulge yourself.  


8 Jun 2009 @ 05:59 by jewelry @218.19.53.159 : pearl
Read to exercise the brain.  


7 Jun 2016 @ 15:36 by bosch ra1181 benchtop router table revie @59.96.121.33 : bosch ra1181 benchtop router table revie
Thanks for sharing
bosch ra1181 benchtop router table review  



Your Name:
Your URL: (or email)
Subject:       
Comment:
For verification, please type the word you see on the left:


Other entries in
24 Jun 2007 @ 23:17: Global Assembly now accepting sign ups
26 May 2007 @ 19:26: WiserEarth / Paul Hawken
18 Mar 2007 @ 23:19: Latest InterMix Design
30 Dec 2006 @ 17:53: A Nonviolent Service Arm for the Global Assembly
19 Nov 2006 @ 15:45: Global Assembly Dialog Progress Report
12 Oct 2006 @ 15:49: True Religion Creates Community
1 Oct 2006 @ 18:24: Voice of Humanity and the Information Commons?
24 Sep 2006 @ 22:12: The Outsider has a place in the Global Assembly Dialog
17 Sep 2006 @ 20:44: "Unity and Diversity" and "Unity in Diversity"
11 Aug 2006 @ 05:13: The Wedding of Humanity and Nonviolence



[< Back] [voice of humanity] [PermaLink]?