voice of humanity: Competing with Google    
 Competing with Google5 comments
5 Oct 2003 @ 16:06, by Roger Eaton

This article explores how we might add an intelligent search capability to the global voice of humanity (voh) network. As I was writing the article I kept noticing I was writing "we". "The point is that eventually we can out-google google." And so forth. I hope that ming will apply his usual good sense to the article and then, if I can get past that test, the next articles will A) begin to probe the internet world for possible collaborators, to make the "we" real, and B) begin to establish the API's for the voh software.

The aim is to facilitate rapid delivery of global database items we most want to see. We say "next" and the system gives us that one item from anywhere in the world that suits us best. Or we say "next 'cluster analysis'" and we get just that item in the "cluster analysis" category that points to PYCluster, because the system knows we are generally interested in Python. "20 next 'cluster analysis'" provides us with an ordered list, google-like.

The importance of this search capability for the larger voh project is twofold. First, it will give users a reason to rate items. The ratings make collective communication possible, and it is the search capability that will be the primary motivator for using ratings. The more items people rate, the better the voh network will know how to feedback information of interest. Second, the search capability will give the network that economic reason that it needs to power its growth. Intelligent searches will make the network attractive to both users and targeted ad purchases ala google.

The bet is that we can eventually out-google google, so at this point it does look like a long shot. What we have going for us, though, besides the flexibility and robustness of networking versus centralization, is the valuable information that will be contained in the bottom up hierarchies that structure the voice of humanity. The voh network intends to be the bottom up contender against the top down search bureaus.

There are three kinds of information that the voh will be using in addition to the structure of the bottom up hierarchies and the item ratings. 1) For text items, we will have word counts. 2) We can get the count of participants that rate each item for each piece of participant information, both standard demographic, such as nationality, sex or education level, and user-self-defined, such as "nonviolent" or "pythonic". For instance we can know that 5 men and 3 women rated a particular item. 3) We can get the count of times meta-keywords have been applied by users to each item, such as "accurate" or "technical". For instance, we can know that three participants independently applied the keyword "brilliant" to a particular item. These keywords will be input by persons who are viewing an item.

The magnitude and complexity of the data is a challenge. It is easy to imagine having to work with vectors that are millions of elements long. To bring the scope down to size, we will process only items from a single language together at one time, and will bypass syntactical elements such as particles, conjunctions, prepositions, pronouns and the forms of "to be". Also, to begin with we will ignore the many obvious problems, such as the semantic difference between "cool" and "cool" and the semantic identity of "UN" and "United Nations". Likewise we will ignore all the good ideas, such as using average word length to characterize items or building in Semantic Web elements from the beginning. In other words, we keep it as simple as possible for now with a definite intention of adding refinements once we have a working model that proves the concept (which will be well before we have a full scale voice of humanity at a global level).

Here is how it will work. Every voice of humanity category will maintain three vectors for each language represented in the category, one for item words, one for item keywords, and one for user keywords. The item word vector will contain aggregate counts of individual words, and phrases of up to 3 or 4 words in length for all the items in the selected language in the category. The item keyword vector will contain aggregate counts of how many times participants browsing the category have applied each keyword to any of the category items. The user keyword vector will contain aggregate counts of how many times participants that have applied each user keyword to themselves have rated items in the category. Item and user keywords will be of mixed languages.

In the voh hierarchies, each category will send the full vectors up the chain weekly or monthly and count changes daily. Since a category may feed ratings up to more than one super-category, the vectors likewise go up to more than one super-category. When hierarchies remerge, as will often happen, so that two paths come back together, the count vectors might be double counted. To prevent this double counting and for general efficiency, the vectors will be preserved for two or three generations up the hierarchy. I.e., if Santa Monica feeds up to Westside LA then to Greater LA, then to California, then to the U.S.A, the Santa Monica vectors will be kept separately at the Santa Monica level, at the Westside level, and also at the Greater LA level, and possibly at the California level. In the case where Santa Monica also feeds up to the "Los Angeles Basin" level and then to "Greater LA", the Greater LA level will be able to undo the double feed of Santa Monica counts. Double feeds will get still get through when the paths merge more than three levels up, but the overall process should be very robust and well able to tolerate the resulting distortion.

As the count vectors go up the ladder, they are aggregated at each level. The vectors therefore become longer and longer and have heftier counts the higher the level.

At the highest levels, aka "the Top", which is normally the level of humanity, the vectors will be fed back down the chain and replicated in every hub for quicker reference. (There will also be local maxima in the voh hierarchy, which is why we normally say "voh hierarchies", plural.) A technical point, but worth mentioning, is that the highest level vectors will provide a unique stable reference ID for each element being counted. As the vectors go up the chain, each count must at first be attached to the actual word or phrase so that aggregation can be exact, but once the local hub has the reference vector from the top, it can replace the full phrase with the shorter ID in the future.

With the reference vectors available, one for each category, the next step is to build the request vector. There are a number of different kinds of requests that make sense. First, the category moderator might want to locate new items in the world database that the category participants would appreciate. Second, a particular category participant might want to find more items in the category but tailored more to the participant's own sense of what is important. Third, someone might want to request items ala google, by a particular set of words and phrases. Fourth, someone might want a result according to a specified demographic for a particular set of words and phrases. (This last possibility answers one of ming's ideas in his "Overlapping Categories" comment on the Handling Collective Messages article from September 4, 2003.)

For the first two types of request, the idea is to build the request vector from the word and keyword counts of highly rated items and then to drill down the reference vector levels to the one most applicable category from the world database, and from that category to pull the highest rated new items. Clearly we will want to explore multiple applicable categories, but this is one of those enhancements that will be left for later.

Take the case of the "American Military Wives with Men in Iraq" category moderator who wants to find more items for her users. There may be a few hundred participants, mostly military wives. The request then will build a vector from just those items in the category which were highly rated by the participants, and the moderator will be able to add weight to the user-keyword component so as to get stress the female, American and the military factor. The local hub then compares this vector against all the reference vectors supplied from the "Top" to select the Top sub- or sub-sub-category with the vector that is closest to the request vector by a simple metric, such as the "city block" metric. Depending on how big the voh network has grown, it may be necessary then to have recourse to the selected sub-category for a comparison of the request vector with its sub-categories. This secondary request slows the process down, but once made, the results can be held for a particular request until an expiry date -- say several weeks down the line -- so the same request will go faster the next time. Finally, once the target categories have been located, the request vector is sent to each of them with instructions to return the highest rated new items that best match the request vector. Standing requests with expiry dates will make sense in this context.

Similarly, standing requests by individual participants for overnight listings will be relatively easy to service and should bring in the latest items, customized to each user's individual likes and dislikes.

The third and fourth types of request, where the request is for particular information ala google, are more difficult for the bottom up voh network to fulfill. References to "Ugarit", for example, could come up under theology, history, language or sight-seeing amongst other possibilities. The stored reference vectors may have 30 high level categories with counts for "Ugarit", and altogether the references to "Ugarit" may be scattered over several hundred leaf-node categories. To collect these references will take some time and be something of a processing burden on the voh network.

As a basis for implementing the google-like type of request, an intelligent spidering service needs to be built into the voh software. A category moderator will want to have a spider search nearby categories (as defined by vector distances) in the voh network for web links and then follow those links, returning items that are within a moderator controlled vector distance of the highly rated category items. Once web items are in the voh database, they will be rated, some of them anyway, thus rejiggering the vectors that self-define the categories. And so forth around and around.

Until the voh network has expanded its categories to cover the entire web, the specific search request will not be competitive with google. And if it is not competitive, then it won't be used. People go to the search engine that works -- of course. So it is fortunate that the category moderators will want to use a spidering service for their own ends, not even thinking of an overall search capability, but only of keeping their own areas up to date.

Google came on so strong and fast because its "page rank" formula produced better hits than the other search engines were providing. The idea of page rank is that pages that are linked to by a lot of pages are more valuable in general than those that are linked to by few. The voh implementation of specific requests should likewise use page rank, and in addition, should use the ratings to order the list. The page rank algorithm requires multiple calculations because the formula is self-referential. Being pointed to by a high rank page counts more than being pointed to by a low-rank page, so each run over the entire database of links readjusts the page rank until after some dozen or so runs, the readjustments are too small to be worth further refinement. Likewise, ratings need to be weighted by the average rank of the rater, which itself is determined at least in part by ratings received by that rater's contributions. ("Rater rank" is an idea that will need some refinement. Just as google's system is hacked by link farms, so voh will be vulnerable to rater-farms. Best may be to assign rater rank on the three factors that cause ideas to propagate: mavens, connectors and salesmen, which for voh translates to content provided, links provided and ratings provided.) As a bottom up system, the voh will do the calculations at a level near the bottom rather than at the top. This will make the computing burden affordable and should still work well, because each category will contain only related material.

We are still left with the problem of speed for the specific request, and even if it were fast, it would not cover the field at first. The best approach is to implement specific search capability at the local category level first and then gradually build it up as we gain experience. Locally it should be fast and work even better than google because we have the ratings as well as the links to establish page rank, except that google's new tool bar already has happy-face/sad-face rating buttons on it.

Clearly this is a rough draft of an idea. Still, it really does look doable, and as more people come in with advice and help, the design will only improve. Do we need a bottom up alternative to google? You bet!



[< Back] [voice of humanity]

Category:  

5 comments

5 Oct 2003 @ 19:19 by vaxen : Hmmm...
{http://www.searchlores.org/}  


7 Oct 2003 @ 06:07 by ming : meta-data machine
Wow, Roger, you take my breath away. I'm still not sure whether you're crazy, or you really have got something that is likely happen. Certainly some well-thought out smart stuff here.

Let me get clear, you're talking here essentially about rating and categorizing stuff that already is on the web, don't you? I'm not sure. If it is stuff that needs to be entered in this new system, that probably kills the idea right there, as it would be a while before results would be apparent, so there's no good selling point for doing so. But if it can be super-imposed as meta-data on top of the existing web, that's a different matter. Then one of the critical paths is merely that it needs to pretty quickly be useful to at least the people doing the rating and categorizing, and it can grow from there.

So, then we're talking about a rating/categoriation aggregation and distribution system. A little like how usenet aggregates channels and data items from many different input points, presented as a whole, except for that any server can decide to carry only a certain set of channels, and a user can decide to only subscribe to certain channels. And the data might be cached more or less along the way, depending on how popular it is.

One thought is that the entry point for making some of this happen is not so much to figure out the right algorithm for aggregating and distributing it, as it is to figure out a way of representing the more atomic meta-data, the ratings and categorizatons, and a way of either making them available for a pull by some spider, or of pushing them somewhere else.

That piece would be the catalyst for other people solving other parts of the puzzle. Maybe you. And maybe in the way you outline here. Or maybe somebody else figures out other ways of aggregating it in more useful or efficient ways. But I suggest not tying it all together up front, as it would be hard for anybody to help. I suggest this order:

1. Work out a structure for the meta-data we're talking about. What fields? Is it XML? Controlled by a schema? By RDF? Tinker around with it until there's both a precise structure there, and an open framework for how it can be expanded in the future.

2. Work out a distribution mechanism. Just a way of transporting it. Over XML-RPC, SOAP, regular HTTP, NNTP, SMTP or what?

3. Work out how it can be summarized by participating aggregation servers and how that scales, so that not all the raw data has to be passed up and down.

4. Work out the algorithms for dynamically making it all available as a search engine of everything.

Not that one can't think of all of it at the same time, but I suggest doing simple things first that people can plug into, even if the bigger problems of aggregation haven't been figured out.

A thing that might make it take off relatively early would be to create modules for the major weblog programs that allows visitors to create this meta-data for the entries in the weblog. Weblog owners are of course motivated to have their articles rated and categorized and becoming easier to find. And the people who will be most excited about that, the techie webloggers, will also be the most likely people who'll tinker with other things to do with it.  



7 Oct 2003 @ 09:32 by mre : Vaxen's link
Thanks to Vaxen for the link to {link:http://www.searchlores.org/|Fravia's Search Lore} pages. I poked about some there and found a new search engine that uses clustering technology that is really super -- {link:http://vivisimo.com|Vivisimo}. Looking for the algorithm that vivisimo uses, I found this:
{link:http://citeseer.nj.nec.com/cachedpage/112012/1|Fast and Intuitive Clustering of Web Documents - Zamir, Etzioni, Madani, Karp (1997)} or {link:http://citeseer.nj.nec.com/zamir97fast.html|for pdf, click the pdf link here}. Anyway, I have bagged the pdf in case it disappears. The voice of humanity network will have the advantage of being already categorized by humans, but providing results in named folders is too brilliant to ignore. Have to remember this angle as we get closer to implementation.  



8 Oct 2003 @ 17:09 by mre : re: meta-data machine

ming> not sure whether you're crazy, or you really have got something

This being California, I can absolutely claim to be as sane as the next guy!

Anyway, let's keep pushing this forward, and see where we come out. I like your idea of using the blogosphere as a test case, but I am not sure I understand exactly your vision of connecting with the major weblog programs. I'll take some time to investigate the weblog world and see if I can come up with a sensible plan. If you can fill me in on your thinking along these lines that would be helpful.

As I envision the InterMix voh software, it is middleware connecting on the one hand with preexisting data that is either "out there" on the web in various formats or on disk in various other formats, and on the other hand, plugging into various user interfaces through the web, through email, and I like the idea of using {link:http://www.osafoundation.org/our_product_desc.htm|Chandler} as a user interface. Each different kind of data will be accessed by a specialized module, and each user interface will also be serviced by a specialized module. The voh middleware will have its own data repository, as well, so new items can be created for storage there. In the main, though, items will come from the web, from email lists and newsgroups -- and from rss feeds.

We do have these rss feeds going for us. Is "managing editor" a required value for a valid feed?  



9 Oct 2003 @ 05:48 by ming : Weblogs
Several of the common weblog programs, like Movable Type and Radio, have a plug-in architecture where third parties can add modules to them, to do specialized things. Like, for example, to add categorization, where it didn't exist before. And, even where they don't, people get creative at finding out how to plug something new into them.

A key point is that techie webloggers love to tinker with that kind of stuff. They love looking for a way of adding something new and useful to there blog. New widgets tend to spread like wildfire. When somebody invents the blogroll, or a google interface, and they post the code for using it, the next week thousands of people will be using it.

Weblog owners and readers are an attentive and motivated audience. The owners want the coolest and most useful features in their weblog. The readers want to be able to find the good stuff more easily.

A: The managing editor is not a required field  



Your Name:
Your URL: (or email)
Subject:       
Comment:
For verification, please type the word you see on the left:


Other entries in
24 Jun 2007 @ 23:17: Global Assembly now accepting sign ups
26 May 2007 @ 19:26: WiserEarth / Paul Hawken
18 Mar 2007 @ 23:19: Latest InterMix Design
30 Dec 2006 @ 17:53: A Nonviolent Service Arm for the Global Assembly
19 Nov 2006 @ 15:45: Global Assembly Dialog Progress Report
12 Oct 2006 @ 15:49: True Religion Creates Community
1 Oct 2006 @ 18:24: Voice of Humanity and the Information Commons?
24 Sep 2006 @ 22:12: The Outsider has a place in the Global Assembly Dialog
17 Sep 2006 @ 20:44: "Unity and Diversity" and "Unity in Diversity"
11 Aug 2006 @ 05:13: The Wedding of Humanity and Nonviolence



[< Back] [voice of humanity] [PermaLink]?