5 Oct 2003 @ 16:06, by Roger Eaton
This article explores how we might add an intelligent search capability to the global voice of humanity (voh) network. As I was writing the article I kept noticing I was writing "we". "The point is that eventually we can out-google google." And so forth. I hope that ming will apply his usual good sense to the article and then, if I can get past that test, the next articles will A) begin to probe the internet world for possible collaborators, to make the "we" real, and B) begin to establish the API's for the voh software.
The aim is to facilitate rapid delivery of global database items we most want to see. We say "next" and the system gives us that one item from anywhere in the world that suits us best. Or we say "next 'cluster analysis'" and we get just that item in the "cluster analysis" category that points to PYCluster, because the system knows we are generally interested in Python. "20 next 'cluster analysis'" provides us with an ordered list, google-like.
The importance of this search capability for the larger voh project is twofold. First, it will give users a reason to rate items. The ratings make collective communication possible, and it is the search capability that will be the primary motivator for using ratings. The more items people rate, the better the voh network will know how to feedback information of interest. Second, the search capability will give the network that economic reason that it needs to power its growth. Intelligent searches will make the network attractive to both users and targeted ad purchases ala google.
The bet is that we can eventually out-google google, so at this point it does look like a long shot. What we have going for us, though, besides the flexibility and robustness of networking versus centralization, is the valuable information that will be contained in the bottom up hierarchies that structure the voice of humanity. The voh network intends to be the bottom up contender against the top down search bureaus.
There are three kinds of information that the voh will be using in addition to the structure of the bottom up hierarchies and the item ratings. 1) For text items, we will have word counts. 2) We can get the count of participants that rate each item for each piece of participant information, both standard demographic, such as nationality, sex or education level, and user-self-defined, such as "nonviolent" or "pythonic". For instance we can know that 5 men and 3 women rated a particular item. 3) We can get the count of times meta-keywords have been applied by users to each item, such as "accurate" or "technical". For instance, we can know that three participants independently applied the keyword "brilliant" to a particular item. These keywords will be input by persons who are viewing an item.
The magnitude and complexity of the data is a challenge. It is easy to imagine having to work with vectors that are millions of elements long. To bring the scope down to size, we will process only items from a single language together at one time, and will bypass syntactical elements such as particles, conjunctions, prepositions, pronouns and the forms of "to be". Also, to begin with we will ignore the many obvious problems, such as the semantic difference between "cool" and "cool" and the semantic identity of "UN" and "United Nations". Likewise we will ignore all the good ideas, such as using average word length to characterize items or building in Semantic Web elements from the beginning. In other words, we keep it as simple as possible for now with a definite intention of adding refinements once we have a working model that proves the concept (which will be well before we have a full scale voice of humanity at a global level).
Here is how it will work. Every voice of humanity category will maintain three vectors for each language represented in the category, one for item words, one for item keywords, and one for user keywords. The item word vector will contain aggregate counts of individual words, and phrases of up to 3 or 4 words in length for all the items in the selected language in the category. The item keyword vector will contain aggregate counts of how many times participants browsing the category have applied each keyword to any of the category items. The user keyword vector will contain aggregate counts of how many times participants that have applied each user keyword to themselves have rated items in the category. Item and user keywords will be of mixed languages.
In the voh hierarchies, each category will send the full vectors up the chain weekly or monthly and count changes daily. Since a category may feed ratings up to more than one super-category, the vectors likewise go up to more than one super-category. When hierarchies remerge, as will often happen, so that two paths come back together, the count vectors might be double counted. To prevent this double counting and for general efficiency, the vectors will be preserved for two or three generations up the hierarchy. I.e., if Santa Monica feeds up to Westside LA then to Greater LA, then to California, then to the U.S.A, the Santa Monica vectors will be kept separately at the Santa Monica level, at the Westside level, and also at the Greater LA level, and possibly at the California level. In the case where Santa Monica also feeds up to the "Los Angeles Basin" level and then to "Greater LA", the Greater LA level will be able to undo the double feed of Santa Monica counts. Double feeds will get still get through when the paths merge more than three levels up, but the overall process should be very robust and well able to tolerate the resulting distortion.
As the count vectors go up the ladder, they are aggregated at each level. The vectors therefore become longer and longer and have heftier counts the higher the level.
At the highest levels, aka "the Top", which is normally the level of humanity, the vectors will be fed back down the chain and replicated in every hub for quicker reference. (There will also be local maxima in the voh hierarchy, which is why we normally say "voh hierarchies", plural.) A technical point, but worth mentioning, is that the highest level vectors will provide a unique stable reference ID for each element being counted. As the vectors go up the chain, each count must at first be attached to the actual word or phrase so that aggregation can be exact, but once the local hub has the reference vector from the top, it can replace the full phrase with the shorter ID in the future.
With the reference vectors available, one for each category, the next step is to build the request vector. There are a number of different kinds of requests that make sense. First, the category moderator might want to locate new items in the world database that the category participants would appreciate. Second, a particular category participant might want to find more items in the category but tailored more to the participant's own sense of what is important. Third, someone might want to request items ala google, by a particular set of words and phrases. Fourth, someone might want a result according to a specified demographic for a particular set of words and phrases. (This last possibility answers one of ming's ideas in his "Overlapping Categories" comment on the Handling Collective Messages article from September 4, 2003.)
For the first two types of request, the idea is to build the request vector from the word and keyword counts of highly rated items and then to drill down the reference vector levels to the one most applicable category from the world database, and from that category to pull the highest rated new items. Clearly we will want to explore multiple applicable categories, but this is one of those enhancements that will be left for later.
Take the case of the "American Military Wives with Men in Iraq" category moderator who wants to find more items for her users. There may be a few hundred participants, mostly military wives. The request then will build a vector from just those items in the category which were highly rated by the participants, and the moderator will be able to add weight to the user-keyword component so as to get stress the female, American and the military factor. The local hub then compares this vector against all the reference vectors supplied from the "Top" to select the Top sub- or sub-sub-category with the vector that is closest to the request vector by a simple metric, such as the "city block" metric. Depending on how big the voh network has grown, it may be necessary then to have recourse to the selected sub-category for a comparison of the request vector with its sub-categories. This secondary request slows the process down, but once made, the results can be held for a particular request until an expiry date -- say several weeks down the line -- so the same request will go faster the next time. Finally, once the target categories have been located, the request vector is sent to each of them with instructions to return the highest rated new items that best match the request vector. Standing requests with expiry dates will make sense in this context.
Similarly, standing requests by individual participants for overnight listings will be relatively easy to service and should bring in the latest items, customized to each user's individual likes and dislikes.
The third and fourth types of request, where the request is for particular information ala google, are more difficult for the bottom up voh network to fulfill. References to "Ugarit", for example, could come up under theology, history, language or sight-seeing amongst other possibilities. The stored reference vectors may have 30 high level categories with counts for "Ugarit", and altogether the references to "Ugarit" may be scattered over several hundred leaf-node categories. To collect these references will take some time and be something of a processing burden on the voh network.
As a basis for implementing the google-like type of request, an intelligent spidering service needs to be built into the voh software. A category moderator will want to have a spider search nearby categories (as defined by vector distances) in the voh network for web links and then follow those links, returning items that are within a moderator controlled vector distance of the highly rated category items. Once web items are in the voh database, they will be rated, some of them anyway, thus rejiggering the vectors that self-define the categories. And so forth around and around.
Until the voh network has expanded its categories to cover the entire web, the specific search request will not be competitive with google. And if it is not competitive, then it won't be used. People go to the search engine that works -- of course. So it is fortunate that the category moderators will want to use a spidering service for their own ends, not even thinking of an overall search capability, but only of keeping their own areas up to date.
Google came on so strong and fast because its "page rank" formula produced better hits than the other search engines were providing. The idea of page rank is that pages that are linked to by a lot of pages are more valuable in general than those that are linked to by few. The voh implementation of specific requests should likewise use page rank, and in addition, should use the ratings to order the list. The page rank algorithm requires multiple calculations because the formula is self-referential. Being pointed to by a high rank page counts more than being pointed to by a low-rank page, so each run over the entire database of links readjusts the page rank until after some dozen or so runs, the readjustments are too small to be worth further refinement. Likewise, ratings need to be weighted by the average rank of the rater, which itself is determined at least in part by ratings received by that rater's contributions. ("Rater rank" is an idea that will need some refinement. Just as google's system is hacked by link farms, so voh will be vulnerable to rater-farms. Best may be to assign rater rank on the three factors that cause ideas to propagate: mavens, connectors and salesmen, which for voh translates to content provided, links provided and ratings provided.) As a bottom up system, the voh will do the calculations at a level near the bottom rather than at the top. This will make the computing burden affordable and should still work well, because each category will contain only related material.
We are still left with the problem of speed for the specific request, and even if it were fast, it would not cover the field at first. The best approach is to implement specific search capability at the local category level first and then gradually build it up as we gain experience. Locally it should be fast and work even better than google because we have the ratings as well as the links to establish page rank, except that google's new tool bar already has happy-face/sad-face rating buttons on it.
Clearly this is a rough draft of an idea. Still, it really does look doable, and as more people come in with advice and help, the design will only improve. Do we need a bottom up alternative to google? You bet!
|
|