levels of specificity which is often based on a blogger’s expertise
or requirements; a post may be tagged as “car” (basic) but an
enthusiast might find “Volkswagen MKIV 2001 Golf” more
appropriate.
Another factor that complicates current blog tagging systems is
the lack of clear functional pressure to make tagging consistent,
stable, and complete for use in applications dealing with
collaboration/community, clustering, and search. Some authors
tag their posts to make them visible to a larger community, using
general categorical descriptions such as “politics” and
“shopping”. Some bloggers use tags to organize their posts for
their own consumption and interpretation, using non-content
descriptive tags such as “random” or “toRead”. Other may
choose to use very specific or niche tags that are highly
descriptive of the content of the post (i.e., “why I hate best buy”,
“instructions for cooking grandma’s apple pie”).
Figure 1: The distribution of tags and their frequencies
The lack of a shared or controlled vocabulary has resulted in the
explosion of unique tags in the blogosphere. At last glance,
Technorati [16], one of the leading blog aggregators, was tracking
over 60 million blogs and nearly 11.5 million tags. A sample of
English blog data provided by Technorati from a 16 day period in
late 2006 shows nearly 403,000 unique tags with a mean
frequency of 343.1, median of 8, and mode of 1. The most
frequently occurring tag is “Weblog” with 6,695,762 occurrences.
Nearly 22% (88,212) of tags in the system only occurred once and
only 5.7% of the tags occur more frequently than that average
frequency (343.1). A sample of the distribution of tags and their
frequencies in the sample is illustrated in Figure 1. Because of the
size of the dataset, we ordered the tags by frequency and sampled
every 15 tags (using log distribution to present the data). The
data show that a small percentage of all existing tags are actually
reused by bloggers. The data also show that there is a very large
portion of existing tags that are used rarely, making up a
significant “long tail” [17] distribution. In practice, incredibly
rare tags (that are assigned to posts very infrequently), often
referred to as “meta-noise”, are unlikely to be used for retrieval
due to their idiosyncratic nature. For example, consider the
likelihood of a user utilizing the tags shown in Table 2, a small
random sample of tags that only occur once in our data sample, in
order to browse or search for content.
Content creator-owned tagging systems (those without a
collaborative component), especially suffer from inconsistent and
idiosyncratic tagging. It is not that people are uninterested in
tagging as they often do tag their posts, but when given no insight
into how other bloggers tag, the task of tagging becomes difficult
and the results are less than useful for retrieval and browsing.
For this reason, systems need to be built that can suggest
appropriate tags for content. The goal of such a system is that
users can see how other users are tagging content and choose to
leverage the shared vocabulary or create new tags when
necessary. The overall results would be much more useful
tagging systems without undercutting the prospect or the power of
folksonomies.
Given the state of this tag space, we aimed to build a system that
would assist users in tagging their own blog posts by providing
them with a set of relevant tags. The approach we take is that of a
mediated suggestion system. That is, the tagging system does not
apply the suggested tags automatically, rather it suggests a set of
appropriate tags and allows the user to select tags from the set
they find appropriate. The selected tags are then applied to the
post and incorporated into a larger corpus of post to tag
associations. The system also provides a text box where users can
add additional tags not suggested by the system, allowing new
and emerging tags to be introduced and utilized in the system.
This approach is appealing as it is able to leverage large scale data
processing, while manually checking the suggestions using
minimal human intervention. This type of approach also
fundamentally changes the tagging process from generation to
recognition -- requiring less cognitive effort and time [18][19].
In addition, by providing consistent suggestions to user, we
provide the opportunity for the tag space to organically converge
on a consensus for tag selection. Such a convergence would help
alleviate the issues of synonymy and level variation as users
would have an indication as to the types of tags that other
bloggers are using to describe similar content. Convergence
would also help increase recall by reducing the number of
idiosyncratic tags, reducing the meta-noise in the tag distribution.
offshoreman 'the people', no way!!.., black colored lilies, manila
pride, circuit asia, console customization, eyelash perming,
shadow watcher, miss yah, all female horror, scripture snippets,
insomnia due to quail wailing, streetball china, marriage age,
Wresler, could this possibly be a poem..., Coritsol, goodbye
highbury, 1.2 glossary of terms
Table 2: A sample of tags that only occur once on the blog post
corpus
2. Related Work
There have been several system developed that automatically
generate appropriate tags for a given blog post.
The first of these systems, named TagIt [1], uses Naïve Bayes text
classification methods to determine the appropriate tag to apply to
a new post. While the results of the system were promising, it
was not proven to scale beyond the 330 tags in the training set or
evaluated against blog posts with multiple tags.
Brooks et al, [4] developed another system that automatically
tagged blog posts based on the top three terms extracted from the