Visualizing data using Tag cloud

2010-04-18

Tag cloud consists in a layered plot of words where font size is proportional to word’s frequency. The challenge is to arrange each element in a coherent and elegant layout. Nowadays, tag clouding is available in many SDKs, including Google Viz. API, see WordCloud. However, few offer the quality that is found using on-line generator like Many Eyes or Wordle.

Here are some examples I’ve built myself, although I’ve tried many layouts and fonts.




Recently, I came across the website of Yihui Xie (who creates the R animation package) and its wonderful tag could in Flash: j.mp/9WqvUd.

Here are the solutions I tested within R. First, using the cloud() function in the snippets package (available on R-Forge).

library(snippets)
txt <- tolower(scan(“all_cut.txt”, what=”character”))
txt <- gsub(“([.]|[()/]|[0-9]*)”,”“,txt)
txt <- txt[nchar(txt)>3]
wt <- table(txt)
wt <- log(wt[wt > 8])
cloud(wt, col = col.br(wt, fit=TRUE))

Then I tried to arrange the (x,y) layout, by randomly assigning words to distinct spatial locations.

library(ggplot2)
xy <- as.data.frame(cbind(replicate(2, runif(length(wt))),as.numeric(wt)))
dimnames(xy) <- list(names(wt),c(“x”,”y”,”freq”))
theme_set(theme_bw())
p <- ggplot(xy, aes(x=x, y=y, label=rownames(xy), col=exp(freq)))
p + geom_text(fontfamily=’Fontin’) + xlim(c(.2,1.2)) + ylim(c(.2,1.2))    
  + labs(x=NULL, y=NULL) + opts(panel.border=theme_blank(), 
    panel.grid.major=theme_blank(), axis.ticks=theme_segment(size=0), 
    axis.text.x=theme_text(size=0), axis.text.y=theme_text(size=0), 
    legend.position=”none”)

I also tried a 3D layout, where words lie on a sphere.

Graphics and animation rely on the rgl package. Color palette reflects actual word’s frequency and the (x,y,z) coordinates are computed very crudely using this function:

set.coord <- function(char) {
    n <- length(char)
    x <- y <- z <- numeric(n)
    for (i in 1:n) {
        alpha <- runif(1, 0, 2pi)
        beta  <- runif(1, 0, 2pi)
        x[i] <- radius(-cos(alpha))sin(beta)
        y[i] <- radiuscos(beta)
        z[i] <- radiussin(alpha)*sin(beta)
    }
    return(list(x=x,y=y,z=z))
}

The algorithm originally used at www.wordle.net is described by the author himself in a reply to a post on stackoverflow. Basically, it is implemented using Java API as follows:

  1. Count the words, throw away boring words, and sort by the count, descending.
  2. Keep the top N words for some N. Assign each word a font size proportional to its count.
  3. Generate a Java2D Shape for each word, using the Java2D API.
  4. In decreasing order of frequency, do this for each word:
    place the word where it wants to be
    while it intersects any of the previously placed words
        move it one step along an ever-increasing spiral
    

Here is a short and lighter implementation using Nodebox.

Finally, I realized that there is a huge amount of discussion on how to best represent tags, or more generally how tagging information can be used to display useful information about web traffic, text content, but see this post on www.smashingmagazine.com. In the same vein, such approach may be used to reproduce Ishihara’s plates, but see Ishihara color test.

---

Articles with the same tag(s):

Multi-Group comparison in Partial Least Squares Path Models
Yet another gray theme for R base graphics
Writing a book
R, pipes and Co.
R Graphs Cookbook
Emacs Org-mode and literate programming
user2014
Reproducible research with R
Python for interactive scientific data visualization
Bar charts of counts or frequencies in Stata

---