Scatterplot legend and fill not working in Octave

The following works for me:

n = 100;
x = randn(n, 1);
y = randn(n, 1);
S = rand(n, 1)*20;
hold on
scatter(x(1:50), y(1:50), S(1:50), "red", "filled")
scatter(x(51:100), y(51:100), S(51:100), "green", "filled")
hold off
print('-depsc', 'bubbleplot.eps');

However, I’m not able to add a legend, and I didn’t find any bug report or indication of a missing functionality for this. So, as an alternative, I would suggest adding marker and text to your plot.

How to get the size of tar.gz in (MB) file in python

According to Python on-line docs, you should look at the size attributes of a TarInfo object.

Getting dataset for building association rules with Weka

Apart from the example dataset used in the following class, Association Rule Mining with WEKA, you might want to try the market-basket dataset.

Also, please note that several datasets are listed on Weka website, in the Datasets section, some of them coming from the UCI repository (e.g., the Plants Data Set).

R Using a for() loop to fill one dataframe with another

Not sure I fully understood your question, but you can try this:

df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2

The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).

The results are shown below:

> head(df1, n=3)
  col1 col2
1    a   68
2    b   73
3    c   45
> tail(df1, n=3)
   col1 col2
24    x   22
25    y    4
26    z   17
> head(df2, n=3)
  col1 col2 col3
1    a    a   68
2    b    a   73
3    c    a   45
> tail(df2, n=3)
    col1 col2 col3
674    x    z   22
675    y    z    4
676    z    z   17

Regarding the approach of displaying a classification/decision tree from the command line?

I’m not sure if you are looking for something very elaborated, but basically decision trees are printed when you use weka.classifiers.trees. There is an illustration with Random Tree in a response of mine , but here is what you would get using the J48 algorithm on the iris dataset:

~/weka/data$ weka weka.classifiers.trees.J48 -t iris.arff  -i

J48 pruned tree

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
|   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|   petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves  :     5

Size of the tree :  9

Time taken to build model: 0.08 seconds
Time taken to test model on training data: 0.01 seconds

Trying to Install RVM 1.8.7

If you don’t have rvm on your system, then follow the instructions here: Installing RVM. Basically, you have to download a shell script that will does the job for you (fetch the latest release and install it). Then, just add

[[ -s "$HOME/.rvm/scripts/rvm" ]] && . "$HOME/.rvm/scripts/rvm"

in your .profile, .bash_profile, or whatever you use.

SQL window function in R Language

You can use aggregate and merge if you are familiar with SQL syntax. Taking one of the example from the PostgreSQL manual, we would use

empsalary <- data.frame(depname=rep(c("develop", "personnel", "sales"), c(5, 2, 3)),
                        empno=c(11, 7, 9, 8, 10, 5, 2, 3, 1, 4),
                        salary=c(5200, 4200, 4500, 6000, 5200, 3500, 3900, 4800, 5000, 4800))
merge(empsalary, aggregate(salary ~ depname, empsalary, mean), by="depname")

to reproduce the first example (compute average salary by depname).

     depname empno salary.x salary.y
1    develop    11     5200 5020.000
2    develop     7     4200 5020.000
3    develop     9     4500 5020.000
4    develop     8     6000 5020.000
5    develop    10     5200 5020.000
6  personnel     5     3500 3700.000
7  personnel     2     3900 3700.000
8      sales     3     4800 4866.667
9      sales     1     5000 4866.667
10     sales     4     4800 4866.667

You may probably want to look at what plyr has to offer for more elaborated construction.

p-value matrix of x and y variables from anova output

Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:

dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
                   x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <-, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1,
                   function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))

Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νa r2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values–if there’re no missing values, this formula is the usual coefficient R2 in regression).

R multiple file “split and plot”

Assuming you already have your data frame set up correctly, how about using aggregate (or ddply from the plyr package)? Here is a toy example with one such data frame (you will need to embed this in your loop or write a custom function).

L01_001 <- data.frame(Cases=gl(5, 2, 5*2*2, labels=c("AAA","BBB","CCC","DDD","EEE")),
                      replicate(3, rnorm(5*2*2))) <- with(L01_001, aggregate(L01_001[,-1], list(Cases=Cases), mean))
## opar <- par(mfrow=c(nlevels(L01_001$Cases), 1))
## apply([,-1], 1, function(x) barplot(x))
## par(opar)
barchart(~ X1 + X2 + X3 | Cases,

I would not recommend using bar charts for visualizing your data: they are incredibly bad at showing subtle variation in your data and have a poor data-ink ratio. Cleveland’s dot plot or level plot would do the job, in my opinion. In the later case, you can even represent everything on a single page, which looks like a pretty sound alternative to “100 plot with 26 bar in it.”

applying the pvclust R function to a precomputed dist object

It’s not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by @Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on [another question on CrossValidated][1]). In the latter case, the authors of [pvclust][2] provide an [example][3] on how to use a custom distance function, although that means you will have to install their “unofficial version”.

Default value in input octave

You can try something like this:

function val = get_response(default="Y")
  val = input("Your choice? [Y]/N ", "s");
  if isempty(val)
    val = default;

I wrapped this into a function but you can use the enclosed code directly, of course.

Gnuplot, CSV, only distinct labels?

Some further suggestions:

set pointsize 2.5
set xtics ("a" 1, "b" 5, "c" 9, "d" 10.5)
set xrange [0:12]
plot "< cat -n aa.csv | sed -e 's/,/  /g'" using 1:3 pt 7 notitle

yields unequally spaced labels in place of x-ticks (as suggested by @choroba, which also provided a way to automate the determination of mid-interval for larger datasets). (Here aa.csv contains data as shown in your question.)

If you want labels to appear in the plotting region rather than under the x-axis, then you can start with

set format x ""

and then place your labels above the highest value found in any of the above categories, a, …, d.

3D parametric graph in Gnuplot

You don’t really need to rely on parametric if you just want to plot raw data from a file.

With sample data xyz.txt that looks like noisy sinusoidal time-series:

1 0 9.43483356296457
1 0.0204081632653061 10.2281885806631
1 0.0408163265306122 10.9377108185805
3 0.959183673469388 10.2733398482972
3 0.979591836734694 10.1662011241681
3 1 10.4628112585751

(First column is an integer values coding x locations, 2nd and 3rd columns are for y and z. I appended the R script I used to generate those data at the end.)

I would simply use

splot 'xyz.txt' using 1:2:3 with impulses

where impulses draws vertical lines from the minimum of z. You can change this bevahior; for example, if you want to start at z=0, you could use

set zrange [-1:12]
set ticslevel 0
zmin = 0
zr(x) = (x==0 ? zmin : x)
splot 'xyz.txt' using 1:2:(zr($3)) with impulses

Note that I used regularly spaced values for y, but that doesn’t really matter.

Other variations are possible, of course. Maybe Fence plots with a some-liner or Wall charts might be of further assistance.

R script used to generate xyz.dat:

n <- 50  # number of observation per sequence
k <- 3   # number of sequences
t <- seq(0, 1, length=n)  # sampling rate
xyz <- data.frame(x=rep(1:k, each=n), y=t, z=sin(2*pi*t*2.1)+.25*rnorm(n*k)+10)
write.table(xyz, file="xyz.txt", row.names=FALSE, col.names=FALSE)

How to get this matrix

If I understand your question correctly, you want to fill a matrix with repeated entries (x,0) and (x,1), where x=1…4, where repetition is determined by values found in column A and B. Given the values you supplied that’s going to be a huge matrix (67,896,086 rows). So, you could try something like this (replace m below, which has less elements for illustrative purpose):

m = [1, 2, 1;
     2, 3, 2;
     3, 2, 1;
     4, 2, 2];
res = [];
for k = 1:4
  res = [res ; [k*ones(m(k, 2), 1), zeros(m(k, 2), 1);
                k*ones(m(k, 3), 1), ones(m(k, 3), 1)]];

which yields

res =

   1   0
   1   0
   1   1
   2   0
   2   0
   2   0
   2   1
   2   1
   3   0
   3   0
   3   1
   4   0
   4   0
   4   1
   4   1

Out of curiosity, is there any reason not to consider a matrix like

1  0  n
1  1  m
2  0  p
2  1  q

where n, m, p, q, are values found in columns A and B. This would probably be easier to handle , no?

Plotting results of several Bland-Altman analysis

I can’t provide you with working R code as you didn’t supply raw data (which are needed for boxplots), and it is not clear what you want to display as nothing indicates where your gold standard comes into play in the given aggregated data (are these repeated measurements with different instruments?), unless the reported means stand for difference between the ith method and the reference method (in which case I don’t see how you could use a boxplot). A basic plot of your data might look like

dfrm <- data.frame(method=LETTERS[1:3], lcl=c(-5,-9,-8),
                   mean=c(4,2,4), ucl=c(15,13,16), var=c(27,33,36))
# I use stripchart to avoid axis relabeling and casting of factor to numeric
# with default plot function
stripchart(mean ~ seq(1,3), data=dfrm, vertical=TRUE, ylim=c(-10,20),
           group.names=levels(dfrm$method), pch=19)
with(dfrm, arrows(1:3, mean-lcl, 1:3, mean+lcl, angle=90, code=3, length=.1))
abline(h=0, lty=2)

However, I can recommend you to take a look at the MethComp package which will specifically help you in comparing several methods to a gold standard, with or without replicates, as well as in displaying results. The companion textbook is

Carstensen, B. Comparing Clinical Measurement Methods. John Wiley & Sons Ltd 2010

Coloring points instead of lines in interaction.plot (R)I

The problem is that interaction.plot actually relies on tapply() to compute aggregated summary measures (mean or median). As interaction.plot just calls matplot, you can use the latter directly if you like, or summarize your data with plyr and plot the results with ggplot2 (more flexible).

# consider the 64x4 dataset OrchardSprays and create a fake
# two-levels factor, say grp, which is in correspondence to rowpos odd/even values
grp <- gl(2, 1, 8, labels=letters[1:2])
# the following won't work (due to color recycling)
     interaction.plot(treatment, rowpos, decrease,
                      type="b", pch=19, lty=1, col=as.numeric(colpos), legend=F))
# what is used to draw the 8 lines is computed from
     tapply(decrease, list(treatment=treatment, rowpos=rowpos), mean))
# the following will work, though
     interaction.plot(treatment, rowpos, decrease,
                      type="b", pch=19, lty=1, col=as.numeric(grp), legend=F))

In short, assuming you can find an adequate mapping between gender and your trace factor (age_bucket), you need to construct a vector of colors of size nlevels(age_bucket) .

creating z-scores for subset of rows

No need for a loop in this case. Use vectorization, instead. Let’s consider the following simulated data: (not sure it will reproduce exactly your dataset, but hopefully you’ll get the general idea)

dfrm <- data.frame(cond=gl(2, 1, 100, labels=LETTERS[1:2]),
                   user=gl(50, 2, labels=paste("id", 1:20, sep="")),
                   sensitivity=runif(100, 1, 5))

Computing z-scores is as simple as

dfrm$z.sensitivity <- scale(dfrm$sensitivity)

If you want z-scores conditional on cond, then you can do either

with(dfrm, tapply(sensitivity, cond, scale))

or, using plyr,

ddply(dfrm, c("cond"), transform, sensitivity.z = scale(sensitivity))

How to subset out some alphabetically arranged columns from a dataframe?

You can try something like this

dfrm <- data.frame(replicate(26, rnorm(10)))
colnames(dfrm) <- paste("COL", LETTERS, sep="_")
which(substr(colnames(dfrm), 5, 6) %in% LETTERS[3:6])

The last expression returns column number that match the letters C to F. See also match, and this related thread: Get column index from label in a data frame.

Which set of python libraries i should learn for AI and Data mining stuff

In addition to the scikit cited by @aix, you might want to take a look at the following libraries:

I’ll really second investigating orange capabilities which is a full-featured application for data mining, but you can call it from external scripts, see e.g. the Beginning with Orange tutorial to get an idea.

Is there any publications about how to use association rules as SVM feature?

Here are some articles that might be (or not) relevant to your question:

  1. Keivan Kianmehr and Reda Alhajj. Effective Classification by Integrating Support Vector Machine and Association Rule Mining. Lecture Notes in Computer Science, 2006, Volume 4224/2006, 920-927
  2. He et al.  Extracting rule from SVM based on association rules, in Rule extraction from support vector machines (Ed. J Diederich), Springer, 2008

You may also want to google articles related to what is known as associative classification.

data.frame with a column containing a matrix in R

The result you got (2 rows x 3 columns) is what is to be expected from R, as it amounts to cbind a vector (id, with recycling) and a matrix (m).

IMO, it would be better to use list or array (when dimensions agree, no mix of numeric and factors values allowed), if you really want to bind different data structures. Otherwise, just cbind your matrix to an existing data.frame if both have the same number of rows will do the job. For example

x1 <- replicate(2, rnorm(10))
x2 <- replicate(2, rnorm(10))
x12l <- list(x1=x1, x2=x2)
x12a <- array(rbind(x1, x2), dim=c(10,2,2))

and the results reads

> str(x12l)
List of 2
 $ x1: num [1:10, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...
 $ x2: num [1:10, 1:2] -0.164 0.709 -0.268 -1.464 0.744 ...
> str(x12a)
 num [1:10, 1:2, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...

Lists are easier to use if you plan to use matrix of varying dimensions, and providing they are organized in the same way (for rows) as an external data.frame you can subset them as easily. Here is an example:

df1 <- data.frame(grp=gl(2, 5, labels=LETTERS[1:2]),
                  age=sample(seq(25,35), 10, rep=T))
with(df1, tapply(x12l$x1[,1], list(grp, age), mean))

You can also use lapply (for list) and apply (for array) functions.

save align.plot()

Can’t you just use grid.arrange() (instead of align.plot()), and use standard png device instead of ggsave() which expect ggplot objects?

dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
p1 <- qplot(carat, price, data=dsamp, colour=clarity)
p2 <- qplot(carat, price, data=dsamp, colour=clarity, geom="path")
grid.arrange(p1, p2) # add ncol=2 to arrange as two-column

What javascript libraries can make a radial convergence graph?

Two top-notch libraries for data visualization are

Which packages make good use of S4 objects?

I would go for kernlab, which additionally includes a lot of C code.

It comes with an handy vignette, detailing some of S4 concepts. (It doesn’t seem to use roxygen for the documentation, though, but this is not the question here.)

What is the best bioinformatics book for a computer scientist?

I would suggest Bioinformatics, from Polanski and Kimmel (Springer, 2007). Very good introductory material on statistics, computer science as applied in bioinformatics. It also covers the necessary grounds in computational biology.

How to plot a violin scatter boxplot (in R)?

Try the vioplot package:


(with awful default color ;-)

There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin.

How to capture R text+image output into one file (html, doc, pdf etc)?

Well, I just remind that I was using Asciidoc for short reporting or editing webpage. Now there’s an R plugin ( ascii on CRAN), which allows to embed R code into an asciidoc document. The syntax is quite similar to Markdown or Textile, so you’ll learn it very fast.

Output are (X)HTML, Docbook, LaTeX, and of course PDF through one of the last two backends.

Unfortunately, I don’t think you can wrap all your code into a single statement. However, it supports a large number of R objects, see below.

> methods(ascii)
 [1] ascii.anova*              ascii.aov*                ascii.aovlist*            ascii.cast_df*
 [5] ascii.character*          ascii.coxph*              ascii.CrossTable**
 [9] ascii.default*            ascii.density*            ascii.describe*           ascii.describe.single*
[13] ascii.factor*             ascii.freqtable*          ascii.ftable*             ascii.glm*
[17] ascii.htest*              ascii.integer*            ascii.list*               ascii.lm*
[21] ascii.matrix*             ascii.meanscomp*          ascii.numeric*            ascii.packageDescription*
[25] ascii.prcomp*             ascii.sessionInfo*        ascii.simple.list*        ascii.smooth.spline*
[29] ascii.summary.aov*        ascii.summary.aovlist*    ascii.summary.glm*        ascii.summary.lm*
[33] ascii.summary.prcomp*     ascii.summary.survfit*    ascii.summary.table*      ascii.survdiff*
[37] ascii.survfit*            ascii.table*              ascii.ts*                 ascii.zoo*

   Non-visible functions are asterisked

Is there a way to access Stata from eclipse?

None that I know. If you’re a CLI junky or willing to use Emacs, you might find limited support through the ESS package and the ado-mode. This is what I used on Mac OS X when I want to run short snippet of code, or use Stata in batch mode, but there’s no interactive graphical output (you can just save graphics as PDF as usual). The ado-mode provides basic syntax highlighting and can send region or buffer to a running instance of Stata GUI program (not the executable file, stata-*, that is being used by ESS ).

Here are two screenshots of (top) edition of code in Emacs with the ado-mode, and (bottom) an interactive Stata session (no plot produced).

Some notes on text editors for Stata users provides a list of text editors that can be used with Stata (without interactive facilities, though).

What data do I need to implement k nearest neighbor?

What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around “recommender + reddit”, e.g.  Reddit, Stumbleupon, and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.

Creating undirected graphs in Python

About Python library for directed and undirected graphs, you can take a look at igraph or NetworkX.

As for the TSP, a little googling indicates that some Python code and discussion is available here, and some background is given in these slides, A Short History of the Traveling Salesman Problem, and on this page, Traveling Salesman Problem.

How do i send ( export ) barplots to a directory or folder instead of displying them on screen

I assume that you mean saving your graphics as PNG or PDF. Here is a snippet of R code that shows how to redirect plotting action to such graphics devices:

WD <- "~/out"  # set your output directory here
k <- 10        # 10 loops for simulated data

for (i in 1:k) {
  png(sprintf(paste(WD, "Rplot%03d.png", sep="/"), i))
  barplot(table(sample(LETTERS[1:6], 100, rep=TRUE)))

text position for multiple regression equation on the same graph

Here is one possibility, using [@Vincent's code] 1. It works with the latest release of ggplot2 (v. 0.9) and the R-forge version of directlabels (v. 2.5). I also have tested the code with ggplot2 0.8.9 and directlabels 2.4. (The version of directlabels released on CRAN won’t work with ggplot2 0.9 , though.)

The idea is basically to replace your labels A, B , C, G with the regression equations. Of course, you could store the latter in a different manner but I think this would sensibly complicate the plotting expression, so let’s keep that as simple as possible. Assuming we already have @Vincent’s melted variable d,

> head(d)
    Xax variable value
1  0.22        A 0.451
2  0.34        A 0.491
3  0.54        A 0.389
4  0.34        A 0.425
5  0.53        A 0.457
6  0.12        A 0.436

let’s replace variable labels with the equations you computed:

lm.stats <- ddply(d, "variable",
                  function(u) {
                    r <- lm(value ~ Xax, data=u)
                    c(coef(r), r.squared=summary(r)$r.squared)
my.formatter <- function(x, digits=2) {
  x <- round(x, digits=digits)
  out <- paste(x[1], ifelse(x[2]>0, "+", ""), x[2], "x", sep="")
  out <- paste(out, " (R2=", x[3], ")", sep="")
d$variablef <- d$variable
levels(d$variablef) <- apply(lm.stats[2:4], 1, my.formatter)

The little helper function, my.formatter, is in charge of assembling the different statistics you computed with ddply . Note that I made a copy of variable in case we need this latter. And here is the plotting stuff:

p <- ggplot(d, aes(Xax,value, col=variablef)) +
       geom_point() +

I should note that you can also have annotated curves with the labcurve() function from the Hmisc package. I can also imagine simpler solutions using ggplot or lattice, namely just write the regression equations along the regression lines, with proper orientation and a slight shift on the x-axis to avoid overlapping, but that might not necessarily be very portable if your dataset happens to change.

Using patterns in addition/instead of background colors in lattice plots

I found a way to manually draw into the levelplot panel and to draw a diagonal fill pattern over all cells with values greater than 0.5

However, I couldn’t manage to draw the same pattern in the color key legend. After hours of reading forums and trying to understand the lattice source code, I couldn’t get a clue. Maybe someone else could fix that. Here is what I got:

cols <- colorRampPalette(brewer.pal(8, "RdBu"))

data <- Harman23.cor$cov

fx <- fy <- c()
for (r in seq(nrow(data)))
  for (c in seq(ncol(data)))
    if (data[r, c] > 0.5)
      fx <- c(fx, r);
      fy <- c(fy, c);

diag_pattern <- function(...)
  for (i in seq(length(fx)))
    panel.linejoin(x = c(fx[i],fx[i]+.5), y= c(fy[i]+.5,fy[i]), col="black")
    panel.linejoin(x = c(fx[i]-.5,fx[i]+.5), y= c(fy[i]+.5,fy[i]-.5), col="black")
    panel.linejoin(x = c(fx[i]-.5,fx[i]), y= c(fy[i],fy[i]-.5), col="black")

p <- levelplot(data, scales=list(x=list(rot=45)),
               xlab="", ylab="", col.regions=cols, panel=diag_pattern)

How to use a separate table to filter data

Quick and dirty solution (because I believe someone will certainly propose a more elegant solution avoiding loop):

tab1 <- list(Target=1:5, Size=c("L","M","L","S","L"), Color=c("R","B","G","B","R"))
tab2 <- data.frame(rep(1:2, each=2), c("A","D","A","B"),
                   c(5,2,1,5), c(2,4,4,8), c(8,6,6,3))
names(tab2) <- c("User", "Condition", 1:3)

tab2.melt <- melt(tab2, measure.vars=3:5)

for (i in 1:nrow(tab2.melt)) {
  tab2.melt$Size[i] <- tab1$Size[tab1$Target==as.numeric(tab2.melt$variable[i])]
  tab2.melt$Color[i] <- tab1$Color[tab1$Target==as.numeric(tab2.melt$variable[i])]

I am assuming you are able to import your data into R, but you may want to adapt the above code if the data structure isn’t the one you show in your excerpt. Basically, the idea is to consider your Target code as a way to index Size and Color levels, which we need in the final data.frame for each repeated measurement (on the ith subject).

The updated data.frame looks like:

> head(tab2.melt)
  User Condition variable value Size Color
1    1         A        1     5    L     R
2    1         D        1     2    L     R
3    2         A        1     1    L     R
4    2         B        1     5    L     R
5    1         A        2     2    M     B
6    1         D        2     4    M     B

From there, you can perform a 3-way ANOVA or study specific contrasts.

R: Lattice messes up legend in pdf

It should work when using the cairo PDF backend, e.g.

cairo_pdf(file="bchart.pdf", width=10, height=10, pointsize=10)

Although I haven’t checked, this might well have to do with PDF encoding, see Including fancy glyphs in R Graphics PDF output, by Paul Murrell.

GSL library in R

I suspect this has to do with your shared library that is not properly linked to GSL libs, as discussed on R-devel, or the manual on Writing R Extensions, where it is suggested to use a Makevars file (with something like PKG_LIBS=-L/usr/lib -lgsl). Otherwise, following the example in help(SHLIB), you may want to try:

$ R CMD SHLIB file.c -lgsl -lgslcblas

There is a simple tutorial, R Call GSL, which shows basic setup for calling GSL functions.

I am able to reproduce the toy example, that I renamed nperms.{c,r} as follows (on a Mac, whence the use of a -dynamiclib switch in place of -shared):

~/scratch $ gcc -c nperms.c
~/scratch $ file nperms.o
nperms.o: Mach-O 64-bit object x86_64
~/scratch $ gcc -dynamiclib -lgsl -lgslcblas -o libnperms.dylib -dylib nperms.o
~/scratch $ ls *nperm*
libnperms.dylib  nperms.c  nperms.o
~/scratch $ file libnperms.dylib
libnperms.dylib: Mach-O 64-bit dynamically linked shared library x86_64

Everything works fine when dyn.load’ing libnperms.dylib in R. However, using shared library generated from R CMD SHLIB without further argument

~/scratch $ R CMD SHLIB nperms.c
gcc -arch x86_64 -std=gnu99 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o nperms.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
~/scratch $ ls *nperm*
libnperms.dylib  nperms.c  nperms.o  nperms.r
~/scratch $ file Mach-O 64-bit dynamically linked shared library x86_64

raises the following error (sorry for the French locale)

> dyn.load("")
Erreur dans dyn.load("") :
  impossible de charger l'objet partag'e '/Users/chl/scratch/':
  dlopen(/Users/chl/scratch/, 6): Symbol not found: _gsl_permutation_alloc
  Referenced from: /Users/chl/scratch/
  Expected in: flat namespace
 in /Users/chl/scratch/

emacs syntax highlighting for jags / bugs

Syntax highlighting

I’m using ESS 5.14 (from ELPA) and syntax highlighting or smart underscore works fine for me with GNU Emacs 24.1.1. If you want to highlight a given file, you can try M-x ess-jags-mode or add a hook to highlight JAGS file each time, e.g.

(add-to-list 'auto-mode-alist '("\\.jag\\'" . jags-mode))

However, that is not really needed since you can simply

(require 'ess-jags-d)

in your .emacs. There’s a corresponding mode for BUGS file. This file was already included in earlier release (at least 5.13), and it comes with the corresponding auto-mode-alist (for "\\.[jJ][aA][gG]\\'" extension). (Please note that there seems to exist subtle issue with using both JAGS and BUGS, but I can’t tell more because I only use JAGS.)

Running command file

If you want to stick with Emacs for running JAGS (i.e., instead of rjags or other R interfaces to JAGS/BUGS), there’s only one command to know: As described in the ESS manual, when working on a command file, C-c C-c should create a .jmd file, and then C-c C-c ‘ing again should submit this command file to Emacs *shell* (in a new buffer), and call jags in batch mode. Internally, this command is binded to a ’Next Action’ instruction ( ess-*-next-action). For example, using the mice data that comes with JAGS sample files, you should get a mice.jmd that looks like that:

model in "mice.jag"
data in "mice.jdt"
compile, nchains(1)
parameters in "mice.in1", chain(1)
update 10000
update 10000
parameters to "mice.to1", chain(1)
coda \*, stem("mice")
system rm -f mice.ind
system ln -s miceindex.txt mice.ind
system rm -f mice1.out
system ln -s micechain1.txt mice1.out
Local Variables:

Be careful with default filenames! Here, data are assumed to be in file mice.jdt and initial values for parameters in mice.in1. You can change this in the Emacs buffer if you want, as well as modify the number of chains to use.

How to get detailed information of ANOVA procedure in R?

To answer your first question, no: Your aov object contains information on model fit, as requested, not post-hoc comparisons. It won’t even assess distributional assumptions (of the residuals), test of homoskedasticity and the like, and this is not what we expect to see in an ANOVA table anyway. However, you are free (and it is highly recommend, of course) to complement your analysis by assessing model fit, checking assumptions, etc.

About your second question. Multiple comparisons are handled separately, using e.g. pairwise.t.test() (with or without correction for multiple tests), TukeyHSD() (usually works best with well-balanced data), the multcomp (see glht()), as pointed out by @MYaseen208, or multtest package. Some of those tests will assume that the ANOVA F-test was significant, other procedures are more flexible, but it all depends on what you want to do and if it sounds like a reasonable approach to the problem at hand (cf.  @DWin’s comment). So why R would provide them automatically?

As an illustration, consider the following simulated dataset (a balanced one-way ANOVA):

dfrm <- data.frame(x=rnorm(100, mean=10, sd=2),
                   grp=gl(5, 20, labels=letters[1:5]))

where group means and SDs are as follows:

|       | | N | Mean    | SD     |
|grp    |a| 20|10.172613|2.138497|
|       |b| 20|10.860964|1.783375|
|       |c| 20| 9.910586|2.019536|
|       |d| 20| 9.458459|2.228867|
|       |e| 20| 9.804294|1.547052|
|Overall| |100|10.041383|1.976413|

With JMP, we have a non-significant F(4,95)=1.43 and the following results (I asked for pairwise t-tests):

(P-values are shown in the last column.)

Note that those t-tests are not protected against Type I error inflation.

With R, we would do:

aov.res <- aov(x ~ grp, data=dfrm)
with(dfrm, pairwise.t.test(x, grp, p.adjust.method="none"))

You can check what is stored in aov.res by issuing str(aov.res) at the R prompt. Tukey HSD tests can be carried out using either

TukeyHSD(aov.res)  # there's a plot method as well


glht(aov.res, linfct=mcp(grp="Tukey"))  # also with a plot method

Include small table in legend in R graphics

Here is my take wiith grid (and the Iris dataset):

res <- matrix(nc=3, nr=4)
for (i in 1:4) res[i,] <- tapply(iris[,i], iris[,5], mean)
colnames(res) <- levels(iris[,5])
rownames(res) <- colnames(iris)[1:4]
dp <- dotplot(res, auto.key=list(position="top", column=3), xlab="Mean")

pdf("1.pdf", width=10, height=5)
pushViewport(viewport(layout=grid.layout(1, 2, widths=unit(c(5,4), "inches"))))

pushViewport(viewport(layout.pos.col=1, layout.pos.row=1))
print(dp, newpage=FALSE)

pushViewport(viewport(layout.pos.col=2, layout.pos.row=1, clip="on"))
grid.draw(tableGrob(head(iris), gp=gpar(fontsize=6, lwd=.5)))

Another solution with ggplot2 only is available on Hadley Wickham’s github page, Mixing ggplot2 graphs with other graphical output. Finally, the on-line help page for gridExtra::grid.arrange() includes additional example.

To show the Table inside the plot, we can modify the code as follows:

pushViewport(viewport(layout=grid.layout(1, 1, widths=unit(c(5,4), "inches"))))

pushViewport(viewport(layout.pos.col=1, layout.pos.row=1))
print(dp, newpage=FALSE)

pushViewport(viewport(x=0.5, y=0.3, clip="off"))
grid.draw(tableGrob(head(iris), padding.v=unit(1, "mm"), padding.h=unit(1, "mm"),
          gp=gpar(fontsize=6, lwd=.5)))

which yields

(The background color of the cells can be changed using theme= when calling tableGrob().)

Drawing path diagrams with R package ‘sem’ using Graphviz

You just need to specify a filename (without extension!), see the file= argument. As stated in the documentation, it will generated both a .dot and PDF file (but set output.type="dot" if you only want the graphviz output).

I would use a simple \includegraphics command in the Sweave file, after having called the above command. (You may need to adapt the path to find the figure if you don’t generate the SEM diagram in the same directory as your master .Rnw file.)


Given your comment, yes it seems there’s a problem running external program from within the function call (pathDiagram). So here is a not very elegant solution to generate the path diagram and include it in your Sweave->TeX document.

Here is the Sweave file (sw.rnw):

R.DHP <- readMoments("sem.cov", diag=FALSE,
                     names=c('ROccAsp', 'REdAsp', 'FOccAsp',
                       'FEdAsp', 'RParAsp', 'RIQ', 'RSES',
                       'FSES', 'FIQ', 'FParAsp'))
model.dhp <- specifyModel(file="sem.mod")
sem.dhp <- sem(model.dhp, R.DHP, 329,
               fixed.x=c('RParAsp', 'RIQ', 'RSES', 'FSES', 'FIQ', 'FParAsp'))
capture.output(pathDiagram(sem.dhp, min.rank='RIQ, RSES, RParAsp, FParAsp, FSES, FIQ',
            max.rank='ROccAsp, REdAsp, FEdAsp, FOccAsp'), file="")
system("dot -Tpdf -o fig1.pdf")

And here is the path diagram.



The files sem.cov and sem.mod contain the covariance matrix and structural model that were both entered manually in the above example (simple copy/paste in plain text file). I’m not very happy to have to use capture.output() because I cannot find a way to mask its call from the chunk. Maybe you’ll find a better way to do that (the idea is to use system(), and that can easily be masked with echo=FALSE in the chunk parameters).

I happened to compile the above document as follows:

$ R CMD Sweave sw.rnw
$ R CMD texi2pdf sw.tex

Plotting lines with gnuplot python

Does the following work for you?

>>> import Gnuplot
>>> x = [1,2,3,4]
>>> gp = Gnuplot.Gnuplot()
>>> gp.title('My title')
>>> gp('set style data linespoints')
>>> gp.plot(x)

You can pass whatever options you want in the 5th command.

Graphic of binary variable in R

Here are two possible solutions, based on fake data generated with this helper function: <- function(rate=.3, dim=c(25,25)) {
  tmp <- rep(".", prod(dim))
  tmp[sample(1:prod(dim), ceiling(prod(dim)*rate))] <- "S"
  m <- matrix(tmp, nr=dim[1], nc=dim[2])
  1. Text-based output

    x <-
    rownames(x) <- colnames(x) <- 1:25
    capture.output(as.table(x), file="res.txt")

The file res.txt include a pretty-printed version of the console output; you can convert it to pdf using any txt to pdf converter (I use the one from PDFlib). Here is a screenshot of the text file:

  1. Image-based output

First, here is the plotting function I used:

    make.table <- function(x, labels=NULL) {
      # x = matrix
      # labels = list of labels for x and y
      coord.xy <- expand.grid(x=1:nrow(x), y=1:ncol(x))
      opar <- par(mar=rep(1,4), las=1)
      plot.window(xlim=c(0, ncol(x)), ylim=c(0, nrow(x)))
      text(coord.xy$x, coord.xy$y, c(x), adj=c(0,1))
      if (!is.null(labels)) {
        mtext(labels[[1]], side=3, line=-1, at=seq(1, ncol(x)), cex=.8)
        mtext(labels[[2]], side=2, line=-1, at=seq(1, nrow(x)), cex=.8, padj=1)

Then I call it as

    make.table(x, list(1:25, 1:25))

and here is the result (save it as png, pdf, jpg, or whatever).

Regarding RandomTree in Weka

This has to do with the minimum number of instances on a leaf node (which is often 2 by default in decision trees, like J48). The higher you set this parameter, the more general the tree will be since having many leaves with a low number of instances yields a too granular tree structure.

Here are two examples on the iris dataset, which shows how the -M option might affect size of the resulting tree:

$ weka weka.classifiers.trees.RandomTree -t iris.arff -i

petallength < 2.45 : Iris-setosa (50/0)
petallength >= 2.45
|   petalwidth < 1.75
|   |   petallength < 4.95
|   |   |   petalwidth < 1.65 : Iris-versicolor (47/0)
|   |   |   petalwidth >= 1.65 : Iris-virginica (1/0)
|   |   petallength >= 4.95
|   |   |   petalwidth < 1.55 : Iris-virginica (3/0)
|   |   |   petalwidth >= 1.55
|   |   |   |   sepallength < 6.95 : Iris-versicolor (2/0)
|   |   |   |   sepallength >= 6.95 : Iris-virginica (1/0)
|   petalwidth >= 1.75
|   |   petallength < 4.85
|   |   |   sepallength < 5.95 : Iris-versicolor (1/0)
|   |   |   sepallength >= 5.95 : Iris-virginica (2/0)
|   |   petallength >= 4.85 : Iris-virginica (43/0)

Size of the tree : 17

$ weka weka.classifiers.trees.RandomTree -M 6 -t iris.arff -i

petallength < 2.45 : Iris-setosa (50/0)
petallength >= 2.45
|   petalwidth < 1.75
|   |   petallength < 4.95
|   |   |   petalwidth < 1.65 : Iris-versicolor (47/0)
|   |   |   petalwidth >= 1.65 : Iris-virginica (1/0)
|   |   petallength >= 4.95 : Iris-virginica (6/2)
|   petalwidth >= 1.75
|   |   petallength < 4.85 : Iris-virginica (3/1)
|   |   petallength >= 4.85 : Iris-virginica (43/0)

Size of the tree : 11

As a sidenote, Random trees rely on bagging, which means there’s a subsampling of attributes (K randomly chosen to split at each node); contrary to REPTree, however, there’s no pruning (like in RandomForest), so you may end up with very noisy trees.

What is the command for Zipf (frequency against rank) plot in R

Check out the zipfR package, and its dedicated website including the following tutorial: The zipfR package for lexical statistics: A tutorial introduction.

Evaluate different layout algorithms / interaction with graphs

You might want to check the extensive documentation of the igraph software, which has some description of its internal layout generators. There are also nice illustrations on aiSee website.

For more academic reference, I would suggest browsing the following tutorials: Graph Drawing Tutorial (106 pages) or Graph and Network Visualization (69 pages).

Another useful resource: Handbook of Graph Drawing and Visualization (26 chapters, available as PDF).

Change summary variables returned by Hmisc summary()

Arf… I just look at the code of summary.formula() in the Hmisc package and I can confirm that Mean and SD are indeed computed but not shown when printing on the command line. So, we have to ask for it explicitely when calling the print() function, e.g.

df <- data.frame(g=sample(LETTERS[1:3], 100, rep=TRUE), replicate(3, rnorm(100)))
s <- summary(g ~ ., method="reverse", data=df)
latex(s, prmsd=TRUE, digits=2)  # replace latex by print to output inline

which yields the following Table:

alt text

Plotting directed multigraphs in R

The igraph package seems to fulfill your requirements, with the tkplot() function helping adjusting the final layout if needed.

Here is an example of use:

s <- cbind(A=sample(letters[1:4], 100, replace=TRUE),
           B=sample(letters[1:2], 100, replace=TRUE)) <- table(s[,1], s[,2])
s.g <- graph.incidence(, weighted=T)
     edge.width=c(, vertex.size=20,
     vertex.label.cex=3, vertex.label.color="white")

With the interactive display (there’s a possibility of using rgl for 3D display), it looks like (I have slightly moved one vertex afterwards):

tkplot(s.g,, vertex.color=c(rep("red",4),rep("blue",2)))

Finally, you can even export you graph into most common format, like dot for graphviz.

generate a vector in R and insert it in a stacked frame

To transform a table into a data.frame, the base function should work.

Here is, however, how I would do:

myframe <- as.table(array(c(35, 34, 132, 38, 7, 31, 23, 109, 36, 5,
                            10, 7, 35, 14, 2, 49, 24, 136, 37, 15,
                            22, 13, 52, 31, 10, 16, 8, 33, 32, 4),
                            dim=c(5, 3, 2),
                            dimnames=list(cars=1:5, q8=c("N","U","Y"),

for getting a data.frame with all variables. Should you only want to keep q8 and sex as factors in your data.frame, use melt(myframe)[,-1] instead.

See help(melt.array) for more information.

Normalizing histogram bins in gnuplot

Here is how I would do, with n=500 random gaussian variates generated from R with the following command:

Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat

I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth _ n), except that I use int instead of floor and I didn’t recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert’s textbook, _Gnuplot in Action* ( Chapter 13, p. 257, freely available). You can replace my sample data file with random-points which is available in the demo folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file.

set xrange [-3:3]
set yrange [0:1]
tstr(n)=sprintf("Binwidth = %1.1f\n", n)
set multiplot layout 1,2
set boxwidth bw1
plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1)
set boxwidth bw2
plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2)

Here is the result, with two bin width

Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on using Freedman-Diaconis binning rule should not be too difficult to implement, although you’ll need to compute the inter-quartile range.

Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won’t make a difference) and equally spaced bin like the ones used above.

The R code that was used is given below:

par(mfrow=c(1,2), las=1)
hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE)
hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1",
     xlab="", ylab="", prob=TRUE)

You can even look at how R does its job, by inspecting the values returned when calling hist():

> str(hist(rnd, plot=FALSE))
List of 7
 $ breaks     : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
 $ counts     : int [1:13] 1 1 12 20 49 79 108 87 71 43 ...
 $ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ density    : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ mids       : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
 $ xname      : chr "rnd"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).

TukeyHSD after within factors ANOVA

As an alternative to traditional repeated-measures ANOVA for within-subject design, you might consider using linear mixed-effects approach. It is being increasingly used in the scientific community and avoids some of the ANOVA pitfalls while allowing for more complex error structures. With continuous response variable, the nlme package is enough, but you can also use lme4 which further allows to cope with categorical response variables. For multiple comparisons (including Tukey’s post-hoc tests), then, the multcomp package (see the glht() function) can be used with mixed-effects models fitted with nlme::lme, as described here: Repeated Measures ANOVA using R.

One short remark about your design: If your response (dependent) variable, Ratio, is a proportion or a bounded value, you may think of using a different link function.

Interface between Octave and R

I do not know of any active R/octave project, but if you’re just after finding roots for a given polynomial you can use one of the polynom or PolynomF package:

Here is an example with P(x)= 6 + 5x + 4x^2 + 3x^3 + 2 x^4 + x^5.

In octave,

octave[2] > p = 1:6;
octave[3] > roots(p)
ans =

   0.55169 + 1.25335i
   0.55169 - 1.25335i
  -1.49180 + 0.00000i
  -0.80579 + 1.22290i
  -0.80579 - 1.22290i

In R,

> library(polynom)
> p <- polynomial(6:1)
> pz <- solve(p)
> pz
[1] -1.491798+0.000000i -0.805786-1.222905i -0.805786+1.222905i
[4]  0.551685-1.253349i  0.551685+1.253349i

R fisher.test print only p-value

If you take a look at the values returned by a call to the fisher.test() function (see help(fisher.test) in R), you will notice that there is p.value among others. So you just need to use $p.value on your R object. Here is an example, in R:

tab <- replicate(2, sample(LETTERS[1:3], 10, rep=TRUE))
fisher.test(table(tab[,1], tab[,2]))$p.value

can’t set border linewidth for bars of rowstacked histogram

I would not use set style fill solid border -1 (or better, noborder), but rather define a specific line type that can be used to customize boxes, e.g.

bin(x,width) = width*floor(x/width) + bw/2.0
set boxwidth bw
set style line 2 lc rgb 'gray30' lt 1 lw 2
set style fill pattern 5
plot 'rnd.dat' using (bin($1,bw)):(1./(bw*n)) smooth frequency with boxes ls 2

Here, boxes are drawn using dark gray and line width of 2.

Is there a Java alternative to Prefuse that is under active development?

Protovis is the successor of Prefuse (and now, D3 is under active development). Protovis-Java is a partial implementation of the Protovis toolkit in Java. There’s a nice example gallery, but I have no experience with the Java side.

As an alternative, you might consider Processing, with some example of use in Java here, or its Javascript counterpart, Processing.js. There were even a port for Scala + Processing: Spde.

How to implement R model in C++ code

You’re asking for a linear regression model (here, R glm() stands for generalized linear model, but as you’re using an identity link, you end up with a linear regression). There are several implementation available in C, for example the apophenia library which features a nice set of statistical functions with bindings for MySQL and Python. The GSL and ALGLIB libraries also have dedicated algorithms.

However, for lightweight and almost standalone C code, I would suggest taking a look at glm_test.c available in the source of the snpMatrix BioC package.

Following the updated question, it seems you rather want to predict the outcome based on a set of regression parameters. Then, given that the general form of the hypothesized model is y=b0 + b1 _ x1 + b2 _ x2 + … + bp * xp, where b0 is the intercept and b1, …, bp are the regression coefficients (estimated from the data), the computation is rather simple as it amounts to a weighted sum: take each observed values for your p predictors and multiply by the b’s (don’t forget the intercept term!).

You can double check your results with R predict() function; here is an example with two predictors, named V1 and V2, 100 observations, and a regular grid of new values for predicting the outcome (you can use your own data as well):

> df <- transform(X <-, rnorm(100))),
                                     y = V1+V2+rnorm(100))
> res.lm <- lm(y ~ ., df)
> <- data.frame(V1=seq(-3, 3, by=.5), V2=seq(-3, 3, by=.5))
> coef(res.lm)
(Intercept)          V1          V2
0.006712008 0.980712578 1.127586352
     V1   V2
1  -3.0 -3.0
2  -2.5 -2.5
> 0.0067 + 0.9807*-3 + 1.1276*-3  # with approximation
[1] -6.3182
> predict(res.lm,[1]

Plotting multiple variables via ggplot2

I would suggest the following: use one of your factor for stacking, the other one for faceting. You can remove position="fill" to geom_bar() to use counts instead of standardized values.

my.df <- data.frame(replicate(10, sample(1:5, 100, rep=TRUE)),
                    F1=gl(4, 5, 100, labels=letters[1:4]),
                    F2=gl(2, 50, labels=c("+","-")))
my.df[,1:10] <- apply(my.df[,1:10], 2, function(x) ifelse(x>4, 1, 0))
my.df.melt <- melt(my.df)
res <- ddply(my.df.melt, c("F1","F2","variable"), summarize, sum=sum(value))
ggplot(res, aes(y=sum, x=variable, fill=F1)) +
   geom_bar(stat="identity", position="fill") +
   coord_flip() +
   facet_grid(. ~ F2) +
   ylab("Percent") + xlab("Item")

In the above picture, I displayed observed frequencies of ‘1’ (value above 4 on the Likert scale) for each combination of F1 (four levels) and F2 (two levels), where there are either 10 or 15 observations:

> xtabs(~ F1 + F2, data=my.df)
F1   +  -
  a 15 10
  b 15 10
  c 10 15
  d 10 15

I then computed conditional item sum scores with ddply, using a ‘melted’ version of the original data.frame. I believe the remaining graphical commands are highly configurable, depending on what kind of information you want to display.

In this simplified case, the ddply instruction is equivalent to with(my.df.melt, aggregate(value, list(F1=F1, F2=F2, variable=variable), sum)).

How to extract the p-value for the slope from an ols object in R

You can use the corresponding extractor function, but you need to call summary.lm:

> coef(summary.lm(m2))
          Estimate Std. Error   t value     Pr(>|t|)
Intercept 37.88458  2.0738436 18.267808 8.369155e-18
cyl       -2.87579  0.3224089 -8.919699 6.112687e-10

how to draw guide lines on a gnuplot generated cdf?

Here is my take (using percentile ranks), which only assumes a univariate series of measurement is available (your column headed X). You may want to tweak it a little to work with your pre-computed cumulative frequencies, but that’s not really difficult.

# generate some artificial data
set sample 200
set table 'rnd.dat'
plot invnorm(rand(0))
unset table

# display the CDF
unset key
set yrange [0:1]
perc80=system("cat rnd.dat | sed '1,4d' | awk '{print $2}' | sort -n | \
          awk 'BEGIN{i=0} {s[i]=$1; i++;} END{print s[int(NR*0.8-0.5)]}'")
set arrow from perc80,0 to perc80,0.8 nohead lt 2 lw 2
set arrow from graph(0,0),0.8 to perc80,0.8 nohead lt 2 lw 2
plot 'rnd.dat' using 2:(1./200.) smooth cumulative

This yields the following output:

You can add as many percentile values as you want, of course; you just have to define a new variable, e.g. perc90, as well as ask for two other arrow commands, and replace every occurrence of 0.8 (ah… the joy of magic numbers!) by the desired one (in this case, 0.9).

Some explanations about the above code:

  1. I generated an artificial dataset which was saved on disk.
  2. The 80th percentile is compute using awk, but before that we need to
  3. remove the header generated by table (first four lines); (we could ask awk to start at the 5th lines, but let’s go with that.)
  4. keep only the second column;
  5. sort the entries.
  6. The awk command to compute the 80th percentile requires truncation, which is done as suggested here. (In R, I would simply use a function like trunc(rank(x))/length(x) to get the percentile ranks.)

If you want to give R a shot, you can safely replace that long series of sed/awk commands with a call to R like

Rscript -e 'x=read.table("~/rnd.dat")[,2]; sort(x)[trunc(length(x)*.8)]'

assuming rnd.dat is in your home directory.

Sidenote: And if you can live without gnuplot, here are some R commands to do that kind of graphics (even not using the quantile function):

x <- rnorm(200)
xs <- sort(x)
xf <- (1:length(xs))/length(xs)
plot(xs, xf, xlab="X", ylab="Cumulative frequency")
## quick outline of the 80th percentile rank
perc80 <- xs[trunc(length(x)*.8)]
abline(h=.8, v=perc80)
## alternative solution
segments(par("usr")[1], .8, perc80, .8)
segments(perc80, par("usr")[3], perc80, .8)

r / sciplot: overlapping whiskers in lineplot.CI

You can use ggplot2 for that. Here is an example with a built-in dataset (as I don’t have your standard errors or CIs). The key is to use position_dodge() .

ToothGrowth$ <- factor(ToothGrowth$dose, labels=paste("d", 1:3, sep=""))
df <- with(ToothGrowth , aggregate(len, list(supp=supp,, mean))
df$se <- with(ToothGrowth , aggregate(len, list(supp=supp,,
              function(x) sd(x)/sqrt(10)))[,3]

opar <- theme_update(panel.grid.major = theme_blank(),
                     panel.grid.minor = theme_blank(),
                     panel.background = theme_rect(colour = "black"))

xgap <- position_dodge(0.2)
gp <- ggplot(df, aes(x=dose, y=x, colour=supp, group=supp))
gp + geom_line(aes(linetype=supp), size=.6, position=xgap) +
     geom_point(aes(shape=supp), size=3, position=xgap) +
     geom_errorbar(aes(ymax=x+se, ymin=x-se), width=.1, position=xgap)

Converting data frame (factors) with data.matrix, weird results

To see what happened, look at the following example:

> (f <- gl(2, 1, 10, labels=3:4))
 [1] 3 4 3 4 3 4 3 4 3 4
Levels: 3 4
> as.numeric(f)
 [1] 1 2 1 2 1 2 1 2 1 2
> as.numeric(as.character(f))
 [1] 3 4 3 4 3 4 3 4 3 4

To convert a factor (which is what you have in your data.frame) into a numeric vector while preserving its labels (otherwise you’ll just get its levels), you need something like as.numeric(as.character()).

So, either ensure that you read your input data correctly (if numbers are quoted, with options("stringsAsFactors") set to TRUE, then it’s likely they will be converted to a factor), or convert your data.frame afterwards. This can be done columnwise, e.g.

dfrm <- data.frame(x=factor(c(3,2,1,8,4)), y=factor(c(5,6,1,2,3)))
m <- sapply(dfrm, function(x) as.numeric(as.character(x)))

I misread your question and I thought you were using as.matrix , not data.matrix. That doesn’t change anything, since both functions will convert factors into their internal representation, as stated in the on-line help:

Factors and ordered factors are replaced by their internal codes.

How do I add an asterix to a boxplot to represent significance?

You can use text() and write at the corresponding location, if you know it beforehand; e.g.,

dfrm <- data.frame(y=rnorm(100, mean=10), x=gl(4, 25))
dfrm$y[dfrm$x==2] <- dfrm$y[dfrm$x==2]+2
boxplot(y ~ x, data=dfrm, ylim=c(min(dfrm$y)-.5, max(dfrm$y)+.5))
text(x=2, y=max(dfrm$y[dfrm$x==2]), "*", pos=3, cex=1.2)

Adapt x=2 to suit your needs.

Or you can use mtext to put the star outside the plotting region, like in

mtext("*", side=3, line=0, at=2, cex=1.2)

How to convert NA’s in a large dataset to either 0 or 1?

Assuming your dataframe is named dfrm, do you mean something like that?

dfrm$Surgery <- ifelse(dfrm$Specialty=="Surgery", 1, 0)
dfrm$Internal <- ifelse(dfrm$Specialty=="Internal", 1, 0)

where can i find the sources for the sample() function in base

Look up do_sample in /trunk/src/main/random.c from, e.g. the svn repository.

How can you visualize data frames in a good way?

RStudio does a pretty good job with its built-in (read-only) data viewer. Other solutions ( have been suggested on Cross Validated: Is there a good browser/viewer to see an R dataset (.rda file).

How can we combine predictors from two different linear models into one?

If this is just a matter of extracting the components of each model and combine them into a new design matrix, then the following should work, irrespective of the fact you used stepAIC:

dfrm <- data.frame(y=rnorm(100), replicate(7, rnorm(100)))
lm1 <- lm(y ~ X1+X2+X3, dfrm)
lm2 <- lm(y ~ X5+X7, dfrm) <- attr(terms(lm1), "term.labels") <- attr(terms(lm2), "term.labels") <- as.formula(paste("y ~ ", paste(c(,, collapse= "+")))
lm3 <- lm(, dfrm)

To fix the ideas, here we have

> names(dfrm)
[1] "y"  "X1" "X2" "X3" "X4" "X5" "X6" "X7"
y ~ X1 + X2 + X3 + X5 + X7

See help(terms.object) to get more information on what it returns. With your example, you’ll need to replace lm1 with best.ngc_fit and lm2 with best.fcst_fit.

Weka data load error

It does not look like the C45 names file is correct. Try replacing breast-cancer-wisconsin.names with this one:

2, 4.
clump: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
size: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
shape: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
adhesion: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
epithelial: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
nuclei: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
chromatin: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
nucleoli: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
mitoses: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

Note that class comes first (only labels).

Here I have removed the first column of subjects’ id in the original dataset using

$ cut -d, -f2-11 >

but it is not difficult to adapt the above code.

Alternative solutions:

  1. Generate a csv file: you just need to add a header to the *.data file and rename it as *.csv. E.g., replace with breast-cancer-wisconsin.csv which should look like

  2. Construct directly an *.arff file by hand; that’s not really complicated as there are few variables. An example file can be found here.

How to add labels to the terminal nodes of a ctree (package party)?

Here is one hacky solution. It involves very little modification in the original source code of the plotting functions from the party package. By reading the source code, I noticed that there is a terminal_panel which is calling node_barplot in case the outcome is a factor. (Everything is located in the R/plot.R function, if you have source package installed.) We can modify the later to display custom labels in the default bar chart.

Just issue the following command at R prompt:

fixInNamespace("node_barplot", pos="package:party")

and then, start adding what we want:

  1. Add labels = NULL, gp = NULL to the existing list of arguments for that function.

  2. Near the end of the function body, after grid.rect(gp = gpar(fill = "transparent")), add the following lines:

    if (!is.null(labels)) {
      labs <- as.character(labels[ctreeobj@where==node$nodeID])
      len <- length(labs)
      x <- unit(rep(0.5, len), "npc")
      y <- unit(0.5:len/len, "npc")
      for (i in 1:len)
        grid.text(labs[i], x=x[i], y=y[i], just="center", gp=gp)

Here, the key idea is to select labels corresponding to the selected node (node$nodeID), and we can grab this information from the slot where of the ctree object (this is a vector indicating in which node each case ended up). The if test is just to ensure that we can use the function as originally written. The gp argument can be used to change font size or color.

A typical call to the function would now be:

plot(cfit, tp_pars=list(labels=dfrm$names))

where dfrm$names is a column of labels from a data frame named dfrm. Here is an illustration with your data:

cfit <- ctree(young ~ age, data=a,
              controls=ctree_control(minsplit=2, minbucket=2))
plot(cfit, tp_args=list(labels=a$names, gp=gpar(fontsize=8, col="darkgrey")))

(I have also tested this with the on-line example with the iris dataset.)

how do i get a 45 angle for the x axis labels in the following code

AFAIK, with base graphics, you can only ask for 0/90° orientation of labels on x- or y-axis (see the las parameter in par()). However, with lattice or ggplot2 you can do it.

Here is an example with lattice::barchart():

tt <- table(sample(LETTERS[1:6], 100, rep=T))
barchart(tt, horiz=F,
         scales=list(x=list(rot=45, labels=paste("Fancy label", 1:6))))
bar chart

Replace labels with your own labels or if you already have a named table, leave it as is.

I have a big table in R, now I want to select the odd rows and paste a label before the first element of this row

If I understand your question correctly, you want to display some text and the first element of all odd rows. You can try this:

cat(paste("This is odd", A[c(2,4),1], "\n"))

No need for a loop there. Should you want to work with a larger matrix, and take all odd rows, you can use seq(2, nrow(A), by=2) instead of c(2,4).

Programming a QQ plot

Here is a simple solution using base graphics:

scores <- rnorm(200, mean=12, sd=2)
gender <- gl(2, 50, labels=c("M","F"))
opar <- par(mfrow=c(1,2))
for (g in levels(gender))
  qqnorm(scores[gender==g], main=paste("Gender =", g))

A more elegant lattice solution then:

qqmath(~ scores | gender, data=data.frame(scores, gender), type=c("p", "g"))

See the on-line help for qqmath for more discussion and example of possible customization.

How to combine 2 matrices into a graph

If you just want to plot correlations as a function of distances, without imposing a particular structure on your plot, you can just extract the lower part of your respective matrices, e.g.

x <- matrix(rnorm(1000), nrow=20)
d.mat <- as.matrix(dist(x))
c.mat <- cor(t(x))
plot(d.mat[lower.tri(d.mat)], c.mat[lower.tri(c.mat)])

Interactive plotting in Python?

My first thought would be Mayavi, which is great for data visualization, especially in 3D. It relies on VTK. It is included in the Enthought flavoured version of Python, together with Chaco for 2D plotting. To get an idea, look at Travis Vaught’s nice screencast in Multidimensional Data Visualization in Python - Mixing Chaco and Mayavi .

It also possible to embed basic interactive functionalities (like slider) to Matplotlib, see matplotlib.widgets and the widget examples.

Finally, you can use rpy (or better, rpy2) and benefit from the R interface.

Performing an if statement on each row in R

Unfortunately, I came late and with a solution similar to @Andrie’s one, like this:

dat <- matrix(c(3,3,3,2,3,3,3,3,3,3,3,3,2,3,1,2,2,2,3,3),
              nr=10, byrow=TRUE)
# here is our lookup table for genotypes
pat <- matrix(1:9, nr=3, byrow=T, dimnames=list(1:3,1:3))


> pat[dat]
 [1] 9 8 9 9 9 9 6 2 5 9

gives you what you want.

However, I would like to say that you might find easier to use dedicated package for genetic studies, like the one found on CRAN (like genetics, gap or SNPassoc , to name a few) or Bioconductor , because they include facilities for transforming/recoding genotype data and working with haplotype.

Here is an example of what I have in mind with the above remark:

> library(genetics)
> geno1 <- as.genotype.allele.count(dat[,1]-1)
> geno2 <- as.genotype.allele.count(dat[,2]-1)
> table(geno1, geno2)
geno1 A/A A/B
  A/A   6   1
  A/B   1   1
  B/B   0   1

Canonical Correlation Analysis in R

Your X variance-covariance matrix is not positive-definite, hence the error when internally calling fda::geigen.

There’s a similar function for regularized CCA in the mixOmics package, but I guess it will lead to the same error message because it basically uses the same approach (except that they plugged the geigen function directly into the rcc function). I can’t actually remember how I get it to work with my data, for a related problem (but I’ll look into my old code once I find it again :-)

One solution would be to use a generalized Cholesky decomposition. There is one in the kinship (gchol; be careful, it returns a lower triangular matrix) or accuracy (sechol) package. Of course, this means modifying the code inside the function, but it is not really a problem, IMO. Or you can try to make Var(X) PD with make.positive.definite from the corpcor package.

As an alternative, you might consider using the RGCCA package, which offers an unified interface for PLS (path modeling) and CCA methods with k blocks.

Controlling z labels in contourplot

Here is my take:

x <- rep(seq(-1.5,1.5,length=50),50)
y <- rep(seq(-1.5,1.5,length=50),rep(50,50))
z <- exp(-(x^2+y^2+x*y))

# here is default plot
lp1 <- contourplot(z~x*y)

# here is an enhanced one
my.panel <- function(at, labels, ...) {
    # draw odd and even contour lines with or without labels
    panel.contourplot(..., at=at[seq(1, length(at), 2)], col="blue", lty=2)
    panel.contourplot(..., at=at[seq(2, length(at), 2)], col="red",
                      labels=as.character(at[seq(2, length(at), 2)]))

lp2 <- contourplot(z~x*y, panel=my.panel, at=seq(0.2, 0.8, by=0.2))
lp3 <- update(lp2, at=seq(0.2,0.8,by=0.1))
lp4 <- update(lp3, lwd=2,"align")

grid.arrange(lp1, lp2, lp3, lp4)

You can adapt the custom panel function to best suit your needs (e.g. other scale for leveling the z-axis, color, etc.).

Gnuplot: Variable colors (and linewidths) for 2D-Vector plot

I would put only one color specification, e.g.

set xrange [0:10]
set yrange [0:10]
plot "test.dat" using 1:2:3:4:5 with vectors lw 3 lc rgb variable

where test.dat contains

1 1 2 0 0x000000
1 2 2 0 0xff0000
1 3 2 0 0xffff00
1 4 2 0 0x382288

The same can be done using the following inline rgb function

rgb(r,g,b) = int(r)*65536 + int(g)*256 + int(b)
plot "test2.dat" using 1:2:3:4:(rgb($5,$6,$7)) with vectors lw 3 lc rgb variable

where test2.dat now reads

1 1 2 0 0 0 0
1 2 2 0 255 0 0
1 3 2 0 255 255 0
1 4 2 0 56 34 136

slick one-lineRs

I often need fake data to illustrate, say, a regression problem. Instead of

X <- replicate(2, rnorm(100))
y <- X[,1] + X[,2] + rnorm(100)
df <- data.frame(y=y, X=X)

we can use

df <- transform(X <-, rnorm(100))),
                y = V1+V2+rnorm(100))

to generate two uncorrelated predictors associated to the outcome y.

Is there a dynamic word/tag cloud Java API somewhere?

I’m not aware of a good R package for that. There are some functions, like cloud in the [snippets][1] package, and maybe other functions, but nothing compared to,, or [Many Eyes][2]. [Drew Conway][3] has done some nice stuff with tm + ggplot2; I also played with it a while ago, but this was more of to play with 3D tag cloud (with rgl) than wordle.



Ordering the bars of a stacked bar graph in ggplot from least to greatest

I’m not sure about the way your data were generated (i.e., whether you use a combination of cast/melt from the reshape package, which is what I suspect given the default name of your variables), but here is a toy example where sorting is done outside the call to ggplot. There might be far better way to do that, browse on SO as suggested by @Andy.

v1 <- sample(c("I","S","D","C"), 200, rep=T)
v2 <- sample(LETTERS[1:24], 200, rep=T)
my.df <- data.frame(v1, v2)
idx <- order(apply(table(v1, v2), 2, sum))

ggplot(my.df, aes(x=factor(v2, levels=LETTERS[1:24][idx], ordered=TRUE),
       fill=v1)) + geom_bar() + opts(axis.text.x=theme_text(angle=90)) +

To sort in the reverse direction, add decr=TRUE with the order command. Also, as suggested by @Andy, you might overcome the problem with x-labels overlap by adding + coord_flip() instead of the opts() option.

How to debug Aquamacs

Does aquamacs --help work? I suspect it does not because if you are using the command that was installed by Aquamacs itself (which is /usr/local/bin/aquamacs) it is simply a Perl script that call OS X open command, which may have difficulty when parsing options.

It is better to just create an alias or a symbolic link to /Applications/

% /Applications/ --help
Usage: /Applications/ [OPTION-OR-FILENAME]...

Run Emacs, the extensible, customizable, self-documenting real-time
display editor.  The recommended way to start Emacs for normal editing
is with no options at all.

Run M-x info RET m emacs RET m emacs invocation RET inside Emacs to
read the main documentation for these command-line arguments.

Initialization options:

--batch                     do not do interactive display; implies -q
--chdir DIR                 change to directory DIR
--daemon                    start a server in the background
--debug-init                enable Emacs Lisp debugger for init file

You can also safely run

% /Applications/ -nw --debug-init

to get Aquamacs running in your preferred Terminal emulator. More information can be found on EmacsWiki, Customize Aquamacs.

Errors due to vowpal wabbit’s dependencies on boost library

I happened to get everything working on OS X 10.7 as follows:

  1. Make sure you have a working Boost installation. As indicated on the Getting started page, usually we only need header files, but some Boost libraries must be built separately, including the program_options library which is used to process options from command line or config file. Go into your boost folder, and then at your shell prompt:

    $ ./
    $ ./bjam

This will compile and build everything. You should now have a bin.v2/ directory in your boost directory, with all built libraries for your system (static and threaded libs).

    $ ls bin.v2/libs/
    date_time       iostreams       python          serialization   test
    filesystem      math            random          signals         thread
    graph           program_options regex           system          wave

More importantly, extra Boost libraries are made available in the stage/lib/ directory. For me, these are Mach-O 64-bit dynamically linked shared library x86_64.

The include path should be your_install_dir/boost_x_xx_x, where boost_x_xx_x is the basename of your working Boost. (I personally have boost_1_46_1 in /usr/local/share/ and I symlinked it to /usr/local/share/boost to avoid having to remember version number.) The library path (for linking) should read your_install_dir/boost_x_xx_x/stage/lib. However, it might be best to symlink or copy (which is what I did) everything in usual place, i.e. /usr/local/include/boost for header files, and /usr/local/lib for libraries.

  1. Edit the Makefile from the vowpal_wabbit directory, and change the include/library paths to reflect your current installation. The Makefile should look like this (first 12 lines):

    COMPILER = g++
    UNAME := $(shell uname)
    ifeq ($(UNAME), FreeBSD)
    LIBS = -l boost_program_options -l pthread -l z -l compat
    BOOST_INCLUDE = /usr/local/include
    BOOST_LIBRARY = /usr/local/lib
    LIBS = -l boost_program_options -l pthread -l z
    BOOST_INCLUDE = /usr/local/share/boost            # change path to reflect yours
    BOOST_LIBRARY = /usr/local/share/boost/stage/lib  # idem

Then, you are ready to compile vowpal_wabbit (make clean in case you already compiled it):

    $ make
    $ ./vw --version
    $ make test

One way to do that is to build the corresponding adjacency matrix. For example,

vertices <- c("SHLRK03", unique(c(SHLRK03, SHLRK04, WNBGORV01)))
adj.mat <- matrix(0, nrow=length(vertices), ncol=length(vertices),
                  dimnames=list(vertices, vertices))
adj.mat["SHLRK03", colnames(adj.mat) %in% SHLRK03] <- 1
adj.mat["SHLRK04", colnames(adj.mat) %in% SHLRK04] <- 1
adj.mat["WNBGORV01", colnames(adj.mat) %in% WNBGORV01] <- 1
g <- graph.adjacency(adj.mat)
V(g)$label <- V(g)$name

There are several options for graph layout, vertices labeling, etc. that you will find in the on-line documentation. Here is the default rendering with the code above.

If you have several vectors like these, you can certainly automate the filling of the adjacency matrix with a little helper function.

Splitting axis labels with expressions

Here is another solution, relying on atop as did @AndresT in his edit. Note that we cannot use control character like \n in an expression, which explains why using something like expression(paste("...\n", alpha[i], "....")) would not produce the desired output.

xlabel <- expression(atop("A very long label with text and expression",
                          paste((alpha+beta)[ij], "   A very long label ...")))
curve(x^3 - 3*x, -2, 2, sub=xlabel, xlab="")

Note that I used sub instead of xlab to avoid collision with x tick marks.

How to summarise data by group with weighted mean?

As suggested on an old R thread, you can use by instead:

wt <- c(5,  5,  4,  1)/15
x <- c(3.7,3.3,3.5,2.8)
xx <- data.frame(avg=x, value=gl(2,2), weight=wt)
by(xx, xx$value, function(x) weighted.mean(x$avg, x$weight))

Python matplotlib : plot3D with a color for 4D

If you want to display a simple 3D scatterplot, can’t you just use scatter? E.g.,

x, y, z = randn(100), randn(100), randn(100)
fig = plt.figure()
from mpl_toolkits.mplot3d import Axes3D
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z, c=randn(100))

(I’m running the above code under python -pylab.)

It seems, on the contrary, that with plot3D you must convert your fourth dimension to RGB tuples.

How to generate bivariate data of different shapes (e.g., square, circle, rectangle) with outliers?

You should probably look into the mlbench package, especially synthetic dataset generating from mlbench.* functions, see some examples below.

Other datasets or utility functions are probably best found on the Cluster Task View on CRAN. As @Roman said, adding outliers is not really difficult, especially when you work in only two dimensions.

Boot package in R simple assistance

Just to emphasize the general idea on bootstrapping in R, although @caracal already answered your question through his comment. When using boot, you need to have a data structure (usually, a matrix) that can be sampled by row. The computation of your statistic is usually done in a function that receives this data matrix and returns the statistic of interest computed after resampling. Then, you call the boot() that takes care of applying this function to R replicates and collecting results in a structured format. Those results can be assessed using in turn.

Here are two working examples with the low birth baby study in the MASS package.

# compute CIs for correlation between mother's weight and birth weight
cor.boot <- function(data, k) cor(data[k,])[1,2]
cor.res <- boot(data=with(birthwt, cbind(lwt, bwt)),
                statistic=cor.boot, R=500)
cor.res, type="bca")
# compute CI for a particular regression coefficient, e.g. bwt ~ smoke + ht
fm <- bwt ~ smoke + ht
reg.boot <- function(formula, data, k) coef(lm(formula, data[k,]))
reg.res <- boot(data=birthwt, statistic=reg.boot,
                R=500, formula=fm), type="bca", index=2) # smoke

Discretization in R

You can still use the dprep package, but you have to install it from source (I just tested and it works well). However, you may well have a look at the discretization or infotheo packages which provide similar functionalities, e.g. equal interval width, equal frequency intervals, ChiMerge, etc.

Using octave headless


octave --silent --eval 5+4 > result.txt

you’ll get

ans =  9

in result.txt. See octave --help for details about command-line arguments.

Yet, there is this infamous ans = that might be remove using sed, e.g.

octave --silent --eval 'x=5+4; y=x+1; disp(y)' | sed -e 's/ans = //' >> result.txt

which add the appropriate result (10) in result.txt.

It should not be too hard to wrap this into a bash script.

Missing values in ARFF (Weka)

Apart from treating missing value as an attribute value on its own, in the case of the J48 classifier any split on an attribute with missing value will be done with weights proportional to frequencies of the observed non-missing values. This is documented in Witten and Frank’s textbook, Data Mining Practical Machine Learning Tools and Techniques (2005, 2nd. ed., p. 63 and p. 191), who then reported that

eventually, the various parts of the instance will each reach a leaf node, and the decisions at these leaf nodes must be recombined using the weights that have percolated to the leaves.

More information about handling missing values in decision trees, like surrogate splits in CART (and contrary to C4.5 or its successor J48), can be found on the wiki section for Classification Trees; the use of imputation is also discussed in several articles, e.g.  Handling missing data in trees: surrogate splits or statistical imputation.

Save .dta files in python

The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by @Sven . In particular, genfromdta() will return an ndarray , e.g. from Python 2.7/statsmodels 0.3.1:

>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>

The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as

>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")

and read it in Stata without a dictionary file as follows:

. insheet using auto.txt, clear

I believe a *.dta reader should be added in the near future.

How can I rank observations within groups in Stata?

The following works for me:

bysort group_id: egen desired_rank=rank(var_to_rank)

U-matrix and self organizing maps

The U-matrix stands for unified distance and contains in each cell the euclidean distance (in the input space) between neighboring cells. Small values in this matrix mean that SOM nodes are close together in the input space, whereas larger values mean that SOM nodes are far apart, even if they are close in the output space. As such, the U-matrix can be seen as summary of the probability density function of the input matrix in a 2D space. Usually, those distance values are discretized, color-coded based on intensity and displayed as a kind of heatmap.

Quoting the Matlab SOM toolbox,

 Compute and return the unified distance matrix of a SOM.
 For example a case of 5x1 -sized map:
            m(1) m(2) m(3) m(4) m(5)
 where m(i) denotes one map unit. The u-matrix is a 9x1 vector:
    u(1) u(1,2) u(2) u(2,3) u(3) u(3,4) u(4) u(4,5) u(5)
 where u(i,j) is the distance between map units m(i) and m(j)
 and u(k) is the mean (or minimum, maximum or median) of the
 surrounding values, e.g. u(3) = (u(2,3) + u(3,4))/2.

Apart from the SOM toolbox, you may have a look at the kohonen R package (see help(plot.kohonen) and use type="dist.neighbours").

Sort xtable() output by p-value from glm model summary

Not necessarily the most elegant solution, but that should do the job:

data(birthwt, package="MASS")
glm.res <- glm(low ~ ., data=birthwt[,-10])
idx <- order(coef(summary(glm.res))[,4])  # sort out the p-values
out <- coef(summary(glm.res))[idx,]       # reorder coef, SE, etc. by increasing p

how to add a general grid to a lattice xy.plot

Try this:

xyplot(yval ~ xval | p*cr, data=B, group=gval, type=c("l","g"), lwd=5,
       main="Scatterplots by Cylinders and Gears",
       ylab="Miles per Gallon", xlab="Car Weight")

I have simplified a little bit your syntax because you can use variable names from you data.frame if you the data= argument. The key is to use type=c("l","g"), which means lines + grid, and is equivalent to a panel function that would looks like

panel=function(...) {

In your case, this is because you forgot to add a panel.xyplot() that no points or line were drawn. If you want a different grid, you can use the above code and customize the call to panel.grid().

Fine tuning a dotplot in R’s lattice package

Some ideas:

  1. Use a smaller font size for y-axis labels, e.g.  scale=list(y=list(cex=.6)). An alternative would be to preserve uniform font size, but separate your output on several pages (this can be controlled with layout=), or, probably better, to show all data from the same dataset (A to F, hence 4 points for each algorithm) or by sample size (10 to 100, hence 6 points for each algorithm) with a group= option. I would personally create two factors, sample.size and dataset.type for that.

  2. Relevel your factor Dataset, so that the dataset you are interested appear where layout will put them, or (better!) use index.cond to specify a particular arrangement for your 24 panels. E.g.,

    dfrm <- data.frame(algo=gl(11, 1, 11*24, labels=paste("algo", 1:11, sep="")),
                       type=gl(24, 11, 11*24, labels=paste("type", 1:24, sep="")),
    p <- dotplot(algo ~ roc | type, dfrm, layout=c(4,6), scale=list(y=list(cex=.4)))

will arrange panels in sequential order, from bottom left to top right (type1 in bottom left panel, type24 in top right panel), while

    update(p, index.cond=list(24:1))

will arrange panels in the reverse order. Just specify a list with expected panel locations.

Here is an example of what I have in mind with Point 1 and the use of two factors instead of one. Let us generate another artificial dataset:

dfrm <- data.frame(algo=gl(11, 1, 11*24, labels=paste("algo", 1:11, sep="")),
                   dataset=gl(6, 11, 11*24, labels=LETTERS[1:6]),
                   ssize=gl(4, 11*6, 11*24, labels=c(10,25,50,100)),
xtabs(~ dataset + ssize, dfrm)  # to check allocation of factor levels
dotplot(algo ~ roc | dataset, data=dfrm, group=ssize, type="l",
        auto.key=list(space="top", column=4, cex=.8, title="Sample size",
                      cex.title=1, lines=TRUE, points=FALSE))

Difference between regression tree and model tree

A linear model tree is a decision tree with a linear functional model in each leaf, whereas in classical regression tree (e.g., CART) it is the sample mean of the response variable for statistical units in each leaf (hence, a constant) that is being considered. Linear model trees can be seen as a a form of locally weighted regression, while regression tree are piecewise-constant regression.

For more information on linear model trees, you can consult

Torgo, L. Functional models for regression tree leaves. In Proceedings of the 14th International Conference on Machine Learning, pp. 385–393. Morgan Kaufmann, 1997.

Naive bayes in R

You seem to be using the e1071::naiveBayes algorithm, which expects a newdata argument for prediction, hence the two errors raised when running your code. (You can check the source code of the predict.naiveBayes function on CRAN; the second line in the code is expecting a newdata, as newdata <- Also as pointed out by @Vincent, you’re better off converting your variables to factor before calling the NB algorithm, although this has certainly nothing to do with the above errors.

Using NaiveBayes from the klar package, no such problem would happen. E.g.,

data(spam, package="ElemStatLearn")

# set up a training sample
train.ind <- sample(1:nrow(spam), ceiling(nrow(spam)*2/3), replace=FALSE)

# apply NB classifier
nb.res <- NaiveBayes(spam ~ ., data=spam[train.ind,])

# predict on holdout units
nb.pred <- predict(nb.res, spam[-train.ind,])

# but this also works on the training sample, i.e. without using a `newdata`

plotting bar graphs in R using ggplot2

Do you mean something like this? (I’m not sure whether you want to add value on top of the graph, so I added them, but you can safely remove the last line if you don’t need them.)

tmp <- c(150.3,84.61,31.45,11.08,-0.19,-57.83,-88.63,-98.39)
dd <- data.frame(y=tmp, x=LETTERS[1:8])
ggplot(dd, aes(x=x, y=y)) + geom_bar(fill="darkgrey") +
  labs(x="Premium change", y="Total Expected Profit") +
  geom_text(aes(x=x, y=ifelse(y>0, y+5, y-5),
            label=y), size=4, colour="white")

It would be even nicer to add + coord_flip(), IMO.

What’s wrong in your code?

  1. The ggplot() function is expecting a data.frame, from which it can extract named variable, e.g. for the aesthetics parameters x= and y=. So, first of all, you need to convert your object into a proper data.frame and name it, for you can its value through aes():

    toplot_noind <-
    names(toplot_noind) <- y

which is better that using the same name as your data.frame. (Note, however, that it will inherit its name with the cast operation.)

  1. Then, the x- and y-labels must be outside the aes() function. I don’t use qplot() but I think using xlab= and ylab= works fine there. With ggplot , I prefer the labs() or xlab()/ylab() functions. E.g.

  2. You need to have x represented as a factor.

  3. The dodge aspect does not seem necessary here, because you don’t have a second cross-classifying factor (see an example of use in help(position_dodge)).

In sum, your corrected code would look like:

toplot_noind <-
ggplot(toplot_noind, aes(x = as.factor(1:8), y = toplot_noind)) +
  geom_bar() +
  xlab("premium change") +
  ylab("Total Expected Profit")

With R, loop over data frames, and assign appropriate names to objects created in the loop

Use a list to store the results of your regression models as well, e.g.

foo <- function(n) return(transform(X <-, rnorm(n))),
                                                       y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
  lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat

creating a more continuous color palette in r, ggplot2, lattice, or latticeExtra

As far as lattice is concerned, you can set up your colors palette with RColorBrewer (or even colorspace). Using the example provided by @Chase, but with positive value for z:

dat <- data.frame(x = rnorm(1000), y = rnorm(1000), z = sample(0:40, 1000, TRUE))
# see, e.g.
# display.brewer.all(9, type="seq")
# display.brewer.pal(11, "RdBu")
my.col <- colorRampPalette(brewer.pal(11, "RdBu"))(diff(range(dat$z)))
xyplot(y ~ x, data=dat, col=my.col[dat$z], pch=19, alpha=.5)

Note that it is also necessary here to increase the range of available colors by interpolation. Also, with levelplot(), you might want to play with cut= and pretty=.

Is there a sensible way to do something like docstrings in R?

You can add any attributes you like to R objects, including function. So something like

describe <- function(obj) attr(obj, "help")
foo <- function(t=NULL) ifelse(!is.null(t), t^2, "Or maybe not")
attr(foo, "help") <- "Here is the help message"

produces more or less the desired output

> foo(2)
[1] 4
> foo()
[1] "Or maybe not"
> describe(foo)
[1] "Here is the help message"

How to add LibSVM class to WEKA classpath on a Mac

You can put libsvm.jar in a folder of your choice, e.g.  ~/Library/Java, and then run weka from the command-line as follows:

$ java -Xmx512m -classpath /Users/chl/weka/weka.jar:/Library/Java/libsvm.jar weka.gui.GUIChooser

You can also update your CLASSPATH to reflect locations where *.jar files can be found. I have the following in my .profile:

export CLASSPATH="/Users/chl/weka/weka.jar:~/Library/Java/*"

(You will need to replace /Users/chl/weka to reflect the correct location of your weka.jar; usually, it is located at the top of the directory if you downloaded the source files, or under if you use the bundled app.)

This way, the first command to start weka GUI simplifies to

$ java -classpath $CLASSPATH:weka.jar:libsvm.jar weka.gui.GUIChooser

Don’t use java -jar since it will override the CLASSPATH, as discussed here.

I also have an alias in my .bash_aliases for wrapping all that stuff:

alias weka='java -Xmx512m -classpath $CLASSPATH:weka.jar'

in order to use weka from the command-line as, e.g.

$ weka weka.classifiers.trees.RandomTree -t iris.arff -i

Is there a Python module to open SPSS files?

Depending on what you want to do–process data using R-related commands from rpy2, or switch to Python–the [solution provided by @Spacedman] ( on a related thread might easily be adapted to suit your needs.

Otherwise, Pandas includes a convenient wrapper for rpy2. Here is an example of use with Peat and Barton’s weights.sav data set:

>>> import pandas.rpy.common as com
>>> filename = "weights.sav"
>>> w = com.robj.r('foreign::read.spss("%s",' % filename)
>>> w = com.convert_robj(w)
>>> w.head()
1  L001    3.95    55.5   37.5  Female  tertiary  3 or more siblings
2  L003    4.63    57.0   38.5  Female  tertiary           Singleton
3  L004    4.75    56.0   38.5    Male    year12          2 siblings
4  L005    3.92    56.0   39.0    Male  tertiary         One sibling
5  L006    4.56    55.0   39.5    Male    year10          2 siblings

Adding error bars on a bar graph in gnuplot

Assuming you want to draw side-by-side bar charts with associated error bars, I would use the following:

set xrange [-0.5:12.75]
set xtic rotate by -45
set boxwidth 0.25
plot 'file.dat' using ($0-.05):4:5:xtic(1) with boxerrorbars title col, \\
     '' using ($0+0.25):2:3 with boxerrorbars title col

The idea is just to offset one of the two measures on the x-axis.

Variables Overview with xtable in R

Another package to look at is reporttools. Here is an short piece of code to illustrate its usage on the tips dataset from reshape package. Both the summary statements produce latex code which can be copy pasted into a document, or used for weaving.

data(tips, package = 'reshape')

# summarize numeric variables
tableContinuous(tips[,sapply(tips, is.numeric)])

# summarize non-numeric variables
tableNominal(tips[,!sapply(tips, is.numeric)])

EDIT. If you really MUST use str, then here is one way to go about it

str_cars = capture.output(str(cars))


Centering title in R viewport with multiple graphs ggplot2

I think you might be interested in the gridExtra package, which provides the grid.arrange() function that fulfills everything you wonder about.

With @Kevin’s example, the command would be

grid.arrange(plots[[1]], plots[[2]], ncol=2,
             main="test main", sub="subtitle test")

How do I compute the number of occurrences of a particular value in a row in R

There’re two ways to use the apply family functions. Either you do

apply(mat, 1, sum, na.rm=TRUE)

if you want to apply the function sum()to each row, passing additional parameters like na.rm=TRUE. Or you can do

apply(mat, 1, foo)

where foo() is a function of your own, defined externally, e.g.

foo <- function(x) sum(x==0, na.rm=TRUE)

Note that NA handling might also be dealt with a parameter of the function itself, with default value set to TRUE, in the above definition, as in

foo2 <- function(x, sum(x==0,

and you can call it as apply(mat, 1, foo2, although it doesn’t really make sense with the sum() function (unless you want to check if there’re NA values, but in this case it’s better to use :-).

Finally, you can define your function directly inline as

apply(mat, 1, function(x) sum(x==0, na.rm=TRUE))

In your case, it gives me

> apply(mat, 1, function(x) sum(x==0, na.rm=TRUE))
 1  2  3  4  5  6
22  4  9  8  7  2

which is equivalent to apply(ex, 1, foo).

R: Print two tables with xtable ()

I would recommend saving the results as two separate tables in different files (see the file= option to print.xtable()), and then input them into your LaTeX document with any command you find appropriate for your layout ( tabular, subfloat, minipage, etc.). This is what I do in general, although I generally rely on LaTeX facilities in the Hmisc package. Should you want only to print them as a standalone PDF, use the standalone class for your document.

So, here is an example:

fm1 <- aov(tlimth ~ sex + ethnicty + grade + disadvg, data=tli)
print(xtable(fm1), file="ta.tex", floating=FALSE)
print(xtable(head(tli, n=5)), file="tb.tex", floating=FALSE)

then, a quick tex wrapper (compile with pdflatex):



\subfloat[Table x(a)]{\label{tab:tab1a}\scalebox{.5}{\input{./ta}}}\quad
\subfloat[Table x(b)]{\label{tab:tab1b}\scalebox{.5}{\input{./tb}}}
\caption{Caption about here}


Here is the result:

Remove the \scalebox command for default (stacked) layout, unless they are narrow enough to fit at their default size, as noted by @David.

Multiple histograms in ggplot2

You can try grid.arrange() from the gridExtra package; i.e., store your plots in a list (say qplt ), and use, qplt)

Other ideas: use facetting within ggplot2 (sex*variable ), by considering a data.frame (use melt).

As a sidenote, it would be better to use stacked barchart or Cleveland’s dotplot for displaying items response frequencies, IMO. (I gave some ideas on CrossValidated.)

For the sake of completeness, here are some implementation ideas:

# simple barchart
ggplot(melt(dat), aes(x=as.factor(value), fill=as.factor(value))) +
  geom_bar() + facet_grid (variable ~ sex) + xlab("") + coord_flip() +
my.df <- ddply(melt(dat), c("sex","variable"), summarize,
my.df$resp <- gl(3, 1, length=nrow(my.df), labels=0:2)

# stacked barchart
ggplot(my.df, aes(x=variable, y=count, fill=resp)) +
  geom_bar() + facet_wrap(~sex) + coord_flip()
# dotplot
ggplot(my.df, aes(x=count, y=resp, colour=sex)) + geom_point() +
  facet_wrap(~ variable)

Visualizing Weka classification tree

Look here, for example.

First you have to fit your decision tree (I used the J48 classifier on the iris dataset), in the usual way. In the results list panel (bottom left on Weka explorer), right click on the corresponding output and select “Visualize tree” as shown below.

If you have installed the Prefuse plugin, you can even visualize your tree on a more pretty layout.

How to display a stacked barchart in Gnuplot?

Assuming your data are stored in the file 1.dat, stacked barcharts might be generated as follows:

set style data histograms
set style histogram rowstacked
set boxwidth 1 relative
set style fill solid 1.0 border -1
set yrange [0:1.5]
set datafile separator " "
plot '1.dat' using 2 t "Var 1", '' using 3:xticlabels(1) t "Var 2"

As you can see, barcharts are no different from histograms (at least, from within Gnuplot). More information can be found on gnuplot demo page.

python (bool) ? then : else syntax?

You can use (x if cond else y), e.g.

>>> x = 0
>>> y = 1
>>> print("a" if x < y else "b")

That will work will lambda function too.

Get last record of a table in Postgres

you can use

SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC

assuming you want also to sort by timestamp?

How to use subscripts in ggplot2 legends [R]

The following should work (remove your line with names(temp) <- …):

ggplot(temp.m, aes(x = value, linetype = variable)) +
  geom_density() + facet_wrap(~ a) +
                          labels=c(expression(b[1]), expression(c[1])))

See help(scale_linetype_discrete) for available customization (e.g. legend title via name=).

Sending an email from R using the sendmailR package

Are you able to send email via the command-line?

So, first of all, fire up a Terminal and then

$ echo “Test 123” | mail -s “Test”

Look into /var/log/mail.log, or better use

$ tail -f /var/log/mail.log

in a different window while you send your email. If you see something like

... setting up TLS connection to[]:587
... Trusted TLS connection established to[]:587:\
    TLSv1 with cipher RC4-MD5 (128/128 bits)

then you succeeded. Otherwise, it means you have to configure you mailing system. I use postfix with Gmail for two years now, and I never had have problem with it. Basically, you need to grab the Equifax certificates, Equifax_Secure_CA.pem from here: (They were using Thawtee certificates before but they changed last year.) Then, assuming you used Gmail,

  1. Create relay_password in /etc/postfix and put a single line like this (with your correct login and password):

    then in a Terminal,

     $ sudo postmap /etc/postfix/relay_password

    to update Postfix lookup table.

  2. Add the certificates in /etc/postfix/certs, or any folder you like, then

     $ sudo c_rehash /etc/postfix/certs/

    (i.e., rehash the certificates with Openssl).

  3. Edit /etc/postfix/ so that it includes the following lines (adjust the paths if needed):

    relayhost =
    smtp_sasl_auth_enable = yes
    smtp_sasl_password_maps = hash:/etc/postfix/relay_password
    smtp_sasl_security_options = noanonymous
    smtp_tls_security_level = may
    smtp_tls_CApath = /etc/postfix/certs
    smtp_tls_session_cache_database = btree:/etc/postfix/smtp_scache
    smtp_tls_session_cache_timeout = 3600s
    smtp_tls_loglevel = 1
    tls_random_source = dev:/dev/urandom
  4. Finally, just reload the Postfix process, with e.g.

    $ sudo postfix reload

(a combination of start/stop works too).

You can choose a different port for the SMTP, e.g. 465. It’s still possible to use SASL without TLS (the above steps are basically the same), but in both case the main problem is that your login informations are available in a plan text file… Also, should you want to use your MobileMe account, just replace the Gmail SMTP server with

How can I generate ascii “graphical output” from R?

You should look at the fairly recent txtplot package. Currently, it includes scatterplot, line plot, density plot, acf, and bar chart.

From the online help,

> txtplot(cars[,1], cars[,2])
120 +                                                     *    +
    |                                                          |
100 +                                                          +
    |                                                     * *  |
 80 +                           *         *                    +
    |                                        *       *    *    |
 60 +                           *              *               +
    |                              *    * *    *       *       |
 40 +                         *      *  * *  *                 +
    |                 *       * *    *  *    * *               |
 20 +         *       *  * *  * *  *                           +
    |            *    *  * *                                   |
    |  *      *    *                                           |
  0 +----+------------+------------+-----------+------------+--+
         5           10           15          20           25

Outlier detection in data mining

I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.

  1. I wouldn’t use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:

    set.seed(123) sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean 1, sd1), rnorm(n, mean2,sd 2))

    generate three clouds of points, well separated in the 2D plane

    xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)), sim.xy(100, c(2.5,0), c(.4,.2)), sim.xy(100, c(1.25,.5), c(.3,.2))) xy[1,] <- c(0,2) # convert 1st obs. to an outlying value km3 <- kmeans(xy, 3) # ask for three clusters km4 <- kmeans(xy, 4) # ask for four clusters

As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.

One possibility, however, would be to use a two-stage approach where one’s removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).

This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).

A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:

R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.

(It is available in the mvoutlier R package.)

  1. It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in

Handl, J., Knowles, J., and Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201-3212.

that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.

There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.

  1. This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g.  Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)

Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.

A related topic is anomaly detection, about which you will find a lot of papers.

  1. That really deserves a new (and probably more focused) question :-)

Variable width bars in ggplot2 barplot in R

How about using width= after rescaling your d vector, say by a constant amount?

ggplot(dat, aes(x=a, y=b, width=d/100)) +
  geom_bar(aes(fill=a), stat="identity", position="identity")

Zip or enumerate in R?

There have been some discussions around list comprehension for R, e.g.  here or there. The hash package even offers dictionary-like structure. However, as others said, it is difficult to try to map one language facilities onto another (even if this is what Comparison of programming languages actually offers) without a clear understanding of what it is supposed to be used to. For example, I can mimic Python zip() in R as follows:


In [1]: x = [1,2,3]
In [2]: y = [4,5,6]
In [3]: zip(x, y)
Out[3]: [(1, 4), (2, 5), (3, 6)]


> x <- 1:3
> y <- 4:6
> list(x, y)                     # gives a simple list
> as.list(paste(x, y))           # three tuples, as a list of characters
> mapply(list, x, y, SIMPLIFY=F) # gives a list of 3 tuples
> rbind(x, y)                    # gives a 2x3 matrix

As can be seen, this really depends on what you want to do with the result afterwards.

gnuplot conditional plotting: plot col A:col B if col C == x

Consider the following dataset (1.dat),

1 0.8 0
2 0.6 0
3 0.9 1
4 1.1 0
5 0.7 0
6 0.6 1

where we want to plot the first two columns only when the third one equals zero. Then you can try this:

plot '1.dat' using 1:($3==0?$2:1/0)

(Credit to markjoe on Gnuplot mailing-list.)

Subset dataframe by multiple logical conditions of rows to remove

Try this

subset(data, !(v1 %in% c("b","d","e")))

How to compute error rate from a decision tree?

Assuming you mean computing error rate on the sample used to fit the model, you can use printcp(). For example, using the on-line example,

> library(rpart)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
> printcp(fit)

Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)

Variables actually used in tree construction:
[1] Age   Start

Root node error: 17/81 = 0.20988

n= 81

        CP nsplit rel error  xerror    xstd
1 0.176471      0   1.00000 1.00000 0.21559
2 0.019608      1   0.82353 0.82353 0.20018
3 0.010000      4   0.76471 0.82353 0.20018

The Root node error is used to compute two measures of predictive performance, when considering values displayed in the rel error and xerror column, and depending on the complexity parameter (first column):

Note that it is more or less in agreement with classification accuracy from tree:

> library(tree)
> summary(tree(Kyphosis ~ Age + Number + Start, data=kyphosis))

Classification tree:
tree(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Number of terminal nodes:  10
Residual mean deviance:  0.5809 = 41.24 / 71
Misclassification error rate: 0.1235 = 10 / 81

where Misclassification error rate is computed from the training sample.

How to increase the space between the bars in a bar plot in ggplot2?

You can always play with the width parameter, as shown below:

df <- data.frame(x=factor(LETTERS[1:4]), y=sample(1:100, 4))
ggplot(data=df, aes(x=x, y=y, width=.5)) +
  geom_bar(stat="identity", position="identity") +
  opts(title="width = .5") + labs(x="", y="") +

Compare with the following other settings for width:

alt text

So far, so good. Now, suppose we have two factors. In case you would like to play with evenly spaced juxtaposed bars (like when using space together with beside=TRUE in barplot() ), it’s not so easy using geom_bar(position="dodge"): you can change bar width, but not add space in between adjacent bars (and I didn’t find a convenient solution on Google). I ended up with something like that:

df <- data.frame(g=gl(2, 1, labels=letters[1:2]), y=sample(1:100, 4))
x.seq <- c(1,2,4,5)
ggplot(data=transform(df, x=x.seq), aes(x=x, y=y, width=.85)) +
  geom_bar(stat="identity", aes(fill=g)) + labs(x="", y="") +
  scale_x_discrete(breaks = NA) +
  geom_text(aes(x=c(sum(x.seq[1:2])/2, sum(x.seq[3:4])/2), y=0,
                label=c("X","Y")), vjust=1.2, size=8)

The vector used for the x -axis is “injected” in the data.frame, so that so you change the outer spacing if you want, while width allows to control for inner spacing. Labels for the x -axis might be enhanced by using scale_x_discrete().

alt text

Read all files in a folder and apply a function to each data frame

On the contrary, I do think working with list makes it easy to automate such things.

Here is one solution (I stored your four dataframes in folder temp/).

filenames <- list.files("temp", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
res <- lapply(ldf, summary)
names(res) <- substr(filenames, 6, 30)

It is important to store the full path for your files (as I did with full.names), otherwise you have to paste the working directory, e.g.

filenames <- list.files("temp", pattern="*.csv")
paste("temp", filenames, sep="/")

will work too. Note that I used substr to extract file names while discarding full path.

You can access your summary tables as follows:

> res$`df4.csv`
       A              B
 Min.   :0.00   Min.   : 1.00
 1st Qu.:1.25   1st Qu.: 2.25
 Median :3.00   Median : 6.00
 Mean   :3.50   Mean   : 7.00
 3rd Qu.:5.50   3rd Qu.:10.50
 Max.   :8.00   Max.   :16.00

If you really want to get individual summary tables, you can extract them afterwards. E.g.,

for (i in 1:length(res))
  assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]])

Comments in Markdown

I use standard HTML tags, like

your comment goes here
and here

Note the triple dash. The advantage is that it works with pandoc when generating TeX or HTML output. More information is available on the pandoc-discuss group.