## Scatterplot legend and fill not working in Octave

The following works for me:

n = 100;
x = randn(n, 1);
y = randn(n, 1);
S = rand(n, 1)*20;
hold on
scatter(x(1:50), y(1:50), S(1:50), "red", "filled")
scatter(x(51:100), y(51:100), S(51:100), "green", "filled")
hold off
print('-depsc', 'bubbleplot.eps');
enter image description here

However, I’m not able to add a legend, and I didn’t find any bug report or indication of a missing functionality for this. So, as an alternative, I would suggest adding marker and text to your plot.

## How to get the size of tar.gz in (MB) file in python

According to Python on-line docs, you should look at the size attributes of a TarInfo object.

## Getting dataset for building association rules with Weka

Apart from the example dataset used in the following class, Association Rule Mining with WEKA, you might want to try the market-basket dataset.

Also, please note that several datasets are listed on Weka website, in the Datasets section, some of them coming from the UCI repository (e.g., the Plants Data Set).

## R Using a for() loop to fill one dataframe with another

Not sure I fully understood your question, but you can try this:

df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2

The last command use recycling (it could be writtent as  rep(df1$col2, 26) as well). The results are shown below: > head(df1, n=3) col1 col2 1 a 68 2 b 73 3 c 45 > tail(df1, n=3) col1 col2 24 x 22 25 y 4 26 z 17 > head(df2, n=3) col1 col2 col3 1 a a 68 2 b a 73 3 c a 45 > tail(df2, n=3) col1 col2 col3 674 x z 22 675 y z 4 676 z z 17 ## Regarding the approach of displaying a classification/decision tree from the command line? I’m not sure if you are looking for something very elaborated, but basically decision trees are printed when you use weka.classifiers.trees. There is an illustration with Random Tree in a response of mine , but here is what you would get using the J48 algorithm on the iris dataset: ~/weka/data$ weka weka.classifiers.trees.J48 -t iris.arff  -i

J48 pruned tree
------------------

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
|   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|   petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves  :     5

Size of the tree :  9

Time taken to build model: 0.08 seconds
Time taken to test model on training data: 0.01 seconds

## Trying to Install RVM 1.8.7

If you don’t have rvm on your system, then follow the instructions here: Installing RVM. Basically, you have to download a shell script that will does the job for you (fetch the latest release and install it). Then, just add

[[ -s "$HOME/.rvm/scripts/rvm" ]] && . "$HOME/.rvm/scripts/rvm"

in your .profile, .bash_profile, or whatever you use.

## SQL window function in R Language

You can use aggregate and merge if you are familiar with SQL syntax. Taking one of the example from the PostgreSQL manual, we would use

empsalary <- data.frame(depname=rep(c("develop", "personnel", "sales"), c(5, 2, 3)),
empno=c(11, 7, 9, 8, 10, 5, 2, 3, 1, 4),
salary=c(5200, 4200, 4500, 6000, 5200, 3500, 3900, 4800, 5000, 4800))
merge(empsalary, aggregate(salary ~ depname, empsalary, mean), by="depname")

to reproduce the first example (compute average salary by  depname).

     depname empno salary.x salary.y
1    develop    11     5200 5020.000
2    develop     7     4200 5020.000
3    develop     9     4500 5020.000
4    develop     8     6000 5020.000
5    develop    10     5200 5020.000
6  personnel     5     3500 3700.000
7  personnel     2     3900 3700.000
8      sales     3     4800 4866.667
9      sales     1     5000 4866.667
10     sales     4     4800 4866.667

You may probably want to look at what plyr has to offer for more elaborated construction.

## p-value matrix of x and y variables from anova output

Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:

dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1, function(x) anova(lm(fm(x), full.df))$Pr(>F)[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y))) Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νa r2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values–if there’re no missing values, this formula is the usual coefficient R2 in regression). ## R multiple file “split and plot” Assuming you already have your data frame set up correctly, how about using aggregate (or ddply from the plyr package)? Here is a toy example with one such data frame (you will need to embed this in your loop or write a custom function). L01_001 <- data.frame(Cases=gl(5, 2, 5*2*2, labels=c("AAA","BBB","CCC","DDD","EEE")), replicate(3, rnorm(5*2*2))) mean.by.case <- with(L01_001, aggregate(L01_001[,-1], list(Cases=Cases), mean)) ## opar <- par(mfrow=c(nlevels(L01_001$Cases), 1))
## apply(mean.by.case[,-1], 1, function(x) barplot(x))
## par(opar)
library(lattice)
barchart(~ X1 + X2 + X3 | Cases, mean.by.case)

I would not recommend using bar charts for visualizing your data: they are incredibly bad at showing subtle variation in your data and have a poor data-ink ratio. Cleveland’s dot plot or level plot would do the job, in my opinion. In the later case, you can even represent everything on a single page, which looks like a pretty sound alternative to “100 plot with 26 bar in it.”

## applying the pvclust R function to a precomputed dist object

It’s not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by @Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on [another question on CrossValidated][1]). In the latter case, the authors of [pvclust][2] provide an [example][3] on how to use a custom distance function, although that means you will have to install their “unofficial version”.

## Default value in input octave

You can try something like this:

function val = get_response(default="Y")
val = input("Your choice? [Y]/N ", "s");
if isempty(val)
val = default;
endif
endfunction

I wrapped this into a function but you can use the enclosed code directly, of course.

## Gnuplot, CSV, only distinct labels?

Some further suggestions:

set pointsize 2.5
set xtics ("a" 1, "b" 5, "c" 9, "d" 10.5)
set xrange [0:12]
plot "< cat -n aa.csv | sed -e 's/,/  /g'" using 1:3 pt 7 notitle

yields unequally spaced labels in place of x-ticks (as suggested by @choroba, which also provided a way to automate the determination of mid-interval for larger datasets). (Here aa.csv contains data as shown in your question.)

enter image description here

If you want labels to appear in the plotting region rather than under the x-axis, then you can start with

set format x ""

and then place your labels above the highest value found in any of the above categories, a, …, d.

## 3D parametric graph in Gnuplot

You don’t really need to rely on parametric if you just want to plot raw data from a file.

With sample data xyz.txt that looks like noisy sinusoidal time-series:

1 0 9.43483356296457
1 0.0204081632653061 10.2281885806631
1 0.0408163265306122 10.9377108185805
...
3 0.959183673469388 10.2733398482972
3 0.979591836734694 10.1662011241681
3 1 10.4628112585751

(First column is an integer values coding x locations, 2nd and 3rd columns are for y and z. I appended the R script I used to generate those data at the end.)

I would simply use

splot 'xyz.txt' using 1:2:3 with impulses

where impulses draws vertical lines from the minimum of z. You can change this bevahior; for example, if you want to start at z=0, you could use

set zrange [-1:12]
set ticslevel 0
zmin = 0
zr(x) = (x==0 ? zmin : x)
splot 'xyz.txt' using 1:2:(zr($3)) with impulses Note that I used regularly spaced values for y, but that doesn’t really matter. enter image description here Other variations are possible, of course. Maybe Fence plots with a some-liner or Wall charts might be of further assistance. R script used to generate xyz.dat: n <- 50 # number of observation per sequence k <- 3 # number of sequences t <- seq(0, 1, length=n) # sampling rate xyz <- data.frame(x=rep(1:k, each=n), y=t, z=sin(2*pi*t*2.1)+.25*rnorm(n*k)+10) write.table(xyz, file="xyz.txt", row.names=FALSE, col.names=FALSE) ## How to get this matrix If I understand your question correctly, you want to fill a matrix with repeated entries (x,0) and (x,1), where x=1…4, where repetition is determined by values found in column A and B. Given the values you supplied that’s going to be a huge matrix (67,896,086 rows). So, you could try something like this (replace m below, which has less elements for illustrative purpose): m = [1, 2, 1; 2, 3, 2; 3, 2, 1; 4, 2, 2]; res = []; for k = 1:4 res = [res ; [k*ones(m(k, 2), 1), zeros(m(k, 2), 1); k*ones(m(k, 3), 1), ones(m(k, 3), 1)]]; endfor which yields res = 1 0 1 0 1 1 2 0 2 0 2 0 2 1 2 1 3 0 3 0 3 1 4 0 4 0 4 1 4 1 Out of curiosity, is there any reason not to consider a matrix like 1 0 n 1 1 m 2 0 p 2 1 q ... where n, m, p, q, are values found in columns A and B. This would probably be easier to handle , no? ## Plotting results of several Bland-Altman analysis I can’t provide you with working R code as you didn’t supply raw data (which are needed for boxplots), and it is not clear what you want to display as nothing indicates where your gold standard comes into play in the given aggregated data (are these repeated measurements with different instruments?), unless the reported means stand for difference between the ith method and the reference method (in which case I don’t see how you could use a boxplot). A basic plot of your data might look like dfrm <- data.frame(method=LETTERS[1:3], lcl=c(-5,-9,-8), mean=c(4,2,4), ucl=c(15,13,16), var=c(27,33,36)) # I use stripchart to avoid axis relabeling and casting of factor to numeric # with default plot function stripchart(mean ~ seq(1,3), data=dfrm, vertical=TRUE, ylim=c(-10,20), group.names=levels(dfrm$method), pch=19)
with(dfrm, arrows(1:3, mean-lcl, 1:3, mean+lcl, angle=90, code=3, length=.1))
abline(h=0, lty=2)

However, I can recommend you to take a look at the MethComp package which will specifically help you in comparing several methods to a gold standard, with or without replicates, as well as in displaying results. The companion textbook is

Carstensen, B. Comparing Clinical Measurement Methods. John Wiley & Sons Ltd 2010

enter image description here

## Coloring points instead of lines in interaction.plot (R)I

The problem is that interaction.plot actually relies on  tapply() to compute aggregated summary measures (mean or median). As interaction.plot just calls matplot, you can use the latter directly if you like, or summarize your data with plyr and plot the results with ggplot2 (more flexible).

# consider the 64x4 dataset OrchardSprays and create a fake
# two-levels factor, say grp, which is in correspondence to rowpos odd/even values
grp <- gl(2, 1, 8, labels=letters[1:2])
# the following won't work (due to color recycling)
with(OrchardSprays,
interaction.plot(treatment, rowpos, decrease,
type="b", pch=19, lty=1, col=as.numeric(colpos), legend=F))
# what is used to draw the 8 lines is computed from
with(OrchardSprays,
tapply(decrease, list(treatment=treatment, rowpos=rowpos), mean))
# the following will work, though
with(OrchardSprays,
interaction.plot(treatment, rowpos, decrease,
type="b", pch=19, lty=1, col=as.numeric(grp), legend=F))

In short, assuming you can find an adequate mapping between  gender and your trace factor (age_bucket), you need to construct a vector of colors of size nlevels(age_bucket) .

## creating z-scores for subset of rows

No need for a loop in this case. Use vectorization, instead. Let’s consider the following simulated data: (not sure it will reproduce exactly your dataset, but hopefully you’ll get the general idea)

dfrm <- data.frame(cond=gl(2, 1, 100, labels=LETTERS[1:2]),
user=gl(50, 2, labels=paste("id", 1:20, sep="")),
sensitivity=runif(100, 1, 5))

Computing z-scores is as simple as

dfrm$z.sensitivity <- scale(dfrm$sensitivity)

If you want z-scores conditional on cond, then you can do either

with(dfrm, tapply(sensitivity, cond, scale))

or, using plyr,

ddply(dfrm, c("cond"), transform, sensitivity.z = scale(sensitivity))

## How to subset out some alphabetically arranged columns from a dataframe?

You can try something like this

dfrm <- data.frame(replicate(26, rnorm(10)))
colnames(dfrm) <- paste("COL", LETTERS, sep="_")
which(substr(colnames(dfrm), 5, 6) %in% LETTERS[3:6])

The last expression returns column number that match the letters C to F. See also match, and this related thread: Get column index from label in a data frame.

## Which set of python libraries i should learn for AI and Data mining stuff

In addition to the scikit cited by @aix, you might want to take a look at the following libraries:

• mlpy, an ML library for preprocessing, clustering, predictive classification, regression and feature selection.
• PyML, yet another ML library.
• pandas (Pythonic cross-section, time series, and statistical analysis), is a set of fast NumPy-based data structures optimized for panel, time series, and cross-sectional data analysis, with an emphasis on econometric applications.
• scikits.statsmodels , implements common statistical model (OLS/GLS, GLM, M-estimators, etc.). I really like this package, the syntax is clean and it feels like we didn’t leave R.

I’ll really second investigating orange capabilities which is a full-featured application for data mining, but you can call it from external scripts, see e.g. the Beginning with Orange tutorial to get an idea.

## Is there any publications about how to use association rules as SVM feature?

Here are some articles that might be (or not) relevant to your question:

1. Keivan Kianmehr and Reda Alhajj. Effective Classification by Integrating Support Vector Machine and Association Rule Mining. Lecture Notes in Computer Science, 2006, Volume 4224/2006, 920-927
2. He et al.  Extracting rule from SVM based on association rules, in Rule extraction from support vector machines (Ed. J Diederich), Springer, 2008

You may also want to google articles related to what is known as associative classification.

## data.frame with a column containing a matrix in R

The result you got (2 rows x 3 columns) is what is to be expected from R, as it amounts to cbind a vector (id, with recycling) and a matrix (m).

IMO, it would be better to use list or array (when dimensions agree, no mix of numeric and factors values allowed), if you really want to bind different data structures. Otherwise, just  cbind your matrix to an existing data.frame if both have the same number of rows will do the job. For example

x1 <- replicate(2, rnorm(10))
x2 <- replicate(2, rnorm(10))
x12l <- list(x1=x1, x2=x2)
x12a <- array(rbind(x1, x2), dim=c(10,2,2))

> str(x12l)
List of 2
$x1: num [1:10, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...$ x2: num [1:10, 1:2] -0.164 0.709 -0.268 -1.464 0.744 ...
> str(x12a)
num [1:10, 1:2, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...

Lists are easier to use if you plan to use matrix of varying dimensions, and providing they are organized in the same way (for rows) as an external data.frame you can subset them as easily. Here is an example:

df1 <- data.frame(grp=gl(2, 5, labels=LETTERS[1:2]),
age=sample(seq(25,35), 10, rep=T))
with(df1, tapply(x12lx1[,1], list(grp, age), mean)) You can also use lapply (for list) and apply (for array) functions. ## save align.plot() Can’t you just use grid.arrange() (instead of  align.plot()), and use standard png device instead of ggsave() which expect ggplot objects? dsamp <- diamonds[sample(nrow(diamonds), 1000), ] p1 <- qplot(carat, price, data=dsamp, colour=clarity) p2 <- qplot(carat, price, data=dsamp, colour=clarity, geom="path") library(gridExtra) png("a.png") grid.arrange(p1, p2) # add ncol=2 to arrange as two-column dev.off() enter image description here ## What javascript libraries can make a radial convergence graph? Two top-notch libraries for data visualization are ## Which packages make good use of S4 objects? I would go for kernlab, which additionally includes a lot of C code. It comes with an handy vignette, detailing some of S4 concepts. (It doesn’t seem to use roxygen for the documentation, though, but this is not the question here.) ## What is the best bioinformatics book for a computer scientist? I would suggest Bioinformatics, from Polanski and Kimmel (Springer, 2007). Very good introductory material on statistics, computer science as applied in bioinformatics. It also covers the necessary grounds in computational biology. ## How to plot a violin scatter boxplot (in R)? Try the vioplot package: library(vioplot) vioplot(rnorm(100)) (with awful default color ;-) There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin. ## How to capture R text+image output into one file (html, doc, pdf etc)? Well, I just remind that I was using Asciidoc for short reporting or editing webpage. Now there’s an R plugin ( ascii on CRAN), which allows to embed R code into an asciidoc document. The syntax is quite similar to Markdown or Textile, so you’ll learn it very fast. Output are (X)HTML, Docbook, LaTeX, and of course PDF through one of the last two backends. Unfortunately, I don’t think you can wrap all your code into a single statement. However, it supports a large number of R objects, see below. > methods(ascii) [1] ascii.anova* ascii.aov* ascii.aovlist* ascii.cast_df* [5] ascii.character* ascii.coxph* ascii.CrossTable* ascii.data.frame* [9] ascii.default* ascii.density* ascii.describe* ascii.describe.single* [13] ascii.factor* ascii.freqtable* ascii.ftable* ascii.glm* [17] ascii.htest* ascii.integer* ascii.list* ascii.lm* [21] ascii.matrix* ascii.meanscomp* ascii.numeric* ascii.packageDescription* [25] ascii.prcomp* ascii.sessionInfo* ascii.simple.list* ascii.smooth.spline* [29] ascii.summary.aov* ascii.summary.aovlist* ascii.summary.glm* ascii.summary.lm* [33] ascii.summary.prcomp* ascii.summary.survfit* ascii.summary.table* ascii.survdiff* [37] ascii.survfit* ascii.table* ascii.ts* ascii.zoo* Non-visible functions are asterisked ## Is there a way to access Stata from eclipse? None that I know. If you’re a CLI junky or willing to use Emacs, you might find limited support through the ESS package and the ado-mode. This is what I used on Mac OS X when I want to run short snippet of code, or use Stata in batch mode, but there’s no interactive graphical output (you can just save graphics as PDF as usual). The  ado-mode provides basic syntax highlighting and can send region or buffer to a running instance of Stata GUI program (not the executable file, stata-*, that is being used by ESS ). Here are two screenshots of (top) edition of code in Emacs with the ado-mode, and (bottom) an interactive Stata session (no plot produced). enter image description here enter image description here Some notes on text editors for Stata users provides a list of text editors that can be used with Stata (without interactive facilities, though). ## What data do I need to implement k nearest neighbor? What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around “recommender + reddit”, e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial. ## Creating undirected graphs in Python About Python library for directed and undirected graphs, you can take a look at igraph or NetworkX. As for the TSP, a little googling indicates that some Python code and discussion is available here, and some background is given in these slides, A Short History of the Traveling Salesman Problem, and on this page, Traveling Salesman Problem. ## How do i send ( export ) barplots to a directory or folder instead of displying them on screen I assume that you mean saving your graphics as PNG or PDF. Here is a snippet of R code that shows how to redirect plotting action to such graphics devices: WD <- "~/out" # set your output directory here k <- 10 # 10 loops for simulated data for (i in 1:k) { png(sprintf(paste(WD, "Rplot%03d.png", sep="/"), i)) barplot(table(sample(LETTERS[1:6], 100, rep=TRUE))) dev.off() } ## text position for multiple regression equation on the same graph Here is one possibility, using [@Vincent's code] 1. It works with the latest release of ggplot2 (v. 0.9) and the R-forge version of directlabels (v. 2.5). I also have tested the code with  ggplot2 0.8.9 and directlabels 2.4. (The version of  directlabels released on CRAN won’t work with ggplot2 0.9 , though.) The idea is basically to replace your labels A, B , C, G with the regression equations. Of course, you could store the latter in a different manner but I think this would sensibly complicate the plotting expression, so let’s keep that as simple as possible. Assuming we already have @Vincent’s melted variable d, > head(d) Xax variable value 1 0.22 A 0.451 2 0.34 A 0.491 3 0.54 A 0.389 4 0.34 A 0.425 5 0.53 A 0.457 6 0.12 A 0.436 let’s replace variable labels with the equations you computed: library(plyr) lm.stats <- ddply(d, "variable", function(u) { r <- lm(value ~ Xax, data=u) c(coef(r), r.squared=summary(r)r.squared)
})
my.formatter <- function(x, digits=2) {
x <- round(x, digits=digits)
out <- paste(x[1], ifelse(x[2]>0, "+", ""), x[2], "x", sep="")
out <- paste(out, " (R2=", x[3], ")", sep="")
return(out)
}
d$variablef <- d$variable
levels(d$variablef) <- apply(lm.stats[2:4], 1, my.formatter) The little helper function, my.formatter, is in charge of assembling the different statistics you computed with ddply . Note that I made a copy of variable in case we need this latter. And here is the plotting stuff: p <- ggplot(d, aes(Xax,value, col=variablef)) + geom_point() + stat_smooth(method=lm) library(directlabels) direct.label(p) I should note that you can also have annotated curves with the  labcurve() function from the Hmisc package. I can also imagine simpler solutions using ggplot or lattice, namely just write the regression equations along the regression lines, with proper orientation and a slight shift on the x-axis to avoid overlapping, but that might not necessarily be very portable if your dataset happens to change. enter image description here ## Using patterns in addition/instead of background colors in lattice plots I found a way to manually draw into the levelplot panel and to draw a diagonal fill pattern over all cells with values greater than 0.5 However, I couldn’t manage to draw the same pattern in the color key legend. After hours of reading forums and trying to understand the lattice source code, I couldn’t get a clue. Maybe someone else could fix that. Here is what I got: library(lattice) library(RColorBrewer) cols <- colorRampPalette(brewer.pal(8, "RdBu")) data <- Harman23.cor$cov

fx <- fy <- c()
for (r in seq(nrow(data)))
for (c in seq(ncol(data)))
{
if (data[r, c] > 0.5)
{
fx <- c(fx, r);
fy <- c(fy, c);
}
}

diag_pattern <- function(...)
{
panel.levelplot(...)
for (i in seq(length(fx)))
{
panel.linejoin(x = c(fx[i],fx[i]+.5), y= c(fy[i]+.5,fy[i]), col="black")
panel.linejoin(x = c(fx[i]-.5,fx[i]+.5), y= c(fy[i]+.5,fy[i]-.5), col="black")
panel.linejoin(x = c(fx[i]-.5,fx[i]), y= c(fy[i],fy[i]-.5), col="black")
}
}

p <- levelplot(data, scales=list(x=list(rot=45)),
xlab="", ylab="", col.regions=cols, panel=diag_pattern)
print(p)
enter image description here

## How to use a separate table to filter data

Quick and dirty solution (because I believe someone will certainly propose a more elegant solution avoiding loop):

tab1 <- list(Target=1:5, Size=c("L","M","L","S","L"), Color=c("R","B","G","B","R"))
tab2 <- data.frame(rep(1:2, each=2), c("A","D","A","B"),
c(5,2,1,5), c(2,4,4,8), c(8,6,6,3))
names(tab2) <- c("User", "Condition", 1:3)

library(reshape)
tab2.melt <- melt(tab2, measure.vars=3:5)

for (i in 1:nrow(tab2.melt)) {
tab2.melt$Size[i] <- tab1$Size[tab1$Target==as.numeric(tab2.melt$variable[i])]
tab2.melt$Color[i] <- tab1$Color[tab1$Target==as.numeric(tab2.melt$variable[i])]
}

I am assuming you are able to import your data into R, but you may want to adapt the above code if the data structure isn’t the one you show in your excerpt. Basically, the idea is to consider your  Target code as a way to index Size and Color levels, which we need in the final data.frame for each repeated measurement (on the ith subject).

The updated data.frame looks like:

> head(tab2.melt)
User Condition variable value Size Color
1    1         A        1     5    L     R
2    1         D        1     2    L     R
3    2         A        1     1    L     R
4    2         B        1     5    L     R
5    1         A        2     2    M     B
6    1         D        2     4    M     B

From there, you can perform a 3-way ANOVA or study specific contrasts.

## R: Lattice messes up legend in pdf

It should work when using the cairo PDF backend, e.g.

cairo_pdf(file="bchart.pdf", width=10, height=10, pointsize=10)

Although I haven’t checked, this might well have to do with PDF encoding, see Including fancy glyphs in R Graphics PDF output, by Paul Murrell.

## GSL library in R

I suspect this has to do with your shared library that is not properly linked to GSL libs, as discussed on R-devel, or the manual on Writing R Extensions, where it is suggested to use a Makevars file (with something like PKG_LIBS=-L/usr/lib -lgsl). Otherwise, following the example in help(SHLIB), you may want to try:

$R CMD SHLIB file.c -lgsl -lgslcblas There is a simple tutorial, R Call GSL, which shows basic setup for calling GSL functions. I am able to reproduce the toy example, that I renamed  nperms.{c,r} as follows (on a Mac, whence the use of a  -dynamiclib switch in place of -shared): ~/scratch$ gcc -c nperms.c
~/scratch $file nperms.o nperms.o: Mach-O 64-bit object x86_64 ~/scratch$ gcc -dynamiclib -lgsl -lgslcblas -o libnperms.dylib -dylib nperms.o
~/scratch $ls *nperm* libnperms.dylib nperms.c nperms.o ~/scratch$ file libnperms.dylib
libnperms.dylib: Mach-O 64-bit dynamically linked shared library x86_64

Everything works fine when dyn.load’ing  libnperms.dylib in R. However, using shared library generated from R CMD SHLIB without further argument

~/scratch $R CMD SHLIB nperms.c gcc -arch x86_64 -std=gnu99 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o nperms.so nperms.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation ~/scratch$ ls *nperm*
libnperms.dylib  nperms.c  nperms.o  nperms.r  nperms.so
$R CMD texi2pdf sw.tex enter image description here ## Plotting lines with gnuplot python Does the following work for you? >>> import Gnuplot >>> x = [1,2,3,4] >>> gp = Gnuplot.Gnuplot() >>> gp.title('My title') >>> gp('set style data linespoints') >>> gp.plot(x) You can pass whatever options you want in the 5th command. ## Graphic of binary variable in R Here are two possible solutions, based on fake data generated with this helper function: generate.data <- function(rate=.3, dim=c(25,25)) { tmp <- rep(".", prod(dim)) tmp[sample(1:prod(dim), ceiling(prod(dim)*rate))] <- "S" m <- matrix(tmp, nr=dim[1], nc=dim[2]) return(m) } 1. Text-based output x <- generate.data() rownames(x) <- colnames(x) <- 1:25 capture.output(as.table(x), file="res.txt") The file res.txt include a pretty-printed version of the console output; you can convert it to pdf using any txt to pdf converter (I use the one from PDFlib). Here is a screenshot of the text file: enter image description here 1. Image-based output First, here is the plotting function I used:   make.table <- function(x, labels=NULL) { # x = matrix # labels = list of labels for x and y coord.xy <- expand.grid(x=1:nrow(x), y=1:ncol(x)) opar <- par(mar=rep(1,4), las=1) plot.new() plot.window(xlim=c(0, ncol(x)), ylim=c(0, nrow(x))) text(coord.xy$x, coord.xy$y, c(x), adj=c(0,1)) if (!is.null(labels)) { mtext(labels[[1]], side=3, line=-1, at=seq(1, ncol(x)), cex=.8) mtext(labels[[2]], side=2, line=-1, at=seq(1, nrow(x)), cex=.8, padj=1) } par(opar) } Then I call it as   make.table(x, list(1:25, 1:25)) and here is the result (save it as png, pdf, jpg, or whatever). enter image description here ## Regarding RandomTree in Weka This has to do with the minimum number of instances on a leaf node (which is often 2 by default in decision trees, like J48). The higher you set this parameter, the more general the tree will be since having many leaves with a low number of instances yields a too granular tree structure. Here are two examples on the iris dataset, which shows how the -M option might affect size of the resulting tree: $ weka weka.classifiers.trees.RandomTree -t iris.arff -i

petallength < 2.45 : Iris-setosa (50/0)
petallength >= 2.45
|   petalwidth < 1.75
|   |   petallength < 4.95
|   |   |   petalwidth < 1.65 : Iris-versicolor (47/0)
|   |   |   petalwidth >= 1.65 : Iris-virginica (1/0)
|   |   petallength >= 4.95
|   |   |   petalwidth < 1.55 : Iris-virginica (3/0)
|   |   |   petalwidth >= 1.55
|   |   |   |   sepallength < 6.95 : Iris-versicolor (2/0)
|   |   |   |   sepallength >= 6.95 : Iris-virginica (1/0)
|   petalwidth >= 1.75
|   |   petallength < 4.85
|   |   |   sepallength < 5.95 : Iris-versicolor (1/0)
|   |   |   sepallength >= 5.95 : Iris-virginica (2/0)
|   |   petallength >= 4.85 : Iris-virginica (43/0)

Size of the tree : 17

$weka weka.classifiers.trees.RandomTree -M 6 -t iris.arff -i petallength < 2.45 : Iris-setosa (50/0) petallength >= 2.45 | petalwidth < 1.75 | | petallength < 4.95 | | | petalwidth < 1.65 : Iris-versicolor (47/0) | | | petalwidth >= 1.65 : Iris-virginica (1/0) | | petallength >= 4.95 : Iris-virginica (6/2) | petalwidth >= 1.75 | | petallength < 4.85 : Iris-virginica (3/1) | | petallength >= 4.85 : Iris-virginica (43/0) Size of the tree : 11 As a sidenote, Random trees rely on bagging, which means there’s a subsampling of attributes (K randomly chosen to split at each node); contrary to REPTree, however, there’s no pruning (like in RandomForest), so you may end up with very noisy trees. ## What is the command for Zipf (frequency against rank) plot in R Check out the zipfR package, and its dedicated website including the following tutorial: The zipfR package for lexical statistics: A tutorial introduction. ## Evaluate different layout algorithms / interaction with graphs You might want to check the extensive documentation of the igraph software, which has some description of its internal layout generators. There are also nice illustrations on aiSee website. For more academic reference, I would suggest browsing the following tutorials: Graph Drawing Tutorial (106 pages) or Graph and Network Visualization (69 pages). Another useful resource: Handbook of Graph Drawing and Visualization (26 chapters, available as PDF). ## Change summary variables returned by Hmisc summary() Arf… I just look at the code of summary.formula() in the Hmisc package and I can confirm that Mean and SD are indeed computed but not shown when printing on the command line. So, we have to ask for it explicitely when calling the print() function, e.g. library(Hmisc) df <- data.frame(g=sample(LETTERS[1:3], 100, rep=TRUE), replicate(3, rnorm(100))) s <- summary(g ~ ., method="reverse", data=df) latex(s, prmsd=TRUE, digits=2) # replace latex by print to output inline which yields the following Table: alt text ## Plotting directed multigraphs in R The igraph package seems to fulfill your requirements, with the tkplot() function helping adjusting the final layout if needed. Here is an example of use: s <- cbind(A=sample(letters[1:4], 100, replace=TRUE), B=sample(letters[1:2], 100, replace=TRUE)) s.tab <- table(s[,1], s[,2]) library(igraph) s.g <- graph.incidence(s.tab, weighted=T) plot(s.g, layout=layout.circle, vertex.label=c(letters[1:4],letters[2:1]), vertex.color=c(rep("red",4),rep("blue",2)), edge.width=c(s.tab)/3, vertex.size=20, vertex.label.cex=3, vertex.label.color="white") enter image description here With the interactive display (there’s a possibility of using  rgl for 3D display), it looks like (I have slightly moved one vertex afterwards): tkplot(s.g, layout=layout.circle, vertex.color=c(rep("red",4),rep("blue",2))) enter image description here Finally, you can even export you graph into most common format, like  dot for graphviz. ## generate a vector in R and insert it in a stacked frame To transform a table into a data.frame, the base function  as.data.frame.table should work. Here is, however, how I would do: myframe <- as.table(array(c(35, 34, 132, 38, 7, 31, 23, 109, 36, 5, 10, 7, 35, 14, 2, 49, 24, 136, 37, 15, 22, 13, 52, 31, 10, 16, 8, 33, 32, 4), dim=c(5, 3, 2), dimnames=list(cars=1:5, q8=c("N","U","Y"), sex=c("F","M")))) library(reshape) melt(myframe) for getting a data.frame with all variables. Should you only want to keep q8 and sex as factors in your data.frame, use melt(myframe)[,-1] instead. See help(melt.array) for more information. ## Normalizing histogram bins in gnuplot Here is how I would do, with n=500 random gaussian variates generated from R with the following command: Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth _ n), except that I use  int instead of floor and I didn’t recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert’s textbook, _Gnuplot in Action* ( Chapter 13, p. 257, freely available). You can replace my sample data file with random-points which is available in the  demo folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file. bw1=0.1 bw2=0.3 n=500 bin(x,width)=width*int(x/width) set xrange [-3:3] set yrange [0:1] tstr(n)=sprintf("Binwidth = %1.1f\n", n) set multiplot layout 1,2 set boxwidth bw1 plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1)
set boxwidth bw2
plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2) Here is the result, with two bin width enter image description here Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on stats.stackexchange.com: using Freedman-Diaconis binning rule should not be too difficult to implement, although you’ll need to compute the inter-quartile range. Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won’t make a difference) and equally spaced bin like the ones used above. enter image description here The R code that was used is given below: par(mfrow=c(1,2), las=1) hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE) hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1", xlab="", ylab="", prob=TRUE) You can even look at how R does its job, by inspecting the values returned when calling hist(): > str(hist(rnd, plot=FALSE)) List of 7$ breaks     : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
$counts : int [1:13] 1 1 12 20 49 79 108 87 71 43 ...$ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
$density : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...$ mids       : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
$xname : chr "rnd"$ equidist   : logi TRUE
- attr(*, "class")= chr "histogram"

All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).

## TukeyHSD after within factors ANOVA

As an alternative to traditional repeated-measures ANOVA for within-subject design, you might consider using linear mixed-effects approach. It is being increasingly used in the scientific community and avoids some of the ANOVA pitfalls while allowing for more complex error structures. With continuous response variable, the nlme package is enough, but you can also use lme4 which further allows to cope with categorical response variables. For multiple comparisons (including Tukey’s post-hoc tests), then, the multcomp package (see the glht() function) can be used with mixed-effects models fitted with nlme::lme, as described here: Repeated Measures ANOVA using R.

One short remark about your design: If your response (dependent) variable, Ratio, is a proportion or a bounded value, you may think of using a different link function.

## Interface between Octave and R

I do not know of any active R/octave project, but if you’re just after finding roots for a given polynomial you can use one of the polynom or PolynomF package:

Here is an example with P(x)= 6 + 5x + 4x^2 + 3x^3 + 2 x^4 + x^5.

In octave,

octave[2] > p = 1:6;
octave[3] > roots(p)
ans =

0.55169 + 1.25335i
0.55169 - 1.25335i
-1.49180 + 0.00000i
-0.80579 + 1.22290i
-0.80579 - 1.22290i

In R,

> library(polynom)
> p <- polynomial(6:1)
> pz <- solve(p)
> pz
[1] -1.491798+0.000000i -0.805786-1.222905i -0.805786+1.222905i
[4]  0.551685-1.253349i  0.551685+1.253349i

## R fisher.test print only p-value

If you take a look at the values returned by a call to the  fisher.test() function (see help(fisher.test) in R), you will notice that there is p.value among others. So you just need to use $p.value on your R object. Here is an example, in R: tab <- replicate(2, sample(LETTERS[1:3], 10, rep=TRUE)) fisher.test(table(tab[,1], tab[,2]))$p.value

## can’t set border linewidth for bars of rowstacked histogram

I would not use set style fill solid border -1 (or better, noborder), but rather define a specific line type that can be used to customize boxes, e.g.

bw=0.1
n=500
bin(x,width) = width*floor(x/width) + bw/2.0
set boxwidth bw
set style line 2 lc rgb 'gray30' lt 1 lw 2
set style fill pattern 5
awk 'BEGIN{i=0} {s[i]=$1; i++;} END{print s[int(NR*0.8-0.5)]}'") set arrow from perc80,0 to perc80,0.8 nohead lt 2 lw 2 set arrow from graph(0,0),0.8 to perc80,0.8 nohead lt 2 lw 2 plot 'rnd.dat' using 2:(1./200.) smooth cumulative This yields the following output: enter image description here You can add as many percentile values as you want, of course; you just have to define a new variable, e.g. perc90, as well as ask for two other arrow commands, and replace every occurrence of 0.8 (ah… the joy of magic numbers!) by the desired one (in this case, 0.9). Some explanations about the above code: 1. I generated an artificial dataset which was saved on disk. 2. The 80th percentile is compute using awk, but before that we need to 3. remove the header generated by table (first four lines); (we could ask awk to start at the 5th lines, but let’s go with that.) 4. keep only the second column; 5. sort the entries. 6. The awk command to compute the 80th percentile requires truncation, which is done as suggested here. (In R, I would simply use a function like  trunc(rank(x))/length(x) to get the percentile ranks.) If you want to give R a shot, you can safely replace that long series of sed/awk commands with a call to R like Rscript -e 'x=read.table("~/rnd.dat")[,2]; sort(x)[trunc(length(x)*.8)]' assuming rnd.dat is in your home directory. Sidenote: And if you can live without gnuplot, here are some R commands to do that kind of graphics (even not using the  quantile function): x <- rnorm(200) xs <- sort(x) xf <- (1:length(xs))/length(xs) plot(xs, xf, xlab="X", ylab="Cumulative frequency") ## quick outline of the 80th percentile rank perc80 <- xs[trunc(length(x)*.8)] abline(h=.8, v=perc80) ## alternative solution plot(ecdf(x)) segments(par("usr")[1], .8, perc80, .8) segments(perc80, par("usr")[3], perc80, .8) enter image description here ## r / sciplot: overlapping whiskers in lineplot.CI You can use ggplot2 for that. Here is an example with a built-in dataset (as I don’t have your standard errors or CIs). The key is to use position_dodge() . ToothGrowth$dose.cat <- factor(ToothGrowth$dose, labels=paste("d", 1:3, sep="")) df <- with(ToothGrowth , aggregate(len, list(supp=supp, dose=dose.cat), mean)) df$se <- with(ToothGrowth , aggregate(len, list(supp=supp, dose=dose.cat),
function(x) sd(x)/sqrt(10)))[,3]

opar <- theme_update(panel.grid.major = theme_blank(),
panel.grid.minor = theme_blank(),
panel.background = theme_rect(colour = "black"))

xgap <- position_dodge(0.2)
gp <- ggplot(df, aes(x=dose, y=x, colour=supp, group=supp))
gp + geom_line(aes(linetype=supp), size=.6, position=xgap) +
geom_point(aes(shape=supp), size=3, position=xgap) +
geom_errorbar(aes(ymax=x+se, ymin=x-se), width=.1, position=xgap)
theme_set(opar)
enter image description here

## Converting data frame (factors) with data.matrix, weird results

To see what happened, look at the following example:

> (f <- gl(2, 1, 10, labels=3:4))
[1] 3 4 3 4 3 4 3 4 3 4
Levels: 3 4
> as.numeric(f)
[1] 1 2 1 2 1 2 1 2 1 2
> as.numeric(as.character(f))
[1] 3 4 3 4 3 4 3 4 3 4

To convert a factor (which is what you have in your  data.frame) into a numeric vector while preserving its labels (otherwise you’ll just get its levels), you need something like as.numeric(as.character()).

So, either ensure that you read your input data correctly (if numbers are quoted, with options("stringsAsFactors") set to TRUE, then it’s likely they will be converted to a factor), or convert your data.frame afterwards. This can be done columnwise, e.g.

dfrm <- data.frame(x=factor(c(3,2,1,8,4)), y=factor(c(5,6,1,2,3)))
m <- sapply(dfrm, function(x) as.numeric(as.character(x)))
plot(m)

I misread your question and I thought you were using as.matrix , not data.matrix. That doesn’t change anything, since both functions will convert factors into their internal representation, as stated in the on-line help:

Factors and ordered factors are replaced by their internal codes.

## How do I add an asterix to a boxplot to represent significance?

You can use text() and write at the corresponding location, if you know it beforehand; e.g.,

dfrm <- data.frame(y=rnorm(100, mean=10), x=gl(4, 25))
dfrm$y[dfrm$x==2] <- dfrm$y[dfrm$x==2]+2
boxplot(y ~ x, data=dfrm, ylim=c(min(dfrm$y)-.5, max(dfrm$y)+.5))
text(x=2, y=max(dfrm$y[dfrm$x==2]), "*", pos=3, cex=1.2)

Adapt x=2 to suit your needs.

Or you can use mtext to put the star outside the plotting region, like in

mtext("*", side=3, line=0, at=2, cex=1.2)

## How to convert NA’s in a large dataset to either 0 or 1?

Assuming your dataframe is named dfrm, do you mean something like that?

dfrm$Surgery <- ifelse(dfrm$Specialty=="Surgery", 1, 0)
dfrm$Internal <- ifelse(dfrm$Specialty=="Internal", 1, 0)

## where can i find the sources for the sample() function in base

Look up do_sample in /trunk/src/main/random.c from, e.g. the svn repository.

## How can you visualize data frames in a good way?

RStudio does a pretty good job with its built-in (read-only) data viewer. Other solutions ( have been suggested on Cross Validated: Is there a good browser/viewer to see an R dataset (.rda file).

enter image description here

## How can we combine predictors from two different linear models into one?

If this is just a matter of extracting the components of each model and combine them into a new design matrix, then the following should work, irrespective of the fact you used stepAIC:

dfrm <- data.frame(y=rnorm(100), replicate(7, rnorm(100)))
lm1 <- lm(y ~ X1+X2+X3, dfrm)
lm2 <- lm(y ~ X5+X7, dfrm)
lm1.fm <- attr(terms(lm1), "term.labels")
lm2.fm <- attr(terms(lm2), "term.labels")
lm3.fm <- as.formula(paste("y ~ ", paste(c(lm1.fm, lm2.fm), collapse= "+")))
lm3 <- lm(lm3.fm, dfrm)

To fix the ideas, here we have

> names(dfrm)
[1] "y"  "X1" "X2" "X3" "X4" "X5" "X6" "X7"
> lm3.fm
y ~ X1 + X2 + X3 + X5 + X7

See help(terms.object) to get more information on what it returns. With your example, you’ll need to replace lm1 with best.ngc_fit and lm2 with  best.fcst_fit.

It does not look like the C45 names file is correct. Try replacing  breast-cancer-wisconsin.names with this one:

2, 4.
clump: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
size: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
shape: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
adhesion: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
epithelial: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
nuclei: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
chromatin: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
nucleoli: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
mitoses: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

Note that class comes first (only labels).

Here I have removed the first column of subjects’ id in the original dataset using

$cut -d, -f2-11 breast-cancer-wisconsin.data > breast-cancer-wisconsin.data but it is not difficult to adapt the above code. Alternative solutions: 1. Generate a csv file: you just need to add a header to the  *.data file and rename it as *.csv. E.g., replace  breast-cancer-wisconsin.data with  breast-cancer-wisconsin.csv which should look like clump,size,shape,adhesion,epithelial,nuclei,chromatin,nucleoli,mitoses,class 5,1,1,1,2,1,3,1,1,2 5,4,4,5,7,10,3,2,1,2 3,1,1,1,2,2,3,1,1,2 6,8,8,1,3,4,3,7,1,2 ... 2. Construct directly an *.arff file by hand; that’s not really complicated as there are few variables. An example file can be found here. ## How to add labels to the terminal nodes of a ctree (package party)? Here is one hacky solution. It involves very little modification in the original source code of the plotting functions from the party package. By reading the source code, I noticed that there is a  terminal_panel which is calling node_barplot in case the outcome is a factor. (Everything is located in the R/plot.R function, if you have source package installed.) We can modify the later to display custom labels in the default bar chart. Just issue the following command at R prompt: fixInNamespace("node_barplot", pos="package:party") and then, start adding what we want: 1. Add labels = NULL, gp = NULL to the existing list of arguments for that function. 2. Near the end of the function body, after grid.rect(gp = gpar(fill = "transparent")), add the following lines: if (!is.null(labels)) { labs <- as.character(labels[ctreeobj@where==node$nodeID])
len <- length(labs)
x <- unit(rep(0.5, len), "npc")
y <- unit(0.5:len/len, "npc")
for (i in 1:len)
grid.text(labs[i], x=x[i], y=y[i], just="center", gp=gp)
}

Here, the key idea is to select labels corresponding to the selected node (node$nodeID), and we can grab this information from the slot where of the ctree object (this is a vector indicating in which node each case ended up). The if test is just to ensure that we can use the function as originally written. The gp argument can be used to change font size or color. A typical call to the function would now be: plot(cfit, tp_pars=list(labels=dfrm$names))

where dfrm$names is a column of labels from a data frame named dfrm. Here is an illustration with your data: cfit <- ctree(young ~ age, data=a, controls=ctree_control(minsplit=2, minbucket=2)) plot(cfit, tp_args=list(labels=a$names, gp=gpar(fontsize=8, col="darkgrey")))
enter image description here

(I have also tested this with the on-line example with the  iris dataset.)

## how do i get a 45 angle for the x axis labels in the following code

AFAIK, with base graphics, you can only ask for 0/90° orientation of labels on x- or y-axis (see the las parameter in  par()). However, with lattice or ggplot2 you can do it.

Here is an example with lattice::barchart():

tt <- table(sample(LETTERS[1:6], 100, rep=T))
library(lattice)
barchart(tt, horiz=F,
scales=list(x=list(rot=45, labels=paste("Fancy label", 1:6))))
bar chart

Replace labels with your own labels or if you already have a named table, leave it as is.

## I have a big table in R, now I want to select the odd rows and paste a label before the first element of this row

If I understand your question correctly, you want to display some text and the first element of all odd rows. You can try this:

cat(paste("This is odd", A[c(2,4),1], "\n"))

No need for a loop there. Should you want to work with a larger matrix, and take all odd rows, you can use seq(2, nrow(A), by=2) instead of c(2,4).

## Programming a QQ plot

Here is a simple solution using base graphics:

scores <- rnorm(200, mean=12, sd=2)
gender <- gl(2, 50, labels=c("M","F"))
opar <- par(mfrow=c(1,2))
for (g in levels(gender))
qqnorm(scores[gender==g], main=paste("Gender =", g))
par(opar)

A more elegant lattice solution then:

qqmath(~ scores | gender, data=data.frame(scores, gender), type=c("p", "g"))

See the on-line help for qqmath for more discussion and example of possible customization.

## How to combine 2 matrices into a graph

If you just want to plot correlations as a function of distances, without imposing a particular structure on your plot, you can just extract the lower part of your respective matrices, e.g.

x <- matrix(rnorm(1000), nrow=20)
d.mat <- as.matrix(dist(x))
c.mat <- cor(t(x))
plot(d.mat[lower.tri(d.mat)], c.mat[lower.tri(c.mat)])

## Interactive plotting in Python?

My first thought would be Mayavi, which is great for data visualization, especially in 3D. It relies on VTK. It is included in the Enthought flavoured version of Python, together with Chaco for 2D plotting. To get an idea, look at Travis Vaught’s nice screencast in Multidimensional Data Visualization in Python - Mixing Chaco and Mayavi .

It also possible to embed basic interactive functionalities (like slider) to Matplotlib, see matplotlib.widgets and the widget examples.

Finally, you can use rpy (or better, rpy2) and benefit from the R interface.

## Performing an if statement on each row in R

Unfortunately, I came late and with a solution similar to @Andrie’s one, like this:

dat <- matrix(c(3,3,3,2,3,3,3,3,3,3,3,3,2,3,1,2,2,2,3,3),
nr=10, byrow=TRUE)
# here is our lookup table for genotypes
pat <- matrix(1:9, nr=3, byrow=T, dimnames=list(1:3,1:3))

Then

> pat[dat]
[1] 9 8 9 9 9 9 6 2 5 9

gives you what you want.

However, I would like to say that you might find easier to use dedicated package for genetic studies, like the one found on CRAN (like genetics, gap or SNPassoc , to name a few) or Bioconductor , because they include facilities for transforming/recoding genotype data and working with haplotype.

Here is an example of what I have in mind with the above remark:

> library(genetics)
> geno1 <- as.genotype.allele.count(dat[,1]-1)
> geno2 <- as.genotype.allele.count(dat[,2]-1)
> table(geno1, geno2)
geno2
geno1 A/A A/B
A/A   6   1
A/B   1   1
B/B   0   1

## Canonical Correlation Analysis in R

Your X variance-covariance matrix is not positive-definite, hence the error when internally calling fda::geigen.

There’s a similar function for regularized CCA in the mixOmics package, but I guess it will lead to the same error message because it basically uses the same approach (except that they plugged the geigen function directly into the rcc function). I can’t actually remember how I get it to work with my data, for a related problem (but I’ll look into my old code once I find it again :-)

One solution would be to use a generalized Cholesky decomposition. There is one in the kinship (gchol; be careful, it returns a lower triangular matrix) or accuracy (sechol) package. Of course, this means modifying the code inside the function, but it is not really a problem, IMO. Or you can try to make Var(X) PD with make.positive.definite from the corpcor package.

As an alternative, you might consider using the RGCCA package, which offers an unified interface for PLS (path modeling) and CCA methods with k blocks.

## Controlling z labels in contourplot

Here is my take:

library(lattice)
x <- rep(seq(-1.5,1.5,length=50),50)
y <- rep(seq(-1.5,1.5,length=50),rep(50,50))
z <- exp(-(x^2+y^2+x*y))

# here is default plot
lp1 <- contourplot(z~x*y)

# here is an enhanced one
my.panel <- function(at, labels, ...) {
# draw odd and even contour lines with or without labels
panel.contourplot(..., at=at[seq(1, length(at), 2)], col="blue", lty=2)
panel.contourplot(..., at=at[seq(2, length(at), 2)], col="red",
labels=as.character(at[seq(2, length(at), 2)]))
}

lp2 <- contourplot(z~x*y, panel=my.panel, at=seq(0.2, 0.8, by=0.2))
lp3 <- update(lp2, at=seq(0.2,0.8,by=0.1))
lp4 <- update(lp3, lwd=2, label.style="align")

library(gridExtra)
grid.arrange(lp1, lp2, lp3, lp4)
enter image description here

You can adapt the custom panel function to best suit your needs (e.g. other scale for leveling the z-axis, color, etc.).

## Gnuplot: Variable colors (and linewidths) for 2D-Vector plot

I would put only one color specification, e.g.

set xrange [0:10]
set yrange [0:10]
plot "test.dat" using 1:2:3:4:5 with vectors lw 3 lc rgb variable

where test.dat contains

1 1 2 0 0x000000
1 2 2 0 0xff0000
1 3 2 0 0xffff00
1 4 2 0 0x382288
enter image description here

The same can be done using the following inline rgb function

rgb(r,g,b) = int(r)*65536 + int(g)*256 + int(b)
plot "test2.dat" using 1:2:3:4:(rgb($5,$6,$7)) with vectors lw 3 lc rgb variable where test2.dat now reads 1 1 2 0 0 0 0 1 2 2 0 255 0 0 1 3 2 0 255 255 0 1 4 2 0 56 34 136 ## slick one-lineRs I often need fake data to illustrate, say, a regression problem. Instead of X <- replicate(2, rnorm(100)) y <- X[,1] + X[,2] + rnorm(100) df <- data.frame(y=y, X=X) we can use df <- transform(X <- as.data.frame(replicate(2, rnorm(100))), y = V1+V2+rnorm(100)) to generate two uncorrelated predictors associated to the outcome y. ## Is there a dynamic word/tag cloud Java API somewhere? I’m not aware of a good R package for that. There are some functions, like cloud in the [snippets][1] package, and maybe other functions, but nothing compared to http://wordle.net, http://tagcrowd.com/, or [Many Eyes][2]. [Drew Conway][3] has done some nice stuff with tm + ggplot2; I also played with it a while ago, but this was more of to play with 3D tag cloud (with rgl) than wordle. [enter image description here][4] I ## Ordering the bars of a stacked bar graph in ggplot from least to greatest I’m not sure about the way your data were generated (i.e., whether you use a combination of cast/melt from the reshape package, which is what I suspect given the default name of your variables), but here is a toy example where sorting is done outside the call to ggplot. There might be far better way to do that, browse on SO as suggested by @Andy. v1 <- sample(c("I","S","D","C"), 200, rep=T) v2 <- sample(LETTERS[1:24], 200, rep=T) my.df <- data.frame(v1, v2) idx <- order(apply(table(v1, v2), 2, sum)) library(ggplot2) ggplot(my.df, aes(x=factor(v2, levels=LETTERS[1:24][idx], ordered=TRUE), fill=v1)) + geom_bar() + opts(axis.text.x=theme_text(angle=90)) + labs(x="fullname") To sort in the reverse direction, add decr=TRUE with the order command. Also, as suggested by @Andy, you might overcome the problem with x-labels overlap by adding + coord_flip() instead of the  opts() option. enter image description here ## How to debug Aquamacs Does aquamacs --help work? I suspect it does not because if you are using the command that was installed by Aquamacs itself (which is /usr/local/bin/aquamacs) it is simply a Perl script that call OS X open command, which may have difficulty when parsing options. It is better to just create an alias or a symbolic link to  /Applications/Aquamacs.app/Contents/MacOS/Aquamacs. % /Applications/Aquamacs.app/Contents/MacOS/Aquamacs --help Usage: /Applications/Aquamacs.app/Contents/MacOS/Aquamacs [OPTION-OR-FILENAME]... Run Emacs, the extensible, customizable, self-documenting real-time display editor. The recommended way to start Emacs for normal editing is with no options at all. Run M-x info RET m emacs RET m emacs invocation RET inside Emacs to read the main documentation for these command-line arguments. Initialization options: --batch do not do interactive display; implies -q --chdir DIR change to directory DIR --daemon start a server in the background --debug-init enable Emacs Lisp debugger for init file --%<-------- You can also safely run % /Applications/Aquamacs.app/Contents/MacOS/Aquamacs -nw --debug-init to get Aquamacs running in your preferred Terminal emulator. More information can be found on EmacsWiki, Customize Aquamacs. ## Errors due to vowpal wabbit’s dependencies on boost library I happened to get everything working on OS X 10.7 as follows: 1. Make sure you have a working Boost installation. As indicated on the Getting started page, usually we only need header files, but some Boost libraries must be built separately, including the program_options library which is used to process options from command line or config file. Go into your boost folder, and then at your shell prompt: $ ./bootstrap.sh
$./bjam This will compile and build everything. You should now have a  bin.v2/ directory in your boost directory, with all built libraries for your system (static and threaded libs).  $ ls bin.v2/libs/
date_time       iostreams       python          serialization   test
graph           program_options regex           system          wave

More importantly, extra Boost libraries are made available in the  stage/lib/ directory. For me, these are Mach-O 64-bit dynamically linked shared library x86_64.

The include path should be  your_install_dir/boost_x_xx_x, where boost_x_xx_x is the basename of your working Boost. (I personally have  boost_1_46_1 in /usr/local/share/ and I symlinked it to /usr/local/share/boost to avoid having to remember version number.) The library path (for linking) should read  your_install_dir/boost_x_xx_x/stage/lib. However, it might be best to symlink or copy (which is what I did) everything in usual place, i.e. /usr/local/include/boost for header files, and /usr/local/lib for libraries.

1. Edit the Makefile from the vowpal_wabbit directory, and change the include/library paths to reflect your current installation. The Makefile should look like this (first 12 lines):

COMPILER = g++
UNAME := $(shell uname) ifeq ($(UNAME), FreeBSD)
LIBS = -l boost_program_options -l pthread -l z -l compat
BOOST_INCLUDE = /usr/local/include
BOOST_LIBRARY = /usr/local/lib
else
LIBS = -l boost_program_options -l pthread -l z
BOOST_INCLUDE = /usr/local/share/boost            # change path to reflect yours
BOOST_LIBRARY = /usr/local/share/boost/stage/lib  # idem
endif

Then, you are ready to compile vowpal_wabbit (make clean in case you already compiled it):

    $make$ ./vw --version
6.1

## Python matplotlib : plot3D with a color for 4D

If you want to display a simple 3D scatterplot, can’t you just use  scatter? E.g.,

x, y, z = randn(100), randn(100), randn(100)
fig = plt.figure()
from mpl_toolkits.mplot3d import Axes3D
ax.scatter(x, y, z, c=randn(100))
plt.show()

(I’m running the above code under python -pylab.)

enter image description here

It seems, on the contrary, that with plot3D you must convert your fourth dimension to RGB tuples.

## How to generate bivariate data of different shapes (e.g., square, circle, rectangle) with outliers?

You should probably look into the mlbench package, especially synthetic dataset generating from  mlbench.* functions, see some examples below.

enter image description here

Other datasets or utility functions are probably best found on the Cluster Task View on CRAN. As @Roman said, adding outliers is not really difficult, especially when you work in only two dimensions.

## Boot package in R simple assistance

Just to emphasize the general idea on bootstrapping in R, although @caracal already answered your question through his comment. When using boot, you need to have a data structure (usually, a matrix) that can be sampled by row. The computation of your statistic is usually done in a function that receives this data matrix and returns the statistic of interest computed after resampling. Then, you call the boot() that takes care of applying this function to R replicates and collecting results in a structured format. Those results can be assessed using boot.ci() in turn.

Here are two working examples with the low birth baby study in the MASS package.

require(MASS)
data(birthwt)
# compute CIs for correlation between mother's weight and birth weight
cor.boot <- function(data, k) cor(data[k,])[1,2]
cor.res <- boot(data=with(birthwt, cbind(lwt, bwt)),
statistic=cor.boot, R=500)
cor.res
boot.ci(cor.res, type="bca")
# compute CI for a particular regression coefficient, e.g. bwt ~ smoke + ht
fm <- bwt ~ smoke + ht
reg.boot <- function(formula, data, k) coef(lm(formula, data[k,]))
reg.res <- boot(data=birthwt, statistic=reg.boot,
R=500, formula=fm)
boot.ci(reg.res, type="bca", index=2) # smoke

## Discretization in R

You can still use the dprep package, but you have to install it from source (I just tested and it works well). However, you may well have a look at the discretization or infotheo packages which provide similar functionalities, e.g. equal interval width, equal frequency intervals, ChiMerge, etc.

Using

octave --silent --eval 5+4 > result.txt

you’ll get

ans =  9

in result.txt. See octave --help for details about command-line arguments.

Yet, there is this infamous ans = that might be remove using sed, e.g.

octave --silent --eval 'x=5+4; y=x+1; disp(y)' | sed -e 's/ans = //' >> result.txt

which add the appropriate result (10) in  result.txt.

It should not be too hard to wrap this into a bash script.

## Missing values in ARFF (Weka)

Apart from treating missing value as an attribute value on its own, in the case of the J48 classifier any split on an attribute with missing value will be done with weights proportional to frequencies of the observed non-missing values. This is documented in Witten and Frank’s textbook, Data Mining Practical Machine Learning Tools and Techniques (2005, 2nd. ed., p. 63 and p. 191), who then reported that

eventually, the various parts of the instance will each reach a leaf node, and the decisions at these leaf nodes must be recombined using the weights that have percolated to the leaves.

More information about handling missing values in decision trees, like surrogate splits in CART (and contrary to C4.5 or its successor J48), can be found on the wiki section for Classification Trees; the use of imputation is also discussed in several articles, e.g.  Handling missing data in trees: surrogate splits or statistical imputation.

## Save .dta files in python

The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by @Sven . In particular, genfromdta() will return an ndarray , e.g. from Python 2.7/statsmodels 0.3.1:

>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>

The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as

>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")

and read it in Stata without a dictionary file as follows:

. insheet using auto.txt, clear

I believe a *.dta reader should be added in the near future.

## How can I rank observations within groups in Stata?

The following works for me:

bysort group_id: egen desired_rank=rank(var_to_rank)
enter image description here

## U-matrix and self organizing maps

The U-matrix stands for unified distance and contains in each cell the euclidean distance (in the input space) between neighboring cells. Small values in this matrix mean that SOM nodes are close together in the input space, whereas larger values mean that SOM nodes are far apart, even if they are close in the output space. As such, the U-matrix can be seen as summary of the probability density function of the input matrix in a 2D space. Usually, those distance values are discretized, color-coded based on intensity and displayed as a kind of heatmap.

Quoting the Matlab SOM toolbox,

 Compute and return the unified distance matrix of a SOM.
For example a case of 5x1 -sized map:
m(1) m(2) m(3) m(4) m(5)
where m(i) denotes one map unit. The u-matrix is a 9x1 vector:
u(1) u(1,2) u(2) u(2,3) u(3) u(3,4) u(4) u(4,5) u(5)
where u(i,j) is the distance between map units m(i) and m(j)
and u(k) is the mean (or minimum, maximum or median) of the
surrounding values, e.g. u(3) = (u(2,3) + u(3,4))/2.

Apart from the SOM toolbox, you may have a look at the kohonen R package (see help(plot.kohonen) and use  type="dist.neighbours").

## Sort xtable() output by p-value from glm model summary

Not necessarily the most elegant solution, but that should do the job:

data(birthwt, package="MASS")
glm.res <- glm(low ~ ., data=birthwt[,-10])
idx <- order(coef(summary(glm.res))[,4])  # sort out the p-values
out <- coef(summary(glm.res))[idx,]       # reorder coef, SE, etc. by increasing p
library(xtable)
xtable(out)
enter image description here

## how to add a general grid to a lattice xy.plot

Try this:

xyplot(yval ~ xval | p*cr, data=B, group=gval, type=c("l","g"), lwd=5,
main="Scatterplots by Cylinders and Gears",
ylab="Miles per Gallon", xlab="Car Weight")

I have simplified a little bit your syntax because you can use variable names from you data.frame if you the data= argument. The key is to use type=c("l","g"), which means lines + grid, and is equivalent to a panel function that would looks like

panel=function(...) {
panel.xyplot(...)
panel.grid()
}

In your case, this is because you forgot to add a  panel.xyplot() that no points or line were drawn. If you want a different grid, you can use the above code and customize the call to  panel.grid().

## Fine tuning a dotplot in R’s lattice package

Some ideas:

1. Use a smaller font size for y-axis labels, e.g.  scale=list(y=list(cex=.6)). An alternative would be to preserve uniform font size, but separate your output on several pages (this can be controlled with layout=), or, probably better, to show all data from the same dataset (A to F, hence 4 points for each algorithm) or by sample size (10 to 100, hence 6 points for each algorithm) with a group= option. I would personally create two factors, sample.size and dataset.type for that.

2. Relevel your factor Dataset, so that the dataset you are interested appear where layout will put them, or (better!) use index.cond to specify a particular arrangement for your 24 panels. E.g.,

dfrm <- data.frame(algo=gl(11, 1, 11*24, labels=paste("algo", 1:11, sep="")),
type=gl(24, 11, 11*24, labels=paste("type", 1:24, sep="")),
roc=runif(11*24))
p <- dotplot(algo ~ roc | type, dfrm, layout=c(4,6), scale=list(y=list(cex=.4)))

will arrange panels in sequential order, from bottom left to top right (type1 in bottom left panel, type24 in top right panel), while

    update(p, index.cond=list(24:1))

will arrange panels in the reverse order. Just specify a list with expected panel locations.

Here is an example of what I have in mind with Point 1 and the use of two factors instead of one. Let us generate another artificial dataset:

dfrm <- data.frame(algo=gl(11, 1, 11*24, labels=paste("algo", 1:11, sep="")),
dataset=gl(6, 11, 11*24, labels=LETTERS[1:6]),
ssize=gl(4, 11*6, 11*24, labels=c(10,25,50,100)),
roc=runif(11*24))
xtabs(~ dataset + ssize, dfrm)  # to check allocation of factor levels
dotplot(algo ~ roc | dataset, data=dfrm, group=ssize, type="l",
auto.key=list(space="top", column=4, cex=.8, title="Sample size",
cex.title=1, lines=TRUE, points=FALSE))
enter image description here

## Difference between regression tree and model tree

A linear model tree is a decision tree with a linear functional model in each leaf, whereas in classical regression tree (e.g., CART) it is the sample mean of the response variable for statistical units in each leaf (hence, a constant) that is being considered. Linear model trees can be seen as a a form of locally weighted regression, while regression tree are piecewise-constant regression.

Torgo, L. Functional models for regression tree leaves. In Proceedings of the 14th International Conference on Machine Learning, pp. 385–393. Morgan Kaufmann, 1997.

## Naive bayes in R

You seem to be using the e1071::naiveBayes algorithm, which expects a newdata argument for prediction, hence the two errors raised when running your code. (You can check the source code of the predict.naiveBayes function on CRAN; the second line in the code is expecting a newdata, as  newdata <- as.data.frame(newdata).) Also as pointed out by @Vincent, you’re better off converting your variables to factor before calling the NB algorithm, although this has certainly nothing to do with the above errors.

Using NaiveBayes from the klar package, no such problem would happen. E.g.,

data(spam, package="ElemStatLearn")
library(klaR)

# set up a training sample
train.ind <- sample(1:nrow(spam), ceiling(nrow(spam)*2/3), replace=FALSE)

# apply NB classifier
nb.res <- NaiveBayes(spam ~ ., data=spam[train.ind,])

# predict on holdout units
nb.pred <- predict(nb.res, spam[-train.ind,])

# but this also works on the training sample, i.e. without using a newdata
head(predict(nb.res))

## plotting bar graphs in R using ggplot2

Do you mean something like this? (I’m not sure whether you want to add value on top of the graph, so I added them, but you can safely remove the last line if you don’t need them.)

tmp <- c(150.3,84.61,31.45,11.08,-0.19,-57.83,-88.63,-98.39)
dd <- data.frame(y=tmp, x=LETTERS[1:8])
ggplot(dd, aes(x=x, y=y)) + geom_bar(fill="darkgrey") +
labs(x="Premium change", y="Total Expected Profit") +
geom_text(aes(x=x, y=ifelse(y>0, y+5, y-5),
label=y), size=4, colour="white")
enter image description here

It would be even nicer to add + coord_flip(), IMO.

1. The ggplot() function is expecting a data.frame, from which it can extract named variable, e.g. for the aesthetics parameters x= and y=. So, first of all, you need to convert your object into a proper data.frame and name it, for you can its value through aes():

toplot_noind <- as.data.frame(toplot_noind)
names(toplot_noind) <- y

which is better that using the same name as your data.frame. (Note, however, that it will inherit its name with the cast operation.)

1. Then, the x- and y-labels must be outside the aes() function. I don’t use qplot() but I think using  xlab= and ylab= works fine there. With ggplot , I prefer the labs() or xlab()/ylab() functions. E.g.

2. You need to have x represented as a factor.

3. The dodge aspect does not seem necessary here, because you don’t have a second cross-classifying factor (see an example of use in  help(position_dodge)).

In sum, your corrected code would look like:

toplot_noind <- as.data.frame(toplot_noind)
ggplot(toplot_noind, aes(x = as.factor(1:8), y = toplot_noind)) +
geom_bar() +
ylab("Total Expected Profit")

## With R, loop over data frames, and assign appropriate names to objects created in the loop

Use a list to store the results of your regression models as well, e.g.

foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat

## creating a more continuous color palette in r, ggplot2, lattice, or latticeExtra

As far as lattice is concerned, you can set up your colors palette with RColorBrewer (or even colorspace). Using the example provided by @Chase, but with positive value for z:

dat <- data.frame(x = rnorm(1000), y = rnorm(1000), z = sample(0:40, 1000, TRUE))
library(RColorBrewer)
# see, e.g.
# display.brewer.all(9, type="seq")
# display.brewer.pal(11, "RdBu")
my.col <- colorRampPalette(brewer.pal(11, "RdBu"))(diff(range(dat$z))) xyplot(y ~ x, data=dat, col=my.col[dat$z], pch=19, alpha=.5)

Note that it is also necessary here to increase the range of available colors by interpolation. Also, with levelplot(), you might want to play with cut= and pretty=.

enter image description here

## Is there a sensible way to do something like docstrings in R?

You can add any attributes you like to R objects, including function. So something like

describe <- function(obj) attr(obj, "help")
foo <- function(t=NULL) ifelse(!is.null(t), t^2, "Or maybe not")
attr(foo, "help") <- "Here is the help message"

produces more or less the desired output

> foo(2)
[1] 4
> foo()
[1] "Or maybe not"
> describe(foo)
[1] "Here is the help message"

## How to add LibSVM class to WEKA classpath on a Mac

You can put libsvm.jar in a folder of your choice, e.g.  ~/Library/Java, and then run weka from the command-line as follows:

$java -Xmx512m -classpath /Users/chl/weka/weka.jar:/Library/Java/libsvm.jar weka.gui.GUIChooser You can also update your CLASSPATH to reflect locations where *.jar files can be found. I have the following in my .profile: export CLASSPATH="/Users/chl/weka/weka.jar:~/Library/Java/*" (You will need to replace /Users/chl/weka to reflect the correct location of your weka.jar; usually, it is located at the top of the directory if you downloaded the source files, or under weka-3-7-3.app/Contents/Resources/Java/ if you use the bundled app.) This way, the first command to start weka GUI simplifies to $ java -classpath $CLASSPATH:weka.jar:libsvm.jar weka.gui.GUIChooser Don’t use java -jar since it will override the  CLASSPATH, as discussed here. I also have an alias in my .bash_aliases for wrapping all that stuff: alias weka='java -Xmx512m -classpath$CLASSPATH:weka.jar'

in order to use weka from the command-line as, e.g.

'' using ($0+0.25):2:3 with boxerrorbars title col The idea is just to offset one of the two measures on the x-axis. enter image description here ## Variables Overview with xtable in R Another package to look at is reporttools. Here is an short piece of code to illustrate its usage on the tips dataset from reshape package. Both the summary statements produce latex code which can be copy pasted into a document, or used for weaving. library(reporttools) data(tips, package = 'reshape') # summarize numeric variables tableContinuous(tips[,sapply(tips, is.numeric)]) # summarize non-numeric variables tableNominal(tips[,!sapply(tips, is.numeric)]) EDIT. If you really MUST use str, then here is one way to go about it str_cars = capture.output(str(cars)) xtable(data.frame(str_cars)) OUTPUT FROM REPORTTOOLS: ## Centering title in R viewport with multiple graphs ggplot2 I think you might be interested in the gridExtra package, which provides the grid.arrange() function that fulfills everything you wonder about. With @Kevin’s example, the command would be grid.arrange(plots[[1]], plots[[2]], ncol=2, main="test main", sub="subtitle test") ## How do I compute the number of occurrences of a particular value in a row in R There’re two ways to use the apply family functions. Either you do apply(mat, 1, sum, na.rm=TRUE) if you want to apply the function sum()to each row, passing additional parameters like na.rm=TRUE. Or you can do apply(mat, 1, foo) where foo() is a function of your own, defined externally, e.g. foo <- function(x) sum(x==0, na.rm=TRUE) Note that NA handling might also be dealt with a parameter of the function itself, with default value set to TRUE, in the above definition, as in foo2 <- function(x, no.na=TRUE) sum(x==0, na.rm=no.na) and you can call it as apply(mat, 1, foo2, no.na=F) although it doesn’t really make sense with the sum() function (unless you want to check if there’re NA values, but in this case it’s better to use is.na() :-). Finally, you can define your function directly inline as apply(mat, 1, function(x) sum(x==0, na.rm=TRUE)) In your case, it gives me > apply(mat, 1, function(x) sum(x==0, na.rm=TRUE)) 1 2 3 4 5 6 22 4 9 8 7 2 which is equivalent to apply(ex, 1, foo). ## R: Print two tables with xtable () I would recommend saving the results as two separate tables in different files (see the file= option to  print.xtable()), and then input them into your LaTeX document with any command you find appropriate for your layout ( tabular, subfloat, minipage, etc.). This is what I do in general, although I generally rely on LaTeX facilities in the Hmisc package. Should you want only to print them as a standalone PDF, use the standalone class for your document. So, here is an example: data(tli) fm1 <- aov(tlimth ~ sex + ethnicty + grade + disadvg, data=tli) print(xtable(fm1), file="ta.tex", floating=FALSE) print(xtable(head(tli, n=5)), file="tb.tex", floating=FALSE) then, a quick tex wrapper (compile with pdflatex): \documentclass{article} \usepackage{subfig} \usepackage{graphicx} \begin{document} \begin{table}[ht] \centering \subfloat[Table x(a)]{\label{tab:tab1a}\scalebox{.5}{\input{./ta}}}\quad \subfloat[Table x(b)]{\label{tab:tab1b}\scalebox{.5}{\input{./tb}}} \caption{Caption about here} \label{tab:tab1} \end{table} \end{document} Here is the result: enter image description here Remove the \scalebox command for default (stacked) layout, unless they are narrow enough to fit at their default size, as noted by @David. enter image description here ## Multiple histograms in ggplot2 You can try grid.arrange() from the gridExtra package; i.e., store your plots in a list (say qplt ), and use do.call(grid.arrange, qplt) Other ideas: use facetting within ggplot2 (sex*variable ), by considering a data.frame (use melt). As a sidenote, it would be better to use stacked barchart or Cleveland’s dotplot for displaying items response frequencies, IMO. (I gave some ideas on CrossValidated.) For the sake of completeness, here are some implementation ideas: # simple barchart ggplot(melt(dat), aes(x=as.factor(value), fill=as.factor(value))) + geom_bar() + facet_grid (variable ~ sex) + xlab("") + coord_flip() + scale_fill_discrete("Response") enter image description here my.df <- ddply(melt(dat), c("sex","variable"), summarize, count=table(value)) my.df$resp <- gl(3, 1, length=nrow(my.df), labels=0:2)

# stacked barchart
ggplot(my.df, aes(x=variable, y=count, fill=resp)) +
geom_bar() + facet_wrap(~sex) + coord_flip()
enter image description here
# dotplot
ggplot(my.df, aes(x=count, y=resp, colour=sex)) + geom_point() +
facet_wrap(~ variable)
enter image description here

## Visualizing Weka classification tree

Look here, for example.

First you have to fit your decision tree (I used the J48 classifier on the iris dataset), in the usual way. In the results list panel (bottom left on Weka explorer), right click on the corresponding output and select “Visualize tree” as shown below.

enter image description here enter image description here

If you have installed the Prefuse plugin, you can even visualize your tree on a more pretty layout.

## How to display a stacked barchart in Gnuplot?

Assuming your data are stored in the file 1.dat, stacked barcharts might be generated as follows:

set style data histograms
set style histogram rowstacked
set boxwidth 1 relative
set style fill solid 1.0 border -1
set yrange [0:1.5]
set datafile separator " "
plot '1.dat' using 2 t "Var 1", '' using 3:xticlabels(1) t "Var 2"

As you can see, barcharts are no different from histograms (at least, from within Gnuplot). More information can be found on gnuplot demo page.

enter image description here

## python (bool) ? then : else syntax?

You can use (x if cond else y), e.g.

>>> x = 0
>>> y = 1
>>> print("a" if x < y else "b")
a

That will work will lambda function too.

## Get last record of a table in Postgres

you can use

SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1

assuming you want also to sort by timestamp?

## How to use subscripts in ggplot2 legends [R]

The following should work (remove your line with names(temp) <- …):

ggplot(temp.m, aes(x = value, linetype = variable)) +
geom_density() + facet_wrap(~ a) +
scale_linetype_discrete(breaks=levels(temp.m$variable), labels=c(expression(b[1]), expression(c[1]))) See help(scale_linetype_discrete) for available customization (e.g. legend title via name=). ## Sending an email from R using the sendmailR package Are you able to send email via the command-line? So, first of all, fire up a Terminal and then $ echo “Test 123” | mail -s “Test” user@domain.com

Look into /var/log/mail.log, or better use

$tail -f /var/log/mail.log in a different window while you send your email. If you see something like ... setting up TLS connection to smtp.gmail.com[xxx.xx.xxx.xxx]:587 ... Trusted TLS connection established to smtp.gmail.com[xxx.xx.xxx.xxx]:587:\ TLSv1 with cipher RC4-MD5 (128/128 bits) then you succeeded. Otherwise, it means you have to configure you mailing system. I use postfix with Gmail for two years now, and I never had have problem with it. Basically, you need to grab the Equifax certificates, Equifax_Secure_CA.pem from here: http://www.geotrust.com/resources/root-certificates/. (They were using Thawtee certificates before but they changed last year.) Then, assuming you used Gmail, 1. Create relay_password in /etc/postfix and put a single line like this (with your correct login and password):  smtp.gmail.com login@gmail.com:password then in a Terminal,  $ sudo postmap /etc/postfix/relay_password

to update Postfix lookup table.

2. Add the certificates in /etc/postfix/certs, or any folder you like, then

 $sudo c_rehash /etc/postfix/certs/ (i.e., rehash the certificates with Openssl). 3. Edit /etc/postfix/main.cf so that it includes the following lines (adjust the paths if needed): relayhost = smtp.gmail.com:587 smtp_sasl_auth_enable = yes smtp_sasl_password_maps = hash:/etc/postfix/relay_password smtp_sasl_security_options = noanonymous smtp_tls_security_level = may smtp_tls_CApath = /etc/postfix/certs smtp_tls_session_cache_database = btree:/etc/postfix/smtp_scache smtp_tls_session_cache_timeout = 3600s smtp_tls_loglevel = 1 tls_random_source = dev:/dev/urandom 4. Finally, just reload the Postfix process, with e.g. $ sudo postfix reload

(a combination of start/stop works too).

You can choose a different port for the SMTP, e.g. 465. It’s still possible to use SASL without TLS (the above steps are basically the same), but in both case the main problem is that your login informations are available in a plan text file… Also, should you want to use your MobileMe account, just replace the Gmail SMTP server with smtp.me.com.

## How can I generate ascii “graphical output” from R?

You should look at the fairly recent txtplot package. Currently, it includes scatterplot, line plot, density plot, acf, and bar chart.

> txtplot(cars[,1], cars[,2])
+----+------------+------------+-----------+------------+--+
120 +                                                     *    +
|                                                          |
100 +                                                          +
|                                                     * *  |
80 +                           *         *                    +
|                                        *       *    *    |
60 +                           *              *               +
|                              *    * *    *       *       |
40 +                         *      *  * *  *                 +
|                 *       * *    *  *    * *               |
20 +         *       *  * *  * *  *                           +
|            *    *  * *                                   |
|  *      *    *                                           |
0 +----+------------+------------+-----------+------------+--+
5           10           15          20           25

## Outlier detection in data mining

I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.

1. I wouldn’t use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:

set.seed(123) sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean 1, sd1), rnorm(n, mean2,sd 2))

# generate three clouds of points, well separated in the 2D plane

xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)), sim.xy(100, c(2.5,0), c(.4,.2)), sim.xy(100, c(1.25,.5), c(.3,.2))) xy[1,] <- c(0,2) # convert 1st obs. to an outlying value km3 <- kmeans(xy, 3) # ask for three clusters km4 <- kmeans(xy, 4) # ask for four clusters

As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.

enter image description here

One possibility, however, would be to use a two-stage approach where one’s removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).

This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).

A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:

R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.

(It is available in the mvoutlier R package.)

1. It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in

Handl, J., Knowles, J., and Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201-3212.

that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.

There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.

1. This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g.  Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)

Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.

A related topic is anomaly detection, about which you will find a lot of papers.

1. That really deserves a new (and probably more focused) question :-)

## Variable width bars in ggplot2 barplot in R

How about using width= after rescaling your d vector, say by a constant amount?

ggplot(dat, aes(x=a, y=b, width=d/100)) +
geom_bar(aes(fill=a), stat="identity", position="identity")
enter image description here

## Zip or enumerate in R?

There have been some discussions around list comprehension for R, e.g.  here or there. The hash package even offers dictionary-like structure. However, as others said, it is difficult to try to map one language facilities onto another (even if this is what Comparison of programming languages actually offers) without a clear understanding of what it is supposed to be used to. For example, I can mimic Python zip() in R as follows:

Python

In [1]: x = [1,2,3]
In [2]: y = [4,5,6]
In [3]: zip(x, y)
Out[3]: [(1, 4), (2, 5), (3, 6)]

R

> x <- 1:3
> y <- 4:6
> list(x, y)                     # gives a simple list
> as.list(paste(x, y))           # three tuples, as a list of characters
> mapply(list, x, y, SIMPLIFY=F) # gives a list of 3 tuples
> rbind(x, y)                    # gives a 2x3 matrix

As can be seen, this really depends on what you want to do with the result afterwards.

## gnuplot conditional plotting: plot col A:col B if col C == x

Consider the following dataset (1.dat),

1 0.8 0
2 0.6 0
3 0.9 1
4 1.1 0
5 0.7 0
6 0.6 1

where we want to plot the first two columns only when the third one equals zero. Then you can try this:

plot '1.dat' using 1:($3==0?$2:1/0)

(Credit to markjoe on Gnuplot mailing-list.)

## Subset dataframe by multiple logical conditions of rows to remove

Try this

subset(data, !(v1 %in% c("b","d","e")))

## How to compute error rate from a decision tree?

Assuming you mean computing error rate on the sample used to fit the model, you can use printcp(). For example, using the on-line example,

> library(rpart)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
> printcp(fit)

Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)

Variables actually used in tree construction:
[1] Age   Start

Root node error: 17/81 = 0.20988

n= 81

CP nsplit rel error  xerror    xstd
1 0.176471      0   1.00000 1.00000 0.21559
2 0.019608      1   0.82353 0.82353 0.20018
3 0.010000      4   0.76471 0.82353 0.20018

The Root node error is used to compute two measures of predictive performance, when considering values displayed in the  rel error and xerror column, and depending on the complexity parameter (first column):

• 0.76471 x 0.20988 = 0.1604973 (16.0%) is the resubstitution error rate (i.e., error rate computed on the training sample) – this is roughly

  class.pred <- table(predict(fit, type="class"), kyphosis$Kyphosis) 1-sum(diag(class.pred))/sum(class.pred) • 0.82353 x 0.20988 = 0.1728425 (17.2%) is the cross-validated error rate (using 10-fold CV, see xval in  rpart.control(); but see also xpred.rpart() and  plotcp() which relies on this kind of measure). This measure is a more objective indicator of predictive accuracy. Note that it is more or less in agreement with classification accuracy from tree: > library(tree) > summary(tree(Kyphosis ~ Age + Number + Start, data=kyphosis)) Classification tree: tree(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Number of terminal nodes: 10 Residual mean deviance: 0.5809 = 41.24 / 71 Misclassification error rate: 0.1235 = 10 / 81 where Misclassification error rate is computed from the training sample. ## How to increase the space between the bars in a bar plot in ggplot2? You can always play with the width parameter, as shown below: df <- data.frame(x=factor(LETTERS[1:4]), y=sample(1:100, 4)) library(ggplot2) ggplot(data=df, aes(x=x, y=y, width=.5)) + geom_bar(stat="identity", position="identity") + opts(title="width = .5") + labs(x="", y="") + theme_bw() Compare with the following other settings for width: alt text So far, so good. Now, suppose we have two factors. In case you would like to play with evenly spaced juxtaposed bars (like when using  space together with beside=TRUE in barplot() ), it’s not so easy using geom_bar(position="dodge"): you can change bar width, but not add space in between adjacent bars (and I didn’t find a convenient solution on Google). I ended up with something like that: df <- data.frame(g=gl(2, 1, labels=letters[1:2]), y=sample(1:100, 4)) x.seq <- c(1,2,4,5) ggplot(data=transform(df, x=x.seq), aes(x=x, y=y, width=.85)) + geom_bar(stat="identity", aes(fill=g)) + labs(x="", y="") + scale_x_discrete(breaks = NA) + geom_text(aes(x=c(sum(x.seq[1:2])/2, sum(x.seq[3:4])/2), y=0, label=c("X","Y")), vjust=1.2, size=8) The vector used for the x -axis is “injected” in the data.frame, so that so you change the outer spacing if you want, while width allows to control for inner spacing. Labels for the x -axis might be enhanced by using scale_x_discrete(). alt text ## Read all files in a folder and apply a function to each data frame On the contrary, I do think working with list makes it easy to automate such things. Here is one solution (I stored your four dataframes in folder  temp/). filenames <- list.files("temp", pattern="*.csv", full.names=TRUE) ldf <- lapply(filenames, read.csv) res <- lapply(ldf, summary) names(res) <- substr(filenames, 6, 30) It is important to store the full path for your files (as I did with  full.names), otherwise you have to paste the working directory, e.g. filenames <- list.files("temp", pattern="*.csv") paste("temp", filenames, sep="/") will work too. Note that I used substr to extract file names while discarding full path. You can access your summary tables as follows: > res$df4.csv
A              B
Min.   :0.00   Min.   : 1.00
1st Qu.:1.25   1st Qu.: 2.25
Median :3.00   Median : 6.00
Mean   :3.50   Mean   : 7.00
3rd Qu.:5.50   3rd Qu.:10.50
Max.   :8.00   Max.   :16.00

If you really want to get individual summary tables, you can extract them afterwards. E.g.,

for (i in 1:length(res))
assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]])

<!---
-->