A Return to Reliable R

The saga with Statistica continues:

Statistica kept crashing on me while doing my data processing.  One of the big problems was a wonderful bug that occurred when some of my text data variables were coded (unsurprisingly) as text!  Under this condition, I would only be able to add a certain small number of extra variables when I needed to make them, and then after that, any extra variable that I tried to add would crash the program!

I was told that this is a known bug in Statistica and they’re hoping to fix it with an update coming around by the end of the year.  In the meanwhile, a workaround is to go into the “Variable Specs” for any variable coded as Text and recode it as “Double”, save the worksheet, then try again.  That seemed to get rid of the crashing, but then my biographical ID column that held all the original database IDs for the individuals in my dataset got messed up.  Numerous IDs, which were previously unique, became spontaneously reassigned to more than one person.  I can’t have that because once I’m done with the dataset, I have to return important parts of it back to the clients I work with so they can put certain new columns into their database.  So it was a bit of a catch 22.

My supervisor advised me to make a new, strictly numeric, ID column outside of Statistica, and import only the new ID column, and not the old one, back into the program.  I did that, and all seemed well until finally it crashed, yet again!  This time, I had no clue whatsoever why the crash happened.

That’s when I told myself “screw it, I’m wasting time in Statistica and am going to do the rest of this analysis in R”.  Man, is it ever nice to be back in R.  Ironically, things are much more simple and flow a lot faster for me.  The only problem is that I have a few projects coming up soon that really need a data analysis program that can handle humongous data sets.  For that reason, I’m probably going to have to see if reinstalling Statistica makes it more reliable to work with.  If not, I suppose I’ll have to move on to other options!

Processing Data from a Statistica Worksheet Using R

Context: I work with data from non-profit organizations, and so a big concern in many of my analyses is if and how much people are donating from one year to the next.  One of the  things I normally like to do in my analyses is get a value for each person that represents how much their yearly donations are increasing or decreasing on average for 5 years (a simple slope from the regression of their giving values on the years that they gave).  It was pretty simple and quick to do this in R for previous projects, so there was no hassle there.  Now that we have Statistica in the office, my supervisor wants me to use it for our current project.

Problem: I was looking for a way in Statistica of doing the above slope calculation for each row in a dataset of roughly 82,000 rows, and could not find it.

Solution:  As I mentioned in my last post, it’s possible to feed your Statistica dataset into R using the Statconn Dcom server, so that you can process it/analyze it in R and then output your results back into Statistica.  So, I fed my dataset of 82,000 rows and 264 columns into R, and used some code I had used previously to calculate 5 year giving slopes for each row in the data set, and to output a new worksheet with the newly calculated slopes column.  Although the code is pretty simple, the entire process seemed to take about 5 minutes, which was unbearably slow!!  It’s a pretty important part of my analysis, so going without it isn’t an option.

I sent an email to one of the Statistica support guys, so hopefully they have a way of doing this kind of data processing natively, instead of having to wait all that time for the data to be processed through R.

Using R from Inside Statistica

I’ve been spending a lot of time in the last month or so doing projects at work not statistics related, hence the lack of posts!  In the interim, I had to do some serious research on handling datasets bigger than the last one I worked with (the one that kept threatening to max out my 8 gigs of RAM!).  I kept trying to practice working with R packages like bigmemory and ffdf, but nothing was completely satisfying my need to be able to handle a big dataset with different data types in different columns.  So, after reading up on different commercial stats packages, I determined that getting Statistica would be best for my supervisor and I (she’s insanely busy and wouldn’t have the time for the learning curve to learn Revolution R, if we were to buy that).

In speaking with my supervisor about Statistica, she mentioned that it can interface with R.  So once we got our copies of Version 11 Advanced, I went ahead and learned how the interface works.

Setup/Installation: The setup and installation of the R integration was really annoying.  There is a COM server application you have to download and install.  You have to make sure you run the installation in administrator mode.  Then you have to make sure that R is installed using administrator mode.  You have to make sure you get the rscproxy package in R and that it is installed in the R Home directory that sits in your program files folder.  It was quite a hassle.  Statistica put a white paper on their website explaining the process.

Memory Usage:  When you actively use the R integration in Statistica, take a look at your memory usage (I’m using a windows 7 computer for work).  What you will notice is that any time you run an R function in statistica, the R connector program starts taking up more and more memory, representing the fact that data is being passed from Statistica to R to be processed.  The upshot of this is that you should probably be careful how much data you’re passing to an R procedure from Statistica so that you don’t max out your memory.

Syntax: Check out the screenshot below.  Typing in R syntax into Statistica is, thankfully, pretty easy.  As you can see in the screenshot, if you want to access the active dataset to do something with it, you treat it as a dataframe labelled ActiveDataSet, and then you can use the $ sign and type the variable name of your statistica dataset like you would with R.  The only catch seems to be variables with spaces in them.  So for those variables it seems that you have to resort to referring to them by their column numbers, instead of name.

Functionality: So far, it looks like data only flows from the Statistica spreadsheet, to R, back to the Statistica report output, or a new Statistica spreadsheet.  It would be nice if I could modify data from R within a spreadsheet, but that seems to be out of the question.

Main advantage: Being a commercial product, the good folks at Statsoft aren’t just going to give you the product with all of the statistical procedures they came up with for free.  For example, since I now have Statistica Advanced, it does allow me to do some cool multivariate procedures, but I can’t generate random forests unless I get Statistica Data Miner.  The advantage that the R integration brings then, is allowing me to have advanced statistical procedures, like Random Forests, or even graphing abilities like ggplot2, without having to pay extra.  I show an example of having used a random forest procedure in Statistica using R in the screenshot above.