Reading an international SPSS file into R

Background I admit I have a certain weakness for SPSS: I have used it extensively during my first university job (when R was still at version 0.63) and more importantly, the first course I was fully responsible for was an SPSS course for graduate students. This was the first time I had to explain statistical concepts to people with little or nor formal training, but (as a rule) very real data analysis problems. Consequently, I had to re-think a lot of theory for myself in order to be able to present a consistent and convincing story to my students, so that the course was a real education for me (as well as for them, hopefully).

In this context, the SPSS manuals were rather helpfully didactic, with interesting example data, the graphical user interface reasonably reasonable, and even the principle of “everything and the kitchen sink” in terms of statistics available in the menus was useful in that it would make a young statistician consult his Agresti in some depth. So all in all, despite the horrible marketing, the rental licences, the eagerness to split statistical functionality into countless shrink-wrapped and separately sold modules, I do have a soft spot for the old warhorse, and it still evokes some of the joy of standard errors I felt when we met the first time…

The problem … but of course, when I recently received an SPSS .sav file from a colleague, the first thing I did was to read it into R. This worked at the first try as suggested by the documentation:

require(foreign)
dat = read.spss("nonaffected.sav", to.data.frame=TRUE)

generated a warning about some value labels it did not like, but the data (and almost all of the labels) made it ok into R. Sweet!

This, however, when running R on Windows 7. When I switched to Ubuntu and tried to re-run my analysis, the exact same code and the exact same file generated an error message:

> dat = read.spss("nonaffected.sav", to.data.frame=TRUE)
Error in read.spss("nonaffected.sav", to.data.frame=TRUE) :
  error reading system-file header

Tales of woe And this is where the fun started. Some googling showed that this has been an intermittent problem for other people in the past, but the suggested solutions varied: exporting to SPSS portable format (extension .por), exporting to a text file (duh!), using the reencode argument to read.spss, commercial conversion software etc. I was a bit muggle-headed that day with an incipient cold, so I tried all of that in no special order.

Seeing that I have no SPSS installed either under Windows or under Linux, I should have started with reencode, which allows you to specify a different character encoding for reading the file. The problem with that is that on my Ubuntu, there are ca. 1000 different encodings available (as shown via iconvlist() at the R prompt). I tried some likely candidates for Windows encoding, prime among them ISO-8859-1, but that did not work.

So the next step should have been to ask myself, so what if I have no SPSS?! There is a GNU version by the acronym of PSPP (hilarious, innit), which can easily be installed via sudo apt-get install pspp. However, calling pspp exposes you directly to the horrors of the SPSS command language… which are pretty horrible, as these things go. Consulting the PSPP documentation, you will learn the command for reading in a system file (IMPORT) and writing to a portable file (EXPORT). Alternatively, you can start the GUI that the PSPP people have thoughtfully included with their software, via the command psppire at the command line, and use the File/Open/Data and File/Save as menus.

Regardless of the path travelled, I was happy to find that the original file was read without problems, even though there were numerous special characters of a Swedish nature (å and ä) in both variable names and value labels. What I did not find was a way of exporting the data to a non-SPSS format, like a tab-delimited text file (always a comfort in situations like these) – either it’s not there, or I’m too stupid. However, generating a so-called portable file was easy, using either the PSPP command line or GUI, so I did that and tried reading it into R. This is what I got:

> dat = read.spss("nonaffected.por", to.data.frame=TRUE)
Error in read.spss("nonaffected.por", to.data.frame=TRUE) :
  error reading portable-file dictionary

A new error message – sweet!

Solution 1 (ugly) At this point I was fairly sure that the problem was with the special characters (you think?!). In a spirit of experimentation, I decided to eliminate them from the variable names only, using the Variable view in psppire: given that the file had ca. 80 variables, with 10 or so containing special characters, somewhat bothersome (and boring), but next to nothing compared to the time I had already burnt on this. Saving the data with the internationalized variable names worked, read.spss could read both the .sav and the .por files.

Note that I did not touch the variable labels defined for the factor variables, as there was a whole bunch of them and I’m lazy. The problem here was the special characters in the variable names themselves, which can save you a lot of time messing around with variable editor if you know it.

Solution 2 (pro) Among the suggested solutions not mentioned above was one by Peter Dalgård, which was to switch the locale (i.e. the internationalization settings) for R, at least temporarily, to read in data with some Nordic issues (Danish in that case). That’s cool, however, the original code did not work for me – firstly, my data is Swedish, so I would need the locale sv_SE instead of da_DK, and secondly, my Ubuntu requires that I also specify the encoding, as in sv_SE.UTF-8 or sv_SE.ISO-8859-1. Anyway, after some more googling (this must be the most boring way of procrastination known to mankind) I was convinced that the following code was worth a try:

lc = Sys.setlocale("LC_CTYPE")                ## store the original locale
Sys.setlocale("LC_CTYPE", "sv_SE.ISO-8859-1") ## switch to Swedish/Windows encoding
dat = read.spss("nonaffected.sav", to.data.frame=TRUE) ## do it
Sys.setlocale("LC_CTYPE", lc)                 ## restore original locale

Except that the second line threw an error message (actually a warning) stating that the OS could not honor my locale request, and read.spss bombed as before. And sure enough, listing the locales on my Ubuntu via locale -a at the command line, I found that sv_SE.ISO-8859-1 was indeed not available. So, of course, back to the search engine.

Luckily, this turned out to be a standard problem, and the solution a simple two-step:

1. Add a line to a file, telling Ubuntu which combination of language and encoding to generate when forced to do so: the file is /var/lib/locales/supported.d/local, and the line in our case is sv_SE.ISO-8859-1 ISO-8859-1 (at the end).

2. Force Ubuntu to generate the locale via sudo dpkg-reconfigure locales.

After this two-step, switching the locale and reading the original (unmodified) SPSS file worked like a dream.

Summary If read.spss throws an error message, the reason may be non-Ascii characters among the variable names. You can check this even without having SPSS on your machine, by using the PSPP program psppire. If you find special characters in the variable names, removing them manually and saving the modified file should work. Just converting the file to portable format alone did not work for me.

Alternatively, you can temporarily switch the locale that R uses for internationalization and encoding, using Sys.setlocale as shown above. If the SPSS file comes from Windows, it’s a fair bet that the encoding is ISO-8859-1; combine this with your language and country code, as in the example, and you should be good to go. If not, check whether this specific locale exists on your machine (locale -a) and force Ubuntu to generate as outlined above if required.

Details R 2.15.1 on Windows 7 and Ubuntu 11.04

Posted in R | Tagged , , , | Leave a comment

Building ROracle on Windows 7

Background ROracle is an add-on package that allows you to easily access an existing Oracle server from the command line of the R statistical environment: you can list existing tables, display their structure, download data – either on a fairly elementary level (like me) or fully SQL-powered.

This is of course very convenient, especially if someone else is taking care of the Oracle server in the background (lucky me). However, installation of ROracle is not totally straightforward, as no pre-compiled binaries are available from CRAN, only the package source: in order to get this to run, you will have to compile the package and link it against an Oracle client yourself. And while it’s not exactly rocket science, and the installation instructions are reasonably clear, there is no way in hell I will be remembering all the steps should I have to re-build at some point in the future. Therefore this post.

Details The recipe below has worked for me on a Windows 7 machine with R 2.15.1 and ROracle 1.1-4

Prerequisites You will need the Rtools matching your R installation (for me Rtools 2.16). If you need to install them, this previous post may be helpful. You will also need the Oracle Instant Client (currently version 11.2), as described in the ROracle installation guide: you need both the basic (or basic lite) and the SDK package (in the 32-bit or 64-bit version or both, as required).

Getting stuff from Oracle The downloads are zero cost, but not free (if they were, presumably ROracle would be distributed with the client source and as a pre-compiled package, right?). If you have already an account with Oracle, go to www.oracle.com and sign in. If you don’t have an account, still go to www.oracle.com and sign up for one (link at the top left): it’s free, they don’t require your firstborn or their personal information, and they are fairly restrained as far as (direct) spamming goes – I think I got more paper letters (ca. 1/year) than emails. Note that for the download to work, you need to be both signed in and have javascript activated.

The Oracle website is about as bad as you would expect from a large commercial entity. Googling instead “Instant client” took me to the current download page at

www.oracle.com/technetwork/database/features/instant-client/index-097480.html

in one go. Once there,

  1. Select 32-bit MS Windows (top choice). This brings you to the download page where you can select either Instant Client Package – Basic (for full international support) or Basic Lite (English only). Don’t forget to read (and accept) the licence agreement, after which you can save a .zip file to your disk.
  2. Repeat the same process for the Instant Client Package SDK, which gives you another .zip file.
  3. Unpack both of the .zip files.
  4. Create c:\instantclient\i386 on your disk, and drop the directory you have unzipped from the instantclinet_11_2 basic (or basic lite) package there.
  5. Drop the sdk subdirectory from the unzipped SDK package into the same place.

If you want install the 64-bit client instead or in addition, repeat Steps 1-5, but choose the 64-bit MS Windows option (of course) and use the subdirectory c:\instantclient\x64.

Building ROracle At the R command line, execute

install.packages("ROracle", type="source")

If that does not return an error message, try

require(ROracle)

and if that does not given an error message, it seems you have built a loadable Oracle client: presumably either 32- or 64-bit, depending on your installation of the client, though in all honesty, I have only tested the 32-bit client so far. If you want to build both, you may have luck using the argument INSTALL_opts="--merge-multiarch" with install.packages, or prefer to consult the installation guide again.

Testing You need to have the service name of your Oracle server (as well as username and password) to connect. Your local Oracle administrator would be a great person to ask, or you may want to look for the tnsnames.ora file on your machine (note that if that file exists, you probably have already some Oracle client software on your machine, and it is quite likely that you wouldn’t have had to jump through all of the hoops above; sorry about that – I’m just a trained monkey who got lucky surfing, not an expert).

Anyhoo, assuming you have all that you need, just continue after require(ROracle) with

ora = Oracle()
con = dbConnect(ora, username = "USER",dbname = "SERVICENAME", password="PASSWORD")
summary(con)
dbListTables(con, "MYSCHEME") ## the scheme you want to work with
dbDisconnect(con)

as seen on ?Oracle. Enjoy!

Posted in R | Tagged , , , | Leave a comment

The why and the how of installing Rtools

Some background One of the great things about R statistical language & environment is that it is open source and cross-platform. Admittedly, these are kind of two things, though you might argue that any sufficiently popular open source software will be ported widely.

Now even if you don’t feel strongly one way or the other about open source (though you should), the cross-platform aspect is very useful indeed: wherever you go, Windows or Mac or Linux, R will be there for you – just go to CRAN, pick the appropriate installer for your system, and you’re good to go, with virtually no differences.

Except, of course, of the obvious ones, in that the GUI may look quite different on a Mac compared to Windows – but that’s all just superficial, mere surface, wrapping the underlying crunchy goodness of the R language slightly differently: the command line is the same, and if you want have GUI consistency, a more elaborate wrapper like JGR or RStudio is going to be your friend. What counts is that under the hood, everything is pretty much the same: you run the same R commands, you read and write the same .RData files, and you install the same packages from CRAN.

Well, at least it feels that way: install.packages("xtable") would work on any system. However, if you visit the home page of the package at CRAN you will find there (currently)

  • xtable_1.7-0.tar.gz, the package source code
  • xtable_1.7-0.tgz, the package built for Mac
  • xtable_1.7-0.zip, the package built for Windows

So what really happens when you install the package on Mac or Windows is that R actually downloads and installs the pre-compiled binary for your operating system, without ever bothering you about the details. Sweeet!

What it is Now on Linux (and presumably on most other Unices), R downloads the package source and builds the ready-to-use package that is installed on your system right there on your machine. Note that this is a non-trivial process, involving making different kinds of documentation, checking R code for sanity, possibly compiling C/C++ or Fortran code and so on. This in its  turn requires a whole series of external tools like make, sed, tar, gzip, a C/C++ compiler etc. collectively known as Rtools.

Note that all these little helpers will be available on, say, an out-of-the box Ubuntu installation. These tools are of course not available on a standard Windows box (or only as cheap commercial knockoffs that will not work). That means, if you want to build your own packages from source on that Winbox, you will need to install all these tools in the correct version.

Which is a world of pain, which is why the good people of the R project have made available an installer which does all that very conveniently, as outlined below. Which is only possible because the whole tool chain for doing this is open source, again.

Why should I care, again? As long as you stick to base R and add-on packages that are available as pre-compiled binaries at CRAN or Bioconductor (and that is the vastly overwhelming majority of packages), you don’t need to care at all. However, there is a small minority of packages, like e.g. ROracle for, yes, accessing data on an Oracle server from the command line, that need to be compiled on the specific machine you are working with (more about that another day). If you need one of these, you better get cracking with the Rtools.

Also, should you want to make your own R packages, even just the tar.gz source version, you will need Rtools. And if you’re serious about using R, the day will come sooner or later when you want to wrap the crunchy goodness of your R code and make it available to the masses easily and freely, and open-sourced (technically, R packages are of course always open source, like in, you get to see the source; legally, of course, a totally different cup of poison).

You will also need the Rtools (extended version, see below) if you want to build R from scratch, which admittedly ups the nerd factor somewhat.

Please, can we get on with it?! Yes, of course.

  1. Go to http://cran.r-project.org/bin/windows/Rtools/ and download the latest version (for me: Rtools216.exe)
  2. Run the installer; if you are only interested in building packages,you can accept the defaults throughout.
  3. Confirm and finish. You should now have a new directory C:\Rtools on your computer.
  4. Test your installation: type install.packages("xtable", type="source") at the R command line.

If you additionally want to build R from source, keep an eye open for the point where the installer asks what type of installation you want: by default, the parts of the Rtools necessary to build a fully capable R binary are not included, but you can select the extras for either a 32- or a 64-bit binary build (or both). This will add the directories C:\R (for 32-bit R) and/or C:\R64 (for 64-bit R).

Details Windows 7, R  2.15.1, Rtools 2.16

Further reading R Installation and Administration Manual (especially Appendix D, The Windows Toolset)

Posted in R | Tagged , , , | 1 Comment

Hello world!

I guess you have to start somewhere, which in my case is here. Let me propose the following question for today: why is it butterfingered, but ham-fisted, and not butter-fingered and hamfisted?

Though I kind of see why hamfingered and especially butter-fisted would not fly (damn you, Marlon!).

Posted in Uncategorized | Leave a comment