Background I admit I have a certain weakness for SPSS: I have used it extensively during my first university job (when R was still at version 0.63) and more importantly, the first course I was fully responsible for was an SPSS course for graduate students. This was the first time I had to explain statistical concepts to people with little or nor formal training, but (as a rule) very real data analysis problems. Consequently, I had to re-think a lot of theory for myself in order to be able to present a consistent and convincing story to my students, so that the course was a real education for me (as well as for them, hopefully).
In this context, the SPSS manuals were rather helpfully didactic, with interesting example data, the graphical user interface reasonably reasonable, and even the principle of “everything and the kitchen sink” in terms of statistics available in the menus was useful in that it would make a young statistician consult his Agresti in some depth. So all in all, despite the horrible marketing, the rental licences, the eagerness to split statistical functionality into countless shrink-wrapped and separately sold modules, I do have a soft spot for the old warhorse, and it still evokes some of the joy of standard errors I felt when we met the first time…
The problem … but of course, when I recently received an SPSS .sav
file from a colleague, the first thing I did was to read it into R. This worked at the first try as suggested by the documentation:
require(foreign) dat = read.spss("nonaffected.sav", to.data.frame=TRUE)
generated a warning about some value labels it did not like, but the data (and almost all of the labels) made it ok into R. Sweet!
This, however, when running R on Windows 7. When I switched to Ubuntu and tried to re-run my analysis, the exact same code and the exact same file generated an error message:
> dat = read.spss("nonaffected.sav", to.data.frame=TRUE) Error in read.spss("nonaffected.sav", to.data.frame=TRUE) : error reading system-file header
Tales of woe And this is where the fun started. Some googling showed that this has been an intermittent problem for other people in the past, but the suggested solutions varied: exporting to SPSS portable format (extension .por
), exporting to a text file (duh!), using the reencode
argument to read.spss
, commercial conversion software etc. I was a bit muggle-headed that day with an incipient cold, so I tried all of that in no special order.
Seeing that I have no SPSS installed either under Windows or under Linux, I should have started with reencode
, which allows you to specify a different character encoding for reading the file. The problem with that is that on my Ubuntu, there are ca. 1000 different encodings available (as shown via iconvlist()
at the R prompt). I tried some likely candidates for Windows encoding, prime among them ISO-8859-1, but that did not work.
So the next step should have been to ask myself, so what if I have no SPSS?! There is a GNU version by the acronym of PSPP (hilarious, innit), which can easily be installed via sudo apt-get install pspp
. However, calling pspp
exposes you directly to the horrors of the SPSS command language… which are pretty horrible, as these things go. Consulting the PSPP documentation, you will learn the command for reading in a system file (IMPORT
) and writing to a portable file (EXPORT
). Alternatively, you can start the GUI that the PSPP people have thoughtfully included with their software, via the command psppire
at the command line, and use the File/Open/Data and File/Save as menus.
Regardless of the path travelled, I was happy to find that the original file was read without problems, even though there were numerous special characters of a Swedish nature (å and ä) in both variable names and value labels. What I did not find was a way of exporting the data to a non-SPSS format, like a tab-delimited text file (always a comfort in situations like these) – either it’s not there, or I’m too stupid. However, generating a so-called portable file was easy, using either the PSPP command line or GUI, so I did that and tried reading it into R. This is what I got:
> dat = read.spss("nonaffected.por", to.data.frame=TRUE) Error in read.spss("nonaffected.por", to.data.frame=TRUE) : error reading portable-file dictionary
A new error message – sweet!
Solution 1 (ugly) At this point I was fairly sure that the problem was with the special characters (you think?!). In a spirit of experimentation, I decided to eliminate them from the variable names only, using the Variable view in psppire
: given that the file had ca. 80 variables, with 10 or so containing special characters, somewhat bothersome (and boring), but next to nothing compared to the time I had already burnt on this. Saving the data with the internationalized variable names worked, read.spss
could read both the .sav
and the .por
files.
Note that I did not touch the variable labels defined for the factor variables, as there was a whole bunch of them and I’m lazy. The problem here was the special characters in the variable names themselves, which can save you a lot of time messing around with variable editor if you know it.
Solution 2 (pro) Among the suggested solutions not mentioned above was one by Peter Dalgård, which was to switch the locale (i.e. the internationalization settings) for R, at least temporarily, to read in data with some Nordic issues (Danish in that case). That’s cool, however, the original code did not work for me – firstly, my data is Swedish, so I would need the locale sv_SE
instead of da_DK
, and secondly, my Ubuntu requires that I also specify the encoding, as in sv_SE.UTF-8
or sv_SE.ISO-8859-1
. Anyway, after some more googling (this must be the most boring way of procrastination known to mankind) I was convinced that the following code was worth a try:
lc = Sys.setlocale("LC_CTYPE") ## store the original locale Sys.setlocale("LC_CTYPE", "sv_SE.ISO-8859-1") ## switch to Swedish/Windows encoding dat = read.spss("nonaffected.sav", to.data.frame=TRUE) ## do it Sys.setlocale("LC_CTYPE", lc) ## restore original locale
Except that the second line threw an error message (actually a warning) stating that the OS could not honor my locale request, and read.spss
bombed as before. And sure enough, listing the locales on my Ubuntu via locale -a
at the command line, I found that sv_SE.ISO-8859-1
was indeed not available. So, of course, back to the search engine.
Luckily, this turned out to be a standard problem, and the solution a simple two-step:
1. Add a line to a file, telling Ubuntu which combination of language and encoding to generate when forced to do so: the file is /var/lib/locales/supported.d/local
, and the line in our case is sv_SE.ISO-8859-1 ISO-8859-1
(at the end).
2. Force Ubuntu to generate the locale via sudo dpkg-reconfigure locales
.
After this two-step, switching the locale and reading the original (unmodified) SPSS file worked like a dream.
Summary If read.spss
throws an error message, the reason may be non-Ascii characters among the variable names. You can check this even without having SPSS on your machine, by using the PSPP program psppire
. If you find special characters in the variable names, removing them manually and saving the modified file should work. Just converting the file to portable format alone did not work for me.
Alternatively, you can temporarily switch the locale that R uses for internationalization and encoding, using Sys.setlocale
as shown above. If the SPSS file comes from Windows, it’s a fair bet that the encoding is ISO-8859-1
; combine this with your language and country code, as in the example, and you should be good to go. If not, check whether this specific locale exists on your machine (locale -a
) and force Ubuntu to generate as outlined above if required.
Details R 2.15.1 on Windows 7 and Ubuntu 11.04