Montag, 29. Oktober 2012

XSLT script to transform xml tree to table for import into sql database

This XSLT script transforms a Tree from freemind to a CSV file in the form of
PARENT_ID, CHILD_ID, CHILD Text

I need this to get it into a postgresql database.



<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text" encoding="UTF-8"/>
 <xsl:template match="/">
 <xsl:apply-templates/>
 </xsl:template>
  <xsl:template name="linebreak">
   <xsl:text>
</xsl:text>
  </xsl:template>
  <xsl:template match="map">
   <xsl:apply-templates select="child::node"/>
  </xsl:template>
 
  <xsl:template match="node">
   <xsl:param name="commaCount">0</xsl:param>
  
  
    <xsl:if test="$commaCount > 0">
      <xsl:call-template name="writeCommas">
       <xsl:with-param name="commaCount" select="$commaCount"/>
      </xsl:call-template>
    </xsl:if>
   
   <xsl:value-of select="../@ID"/>,<xsl:value-of select="@ID"/>,<xsl:value-of select="@TEXT"/>
   
     <xsl:call-template name="linebreak"/>
      <xsl:apply-templates select="child::node">
        <xsl:with-param name="commaCount" select="$commaCount + 1"/>
      </xsl:apply-templates>
     </xsl:template>
     <xsl:template name="writeCommas">
      <xsl:param name="commaCount">0</xsl:param>
       <xsl:if test="$commaCount > 0"> <xsl:call-template name="writeCommas">
         <xsl:with-param name="commaCount" select="$commaCount - 1"/>
     </xsl:call-template>
    </xsl:if>
 </xsl:template>
</xsl:stylesheet>

          

Dienstag, 23. Oktober 2012

household savings is not household savings

 

http://epp.eurostat.ec.europa.eu/tgm/table.do?tab=table&init=1&plugin=1&language=en&pcode=tsdec240

is the household savings rate calculated by eurostat which is defined as:

“(ESA95 code: B8G) divided by gross disposable income (B6G), with the latter being adjusted for the change in the net equity of households in pension funds reserves (D8net)”

I have documented the differences here:http://rpubs.com/johanneskutsam/savingsrate

Texreg and discourse analysis

Phillip Leifeld has developed two useful tools: texreg – a tool to print latex tables from different types  regressions in R and a software for discourse analysis that looks useful for my political science amigos.

Mittwoch, 10. Oktober 2012

multiple regressions in one table

 

You often want to compare coefficients between different models. mtable from the memisc package does exactly that. As its author (Martin Elff) just pointed out to me it is possible to use mtable of the number of regression results is variable. See the following example:

library(memisc)
x=rnorm(10)
y=rnorm(10)
z=rnorm(10)
l1=lm(x~z)
l2=lm(z~y)
li=list(l1,l2)
attributes(li)=list(names=list(1,2))
do.call(mtable,li)

In a future version of mtable Martin said there will be a functionality to include lists directly

mtable(list=list("1"=l1,"2"=l2))

Thanks Maritn for the great tool!

Montag, 8. Oktober 2012

Eurostat in R in GIT

 

I have published the source for the free read.eurostat() in R at https://github.com/toprach/eurostat_r.git

Users of STATA can buy my stata version  from here: https://sites.google.com/site/eustatuse/

 

Update: I have just found out that the rdatamarket library can also download data from EUROSTAT so you may have a look at it too.

install.packages(“rdatamarket”)
library(rdatamarket)
dmlist(http://data.is/PllzbZ
)

My impression is that rdatamarket is a little slower than my implementation – but rdatamarket gives access to UN, Eurostat, Worldbank, FAO, and other providers – via a single Interface. I guess I will use my library for EUROSTAT and rdatamarket for the other sources.

There is also a library WDI  to access Worldbank data directly from R.

Samstag, 6. Oktober 2012

Downloading datasets from EUROSTAT with R

 

read.eurostat(datasetname, LANGUAGE=”en”)

Is a function I have written to facilitate the downloading and creation of data.table objects from eurostat data.

 

library is used to download data from eurostat and to find datasets
#later extend to extend to find datasets with certain dimensions

#download data from eurostat
#unpack and convert to dataframe
#load label descriptions
#load factors
#save as r data object

datasetname

#this="jvs_a_nace2"

LANGUAGE="en"

 

read.eurostat=function(datasetname,LANGUAGE="en",save=FALSE){
  library(RCurl)
  library(data.table)
  library(reshape)
  library(stringr)
 
  baseurl="
http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=data%2F"
 
  fullfilename=paste(datasetname,".tsv.gz",sep="")
  temp <- paste(tempfile(),".gz",sep="")
  download.file(paste(baseurl,fullfilename,sep=""),temp)
  dataconnection <- gzfile(temp)
  d=read.delim(dataconnection)
  longdata=melt(d,id=colnames(d)[1])
 
  firstname=colnames(d)[1] # remove .time and count how many headings are there
  firstname=substr(firstname,1,nchar(firstname)-nchar(".time"))
  headings=toupper(strsplit(firstname,".",fixed=TRUE)[[1]])
  headingcount=length(headings)
  colnames(longdata)=c("dimensions","time","value")
 
 
  #get the data on the dimension tables
  df=data.frame(dimensions=as.character(longdata[,"dimensions"]))
  df = transform(df, dimensions= colsplit(dimensions, split = "\\,",names=headings))
  dimensions=data.table(df$dimensions)
 
  #download the dimension labels - save headings as better variable
  dimfile=paste("
http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=dic%2F",LANGUAGE,"%2Fdimlst.dic",sep="")
 
  temp <- paste(tempfile(),".gz",sep="")
  download.file(dimfile,temp)
  dataconnection <- gzfile(temp)
  dimdata=read.delim(dataconnection,header=FALSE)
  colnames(dimdata)=c("colname","desc")
  lab=dimdata$desc
  names(lab)=dimdata$colname
 
  #create  headings that speak for themselves for columns
  speakingheadings=as.character(lab[headings])
 
  #download factors for each heading and add
  for(heading in headings){
    factorfile=paste("
http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=dic%2F",LANGUAGE,"%2F",tolower(heading),".dic",sep="")
    temp <- paste(tempfile(),".gz",sep="")
    download.file(factorfile,temp)
    dataconnection <- gzfile(temp)
    factordata=read.delim(dataconnection,header=FALSE)
    colnames(factordata)=c(heading,paste(heading,"_desc",sep=""))
    #join the heading to the heading dataset
    dimensions=merge(dimensions,factordata,by=heading,all.x=TRUE)
  }
 
 
    #at the end at speaking headings
  setnames(dimensions,colnames(dimensions)[1:length(speakingheadings)],speakingheadings)
 
    #add data columns by writing and reading again---FASTER ;-)
  temp=tempfile()
  values=data.frame(value=as.character(longdata$value),stringsAsFactors=FALSE)
  v=try(transform(values, value= colsplit(value, split = "\\ ",names=c("value","flag"))),silent=TRUE)
 
 
  if(class(v)=="try-error"){
    print("there are no flags")
    values=data.table(values)
    values$value=as.character(values$value)
    values$value=as.double(values$value)
  }else{
    values=v$value
    values=data.table(values)
    values$value=as.character(values$value)
    values$value=as.double(values$value)
    values$flag=as.character(values$flag)
    values[value==flag,flag:=NA]
  }
 
  #fix the time column
  identifytimedim=as.character(longdata$time[1])
 
  #TODO continue here: identify Q for quarter, M for month, - if non YEAR
  if found Q then zoo -> Q  else zoo -> M
  as.yearqtr("200706", "%Y%m")
  grep(,"\\")
 
 
  eurostatdata=cbind(dimensions,time=longdata$time,values)
  if(save) save(eurostatdata,file=paste(datasetname,".RData",sep=""))
  eurostatdata
}