Start with the data

Check the distribution of data and relationship between factors

Marie Vaugoyeau

2 minutes read

After verifying that there is no one missprint in the data and before starting the true statistical analysis, I check the distribution of data and relationships between factors.

For this blog article, I use data of air quality available in R.

The distribution of each variable

To see distribution, I use a ggplot.


# First time transform data of tidy ones
D_tidy<-D %>% tidyr::gather(VAR, value, -Month, -Day) %>% mutate(Time = Month*30.5+Day)

ggplot(D_tidy) + geom_line(aes(x=Time, y=value, color=Month)) + scale_color_distiller("Month", palette = "Spectral") + facet_wrap(~VAR, scales = "free_y") + theme_minimal() + ylab("")

We can see there are missing value in Ozone but it is not an issue.

The multi-test of correlations

To see which variable varies with another, I analysed correlation with cor.test.
I used the following code lines with R modified from Zuur et al. (2010)).


par(bty="n") # I prefer when box was delete from plots but it is not a necessary

# I re-write panel.smooth to change the shape of points and thickness of the red line
panel.smooth2=function (x, y, col = par("col"), bg = NA, pch = par("pch"),
    cex = 1, col.smooth = "red", span = 2/3, iter = 3, ...)
  points(x, y, pch = 20, col = col, bg = bg, cex = cex)
    ok <- is.finite(x) & is.finite(y)
    if (any(ok))
        lines(stats::lowess(x[ok], y[ok], f = span, iter = iter),
            col = "red", lwd=2,...)

# I modfied panel.cor to show estimate for only significant cor.test
panel.cor <- function(x, y, digits=1, prefix="", cex.cor)
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y,use="pairwise.complete.obs"))
  txt <- format(c(r1[4], 0.123456789), digits=digits)[1]
  txt <- paste(prefix, txt, sep="")
  if(missing(cex.cor)) cex <- 0.9/strwidth(txt)
  text(0.5, 0.5, if(r1[3]<0.005) txt, cex = cex * r)

pairs(D, lower.panel = panel.smooth2, upper.panel=panel.cor, cex.labels=1.3)

My main modification is the blank when correlation test was not significant.

In the upper triangle, number is estimate of the correlation between the line variable and the column one.

And you, how do you do multi-test of correlations?