Directories, files and paths

Learning objectives:

  1. How to do basic navigation in directories and files using R console
  2. Writing a for-loop to iterate over directories
  3. Vectorizing the for-loop using lapply()
  4. Applying any function automatically to a set of images inside folders using the above

CONTENT:

  1. Background
  2. Iteration through for loops
  3. Vectorization in R
  4. Functions
  5. Conclusions

Background


When you open up R the default working directory is usually the user pathway (usually C:/Users/yourname/Documents/ on Windows, and ~ or /Users/yourname on Mac OSX or *nix based systems).

If nothing else is specified the working directory is where files will be loaded and saved from. To get the current working directory write into the console and press return:

> getwd()
[1] "/Users/joeschmo"

To set the working directory to another directory, like /Users/joeschmo/Documents, you use the setwd() command.

> setwd('/Users/joeschmoe/Documents')
> getwd()
[1] "/Users/joeschmoe/Documents"

R follows current Unix convetion so the current directory is "./" and you can allways navigate to one directory up/above by using "../" as input to setwd() like setwd('../').
To get a list of all the content in a directory simply use dir():

> dir()
  [1] "figure1.pdf"                                                        
  [2] "section0023_EGFP_tdTomato.tif"                                            
  [3] "amazing_manuscript.doc"

A common command in terminal or console applications is ls() however in R this command will list all the current R objects occupying RAM memory:

> ls()
 [1] "model.1"               "model.2"                  "dataset"   

To only get the directories and not the file names simply use the new command list.dirs() importantly turn off recursively (recursive=FALSE) because otherwise the command will search for all the directories and their subdirectories:

> list.dirs('.', recursive=FALSE)
 [1] "./data"                                       "./manuscripts"                                
 [3] "./photos"                                               "./stuff" 

R is an object oriented language so most outputs from a command can be saved into an object stored into RAM and manipulated further. The output of dir() and command alike is mostly a vector of strings and each scalar in this vector is a string of combined characters.

Lets go to a directory of my experiment, Experiment001, where I've saved a bunch of tiff images into directories each directory containing tiffs for a specific animal, e.g. Experiment001/mouse001/sections001.tif ....

> setwd('/Users/joeschmoe/Documents/Experiment001')

Or if I'm already in /Users/joeschmoe/Documents/ I can simply use:

> setwd('Experiment001')

Now in this directory I already know that I do not have any files only folders for specific animals so dir() will be enough to get the list of directories:

> dir()
[1] "mouse001"  "mouse002"  "mouse003"

I have data from three mice. If I would have files in this directory as well we can get only the directory names and store them into an object called subjects using the assignment operator <-:

> subject<-list.dirs('.', recursive=FALSE)
[1] "./mouse001"  "./mouse002"  "./mouse003"

We now have a new object in RAM called subjects:

> ls()
 [1] "model.1"               "model.2"                  "dataset"                  "subjects"    

Notice that here list.dirs() assume that we always will be in the current working directory and therefore we use the relative path name where "./" means in current directory. To get the full path name use another input string like getwd():

> subjects<-list.dirs(getwd(), recursive=FALSE)
> subjects
[1] "/Users/joeschmoe/Documents/Experiment001/mouse001"
[2] "/Users/joeschmoe/Documents/Experiment001/mouse002" 
[3] "/Users/joeschmoe/Documents/Experiment001/mouse003"

Notice here how the new assignment overwrites anything that is contained in subjects previously. There is no Undo operation in R. So be careful when assigning new data into an already existing object. Make sure to write down every step into a script for both reproducibility and options to trackback any errors.

There is no "Undo operation" in R!

So be careful when assigning new data into an already existing object. Make sure to write down every step into a script for both reproducibility and options to trackback any errors.

We can then subset the subjects object to only display elements of it by using indexing:

> subjects[1]
[1] "/Users/joeschmoe/Documents/Experiment001/mouse001"

Displays the first object, likewise:

> subjects[2:3]
[1] "/Users/joeschmoe/Documents/Experiment001/mouse002"
[2] "/Users/joeschmoe/Documents/Experiment001/mouse003"

Displays element 2 to 3.
Indexing also takes boolean input so using ! which signifies the not relation we can display all elements except the second one:

> subjects[!2]
[1] "/Users/joeschmoe/Documents/Experiment001/mouse001"
[2] "/Users/joeschmoe/Documents/Experiment001/mouse003"

To only get parts of the filename one can use the "substring" function, substr(x, start, stop), which takes as input the character to start and the character to stop at:

>substr(subjects, 1, 3)
[1] "/Us" "/Us" "/Us"

To get the number of characters in total we use the nchar(x, type = "chars", allowNA = FALSE) function:

> nchar(subjects)
[1] 49 49 49

All full path names have 49 characters.
We can combine substr() and nchar() to get the ID of each animal and save it into a new object called animals:

> animals<-substr(subjects, (nchar(subjects)-7),nchar(subjects))
> animals
[1] "mouse001" "mouse002" "mouse003"

This assumes that we have a fixed length of 7 characters for our animal IDs. An easier way is to just get the basename() of the file or folder using:


> animals<-basename(subjects)
> animals
[1] "mouse001" "mouse002" "mouse003"

Likewise if I just want to get the folder which the subject folder is contained in we can use the dirname() command:

> dirname(subjects)
[1] "/Users/joeschmoe/Documents/Experiment001"
[2] "/Users/joeschmoe/Documents/Experiment001"
[3] "/Users/joeschmoe/Documents/Experiment001"

In our mouse001 directory we have 3 tiff files and one text file:

> dir(subjects[1])
[1] "mouse001section001.tif"
[2] "mouse001section002.tif"
[3] "mouse001section003.tif"
[4] "coordinates.txt"

To check the file extension of a file we can use the strsplit() command for example:

> strsplit("name1.jpeg", "\\.")[[1]]
[1] "name1" "jpeg" 

And combine it with which() and tail() (last element of vector, i.e. file extension) to get the index of the files that are equal to (==) tif files:

> file.extensions<-tail(strsplit( dir(subjects[1]), '\\.')[[1]], n=1)
> file.extensions
[1] "tif" "tif" "tif" "txt"
>  which( file.extensions == 'tif')
[1] 1 2 3
>  which( file.extensions == 'txt')
[1] 4

If we load the tools package we have a simpler command file_ext() to get the same result:

> library(tools)
> file_ext(dir(subjects[1]))
[1] "tif" "tif" "tif" "txt"

In the next section we will use what we now learnt to create a for loop to iterate through our folder and performing some function on every image in the folder. After that we will show how to speed up the for loop by vectorizing the code using lapply() instead of for().

Iteration through for loops


In R for loops are a simple way to debug code when needing to iterate across several functions and data manipulations. However, R, like Matlab, is notoriously known for handling for loops extremely slow, especially when multiple for loops are nested within each other.

Therefor the following section will demonstrate the use of for loops but in the next section we will show the equivalent to vectorization Matlab uses to speed up iteration; the use of apply().

The generic structure of any for-loop in R can be seen as:

for(iterator in vector){commands}

Where the iterator takes on the value of the nth element in the object vector for each iteration n, and everytime commands is executed.

To make this concrete lets try the following for-loop

ourvector<-c(1, 6, 3, 2)
for(iterator in ourvector){
 print('Iterator: ')
 print(iterator)
}

Which will print the following as output into the console:

[1] "Iterator: "
[1] 1
[1] "Iterator: "
[1] 6
[1] "Iterator: "
[1] 3
[1] "Iterator: "
[1] 2

It is very common to use the iterator as an index to be used when executing the commands.
For example, we might write a for-loop that iterates through our folders and append() the mouse ID to a variable named animals which we create as an character type but do not populate:

 animals<-character() #create empty object for storing animal IDs
 for(j in subjects ){
 	ID<-basename(j) #extract ID of mouse folder
 	animals<-append(animals, ID) #append ID of the jth mouse to 'animals'
 }
 animals #display the result of animals
[1] "mouse001" "mouse002" "mouse003"

For-loops can also be nested so we can iterate over two or more indices at the same time. We can use this to loop through every animal folder and then add the name of each image contained in the folder to a variable called images:

animals<-character() #create empty object for storing animal IDs
images<-character() #create empty object for storing image basenames
for(j in direct){
	ID<-basename(j) #extract ID of mouse folder
	animals<-append(animals, ID) #append ID of the jth mouse to 'animals'
	
	#then loop through each image in animal folder
	for(i in dir(j)){
		if( file_ext(i) == 'tif'){ #check if the file is a tif file
			images<-append(images, basename(i)) #if it is a tif file then ad the ith image of that animal to the images vector
		} 
	}
	
}
animals
[1] "mouse001" "mouse002" "mouse003"
images
[1] "mouse001section001.tif" "mouse001section002.tif" "mouse001section003.tif" "mouse002section001.tif" "mouse002section002.tif" "mouse002section003.tif" "mouse003section001.tif"

We've now almost created a data set of animals and the images belonging to each animal. However this data set is not tidy in the sense that each row is an observation and each column is a variable describing a measurement on that observation. The reason is that the length() of the variables differ:

length(animals)
[1] 3
length(images)
[7] 7

We need to make animals redundant in the sense that each unique element in images is indexed on animals. We can do this by repeating the mouse ID for as many times as there are tif images in that directory:

animals<-character() #create empty object for storing animal IDs
images<-character() #create empty object for storing image basenames
for(j in direct){
	ID<-basename(j) #extract ID of mouse folder
        repetitions<-length(which(file_ext(dir(j)) == 'tif')) #compute how many repitions we need
        ID<-rep(ID, repetitions) #ID is repeated for as many TIF images exist in the jth directory
	animals<-append(animals, ID) #append ID of the jth mouse to 'animals'
	
	#then loop through each image in animal folder
	for(i in dir(j)){
		if( file_ext(i) == 'tif'){ #check if the file is a tif file
			images<-append(images, basename(i)) #if it is a tif file then ad the ith image of that animal to the images vector
		} 
	}
	
}
animals
[1] "mouse001" "mouse001" "mouse001" "mouse002" "mouse002" "mouse002" "mouse003"
images
[1] "mouse001section001.tif" "mouse001section002.tif" "mouse001section003.tif" "mouse002section001.tif" "mouse002section002.tif" "mouse002section003.tif" "mouse003section001.tif"
length(animals)
[1] 7
length(images)
[7] 7

We can now combine animals and images into a data.frame() object. In R data frames are very usefull since it can combine several different types of data like both numeric and string/character data.

> mydata<-data.frame(animals, images)
> mydata
   animals                 images
1 mouse001 mouse001section001.tif
2 mouse001 mouse001section002.tif
3 mouse001 mouse001section003.tif
4 mouse002 mouse002section001.tif
5 mouse002 mouse002section002.tif
6 mouse002 mouse002section003.tif
7 mouse003 mouse003section001.tif

We've now stored our data frame object as mydata. To access individual variables in the data frame subset using the $ symbol:

mydata$animals
[1] "mouse001" "mouse001" "mouse001" "mouse002" "mouse002" "mouse002" "mouse003"

The data frame can also be treated as a two-dimensional array. So we can access individual observations by mydata[row, column] for example to select the observation on the 4th row and the second column:

mydata[4,2]
[1] "mouse002section001.tif"

R has several functions for computing both descriptive and inferential statistics on data frames directly. For example, if we would like to know how many images we have per animal we can easily use the table() function:

> table(mydata$animals)

mouse001 mouse002 mouse003 
       3        3        1 

We can here see that we have three images for mouse001 as well as mouse002 but mouse003 only has one single image.
Although for-loops can be useful for trying out stuff and debugging. With respect to performance for-loops in R are really slow, especially fi they are nested like the one above. Next we'll demonstrate how to use the apply() functions to reduce processing time.

Vectorization in R


Lets first begin by understanding the apply() and how it easily vectorizes statistical computations on matrices. We begin by making a matrix object with 8 rows and 8 columns filled with random data drawn from a normal distribution using the rnorm(n, mean=0, sd=1) function which draws n samples from a normal distribution with mean 0 and standard deviation of 1 by default:

> mymatrix<-matrix(rnorm(8*8), ncol=8, nrow=8)
> mymatrix
           [,1]        [,2]        [,3]       [,4]       [,5]       [,6]        [,7]       [,8]
[1,]  1.4778741 -1.39498759  1.34237059 -0.4327339 -2.1676488 -2.0462595  0.50607432 -1.4378282
[2,]  0.9249665 -0.11479861  0.63942608  0.6793854 -0.6353729 -0.2977904  0.42098053 -1.7151697
[3,] -0.4696242  1.97434985 -0.09665343  1.7676422  0.2326265 -1.1574151  0.06376697  0.1258279
[4,] -0.3808475 -0.18415909 -0.25677780 -0.7994633  0.8527967 -0.5482186 -1.14500185 -0.7568367
[5,]  0.2440967 -0.07499484 -1.39588671 -0.1643734 -0.9165985  0.4222664  1.79283103 -1.2640698
[6,]  0.2247876 -0.50179447 -1.27837027 -1.0444757  0.9282552 -1.5257497 -0.33052443 -0.8296681
[7,] -1.4142843  0.40906447 -0.10544004 -0.7578096 -0.6829120  0.4134579 -0.21303588 -0.9952677
[8,] -0.6111092  0.17054424 -1.09200401 -1.4288085  1.5746793  0.9501571 -0.45377496  0.0461885

If we want to compute the mean for each row we might do it with a for-loop:

row.means<-numeric()
for(i in 1:nrow(mymatrix)){
  row.means<-append(row.means, mean(mymatrix[i,]) )
}
row.means
[1] -0.51914236 -0.01229664  0.30506508 -0.40231351 -0.16959114 -0.54469249 -0.41827839 -0.10551593

However this is not as efficient and requires a lot of code compared to using the apply() function in order to apply the function mean() across any given dimension of the mymatrix. So if I want to get the mean across for each row I apply it to the first dimension and likewise the mean for each column I apply it to the second dimension:

apply(mymatrix, 1, mean) # mean for each row
[1] -0.51914236 -0.01229664  0.30506508 -0.40231351 -0.16959114 -0.54469249 -0.41827839 -0.10551593
apply(mymatrix, 2, mean) # mean for each column
[1] -0.0005175308  0.0354029945 -0.2804169511 -0.2725795997 -0.1017718029 -0.4736939904  0.0801644661 -0.8533529629

We can plug any function into apply so we can for example compute the standard deviation, sd(), or the sum, sum(), instead.

apply(mymatrix, 1, sd)
[1] 1.4722475 0.8741831 1.0637945 0.5977127 1.0447976 0.8142598 0.6575420 1.0098017
apply(mymatrix, 1, sum)
[1] -4.1531389 -0.0983731  2.4405206 -3.2185081 -1.3567291 -4.3575399 -3.3462271 -0.8441275

The apply function is great when used on arrays. But a lot of time the objects we want to manipulate is better conceived as objects of lists with different lengths. Remember that we have three folders and unequal amount of images in each folder. For these kind of situations the more generic apply to lists lapply() function exist.

But before using this we need to create our own function.

Functions


Functions in R are great and are treated as objects as well. You can easily create your own function with the following structure by using the function() command:

myfunction<-function(x){x+2}
myfunction(2)
[1] 4

This function, for example, takes any number, x, as input and outputs that number added by 2. Once the function is assigned to an object, in this case myfunction, the function can be called from console using simply myfunction().

A function may also take multiple inputs:

myfunction<-function(x,y){x+2*y}
myfunction(2, 3)
[1] 8

At this point we might stop for a while and think about what kind of function we want to create. We want to create a function that given a list of folders searches through each folder and extracts the name of each tiff image inside and adds this together with the folder name into a data frame.

So at the very basic level the input to our function is going to be a single string pointing to the absolute path of a directory of a mouse, e.g. "/Users/joeschmoe/Documents/Experiment001/mouse001". In our practical application the input is going to be a vector of strings, for example:

x<-c("/Users/joeschmoe/Documents/Experiment001/mouse001",
"/Users/joeschmoe/Documents/Experiment001/mouse002",
"/Users/joeschmoe/Documents/Experiment001/mouse003")
myfunction(x)

Might be a call to our function.
We might also want to call our function something more concrete such as get.images().
One level down in our function we might consider our input to be a list of files in the directory and then iterate through them to get a list of tif files. Lets start at this level.

So our function is going to take as input a filename, e.g. "mouse001section001.tif" or "coordinates.txt", and if this matches to a tif file then it should return the base of the filename otherwise return NULL. Using the file_ext() from library(tools) we can create a function we name get.images.basename() which looks like this:

get.images.basename<-function(x){
	if( file_ext(x) == 'tif' ){
        return(basename(x))
    }
}

Now we can demonstrate the use of the lapply function. Lets set up a character vector with some random file names:

filenames<-c("image.tif", "textdocument.txt", "otherdocument.raw", "secondimage.tif")
lapply(filenames, get.images.basename)
[[1]]
[1] "image.tif"

[[2]]
NULL

[[3]]
NULL

[[4]]
[1] "secondimage.tif"

The object returned is a list. Without touching too much on the topic of how list objects work in R we can always unlist the list into a vector by using unlist():

tiffFiles<-lapply(filenames, get.images.basename)
unlist(tiffFiles)
[1] "image.tif"       "secondimage.tif"

The NULL returns are suppressed so we end up with a vector with the exact length of the number of tif files.
Now we have our first part of the function we want in the end. As we said earlier the function should take as input a list of folders and then check in each folder for tiff files and return a data frame with two columns; one with animal ID and another with image names:

get.animal.and.images<-function(x){
   filenames <-dir(x)

   tiffFiles<-lapply(filenames, get.images.basename)
   
   images<-basename(unlist(tiffFiles))
   animals<-rep(basename(x), length(images))
   
   return(cbind(animals, images))
}

Here we use the cbind() to bind together two vectors of equal length into a two-dimensional array with the vectors as columns, rbind() does the same bind binds together row-wise.

Lets apply this function to our folder where mouse001, mouse002, mouse003 folders are located:


folders<-dir("/Users/joeschmoe/Documents/Experiment001") #folder to our experiment
ourlist<-lapply(folders, get.animal.and.images)
ourlist
[[1]]
     animals images              
[1,] "mouse001"  "mouse001section001.tif"
[2,] "mouse001"  "mouse001section002.tif"
[3,] "mouse001"  "mouse001section003.tif"

[[2]]
     animals images              
[1,] "mouse002"  "mouse002section001.tif"
[2,] "mouse002"  "mouse002section002.tif"
[3,] "mouse002"  "mouse002section003.tif"

[[3]]
     animals images              
[1,] "mouse003"  "mouse003section001.tif"

As you see lapply returns a list object where each item in the list is a data table with animals and images.

The standard way to combine two data sets with exactly same number of variables using base R is to do a do.call(), however this is not optimized and can be slow, package dplyr() has better options, but it will do the job fine:

mydata<-do.call(rbind, ourlist)
mydata
     animals images     
[1,] mouse001 mouse001section001.tif
[2,] mouse001 mouse001section002.tif
[3,] mouse001 mouse001section003.tif
[4,] mouse002 mouse002section001.tif
[5,] mouse002 mouse002section002.tif
[6,] mouse002 mouse002section003.tif
[7,] mouse003 mouse003section001.tif

This produces the exact same output as our for-loop but in a single function. We are not restricted to getting the filename of files;
you can easily apply any function to the image once its provided as input.

Conclusions


So lets at the end define a get.experiment() function which we has as input the folder where our experiment is stored, e.g. "/Users/joeschmoe/Documents/Experiment001":


library(tools)
get.images.basename<-function(x){
 	if( file_ext(x) == 'tif' ){
         return(basename(x))
     }
 }

get.animal.and.images<-function(x){
   filenames <-dir(x)

   tiffFiles<-lapply(filenames, get.images.basename)
   
   images<-basename(unlist(tiffFiles))
   animals<-rep(basename(x), length(images))
   
   return(cbind(animals, images))
}

get.experiment<-function(experimental.folder){
	folders<-dir(experimental.folder)
	ourlist<-lapply(folders, get.animal.and.images)
	dataset<-do.call(rbind, ourlist)
	return(dataset)
}

Once this is loaded into R console you can simply run:

get.experiment("/Users/joeschmoe/Documents/Experiment001")

On any folder you want to get images from.

If we would like to for example unmerge() every image in the directory into tiles with a size of approximately 1024 and 10% overlap then we change the get.images function to:


get.images.basename<-function(x){
 	if( file_ext(x) == 'tif' ){
         unmerge(x, 1024, 10) #before storing filename in list apply unmerge()
         return(basename(x))
     }
 }

Do not forget to load the wholebrain package since unmerge() is from wholebrain, library(wholebrain).

Given this short lesson in vectorization using apply there are still situations that it may make sense to use for-loops instead of vectorized functions:

  1. Using functions that don’t take vector arguments
  2. Loops where each iteration is dependent on the results of previous iterations