# preface.R # (c) 2011 Mary E. Beckman & Bridget Smith, Ohio State University # Code from the R code section of the preface to Beckman, Foltz, # and Smith "Analyzing the Sounds of Languages." #### Getting started # When opening an R script, if you are running Windows, you # must first start the R program. Then from within the R # program, choose the File menu, then select “open script” # and browse to where you have saved the R script, for # example a copy of this R note is saved inside a script that # is called preface.R and can be found on the course website. # If you open the script by double-clicking on it, it will # open it in wordpad or a text editor, so you will not be # able to use it in R. # On a Mac, you can open a script the same way, or you can # double click directly on the script, and it should open in # R. # Once you have opened the script, you should have two # windows running R. The window that contains the text and # code from this section is called by the name of the R # script. The main window, where all of the output is listed # is called the R console. The difference is that you can # write and erase as much as you want inside the script, and # it doesn’t do anything until you transfer a line of code to # the console. Inside the console window, you cannot change a # line of code once it is entered; you instead have to re- # enter the changed line of code. To run any line of code, # you an do any of the following: # 1) Type the line directly into the R console window # followed by a carriage return. # 2) Use the keyboard or mouse to copy the line (including # the carriage return at the end) from the script and paste # it into the R console window. # 3) Highlight the part of the code you want to use from # the script by clicking and dragging your mouse over it. # Then, for a Mac, hold down the key (the # swirlygig button, formerly the apple button) on the left # of the keyboard, and then also hit (as you are # continuing to hold the key). If you use R on a # PC running windows (or linux), you will highlight the # code and press and the letter "r". # The last option is the fastest way of getting a line of # code to work because it transfers it to the console and # executes the command all at once. Even if you are writing # the commands or functions yourself, you should write them # in a script window first and then execute them from the # script. That way, it is easier to go back and make changes # to your code (which you will probably have to do often at # first). #### Set the working directory # Next, it is a good habit to always set the working # directory to where you have any files located that you will # be using. For example, to execute the command to read in # the data file of names and nicknames that we will be # working with in the exercises, you should have downloaded # the nicknames.txt file to some directory such as the place # where you are storing your notes from class for this # course. For example, if you were an Ohio State University # student and you were working in the computer lab where we # teach the course in Columbus, you would download the file # to the desktop. To set the working directory to the desktop # then, you can use the command: setwd("/Users/buckeye/Desktop") # Or if you were working at home, and your home computer is # PC and you’re saving your notes in a directory called # “DataAnalysis” in the “Documents” directory on your hard # drive, the path might be something like the following: setwd("C:/Documents/DataAnalysis") # If you are using a Mac, you also can go to Misc in the top # menu bar, and choose “Change Working Directory” and then # browse until you find the directory where you want to be # working. If you are working on a PC running Windows, this # menu choice will be under the File heading, rather than # Misc. #### Command lines versus comment lines # In the R Note box above and in every R code section # (including this one), we will be distinguishing our # descriptions of R commands from the code proper by using # different fonts. For example, the following line is the R # code for subtracting 2 from 7: 7 - 2 # If you are using the script preface.R, you will probably # have noticed by now (but it doesn’t hurt to point out) that # strings of text preceded by the hash mark # are notes # inside the R script. R ignores anything that is preceded by # a hash mark #, so that’s why we use these to “hide” notes # in a script. For example, the following is the same line of # R code, with a note about what it does. 7 - 2 # Subtracts 2 from 7 # You can write additional notes for yourself inside the # script and save it to look over later. #### Arithmetic operators # Now the easiest thing to do in R is basic arithmetic # operations. Basic operator signs are as follows: 459 + 51 # Adds 51 to 459 459 - 51 # Subtracts 51 from 459 459 * 51 # Multiplies 459 by 51 459 / 51 # Divides 459 by 51 51 ^ 3 # Calculates 51 to the power of 3 -- i.e., the same as: 51 * 51 * 51 # Remember, to solve each of these equations, you could 1) # type it into the console then hit return, 2) copy and paste # it into the console and press return, or 3) highlight it # from the portion of the script directly above this note, # and press + (or + for a PC # running windows). Try executing the commands in each of the # three ways. # If it has been a while since you've done a lot of this kind # of arithmetic, it might be good to review some other # basics, such as the difference between the following two commands. (459 + 51) / 3 # Adds 51 to 459 and then divides the result by 3. 459 + (51 / 3) # Adds 459 plus the result of dividing 51 by 3. # Notice that when you run any of the commands above, the # next line on the R console window is the result of # performing the operation specified by the command. The # symbols “+” and “*” and so on are called operators, because # they are what cue the operation. #### Functions # Addition also can be performed using the R function sum(). # A function is a named command that can stand for an # arbitrarily long sequence of operations. That is, it is # like a shortcut for more complicated mathematical functions # or processes, including operating the graphical device, # that have been pre-programmed into R. A function will be # followed by parentheses where you will specify further bits # of information that R needs in order to perform the # selected function. Each bit of information is an argument. # If there are several arguments, they are separated from # each other by commas. For example, the sum() function adds # its arguments together, so the following two commands # return the same result: 459 + 51 + 327 # Adds together 459, 51, and 327. sum(459, 51, 327) #### The R assignment operator # Another very important special symbol is "=", the # assignment operator. This operator tells R to assign to the # thing on its left, the value to its right. In the simplest # case, this is just like giving a name to the value. So for # example: x1 = 459 + 51 # Adds 51 to 459 and assigns the result the name "x1". x2 = sum(459, 51, 327) # Stores the sum of these 3 numbers in "x2". x3 = 7 - 2 # Takes 2 from 7 and stores the result in "x3". x4 = sum(7, -2) # Stores the sum of 7 and -2 in "x4" (same as x3). # Once you have stored the result of an operation in this # way, you can retrieve the computed value just by typing the # "name" that you’ve given to the value. So, for example, if # you type x or x2 or x3 or x4 in the R console window after # running the above four commands, the next line on the R # console window is the same value that you would have got by # running the original command again. This is especially # convenient if you want to store more complicated values, # such as a vector of numbers instead of a single number. #### The R vector function # First of all, a vector is basically a single row of items. # In R, you can specify that a set of values is a vector by # typing them, separated by commas, as arguments to the c() # function, like this: c(459, 51, 327) # The c() part of this command is a function that tells R to # "concatenate" the arguments, which means "to group these # items together in order," which is a simple definition of a # vector. Again, a function is like a shortcut for more # complicated mathematical functions or processes. The name # of the function will be followed by parentheses where you # will specify further information that R needs in order to # perform the selected function. In this case, the () # parentheses enclose the items you want grouped together, # and the items are separated by commas, to show where one # item ends and the next begins. If you assign the vector a # name, like this: x5 = c(459, 51, 327) # you have a way of referring back to it, so that you don’t # have to keep typing the same numbers in over and over # again. So, after you run the above command, the following # two lines of code return the same value. sum(459, 51, 327) sum(x5) # The length() function lets you count the number of items in # a vector. So the value that is returned by the following # command is the number of items in the vector x5 that you # created earlier. length(x5) # You can refer to a value at any position in a vector by # following the vector with the position number enclosed in # square brackets. So after you have defined x5 as above, the # following three commands all return the same result. 51 # Just type the number and R will echo it. c(459, 51, 327)[2] # Specify the second item in the vector. x5[2] # The following three commands also are equivalent ways of # adding 459 and 51. 459 + 51 x5[1] + x5[2] # This equivalence may seem a bit boring and trivial now, but # wait until you see what this buys you when you’re dealing # with longer vectors or more complicated items. #### Numbers versus character strings # The values that we’ve been talking about so far have all # been numbers, such as 459 or the result of summing up the # numbers 459, 51, and 327 that were assigned to the vector # x5. But R lets you also make observations about character # strings, such as the string of letters that constitute the # spelled forms of the names Cynthia and Elizabeth. You # distinguish the two types of values by enclosing the latter # in quotation marks, like this: name1 = "Robert" name2 = "Jonathan" name3 = "Cynthia" name4 = "Elizabeth" # These four lines of code assign the character strings "Robert", # "Jonathan", "Cynthia", and "Elizabeth" to the variables name1, # name2, name3, and name4. You can check to see whether these # are numbers or character strings using the is.numeric() and # is.character() functions, like this: is.numeric("Robert") is.character("Robert") is.numeric(name1) is.character(name1) # Compare the results of the above four commands to the results # of running the following: is.numeric(6) is.character(6) is.numeric("6") is.character("6") # You can also make vectors of several character strings, # like this: somenames = c("Robert", "Jonathan", "Cynthia", "Elizabeth") somenicknames = c("Rob", "Jon", "Cindy", "Lisa") # These first of these two lines makes a vector of the # character strings that are the spelled forms of these four # names and assigns it to the variable "somenames". The # second line does a similar thing for the associated # nicknames. If you have assigned each of the four names # to variables, as shown above, then the first line is # equivalent to the following: somenames = c(name1, name2, name3, name4) # Also, once you have defined these vector variables, you can # tell R to count the number of names or nicknames, like # this: length(somenames) length(somenicknames) # You can also check to see if the vector contains numbers or # character strings, like this: is.numeric(somenames) is.character(somenames) #### Counting the number of characters # You can use the nchar() function to count the number of # characters in a character string, like this: nchar("Cynthia") nchar(name3) nchar("Cindy") # You can confirm that the result of using this function is # a number by embedding it in the nchar() command, like this: is.numeric(nchar("Cynthia")) is.numeric(nchar(name3)) # Also, once you have defined the two vectors of names and # nicknames above, you can also refer to the items by the # their positions in the list, so the following three commands # will give you the same result, specifically the number of # letters in the name Cynthia. nchar("Cynthia") nchar(somenames[3]) # and the following three commands will all give you the same # result – i.e., the difference in number of letters between # the name Cynthia and the nickname Cindy: 7 - 5 nchar("Cynthia") - nchar("Cindy") nchar(somenames[3]) - nchar(somenicknames[3]) # The nchar() function can take a vector of character strings # as its argument, in which case it returns a vector of # numbers. Try typing the following commands, to see this. nchar(somenames) nchar(somenicknames) # You can also combine commands in more complicated ways. For # example, try typing the following command. nchar(somenames) - nchar(somenicknames) # This gives you a vector of four numbers that are the values of # the difference in length for each of the four pairs of full # name and nickname. #### Data frames # Sometimes a dataset is a simple vector of values, like the # vectors nicknames here. It is far more common, however, for # datasets to be two-dimensional combinations of vectors, in # which the items at the same position in the different # vectors are associated. For example, the first nickname in # nicknames is associated with the first name in somenames, # and so on. The data file nicknames.txt contains a much # larger number of such pairs of names and nicknames, copied # from the web page http://www.censusdiggins.com/ # nicknames.htm and edited so that each row of the file # contains a given name and one of its associated nicknames. # (There is also a third column, specifying whether the name # is from the list of "male names" on the left of this page # or from the list of "female names" on the right.) # In R, such a two dimensional combination is called a # dataframe. If you have set the working directory to the # directory where you stored the nicknames.txt file, you can # use the function read.delim() to read the data into this # kind of two-dimensional tabular information, like this: nicknames = read.delim("nicknames.txt", as.is=TRUE) # This reads in the content of the file and assigns it to the # variable nicknames. The first argument is the name of the # file that you want to read in. Note that the file name is a # character string and needs to be enclosed in quotation # marks. The second argument (i.e., “as.is=TRUE”) tells R to # keep the format of the original vectors. That is, here, we # want the first column to be a vector of character strings # and we also want this to be true for the second column. # You can look at the content of the dataframe in various # ways. For example, the following command returns the first # six rows of the dataframe. head(nicknames) # And the following command lets you open the dataframe in a # text editor, in a format that will be familiar if you’ve # used programs such as Excel. edit(nicknames) #### Retrieving values at specific positions in a data frame # Finally, you can look at values at particular positions in # the dataframe by specifying their row numbers or their # column numbers or both. You specify these positions by # enclosing the numbers in square brackets, just as you did # for the positions in a vector. But since a dataframe has # two dimension, now there can be two numbers, separated by a # comma, for the row position and the column position. For # example, the following command returns the value that is in the second row of the dataframe which is the name for that row – i.e., the value in the first column: nicknames[2,1] # You can leave out the column number to return all of the values in a particular row. For example, the following command returns the values from all three columns (name, nickname, and gender) for the second row of the dataframe: nicknames[2,] # Since any row is a vector with one value from each of the # columns of the dataframe, you can ask R to tell you how # many columns there are, like this: length(nicknames[2,]) # Similarly, you can leave out the row number. For example, # the following command returns all of the values from the # second column (the nicknames column) for all rows of the # dataframe: nicknames[,2] # Since this column is a vector, too, you can ask R to tell # you the total number of rows in the dataframe, like this: length(nicknames[,2]) # There is also a function that returns a vector with the number # of rows followed by the number of columns. That’s the dims() # function, which takes the name of the dataframe as its argument, # like this: dim(nicknames) # Also, now that you know how to return a row vector or a column # vector, you can calculate the difference in length using the # same command that you used for the shorter vectors, like this: nchar(nicknames[,1]) - nchar(nicknames[,2]) # This gives you the difference in length for all 1271 pairs # of full name and nickname.