User Tools

Site Tools


diggingnumbers:data_description

Data description

Starting R for the first time

The recommended way to use R is to create a new empty directory (named something like “Digging Numbers” and start R from the command line into that directory. This way, data and command history will be saved just for this workspace. Here's an example for UNIX operating systems.

$ mkdir diggingnumbers # just the first time 
$ cd diggingnumbers
$ R

After that, your R session is active in the current directory. You can always check the directory you are working in with the command

> getwd()

Importing data

To read data from the raw data file into R:

> spearheads <- read.csv("spearheads.csv", header=TRUE)

From that moment on you can access the dataset with the data frame object named spearheads.

You can save time and fingers typing

> attach(spearheads)

every time you start a new session into that workspace. This enables you to call variables directly, like Maxle instead of spearheads$Maxle

Once you have read in a dataset, you can verify the names of the variables using the “names” command:

> names(spearheads)

This will display a list of the column names in the table. It is also a handy means of verifying capitalization and spelling of the field (column) names , since a missing or added capital in a field name will result in an error.

For additional information regarding the data set enter:

> str(spearheads)

This displays a more elaborated list of the data as follows:

'data.frame':   40 obs. of  14 variables:
 $ Num   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Mat   : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Con   : int  3 3 3 3 3 3 3 2 2 1 ...
 $ Loo   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Peg   : int  2 2 2 NA 1 2 2 2 2 2 .... etc.

The ouput shows, first that data is stored in memory as a dataframe. It also tells you that there are 40 records - observations - of 14 variables. The output then lists the variable name, the type of data, and a partial list of values stored in the variable following importation. This is particularly important information since some of the variables listed as “int” types are not actually numerical data. Material type - Mat - for example, is categorical data that has been entered as a numeric code. R will need to be informed that the variable really contains levels of a factor (a categorical variable) for some commonly used statistical routines. R could otherwise yield nonsensical results. There is no point, for example, in asking for an average value of Mat.

A note about importing data from external sources

Especially when you are importing files that you haven't produced yourself, always inspect text-format data with a text editor (e.g. vi, emacs, gedit, wordpad ). Don't make assumptions based on the file extension (like “.csv”), instead just go looking at the data first. That's just good practice and something any user of external data should keep in mind.

You might find that files produced in a different country use different locale settings of decimal separators (comma vs point). R by default tries to load files with English settings. If your file doesn't load, inspect it and make good use of some of the options of the read.csv() command like sep (for field separator) and dec (for decimal separator).

Quitting R

When you are done with your first tutorial, quit the R session with the q() command, and answer y to the Save workspace image question.

> q()
Save workspace image? [y/n/c]:

This leaves all the variables you created as they are for your next session.

<note tip>If you want to be sure R data is actually saved in that directory, just ls -a after quitting R and you should find two files .RData and .Rhistory.</note>


Start · Data description · Transforming variables · Tables · Pictorial displays · Measures of position and variability · Sampling · Tests of difference · Tests of distribution · Correlation · Tests of association

diggingnumbers/data_description.txt · Last modified: 2012/11/30 09:58 (external edit)