# Coding factors with numerical levels instead of character ones

## Problem:

Often times, the categorical variables in our read in data set have character denoted levels (i.e. "No" and "Yes"). Sometimes, we wish to have these same caterogical variables have numerically denoted levels (i.e. 0 and 1).

## Data specifics:

• The data file you have read in contains categorical variables whose levels are denoted with character values, not numerical ones.

## "Solution":

First let's read in a dummy data file to illustrate the problem ("file.txt" is attached at the bottom of this page):

# Read in data file with column assigned No/Yes
x
#    id weight smoker
# 1   1    120     No
# 2   2    125    Yes
# 3   3    130    Yes
# 4   4    135     No
# 5   5    140     No
# 6   6    145    Yes
# 7   7    150    Yes
# 8   8    155     No
# 9   9    160     No
# 10 10    165    Yes
# 11 11    170    Yes
# 12 12    175     No
# 13 13    180     No
# 14 14    185    Yes
# 15 15    190    Yes

If we look at the class of the smoker column, we see that it is a "factor".

class(x\$smoker)
# [1] "factor"

When a vector of character strings is included as a column of a data frame, R by default turns the vector into a factor.

There are a few ways we can extract the codes 1, 2, ... from the categorical variable.

The easiest was to extract the codes 1, 2, ... is to use the as.numeric() function:

as.numeric(x\$smoker)
#  [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2

The unclass() function will temporarily remove the effects of a class. So, if we "unclass" the smoker column we get the same output of the as.numeric() function:

unclass(x\$smoker)
#  [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
# attr(,"levels")
# [1] "No"  "Yes"

As seen, factors have an attribute levels which holds the level names.

We can manipulate this as.numeric() or unclass() output in order to extract the numeric codes we want, such as 0, 1, ... instead of 1, 2, ...:

as.numeric(x\$smoker) - 1
#  [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
unclass(x\$smoker) - 1
#  [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
# attr(,"levels")
# [1] "No"  "Yes"

So, if we wanted to, we could define a new column in our dataframe which contained the desired extracted numerical codes of our factor variable:

# Define a new column which is assigned 0/1
x\$smoker2<-as.numeric(x\$smoker)-1
x
#    id weight smoker smoker2
# 1   1    120     No       0
# 2   2    125    Yes       1
# 3   3    130    Yes       1
# 4   4    135     No       0
# 5   5    140     No       0
# 6   6    145    Yes       1
# 7   7    150    Yes       1
# 8   8    155     No       0
# 9   9    160     No       0
# 10 10    165    Yes       1
# 11 11    170    Yes       1
# 12 12    175     No       0
# 13 13    180     No       0
# 14 14    185    Yes       1
# 15 15    190    Yes       1

For more information on factors and their potential surprising characteristics, see the following books:
• Data Analysis and Graphics Using R by John Maindonald and John Braun