Coding factors with numerical levels instead of character ones

Problem:

Often times, the categorical variables in our read in data set have character denoted levels (i.e. `"No"` and `"Yes"`). Sometimes, we wish to have these same caterogical variables have numerically denoted levels (i.e. `0` and `1`).

Data specifics:

• The data file you have read in contains categorical variables whose levels are denoted with character values, not numerical ones.

"Solution":

First let's read in a dummy data file to illustrate the problem (`"file.txt"` is attached at the bottom of this page):

```# Read in data file with column assigned No/Yes
x
#    id weight smoker
# 1   1    120     No
# 2   2    125    Yes
# 3   3    130    Yes
# 4   4    135     No
# 5   5    140     No
# 6   6    145    Yes
# 7   7    150    Yes
# 8   8    155     No
# 9   9    160     No
# 10 10    165    Yes
# 11 11    170    Yes
# 12 12    175     No
# 13 13    180     No
# 14 14    185    Yes
# 15 15    190    Yes
```

If we look at the class of the `smoker` column, we see that it is a `"factor"`.

```class(x\$smoker)
# [1] "factor"
```

When a vector of character strings is included as a column of a data frame, `R` by default turns the vector into a factor.

There are a few ways we can extract the codes 1, 2, ... from the categorical variable.

The easiest was to extract the codes 1, 2, ... is to use the `as.numeric()` function:

```as.numeric(x\$smoker)
#  [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
```

The `unclass()` function will temporarily remove the effects of a class. So, if we "unclass" the `smoker` column we get the same output of the `as.numeric()` function:

```unclass(x\$smoker)
#  [1] 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
# attr(,"levels")
# [1] "No"  "Yes"
```

As seen, factors have an attribute `levels` which holds the level names.

We can manipulate this `as.numeric()` or `unclass()` output in order to extract the numeric codes we want, such as 0, 1, ... instead of 1, 2, ...:

```as.numeric(x\$smoker) - 1
#  [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
unclass(x\$smoker) - 1
#  [1] 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
# attr(,"levels")
# [1] "No"  "Yes"
```

So, if we wanted to, we could define a new column in our dataframe which contained the desired extracted numerical codes of our factor variable:

```# Define a new column which is assigned 0/1
x\$smoker2<-as.numeric(x\$smoker)-1
x
#    id weight smoker smoker2
# 1   1    120     No       0
# 2   2    125    Yes       1
# 3   3    130    Yes       1
# 4   4    135     No       0
# 5   5    140     No       0
# 6   6    145    Yes       1
# 7   7    150    Yes       1
# 8   8    155     No       0
# 9   9    160     No       0
# 10 10    165    Yes       1
# 11 11    170    Yes       1
# 12 12    175     No       0
# 13 13    180     No       0
# 14 14    185    Yes       1
# 15 15    190    Yes       1
```

For more information on factors and their potential surprising characteristics, see the following books:
• Data Analysis and Graphics Using R by John Maindonald and John Braun

Acknowledgements:

I would like to thank Richard Urbano for posing this problem.
