R Internals

This page is about how R stores objects internally. I've had to re-figure this out several times, so I thought a TWiki page about this might be prudent.

SEXP

All R objects are accessible in C via the SEXP type. SEXP is just a pointer defined by a typedef, and you can access the actual data inside an SEXP object by using macros which call other macros which use some typedefs of macros and other typedefs, etc. (basically, a real pain to understand). Here's an attempted trace of what SEXP actually is:

SEXP is a pointer to an SEXPREC object:

typedef struct SEXPREC *SEXP;

SEXPREC is a struct:

typedef struct SEXPREC {
    SEXPREC_HEADER;
    union {
    struct primsxp_struct primsxp;
    struct symsxp_struct symsxp;
    struct listsxp_struct listsxp;
    struct envsxp_struct envsxp;
    struct closxp_struct closxp;
    struct promsxp_struct promsxp;
    } u;
} SEXPREC, *SEXP;

SEXPREC_HEADER is a preprocessor-defined variable:
#define SEXPREC_HEADER  \
struct sxpinfo_struct sxpinfo; \
struct SEXPREC *attrib; \
struct SEXPREC *gengc_next_node, *gengc_prev_node

spxinfo_struct looks like this:

struct sxpinfo_struct {
    SEXPTYPE type      :  5;
    unsigned int obj   :  1;
    unsigned int named :  2;
    unsigned int gp    : 16;
    unsigned int mark  :  1;
    unsigned int debug :  1;
    unsigned int trace :  1;
    unsigned int fin   :  1;  /* has finalizer installed */
    unsigned int gcgen :  1;  /* old generation number */
    unsigned int gccls :  3;  /* node class */
}; /*           Tot: 32 */

SEXPTYPE is an integer that defines what this object is (list, string vector, numeric vector, etc.). The other variables in sxpinfo_struct seem to be flags of some kind.

The union in SEXPREC (called u) can contain several structs, which I assume depend on what type the R object happens to be. Here they are for reference:

struct primsxp_struct {
    int offset;
};

struct symsxp_struct {
    struct SEXPREC *pname;
    struct SEXPREC *value;
    struct SEXPREC *internal;
};

struct listsxp_struct {
    struct SEXPREC *carval;
    struct SEXPREC *cdrval;
    struct SEXPREC *tagval;
};

struct envsxp_struct {
    struct SEXPREC *frame;
    struct SEXPREC *enclos;
    struct SEXPREC *hashtab;
};

struct closxp_struct {
    struct SEXPREC *formals;
    struct SEXPREC *body;
    struct SEXPREC *env;
};

struct promsxp_struct {
    struct SEXPREC *value;
    struct SEXPREC *expr;
    struct SEXPREC *env;
};


DATAPTR

The DATAPTR macro is used to access the actual data of an SEXP object. Here's what I found by tracing DATAPTR.

DATAPTR is called from lots of macros such as REAL and INTEGER, where x is an SEXP object:

#define CHAR(x)     ((char *) DATAPTR(x))
#define LOGICAL(x)  ((int *) DATAPTR(x))
#define INTEGER(x)  ((int *) DATAPTR(x))
#define RAW(x)      ((Rbyte *) DATAPTR(x))
#define COMPLEX(x)  ((Rcomplex *) DATAPTR(x))
#define REAL(x)     ((double *) DATAPTR(x))

DATAPTR is also a preprocessor-defined macro:

#define DATAPTR(x)  (((SEXPREC_ALIGN *) (x)) + 1)

SEXPREC_ALIGN is a typedef'd union:

typedef union { VECTOR_SEXPREC s; double align; } SEXPREC_ALIGN;

VECTOR_SEXPREC is a struct that is a mini-version of SEXPREC for use with vectors:

typedef struct VECTOR_SEXPREC {
    SEXPREC_HEADER;
    struct vecsxp_struct vecsxp;
} VECTOR_SEXPREC, *VECSEXP;

Getting Data

So how do you get at the actual data? It depends on the type of R object you're dealing with.

Reals and Integers

double *myArray = REAL(some_SEXP_object);

which is equivalent to:

double *myArray = ((double *) (((SEXPREC_ALIGN *) (some_SEXP_object)) + 1))

What this code does is take an SEXP object, which is really a pointer to an SEXPREC struct, and cast it as an SEXPREC_ALIGN pointer. SEXPREC_ALIGN is either another pointer or a double. So what I think is happening here, is that by using pointer arithmetic, you magically get a pointer of type double that points to the actual data you're trying to access. I don't completely understand this yet, but I'm working on it.


Memory Allocation

The information above might lead to the question, "How does R allocate memory for its objects?" Maybe the answer will also answer some questions about how to handle SEXP objects.

R_alloc

When you create an object in R, a function named R_alloc that lives in src/main/memory.c is called. R_alloc is a small function that looks like this:

char *R_alloc(long nelem, int eltsize)
{
    R_size_t size = nelem * eltsize;
    double dsize = nelem * eltsize;
    if (dsize > 0) { /* precaution against integer overflow */
    SEXP s;
#if SIZEOF_LONG > 4
    if(dsize < R_LEN_T_MAX)
        s = allocString(size); /**** avoid extra null byte?? */
    else if(dsize < sizeof(double) * (R_LEN_T_MAX - 1))
        s = allocVector(REALSXP, (int)(0.99+dsize/sizeof(double)));
    else {
        s = R_NilValue; /* -Wall */
        error(_("cannot allocate memory block of size %.0f"), dsize);
    }
#else
    if(dsize > R_LEN_T_MAX)
        error(_("cannot allocate memory block of size %.0f"), dsize);
    s = allocString(size); /**** avoid extra null byte?? */
#endif
    ATTRIB(s) = R_VStack;
    R_VStack = s;
#if VALGRIND_LEVEL > 0
    VALGRIND_MAKE_WRITABLE(CHAR(s), (int) dsize);
#endif
    return CHAR(s);
    }
    else return NULL;
}

R_alloc either calls allocString or allocVector, but since allocString is nothing more than a wrapper around allocVector, allocVector seems to handle most object allocation requests.


allocVector

This is where the rubber meets the road, if you will. Its prototype looks like this:

SEXP allocVector(SEXPTYPE type, R_len_t length);

So you tell it what you want and how big you want it, and it spits out an allocated R object (as long as there's memory available and you don't exceed R's limits). First, there's a large switch statement that calculates how much memory space it needs based on what kind of object you want and how big you want it. It does this through some macros like BYTE2VEC and INT2VEC that are defined as such (where VECREC is a struct/union that helps with stack maintenance):

#define BYTE2VEC(n)   (((n)>0)?(((n)-1)/sizeof(VECREC)+1):0)
#define INT2VEC(n)   (((n)>0)?(((n)*sizeof(int)-1)/sizeof(VECREC)+1):0)

Next, R determines whether or not it needs to garbage collect (along with doing some other stuff related to garbage collection). If there isn't enough space, R will clean up objects that aren't in use anymore to make room for new objects. As far as I can tell there are two different ways to classify R objects: large and small. Objects deemed as "small" are stored as nodes on a stack designated for "small" objects, and objects deemed as "large" are stores as vectors on a different stack (vector stack) from the "small" objects. Both stacks are maintained independently.

From the code:
Node Classes. Non-vector nodes are of class zero. Small vector nodes are in classes 1, ..., NUM_SMALL_NODE_CLASSES, and large vector nodes are in class LARGE_NODE_CLASS. For vector nodes the node header is followed in memory by the vector data, offset from the header by SEXPREC_ALIGN.

The node header is VECREC, which looks like this (from src/include/Defn.h):

typedef struct {
    union {
        SEXP        backpointer;
        double      align;
    } u;
} VECREC, *VECP;

There is a lot more information in the source code comments about the heap structure, and it's pretty interesting stuff, although I won't go into detail about it here.


Garbage Collection

As I was stepping through allocVector I noticed some code dedicated to garbage collection, and I found some helpful comments at the top of src/main/memory.c that explain what's happening.

There are 3 levels of garbage collection in R. A level 0 collection deals with the "youngest generation". Level 1 collections occur after a certain number of level 0 collections (i.e. 20), and level 2 collections occur after a certain number of level 1 collections (i.e. 5). So level 2 collections happen after every 100 level 0 collections. If a level N collection fails to free a certain amount of space, the next collection will be level N + 1.

There is a maximum and minimum size for the two object stacks (nodes and vectors), and these stacks grow and shrink according to a 6 or so constants named R_NGrowFac, R_VGrowFac, R_NShrinkFac, etc.


Relevant files

Here is a list of R files I waded through to get this information (relative to an unpacked R source tree): The attached files are taken from the R-2.2.1 source tree.
Topic attachments
I Attachment Action Size Date Who Comment
Defn.hh Defn.h manage 29.5 K 13 Jan 2006 - 11:11 JeremyStephens Defn.h (found in src/include)
Rinternals.hh Rinternals.h manage 34.0 K 12 Jan 2006 - 17:11 JeremyStephens Rinternals.h (found in src/include)
memory.cc memory.c manage 72.6 K 12 Jan 2006 - 17:12 JeremyStephens memory.c (found in src/main)
Topic revision: r3 - 13 Jan 2006, JeremyStephens
 

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback