Reweighted Mahalanobis Distance Matching in Observational Studies and Randomized Trials

Matched Randomization R Package nbpMatching

The current version of the R package nbpMatching is currently available at: Active development occurs on R-Forge and GitHub with stable versions being pushed to cran periodically. Installing a current version of nbpMatching may require updating R.

# installing from CRAN
install.packages( "nbpMatching" ) 

# installing from R-Forge
install.packages("nbpMatching", repos="")

# installing from GitHub

# check your version
library( nbpMatching )
library( help='nbpMatching' )

nbpMatching Tutorials

The first tutorial focuses on using nbpMatching for an observational study. We recommend doing this tutorial first even if doing a randomized trial. It teaches several important features of nbpMatching package.

This tutorial shows how to use nbpMatching to create matched triplets for a randomized trial.

This tutorial shows how to use the fill.missing() function to impute missing values and create missingness indicator variables.

What is that "Note: Distances scaled" warning all about?

What's New (version history)

  • The latest improvements and source code may be viewed at R-Forge.

Known Bugs and Desired Features

Note many of the bugs and desired features have been fixed and added. We'll cull the following list down in the future.
  1. Create bib so cite('nbpMatching') works. In the meantime, the standard format is encouraged, e.g.
    Beck C, Lu B, Greevy R. (2015). nbpMatching: Functions for Optimal Non-Bipartite Matching. R package version 1.4.5.
  2. Documentation for nbpMatching needs to be updated and should point to this page and include the PDS paper reference. Tutorial examples on this page should be expanded.
  3. The qom() function calculates the quantiles of the absolute mean differences for two treatment arms over the randomization space. It offers the option of including or excluding subjects matched to phantoms. The calculations when including subjects matched to phantoms are slightly off.
  4. Modify qom() to calculate the AMD_100 by placing the max of each pair in group A and the min of each pair in group B, and calculating the AMD on that worst case; as opposed to taking the max of the sampled randomization set which will likely miss that worst case.
  5. It would be nice for the the quality of matches function, qom, to return the standard deviation for the AMDs and to take the number of simulations to run as an input, e.g. someone could input 100,000 sims if 10,000 sims wasn't enough for the needed level of precision.
  6. Consider adding the SEs for the AMDs to the automatically generated benchmarking balance tables. This could get cluttered, but maybe a max SE for each variable could be included as a footnote, excluding the SEs for AMD_0 and AMD_100. The command hdquantile in Hmisc is pretty fast and will give the quantile with the se.
  7. Allow individual weight specification of missingness indicators. If length(missing.weight)==1, use the same weight for all missingness indicators. If length(missing.weight) == the number of missingness indicators, use the respective weights. Else, warn the user that "the number of elements in missing.weight does not equal the number of variables with missingness".
  8. We want to revisit how smoothly the package and webapp handle perfectly collinear variables. This is most likely to occur in the generated missingness indicators, e.g. systolic.missing and diastolic.missing are likely to be perfectly collinear.
  9. Look at how fill.missing() handles the id column. Allow fill.missing() to take idcol=# as an option so it can be more easily used independently from gendistance().

Matched Randomization Web Application

The web application allows users to upload a dataset of covariates on which to match (in csv format) and creates the set of optimally matched pairs that minimizes the average reweighted Mahalanobis distance between pairs. Users may choose the weights for each covariate, may select covariates to be transformed to ranks, may prevent certain matches from forming, and may select a number of units to optimally discard. If the dataset contains missing values, users may control whether to match on imputed values, match on missingness patterns, or a weighted combination of the two. Optionally, users may directly upload a distance matrix on which to match.

The web application provides links to download the generated distance matrix, a full and a reduced table of the optimal matches, to assess the quality of the matching if being used for a randomized trial, and to perform the randomization within pairs. When randomizing, the application assigns treatments "A" and "B" and allows the user to specify a randomization seed for reproducibility.

WebApp Tutorial

Uploading Your Covariate Matrix

Unless you have created your own distance matrix already, you begin by uploading a dataset with your covariates. They need to be in a comma separated file, e.g. ClusterRandomizedExample2.csv. With the exception of the variable names and ID column values, all variable values should be numeric. Any non-numeric value including "NA", ".", and a blank "" will be treated as missing values. Categorical variables should be broken into indicators. For example, a location variable with four levels (Northern, Southern, Western, MidWestern) should be made into three indicator variables (Southern, Western, MidWestern; Northern is referent).

The example dataset ClusterRandomizedExample2.csv can be opened with any statistical software, spreadsheet program, or text editor. The first ten rows look like the following.
SiteStudyID AgeMean PercentPosCVDHistory LDLMean A1cMean A1c90thPercentile PercentOnStatins PercentSulfonylureaUsers PercentAfricanAmerican NumberOfPatients
1 58.3 20.8 117.5 7.6 9.9 58.5 35.8 34.6 1421
2 NA 19.8 114.3 7.5 9.7 47.7 25.1 13.7 1975
3 61.1 18.1 106.3 7.3 9.2 47.2 23.9 4.6 1371
4 61.6 27.6 109 7.6 10 52.4 44.5 28.7 1793
5 61.8 22.2 103.4 7.7 10.3 54.7 44.7 10.4 2218
6 62.3 34.6 NA 7.4 9.2 58.8 40.9 2 1793
7 62.5 27.1 115.5 6.9 8.3 56.5 31 21.6 667
8 62.6 24.5 100.8 7.3 8.9 53.8 28.2 2.2 649
9 62.7 28.2 98.5 7.8 10.3 52.5 27.3 13.3 2241
10 62.9 28.2 107.3 7.3 9 64.4 40.4 17.1 3066

Create Your Distance Matrix

If you have an ID column, click the appropriate radio button to identify that column. Check any columns you wish to be transformed to ranked values before matching. Check any columns indicating matches to prevent. Set desired weights for variables and missingness; weights may be any number greater than or equal to 0. Set the number of units you wish to drop. These units will be matched to "phantoms" as indicated in the results.

Create Your Matches

Using your distance matrix, create your matched pairs via optimal nonbipartite matching.

Examine The Quality of Your Matches

This will show the upper percentiles for the absolute mean differences for your covariates over 10,000 possible randomizations.


Create the official randomization, setting the seed or making note of the default seed for reproducibility.

Topic revision: r30 - 17 Dec 2015, RobertGreevy

This site is powered by FoswikiCopyright © 2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback