Converting Documents Produced by Sweave and knitr

Converting from LaTeX to html

The recommended format for statistical reports to send to collaborators is pdf produced by pdflatex. Sometimes it is necessary to give collaborators a version of a report that can be edited outside of LaTeX, or to post a report on a web site. Experiments with latex2rtf and hevea have shown that these are not adequate for reports that incorporate advanced features such as latex(describe()) output. One of the most reliable approaches is to use TtH to convert from LaTeX to html (but see TeX4ht below which is probably better). BUT if you need to convert tables that unlike latex(describe(...)) do not contain pictures, hevea is the fastest and usually the best approach. This is an option in the Hmisc package html function.

Collaborator Communication Strategy

The gold standard graphics file format at present is pdf, and you can send all the individual pdf graphics files to collaborators or send the single large pdf report file. For reproducible research, collaborators should not edit graphics files; statisticians should attempt to make graphics publication-ready if they are to be used in a manuscript or grant proposal. Tables usually present the bigger challenge, because collaborators often need to extract tables into a Word document and sometimes need to reformat the tables. This is not what reproducible research is about. Statisticians need to produce tables in nearly final format so that if any data or computational methods are changed the table can be re-exported and re-inserted into a collaborator's document with very little manual intervention.

LaTeX is the gold standard for producing advanced tables. No other approach can handle the nuances that LaTeX can handle, and we have an abundance of R functions for producing LaTeX tables. (As an aside, pandoc is an excellent way to convert simple tables, but table formats are constrained to the very simple patterns supported by the markdown language.) Converting tables from LaTeX to html is the best approach for working with non- LaTeX users. There are two global choices:
  1. After running knitr or Sweave, convert the whole report's .tex file to html and send this to the collaborator along with all graphics files and the html css stylesheet produced by the conversion software. Although rendering in html is good overall, often the R code and some of the pdf graphics do not come out right, which will cause confusion.
  2. For tables produced by R code that you think will need to be extracted by the collaborator, have those tables both appear inside the knitr or Sweave report and be produced a second time onto a .tex file that is external to the main report file. Here is an example:

<<summary, results='asis'>>=
require(Hmisc)
f <- summaryM(age + sex + sbp + Symptoms ~ treatment + country,
              groups='treatment', test=TRUE)
latex(f, file='', npct='slash', middle.bold=TRUE, prmsd=TRUE)
fi <- 'tables.tex'
cat('\\documentclass{report}\\begin{document}\n', file=fi)
w <- latex(f, file=fi, npct='slash', middle.bold=TRUE, prmsd=TRUE, append=TRUE)
# Note: npct='slash' option will be in the next release of Hmisc around 2014-10-07
# Until then the old npct='both' can be used; htlatex will convert the fractions to small png files
# that are included in the table
@

. . .
<<summary2, results='asis'>>=
f <- summaryM(. . .)
latex(f, file='')
w <- latex(f, file=fi, append=TRUE)
cat('\\end{document}\n', file=fi, append=TRUE)  # after last latex() call
@
Then run htlatex tables.tex to produce tables.html tables.css and any needed graphics files and zip up these new files and send to the collaborator to open directly in Word. See the comment below about instructions for unzipping the archive. htlatex is part of the TeX4ht package in linux, Mac OS, and Windows.

Example output may be found here - this is a pdf file produced just by printing to pdf from a browser. This output will look the same in Word once the html file is inserted into a Word document or opened whole, and all table elements are editable.

You can even do some of the latter steps from within R:

system('htlatex tables.tex')
zip('/tmp/tables.zip', 'tables*')  # package all files to send to collaborator
unlink(c('tables*.png', 'tables.dvi', 'tables.html', 'tables.css', 'tables.4ct', 'tables.tmp')) # remove htlatex-created files
Beginning with Hmisc version 3.15-1 you can use the html function to run htlatex automatically to create html files:

require(Hmisc)
getHdata(pbc)
s <- summaryM(bili + albumin + stage + protime + sex + age + spiders ~ drug,
              data=pbc, test=TRUE)
w <- latex(s, npct='slash', file='s.tex')
z <- html(w)
browseURL(z$file)

d <- describe(pbc)
w <- latex(d, file='d.tex')
z <- html(w)
browseURL(z$file)

Producing Individual html Tables for Insertion by Collaborators into Word

When the table does not contain pictures, e.g. when using tables produced by an Hmisc summary* function without specifying dotchart=TRUE to latex.summary.formula, the fastest and best way to convert a table in LaTeX that is in its own .tex file is to use hevea. There is an option in the Hmisc html.latex function for using hevea. Here is an example:

ht <- function(x) {
  base <- gsub('\\.tex', '', x)
  invisible(html(w, method='hevea', file=paste(base, 'html', sep='.')))
}
w <- latex(summaryM(age + height + type ~ sex , data=dbase, overall=TRUE,
                    test=TRUE),
      long=TRUE, prmsd = TRUE, npct='slash', middle.bold=TRUE,
      caption="Descriptive Statistics",
      msdsize='scriptsize', round=2, digits=2, prtest='P', pdig=2,
      file='/tmp/a.tex',
      label="table:summary")
h <- ht(w)
If this were in a knitr document you could have the following after the @ that ends the chunk to also include the LaTeX typeset table:
\input{/tmp/a}

TtH

A linux script is needed to translate Sweave's LaTeX output for use with TtH (thanks to Ben Bolker for providing most of the code). Here are the steps needed to get going.
  1. Install necessary packages (instructions for Debian/Ubuntu variants):
    • Install package 'tth'
      • Using terminal: sudo apt-get install tth
    • Install package 'netpbm' if you are including LaTeX picture code in documents as produced by latex.describe to create the ppmtogif executable
      • Using terminal: sudo apt-get install netpbm
  2. Download 'sweave2html' script
    • Using the mouse
      1. Using the system menu, open the /home/username/bin/ folder.
      2. Create and save a script file called 'sweave2html' (without the .txt extension) by right clicking in the folder and selecting 'Create New\Text File.'
      3. Copy and paste the sweave2html commands (found in the sweave2html link) into the script file.
      4. Mark this script as executable by right clicking on the sweave2html file, choosing 'Properties,' and the 'Permissions' tab. Check the 'Is executable' box.
    • Alternatively, the script may be created in a terminal by typing the commands:
      1. cd ~/bin
      2. wget http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SweaveConvert/sweave2html -nv ( Note: Type the web address; not the contents of the link.)
      3. chmod u+x sweave2html
To convert your .tex file to html and create all the needed graphics files, do the following.

  1. Create a folder called 'graphics' in your project directory and have Sweave use it by putting the following command in your .Rnw or .nw file: \SweaveOpts{prefix.string=graphics/plot}
  2. In a terminal (shell), open the (project) directory where the .tex file is located (using the command 'cd ~username\pathname').
    1. Run Sweave by typing 'Sweave filename (without the .Rnw or .nw extension).'
    2. Type the command 'sweave2html tex_filename (without the .tex extension).'
You can view the filename.html output in a browser such as konqueror and copy and paste it into an OpenOffice document and save in a variety of formats including Word. Use Select All(control+a), cut and paste and all graphics and table formats will be preserved.

It is important to give your collaborator all the .pdf files in the graphics directory to use in manuscripts; do not let them use the lower resolution graphs that will be included in the filename.html document. Bundle all the necessary files to send to the collaborator, using for example
zip /tmp/z.zip foo.pdf foo.html *.gif graphics/*.pdf graphics/*.png
E-mail /tmp/z.zip as an attachment.

Using TeX4ht

The TeX4ht package is a comprehensive LaTeX to html convertor. It may be installed easily using apt-get For Windows go here and note that many of the changes discussed there are not needed.

In one test TeX4ht performed well (including greek letters and superscripts and LaTeX picture environments) although I did not see how to get postscript or pdf graphics to appear in the final output. Advanced summary.formula.reverse tables are handled nearly perfectly, including those that contain micro dot charts. TeX4ht is used as follows:
htlatex foo.tex            # produces foo.html
mk4ht oolatex foo.tex      # produces an OpenOffice .sxw file
Note that the tth package has to be installed for htlatex to run completely.

My test of the oolatex option resulted in output that was not as good as running htlatex and opening the resulting .html file in OpenOffice. See StatReport for more information and example output, and note its comment about turning off picture links in the OpenOffice document.

Current Best Approach for Converting from LaTeX to Word

The following approach works well in many cases (e.g., documents with greek letters, simple math expressions, and bibliographies). Define the following script as l2h in your ~/bin directory:
htlatex $1.tex
rm -f $1.idv $1.lg $1.tmp $1.4tc $1.xref $1.4ct
zip /tmp/$$.zip $1.html $1.css $1*x.png
oowriter $1.html
echo "pack [button .h -text \"/tmp/$$.zip contains html and related files for\ncollaborator to unpack into one folder, or:\n\nClick Edit ... Links ... Break Link\nClick View and uncheck Notes, then\nSave as Word 97/2000/XP and exit OpenOffice\" -command exit]" | wish
rm -f $1*x.png ${1}2.html $1.dvi 
Run l2h my to convert my.tex to html and open OpenOffice to save it in Word 97/2000/XP. A popup will give you some pointers, such as unlinking pictures so if you e-mail the document to someone it will be self-contained. Examples are attached (see below for intro.tex and intro.doc ). This process gives you two options. First, you can e-mail your collaborator the .zip file the script creates in /tmp. Second, you can go ahead and save the result as a Word 97 document.

NEW Try having your collaborator use the html approach first. html files can be opened directly in Word, and Word will use the html style sheet ( css file) that is included in the zip file. You will need to include in addition to the html and css file any png image files created by htlatex. Further investigation has shown that OpenOffice and LibreOffice lose some font attributes but that opening the html directly in Word preserves them. So zipping html css png files and sending the zip file to the collaborator is the best approach. Be sure to tell her not to open the html file from the zip archive but to extract all the files from the archive into a folder, otherwise Word will not find the image files.

If you do not use many LaTeX packages, tables are not complex, and do not make major use of equations, a faster approach is to install the latex2rtf package to very quickly convert from LaTeX to rich text format ( rtf), using a command such as latex2rtf -o my.rtf my.tex.

You can create a file that can be opened in Firefox that beautifully renders equations without resorting to graphics by using MathML. The attached intro.xhtml was created by running mk4ht xhmlatex intro.tex then renaming intro.html as intro.xhtml. We don't currently know how to make OpenOffice open such files. To properly view intro.xhtml you have to save it to a local file so you can point to it outside of foswiki.

Using OpenOffice Exclusively

The odfWeave package by Max Kuhn can be used to produce reports directly in open document format, and the output can be save in Word format. At present, graphics are somewhat low resolution. Source code is similar to what is used with Sweave. Here is how to run an example (in linux), after installing the odfWeave package and the latest OpenOffice. The file can then be exported to open document or Word format.
 R library(odfWeave) odfWeave('/usr/local/lib/R/site-library/odfWeave/examples/examples.odt', '/tmp/out.odt') 
You can then open /tmp/out.odt in OpenOffice Writer. Note: On some systems the correct file name will be /usr/lib/R/site-library/odfWeave/examples/examples.odt.

This approach does not allow you to use the advanced table making capabilities of Hmisc that rely on LaTeX.

Weaving with Raw HTML

Greg Snow has written a document showing how to use raw HTML and the R2HTML package to produce .html reports.

Batch Conversion of Document Formats

cd /tmp
bunzip2 ooconvert-*
tar xvf ooconvert-*
# You may need to edit line to change python2.3 to python
sudo chmod a+x ooconvert
sudo mv ooconvert /usr/local/bin or to ~/bin

One-step Conversion of LaTeX Documents to Word

  • Install ooconvert, tth, tex4ht
  • Put the following script in ~/bin and chmod +x to make it executable
  • Run it by saying ltx2doc foo to convert foo.tex to foo.doc
mk4ht oolatex $1.tex
rm -f $1.css $1.idv $1.lg $1.tmp $1.4tc $1.xref $1.4ct
ooconvert $1.odt $1.doc
rm $1.odt
But see above for a better approach through html and OpenOffice.

Converting to Word or OpenOffice by Converting from PDF

http://pdftoword.com does a surprisingly good job in many cases, including good handling of graphics. Convert your LaTeX document to pdf then use this server to convert to .doc or .rtf which will be e-mailed to you. Here is a great example: pdflatex output converted to Word (zip file) then back to pdf using Word or using pdftoword.com

Also try http://zamzar.com. The result on the above test file was not nearly as good as with pdftoword (except for the complex summary 'reverse' tables!)

A test on 2015-11-14 on spaper.pdf got better results with pdftoword than with zamzar but ggplot2 graphics were converted to editable characters (parts of which rendered corrected and parts didn't). http://pdfonline.com produced perfect html but using Word on the html file was a total mess. It claims to convert to Word directly but really converts to defective rtf. On spaper.tex tth (using a new shell script knitr2html) rendered well but would not recognize \begin{supp}...\end{supp}. htlatex has a bug related to BibLaTeX.

Doesn't work: freepdfconvert, smallpdf.com (partially worked), doc.zone, convertonlinefree.com, pdf2doc, convertpdftoword.net, pdfpublisher, wondershare.net pdfelement (failed to run using wine).

formswift.com rendered perfectly in their online editor after converting but required a credit card in order to download the Word document

More information is available at http://www.freewaregenius.com/2010/03/06/how-to-convert-pdf-to-word-doc-for-free-a-comparative-test. See especially Nuance which costs $ and runs only on Windows and Mac.

Converting from Word to PDF

http://www.pdfonline.com/convert-pdf
Topic attachments
I Attachment Action Size Date Who Comment
htlatex.pdfpdf htlatex.pdf manage 31.9 K 17 Sep 2014 - 18:00 FrankHarrell Output of htlatex after printing the page to pdf from a browser
htmlWeave.pdfpdf htmlWeave.pdf manage 60.8 K 26 Jul 2006 - 22:59 FrankHarrell Automating Reports with Sweave by Greg Snow
intro.docdoc intro.doc manage 69.5 K 06 May 2009 - 19:20 FrankHarrell Result of l2h intro after telling OpenOffice to save in Word format
intro.textex intro.tex manage 8.5 K 06 May 2009 - 19:19 FrankHarrell LaTeX test file to try with l2h
intro.xhtmlxhtml intro.xhtml manage 41.7 K 17 May 2009 - 11:57 FrankHarrell Result of mk4ht xhmlatex intro.tex then renaming intro.html to intro.xhtml
sweave2htmlEXT sweave2html manage 0.6 K 14 May 2009 - 10:44 WillGray Sweave LaTeX to html convertor
Topic revision: r46 - 08 Dec 2015, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback