Goodbye file-format woes: Using Pandoc to export LaTeX documents to word processors
Published on Apr 14, 2021.
I really like writing papers and complex documents in LaTeX. The results look very nice (with a little tweaking) and things tend to behave even when the text gets long. Emacs makes editing the document a smooth experience. For collaboration, I can rely on the awesome power of git.
While the use of LaTeX is quite wide-spread in my discipline, some of my colleagues prefer to use WYSIWYG-style word processors. In that case, I still want to at least be able to draft the document in a format that I work efficiently in before converting it to something else. So, here is how I turned a physics paper draft from LaTeX into MS Word with a very satisfying end result!
The go-to tool for these purposes is pandoc. It It describes itself as swiss-army knife for markup file format conversions and has a long list of supported formats. Many of these come with bidirectional capabilities, meaning you can convert both to and from that particular format. However, LaTeX itself can be rather complex, at least behind the scenes, which makes it hard to convert from. It is what makes the format so flexible and the typesetting so clean. However, for conversion into other formats, this makes it notoriously difficult to get anywhere near right.
Pandoc does come with its own, simplified LaTeX parser which supports enough of the format to produce good-looking and complete conversions including figures, formulas and references. You have to help it along a little though by removing and/or replacing certain non-supported packages and classes.
My main.tex
LaTeX document from the Elsevier template package, which includes
all header information, looked roughly like this:
1: \documentclass[draft]{elsarticle} 2: 3: \usepackage{lineno,hyperref} 4: \modulolinenumbers[5] 5: 6: \journal{Journal of \LaTeX\ Templates} 7: \usepackage{fixltx2e} 8: \usepackage[binary-units = true]{siunitx} % Enable SI units 9: \DeclareSIUnit\neutron{neutron} 10: \DeclareSIUnit\Bq{Bq} 11: \DeclareSIUnit\uranium{U} 12: \DeclareSIUnit\n{n} 13: \usepackage{tikz} 14: \usetikzlibrary{backgrounds,positioning,fit,decorations.pathmorphing,arrows,shapes,calc,shadows,fadings} 15: 16: \usepackage{xfrac} % for nice (inline) fractions 17: \usepackage[utf8]{inputenc} 18: %% `Elsevier LaTeX' style 19: \bibliographystyle{elsarticle-num} 20: 21: \begin{document} 22: %% Settings for how to display units using the SI and SIrange commands 23: \sisetup{range-phrase=-,range-units=single,product-units = power} 24: 25: \begin{frontmatter} 26: 27: \title{My paper's title} 28: 29: \author[mysecondaryaddress]{Author 1} 30: \author[mymainaddress]{Author 2\corref{mycorrespondingauthor}} 31: \cortext[mycorrespondingauthor]{Corresponding author} 32: \ead{email@example.com} 33: 34: \begin{abstract} 35: Add text here, summarizing the paper. 36: \end{abstract} 37: 38: \begin{keyword} 39: add\sep Text\sep here\sep 40: \end{keyword} 41: 42: \end{frontmatter} 43: 44: \linenumbers 45: 46: \input{article_text} 47: \bibliography{references} 48: 49: \end{document}
In the document above, the main text of the paper is contained in article_text
on line 46 which is included in the document’s body. Running pandoc
over this, I encountered issues mostly with the meta-data (title, authors due to
the elsarticle
class stuff line 25ff) and with the units (siunitx
package on line 8). The latter helps with handling various units and
typesetting them in a consistent manner. For the word-processor document that we
want to create, all this doesn’t have too much of a visible effect though. So, I
created a separate export.tex
where I replaced many of the style settings and
set up my own, much reduced unit commands:
1: \documentclass{article} 2: 3: \usepackage{tikz} 4: \usetikzlibrary{backgrounds,positioning,fit,decorations.pathmorphing,arrows,shapes,calc,shadows,fadings} 5: 6: \usepackage{xfrac} % for nice (inline) fractions 7: \usepackage[utf8]{inputenc} 8: 9: \bibliographystyle{usrtnum} 10: 11: \newcommand\SI[2]{#1~#2} 12: \newcommand\SIrange[3]{#1~#3 - #2~#3} 13: 14: \newcommand\neutron{\textrm{n}} 15: \newcommand\uranium{\textrm{U}} 16: \newcommand\n{\textrm{n}} 17: \newcommand\nano{\textrm{n}} 18: \newcommand\kilo{\textrm{k}} 19: \newcommand\mega{\textrm{M}} 20: \newcommand\giga{\textrm{G}} 21: \newcommand\tera{\textrm{T}} 22: \newcommand\meter{\textrm{m}} 23: \newcommand\m{\textrm{m}} 24: \newcommand\mm{\textrm{mm}} 25: \newcommand\cm{\textrm{cm}} 26: \newcommand\volt{\textrm{V}} 27: \newcommand\micro{\mu} 28: \newcommand\second{\textrm{s}} 29: \newcommand\hour{\textrm{h}} 30: \newcommand\keV{\textrm{keV}} 31: \newcommand\MeV{\textrm{MeV}} 32: \newcommand\per{/} 33: \newcommand\ampere{\textrm{A}} 34: \newcommand\Hz{\textrm{Hz}} 35: \newcommand\hertz{\textrm{Hz}} 36: \newcommand\byte{\textrm{B}} 37: \newcommand\gray{\textrm{Gr}} 38: \newcommand\Bq{\textrm{Bq}} 39: \newcommand\sievert{\textrm{Sv}} 40: 41: \title{My paper's title} 42: \author{Author 1, Author 2} 43: 44: \begin{document} 45: %% Settings for how to display units using the SI and SIrange commands 46: 47: \begin{abstract} 48: Add text here, summarizing the paper. 49: \end{abstract} 50: 51: % \input{article_text} 52: \input{article_text} 53: \bibliography{references} 54: 55: \end{document}
On line 11 I create my own \SI
command for setting units that I then define in the following lines.
Yes, I do need that many different units in the same text… ;)
This can now be converted with pandoc
:
1: pandoc -s export.tex -o export.docx --bibliography references.bib --csl elsevier-vancouver.csl
The needed citation style file (and many others) can be downloaded from https://citationstyles.org/authors/.
Voila, we now have a export.docx
document! It does not look identical to the
LaTeX output (how could it?) but it includes the same information and looks not
too shabby!
If you also want to have your figures and equations numbered as you would expect
in LaTeX, you need an additional pandoc-filter: pandoc-xnos
. It can be
installed via pip
:
pip install pandoc-fignos pandoc-eqnos pandoc-tablenos \
pandoc-secnos --user
Then, call pandoc using the corresponding filter:
1: pandoc -s export.tex -o export.docx --filter pandoc-xnos --bibliography references.bib --csl elsevier-vancouver.csl
But beware that labels for equations, figures and such cannot have underline characters (’_’) but have to consist of single words (camel case is ok though). That was at least the case at the time of testing it here.
Once sent out, you might get back documents in unhandly proprietary formats. In that
case, you can even extract diffs between versions send out and received back –
or even from the track-changes feature of MS Word using pandiff
:
https://github.com/davidar/pandiff
Hope this helps your workflow – and keeps your colleagues happy while allowing you to use the tools of your choice!
Update 20210610: Added info on numbering equations and figures.