Ted is writing things

On privacy, research, and privacy research.

Converting my PhD thesis into HTML

— updated

Finishing a PhD is a weird emotional experience. All the hard work, the joys, the pains, the pulled hairs, everything gets condensed into a scary-looking PDF and then you're just… done? What? This makes no sense whatsoever. Or rather, this makes sense on paper, but then you feel this weird sense of grief somehow. And you're not quite at the acceptance stage yet. So instead, you decide to deal with those feelings in a perfectly normal and healthy way, and you embark on a journey to compile said thesis into a series of HTML pages.

HTML, by the way, is a much better way of disseminating information than PDF. Pretty much all of recent scientific research is recorded in PDF files, for historical reasons that are largely irrelevant today. PDFs are difficult to browse, impossible to read on a phone, uncomfortable to read on a tablet, hostile to screen readers, impractical to search engines, and the list goes on. It's just a terrible format, unless you're trying to print things on paper. Printing things is a perfectly reasonable thing to do, but that's really not the main use case we should be optimizing for.

Anyway. I converted my thesis to HTML and this is my story. A story of false hopes, perseverance, pain, and futility. I hope this can be useful to other people, as a guide on how to do this for your own thesis or large & complex LaTeX documents, or as an encouragement to do something better with your time instead.

False hopes

"Convert LaTeX to HTML", I type in my search engine of choice. Ooooh, I have options! There's pandoc, lwarp, LaTeXML, TeX4ht, and probably others. This looks excellent. Converting LaTeX to HTML is clearly a problem that other people have already solved for me before. I will just have to run an existing tool, and iron out the kinks.

I download the tools in question, run them on my thesis, and look at the initial results. TeX4ht fires off a bunch of compilation errors and warnings, but it outputs something that kinda looks reasonable from a distance. All others fail completely. So I go, "OK, let's try to fix the TeX4ht problems, to get a feeling for how difficult this is". It turns out not to be too difficult to fix most common issues, LaTeX Stack Exchange answers most of my questions, so I make progress. I also notice that there is a nice-looking build system for TeX4ht called make4ht, which looks really nifty, I imagine it's going to be similar to latexmk, which I love.

So, things are going alright. I make progress. Here are some of the problems I found at first and how I fixed them.

  • A bunch of packages or commands don't make much sense in an HTML context: page breaks, PDF anchors, page numbers, floats, landscape layouts, margins or other types of spacing… Some of them (like floatrow) throw compilation errors, most are simply ignored. I made a pass at all the packages I used and removed the ones that were obviously irrelevant for HTML.
  • One special case is longtable: since a regular table can be as long as you need it to be in HTML, you also don't need it. Replacing it by a regular tabular, and ThreePartTable (from threeparttablex) by the regular threeparttable fixed the problem.
  • Importing an image originally stored in a PDF rendered it into a tiny unreadable vignette. Adding a "config file" with some dark magic in it did the trick.
  • SVGs generated with tikzpicture by were very wrong (missing text, blank graphs…). Apparently the "driver" included in htlatex is not good, but for some reason it's still in use. Including the line that calls a different driver wasn't enough, even though the file was already present in my system, I still got some bugs (text not at the right place). Importing the file directly from GitHub worked.
  • Some commands don't work for reasons I didn't really understand, but are easily fixable: for example, \notin works fine with pdflatex, but tex4ht complains about it. Replacing it by \not{\in} everywhere fixes it.
  • Each footnote is, by default, put in its own separate HTML file. It gets fixed by creating a .make4ht file that contains something like: settings_add { tex4ht_sty_par = ",fn-in" } This tells make4ht to pass additional arguments (here, fn-in) to tex4ht, which change its behavior. There are many available options.

I should probably have noticed the early warning signs. One is that the default behavior often makes zero sense: for example, this footnote problem… who would want footnotes in a separate HTML files when all the rest is in a single HTML file? Why is that a reasonable thing to do?

Also, compilation errors don't give you a clear picture of what actually goes wrong. LaTeX is bad at this in general, but TeX4ht is definitely worse. The error messages are often classical LaTeX errors like ! Extra }, or forgotten \endgroup, but that's almost never the actual error, since the same file compiles fine into PDF. So looking it up error messages online doesn't help. Instead, I fixed those early problems by bisecting the error, or by asking the internet how to do a certain thing.

Still, I'm making quick progress. I wonder things like "can I put the different sections on different HTML pages rather than having one monolithic document" and find out that all I need to do is pass an option to TeX4ht and it works. The option is unbelievably badly named: to tell it "make one page per subsection", you tell it "3", because that's three subdivision levels (chapter / section / subsection). Yes, I really mean "3". The option has no other name. You just pass a single number to the command line.

But whatever. It works. I make progress. I invest time fixing things. Surely, if I just spend a few more hours of fixing things, I'll be done. The sunk cost fallacy starts taking its hold on me. I don't notice a thing.

Perseverance

I start stumbling into some issues that are more difficult to fix. The first big one is how equations display. By default, TeX4ht converts each equation into an image, and includes the image in the HTML file. I imagine it's pretty awful for accessibility, and it's also really ugly. The images are low-quality, stand out in the middle of text, zooming in or out is a visual nightmare. After some testing, I decide that the best solution is to pass the mathml option to tex4ht, and pass the html5+mathjaxnode option to make4ht to tell it to post-process all of the pages with mathjax-node-page, which converts the MathML equations into… prettier-looking equations I think. I don't exactly understand how it works, but MathML alone is ugly, and this is pretty. Ship it. This requires me to install Node.js, which, urgh, but whatever.

I realize only afterwards that this package is deprecated, and that TeX4ht's GitHub repository recommends using the mjcli option instead. That option isn't recognized on my machine, probably because I don't have a recent enough version. What I have works, so I don't look further.

I also start cleaning up my build process. And this is where I start noticing some behaviors of these tools that are kind of really wrong and frustrating for no reason.

  • One example is the -d option of make4ht, which is supposed to tell it "put all output files in this specific subdirectory". This option is lying to you. The files are copied over to this directory, and only some of them. So your working directory is still cluttered with intermediary files, logs, and HTML files.
  • I initially thought that it would be kind of like latexmk, running the compilation commands multiple times until it gets the bibliography references right. It does not do that. You have to do it manually.
  • When you realize you didn't compile what you wanted to, pressing ctrl-c doesn't seem to stop the process. It does, however, make the command-line output hang. So you have to close the terminal and open a new one again.

None of these things is a huge deal-breaker. I am still making progress. I also fix a bunch of other problems that start looking more like real weird bugs than understandable annoyances.

  • \autoref did not work. I tried pretty hard to fix it, and finally gave up and changed all the \autorefs into regular \refs using sed.
  • LaTeX expressions that are perfectly fine according to pdflatex, like a_\text{b} or a^\mycommand{b} (where \mycommand is a custom command), failed to compile. This could be fixed by adding brackets: a_{\text{b}} works, as does $^{\mycommand{b}}. Alas, fixing all compilation problems isn't enough: simple expressions like e^\eps, where \eps is defined as simply an alias of \varepsilon, compile fine, but display incorrectly, so they must also be detected and changed to e.g. e^{\eps}.
  • But wait, this gets even worse: expressions like e^{\eps} are fine in text, but if they are put in macros, then they no longer work. Sometimes. To solve that final problem, I replaced all _ and ^ in my macros by \sb and \sp. Gross.
  • The itemized list of tablenotes in threeparttable environments did not correctly put line breaks between items. You have to add line breaks manually.
  • Speaking of tables, multirow doesn't work. A workaround is to use \newline within cells. There is probably a better option.
  • Having multiple \tikzpicture commands in a single figure resulted in really weird visual bugs, without a compilation error: only a single picture being shown, random text in absurd places. Putting each \tikzpicture in its own cell in a tabular environment is a quick workaround. There is probably a better option (subfigure with the right arguments maybe?).
  • \hat{D} looked reasonable, \hat{O} displayed like the french Ô in equations. Whyyyy. I fixed it by using \hat{{O}}. No clue why it works nor why it happened in the first place.
  • Regular parentheses in equations are automatically sized to the biggest thing on the same line. So if you have an equation like \(f(x)=\frac{tallthing}{alsotallthing}\), the parenthesis around the \(x\) are comically large. You need to replace all these by \left(x\right) to get the correct behavior.
  • Having a cases* environment nested inside of an align* environment failed to compile. Replacing the align* environment by \[ … \] compiles, but the line breaks within the cases* environment are ignored. I solved it by using a matrix* environment instead (with the [l] option for correct alignment), surrounded by \left\{...\right. to emulate the big left bracket.

This is where I started doing some really ugly things to get around such bugs. Using grep and sed to do large-scale changes, or doing gross things like replacing horizontal spaces by non-breaking spaces, became routine. At that point though, I was in too deep to reconsider my choices. So I kept going, even as the bugs got progressively more arcane.

Pain

The serious problems happened as I was trying to figure out how to get the table of contents working as expected. It seemed to be truncated for no reason, with very weird errors on the command-line, referencing some intermediary files. I bisected it to a % symbol in the caption of a table. You read that right: I had a correctly-escaped % in the legend of one of my figures, it compiled and displayed perfectly fine, but it broke the regular table of contents. Not the "list of figures", mind you! I didn't even have a list of figures!

Another problem was with chapter- or section-specific tables of content, which are a good thing to have when everything is separated across many HTML pages. Sadly, they sometimes had the wrong sections or subsections in them; Section 4.2 would have a few subsections from Section 4.3 in its table of contents. I tried for a while to make an minimal working example to figure out where the problem came from, the behavior didn't look very deterministic, so I gave up and simply removed these altogether.

Captions also have their share of bizarre, non-deterministic bugs. For example, using a formula like \left[a\middle|b\right] inside of a caption made compilation fail. Removing the \middle part, which does not cause any issue anywhere else, fixes it. Except that macros also sometimes fail to display the desired formula inside captions, with e.g. a subscript being ignored. But the exact same code without a macro would work fine, or the same macro outside a \caption{} would also work fine. Bizarre stuff.

Eventually, I stopped trying to fix the bugs, and simply learned to work around them, by either removing the thing entirely, or post-processing the output. This happened, uh, a number of times.

  • Using \intertext between lines of an align* equation, a trick to keep equations aligned even when you put a paragraph of text between them, resulted in the entire thing being ridiculously shifted to the right. I solved it by changing the \intertext into a normal paragraph.
  • Algorithms from the algorithm2e package display really strangely. Removing line numbers kind of helps, but it's not great, and the official advice seems to be "convert it as an image", which, ew. I only used this environment once, so I simply converted it into a listing.
  • The TeX4ht config file did not work as expected. Internet tells me adding lines starting with \Configure{@HEAD} is supposed to add corresponding lines in the <head> element of the generated HTML files, and you add multiple such lines to add multiple elements. There are plenty of examples online of this pattern being used. Somehow, on my machine, only the first such command was added to <head>, the others appeared in the <body> instead (which, of course, does not have the expected semantics). After a few hours trying to debug this, I trashed that whole idea and, instead, made a Python script that replaced the beginning and the end of each HTML page entirely instead.
  • A series of underscores got added after some of the citations at the beginning of each chapter. I added a few lines to my Python script to get rid of them without even trying to understand where that particular weirdness came from.

The CSS part of this whole build process is also broken in interesting ways. Two style files are generated: a main one that I think is part of TeX4ht, and another one added by mathjax-page-node.

  • The main CSS has the same commands repeated many times for no reason. It also has styles that are "obviously wrong": class="indent" ends up disabling the text indentation, while there are elements with the "noindent" class, which aren't defined anywhere on CSS, so inherit the global behavior (which is "add an indentation" on my website).
  • The mathjax CSS is fine, but the build process copies it over to the output directory every time a file is generated. But when the file doesn't contain any equation, that CSS is empty instead! So if that's the case for the last file generated by the build process, its empty CSS file overwrites the correct CSS file and all of a sudden, the equations looks terrible. I fixed it by manually adding the "right" CSS in a fixed place.

Futility

So, it's done. I'm pretty happy about how it looks. The entire exercise was entirely futile, of course: it's not like anyone will, y'know, actually read the damn thing. But I'm weirdly glad it exists.

Obviously though, I'm not at all impressed by the road that was needed to get to this point. It's infuriating that doing something like this was so hard. LaTeX is the main way scientific research gets written up. HTML might be the main format used by pretty much everyone on the planet to consume written content. Why is converting one to to the other not a solved problem?

Of course, we know why. Incentives in academia are irremediably broken, so we're stuck with old practices, bad formats, a lack of funds for projects that would make everyone's life better, and a structural impossibility to do much about it. My friend a3nm lays out all of these root causes much better than I possibly could, and this LaTeX-to-HTML story is a good illustration. Imagine that we lived in a world where it was trivial to make beautiful web pages out of scientific papers. Wouldn't that encourage more researchers to share their work more widely? Wouldn't that create whole new categories of readership, given that most people consume content on their phone? If HTML was the default format for research, would more people realize how ridiculous it is that paywalled research papers are still a thing in 2021?

Anyway. I'm complaining, but I still want to finish off on a positive note: the people who are actually doing the work of building and maintaining this tooling are heroes. The many bugs and annoyances I complained about should in no way be interpreted as a criticism of the authors of the software. Converting LaTeX to HTML is absurdly hard because LaTeX was never designed for such a thing, because the input language is forever stuck in the 80's, and because the complexity of the package ecosystem is out of control. The more you dive into how these converters work, the more you realize that the fact that they work at all is actually pretty impressive! Massive respect to folks like Michal Hoftich, who are creating software that solves a fundamentally difficult problem and spending massive amounts of time and energy answering people's questions. Genuinely inspiring.

I hope that some day, that kind of work can be properly funded and rewarded. I don't really know how we get there.

Additional thoughts (added in December 2023)

Time has passed since I originally wrote this blog post, and a few things happened since.

  • I presented my thesis and this blog post as an exhibit at a workshop called Rethinking ML Papers. I recorded a short talk about it, and a recording is now available on YouTube.
  • Deyan Ginev, one of the maintainers of LaTeXML, reached out to me to tell me that they've landed patches to avoid fatal errors during conversion. Versions of LaTeXML from 2022 onwards now produce a partial output when ran on the original sources of my PhD thesis. He's now involved in the ar5iv project, whose goal is to convert all papers on arXiv into HTML. Super cool progress!
  • Brian Dunn, the main author of lwarp, reached out to me to ask for the original sources of my PhD thesis, and understand more about what problems I originally encountered. He then fixed all the issues in lwarp until my thesis compiled without issue (on v0.897 and above). How impressive is this! My original blog post said that the folks who build & maintain conversion software are heroes, and I could not have been more right.

This made me want to understand more about the technical complexity of this kind of work, and get a better overview of the different tools that are out there (something I wish I'd done at the beginning of this project rather than at the end). I found some good discussions available online, for example here or there. It made me realize that there are some profound differences with how different tools tackle the problem.

If I had to do this again, I would probably use lwarp. I like its straightforward technical approach: it uses LaTeX itself to parse the source files and directly generate HTML. This means it can't handle unknown LaTeX packages that implement brand new things… but also that it's less likely to lead to super arcane errors, and that adding support for new packages is easier. It also tries to suggest alternative packages when it encounters an unsupported one, allowing users to solve the error at the source. I would also be cautiously optimistic about my ability to patch lwarp itself if necessary.

If you've had some experience doing big LaTeX-to-HTML conversion projects like the one described in this blog, let me know!

All opinions here are my own, not my employer's.   |   Feedback on these posts is very welcome! Please reach out via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy) for comments and suggestions.   |   Interested in deploying formal anonymization methods? My colleagues and I at Tumult Labs can help. Contact me at oi.tlmt@neimad, and let's chat!