Cobra's bits (Posts about presentations)

latexdiff-vc

Cobra — Mon, 03 Oct 2022 12:47:34 GMT

Since my first post on version control systems (VCS) and latexdiff, the whole kaboodle has become much easier for the user. Instead of looking for some python scripts to ease the generation of diffs, latexdiff itself offers this functionality. For example,

latexdiff-vc --hg --pdf -r 0 example.tex

will create a diff between the original(-r 0) and current version of example.tex handled by mercurial (--hg), and compile the resulting diff by pdflatex (--pdf).

Even better, latexdiff supports multi-file documents, the use of which is highly advisable for dissertations and books or other long documents. The workflow is simple: as usual, we create the repository and add files to it, but this time all files have to be included:

hg init
hg add master.tex file1.tex ... filei.tex ... filen.tex
hg commit -m "Initial version"

And after editing an arbitrary file:

hg commit -m "Did this or that"
latexdiff-vc --hg --pdf --flatten -r 0 master.tex

Voilà, here's your compiled diff. 😎

As a sidenote: I have seen way too many students trying to compile the active chapter of their dissertation, just to realize that only the master file including their active chapter can be compiled. Hence, they classical sequence: compile – oops, switch tabs, compile, switch tabs, edit, compile – oops, etc. etc. ad nauseum. Except for masochistically inclined witers, there's a simpler and better way.

Indeed, many specialized LaTeX editors (TeXWorks, TeXStudio) and also several of the most prominent general programming editors with a LaTeX extension (Sublime Text with LaTeX Tools, VS Code with LaTeX Workshop, Vim with Vimtex ...) respect the following magic command at the top of the active chapter:

% !TeX root = master.tex

Only Emacs with auctex needs a slightly different syntax:

% -*- TeX-master: "master.tex" -*-

After setting this command, you can just type and compile, and your focus will not be distracted any longer by the compilation cycle sketched above.

Upgrading virtualenvs in fish

Cobra — Wed, 29 Dec 2021 15:55:32 GMT

Python major version upgrades such as the one from 3.9 to 3.10 a few weeks ago require rebuilding any virtual environments created earlier. The generic one-liner I gave in an earlier post works in all shells, but as an avid user of the fish shell, I'm of course employing virtualfish for managing my virtual environments. And upgrading them in fish is even easier than with the one-liner above:

vf upgrade --rebuild

Prior to that, one also needs to rebuild the virtualfish for the python version upgrade:

yay --rebuild -S virtualfish

Afterwards, one can see to the update of the content of the virtualenv as documented in my earlier post. Compared to the entire recreation of the virtualenv, this whole procedure is as painless as fast – which makes the whole concept of virtualenvs an eminently practical one.

Better annotations

Cobra — Mon, 29 Mar 2021 15:57:52 GMT

A significant part of my daily work consists of critically reading drafts of publications or project proposals. I usually place hand-written comments on a printout of the respective document and discuss them with the author in my office, but that isn't such a good idea in the time of SARS-CoV-2. We hold these discussions now in video meetings, with the document in question being looked at together by sharing someones screen showing an annotated pdf. Now, I'm using evince to annotate pdfs, and didn't like the fact that all annotations seem to come from ‘Unknown’. In principle, that can be changed by editing the author in the annotation's properties, but I certainly would not have enjoyed doing that for each of the 80+ comments I had made for the present manuscript.

Alas, the official help told me that setting a different default author would not be possible. And that seemed final, since it came from the most authoritative source – the developers themselves. But I finally found a surprisingly simple solution in the place where, at this time, I had expected it least: the ArchWiki.. Shouldn't the developers know that evince looks into /etc/passwd? In any case, a simple

usermod -c “Deus ex machina” cobra

ensured that my comments would be now easily distinguishable from those of the other coauthors.

What you don't want to use, revisited

Cobra — Sat, 01 Jun 2019 16:58:12 GMT

A decade ago, I advised my readers to stay away from OpenOffice for the preparation of professional presentations, primarily because of the poor support of vector graphics formats at that time. In view of the difficulties we have recently encountered when working with collaborators on the same document with different Office versions, I was now setting great hopes in LibreOffice for the preparation of our next project proposal. First of all, I thought that using platform-independent open source software, it should be straightforward to guarantee that all collaborators are using the same version of the software. Second, the support for SVG has been much improved in recent versions (>6) of LibreOffice, and I believed that we finally should be able to import vector graphics directly from Inkscape into an Office document. Third, the TexMaths extension allows one to use LaTeX for typesetting equations and to insert them as SVG, promising a much improved math rendering at a fraction of the time needed to enter it compared to the native equation editor. Fourth, Mendeley offers a citation plugin for LibreOffice, which I hoped would make the management of the bibliography and inserting citations as simple as with BibTeX in a LaTeX document.

Well, all of these hopes were in vain. What we (I) had chosen for preparing the proposal (the latest LibreOffice, TexMaths extension, and Mendeley plugin) proved to be one of the buggiest software combos of all times.

ad (i): Not the fault of the software, but still kind of sobering: our external collaborator declared that he had never heard about LibreOffice, and that he wouldn't know how to install it. Well, we thought, now only two people have to stay compatible to each other. We installed the same version of LibreOffice (first Still, than Fresh), I on Linux, he on Windows. But the different operating systems probably had little to do with what followed.

ad (ii): I was responsible for all display items in the proposal, and I've used a combination of Mathematica, Python, Gimp, and Inkscape to create the seven figures contained in it. The final SVG, however, was always generated by Inkscape. I've experienced two serious problems with these figures. First, certain line art elements such as arrows were simply not shown in LibreOffice or in PDFs created by it. Second, the figures tended to “disappear”: when trying to move one of them, another would suddenly be invisible. The caption numbering showed that they were still part of the document, and simply inserting them again messed up the numbering. We've managed to find one of these hidden figures in the nowhere between two pages (like being trapped between dimensions 😱), but others stayed mysteriously hidden. We had to go back to the previous version to resolve these issues, and in the end I converted all figures to bitmaps. D'Oh!

ad (iii): I wrote a large part of my text in one session and inserted all symbols and equations using TeXMaths. Worked perfectly, and after saving the document, I went home, quite satisfied with my achievements this day. When I tried to continue the next day, LibreOffice told me the document is corrupted, and was subsequently unable to open it. I finally managed to open it with TextMaker, which didn't complain, but also didn't show any of the equations I had inserted the day before. Well, I saved the document anyway to at least restore the text. Opening the file saved by TextMaker with Writer worked, and even all symbols and equations showed up as SVG graphics, but without the possibility to edit them by TeXMaths.

ad (iv): Since my colleague had previously used the Mendeley plugin for Word, it was him who had the task to insert our various references (initially about 40). That seemed to work very well, although he found the plugin irritatingly slow (40 references take something like a minute to process). However, when he tried to enter additional references a few days later, Mendeley claimed that the previous one were edited manually, displayed a dialogue asking whether we would like to keep this manual edit or disregard it. Regardless the choice, the previous citations were now generated twice. And with any further citation, twice more, so that after adding three more citations, [1] became [1][1][1][1][1][1][1][1]. The plugin also took proportionally longer for processing the file, so in the last example, it took about 10 min. Well, we went one version back. But what worked so nicely the day before was now inexplicably broken. It turned out that a simple sync of Mendeley (which is carried out automatically when you start this software) can be sufficient for triggering this behavior. We finally inserted the last references manually, overriding and actually irreversibly damaging the links between the citations and the bibliography.

In the final stages, working on the proposal felt like skating on atomically thin ice (Icen 😎). We always expected the worst, and instead of concentrating on the content, we treated the document like a piece of prehistoric art which could be damaged by anything, including just viewing the document on the screen. That feeling was very distracting. I would have loved to correct my position, really, but LibreOffice in its present state is clearly no alternative to LaTeX for preparing the documents and presentations required in my professional environment. I will check again in another ten years. 😉

In principle, I would have no problem with being solely responsible for the document if I could use LaTeX and would get the contribution from the collaborators simply as plain text. It is them having a problem with that, since they don't know what plain text is. In this context, I increasingly understand the trend to collaborative software: it's not that people really work at the same time, simultaneously, on a document, but it's the fact that people work on it with the guaranteed same software which counts.

A one-liner to upgrade your virtualenvs

Cobra — Sun, 23 Sep 2018 11:02:35 GMT

This blog is powered by Nikola, a Python-based static blog compiler, which I've installed in a Python virtualenv to separate it from the system-wide Python installation of my notebook. Updates of the major version of Python (like from 3.6 to 3.7) inevitably break these virtualenvs, and I have so far accepted that there's no other way to get them back than to rebuild them from scratch. In fact, that's what you get to hear even from experienced Linux developers.

The recent update to Python 3.7 brought that topic back to my attention, and I kind of lost my patience. I just couldn't accept that there shouldn't be a better way, and indeed found a solution for those using the venv module of Python 3.3+:

python -m venv --upgrade <path_of_existing_venv>

Despite the fact that I'm using virtualenv instead of venv, this command worked exactly as I had hoped. ☺

The virtualenv can now be updated as usual. Well, almost – both pip and pip-tools got a lot more conservative and actually have to be told explicitly that they really should upgrade to the latest version. For a particular package, that looks like this:

pip install --upgrade Nikola --upgrade-strategy=eager

A rather weird behavior, if you ask me, but what do I know. ☺

Back to our virtualenv. To really, genuinely and truly update all requirements, the following sequence of commands is necessary:

pip install --upgrade setuptools
pip install --upgrade pip
pip install --upgrade pip-tools
pip-compile --rebuild --upgrade --output-file requirements.txt requirements.in
pip-sync requirements.txt

The update to Nikola 8.0.0 broke my old theme (based on bootstrap2), and it was about time: too many things were not working as desired. I'm now using an essentially unmodified bootblog4, as before with Kreon as the main font, and Muli for the headlines (from 15.12.2018: Oswald for the latter).

The office suites disaster

Cobra — Sun, 15 Jul 2018 10:57:34 GMT

I was recently involved in the preparation of a project proposal, i.e, the attempt to obtain external funding for a certain research project. This particular project is a joint initiative of two academic institutions, and after having agreed that we would like to collaborate, a practical question arose. How should we prepare the documents required for the proposal?

The youngest of us proposed to use Google Docs (“it's so convenient”), but this proposal just received hems and haws from all others. My colleague then remarked, with an apologetic smile, that I would work exclusively with LaTeX. Our benjamin cheerfully chimed in and suggested Authorea. Once again, his proposal was met with limited enthusiasm. I gleefully added CoCalc as another LaTeX-based collaborative cloud service and briefly enjoyed the embarrassed silence that followed. After a few seconds I hastened to add that I, being foreseeably only responsible for a minor part of the proposal, would agree to anything with which the majority would feel comfortable with (which, in hindsight, was entirely besides the point).

In the end, we decided to use a Microsoft Sharepoint server run by our partner institution. I didn't connect to it myself, but was told by my colleague that the comments and tracking features didn't work correctly. And so we ended up with good ol' (cough) Microsoft Word as the least common denominator.

Well, I have LibreOffice (LO) on all of my systems, which should do for a simple document like the one we would create. To be on the safe side, I've ordered a license for Softmaker Office (SMO) which is praised for being largely compatible with MS Office (MSO), and which I planned to test anyway in my function as IT strategy officer (muhaha). Besides, in the office I have Microsoft Word 2007 available in a virtual machine running Windows 7. So what could go wrong?

At the very beginning, our document contained only text with minimal formatting, and was displayed with only insignificant differences on MSO and LO. After a few iterations, comments and the track of changes became longer than the actual text. That's when the problems started.

We were in the middle of an intense discussion when I noticed that, from my view, it just didn't seem to make any sense. My coworkers were discussing item (1) of an enumerated list, but their arguments revolved around an entirely different subject. When the discussion moved to item (2), I suddenly understood the origin of this confusion. LO, apparently triggered by a deletion just prior to the list, had removed the first entry, and my list thus only consisted of items (2)–(7), but LO enumerated them as (1)–(6). Gnarf.

Since my SMO order had not yet been approved (it's no problem to order MSO licences for €50,000, but an unknown product for €49.99 stirs up the entire administration), I decided to open and edit the proposal by MSO 2007. Only to discover that my colleagues, using MSO 2013 and 2016, talked about a paragraph that simply didn't exist in my version of the document. The excessive tracking also broke compatibility to MSO 2007. I wasn't too surprised, but at this point I really looked forward to SMO.

When the license finally arrived, we were busy with creating figures. A simple sketch created by our youngster in Powerpoint 2016 looked more like surreal modern art in LO, and not much better in MSO 2007. But in SMO, it was missing altogether. What the hell?

It turned out that the much acclaimed SMO has serious deficiencies with vector graphics, particularly under Linux. This flaw, together with the missing formula editor, makes SMO basically unusable in a scientific environment. That's a pity, since SMO is responsive and has a flexible and intuitive user interface.

Anyway, with the deadline approaching rapidly, my colleagues started to work on the proposal over the weekend at home. It turned out that none of them had MSO at home, so they all used LO. Now everyone's printed version deviated in one way or the other from the master copy on ownCloud, and we had to carefully compare the content, line by line, in the subsequent discussion.

Miserable pathetic fools! Why on earth didn't we agree at the outset of this endeavour to use LO? We could have agreed even on the same exact version to use. Our work on the proposal would have been easier and much more efficient. And instead of creating figures the hard way with Powerpoint, we could have used a full-scale vector graphics suite such as Inkscape, since recent versions of LO support the svg format. Equations could have been handled with TeXMaths, which allows users to insert any kind of math losslessly in an LO document. Here's a slide composed exclusively of vector graphic elements:

No, I did not suddenly become an enthusiastic user and advocate of LO. On the contrary, I still firmly believe that for the task at hand, LaTeX would have been most appropriate and convenient solution. However, LO would have obviously been a much more rational choice than MSO.

When I said that aloud, my colleagues gave me this 'oh-my-god-now-he's-going-mad' look. Normal people accept LO for personal use, but not for a professional one. Why? Well, they have this deeply internalized believe that what you get is what you pay for. Free software is for amateurs and nerds, but as a professional, one uses professional software! Oh wait ...

Xpdf (4!)

Cobra — Sun, 17 Sep 2017 11:44:17 GMT

Desktop publishing (DTP) was initiated by two ground-breaking developments of Adobe. They first established postscript in 1984, which, after being quickly adapted by Apple in 1985 in their first laser printer, became the de-facto standard in the DTP world for a long time to come. Second, they developed the portable document standard (pdf) in 1993, which is now not only dominating DTP, but also all electronic publishing activities.

I don't remember when pdf became relevant for me. For publishing, most journals still prefer figures in eps format, although some accept pdf as well. I also don't remember whether Xpdf or gv was serving as my pdf reader in the 1990s. In any case, Xpdf was (as far as I know) the first dedicated pdf reader for Linux, and came with a Motif interface popular at that time (after all, the popular Unix desktop CDE used Motif!). This archaic interface didn't change since 1995, and is certainly one of the main reasons why nobody uses Xpdf any more.

Well, I do, but only for a single purpose: I use Xpdf to extract vector graphics from pdf files. A few days ago, I planned to do exactly that, opened the paper with Xpdf and ... but wait a second, that's not Xpdf!

And yet, the window title says Xpdf. What's going on?

A new logo, and a new toolkit. If I would have been asked, I would have bet anything on this not going to happen, ever. Fortunately, nobody has asked...

In any case, it's nice. But does my script work? Of course not. At least nothing happens when I hit Ctrl-e. Starting Xpdf from a terminal shows that the script is started all right, but the filename is put in additional quotation marks. Ha, that's easy: in ~/.xpdfrc, the line

bind ctrl-e any "run(pdfsnap '%f' %p %x %y %X %Y)"

just has to read

bind ctrl-e any "run(pdfsnap %f %p %x %y %X %Y)"

and it works!

By the way: this update is illustrating the difference between Archlinux and Debian Sid very nicely. For both systems, the update came almost at the same day, but had different content: 4.00 for Arch and 3.04-4+b1 for Debian. Sid is not vicious, but a snail. 😉

Better graphic formats

Cobra — Mon, 26 Dec 2016 14:08:43 GMT

The most frequently used (and abused) raster image format—JPEG—recently celebrated its 25th anniversary. Its cousins are mostly even older: TIFF stems from 1986, GIF from 1987, and only PNG, the latter's intended replacement, was developed a few years later, namely, in 1995.

What kind of computer did I have 1995? A Pentium 90 with 16 MB RAM and a 512 MB HDD. And that's what these formats were designed for. Today, 20 years later, we enjoy a factor of about 1000 with regard to CPU speed, memory, and storage size, but despite this enormous difference, our image file formats have so far remained the same.

Several new formats have been proposed in the past few years, such as Google's WEBP in 2010, BPG (better portable graphics), which is essentially owned by the MPEGLA, in 2014, and FLIF (free lossless image format) in 2015. Only WEBP is supported to a degree that allows one to actually use it, while BPG and FLIF are essentially still on the level of technology demonstrations.

This page offers a most illustrative comparison between the different lossy image formats, among them JPEG and its intended successors as well as BPG and WEBP. There's absolutely no question about the winner. Just look at Tennis or Steinway, for Pete's sake. No question, wouldn't it be for the sodding patents. sigh

But forget the patents for the moment, let's rather look at something interesting. In this post, I look at these new image formats from a different perspective. How well can they compress an essentially black-and-white line art?

Not that one should ever even consider to do that. Line art should always be stored as vector graphics, that much is obvious to anyone with even the faintest knowledge of graphic formats. Even a few scientific publishers know that. In the author guide to Nature Communications, for example, we find the statement:

All line art, graphs, charts and schematics should be supplied in vector format [...].

The author guides of most other publishers lack such explicit statements and rather breath the spirit of the 1990s. For example, in an Elsevier FAQ we can read:

Why don't you accept PNG files?

We will constantly review technological developments in the graphics industry including emerging file formats - new recommended formats will be introduced where appropriate. PNG files do not cause issues in processing, but our submission systems are in progress of updating to allow for this useful new format.

In practice, however, most publishers have no problem with accepting vector graphics in EPS or PDF format and, most importantly, also use it 1:1 for the final publication.With one prominent exception: the American Chemical Society (ACS). Vector graphics submitted to any of the numerous ACS journals are invariably converted to a raster image. Some of their author guides even include a corresponding note:

NOTE: While EPS files are accepted, the vectorbased graphics will be rasterized for production.

Regarding the format and resolution of this raster graphics, we find the following exemplary recommendation in this guide:

Figures containing photographic images must be at least 300 dpi tif files in CMYK format; line art should be at least 1200 dpi eps files.

To specify a resolution for EPS files demonstrates a complete lack of understanding of vector graphics. And in the same spirit, we read:

Cover images should be 21.5 cm in width and 28 cm in height, with a resolution of 300 dpi at this size (this should be a file of at least 8 MB).

Oh, we cannot even handle compressed TIFFs? How wonderful to work with professionals.

Perhaps as a direct consequence of the resulting size of 1200 dpi bitmaps, I have never seen any figure in an ACS journal whose resolution would have exceeded 300 dpi. At least these figures are compressed, contrary to the implicit recommendation in the author guide. Depending on the preference of the technical stuff at the respective ACS journal, the figures are included in the manuscript either as overcompressed JPEGs, exhibiting plainly visible compression artefacts, or as insufficiently compressed PNG files.

Insufficiently compressed? Yes—in contrast to JPEG, PNG employs lossless compression, and one can and should thus always employ the maximum compression level (9). Not doing so only increases the file size. The technical stuff at ACS typically invokes only the minimum compression level 1. Furthermore, the file format is invariably 8 bit/color RGB, even for black and white line art. As a result, the 692 kB of a 295 dpi figure (extracted as described here) in one of my recent ACS publications could have been easily reduced to 138 kB. Or, alternatively, one could have produced a 1200 dpi version with a file size of only 787 kB—barely larger than that included in the galley proofs.

And for all this “professional” service, we even pay handsomely. Why, then, do we publish there at all? Because of the impact factor, of course. I'll write more about this much too powerful incentive in the near future.

But let's come back now to the actual topic of this post, and consider the following grayscale line art that has been created with the help of graph and inkscape:

The original SVGZ has 21.6 kB, a PDF saved by inkscape 52 kB. Now let's see what happens if we convert this PDF into various raster image formats with a resolution of 1200 dpi.

PNG:

The obvious choice of format is PNG. We can convert the SVGZ or the PDF in various ways. We could export a PNG directly from inkscape, of course. Alternatively, we could open the PDF by gimp and export it as PNG. Both are viable ways, but the CLI is actually more flexible and powerful. So let's open a terminal and enter

pdftocairo -png -scale-to-x 4000 -scale-to-y -1 -gray -antialias gray valence_bands.pdf valence_bands_cairo.png

That would be my usual way. Results in a nice grayscale png with 356 kB.

Another possibility is

convert -verbose -density 483.87 valence_bands.pdf -depth 8 valence_bands_convert.png

Equivalent to '-depth 8' is '-colorspace gray' (in this particular case). In any case, we get a file with 330 kB. Can we do better? Oh yes, by tuning the PNG compression parameters:

convert -density 483.87 valence_bands.pdf -colorspace gray -define png:compression-filter=1 -define png:compression-level=9 -define png:compression-strategy=1 def.png

300 kB! For the parameters, see here.

Now, that seems to be a fairly optimized PNG, but it is still almost six times larger than its predecessor, the PDF. That's the time of the PNG optimizers! Let's apply them to the smallest PNG we have obtained so far, the one with 300 kB.

optipng

optipng def.png -out opti.png

225 kB.

pngquant

pngquant def.png

In contrast to the other optimizers, pngquant converts to a color palette! But with unexpected success:

220 kB.

pngout

pngout def.png out.png

189 kB. Needs ages. But it's the tool of the duke.

zoplipng

zopflipng def.png zopfli.png

190 kB. Google vs Ken Silverman: 0:1!

That's about the limit for PNG.

Let's check other lossless formats.

TIFF:

convert -verbose -density 483.87 valence_bands.pdf -depth 8 -flatten -compress lzma valence_bands.tiff

188 kB. Surprise, surprise: basically equal in size to the smallest PNG.

WEBP:

convert def.png -define webp:lossless=true def.webp

159 kB! Not bad at all.

BPG:

bpgenc -lossless def.png -o def.bpg

387 kB. Not a format for lossless compression.

FLIF:

flif def.png def.flif

92 kB. Now that's a statement!

But still way larger than the PDF. Is there perhaps a lossy algorithm capable of creating a 1200 dpi image smaller in file size than the PDF? Note that the present graphics with its hard contrasts is a worst case scenario for JPEG and, I presume, for essentially all lossy image formats.

JPEG (libjpeg-turbo)

convert def.png -flatten -quality 1 def_default.jpeg

165 kB. Hardly smaller than the lossless variants and with the characteristic ringing and quilting artefacts surrounding every edge and corner (see below).

JPEG (mozjpeg)

convert def.png -flatten -quality 1 def_moz.jpeg

83 kB. Better than the default above, but still larger than the PDF. The compression artefacts are different from those of the default JPEG implementation, but the image is still of terrible quality (see below).

WEBP:

convert def.png def_lossy.webp

203 kB. Worse than lossless (but I didn't explore the various parameters convert offers for WEBP).

BPG:

convert def.png -flatten def_spec.png
bpgenc -q 44 def_spec.png -o def_lossy.bpg

50 kB. I had to preprocess the image since I needed a screenshot of the final BPG for the comparison below. The result is indeed smaller than the PDF, and exhibits (compared to the JPEG) only moderate compression artefacts (see below). Very impressive.

Here's a comparison of a section of the above graphics.

BPG is certainly a major improvement over JPEG also for line art. However, nothing beats vector formats: the PDF is of similar size and is arbitrarily scalable. A version for an A0 poster would still be 54 kB in size, whereas a corresponding BPG of the same quality as shown above would be truly gigantic.

An ideal strategy for scientific artwork would look like that: line art, labels, and annotations as vector graphics (SVG or PDF), photography as BPG, stored together in a PDF or SVGZ container. That's imagery for the 21st century. And, in case you didn't notice, I didn't find any reason to mention WEBP or FLIF. For either of them, there's always a better alternative. If we disregard the patents. 😉

LaTeX vs. Unicode

Cobra — Sun, 13 Nov 2016 10:03:52 GMT

I'm using matplotlib to create figures for my publications. For axes labels, legends, and everything else requiring text and symbols in a figure, I've so far used the excellent LaTeX support of matplotlib, and the results are (obviously) highly satisfactory:

There's a disadvantage, though: there are not too many fonts to chose from. Naively, I thought that this limitation would be lifted if I wouldn't use LaTeX, but Unicode instead:

And wouldn't XeLaTeX even combine the advantages of both?

As you can see, matplotlib allows you to use any of these options, but what you don't see is that the desired results can be achieved only with a very limited set of fonts. For example, there are only a few fonts that include the unicode character for a 'superscript minus' (for an overview, see here). Sadly, most of these are part of the ClearType Font Collection, which was introduced by Microsoft with Windows Vista. Free fonts with a 'superscript minus' include Dejavu Sans, Free Sans, and Free Serif. If the 'superscript minus' is included instead as a command by employing the internal LaTeX support of matplotlib, many more fonts become accessible. Examples are shown in the table below. But even then one can't make any assumptions: while Source Sans Pro works fine, Source Serif Pro doesn't. I have no idea why.

You see from my last statement that this post in not in the least authoritative. I'm just toddling around, and if you find a better way, I'd appreciate corrections and additions. That's particularly true for the case of XeLaTeX, the use of which seems to require OTF-only fonts with math table support. I wasn't even able to find a single Sans Serif font with this profile 😞 . Others have similar problems.

Renderer	Serif	Sans Serif
LaTeX	Palatino, Fourier	Kurier, CM Bright
Unicode	Noto, Gentium Plus	Open Sans, Source Sans Pro
XeLaTeX	Libertinus, XITS	?

Finally, here's an archive containing the three scripts I've used to create the figures above. In each case, I let matplotlib render a pdf, convert that into an svg by pdftocairo, and compress this svg files by gzip:

./plot_uc.py
pdftocairo -svg plot_uc.pdf plot_uc.svg
gzip -S z plot_uc.svg

The results are compressed scalable vector graphics that are fully compatible with inkscape if a post-processing should be necessary. That's how I got the unicode logo in, by the way. 😉

No magic

Cobra — Sun, 06 Nov 2016 13:52:11 GMT

In times of rain, fog and drizzles, I take the U-Bahn for commuting to and from work. I'm not as regular as your proverbial clockwork, but still punctual enough to see certain co-commuters on an almost daily basis. Some of them are individual enough to stick out of the crowd. Sir David, for example, a long thin figure with short grey hair and a short, accurately trimmed grey beard, invariably dressed in a checkered grey suit with a light grey shirt and a gunmetal grey tie, dark grey suede shoes and charcoal grey socks. Even his beyerdynamic headphones, which he's sure to wear, are grey, and while listening to Vivaldi, he's studying the daily Frankfurter Allgemeine with great interest and concentration.

Such intense is his concentration that he does not even notice Mike sitting next to him with a battered Nokia cell phone and talking very loudly, as every day, in an unidentifiable language with what I assume to be a subordinate of Mike. Mike is ebony black, 6'4'', 150+ kg, wears a gold Rolex and other golden accessories over a black Savile Row suit but still manages to look like a very serious and very worried business man. When U2 goes underground at Potsdamer Platz and his phone loses the connection (as every day), his worries grow troublesome for his health as accentuated by repeated violent bursts of shouting at his unfortunate subordinate who, luckily for him or her, can't hear the verbal assault because of the severed connection.

Despite this acoustic disturbance, Sir David remains concentrated on his newspaper and entirely misses Audrey stepping into the train. Audrey is a seriously cute American girl in her twenties with dazzling blue eyes and jet black hair she's wearing as an asymmetric bob. Unlike many Americans, she has a soft, melodious voice which is as pleasant to listen to as she's a pleasure to look at, despite her constant fumbling with a white Apple Watch. Today, however, she's discussing a technical subject with one of her colleagues from the advertising agency she's working for. The guy is the living prototype of a Berlin-Mitte hipster with the trade-mark combination of an undercut leaving a 2 cm ponytail stub at the back and a beard of Abrahamian dimensions in front, together with the indispensable oversized horn-rimmed black glasses fulfilling no medical function but representing a fashion statement.

He seems to believe, as many non-native speakers, that an excessive use of the F-word documents an intimate familiarity with both the American language and the American culture, and thus demonstrates that you are on the level. Know what I'm sayin'? Fuck, eh. Right now, he keeps whining, in a high wheezy voice not fitting the beard, that his brochure wasn't accepted because the fucking pdf was too fucking big, fuck eh. Know what I fucking mean, eh? Audrey knows, and contrary to what I would have expected, she's reacting in a totally enthusiastic way. Oh so true, she cheers, and adds immediately that she's recently found the solution for big pdfs that seem to be the major, if not the problem troubling her agency. A wonderful, an absolutely fabulous web service! Her original pdf of 500 MB reduced to 10 MB without any loss of quality! Pure magic! But ... who ... ?, her hipster colleague manages to ask, and she's shouting at him, full of delight: ILOVEPDF DOT COM!

Two days later one of my colleagues (with a PhD in physics) tells me that he used ilovepdf.com to compress the pdf of his recent publication, so that it's of a size suitable for uploading to arXiv. He shows me the result, and I'm impressed: there are no immediately obvious compression artifacts, although the file size has been reduced from more than eight to just one MB. Now I'm really getting curious. Is it possible that these ilovepdf guys are doing something ... clever? Perhaps they employ one of the new image formats, such as webp, bpg, or even flif? That would be most interesting, and I thus set out to get to the core of this business.

Several web services promise to compress images (or, more precisely, pixel graphics or, synonymously, raster image files, or short, bitmaps or pixmaps) of various formats or entire pdf documents (where the compression, of course, boils down to exactly the same: compressing the pixmaps embedded in the pdf container). All of them also promise to respect our privacy. For example, ilovepdf.com (and iloveimg.com) states:

Absolutely all uploaded files on ilovepdf.com are deleted from our servers one, two or twenty-four hours (depending on if a user is non registered, registered or premium) after been processed.

Hmmm. Why only the uploaded files? What about the compressed ones? I like the statement of smallpdf.com better:

Please note that uploaded and processed files are never stored longer than an hour on our servers and then are deleted permanently. During this hour your files are not accessed, copied, analyzed or anything else except we have the explicit permission of the user for example for a support case.

“Analyzed” is the key term here. Even if the uploaded and processed files are deleted, it takes only fractions of a second to extract the text of pdf documents, or the raw pixmaps embedded in them:

        less upload.pdf > text.txt
        mutool extraxt upload.pdf

On tinypng.com, we only read:

Submitted content will not be shared with third parties other than Voormedia’s service providers, unless required to comply with the law or requests of governmental entities. Voormedia uses service providers based in the USA.

For the present case of use (getting a file size acceptable for arXiv), all of that doesn't seem to matter. After all, our intention is to publish our content, not to keep it confidential. The same goes for Audrey and her agency. Still ... if you can do it yourself, why should you become dependent on others? And after you have made yourself depending on this service, what will you do when it really matters?

But can we do it ourselves? Are the results of these services within our reach, or are their makers truly magicians with capabilities beyond the John and Susan Does of the interwebs? To narrow it down to the point which matters most for me: can these services, given a file that I deem to be suitable for publication, significantly compress it further? After all, I know how to treat my images, don't I? Well, at least I believe I do.

I've made a comparison using several pdf documents (including the publication of my colleague above) as well with a few raster images. To my simultaneous relief and disappointment, the most frequent statement I got from the web services under consideration included:

We are sorry, your file is very well compressed and we can't compress it without quality loss.

Or, as honest as cute:

We compressed your file from 30.36 kB to 29.57 kB. That's not that much. Sorry.

No magic, no new formats. What a pity! At the same time, I was impressed by the technical quality of these services. All documents and images returned from them were an excellent compromise between image quality and size. Furthermore, I actually never managed to produce an image of equal quality but smaller file size. But I was always close with very little effort.

To give explicit examples: the publication of my colleague was 8.3 MB in size. The sole reason for the large size were a couple of images embedded as uncompressed tiff. I would normally compress these images prior to generating the pdf with pdfLaTeX, of course, but we can equally well compress them afterwards. Our two services return files of 0.95 (ilove) and 0.86 MB (small). I have to magnify them a five, even a tenfold to see the effects of compression. Yes, these services use lossy compression schemes, specifically jpeg, but they do that expertly.

What could John Doe do against specialized services jam-packed with expertise and knowledge on image compression? Well, I just tried to apply the little I know, like this (well-known) ghostscript one-liner:

            gs -sDEVICE=pdfwrite -dPDFSETTINGS=/default -dNOPAUSE -dQUIET 
            -dBATCH -sOutputFile=out.pdf in.pdf

resulting in

                gs (default):       1.41 MB     83% 
                gs (ebook)          1.05 MB     87%
                ilovepdf:           0.95 MB     89%         
                gs (screen)         0.94 MB     89%
                smallpdf            0.86 MB     90%

With the default setting, image quality is basically indistinguishable from the original and thus even better than the files produced by ilove and small. The higher compression stages of ghostscript with compression ratios rivaling those of ilove and small produce visual artifacts in this particular case, but are always worth a glance when trying to compress a pdf.

The conclusion of this exercise is obvious, I believe. Pixmaps for inclusion in documents should be compressed, most preferably losslessly. At the moment of writing, the most suitable format for images in my trade is png. These images can be compressed further (but lossily!) by reducing the range of colors (with pngquant, for example, or simply by converting them to greyscale). For photographs with soft contrasts and 16 M colors, jpeg remains the reigning format, but we have to be very careful not to reduce the image quality in an obvious way. And if you forgot all that and are facing a giant pdf that cannot be uploaded anywhere, use ghostscript.

That's it. That's all there is to know (at least concerning the current topic). I plan to look at new image formats, though, but those will be the subject of future posts. 😉