[wlug] Microsoft Screws Up Linux

17 Jul 2022

      Remember that version 1.0 of WSL (“Windows Services for Linux”) tried
to emulate Linux APIs on top of the Windows kernel, but that turned out
to be impossible to get right. (Why not? Windows kernel not functional
enough?) So WSL 2.0 abandoned this approach, to bring in a true-blue
Linux kernel running under a cut-down version of Microsoft’s Hyper-V
virtualization system.

Problem solved, right? You’d think having an actual Linux kernel would
be sufficient to implement actual Linux APIs. And so Microsoft can
offer regular Linux distros like Ubuntu to run on Windows, under WSL.

But it turns out WSL actually relies on a specially-patched Microsoft
version of the Linux kernel. So you can’t just take any old Linux
distro and install it. Which is why you only have the option of distros
specially packaged for the Microsoft Store. Furthermore, this special
WSL kernel has some bugs in it that a regular Linux kernel does not.

I discovered this while browsing through the open-issues list for
libffi <https://github.com/libffi/libffi>. This library offers a
“foreign-function interface”, which high-level languages can use to
call functions written in C. For example, this library is the basis of
Python’s ever-handy ctypes module.

Look at this issue <https://github.com/libffi/libffi/issues/552>,
where the script for building libffi fails under WSL. One
participant has narrowed the problem down to this:

    The following shell script which performs the issued part in a
    generated configure file is as follows:

    #!/bin/bash

    exec 5>>my.log
    {
        touch my.log
        mkdir -p workdir
        cd workdir
        test -f ../my.log && mv ../my.log .
        stat my.log
    }

    On a Linux system single partition, the stat call will show
    expected data. In a WSL environment on an NTFS mount, the call will
    fail.

The only real subtlety is that the enclosing block has a file open for
writing that the code inside the block is moving to a different
directory. But all this has perfectly-defined behaviour according to
the POSIX spec. This applies even if the destination directory is on a
separate filesystem, so that the “mv” command has to turn the rename
into a copy-then-delete.

I tried that script on my genuine Linux system, with and without
“workdir” being a separate filesystem. I even tried formatting the
separate filesystem as an NTFS volume. The code worked fine in all
cases.

So what’s going wrong? It could be the age-old Windows bugbear, where
you cannot access a file that has already been opened somewhere else,
because it is “locked”. Somehow Microsoft has transplanted this
limitation into its own version of the Linux kernel.

This may be a “feature” under Windows, but it is no longer conforming
to POSIX filesystem semantics!

[wlug] Microsoft Screws Up Linux

Lawrence D'Oliveiro