Barely grasping the small picture: 2018

Wednesday, March 21, 2018

Splice up your life

While I'm no longer an Oracle employee, there are still a few projects that landed in Solaris 11.4 that I'd like to talk about. The one that has occupied most of my last few years is definitely Ksplice on Solaris. Back in 2011, Oracle bought Ksplice, a company that provided runtime patching to the Linux kernel. Ksplice on Linux, today, is many things:

for customers, it's a service, that provides updates to both kernel and userland libraries for all Linux CVEs.
for administrators, a set of interfaces to install, revert and manage such splices, rooted in the uptrack tool.
for patch developers, a set of tools that allow (semi)automatic extraction and generation of splices from code changes.
for Ksplice developers, the code that makes all this possible, shared between the kernel framework that handles splices, the tools that generate them and the userland infrastructure.

The last two bullets happen behind Oracle walls and generate what customers and administrators ultimately see. Ksplice on Solaris is, today, in a different situation. If you harvest through 11.4 packages you'll find a kernel module (/kernel/drv/$arch/ksplice), a userland tool (spliceadm, for which there is a publicly accessible man page) and, if you look a bit further under the cover, a new SMF service (svc:/system/splice:default). On top of that, platinum customers have had a chance to experience the framework at play, receiving a couple of test splices for bugs encountered during the program. In a nutshell, the strictly technical fundaments to generate runtime kernel patches in Solaris are there, but nothing is set in stone yet (and I personally don't know) about how the service will look like.

In this blog post, I'll walk through the technical side of Ksplice on Solaris and the evolution it had from the initial "hey, we should probably have this, too" conversation with Jan, through the legal evaluation to make sure that we were doing all the right things (necessary disclaimer: we did!), to what is in the repository today.

Why Ksplice

Before dwelling into the technical details, a small digression on why we embarked into the whole effort. Patching is a key step of every deployment/security strategy and one of those that rank higher in the risk analysis scale. Many are the horror stories of systems that do not come back successfully after patching, of legacy software that just breaks down or critical, unexpected, security fixes that need to be rolled out quickly across an organization.

Solving patching pain and providing seamless updates is one of the greatest things that modern operating systems can do for users. At the same time, customers needs also have to be captured: you can't expect someone to disrupt its operations every week for a patching window, just as much as you don't want another one sitting on outdated software for too long.

With Solaris 11, we've done a tremendous amount of work to modernize and improve the patching experience and you can see it touching pretty much any area of the system. We have a new packaging system, IPS, which ensures that things move forward coherently, we leverage ZFS copy on write to provide lightweight boot environments that allow for easy rollback/fallback. We have SMF, handling the job of restarting services on updates, so that you never end up running stale code and fast reboot to quickly move across environments saving long firmware POSTs.

Ksplice was just a great fit in this overall story, opening up the possibility of both improving the IDR experience (one-off patches that fix a specific customer issue) and offering to customers a minimal reboot train with security and critical fixes. As I've previously mentioned, at the time of writing there is no commitment by Oracle that any of the above will be eventually provided.

Basic Blocks

Ksplice is composed of four key parts: the generation tools, that compare and extract differences between compilation units, creating the necessary metadata to build the splices, which are the fundamental patching blobs. The kernel framework, which loads and applies splices in memory and the administrative tools, which allow to configure the system for splice application/reversal and also manually inspect their state.

On the surface, Ksplice on Linux and Ksplice on Solaris look very similar: they both use a two pass build process, to create compilation units pre and post the patch that are later compared, and the splice contents have corresponding metadata names (if you dump the ELF sections you'll see the familiar .ksplice_relocs, .ksplice_symbols, etc sections). Also the splice format is similar, with the so called new_code and old_code pairs for each module target. But the similarities kind of stop there.

The ON build infrastructure is fundamentally different from the Linux one and is controlled by lullaby. The work that Tim, Mark and James did there is a tremendous improvement over the old nightly world and is the foundation of our extracting process. The generation tools have also been, for the most part, rewritten and are based on our libelf implementation. libelf is basically the assembly of ELF files: it gives you useful primitives to manipulate, read and generate ELF files, but doesn't do anything fancy on top of that (if you're used to the GNU libbfd way, you know what I mean). The kernel core is of course different and even the compilers are, since we use Oracle Developer Studio rather than GCC. We also have our own delivery mechanism, through IPS/pkg and our configuration (SMF) and reporting (FMA) interfaces, that spliceadm and the kernel framework consume.

In a nutshell, this was not much a port, but rather, as Scott Michael put it, "a technology transplant". Notwithstanding this, the help we got from the Ksplice team was huge. I've lost count of the number of chats/mails/random pings that I've sent up to Jamie and others while working on this and in retrospect, maintaining some of the building blocks (metadata, patch generation, validation and application steps, etc) hugely helped.

As we were busy playing catch up with the kernel world, the Ksplice folks have also introduced userland splicing, which is a great addition towards a rebootless world, as you can now fix at runtime your behemoth applications when the next blockbuster library bug comes out. At the time of writing, this is not available in Solaris.

Preparing the Kernel

To simplify patch extraction and application, and for good measure, we want to reduce to a minimum the changes from a software fix. In particular, the waterfall effect of relative offsets changing can be particularly nasty. To avoid that, we follow the Ksplice on Linux steps of building with fragmentation, separating each function or variable into its own section and so transforming relative jumps/memory accesses into relocations (much easier to process and compare). The Studio idiom to enable fragmentation is -xF=func -xF=gbldata -xF=lcldata.

Running elfdump -c over a so built unit shows it in action, as highlighted by the section names:

Section Header[4]:  sh_name: .text%splicetest_unused_func
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC SHF_EXECINSTR ]
 sh_size:      0x1f                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0xce0               sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x20              

Section Header[5]:  sh_name: .text%splicetest_attach
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC SHF_EXECINSTR ]
 sh_size:      0x68                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0xd00               sh_entsize: 0
 sh_link:      0                   sh_info:    0
[...]
Section Header[27]:  sh_name: .rodata%splicetest_string
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC ]
 sh_size:      0x8                 sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0x1318              sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x8               
[...]
Section Header[32]:  sh_name: .data%splicetest_dev_ops
 sh_addr:      0                   sh_flags:   [ SHF_WRITE SHF_ALLOC ]
 sh_size:      0x58                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0x1620              sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x10

The above output is from an internal testing module, which we call splicetest to demonstrate that programmers shine thanks to their originality.

Fun story about fragmentation, the first time we enabled it for the SPARC kernel, we got greeted with an early boot panic. Turns out that SPARC uses a very simple boot allocator that has an imposed limit on the number - not total size - of allocations. In krtld (the kernel runtime linker) we use the boot allocator when parsing genuinx, since better memory management will come from genuinx itself later on. Parsing genuinx means parsing an ELF file and allocating space for its sections: the driven up number of them, especially .rela sections, just exceeded the total number of available memory slots.

Luckily, we didn't have to modify the boot allocator, but just collapse the sections back together again, as krtld would end up doing that anyway. We did this first through a linker script and later the linker aliens promoted it as a linking feature for -ztype=kmod objects.

Fun story number two about reducing the footprint of changes: we build ON in two different ways, debug and non-debug. Normally you'd run the non-debug bits, but you can get the others through pkg change-variant debug.osnet=true. Internally, developers tend to run on the slower, but mdb friendly, debug bits. In any case, we wanted splices for both, but for a long time only worked with non-debug bits. At some point, we started testing our preliminary splice tools on debug units and the number of detected changes just exploded. Thank you very much, ASSERT() and VERIFY().

These developer loved macros include in the output the line number, via __LINE__, which of course changes at each source patch, waterfalling into all the functions that use either ASSERT() or VERIFY() and that follow the fixed one. There are a number of cumbersome ways to reduce noise, from playing games with the blank lines to coding things up in funny ways, but we didn't really like that. Kuriakose and Jonathan came to the rescue by stealing a page from DTrace SDT probes and the special relocations that we use to signal them to the kernel runtime linker.

In practice, instead of placing directly the line number in the macro, we create a global variable with a reserved name that contains the line number. This creates a relocation to a symbol that has, in the name, enough information for krtld to do a clever patching of the generated assembly code so that the number is directly returned. Similarly, this allows the Ksplice tools and core framework to properly identify the relocation to the special symbol and just skip it during comparison, bringing us back to a sane number of detected changes.

A central part of this implementation is visible in sys/debug.h, which is a publicly delivered file. Go take a look for some pure engineering joy.

Splices

The fundamental unit of patching are splices. Splices are identified by a monotonically increasing, eight digits, id number. We do this for a very specific reason: prevent dim sum. We don't want customers to create unicorn configurations that we haven't tested in house, and so we look at Ksplice fixes as a stream of changes, one on top of the other, rather than a collection that you can pick from. The idea is that this should also simplify your workflow. If a previous splice doesn't successfully apply for whatever reason, the framework won't allow the next to go in.

Splices are regular kernel module that get in the system through modload. We produce a pair of modules for each target module that we want to fix, a new_code, that contains the updated stuff, and a old_code, that contains the expected contents to be found on the running system, which we verify before attempting any splice operation. new_code and old_code need to be loaded in a specific order, but instead of stuffing this logic into a script or a tool, we use module dependencies to link them and to link the whole splice together, thanks to an extra, Solaris specific, module that we call the dependency module. If a splice is delivered to your system, you can find the dependency module in /kernel/ksplice/$arch/splice-$id.

Recursively dumping this module dependencies shows the interconnections and the targets (as outlined by our fresh new kldd tool in action in all its glory):

root@kzx42-01-z15:/kernel/ksplice/amd64# kldd ./splice-90000001 
        drv/splicetest_90000001_old =>  /kernel/drv/amd64/splicetest_90000001_old
        genunix_90000001_old => /kernel/amd64/genunix_90000001_old
        drv/splicetest_90000001_new =>  /kernel/drv/amd64/splicetest_90000001_new
        genunix_90000001_new => /kernel/amd64/genunix_90000001_new
        unix  (parent) =>       /platform/i86pc/kernel/amd64/unix
        genunix  (parent dependency) => /kernel/amd64/genunix
root@kzx42-01-z15:/kernel/ksplice/amd64#

By virtue of modloading splice-9000001, splicetest_90000001_old and genunix_90000001_old get brought in as a dependency and each one brings in the _new counterpart. Later on, this chain allows to leave in memory only the new_code modules and get rid of the old_code and dependency module to save some space.

Splices also come with one extra module, known as module.kid or target.id, whether you talk with a Linux or Solaris person. This module is an updated copy of the target module that contains the fix. The Ksplice framework interposes into the module loading code so that if you try to load a module that wasn't in memory at the time of splicing, we pick up the updated copy.

target.id can be a bit annoying in a reverse situation, because if the module joins in as a dependency or is otherwise locked (e.g. a userspace application holding a descriptor to the device that the module provides), we can't unload it and, hence, can't reverse the splice. Reversing splices is something customers expressed fondness for, so we try to limit as much as possible this situation by loading any target module before running a splice application, de facto forcing a memory patch every time.

Could have we gotten rid of target.id, then? Unfortunately not, as it is still necessary for edge cases where we deliver a splice that fixes a module that isn't installed. If, later on, the module gets installed and loaded, we'd have no chance to splice it 'at runtime' (just imagine the can of worms that opens up if this operation fails for whatever reason) and so we let the interposing code pick the right target.id copy.

Kernel Framework

The kernel framework is the heart and soul of Ksplice on Solaris. Splice operations start from an ioctl to the /dev/ksplice device, which is provided by the ksplice kernel module. This module contains the Solaris implementation of the run-pre algorithm, the preflight safety checks and the patching support. Along with the kernel module, a small portion of the framework is provided by genuinx, mostly to maintain metadata and state information about the loaded splices. This split allows for the ksplice module to be loaded/unloaded at will, so that we can update it at runtime.

Function patching is performed by placing a trampoline from the original function to the patched one. The trampoline is 5 bytes on x86 (jmp offset) and 12 bytes on SPARC (sethi, jmpl, nop) and so, by the sacred rules of self-modifying code, cannot be placed safely without stopping all the cpus except the one running the patching code. While the world is stopped, the framework also takes the chance to walk through all the existing thread stacks, looking for any target pointer stored there, as that might lead to inconsistencies or crashes after the patching. This operation, internally referred to as stack-check, needs to run fast, to prevent any network or cluster timeout/heartbeat from hitting.

Fun story about stack-check. For a while we have just not paid attention to how long the operation was taking, because testing machines tend to not have too much traffic or network sensitive applications on them (the operation time grows linearly with the number of processes). The original stack-check algorithm was kind of simplistic, starting from the top of the stack and comparing 8 bytes at the time all the way down, but effective. It also felt fast enough.

Later on, reality kicked in, especially on SPARC where stacks are significantly larger compared to x86. Our clustering code started panic'ing here and there with heartbeat timeouts and that became very fast a P1 bug. We worked out a quicker, but slightly riskier algorithm, in which we walked the stack frame by frame and only evaluated function linkage data (e.g. return addresses or passed in parameters). That relieved the problem, but was still somewhat close to the time limit when testing with a very large number of processes. On top of that, for splices removing a symbol, we still had to make sure somehow that no local variable contained a reference to it, or fully embrace the yolo mentality. Basically, we had duck-taped the issue, but not really solved it.

Turns out that there is a third, much better way: instead of performing the whole stack check while cpus are stopped, we perform an initial pass while the world is running. If we hit a conflict we back off for a bit and try again. Rinse and repeat for three times before definitively bailing out. If we pass this step, then we stop the world and re-perform the stack-check, but this time we skip all the threads that haven't had any cpu time since the last check, as they haven't had any chance to make progress. This takes away a huge chunk of stack walking and makes things fast, so fast that we default to the full stack check again (but keep frame checking around for good measure and even compare the two on debug kernels).

Fun story about stack-check and SPARC, take two. At some point, all splice applications on SPARC started failing with a stack-check violation. Every single one of them had the issuing process (spliceadm) hitting a false positive in its call chain. We hadn't made any recent significant change to the algorithm, just some code reordering, so this was even more puzzling. First came the frustration-induced, draconian idea: always ignore stack-check failures that come from the thread that is running within the ksplice code path. Basically functioning, but really not pretty - so we kept debugging.

Oh beloved register windows, we meet again, Turns out that our code reordering led to the compiler leaving some of the to-be-checked pointers into registers that survived across a register window and ended up in the next, happily saved onto the stack right before the full stack-check. We solved this by adding a clear_window() routine that basically exhausted all the register windows and repeatedly set all registers to 0, so that we could start from a clean state. Small, cute and elegant - this worked for a while, until at some point false positives started popping up again.

On SPARC there is extra stack space that is saved for aggregate return values and an extra area for the callee to store register arguments. If this extra space ends up unused and unluckily aligned over some dead stack that contains the pointers we played with in the framework to prepare the check, a false positive arises again. As much as we had ways to solve this by rearranging the code, this felt fragile over time, so on top of the register window clearing, we now also zero out all the dead stack before walking down the stack checking algorithm, ensuring to do that from a call site that is higher than the shortest depth that the algorithm can hit.

Ksplice and DTrace

Along with stack-check, the most interesting safety check that we run is the one that guarantees interoperability between Ksplice and DTrace. Actually, this is more than just a safety check, as these two guys really like to fiddle with the .text segments and have to communicate to avoid stepping on each others toes.

The story of DTrace support is fairly tortuous and spans over a few years before we got to its final form, with various people alternating and, occasionally, walking down deep and dark alleys. If there is one thing that I've learned from this is that failure is, indeed, progress. We had to prove ourselves that some of the ideas were batshit crazy to really reach the final state we're now happy with.

Let's start with the problem to solve. DTrace has two unstable providers that interact with the text/symbols layout: FBT (Function Boundary Tracing) and SDT (Statically Defined Tracing). The former places probes at each function entry and return point, while the latter needs to be explicitly written into the source code and allows the programmer to collect data at arbitrary points within a function. They are both "unstable" as they are intimately tied with the kernel implementation, which we reserve the right to change at will.

One of the key ideas behind Ksplice is that things get updated, but you really don't notice that. As an example, we take care to not change user/kernel interfaces with it. When it comes to DTrace scripts, ideally we'd want something written prior to a splice to keep working even if the splice has detoured execution of one of the traced points. Defining working is the big deal. The unstability of the SDT and FBT providers gives us a bit of a leeway, but we have internal products that we want to splice, and that rely on SDT/FBT behavior (e.g. ZFSSA). Also, it would be silly to not strive for the best possible experience with one of Solaris finest tools, of course always factoring in the complexity.

Here is what we came up with. First of all, we need to distinguish between two macro scenarios: a script is running or a script has been written, but will be started later. In the first case, if it is currently enabling SDT or FBT probes within units that we need to evaluate or consume (e.g. run-pre/splice framework), we abort the splice operation and return the list of such scripts/pids to the admin. Trying to do anything on our own only leads to too much complexity. Say that we termporarily stop the script, do the patching and the logic of the function changes - would the script still make sense? What if the script tries to access a parameter that we no longer pass? What if the function was even deleted? Better have the admin relaunch the script and DTrace catch all these situations. This also solves the problem of DTrace modifying the .text segment of functions that we need to compare, as we ensure that no DTrace script will ever interfere during a splice operation.

For the second scenario, whereby a script exists but it will be (re)launched after the splice operation, there are a couple of troublesome situations:

Every patched function is inside a new module (the new_code) and part of the 4-touple that identifies a DTrace probe point (provider:module:function:name) relies on the module name. A script may think it's enabling the right SDT point, but it might be the "old" one and never fire.
DTrace providers are loadable kernel modules and build the list of probe points when loaded, by parsing all the already loaded modules. On top of that, there are hooks at every modload/modunload. Building the list means, for FBT, walking the symbol table and finding entry/exit points by pattern matching on known prologue/epilogues. Ksplice patches the prologue, so the view, pre and post a splice for a module, has a different number of entries and can lead to stale contents. Stale contents with DTrace are a panic waiting to happen.
Users might be confused if all of a sudden more than a single probe is enabled for a touple that doesn't specify the module name (new_code functions maintain the same name as the target ones).

We solve these problems differently for SDT and FBT. For SDT we implement what we call probe translation, so that the new_code SDT probe, if present and identical, overwrites the one from the patched function. The opposite operation happens during reverse, restoring the old SDT chain.

For FBT, we bite the bullet of letting the touple change with respect to the module definition. Say you have a script that hooks on fbt:splicetest:splicetest_math:entry and we patch splicetest_math; that script won't work anymore, because after the splice, splicetest:splicetest_math no longer has an expected prologue and is not recognized by DTrace as a valid point. Similarly, also splicetest_math:return goes away, solving the problem of an FBT return probe that never fires. Scripts in the form fbt::splicetest_math:{entry|return} instead just work seamlessly, as the last new_code module in the chain will be the only one providing the symbol. This form is by far the most common and the one that we use internally, so we "optimize" for it.

The above sort of works on x86 with the existing code, just by calling into DTrace modload/modunload callbacks, but is a total mess on SPARC. This is because on SPARC probes are set up through a two-pass algorithm, whereby in the first run we count the number of probes and allocate the necessary handling structures and on the second run, populate them. The simplistic calls into the modload/modunload routines would find a pre-allocated table and things would go south from there. It's also a bit gross, reflecting the attempt of a Ksplice person doing DTrace-y things, which is a classic sentinel of bad.

Thankfully, Tomas Jedlicka and Tomas Kotal came to the rescue by designing and implementing a much better interface in DTrace, that invents a new probe state, HIDDEN, that behaves like DISABLED, but cannot be enabled, ever. Its whole point is to stay around keeping metadata information. The only transition allowed is from HIDDEN to DISABLED and vice versa.

This HIDDEN state captures all the splice interaction scenarios: the target module is spliced and later parsed by FBT? All the spliced points get included in the list of probes, but marked HIDDEN. The splice is lifted? The probe points become DISABLED. The list has already been built, but we apply a splice? No problem, just get the list of targets from Ksplice and make the associated probes HIDDEN.

The HIDDEN concept is at the framework level and same goes for the new refresh callback, introduced to not overload modload/modunload and now consumed by Ksplice. By making these changes at the framework level, any future provider that might need to do something reacting to splice operations already has all the necessary entry points in place. On top of that we also provide a couple of helper functions to request the original function contents (in case one wants to walk the .text segment as if the splice wasn't there) or the list of targets/symbols of a splice operation.

As of today, FBT and SDT are the only two consumers of the above.

User Experience

All the architecture, code, cute designs and long debugging sessions are pointless if you don't make your stuff usable. Staying with the idea that things get updated, but you really don't notice that, applying a splice to the system is as simple as installing/updating any other package, which, not to brag, is so damn cool (I might be biased by the amount of manual loading that I've done during development). This is achieved through the SMF svc://system/ksplice:default service, which coordinates automatic splice operations.

This service is responsible of four main things:

apply splices on delivery, by getting refreshed by pkg
control freezing and unfreezing of splices
on a reboot, apply all the splices at boot time
collect and store splice logs

Freezing is a Ksplice on Solaris specific concept, which roots on the fact that splices have a monotonically incremental id. At any point in time, an admin can specify a maximum ID value that the system can be at. If there are splices with a bigger ID currently applied, they get reversed, if new splices with a bigger ID get delivered, they are not loaded. The idea of freezing is to capture scenarios where admins want to download the splices, but still apply them during a quiet period (to maximize chances of success) or a potential downtime window (for a new technology such as ours, some testing of the water has to be expected). It also provides a very simple instrument to temporarily blacklist a problematic splice, while we frantically work on fixing it. Of course, we never release problematic splices, so you will never need that - right? If that was ever to happen, though, we also leverage the freezing concept to prevent reboot loops, by leaving a grace period before when the freshly applied splice will also be applied on reboot.

Freezing is controlled by spliceadm(1M), through the freeze <id> and unfreeze commands and highlighted by the status command. These three commands, along with log, are the only ones you should have ever to interact with for regular administration of Ksplice on Solaris, but we also provide a few more for our support folks to troubleshoot issues and manually interact with splices (apply/reverse/cleanup).

Lastly, there is spliceadm sync, which is what the SMF method calls. Its job is to walk the list of existing splices on the system and compare it with the freeze configuration to establish the list of splices to apply or reverse.

spliceadm man page describes the command in details and you can bet that, whenever the first splice will be out, a lot more documentation with examples and screenshots will be available. Since I'm now a user and no longer a developer, I'm really looking forward to that.

Closing Words/Shootouts

This project was huge and a number of people joined in at various stages to help along since the early days when Jan Setje-Eilers dragged me into this under Scott Michael's managerial supervision. Kuriakose Kuruvilla and Pete Dennis have been stably part of the "Solaris Ksplice Team", Rod Evans and Ali Bahrami (the linker aliens) have joined mid-way and made the tooling and krtld so much better, Mark J Nelson is one of the three people in the organization that understand everything lullaby does and that can express desires in Makefile form; if the infrastructure has gotten this efficient and anywhere sustainable, it's mostly thanks to his magic-fu. Xinliang Li and Raja Tummalapalli have both tolerated our occasional "what if we do that?" and turned it into code. Testing infrastructure was Albert White's work and the gate autografting and management was Adam Paul's and Gabriel Carrillo's bread and butter.

Bottom line, I mostly just got to tell the story :-)

Friday, March 16, 2018

Sunset

As of today, I'm no longer an Oracle employee and no longer work on the Solaris (or, briefly, Linux) kernel. I'm not very good with goodbyes, even my 'out of here' mail had just one line about the past 9 years: "Was Fun".

And it really was. I've had a blast and learned a ton. There are five things that I ridiculously and perhaps irrationally loved about Solaris and the organization:

Code should be beautiful: as in every big project, there are strict rules about the C style, to keep the overall aspect coherent. On top of those, there are a few extra facultative rules, such as those that the VM guys have used in the VM2 code. As much as I love those, the important take away for me has been to strive to make the code not just functioning, but visually and technically pleasant. The extra time you spend there pays back in dividends down the line, when you have to look at it again. I really hope ON will be open one day, as there are pieces written by Blake, Jonathan or the linker aliens (just to name a few) that are pure art. There are even folks that can make Forth look nice. How cool is that?
Integrate the hell out of everything: every piece of Solaris technology leverages the existing ones as much as possible. The ARC (architectural) Committee won't let you integrate something that reinvents the wheel. Well, will try not to, as we're guilty of some "needs to go in, will come back and fix after $major_deadline" (as an example, we still have way too many malloc() implementations), but overall, the integration is great. If you invent a new command/subsystem today, you'll have to store the configuration in SMF, follow the output formatting rules, report issues through FMA and make sure that, if necessary, there are the proper Analytics hooks. This also ensures that you consume and are a consumer, which keeps you honest as you write stuff (personal mantra: never, ever, integrate something that doesn't have a consumer).
SPARC relationship: building a chip in house and supporting it opens up tons of fascinating chances for learning about hardware and software interaction. Software in Silicon is perhaps the most known and successful public example, but many of the hardware meetings I attended to with my software hat on have been incredibly fascinating. Getting the hardware side perspective on things helps in understanding certain design decisions and better relate to a whole different set of issues. My love for low level stuff just grew a little bit further, there.
Kernel and User Land unite: the ON code base contains both the kernel code and the key userland pieces, as system libraries, some commands and the linker. This means that in a single project you can modify all of them and have them working together, without having to move across different consolidations/organizations and respect different putback timings. I understand this sounds (and probably is) quite obvious, but I can't shake the happiness I felt adding the secsys() system call, the secext framework, sxadm, the linker changes and modifying libc in a single push. Felt like having control over the whole world.
SCT meetings: every year, all engineers from around the globe met for a week of presentations and hallway chats. Truly exciting and extenuating week, which gave me, year over year, a thermometer on how much I was really contributing (the first year I almost didn't participate in any conversation, towards the last years, I was grabbed here and there). Oh, and it had the traditional Beer and Sausage outing with the Ksplice, Security and Linker folks in San Jose, at Original Gravity. This last one is a tradition I hope to maintain every time I hit the Bay in the future.

I'm leaving out from the list the popular "the people", as I believe (with proof ;)) that there is great people in every company and I'm truly looking forward to the next round of meeting and learning. In Sun and then Oracle, I've had the luck of always working in cool teams, with cool people (lot of this luck has to be attributed to Jan, who's constantly been my mentor), many of which have been around for 10+ years, creating pretty strong relationships. I could occupy a whole page of names and stories about different folks, but I would surely forget someone and be called up on it, so I take the easy way out. Just like I took one with the mail and the single line: "Was Fun". Which really meant, "Thank you".

Sunday, February 25, 2018

libc:malloc meets ADIHEAP

Oracle Solaris 11.4 comes with ADIHEAP , a new security extension that acts as a management interface for allocators that implement ADI based defenses. In this blog entry we'll walk through the implementation of ADIHEAP within libc:malloc in Solaris.

Background

So why ADIHEAP at all? Shortly before the advent of libadimalloc, I started thinking about a better way to integrate ADI in the Solaris ecosystem. I love the technology and libadimalloc, while doing its job as a testing library, wasn't cutting it for production environments. LD_PRELOAD is a poor controlling interface and relinking existing applications just wasn't happening. It didn't really help that libadimalloc wasn't seen as a viable non-ADI allocator and hence had no consumer out of the box.

My vision for ADI was a bit different, involving the main Solaris allocators (libc:malloc, libumem and libmtmalloc) supporting ADI-based defenses and the Security Extensions Framework acting as a more advanced and coherent controlling interface. That's when ADIHEAP was born.

ADIHEAP brought all the usual security extensions goodness:

progressive introduction to sensitive binaries, through the tagged-files model and binary tagging (ld -z sx=adiheap=enable)
ease of switching between enabled/disabled state, especially system wide (capture different production scenarios)
simplified/advanced testing through sxadm exec -i and the ability to unleash ADI checks over the entire system with model=all
reporting through the Compliance framework
kernel and user process cooperation. The kernel knows whether the extension will be enabled on the target process and can do operations on its behalf. In particular, this vastly simplifies ADI support in brk-based allocators, since the kernel now pre enables ADI over the brk pages.

Of course, ADIHEAP by itself does very little without libraries support. The choice for the first ADIHEAP consumer (never introduce something without a consumer!) ended on libc:malloc, as it was/is small, self contained, (still) vastly used across the system and implemented as a non-slab, brk-based allocator, which provides an interesting alternative to the mmap-and-slab-based libadimalloc example.

libc:malloc

The implementation of libc:malloc hasn't changed much over the years and an older implementation can be found from the old open-source days. We'll use this public implementation as a reference, since the main goal of this entry is to show how an allocator can be evolved to incorporate ADI checks. Keep in mind that some of the code here described may not strictly apply to the 11.4 Oracle Solaris codebase.

libc:malloc is composed by two main files, mallint.h (which contains general definitions for the allocator) and malloc.c (which has the bulk of the implementation). It's a best fit allocator based on a self-adjusting tree of free elements grouped by size. Element information is contained inside the chunk itself and described by the TREE structure:

 110 /* structure of a node in the free tree */
 111 typedef struct _t_ {
 112         WORD    t_s;    /* size of this element */
 113         WORD    t_p;    /* parent node */
 114         WORD    t_l;    /* left child */
 115         WORD    t_r;    /* right child */
 116         WORD    t_n;    /* next in link list */
 117         WORD    t_d;    /* dummy to reserve space for self-pointer */
 118 } TREE;

Free objects use all the elements of the TREE structure and also have a pointer to the start of the chunk at the end of the buffer (which basically mimics the t_d member for chunks larger than sizeof(TREE)).

Allocated objects, instead, only use the first element, which contains the size of the chunk. The first element is ensured to be ALIGN bytes in size:

 103 /* the proto-word; size must be ALIGN bytes */
 104 typedef union _w_ {
 105         size_t          w_i;            /* an unsigned int */
 106         struct _t_      *w_p;           /* a pointer */
 107         char            w_a[ALIGN];     /* to force size */
 108 } WORD;

so that the data portion is guaranteed to start at the required alignment boundary, which is 16 bytes on 64-bit systems.

Since every chunk is guaranteed to be aligned to and a multiple of ALIGN, the last bits of the size member are equally guaranteed to be zero. libc:malloc exploits this to store further information in the last two bits, with BIT0 that specifies whether the block is in use (1) or not and BIT1 that specifies, for blocks in use, whether the preceding block is free. This information is used at free time to handle the coalescing of free blocks together (adjacent free blocks are always coalesced, in order to reduce fragmentation).

Allocations smaller than the TREE structure itself would waste a lot of space for the header, so the allocator provides a specialized path: ,smalloc(). Small allocations are based off on an array of linked lists of free chunks. Each entry of the array satisfies a multiple of WORDSIZE requirement, so that List[0] contains 16 bytes chunks, List[1] 32 bytes chunks and so forth up to sizeof(TREE).

libc:malloc grows its memory footprint through sbrk() in the _morecore() function, which extends the Bottom block, the last free block available. As it's common in this kind of allocators, the Bottom block is treated a bit specially.

Introducing ADI defenses to libc:malloc

Porting an allocator to use ADI requires to solve three main problems: buffer alignment/size, what versioning scheme to use and minimize performance degradation. The former two end up directly or indirectly affecting the latter. It's also paramount for the allocator to run with basically the same speed and memory consumption when the ADI protections are disabled, so that applications that do not have ADIHEAP enabled do not suffer of any potential regression.

The first step in opening the allocator to the ADI world is to change the minimum unit of operation to 64 bytes. Looking at the libc:malloc code, this means to move ALIGN to a value of 64. Unfortunately, in libc:malloc, ALIGN meaning is overloaded, since it's both the alignment that malloc() expects for the pointer and the alignment that chunks have in memory within the allocator. While this normally aligns (no pun intended), when it comes to ADI, it's decoupled: malloc() still lives happily with 16-bytes aligned pointers, but we need 64-byte aligned chunks that back them. This can be a common scenario in allocators, but is luckily also a fairly simple one to fix, by capturing the decoupling into different variables.

Actually, the best thing to do here is to rely as little as possible on static defines and have alignment and other requirements extracted at runtime, through an init function. The init function is also the best place to learn on whether we are going to enable ADI protections or not, by consuming the sx_enabled(3C) interface from the security extensions framework:

    if (sx_enabled(SX_ADIHEAP)) {
         ...do ADI specific initialization...
         adi_enabled = B_TRUE;
    }

The ADI specific initialization should probably leverage the ADI API to collect information about the versioning space, the default block size, etc (see the examples in the mentioned blog entry).

Done with the alignment, we need to concentrate on the allocator metadata and how it affects chunks size and behavior. In the reference libc:malloc implementation, all the metadata concentrates into the TREE header, which is WORDSIZE * 6 = 16 * 6 = 92 bytes in size. This would dictate a minimum unit of allocation of 128 bytes, which is a bit scary for workloads that allocate lots of small objects. Ideally we'd like for the minimum unit to be 64 bytes and we can actually achieve that here with a more clever use of the data types. In fact only the very first member of the TREE structure needs to be 16 bytes in size, while all the others (with the exception of some implication for the last member as well) don't. This allows to get rid of the WORD union padding and just declare the members as struct pointers, which are 8 bytes in size. By keeping the first and last member as proto WORD and the rest as pointers we get to 16 * 2 + 8 * 4 = 64 bytes, which is exactly our goal.

Now that we have the minimum unit down to the cacheline size, we can start thinking about small allocations in the smalloc() path. There are basically two options: keep smaller buffers together under a common version, hoping to not have small controlled overflows between them or ditch the smalloc() algorithm altogether and declare that any allocation will at least get one 64 bytes chunk. Simplicity trumps memory usage by a fair bit here and we get the added benefit of a more secure implementation, so we say goodbye to smalloc() when ADIHEAP is enabled. With the proper adjustment at runtime for MINSIZE (another good candidate for a variable set up during the init path), smalloc() path is just never entered.

With the size/alignment solved, it's time for the versioning scheme, which again requires to evaluate the metadata behavior to make a decision. Either we isolate the header into its own 64 byte chunk, but de fact we end up extending the minimum allocation unit back to 128 bytes and that's not good, or just accept in the threat model that an underflow can target the size member. It's a tradeoff, but we can make a case for underflows to be significantly less common than overflows and that it has to be a controlled underflow (anything above 16 bytes would trap). It's not uncommon for security defenses to have to take tradeoffs and certainly we could provide a paranoid mode where the metadata is decoupled. I'm usually very complexity and knobs adverse, but that must not come with ineffective implementations. In this case, the major feature of ADI (detecting and invariantly preventing linear overflows) is maintained, so we can go for the simpler/faster implementation.

The versioning scheme is pretty standard for allocators that don't have isolated metadata, as we just need to reserve one version (1 is a popular choice) for free objects. At allocation time, instead, we're free to cycle to the rest of the versioning space. There are usually two different algorithms for versioning objects, which depend on whether the relative position between chunks is known and fixed at allocation time or not. For libc:malloc it clearly isn't, so we go for "randomization" plus verification. The fake randomization is based on RDTICK (the timestamp register) and improved with left and right verification.

Each time a version is selected, it is compared with the antecedent chunk and, if identical, it's incremented by one. The resulting value is then compared with the following chunk and, if identical, it's incremented again, taking care of looping back to 2 when hitting the end of the versioning space. This ensure the most important property for an ADIHEAP capable allocator to act as an invariant against linear overflows: no two adjacent allocated buffers should be found having the same version.

Run, Crash, Fix, Rinse and Repeat

ADIHEAP and the above prototype come along pretty nicely and seem to go through the most obvious testing of doing a bunch of malloc() and checking the returned version. It's with a lot of confidence that I type sxadm exec -s adiheap=enable /bin/ls, already imagining how the mail to showcase ADIHEAP to my colleagues would look like.

Segmentation Fault (core dump). Almost immediately and within libc:malloc.

Turns out that any non trivial sequence of memory operations requires coalescing and tree rebalancing, which do a heavy use of the LAST() and NEXT() macros.

 145 #define LAST(b)         (*((TREE **)(((uintptr_t)(b)) - WORDSIZE)))
 146 #define NEXT(b)         ((TREE *)(((uintptr_t)(b)) + SIZE(b) + WORDSIZE))

LAST() takes as an argument a TREE pointer and subtracts WORDSIZE. This effectively accesses the last 16 bytes of the previous chunk which, when free, contains the self pointer to the chunk itself. In other words, LAST() allows to reach back to the previous free chunk without knowing its size. NEXT() takes as a pointer a TREE structure, adds the size of the chunk and the extra WORDSIZE that corresponds to the size metadata at the start, effectively accessing the next adjacent chunk. This is generally fine and dandy, except that with ADIHEAP, the previous and next chunk most likely have mismatching versions and the segmentation fault is inevitable.

Once again we have two ways to fix this: we can either declare NEXT() and LAST() trusted paths and have them use non faulting loads that ignore ADI tagging, or we can read the previous/next ADI tag and fixup the pointer ourselves before accessing it. The first solution is a bit faster, but feels and looks more hackish in an allocator, so we go for the latter, rewrite NEXT() and LAST() as functions and wait to see how bad the performance numbers look like. Turns they are not that bad, so we go for it.

darkhelmet$ sxadm exec -s adiheap=enable /bin/ls
0_sun_studio_12.4  ssh-XXXX.xbadd     volatile-user
hmptemp            ssh-xauth-xSGDfd
darkhelmet$

Woot. I'm the coolest boy in town. Let's see what else we can run with ADIHEAP enabled. /bin/w ? Done. /sbin/ping ? Yeah. /bin/bash ?

Segmentation Fault (core dump). Once again within libc:malloc.

A few imprecations later, the fulguration on the road to Damascus. While our brk space is backed by ADI enabled pages, other adjacent memory isn't. Our versioning algorithm, when it allocates a new version, checks the previous and next chunk, but what if there is no previous or next chunk at all, because we are just at the start or at the end of the brk space (something that ASLR had hidden from previous tests)? We core dump attempting to use an ADI API over non ADI enabled pages.

libc:malloc only keeps track of where the brk ends, but we also need to know where brk started. In general, you may have to keep track of the boundaries of your allocated space, especially if you have it fragmented in memory. In this case, it's simple, we just store the starting address in our init function.

The above are just two examples of the many joy, segmentation fault, despair, debug, fix loops that touching an allocator brings. I spare you from many of the others, from off by one in the versioning code (that led to certain chunks getting adjacent ones retagged), to fixing and then fixing again memalign(), from discovering that emacs takes an entire copy of the lisp interpreter at build time, dumps its memory and restores it back when executed on another system and you've got to work around that to make it work (ignore free()s and replay any realloc() as if it was a new malloc() does the trick) and fixing python clever memory management code (see the objmalloc patch for an example of using a non faulting load in a performance sensitive code path).

Python, why you so slow?

The first performance numbers out of libc:malloc are encouraging, but while testing the objmalloc patch on Python things look ridiculously slow. A pkg solving operation that takes 30-40 seconds with the standard libc bumps up to minutes with ADIHEAP enabled, a 5x regression which is just as bad as unexpected.

Oracle Studio Performance Analyzer is a friend here and I start collecting samples to figure out where the problem lies. One outlier shows up: some PyList_Append() calls seem to spend an awful lot of time in realloc() compared to other paths that call into it as well, even other invocations of the function itself. Some DTrace allows to further isolate the realloc() sequences in PyList_Append(), identifying an interesting pattern, whereby PyList_Append() allocates a small amount of memory and reallocates it up multiple times in small chunks.

I take a long stare at the realloc() code, but nothing seems wrong and there aren't really any changes from the ADIHEAP code: adjacent blocks are checked and if one fitting is found, it is merged with the existing one and the rest is freed.

And then it hits me.

Some of this reallocation calls extend into the Bottom chunk, which is a couple of kilobytes in size.
Here is what happens:

a small 64 byte chunk is allocated
the chunk ends up being right next to the Bottom chunk
the chunk is reallocated extending its size to 128/256 bytes
the bottom chunk (which is 8/16K bytes) is splitted into two parts, the extra 64/192 bytes needed for the reallocation and the remaining 8/16K - 64/192 bytes
the chunk version is extended into the extra 64/192 bytes
the free version is stored on the remaining 8/16K bytes

multiply the above for a couple of incremental loops and we are doing kilobytes and kilobytes of unnecessary tagging! No kidding that this is killing performance.

I rewrite the free tagging code with an extra check, which evaluates the first and last chunk of the to-be-freed chunk: if both versions match the free version it means that we have been called on a split free block and so we have nothing to do and can just return. With this simple check alone, the 5x regression goes away and performance is almost on par with the non ADIHEAP case.

Almost, though. There is still a bit too much time spent in _morecore() compared to the vanilla libc case. The impact is significantly smaller, but notwithstanding this it's noticeable in the Studio analyzer output.

The lesson learned with realloc() sort of applies to _morecore() as well: when we extend the brk space, we ask for a couple of pages and mark them as free. Once again, we're tagging kilobytes and kilobytes of memory unnecessarily, because those pages come fresh (and empty) from the kernel. Just as we used as a sentinel in realloc() the presence of the free tag, we use as a sentinel here the presence of the universal tag, and limit ourselves to tag only the first and last 64 byte chunks of the freshly received memory. Similarly, during a split, we check again for an underlying zero tag and convert the first chunk to the free tag.

This works because ADI allows to access with an arbitrary tag in the pointer memory that is physically tagged as universal match (0 or 0xF), which is what the kernel returns by virtue of zeroing the pages out. With this extra change there are more unexpected slowdowns and the pkg operation runs almost as fast as with the vanilla libc.

Debugging Easter Eggs

The ADIHEAP libc:malloc implementation is gathered towards production scenarios, but since I was there, I left a couple of debugging easter eggs that you might find interesting. You can turn these on through the following environment variables:

_LIBC_ADI_PRECISE: when set, the application runs with ADI set to precise mode. This is handy when you have a store version mismatch and you want to pinpoint exactly where it happened.
_LIBC_MALLOC_ZEROFILL: when set, malloc() behaves just like calloc() and zeroes out a buffer before handing it out (useful in certain cases where memory is freed and later reused, but not initialized properly). This is implemented regardless of whether ADIHEAP is enabled, but is kind of cute with ADIHEAP, since we can use adi_memset() rather than adi_set_version() and get it basically for free.
SXADM_EXTEND: ADIHEAP and ADISTACK can uncover a large number of latent bugs and hence bring disruption to software that otherwise "just works" (think of an off by one read outside of the allocated chunk, which is generally non fatal for an application, but immediately traps under ADIHEAP). For this reason, model=all is not exposed for ADIHEAP and ADISTACK. When this variable is set, model=all can be used there too, e.g. through SXADM_EXTEND=1 sxadm enable -c model=all adiheap. I'm a huge fan of model=all for testing and that's how I run my T7 environments, in order to catch bugs left and right (and it did catch many).

Final Words

I've been waiting to write this blog post for quite some time now and I'm glad this stuff is finally out. The most important message that should get conveyed is that, while there might be tricky corner cases, any default/system allocator can be converted to use ADI and benefit from the added security checks that come with it.

If you have a T7 and an Oracle Solaris 11.4 copy, you can now go out and play with sxadm and ADIHEAP (and ADISTACK, more about it coming), I look forward to hear from your experience!