Wednesday, March 21, 2018

Splice up your life

While I'm no longer an Oracle employee, there are still a few projects that landed in Solaris 11.4 that I'd like to talk about. The one that has occupied most of my last few years is definitely Ksplice on Solaris. Back in 2011, Oracle bought Ksplice, a company that provided runtime patching to the Linux kernel. Ksplice on Linux, today, is many things:

  • for customers, it's a service, that provides updates to both kernel and userland libraries for all Linux CVEs.
  • for administrators, a set of interfaces to install, revert and manage such splices, rooted in the uptrack tool.
  • for patch developers, a set of tools that allow (semi)automatic extraction and generation of splices from code changes.
  • for Ksplice developers, the code that makes all this possible, shared between the kernel framework that handles splices, the tools that generate them and the userland infrastructure.
The last two bullets happen behind Oracle walls and generate what customers and administrators ultimately see. Ksplice on Solaris is, today, in a different situation. If you harvest through 11.4 packages you'll find a kernel module (/kernel/drv/$arch/ksplice), a userland tool (spliceadm, for which there is a publicly accessible man page) and, if you look a bit further under the cover, a new SMF service (svc:/system/splice:default). On top of that, platinum customers have had a chance to experience the framework at play, receiving a couple of test splices for bugs encountered during the program. In a nutshell, the strictly technical fundaments to generate runtime kernel patches in Solaris are there, but nothing is set in stone yet (and I personally don't know) about how the service will look like.

In this blog post, I'll walk through the technical side of Ksplice on Solaris and the evolution it had from the initial "hey, we should probably have this, too" conversation with Jan, through the legal evaluation to make sure that we were doing all the right things (necessary disclaimer: we did!), to what is in the repository today.

Why Ksplice


Before dwelling into the technical details, a small digression on why we embarked into the whole effort. Patching is a key step of every deployment/security strategy and one of those that rank higher in the risk analysis scale. Many are the horror stories of systems that do not come back successfully after patching, of legacy software that just breaks down or critical, unexpected, security fixes that need to be rolled out quickly across an organization.

Solving patching pain and providing seamless updates is one of the greatest things that modern operating systems can do for users. At the same time, customers needs also have to be captured: you can't expect someone to disrupt its operations every week for a patching window, just as much as you don't want another one sitting on outdated software for too long.

With Solaris 11, we've done a tremendous amount of work to modernize and improve the patching experience and you can see it touching pretty much any area of the system. We have a new packaging system, IPS, which ensures that things move forward coherently, we leverage ZFS copy on write to provide lightweight boot environments that allow for easy rollback/fallback. We have SMF, handling the job of restarting services on updates, so that you never end up running stale code and fast reboot to quickly move across environments saving long firmware POSTs.

Ksplice was just a great fit in this overall story, opening up the possibility of both improving the IDR experience (one-off patches that fix a specific customer issue) and offering to customers a minimal reboot train with security and critical fixes. As I've previously mentioned, at the time of writing there is no commitment by Oracle that any of the above will be eventually provided.

Basic Blocks


Ksplice is composed of four key parts: the generation tools, that compare and extract differences between compilation units, creating the necessary metadata to build the splices, which are the fundamental patching blobs. The kernel framework, which loads and applies splices in memory and the administrative tools, which allow to configure the system for splice application/reversal and also manually inspect their state.

On the surface, Ksplice on Linux and Ksplice on Solaris look very similar: they both use a two pass build process, to create compilation units pre and post the patch that are later compared, and the splice contents have corresponding metadata names (if you dump the ELF sections you'll see the familiar .ksplice_relocs, .ksplice_symbols, etc sections). Also the splice format is similar, with the so called new_code and old_code pairs for each module target. But the similarities kind of stop there.

The ON build infrastructure is fundamentally different from the Linux one and is controlled by lullaby. The work that Tim, Mark and James did there is a tremendous improvement over the old nightly world and is the foundation of our extracting process. The generation tools have also been, for the most part, rewritten and are based on our libelf implementation. libelf  is basically the assembly of ELF files: it gives you useful primitives to manipulate, read and generate ELF files, but doesn't do anything fancy on top of that (if you're used to the GNU libbfd way, you know what I mean). The kernel core is of course different and even the compilers are, since we use Oracle Developer Studio rather than GCC. We also have our own delivery mechanism, through IPS/pkg and our configuration (SMF) and reporting (FMA) interfaces, that spliceadm and the kernel framework consume.

In a nutshell, this was not much a port, but rather, as Scott Michael put it, "a technology transplant". Notwithstanding this, the help we got from the Ksplice team was huge. I've lost count of the number of chats/mails/random pings that I've sent up to Jamie and others while working on this and in retrospect, maintaining some of the building blocks (metadata, patch generation, validation and application steps, etc) hugely helped.

As we were busy playing catch up with the kernel world, the Ksplice folks have also introduced userland splicing, which is a great addition towards a rebootless world, as you can now fix at runtime your behemoth applications when the next blockbuster library bug comes out. At the time of writing, this is not available in Solaris.

Preparing the Kernel


To simplify patch extraction and application, and for good measure, we want to reduce to a minimum the changes from a software fix. In particular, the waterfall effect of relative offsets changing can be particularly nasty. To avoid that, we follow the Ksplice on Linux steps of building with fragmentation, separating each function or variable into its own section and so transforming relative jumps/memory accesses into relocations (much easier to process and compare). The Studio idiom to enable fragmentation is -xF=func -xF=gbldata -xF=lcldata. 

Running elfdump -c over a so built unit shows it in action, as highlighted by the section names:
Section Header[4]:  sh_name: .text%splicetest_unused_func
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC SHF_EXECINSTR ]
 sh_size:      0x1f                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0xce0               sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x20              

Section Header[5]:  sh_name: .text%splicetest_attach
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC SHF_EXECINSTR ]
 sh_size:      0x68                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0xd00               sh_entsize: 0
 sh_link:      0                   sh_info:    0
[...]
Section Header[27]:  sh_name: .rodata%splicetest_string
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC ]
 sh_size:      0x8                 sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0x1318              sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x8               
[...]
Section Header[32]:  sh_name: .data%splicetest_dev_ops
 sh_addr:      0                   sh_flags:   [ SHF_WRITE SHF_ALLOC ]
 sh_size:      0x58                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0x1620              sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x10              
The above output is from an internal testing module, which we call splicetest to demonstrate that programmers shine thanks to their originality.

Fun story about fragmentation, the first time we enabled it for the SPARC kernel, we got greeted with an early boot panic. Turns out that SPARC uses a very simple boot allocator that has an imposed limit on the number - not total size - of allocations. In krtld (the kernel runtime linker) we use the boot allocator when parsing genuinx, since better memory management will come from genuinx itself later on. Parsing genuinx means parsing an ELF file and allocating space for its sections: the driven up number of them, especially .rela sections, just exceeded the total number of available memory slots.

Luckily, we didn't have to modify the boot allocator, but just collapse the sections back together again, as krtld would end up doing that anyway. We did this first through a linker script and later the linker aliens promoted it as a linking feature for -ztype=kmod objects.

Fun story number two about reducing the footprint of changes: we build ON in two different ways, debug and non-debug. Normally you'd run the non-debug bits, but you can get the others through pkg change-variant debug.osnet=true. Internally, developers tend to run on the slower, but mdb friendly, debug bits. In any case, we wanted splices for both, but for a long time only worked with non-debug bits. At some point, we started testing our preliminary splice tools on debug units and the number of detected changes just exploded. Thank you very much, ASSERT() and VERIFY().

These developer loved macros include in the output the line number, via __LINE__, which of course changes at each source patch, waterfalling into all the functions that use either ASSERT() or VERIFY() and that follow the fixed one. There are a number of cumbersome ways to reduce noise, from playing games with the blank lines to coding things up in funny ways, but we didn't really like that. Kuriakose and Jonathan came to the rescue by stealing a page from DTrace SDT probes and the special relocations that we use to signal them to the kernel runtime linker.

In practice, instead of placing directly the line number in the macro, we create a global variable with a reserved name that contains the line number. This creates a relocation to a symbol that has, in the name, enough information for krtld to do a clever patching of the generated assembly code so that the number is directly returned. Similarly, this allows the Ksplice tools and core framework to properly identify the relocation to the special symbol and just skip it during comparison, bringing us back to a sane number of detected changes.

A central part of this implementation is visible in sys/debug.h, which is a publicly delivered file. Go take a look for some pure engineering joy.

Splices


The fundamental unit of patching are splices. Splices are identified by a monotonically increasing, eight digits, id number. We do this for a very specific reason: prevent dim sum. We don't want customers to create unicorn configurations that we haven't tested in house, and so we look at Ksplice fixes as a stream of changes, one on top of the other, rather than a collection that you can pick from. The idea is that this should also simplify your workflow. If a previous splice doesn't successfully apply for whatever reason, the framework won't allow the next to go in.

Splices are regular kernel module that get in the system through modload. We produce a pair of modules for each target module that we want to fix, a new_code, that contains the updated stuff, and a old_code, that contains the expected contents to be found on the running system, which we verify before attempting any splice operation. new_code and old_code need to be loaded in a specific order, but instead of stuffing this logic into a script or a tool, we use module dependencies to link them and to link the whole splice together, thanks to an extra, Solaris specific, module that we call the dependency module. If a splice is delivered to your system, you can find the dependency module in /kernel/ksplice/$arch/splice-$id.

Recursively dumping this module dependencies shows the interconnections and the targets (as outlined by our fresh new kldd tool in action in all its glory):
root@kzx42-01-z15:/kernel/ksplice/amd64# kldd ./splice-90000001 
        drv/splicetest_90000001_old =>  /kernel/drv/amd64/splicetest_90000001_old
        genunix_90000001_old => /kernel/amd64/genunix_90000001_old
        drv/splicetest_90000001_new =>  /kernel/drv/amd64/splicetest_90000001_new
        genunix_90000001_new => /kernel/amd64/genunix_90000001_new
        unix  (parent) =>       /platform/i86pc/kernel/amd64/unix
        genunix  (parent dependency) => /kernel/amd64/genunix
root@kzx42-01-z15:/kernel/ksplice/amd64# 

By virtue of modloading splice-9000001, splicetest_90000001_old and genunix_90000001_old get brought in as a dependency and each one brings in the _new counterpart. Later on, this chain allows to leave in memory only the new_code modules and get rid of the old_code and dependency module to save some space.

Splices also come with one extra module, known as module.kid or target.id, whether you talk with a Linux or Solaris person. This module is an updated copy of the target module that contains the fix. The Ksplice framework interposes into the module loading code so that if you try to load a module that wasn't in memory at the time of splicing, we pick up the updated copy.

target.id can be a bit annoying in a reverse situation, because if the module joins in as a dependency or is otherwise locked (e.g. a userspace application holding a descriptor to the device that the module provides), we can't unload it and, hence, can't reverse the splice. Reversing splices is something customers expressed fondness for, so we try to limit as much as possible this situation by loading any target module before running a splice application, de facto forcing a memory patch every time.

Could have we gotten rid of target.id, then? Unfortunately not, as it is still necessary for edge cases where we deliver a splice that fixes a module that isn't installed. If, later on, the module gets installed and loaded, we'd have no chance to splice it 'at runtime' (just imagine the can of worms that opens up if this operation fails for whatever reason) and so we let the interposing code pick the right target.id copy.

Kernel Framework


The kernel framework is the heart and soul of Ksplice on Solaris. Splice operations start from an ioctl to the /dev/ksplice device, which is provided by the ksplice kernel module. This module contains the Solaris implementation of the run-pre algorithm, the preflight safety checks and the patching support. Along with the kernel module, a small portion of the framework is provided by genuinx, mostly to maintain metadata and state information about the loaded splices. This split allows for the ksplice module to be loaded/unloaded at will, so that we can update it at runtime.

Function patching is performed by placing a trampoline from the original function to the patched one. The trampoline is 5 bytes on x86 (jmp offset) and 12 bytes on SPARC (sethi, jmpl, nop) and so, by the sacred rules of self-modifying code, cannot be placed safely without stopping all the cpus except the one running the patching code. While the world is stopped, the framework also takes the chance to walk through all the existing thread stacks, looking for any target pointer stored there, as that might lead to inconsistencies or crashes after the patching. This operation, internally referred to as stack-check, needs to run fast, to prevent any network or cluster timeout/heartbeat from hitting.

Fun story about stack-check. For a while we have just not paid attention to how long the operation was taking, because testing machines tend to not have too much traffic or network sensitive applications on them (the operation time grows linearly with the number of processes). The original stack-check algorithm was kind of simplistic, starting from the top of the stack and comparing 8 bytes at the time all the way down, but effective. It also felt fast enough.

Later on, reality kicked in, especially on SPARC where stacks are significantly larger compared to x86. Our clustering code started panic'ing here and there with heartbeat timeouts and that became very fast a P1 bug. We worked out a quicker, but slightly riskier algorithm, in which we walked the stack frame by frame and only evaluated function linkage data (e.g. return addresses or passed in parameters). That relieved the problem, but was still somewhat close to the time limit when testing with a very large number of processes. On top of that, for splices removing a symbol, we still had to make sure somehow that no local variable contained a reference to it, or fully embrace the yolo mentality. Basically, we had duck-taped the issue, but not really solved it.

Turns out that there is a third, much better way: instead of performing the whole stack check while cpus are stopped, we perform an initial pass while the world is running. If we hit a conflict we back off for a bit and try again. Rinse and repeat for three times before definitively bailing out. If we pass this step, then we stop the world and re-perform the stack-check, but this time we skip all the threads that haven't had any cpu time since the last check, as they haven't had any chance to make progress. This takes away a huge chunk of stack walking and makes things fast, so fast that we default to the full stack check again (but keep frame checking around for good measure and even compare the two on debug kernels).

Fun story about stack-check and SPARC, take two. At some point, all splice applications on SPARC started failing with a stack-check violation. Every single one of them had the issuing process (spliceadm) hitting a false positive in its call chain. We hadn't made any recent significant change to the algorithm, just some code reordering, so this was even more puzzling. First came the frustration-induced, draconian idea: always ignore stack-check failures that come from the thread that is running within the ksplice code path. Basically functioning, but really not pretty - so we kept debugging.

Oh beloved register windows, we meet again, Turns out that our code reordering led to the compiler leaving some of the to-be-checked pointers into registers that survived across a register window and ended up in the next, happily saved onto the stack right before the full stack-check. We solved this by adding a clear_window() routine that basically exhausted all the register windows and repeatedly set all registers to 0, so that we could start from a clean state. Small, cute and elegant - this worked for a while, until at some point false positives started popping up again.

On SPARC there is extra stack space that is saved for aggregate return values and an extra area for the callee to store register arguments. If this extra space ends up unused and unluckily aligned over some dead stack that contains the pointers we played with in the framework to prepare the check, a false positive arises again. As much as we had ways to solve this by rearranging the code, this felt fragile over time, so on top of the register window clearing, we now also zero out all the dead stack before walking down the stack checking algorithm, ensuring to do that from a call site that is higher than the shortest depth that the algorithm can hit.

Ksplice and DTrace


Along with stack-check, the most interesting safety check that we run is the one that guarantees interoperability between Ksplice and DTrace. Actually, this is more than just a safety check, as these two guys really like to fiddle with the .text segments and have to communicate to avoid stepping on each others toes.

The story of DTrace support is fairly tortuous and spans over a few years before we got to its final form, with various people alternating and, occasionally, walking down deep and dark alleys. If there is one thing that I've learned from this is that failure is, indeed, progress. We had to prove ourselves that some of the ideas were batshit crazy to really reach the final state we're now happy with.

Let's start with the problem to solve. DTrace has two unstable providers that interact with the text/symbols layout: FBT (Function Boundary Tracing) and SDT (Statically Defined Tracing). The former places probes at each function entry and return point, while the latter needs to be explicitly written into the source code and allows the programmer to collect data at arbitrary points within a function. They are both "unstable" as they are intimately tied with the kernel implementation, which we reserve the right to change at will.

One of the key ideas behind Ksplice is that things get updated, but you really don't notice that. As an example, we take care to not change user/kernel interfaces with it. When it comes to DTrace scripts, ideally we'd want something written prior to a splice to keep working even if the splice has detoured execution of one of the traced points. Defining working is the big deal. The unstability of the SDT and FBT providers gives us a bit of a leeway, but we have internal products that we want to splice, and that rely on SDT/FBT behavior (e.g. ZFSSA). Also, it would be silly to not strive for the best possible experience with one of Solaris finest tools, of course always factoring in the complexity.

Here is what we came up with. First of all, we need to distinguish between two macro scenarios: a script is running or a script has been written, but will be started later. In the first case, if it is currently enabling SDT or FBT probes within units that we need to evaluate or consume (e.g. run-pre/splice framework), we abort the splice operation and return the list of such scripts/pids to the admin. Trying to do anything on our own only leads to too much complexity. Say that we termporarily stop the script, do the patching and the logic of the function changes - would the script still make sense? What if the script tries to access a parameter that we no longer pass? What if the function was even deleted? Better have the admin relaunch the script and DTrace catch all these situations. This also solves the problem of DTrace modifying the .text segment of functions that we need to compare, as we ensure that no DTrace script will ever interfere during a splice operation.

For the second scenario, whereby a script exists but it will be (re)launched after the splice operation, there are a couple of troublesome situations:

  • Every patched function is inside a new module (the new_code) and part of the 4-touple that identifies a DTrace probe point (provider:module:function:name) relies on the module name. A script may think it's enabling the right SDT point, but it might be the "old" one and never fire.
  • DTrace providers are loadable kernel modules and build the list of probe points when loaded, by parsing all the already loaded modules. On top of that, there are hooks at every modload/modunload. Building the list means, for FBT, walking the symbol table and finding entry/exit points by pattern matching on known prologue/epilogues. Ksplice patches the prologue, so the view, pre and post a splice for a module, has a different number of entries and can lead to stale contents. Stale contents with DTrace are a panic waiting to happen.
  • Users might be confused if all of a sudden more than a single probe is enabled for a touple that doesn't specify the module name (new_code functions maintain the same name as the target ones).

We solve these problems differently for SDT and FBT. For SDT we implement what we call probe translation, so that the new_code SDT probe, if present and identical, overwrites the one from the patched function. The opposite operation happens during reverse, restoring the old SDT chain.

For FBT, we bite the bullet of letting the touple change with respect to the module definition. Say you have a script that hooks on fbt:splicetest:splicetest_math:entry and we patch splicetest_math; that script won't work anymore, because after the splice, splicetest:splicetest_math no longer has an expected prologue and is not recognized by DTrace as a valid point. Similarly, also splicetest_math:return goes away, solving the problem of an FBT return probe that never fires. Scripts in the form fbt::splicetest_math:{entry|return} instead just work seamlessly, as the last new_code module in the chain will be the only one providing the symbol. This form is by far the most common and the one that we use internally, so we "optimize" for it.

The above sort of works on x86 with the existing code, just by calling into DTrace modload/modunload callbacks, but is a total mess on SPARC. This is because on SPARC probes are set up through a two-pass algorithm, whereby in the first run we count the number of probes and allocate the necessary handling structures and on the second run, populate them. The simplistic calls into the modload/modunload routines would find a pre-allocated table and things would go south from there. It's also a bit gross, reflecting the attempt of a Ksplice person doing DTrace-y things, which is a classic sentinel of bad.

Thankfully, Tomas Jedlicka and Tomas Kotal came to the rescue by designing and implementing a much better interface in DTrace, that invents a new probe state, HIDDEN, that behaves like DISABLED, but cannot be enabled, ever. Its whole point is to stay around keeping metadata information. The only transition allowed is from HIDDEN to DISABLED and vice versa.

This HIDDEN state captures all the splice interaction scenarios: the target module is spliced and later parsed by FBT? All the spliced points get included in the list of probes, but marked HIDDEN. The splice is lifted? The probe points become DISABLED. The list has already been built, but we apply a splice? No problem, just get the list of targets from Ksplice and make the associated probes HIDDEN.

The HIDDEN concept is at the framework level and same goes for the new refresh callback, introduced to not overload modload/modunload and now consumed by Ksplice. By making these changes at the framework level, any future provider that might need to do something reacting to splice operations already has all the necessary entry points in place.  On top of that we also provide a couple of helper functions to request the original function contents (in case one wants to walk the .text segment as if the splice wasn't there) or the list of targets/symbols of a splice operation.

As of today, FBT and SDT are the only two consumers of the above.

User Experience


All the architecture, code, cute designs and long debugging sessions are pointless if you don't make your stuff usable. Staying with the idea that things get updated, but you really don't notice that, applying a splice to the system is as simple as installing/updating any other package, which, not to brag, is so damn cool (I might be biased by the amount of manual loading that I've done during development). This is achieved through the SMF svc://system/ksplice:default service, which coordinates automatic splice operations.

This service is responsible of four main things:
  • apply splices on delivery, by getting refreshed by pkg
  • control freezing and unfreezing of splices
  • on a reboot, apply all the splices at boot time
  • collect and store splice logs
Freezing is a Ksplice on Solaris specific concept, which roots on the fact that splices have a monotonically incremental id. At any point in time, an admin can specify a maximum ID value that the system can be at. If there are splices with a bigger ID currently applied, they get reversed, if new splices with a bigger ID get delivered, they are not loaded. The idea of freezing is to capture scenarios where admins want to download the splices, but still apply them during a quiet period (to maximize chances of success) or a potential downtime window (for a new technology such as ours, some testing of the water has to be expected). It also provides a very simple instrument to temporarily blacklist a problematic splice, while we frantically work on fixing it. Of course, we never release problematic splices, so you will never need that - right? If that was ever to happen, though, we also leverage the freezing concept to prevent reboot loops, by leaving a grace period before when the freshly applied splice will also be applied on reboot.

Freezing is controlled by spliceadm(1M), through the freeze <id> and unfreeze commands and highlighted by the status command. These three commands, along with log, are the only ones you should have ever to interact with for regular administration of Ksplice on Solaris, but we also provide a few more for our support folks to troubleshoot issues and manually interact with splices (apply/reverse/cleanup).

Lastly, there is spliceadm sync, which is what the SMF method calls. Its job is to walk the list of existing splices on the system and compare it with the freeze configuration to establish the list of splices to apply or reverse.

spliceadm man page describes the command in details and you can bet that, whenever the first splice will be out, a lot more documentation with examples and screenshots will be available. Since I'm now a user and no longer a developer, I'm really looking forward to that.

Closing Words/Shootouts


This project was huge and a number of people joined in at various stages to help along since the early days when Jan Setje-Eilers dragged me into this under Scott Michael's managerial supervision. Kuriakose Kuruvilla and Pete Dennis have been stably part of the "Solaris Ksplice Team", Rod Evans and Ali Bahrami (the linker aliens) have joined mid-way and made the tooling and krtld so much better, Mark J Nelson is one of the three people in the organization that understand everything lullaby does and that can express desires in Makefile form; if the infrastructure has gotten this efficient and anywhere sustainable, it's mostly thanks to his magic-fu. Xinliang Li and Raja Tummalapalli have both tolerated our occasional "what if we do that?" and turned it into code. Testing infrastructure was Albert White's work and the gate autografting and management was Adam Paul's and Gabriel Carrillo's bread and butter.

Bottom line, I mostly just got to tell the story :-)

Friday, March 16, 2018

Sunset

As of today, I'm no longer an Oracle employee and no longer work on the Solaris (or, briefly, Linux) kernel. I'm not very good with goodbyes, even my 'out of here' mail had just one line about the past 9 years: "Was Fun".

And it really was. I've had a blast and learned a ton. There are five things that I ridiculously and perhaps irrationally loved about Solaris and the organization:

  1. Code should be beautiful: as in every big project, there are strict rules about the C style, to keep the overall aspect coherent. On top of those, there are a few extra facultative rules, such as those that the VM guys have used in the VM2 code. As much as I love those, the important take away for me has been to strive to make the code not just functioning, but visually and technically pleasant. The extra time you spend there pays back in dividends down the line, when you have to look at it again. I really hope ON will be open one day, as there are pieces written by Blake, Jonathan or the linker aliens (just to name a few) that are pure art. There are even folks that can make Forth look nice. How cool is that?
  2. Integrate the hell out of everything: every piece of Solaris technology leverages the existing ones as much as possible. The ARC (architectural) Committee won't let you integrate something that reinvents the wheel. Well, will try not to, as we're guilty of some "needs to go in, will come back and fix after $major_deadline" (as an example, we still have way too many malloc() implementations), but overall, the integration is great. If you invent a new command/subsystem today, you'll have to store the configuration in SMF, follow the output formatting rules, report issues through FMA and make sure that, if necessary, there are the proper Analytics hooks. This also ensures that you consume and are a consumer, which keeps you honest as you write stuff (personal mantra: never, ever, integrate something that doesn't have a consumer).
  3. SPARC relationship: building a chip in house and supporting it opens up tons of fascinating chances for learning about hardware and software interaction. Software in Silicon is perhaps the most known and successful public example, but many of the hardware meetings I attended to with my software hat on have been incredibly fascinating. Getting the hardware side perspective on things helps in understanding certain design decisions and better relate to a whole different set of issues. My love for low level stuff just grew a little bit further, there.
  4. Kernel and User Land unite: the ON code base contains both the kernel code and the key userland pieces, as system libraries, some commands and the linker. This means that in a single project you can modify all of them and have them working together, without having to move across different consolidations/organizations and respect different putback timings. I understand this sounds (and probably is) quite obvious, but I can't shake the happiness I felt adding the secsys() system call, the secext framework, sxadm, the linker changes and modifying libc in a single push. Felt like having control over the whole world.
  5. SCT meetings: every year, all engineers from around the globe met for a week of presentations  and hallway chats. Truly exciting and extenuating week, which gave me, year over year, a thermometer on how much I was really contributing (the first year I almost didn't participate in any conversation, towards the last years, I was grabbed here and there). Oh, and it had the traditional Beer and Sausage outing with the Ksplice, Security and Linker folks in San Jose, at Original Gravity. This last one is a tradition I hope to maintain every time I hit the Bay in the future.

I'm leaving out from the list the popular "the people", as I believe (with proof ;)) that there is great people in every company and I'm truly looking forward to the next round of meeting and learning. In Sun and then Oracle, I've had the luck of always working in cool teams, with cool people (lot of this luck has to be attributed to Jan, who's constantly been my mentor), many of which have been around for 10+ years, creating pretty strong relationships. I could occupy a whole page of names and stories about different folks, but I would surely forget someone and be called up on it, so I take the easy way out. Just like I took one with the mail and the single line: "Was Fun". Which really meant, "Thank you".

Sunday, February 25, 2018

libc:malloc meets ADIHEAP

Oracle Solaris 11.4 comes with ADIHEAP , a new security extension that acts as a management interface for allocators that implement ADI based defenses. In this blog entry we'll walk through the implementation of ADIHEAP within libc:malloc in Solaris.

Background


So why ADIHEAP at all? Shortly before the advent of libadimalloc, I started thinking about a better way to integrate ADI in the Solaris ecosystem. I love the technology and libadimalloc, while doing its job as a testing library, wasn't cutting it for production environments. LD_PRELOAD is a poor controlling interface and relinking existing applications just wasn't happening. It didn't really help that libadimalloc wasn't seen as a viable non-ADI allocator and hence had no consumer out of the box.

My vision for ADI was a bit different, involving the main Solaris allocators (libc:malloc, libumem and libmtmalloc) supporting ADI-based defenses and the Security Extensions Framework acting as a more advanced and coherent controlling interface. That's when ADIHEAP was born.

ADIHEAP brought all the usual security extensions goodness:
  • progressive introduction to sensitive binaries, through the tagged-files model and binary tagging (ld -z sx=adiheap=enable)
  • ease of switching between enabled/disabled state, especially system wide (capture different production scenarios)
  • simplified/advanced testing through sxadm exec -i and the ability to unleash ADI checks over the entire system with model=all
  • reporting through the Compliance framework 
  • kernel and user process cooperation. The kernel knows whether the extension will be enabled on the target process and can do operations on its behalf. In particular, this vastly simplifies ADI support in brk-based allocators, since the kernel now pre enables ADI over the brk pages.
Of course, ADIHEAP by itself does very little without libraries support. The choice for the first ADIHEAP consumer (never introduce something without a consumer!) ended on libc:malloc, as it was/is small, self contained, (still) vastly used across the system and implemented as a non-slab, brk-based allocator, which provides an interesting alternative to the mmap-and-slab-based libadimalloc example.

libc:malloc


The implementation of libc:malloc hasn't changed much over the years and an older implementation can be found from the old open-source days. We'll use this public implementation as a reference, since the main goal of this entry is to show how an allocator can be evolved to incorporate ADI checks. Keep in mind that some of the code here described may not strictly apply to the 11.4 Oracle Solaris codebase.

libc:malloc is composed by two main files, mallint.h (which contains general definitions for the allocator) and malloc.c (which has the bulk of the implementation). It's a best fit allocator based on a self-adjusting tree of free elements grouped by size. Element information is contained inside the chunk itself and described by the TREE structure:
 110 /* structure of a node in the free tree */
 111 typedef struct _t_ {
 112         WORD    t_s;    /* size of this element */
 113         WORD    t_p;    /* parent node */
 114         WORD    t_l;    /* left child */
 115         WORD    t_r;    /* right child */
 116         WORD    t_n;    /* next in link list */
 117         WORD    t_d;    /* dummy to reserve space for self-pointer */
 118 } TREE;
Free objects use all the elements of the TREE structure and also have a pointer to the start of the chunk at the end of the buffer (which basically mimics the t_d member for chunks larger than sizeof(TREE)).

Allocated objects, instead, only use the first element, which contains the size of the chunk. The first element is ensured to be ALIGN bytes in size:
 103 /* the proto-word; size must be ALIGN bytes */
 104 typedef union _w_ {
 105         size_t          w_i;            /* an unsigned int */
 106         struct _t_      *w_p;           /* a pointer */
 107         char            w_a[ALIGN];     /* to force size */
 108 } WORD;
so that the data portion is guaranteed to start at the required alignment boundary, which is 16 bytes on 64-bit systems.

Since every chunk is guaranteed to be aligned to and a multiple of ALIGN, the last bits of the size member are equally guaranteed to be zero. libc:malloc exploits this to store further information in the last two bits, with BIT0 that specifies whether the block is in use (1) or not and BIT1 that specifies, for blocks in use, whether the preceding block is free. This information is used at free time to handle the coalescing of free blocks together (adjacent free blocks are always coalesced, in order to reduce fragmentation).

Allocations smaller than the TREE structure itself would waste a lot of space for the header, so the allocator provides a specialized path: ,smalloc(). Small allocations are based off on an array of linked lists of free chunks. Each entry of the array satisfies a multiple of WORDSIZE requirement, so that List[0] contains 16 bytes chunks, List[1] 32 bytes chunks and so forth up to sizeof(TREE).

libc:malloc grows its memory footprint through sbrk() in the _morecore() function, which extends the Bottom block, the last free block available. As it's common in this kind of allocators, the Bottom block is treated a bit specially.

Introducing ADI defenses to libc:malloc


Porting an allocator to use ADI requires to solve three main problems: buffer alignment/size, what versioning scheme to use and minimize performance degradation. The former two end up directly or indirectly affecting the latter. It's also paramount for the allocator to run with basically the same speed and memory consumption when the ADI protections are disabled, so that applications that do not have ADIHEAP enabled do not suffer of any potential regression.

The first step in opening the allocator to the ADI world is to change the minimum unit of operation to 64 bytes. Looking at the libc:malloc code, this means to move ALIGN to a value of 64. Unfortunately, in libc:malloc, ALIGN meaning is overloaded, since it's both the alignment that malloc() expects for the pointer and the alignment that chunks have in memory within the allocator. While this normally aligns (no pun intended), when it comes to ADI, it's decoupled: malloc() still lives happily with 16-bytes aligned pointers, but we need 64-byte aligned chunks that back them. This can be a common scenario in allocators, but is luckily also a fairly simple one to fix, by capturing the decoupling into different variables.

Actually, the best thing to do here is to rely as little as possible on static defines and have alignment and other requirements extracted at runtime, through an init function. The init function is also the best place to learn on whether we are going to enable ADI protections or not, by consuming the  sx_enabled(3C) interface from the security extensions framework:
    if (sx_enabled(SX_ADIHEAP)) {
         ...do ADI specific initialization...
         adi_enabled = B_TRUE;
    }
The ADI specific initialization should probably leverage the ADI API to collect information about the versioning space, the default block size, etc (see the examples in the mentioned blog entry).

Done with the alignment, we need to concentrate on the allocator metadata and how it affects chunks size and behavior. In the reference libc:malloc implementation, all the metadata concentrates into the TREE header, which is WORDSIZE * 6 = 16 * 6 = 92 bytes in size. This would dictate a minimum unit of allocation of 128 bytes, which is a bit scary for workloads that allocate lots of small objects. Ideally we'd like for the minimum unit to be 64 bytes and we can actually achieve that here with a more clever use of the data types. In fact only the very first member of the TREE structure needs to be 16 bytes in size, while all the others (with the exception of some implication for the last member as well) don't. This allows to get rid of the WORD union padding and just declare the members as struct pointers, which are 8 bytes in size. By keeping the first and last member as proto WORD and the rest as pointers we get to 16 * 2 + 8 * 4 = 64 bytes, which is exactly our goal.

Now that we have the minimum unit down to the cacheline size, we can start thinking about small allocations in the smalloc() path. There are basically two options: keep smaller buffers together under a common version, hoping to not have small controlled overflows between them or ditch the smalloc() algorithm altogether and declare that any allocation will at least get one 64 bytes chunk. Simplicity trumps memory usage by a fair bit here and we get the added benefit of a more secure implementation, so we say goodbye to smalloc() when ADIHEAP is enabled. With the proper adjustment at runtime for MINSIZE (another good candidate for a variable set up during the init path), smalloc() path is just never entered.

With the size/alignment solved, it's time for the versioning scheme, which again requires to evaluate the metadata behavior to make a decision. Either we isolate the header into its own 64 byte chunk, but de fact we end up extending the minimum allocation unit back to 128 bytes and that's not good, or just accept in the threat model that an underflow can target the size member. It's a tradeoff, but we can make a case for underflows to be significantly less common than overflows and that it has to be a controlled underflow (anything above 16 bytes would trap). It's not uncommon for security defenses to have to take tradeoffs and certainly we could provide a paranoid mode where the metadata is decoupled. I'm usually very complexity and knobs adverse, but that must not come with ineffective implementations. In this case, the major feature of ADI (detecting and invariantly preventing linear overflows) is maintained, so we can go for the simpler/faster implementation.

The versioning scheme is pretty standard for allocators that don't have isolated metadata, as we just need to reserve one version (1 is a popular choice) for free objects. At allocation time, instead, we're free to cycle to the rest of the versioning space. There are usually two different algorithms for versioning objects, which depend on whether the relative position between chunks is known and fixed at allocation time or not. For libc:malloc it clearly isn't, so we go for "randomization" plus verification. The fake randomization is based on RDTICK (the timestamp register) and improved with left and right verification.

Each time a version is selected, it is compared with the antecedent chunk and, if identical, it's incremented by one. The resulting value is then compared with the following chunk and, if identical, it's incremented again, taking care of looping back to 2 when hitting the end of the versioning space. This ensure the most important property for an ADIHEAP capable allocator to act as an invariant against linear overflows: no two adjacent allocated buffers should be found having the same version.


Run, Crash, Fix, Rinse and Repeat


ADIHEAP and the above prototype come along pretty nicely and seem to go through the most obvious testing of doing a bunch of malloc() and checking the returned version. It's with a lot of confidence that I type sxadm exec -s adiheap=enable /bin/ls, already imagining how the mail to showcase ADIHEAP to my colleagues would look like.

Segmentation Fault (core dump). Almost immediately and within libc:malloc.

Turns out that any non trivial sequence of memory operations requires coalescing and tree rebalancing, which do a heavy use of the LAST() and NEXT() macros.
 145 #define LAST(b)         (*((TREE **)(((uintptr_t)(b)) - WORDSIZE)))
 146 #define NEXT(b)         ((TREE *)(((uintptr_t)(b)) + SIZE(b) + WORDSIZE))
LAST() takes as an argument a TREE pointer and subtracts WORDSIZE. This effectively accesses the last 16 bytes of the previous chunk which, when free, contains the self pointer to the chunk itself. In other words, LAST() allows to reach back to the previous free chunk without knowing its size. NEXT() takes as a pointer a TREE structure, adds the size of the chunk and the extra WORDSIZE that corresponds to the size metadata at the start, effectively accessing the next adjacent chunk. This is generally fine and dandy, except that with ADIHEAP, the previous and next chunk most likely have mismatching versions and the segmentation fault is inevitable.

Once again we have two ways to fix this: we can either declare NEXT() and LAST() trusted paths and have them use non faulting loads that ignore ADI tagging, or we can read the previous/next ADI tag and fixup the pointer ourselves before accessing it. The first solution is a bit faster, but feels and looks more hackish in an allocator, so we go for the latter, rewrite NEXT() and LAST() as functions and wait to see how bad the performance numbers look like. Turns they are not that bad, so we go for it.
darkhelmet$ sxadm exec -s adiheap=enable /bin/ls
0_sun_studio_12.4  ssh-XXXX.xbadd     volatile-user
hmptemp            ssh-xauth-xSGDfd
darkhelmet$
Woot. I'm the coolest boy in town. Let's see what else we can run with ADIHEAP enabled. /bin/w ? Done. /sbin/ping ? Yeah. /bin/bash ?

Segmentation Fault (core dump). Once again within libc:malloc.

A few imprecations later, the fulguration on the road to Damascus. While our brk space is backed by ADI enabled pages, other adjacent memory isn't. Our versioning algorithm, when it allocates a new version, checks the previous and next chunk, but what if there is no previous or next chunk at all, because we are just at the start or at the end of the brk space (something that ASLR had hidden from previous tests)? We core dump attempting to use an ADI API over non ADI enabled pages.

libc:malloc only keeps track of where the brk ends, but we also need to know where brk started. In general, you may have to keep track of the boundaries of your allocated space, especially if you have it fragmented in memory. In this case, it's simple, we just store the starting address in our init function.

The above are just two examples of the many joy, segmentation fault, despair, debug, fix loops that touching an allocator brings. I spare you from many of the others, from off by one in the versioning code (that led to certain chunks getting adjacent ones retagged), to fixing and then fixing again memalign(), from discovering that emacs takes an entire copy of the lisp interpreter at build time, dumps its memory and restores it back when executed on another system and you've got to work around that to make it work (ignore free()s and replay any realloc() as if it was a new malloc() does the trick) and fixing python clever memory management code (see the objmalloc patch for an example of using a non faulting load in a performance sensitive code path).

Python, why you so slow?


The first performance numbers out of libc:malloc are encouraging, but while testing the objmalloc patch on Python things look ridiculously slow. A pkg solving operation that takes 30-40 seconds with the standard libc bumps up to minutes with ADIHEAP enabled, a 5x regression which is just as bad as unexpected.

Oracle Studio Performance Analyzer is a friend here and I start collecting samples to figure out where the problem lies. One outlier shows up: some PyList_Append() calls seem to spend an awful lot of time in realloc() compared to other paths that call into it as well, even other invocations of the function itself. Some DTrace allows to further isolate the realloc() sequences in PyList_Append(), identifying an interesting pattern, whereby PyList_Append() allocates a small amount of memory and reallocates it up multiple times in small chunks.

I take a long stare at the realloc() code, but nothing seems wrong and there aren't really any changes from the ADIHEAP code: adjacent blocks are checked and if one fitting is found, it is merged with the existing one and the rest is freed.

And then it hits me.

Some of this reallocation calls extend into the Bottom chunk, which is a couple of kilobytes in size.
Here is what happens:

  • a small 64 byte chunk is allocated
  • the chunk ends up being right next to the Bottom chunk
  • the chunk is reallocated extending its size to 128/256 bytes
  • the bottom chunk (which is 8/16K bytes) is splitted into two parts, the extra 64/192 bytes needed for the reallocation and the remaining 8/16K - 64/192 bytes
  • the chunk version is extended into the extra 64/192 bytes
  • the free version is stored on the remaining 8/16K bytes

multiply the above for a couple of incremental loops and we are doing kilobytes and kilobytes of unnecessary tagging! No kidding that this is killing performance.

I rewrite the free tagging code with an extra check, which evaluates the first and last chunk of the to-be-freed chunk: if both versions match the free version it means that we have been called on a split free block and so we have nothing to do and can just return. With this simple check alone, the 5x regression goes away and performance is almost on par with the non ADIHEAP case.

Almost, though. There is still a bit too much time spent in _morecore() compared to the vanilla libc case. The impact is significantly smaller, but notwithstanding this it's noticeable in the Studio analyzer output.

The lesson learned with realloc() sort of applies to _morecore() as well: when we extend the brk space, we ask for a couple of pages and mark them as free. Once again, we're tagging kilobytes and kilobytes of memory unnecessarily, because those pages come fresh (and empty) from the kernel. Just as we used as a sentinel in realloc() the presence of the free tag, we use as a sentinel here the presence of the universal tag, and limit ourselves to tag only the first and last 64 byte chunks of the freshly received memory.  Similarly, during a split, we check again for an underlying zero tag and convert the first chunk to the free tag.

This works because ADI allows to access with an arbitrary tag in the pointer memory that is physically tagged as universal match (0 or 0xF), which is what the kernel returns by virtue of zeroing the pages out. With this extra change there are more unexpected slowdowns and the pkg operation runs almost as fast as with the vanilla libc.

Debugging Easter Eggs


The ADIHEAP libc:malloc implementation is gathered towards production scenarios, but since I was there, I left a couple of debugging easter eggs that you might find interesting. You can turn these on through the following environment variables:
  • _LIBC_ADI_PRECISE: when set, the application runs with ADI set to precise mode. This is handy when you have a store version mismatch and you want to pinpoint exactly where it happened. 
  • _LIBC_MALLOC_ZEROFILL: when set, malloc() behaves just like calloc() and zeroes out a buffer before handing it out (useful in certain cases where memory is freed and later reused, but not initialized properly). This is implemented regardless of whether ADIHEAP is enabled, but is kind of cute with ADIHEAP, since we can use adi_memset() rather than adi_set_version() and get it basically for free.
  • SXADM_EXTEND: ADIHEAP and ADISTACK can uncover a large number of latent bugs and hence bring disruption to software that otherwise "just works" (think of an off by one read outside of the allocated chunk, which is generally non fatal for an application, but immediately traps under ADIHEAP). For this reason, model=all is not exposed for ADIHEAP and ADISTACK. When this variable is set, model=all can be used there too, e.g. through SXADM_EXTEND=1 sxadm enable -c model=all adiheap. I'm a huge fan of model=all for testing and that's how I run my T7 environments, in order to catch bugs left and right (and it did catch many).

Final Words


I've been waiting to write this blog post for quite some time now and I'm glad this stuff is finally out. The most important message that should get conveyed is that, while there might be tricky corner cases, any default/system allocator can be converted to use ADI and benefit from the added security checks that come with it.

If you have a T7 and an Oracle Solaris 11.4 copy, you can now go out and play with sxadm and ADIHEAP (and ADISTACK, more about it coming), I look forward to hear from your experience!

Monday, September 18, 2017

Getting started with ADI

After a number of entries on different uses of ADI, it's time to get our hands dirty and walk through the C API that allows to experiment with it directly. The whole API set is quite simple and we'll use the Solaris implementation as a walkthrough. Linux is getting ADI support as well and while the interfaces are going to be a bit different (as an example, Linux is fond of mprotect() whereby Solaris prefers memcntl()), most of what we discuss here is going to apply there as well.

If you own or have access to a SPARC T7/M7 system running Solaris 11.3 or later, then, lucky you, you can get there and play. Otherwise, you may try the developer trial account on swisdev.oracle.com.

Enabling ADI on the current thread


Unless you're running in a Kernel Zone and haven't enabled the proper host compatibility (adi or native), if you are on proper hardware, ADI is supported by the kernel and available for every 64-bit process on the system. What this means is that your process can decide to start using ADI, but it still has to state that explicitly. This is provided by:

             int adi_set_enabled(int arg) 

arg is either ADI_ENABLE or ADI_DISABLE, to, respectively, enable and disable ADI. Under the covers, this has the kernel set/clear the MCDE bit in the PSTATE register, which stands for Memory Corruption Detection Enable, showing trace of the original name by which the technology was created.

At any point in time, a userland thread can inquiry on whether it has ADI enabled, and the interface is, not surprisingly:

            int adi_get_enabled(void)

which will return either ADI_ENABLE or ADI_DISABLE.

Learning about ADI properties


Magic numbers are the worst and, as much as possible, a well written piece of software should derive dynamically the properties of a running system and adjust accordingly. ADI API is not here to have you write poor software and hence we have:

           int adi_blksz(void)
           int adi_version_max(void)
           int adi_version_nbits(void)

adi_blksz() returns the granularity to which the versioning applies. Today, ADI operates on 64 byte cache lines, so 64 bytes is also the necessary alignment. adi_version_nbits() returns the number of bits, starting from the topmost bit in the 64-bit virtual address, that are used to represent the associated ADI version and adi_version_max() the largest color that is reliably usable on the architecture. A common initialization routine for a piece of software (e.g. a malloc() implementation) would collect these values to tune itself, with code along these lines:

void
initialize_adi(void)
{
        if (adi_set_enabled(ADI_ENABLE) < 0) {
                perror("ADI initialization failed");
                adi_enabled = B_FALSE;
                return;
        }

        adi_enabled = B_TRUE;

        alignment = adi_blksz();
        nbits = adi_version_nbits();
                
        if (alignment < 0 || nbits < 0) {
                adi_enabled = B_FALSE;
                return;
        }

        maxversion = (uint_t)adi_version_max();
}

If we build and run it on a M7 and dump the values we get something along these lines:

darkhelmet$ ./adi_base_test 
Block size and alignment: 64 bytes.
Available version bits: 4
Maximum usable version: 13
[...]
darkhelmet$

64 and 4 match what we know and expect about the ADI implementation. The maximum usable version is, instead, a tiny bit surprising: we know that "all zeroes" or "all ones" (so, 0 and 15) are universal matches which allow a load/store with any arbitrary tag, but that should still leave 14 as an usable version. The reason why it isn't comes from an architecture implementation detail.

On M7, we keep version bits separately protected in the L2$ (all 8 ways of a set in a check-word). If an Unrecoverable Error (UE) happens, they are flushed and if a dirty line is present, it's written back with version 14, regardless of the original version. Since a userland program might decide to rely on ADI for correctness, it would be cumbersome to figure out whether an exception raised on the 14 version was a consequence of a legit condition or a UE, so we simply restrict the versioning space.

The upcoming M8 architecture lifts this limitation by doing the right thing on UE and writing back the correct color. This is a good example of why one should never rely on magic numbers: by gathering the maximum version at runtime, your software is guaranteed to take advantage of all the available versioning space.

Setting/Getting Tags for VA ranges


Now that we have enabled ADI for the running thread (PSTATE.mcde) and have all the necessary information to produce and place a tag on a virtual address (alignment, number of version bits, maximum version), all that is left is to effectively start tagging memory.

The first mandatory step is to enable ADI onto the target pages. In fact, ADI is not implicitly enabled for a TTE (Translation Table Entry), but, instead, a new bit is introduced (TTE.mcd), that specifies whether it's on or off. The main reason for this is that the processor disables store merging for ADI enabled pages, which might translate in some (generally minimal) performance impact. Setting TTE.mcd is up to the kernel and Solaris offers two different API to gently ask for it:

              mmap(hint_addr, len, prot, MAP_ADI, ...)
              memcntl(addr, len, MC_ENABLE_ADI, ...)

In both cases, assuming that everything is right, TTE.mcd will be set for all the pages that make up the requested len range. The kernel makes no promises on the versions that will be set for those pages initially (you might experience that,  e.g. for anonymous pages, they are zeroed out, but please don't rely on it) and leaves it up to the userland application.

Setting versions is a two steps process. First of all, we need to set the proper tag to the cache line bits, which are exposed through dedicated Address Space Identifiers (ASI) and then mirror the tag onto the pointer used to access the memory range. ASIs are basically the SPARC architecture swiss army knife: they allow to expose different address spaces (e.g. select between the primary or secondary context with a load/store), I/O ranges, internal registers or otherwise influence the load/store behavior. Alternate versions of load/store, identified by a 'a' in the mnemonic at the end of the name (e.g. stoa), allow to directly specify the ASI to operate on.

Two ASIs are of interest for ADI setting/getting of versions by unprivileged software:

  • ASI_MCDP (0x90): Memory Corruption Detection Primary, takes the virtual address as relative to the primary context and sets/gets the specified version.
  • ASI_MCDSBP (0x92): Memocry Corruption Detection Block Init Store Primary, which optimizes the operation of zero'ing a block (64 bytes) while also setting the version. 

Of course, one doesn't need to do any of this manually, but, instead, proper APIs are provided, that also deal correctly with misaligned addresses:

             caddr_t adi_clr_version(caddr_t addr, size_t size)
             int adi_get_version(caddr_t addr)
             caddr_t adi_set_version(caddr_t addr, size_t size, int version)
             caddr_t adi_memset(caddr_t addr, int c, size_t size, int version)

Their name is quite self-explanatory and so should be the behavior, given the introduction above (which might actually come back handy to you in case you find yourself willing to optimize some hot path). All the setting functions return an address which is already properly tagged, as we can see in the next example:

        caddr_t buffer;
        buffer = mmap(NULL, 8192, PROT_READ|PROT_WRITE,
            MAP_ANON|MAP_PRIVATE|MAP_ADI, -1, 0);
        if (buffer == (caddr_t)MAP_FAILED) {
                perror("mmap");
                exit(EXIT_FAILURE);
        }

        printf("Buffer address before versioning: %p\n", buffer);
        buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", buffer);       

which, once run, gives us:

darkhelmet$ ./adi_base_test
[...]
Buffer address before versioning: ffffffff7f5ec000
Versioned buffer address: 7fffffff7f5ec000
[...]
darkhelmet$

with the topmost nbits properly set to reflect the ADI version we specified. If we attach and disassemble the call to add_set_version(), being careful to pick the symbol with the proper capability tag, eventually we see:

adi_base_test:580319*> ::nm ! grep adi_set_version
0x00000001001013c0|0x0000000000000000|FUNC |GLOB |0x0  |UNDEF   |adi_set_version
0xffffffff7ef289f0|0x00000000000001a8|FUNC |LOCL |0x0  |19      |adi_set_version%sun4v-adi
0xffffffff7edea9e0|0x000000000000001c|FUNC |GLOB |0x3  |19      |adi_set_version
adi_base_test:580319*> 0xffffffff7ef289f0::dis ! ggrep -B 2 stxa
libc.so.1`adi_set_version%sun4v-adi+0x108:      ldsb      [%l2], %o2
libc.so.1`adi_set_version%sun4v-adi+0x10c:      sra       %i2, 0x0, %o0
libc.so.1`adi_set_version%sun4v-adi+0x110:      stxa      %o0, [%i4] 0x90       
--
libc.so.1`adi_set_version%sun4v-adi+0x16c:      mov       %l5, %o5
libc.so.1`adi_set_version%sun4v-adi+0x170:      sra       %i2, 0x0, %o7
libc.so.1`adi_set_version%sun4v-adi+0x174:      stxa      %o7, [%o5] 0x90       
adi_base_test:580319*> 

with the calls to store-alternate with the expected ASI.

adi_memset() operates basically in the same way, although it is faster than just doing a memset() followed by an idi_set_version().

What happens on a ADI mismatch?


Let's trigger an ADI mismatch from within our code:

       /* Tag the first two cachelines */
        tagged_buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", tagged_buffer);        

        /* Tag the next two cachelines with a different value */
        nextbuffer = adi_set_version(buffer + 128, 128, 8);
        printf("Versioned next buffer address: %p\n", nextbuffer);

        /* Access with correct version goes through */
        tagged_buffer[0] = 'a';
        printf("first store acces: %c\n", tagged_buffer[0]);
        /* Access with incorrect version traps */
        tagged_buffer[130] = 'b';
        /* segfault...*/        
        printf("second store access: %c\n", nextbuffer[2]);

The second access from tagged_buffer, performed with a pointer with version 7 over a memory line tagged with version 8 is detected by ADI, which nullifies the store (mismatching stores never go through) and raises an exception, whose final outcome is to send a SIGSEGV down to the process. Let's see it in action:

darkhelmet$ ./adi_base_test
[...]
Buffer address before versioning: ffffffff7f5ec000
Versioned buffer address: 7fffffff7f5ec000
Versioned next buffer address: 8fffffff7f5ec080
first store acces: a
Segmentation Fault (core dumped)
darkhelmet$

We got the expected Segmentation Fault and a nice core dump. Let's dig into it a little bit:

darkhelmet$ mdb ./core
Loading modules: [ libc.so.1 ld.so.1 ]
adi_base_test:core> ::status
debugging core file of adi_base_test (64-bit) from darkhelmet
file: /tmp/adi_base_test
initial argv: ./adi_base_test
threading model: raw lwps
status: process terminated by SIGSEGV (Segmentation Fault)
, ADI deferred mismatch, pc=100001188
adi_base_test:core> 100001188::dis
main+0x130:                     call      +0x100320     
[...]
main+0x158:                     stb       %l0, [%i4 + 0x82]
[...]
main+0x17c:                     restore
_init:                          save      %sp, -0xb0, %sp
adi_base_test:core> ::regs
%g0 = 0x0000000000000000                 %l0 = 0x0000000000000062 
%g1 = 0x0000000000000004                 %l1 = 0x0000000100000c80 
%g2 = 0x0000000000000000                 %l2 = 0x7fffffff7f5ec000 
%g3 = 0x0000000000000000                 %l3 = 0x0000000000000000 
%g4 = 0x00000000782f296a                 %l4 = 0x0000000000000061 
%g5 = 0xffffffff7f0486e4 libc.so.1`_sobuf+0x18 %l5 = 0x0000000100000cf0 
%g6 = 0x0000000000000000                 %l6 = 0x0000000000002800 
%g7 = 0xffffffff7f5c2a40                 %l7 = 0xffffffff7f548f08 
%o0 = 0x0000000100000d78                 %i0 = 0x0000000000000001 
%o1 = 0x0000000000000000                 %i1 = 0xffffffff7ffff808 
%o2 = 0x0000000000000000                 %i2 = 0xffffffff7ffff818 
%o3 = 0xffffffff7f030000                 %i3 = 0x8fffffff7f5ec080 
%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 
%o5 = 0xffffffff7f037fdc libc.so.1`_iob+0xa4 %i5 = 0xffffffff7f5ec000 
%o6 = 0xffffffff7fffee71                 %i6 = 0xffffffff7fffef51 
%o7 = 0x0000000100001190      main+0x160 %i7 = 0x0000000100000ea8 _start+0x108

 %ccr = 0x99 xcc=NzvC icc=NzvC
   %y = 0x0000000000000000
  %pc = 0x0000000100101480 PLT=libc.so.1`printf
 %npc = 0x0000000100101484 PLT=libc.so.1`printf
  %sp = 0xffffffff7fffee71
  %fp = 0xffffffff7fffef51

 %asi = 0x82
%fprs = 0x07
adi_base_test:core> 

Two things are immediately evident:

  • ADI tells us that a trap was raised by the execution at pc=100001188 (main+0x158), which is where we expected it to happen, at the store into the mismatching cache line.
  • There is surprisingly scarce information about the trap: on what virtual address? because of what mismatching value? All we get reported is a program counter and that 'deferred' definition, which seems to be intuitively matched by the fact that %pc and %npc are not at the reported faulting instruction, but somewhere further down into the next printf() call.

For a feature that was born to be a debugging tool, this lack of details is fairly unexpected. So, what's going on here? There are two types of traps that can be risen by an ADI mismatch: a precise trap and a deferred trap. Precise trap stop the thread immediately and so the architecture and the kernel can conspire to send out to userland all the relevant information about the exception. Deferred traps happen some time later, based on a number of conditions that we'll see in a short.

Loads always raise a precise trap, stores, instead, a deferred one. This is because on SPARC we have a store buffer that queues (up to 64) stores on commit before they are performed by the L2$ and the version mismatch is not detected until the L2$ performs the store. This can happen on an explicit flush (membar) or when the buffer is full and goes through. There are a number of implicit situations that lead to a membar, e.g. the userland application performing a system call, but by the time the deferred trap is delivered (and the program stopped), some instructions have passed and the hardware has no ability to keep any information on the original condition beyond the PC at which it happened.

Disabling store buffering has a significant performance penalty and is recommended only on debugging scenarios, to further drill down the details of a bug. We offer an API to enable what we called precise mode that allows to control it:

          int adi_get_precise(void)
          int adi_set_precise(int mode)

where mode is either ADI_PRECISE_ENABLE or ADI_PRECISE_DISABLE. The simplest way when writing a memory allocator (or some other debugging/protection) with ADI is to provide a tweakable way (e.g. an environment variable or a security extension property) to control the ADI behavior. As mentioned, you'll want to run with precise mode disabled in production.

Let's see an example of the amount of information we can gather when we hit a precise trap. We'll do so by tweaking our code a little and and making it trap over a load, rather than a store.

       /*
         * Do a proper store access and modify the printf to read from
         * tagged_buffer, which has an incorrect version.
         */ 
        nextbuffer[2] = 'b';
        printf("second store access (correct): %c\n", nextbuffer[2]);
        printf("second store access (mismatch): %c\n", tagged_buffer[130]);

And run it, along with evaluating the expected core dump:

darkhelmet$ ./adi_base_test
[...]
first store acces: a
second store access (correct): b
Segmentation Fault (core dumped)
darkhelmet$ mdb ./core
Loading modules: [ libc.so.1 ld.so.1 ]
adi_base_test:core> ::status
debugging core file of adi_base_test (64-bit) from darkhelmet
file: /tmp/adi_base_test
initial argv: ./adi_base_test
threading model: raw lwps
status: process terminated by SIGSEGV (Segmentation Fault), pc=1000011e4
, ADI version 8 mismatch for VA ffffffff7f5ec082
adi_base_test:core> ::regs
%g0 = 0x0000000000000000                 %l0 = 0x0000000100101000 
%g1 = 0x0000000000000004                 %l1 = 0x0000000100000cf0 
%g2 = 0x0000000000000000                 %l2 = 0x8fffffff7f5ec082 
%g3 = 0x0000000000000000                 %l3 = 0x0000000000000062 
%g4 = 0x00000000222f2967                 %l4 = 0x0000000000000061 
%g5 = 0xffffffff7f0486f0 libc.so.1`_sobuf+0x24 %l5 = 0x0000000000002880 
%g6 = 0x0000000000000000                 %l6 = 0x0000000000002800 
%g7 = 0xffffffff7f5c2a40                 %l7 = 0xffffffff7f548f08 
%o0 = 0x0000000100000da0                 %i0 = 0x0000000000000001 
%o1 = 0x0000000000000021                 %i1 = 0xffffffff7ffff7f8 
%o2 = 0x0000000000000000                 %i2 = 0xffffffff7ffff808 
%o3 = 0xffffffff7f030000                 %i3 = 0x8fffffff7f5ec080 
%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 
%o5 = 0xffffffff7f037fdc libc.so.1`_iob+0xa4 %i5 = 0xffffffff7f5ec000 
%o6 = 0xffffffff7fffee61                 %i6 = 0xffffffff7fffef41 
%o7 = 0x00000001000011e0      main+0x170 %i7 = 0x0000000100000ee8 _start+0x108

 %ccr = 0x99 xcc=NzvC icc=NzvC
   %y = 0x0000000000000000
  %pc = 0x00000001000011e4 main+0x174
 %npc = 0x0000000100101480 PLT=libc.so.1`printf
  %sp = 0xffffffff7fffee61
  %fp = 0xffffffff7fffef41

 %asi = 0x82
%fprs = 0x07
adi_base_test:core> 

We get all the nice debugging/detailed information that we expected: the instruction pointer, the ADI version mismatch that happened and for what virtual address. The virtual address is always reported normalized, although we can inquiry about the ADI version that was associated through the ::adviser command in mdb:

adi_base_test:core> ffffffff7f5ec082::adiver
addr: ffffffff7f5ec082 cache ver: 8
adi_base_test:core> ffffffff7f5ec070::adiver
addr: ffffffff7f5ec070 cache ver: 7
adi_base_test:core> 

Since we have an exact picture of the status of registers at the time of the trap, we can further look at the instruction pointer and registers to see from what virtual address and with what version the access was performed:

%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 

adi_base_test:core> 0x00000001000011e4::dis
[...]
main+0x174:                     ldsb      [%i4 + 0x82], %o1

And, as expected, a version 7 address starting two cache lines before shows up.

Source Code Example


As a general reference, here is the full source code from the last example we discussed, with the mismatching trap on a load access.
#include 
#include 
#include 
#include 
#include 
#include 

/*
 * Global varaibles that control ADI behavior for the application.
 */
boolean_t       adi_enabled;
int             alignment;
int             nbits;
uint_t          maxversion;
uint_t          mask;

/*
 * Gather information at runtime about ADI
 */
void
initialize_adi(void)
{
        if (adi_set_enabled(ADI_ENABLE) < 0) {
                perror("ADI initialization failed");
                adi_enabled = B_FALSE;
                return;
        }

        adi_enabled = B_TRUE;

        alignment = adi_blksz();
        nbits = adi_version_nbits();
                
        if (alignment < 0 || nbits < 0) {
                adi_enabled = B_FALSE;
                return;
        }

        maxversion = (uint_t)adi_version_max();
        mask = (1 << nbits) - 1;
}

int main(int argc, char **argv)
{
        initialize_adi();

        if (adi_enabled) {
                printf("Block size and alignment: %d bytes.\n", alignment);
                printf("Available version bits: %d\n", nbits);
                printf("Maximum usable version: %d\n", maxversion);
        }

        caddr_t buffer;
        caddr_t nextbuffer;
        caddr_t tagged_buffer;

        buffer = mmap(NULL, 8192, PROT_READ|PROT_WRITE,
            MAP_ANON|MAP_PRIVATE|MAP_ADI, -1, 0);
        if (buffer == (caddr_t)MAP_FAILED) {
                perror("mmap");
                exit(EXIT_FAILURE);
        }

        printf("Buffer address before versioning: %p\n", buffer);
        /* Tag the first two cachelines */
        tagged_buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", tagged_buffer);        

        /* Tag the next two cachelines with a different value */
        nextbuffer = adi_set_version(buffer + 128, 128, 8);
        printf("Versioned next buffer address: %p\n", nextbuffer);

        /* Access with correct version goes through */
        tagged_buffer[0] = 'a';
        printf("first store acces: %c\n", tagged_buffer[0]);
        
        /*
         * Do a proper store access and modify the printf to read from
         * tagged_buffer, which has an incorrect version.
         */ 
        nextbuffer[2] = 'b';
        printf("second store access (correct): %c\n", nextbuffer[2]);
        printf("second store access (mismatch): %c\n", tagged_buffer[130]);
}

Tuesday, September 12, 2017

ADI vs ROP

A couple entries ago, I've covered how we planned to use ADI to protect against heap attacks. If you've been following the stream of patches for the Solaris userland gate, you may have noticed this commit a few months ago. This commit added the necessary macros to the userland gate to enable ADIHEAP and ADISTACK, two new security defenses that will show up in the upcoming release of 11.4.

The ADIHEAP extension allows to extend system libraries (e.g. libc:malloc()) with ADI-based protections, while still retaining the ability to control whether the extension is enabled or not through binary tagging and sxadm exec (both covered in this previous blog entry). A more detailed discussion of the implementation for specific libraries is deferred to the release of 11.4.

The ADISTACK extension, instead, uses ADI transparently to any arbitrary process/binary to mitigate against Return Oriented Programming (ROP) attacks. While I'm deferring a detailed analysis of this one to the release of 11.4 as well, recently Steve Sistare (one of the folks behind our ADISTACK implementation, along with Anthony Yznaga, myself and Steve Chessin) commented around the stack defense on the sparclinux mailing list, so I figured it might be a good time to recap it a bit in a blog entry.

Lookin' Through The (Register) Windows


ROP attacks, as the name suggests, target the return address of a procedure and build a sequence of entry points to gadgets (brief instruction sequences in the process address space that conclude with a return instruction) that compose the attacker payload. ROP attacks are popular today and are the direct consequence of the defensive side taking away the ability to store an attacker controlled payload and jump to it, through what is traditionally known as W^X/DEP (ensure that no mapping that is simultaneously writeable and executable exists in the process address space).

ROP attacks on the SPARC architecture have been throughly covered back in 2008, by the "When Good Instructions Go Bad:Generalizing Return-Oriented Programming to RISC" paper. In this entry, I'm just going to briefly summarize the characteristics of the SPARC architecture that make them and the respective mitigation possible.

The whole function call/return model on SPARC is predicated around the concept of register windows. A register window consist of 24 directly accessible registers, divided in three parts: 8 local (private to the function), 8 in and 8 out registers. In and out registers are the heart of parameter passing and value returning across functions. In fact, register windows overlap: when function A calls function B, the out registers from function A become the in registers for function B, which also gets a fresh set of local and out registers. Later, when B returns, it can use its in registers to make the return values pop into the A register window. Some registers have special meanings: %i6 contains the frame pointer (%fp) and %i7 contains the return address. This document, although centered to the V8 architecture (remember to double the size of registers), nicely recaps the whole window registers design and operation and so does the aforementioned paper.

The number of physical registers cannot grow indefinitely, so there is an upper limit to the number of register windows. This means that an application can find itself about to do a call, but with no register window to shift to. This situation will generate a SPILL trap, transferring control to the operating system and asking to please save the existing register window somewhere and buy a new one for the process. This "somewhere" is, of course, the stack (partially defeating the original purpose of register windows), whereby enough space was conveniently left aside by the save instruction in the function prologue. If we ignore for a moment local variables and any other use of the stack (and just concentrate on register windows), it's easy to realize that the save instruction needs to reserve at least 16 * register_size (8 in and 8 local) bytes which, on SPARC V9 (register_size == 8 bytes), puts us conveniently to 128 bytes (two cache lines). The sister condition to a SPILL trap happens when the program tries to restore back to a previous register window, but such window is currently invalid. In this case, the OS receives a FILL trap and needs to recover the saved registers from the stack and "provide" them back to the caller.

The whole stream of FILL and SPILL traps happens transparently to the application.

Leveraging SPILL/FILL traps with ADI


SPILL and FILL traps provide something pretty unique in the OS/arch landscape: the Operating System has a chance to check the saved register values right at the prologue and right after the epilogue of a function. This possibility hasn't really gone unnoticed in the past and back in 2002 a stack protection known as StackGhost was presented for OpenBSD. It proposed two different approaches to more or less minimize the performance impact and more or less effectively defeat stack smashing attacks: 
  • obfuscate the return address: the OS encrypts the return address on SPILL and decrypts it on FILL. XOR is the obvious choice for speed and simplicity, but has some significant limitation (a sufficiently large infoleak allows to recover the process cookie and it's easily subject to perturbation of lower bits), a more sophisticated algorithm may require more instructions (despite the speed improvement with the crypto instructions since 2002). In either case, this approach directly affects userland: debuggers and the like need to be extended in order to understand the 'encrypted' return address (an extra cost that is better to avoid for adoption). The frame pointer and the other register contents are not protected at all.
  • create a shadow stack: the kernel keeps a in-kernel shadow stack and updates it at every SPILL/FILL event. This is a much more robust defense and one that has had a number of unfinished attempts (internally) also for Solaris. There are two tricky parts with it: the first one is the performance impact. SPILL/FILL traps are frequent and are hence hugely optimized, adding extra code to copy things to a shadow stack can impose a significant penalty. The second one is related to the well known CFI nightmares of longjmp(), setjmp() and makecontext(), which can create non linear modification to the control flow and require further complexity to clean the shadow stack properly. On top of this, also the memory management of the shadow stack space can add further food for thoughts, especially if one wants to protect all the saved registers. 
Enter ADI. Once again, it acts as a game changer, since it can defer to the architecture the ungrateful job of evaluating memory accesses. In particular, the kernel can tag with a dedicated (randomized) color the register save area on the stack on SPILL and clear it up on FILL. Any attempt to overflow from an existing buffer on the stack into the save area during program execution is then automatically caught and stopped, leading to a SIGSEGV. Similarly, attempts to infoleak the save area contents are detected and stopped as well.

Adding a color requires a significantly smaller amount of instructions then a full shadow stack and, through the ADISTACK security extension, we know whether we have to pre-enable ADI over the stack pages of a process. This further helps reducing the impact of the protection on binaries/processes that have it disabled. On top of that, Steve Sistare and Anthony Yznaga came up with a pretty cool trick to completely eliminate any overhead of ADISTACK specific instructions from unaffected processes (and speed up ADISTACK itself), but I'm leaving this one a to a future (perhaps from them) entry. For this blog post, just consider that ADISTACK has basically no impact when disabled, something that is, indeed, pretty nice for a lightweight CFI solution.

longjmp(), setjmp() and makecontext() are augmented with proper version-clearing paths and similar code is exposed, through APIs, to all those pieces of software (Java I'm looking at you) that want to do internal stack management. Lastly, through the use of non-faulting loads, trusted paths in userland (e.g. when issuing a system call with a variadic number of parameters, and hence arbitrarily hitting stack pieces or when inspecting the stack from within the process) can be created. The trusted path code doesn't need to know the color value in advance, nor needs to go through the dance of retrieving it from the pointer - fixing the accessing pointer - do the load, but can instead simply use the ASI_NOFAULT identifier when accessing the value with a single, direct, load. The ability to create a trusted path is an often overlooked property of a security feature and one that can prove crucial when trying to meet acceptable performance results (as an example of this, I've some time ago fixed Python to work nicely with ADI through the use of non-faulting loads).

The use of ADI doesn't come without some limitation, which doesn't permit to this protection to elevate to the holy grail of invariant. First of all, an ADI protected region can be subject to an arbitrary write attack, whereby the attacker is capable of constructing the target pointer with the proper version. Randomizing the saving area version helps a bit, but the randomization range is extremely small and easy to bruteforce/guess. Life would be fantastic if instead of being able to tag with a color, we were able to temporarily mark the region as read-only, but that's a different story, which I hope to tell one day.

Second, the SPARC ABI mandates a 16-byte alignment for the stack, which translates to the register save area not being always aligned to 64-byte on existing software. Since the save area is 128 bytes, we're always guaranteed to version at least one cacheline worth of space. this may mean that the two key pieces that we want to protect (%i6 and %i7), which sit at the top of the save area, end up exposed (but with a tagged cacheline that catches linear overflows in between). This limitation can be reduced by imposing a 64-byte alignment: the kernel, the compiler and the userland crt objects can conspire to not create (or reduce to a minimum, depending on performance requirements) misalignment.

Leveraging ADI through the Compiler


ADISTACK has a few characteristics that make me a huge fan of it: it's transparent to the target process (which means that it can be easily applied to third party software), has a very low performance impact when enabled and has zero performance impact when disabled. ADISTACK focuses on one specific target: the saved registers, in particular %fp and the return address. This is both really good (do one thing, but do it right) and "bad" (as it ignores, by design, stack smashing attacks that target other adjacent local variables.

A more complex, but more comprehensive, protection can be designed to include detection of linear overflows across local variables, by having the compiler either separate each one into dedicated cachelines or leave redzone "gaps" between them, and providing an entry point to each function to call into a tagging subroutine. This is not dissimilar, impact-wise, from other cookie-based stack protection solutions, like GCC StackGuard/-fstack-protector and can implement similar solutions to improve performance, for example through the identification of functions that truly need a protection and those that don't have any overflowing object in their frame. In the very same way, such compiler based protection do leave some performance impact also when disabled, by virtue of adding some extra code to some/all functions.