JavaScript – WebKit https://webkit.org Open Source Web Browser Engine Fri, 08 Mar 2024 23:19:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 How to use Media Source Extensions with AirPlay https://webkit.org/blog/15036/how-to-use-media-source-extensions-with-airplay/ Fri, 16 Feb 2024 18:35:17 +0000 https://webkit.org/?p=15036 Media Source Extensions (MSE) is a popular way to provide streaming video on the web. It gives JavaScript control of the way bytes are sent to the browser for playback. At the 2023 Worldwide Developer conference, Apple announced a new Managed Media Source API that improves on MSE with efficient video streaming and longer battery life for iOS and other devices.

However, MMS and MSE, by nature, are not compatible with AirPlay, which requires a unique playback URL. AirPlay allows you to start playback of your favorite videos on your phone, move them to your TV and then switch off that phone. An AirPlay-compatible URL can be of any format such as an mp4, mpeg-ts, or HTTP Live Streaming (HLS).

This post will guide you through providing both sources and, in the process, build out a demo example.

WebKit MSE + AirPlay Demo

Since MMS/MSE uses binary blobs appended to a SourceBuffer it won’t work with AirPlay. But, if you create an alternative source that can be served as an AirPlay-compatible URL, there is a way to get them to work together.

When it comes to an AirPlay-compatible alternative, HLS is an option that offers a great deal of efficiency for users. There are numerous resources that can guide you in converting your video content to serve it with HLS. Apple offers a toolkit you can use and there are many other options as well.

const airplayURL = './video/demo.m3u8';
const videoURL = './video/demo.mp4';
const mediaType = 'video/mp4; codecs="avc1.640028"';

// Create a video element
const video = document.createElement('video');

// Set video element properties for the demo
video.controls = true;
video.loop = true;
video.muted = true;
video.autoplay = true;

In setting up the MediaSource, it’s easy to use feature detection to use Managed Media Source API to offer power-efficient streaming on browsers that support it and gracefully fallback to MSE where its not available:

// Feature detect MMS and fallback to MSE
const MediaSource = self.ManagedMediaSource || self.MediaSource;
const source = new MediaSource();

We also need a way to offer both our Media Source and AirPlay sources at the same time. The HTML video element’s ability to define multiple sources will do just that. It was originally intended to allow a user-agent to choose the best format of the video to be played and fallback should it not be supported.

<video>
  <source src="format/video.m3u8" type="application/x-mpegURL">
  <source src="format/video.ogg" type="video/ogg">
  <source src="format/video.mp4" type="video/mp4">
  <source src="format/video.webm" type="video/webm">
</video>

The browser will look over the list of available formats from top to bottom. If no matches are found, or if decoding fails along the way, it will select the next element in the list.

We can make use of this behavior, combining the ability for the user-agent to choose the best format and allowing AirPlay to select a playable source. You’ll create a URL from your Media Source and add it to a video element as the first source. This URL is local to the user-agent and can’t be shared, as it has no meaning outside the local context. Then you add the AirPlay-compatible URL as the second source.

// Add MSE/MMS streaming source
const videoSource1 = document.createElement('source');
videoSource1.type = 'video/mp4';
videoSource1.src = URL.createObjectURL(source); // Create URL from MediaSource
video.appendChild(videoSource1);

// Add AirPlay-compatible HLS source
const videoSource2 = document.createElement('source');
videoSource2. type = 'application/x-mpegURL';
videoSource2.src = airplayURL;
video.appendChild(videoSource2);

Now when Safari detects that an alternative source is available in addition to the MediaSource URL object, it will display the familiar AirPlay icon to the video player control. Should the user select AirPlay, it will switch over from MSE to the AirPlay-compatible URL.

The streaming code for this demo is very basic and serves as an example of getting the bytes from the video to the browser.

document.onload = async () => {
  if (!MediaSource.isTypeSupported(mediaType)) {
    return console.log('Media type is not supported.');
  }

  await new Promise((resolve) => {
    source.addEventListener("sourceopen", resolve, { once: true });
    document.body.appendChild(video);
  });

  const sourceBuffer = source.addSourceBuffer(mediaType);

  async function loadSegment() {
    const response = await fetch(videoURL);
    const buffer = await response.arrayBuffer();
    await new Promise((resolve) => {
      sourceBuffer.addEventListener("updateend", resolve, { once: true });
      sourceBuffer.appendBuffer(buffer);
    });
  }

  if (typeof(source.onstartstreaming) !== 'undefined') {
    source.onstartstreaming = async () => {
      loadSegment();
    };
  } else loadSegment();
});

Feedback

Offering support for MMS/MSE and AirPlay is gives users the best of all worlds and the video element makes it easy to offer multiple sources. You can learn more about the Managed Media Source API proposal at the W3C, and learn about HTTP Live Streaming from Apple’s documentation.

We love to hearing from you. Send a tweet to @webkit to share your thoughts on this feature. Find us on Mastodon at @jensimmons@front-end.social and @jondavis@mastodon.social. You can also follow WebKit on LinkedIn. If you run into any issues, we welcome your WebKit bug reports on WebKit-powered features like this. Reporting issues and sharing your feedback makes an enormous difference.

]]>
WebGPU now available for testing in Safari Technology Preview https://webkit.org/blog/14879/webgpu-now-available-for-testing-in-safari-technology-preview/ Thu, 21 Dec 2023 17:30:48 +0000 https://webkit.org/?p=14879 WebGPU is a new standards-compliant API that enables high-performance 3D graphics and general-purpose computations on the Web. WebGPU programs are written in JavaScript but expose GPU functionality, allowing GPU computing to be used in Web content for the first time. Starting in Safari Technology Preview 185, WebGPU can be enabled for early testing and development.

To enable WebGPU, turn on the “WebGPU”, “GPU Process: DOM Rendering” and “GPU Process: Canvas Rendering” feature flags in the Feature Flags tab in Safari Preferences. If you don’t see the Feature Flags tab, you need to first check “Show features for web developers” in the Advanced tab.

Once you have WebGPU enabled in Safari Technology Preview 185, try out this example of WebGPU. It utilizes many of the best features of WebGPU.

WebGPU JavaScript API

The WebGPU API is accessed through JavaScript, similar to WebGL.

Creating a GPUDevice

In order to use WebGPU, a device must be created. Resources and pipeline state are created from a GPUDevice instance. To create a device with default limits and features which are supported on all devices supporting WebGPU, we can pass zero parameters to the invocations of requestAdapter and requestDevice.

const adapter = await navigator.gpu.requestAdapter();
device = await adapter.requestDevice();

Configuring a GPUCanvasContext

The GPUCanvasContext is an interface that allows you to configure how your content will be displayed in the corresponding HTMLCanvas element on the page.

context = canvas.getContext('webgpu');
const canvasFormat = "bgra8unorm";

const contextConfiguration = {
    device: device,
    format: canvasFormat,
    alphaMode: 'opaque',
};
context.configure(contextConfiguration);

Creating a GPURenderPipeline

A GPURenderPipeline or a corresponding GPUComputePipeline are used to configure the pipeline state of the graphics driver. This pipeline state is then used in a GPURenderPassEncoder or GPUComputePassEncoder as later illustrated.

const shaderModule = device.createShaderModule({ code: wgslSource });
const vertexStageDescriptor = { module: shaderModule, entryPoint: "vsmain" };
const fragmentStageDescriptor = { module: shaderModule, entryPoint: "fsmain" };
const renderPipelineDescriptor = {
    layout: 'auto',
    vertex: vertexStageDescriptor,
    fragment: fragmentStageDescriptor,
    primitive: {topology: "triangle-list" },
};
const renderPipeline = device.createRenderPipeline(renderPipelineDescriptor);

Issuing draw calls

A GPURenderPassEncoder is created to send draw calls to the graphics driver. In the below example, we draw a simple triangle which contains three vertices. A GPURenderPassEncoder can also draw multiple instances of the same geometry or draw from an offset of a vertex buffer.

const colorAttachmentDescriptor = {
    view: renderAttachment,
    loadOp: "clear",
    storeOp: "store",
    clearColor: { r: 0.15, g: 0.15, b: 0.5, a: 1 }
};
const renderPassDescriptor = { colorAttachments: [colorAttachmentDescriptor] };
const commandEncoder = device.createCommandEncoder();
const renderPassEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
renderPassEncoder.setPipeline(renderPipeline);
const vertexBufferSlot = 0;
renderPassEncoder.setVertexBuffer(vertexBufferSlot, vertexBuffer, 0);
renderPassEncoder.draw(3, 1, 0, 0); // 3 vertices, 1 instance, 0th vertex, 0th instance.
renderPassEncoder.end();
const commandBuffer = commandEncoder.finish();
const queue = device.queue;
queue.submit([commandBuffer]);

WebGPU Shading Language

WebGPU introduces WGSL, a platform independent shading language for the web. Here is an example of a WGSL shader source that would be passed in place of wgslSource in the above API call:

const wgslSource = `
    struct Vertex {
        @builtin(position) Position: vec4<f32>,
        @location(0) color: vec4<f32>,
    }

    @vertex fn vsmain(@builtin(vertex_index) VertexIndex: u32) -> Vertex
    {
        var pos: array<vec2<f32>, 3> = array<vec2<f32>, 3>(
            vec2<f32>( 0.0,  0.5),
            vec2<f32>(-0.5, -0.5),
            vec2<f32>( 0.5, -0.5));
        var vertex_out : Vertex;
        vertex_out.Position = vec4<f32>(pos[VertexIndex], 0.0, 1.0);
        vertex_out.color = vec4<f32>(pos[VertexIndex] + vec2<f32>(0.5, 0.5), 0.0, 1.0);
        return vertex_out;
    }

    @fragment fn fsmain(in: Vertex) -> @location(0) vec4<f32>
    {
        return in.color;
    }
`;

Try WebGPU and file bugs!

We’re very excited to have an early version of WebGPU and WGSL in the latest version of Safari Technology Preview. Please do try it out. Check out the public repository of WebGPU samples. And file bugs or issues you discover at bugs.webkit.org.

]]>
Understanding Garbage Collection in JavaScriptCore From Scratch https://webkit.org/blog/12967/understanding-gc-in-jsc-from-scratch/ Fri, 29 Jul 2022 16:00:23 +0000 https://webkit.org/?p=12967 JavaScript relies on garbage collection (GC) to reclaim memory. In this post, we will dig into JSC’s garbage collection system.

Before we start, let me briefly introduce myself. I am Haoran Xu, a PhD student at Stanford University. While I have not yet contributed a lot to JSC, I found JSC a treasure of elegant compiler designs and efficient implementations, and my research is exploring ways to transfer JSC’s design to support other programming languages like Lua at a low engineering cost. This post was initially posted on my blog — great thanks to the WebKit project for cross-posting it on their official blog!

Filip Pizlo’s blog post on GC is great at explaining the novelties of JSC’s GC, and also positions it within the context of various GC schemes in academia and industry. However, as someone with little GC background, I felt the blog alone insufficient for me to get a solid understanding of the algorithm and the motivation behind the design. Through digging into the code, and with some great help from Saam Barati, one of JSC’s lead developers, I wrote up this blog post in the hope that it can help more people understand this beautiful design.

The garbage collector in JSC is non-compacting, generational and mostly[1]concurrent. On top of being concurrent, JSC’s GC heavily employs lock-free programming for better performance.

As you can imagine, JSC’s GC design is quite complex. Instead of diving into the complex invariants and protocols, we will start with a simple design, and improve it step by step to converge at JSC’s design. This way, we not only understand why JSC’s design works, but also how JSC’s design was constructed over time.

But first of all, let’s get into some background.

Memory Allocation in JSC

Memory allocators and GCs are tightly coupled by nature – the allocator allocates memory to be reclaimed by the GC, and the GC frees memory to be reused by the allocator. In this section, we will briefly introduce JSC’s memory allocators.

At the core of the memory allocation scheme in JSC is the data structure BlockDirectory[2]. It implements a fixed-sized allocator, that is, an allocator that only allocates memory chunks of some fixed size S. The allocator keeps tracks of a list of fixed-sized (in current code, 16KB) memory pages (“blocks”) it owns, and a free list. Each block is divided into cells of size S, and has a footer at its end[3], which contains metadata needed for the GC and allocation, e.g., which cells are free. By aggregating and sharing metadata at the footer, it both saves memory and improves performance of related operations: we will go into the details later in this post.

When a BlockDirectory needs to make an allocation, it tries to allocate from its free list. If the free list is empty, it tries to iterate through the blocks it owns[4], to see if it can find a block containing free cells (which are marked free by GC). If yes, it scans the block footer metadata to find out all the free cells[5] in this block, and put into the free list. Otherwise, it allocates a new block from malloc[6]. Note that this implies a BlockDirectory’s free list only contains cells in one block: this is called m_currentBlock in the code, and we will revisit this later.

BlockDirectory is used as the building block to build the memory allocators in JSC. JSC employs three kinds of allocators:

  1. CompleteSubspace: this is a segregated allocator responsible for allocating small objects (max size about 8KB). Specifically, there is a pre-defined list of exponentially-growing size-classes[7], and one BlockDirectory is used to handle allocation for each size class. So to allocate an object, you find the smallest size class large enough to hold the object, and allocate from the directory for that size class.
  2. PreciseAllocation: this is used to handle large allocations that cannot be handled by the CompleteSubspace allocator[8]. It simply relies on the standard (malloc-like) memory allocator, though in JSC a custom malloc implementation called libpas is used. The downside is that since a PreciseAllocation is created on a per-object basis, the GC cannot aggregate and share metadata information of multiple objects together to save memory and improve performance (as MarkedBlock’s block footer did). Therefore, every PreciseAllocation comes with a whopping overhead of a 96-byte GC header to store the various metadata information needed for GC for this object (though this overhead is justified since each allocation is already at least 8KB).
  3. IsoSubspace: each IsoSubspace is used to allocate objects of a fixed type with a fixed size. So each IsoSubspace simply holds a BlockDirectory to do allocation (though JSC also has an optimization for small IsoSubspace by making them backed by PreciseAllocation[9]). This is a security hardening feature that makes use-after-free-based attacks harder[10].

IsoSubspace is mostly a simplified CompleteSubspace, so we will ignore it for the purpose of this post. CompleteSubspace is the one that handles the common case: small allocations, and PreciseAllocation is mostly the rare slow path for large allocations.

Generational GC Basics

In JSC’s generational GC model, the heap consists of a small “new space” (eden), holding the newly allocated objects, and a large “old space” holding the older objects that have survived one GC cycle. Each GC cycle is either an eden GC or a full GC. New objects are allocated in the eden. When the eden is full, an eden GC is invoked to garbage-collect the unreachable objects in eden. All the surviving objects in eden are then considered to be in the old space[11]. To reclaim objects in the old space, a full GC is needed.

The effectiveness of the above scheme relies on the so-called “generational hypothesis”:

  1. Most objects collected by the GC are young objects (died when they are still in eden), so an eden GC (which only collects the eden) is sufficient to reclaim most newly allocated memory.

  2. Pointers from old space to eden is much rarer than pointers from eden to old space or pointers from eden to eden, so an eden GC’s runtime is approximately linear to the size of the eden, as it only needs to start from a small subset of the old space. This implies that the cost of GC can be amortized by the cost of allocation.

Inlined vs. Outlined Metadata: Why?

Practically every GC scheme uses some kind of metadata to track which objects are alive. In this section, we will explain how the GC metadata is stored in JSC, and the motivation behind its design.

In JSC, every object managed by the GC carries the following metadata:

  1. Every object managed by the GC inherits the JSCell class, which contains a 1-byte member cellState. This cellState is a color marker with two colors: white and black[12].

  2. Every object also has two out-of-object metadata bits: isNew[13] and isMarked. For objects allocated by PreciseAllocation, the bits reside in the GC header. For objects allocated by CompleteSubspace, the bits reside in the block footer.

This may seem odd at first glance since isNew and isMarked could have been stored in the unused bits of cellState. However, this is intentional.

The inlined metadata cellState is easy to access for the mutator thread (the thread executing JavaScript code), since it is just a field in the object. However, it has bad memory locality for the GC and allocators, which need to quickly traverse through all the metadata of all objects in some block owned by CompleteSubspace (which is the common case). Outlined metadata have the opposite performance characteristics: they are more expensive to access for the mutator thread, but since they are aggregated into bitvectors and stored in the block footer of each block, GC and allocators can traverse them really fast.

So JSC keeps both inlined and outlined metadata to get the better of both worlds: the mutator thread’s fast path will only concern the inlined cellState, while the GC and allocator logic can also take advantage of the memory locality of the outlined bits isNew and isMarked.

Of course, the cost of this is a more complex design… so we have to unfold it bit by bit.

A Really Naive Stop-the-World Generational GC

Let’s start with a really naive design just to understand what is needed. We will design a generational, but stop-the-world (i.e. not incremental nor concurrent) GC, with no performance optimizations at all. In this design, the mutator side transfers control to the GC subsystem at a “safe point”[14] to start a GC cycle (eden or full). The GC subsystem performs the GC cycle from the beginning to the end (as a result, the application cannot run during this potentially long period, thus “stop-the-world”), and then transfer control back to the mutator side.

For this purpose, let’s temporarily forget about CompleteSubspace: it is an optimized version of PrecisionAllocation for small allocations, and while it is an important optimization, it’s easier to understand the GC algorithm without it.

It turns out that in this design, all we need is one isMarked bit. The isMarked bit will indicate if the object is reachable at the end of the last GC cycle (and consequently, is in the old space, since any object that survived a GC cycle is in old space). All objects are born with isMarked = false.

The GC will use a breadth-first search to scan and mark objects. For full GC, we want to reset all isMarked bits to false at the beginning, and do a BFS to scan and mark all objects reachable from GC roots. Then all the unmarked objects are known to be dead. For an eden GC, we only want to scan the eden space. Fortunately, all objects in the old space are already marked at the end of the previous GC cycle, so they are naturally ignored by the BFS, so we can simply reuse the same BFS algorithm in full GC. In pseudo-code:

Eden GC preparation phase: no work is needed.

Full GC preparation phase[15]:

for (JSCell* obj : heap)
    obj->isMarked = false;

Eden/Full GC marking phase:

while (!queue.empty()) {
    JSCell* obj = queue.pop();
    obj->ForEachChild([&](JSCell* child) {
        if (!child->isMarked) {
            child->isMarked = true;
            queue.push(child);
        }
    });
}

Eden/Full GC collection phase:

// One can easily imagine an optimization to make eden collection
// traverse only the eden space. We ignore it for simplicity.
for (JSCell* obj : heap) {
    if (!obj->isMarked)
        free(obj);
}

But where does the scan start, so that we can scan through every reachable object? For full GC, the answer is clear: we just start the scan from all GC roots[16]. However, for an eden GC, in order to reliably scan through all reachable objects, the situation is slightly more complex:

  1. Of course, we still need to push the GC roots to the initial queue.

  2. If an object in the old space contains a pointer to an object in eden, we need to put the old space object to the initial queue[17].

The invariant for the second case is maintained on the mutator side. Specifically, whenever one writes a pointer slot of some object A in the heap to point to another object B, one needs to check if A.isMarked is true and B.isMarked is false. If so, one needs to put A into a “remembered set”. An eden GC must treat the objects in the remembered set as if they were GC roots. This is called a WriteBarrier. In pseudo-code:

// Executed after writing a pointer to 'dst' into a field of 'obj'
if (obj->isMarked && !dst->isMarked)
    addToRememberedSet(obj);

Getting Incremental

A stop-the-world GC isn’t optimal for production use. A GC cycle (especially a full GC cycle) can take a long time. Since the mutator (application logic) cannot run during the stop-the-world period, the application would appear irresponsive to the user, which can be a very bad user experience for long pauses.

A natural way to shorten this irresponsive period is to run GC incrementally: at safe points, the mutator transfers control to the GC. The GC only runs for a short time, doing a portion of the work for the current GC cycle (eden or full), then return control to the mutator. This way, each GC cycle is split into many small steps, so the irresponsive periods are less noticeable to the user.

Incremental GC poses a few new challenges to the GC scheme.

The first challenge is the extra interference between the GC and the mutator: the mutator, namely the allocator and the WriteBarrier, must be prepared to see states arisen from a partially-completed GC cycle. And the GC side must correctly mark all reachable objects despite changes made by the mutator side in between.

Specifically, our full GC must change: imagine that the full GC scanned some object o and handed back control to mutator, then the mutator changed a field of o to point to some other object dst. The object dst must not be missed from scanning. Fortunately, in such a case o will be isMarked and dst will be !isMarked (if dst has isMarked then it has been scanned, so there’s nothing to worry about), so o will be put into the remembered set.

Therefore, for a full GC to function correctly in the incremental GC scheme, it must consider the remembered set as a GC root as well, just like the eden GC.

The other parts of the algorithm as of now can remain unchanged (we leave the proof of correctness as an exercise for the reader). Nevertheless, “what happens if a GC cycle is run partially?” is something that we must keep in mind as we add more optimizations.

The second challenge is that the mutator side can repeatedly put an old space object into the remembered set, and result in redundant work for the GC: for example, the GC popped some object o in the remembered set, traversed from it, and handed over control to mutator. The mutator modified o again, putting it back to the remembered set. If this happens too often, the incremental GC could do a lot more work than a stop-the-world GC.

The situation will get even worse once we make our GC concurrent: in a concurrent GC, since the GC is no longer stealing CPU time from the mutator, the mutator gets higher throughput, thus will add even more objects into the remembered set. In fact, JSC observed up to 5x higher memory consumption without any mitigation. Therefore, two techniques are employed to mitigate this issue.

The first and obvious mitigation is to have the GC scan the remembered set last: only when the queue has otherwise been empty do we start popping from the remembered set. The second mitigation employed by JSC is a technique called Space-Time Scheduler. In short, if it observes that the mutator was allocating too fast, the mutator would get decreasingly less time quota to run so the GC can catch up (and in the extreme case, the mutator would get zero time quota to run, so it falls back to the stop-the-world approach). Filip Pizlo’s blog post has explained it very clearly, so feel free to take a look if you are interested.

Anyways, let’s update the pseudo-code for the eden/full GC marking phase:

while (!queue.empty() || !rmbSet.empty()) {
    // Both eden GC and full GC needs to consider the remembered set
    // Prioritize popping from queue, pop remembered set last
    JSCell* obj = !queue.empty() ? queue.pop() : rmbSet.pop();
    obj->ForEachChild([&](JSCell* child) {
        if (!child->isMarked) {
            child->isMarked = true;
            queue.push(child);
        }
    });
}

Incorporate in CompleteSubspace

It’s time to get our CompleteSubspace allocator back so we don’t have to suffer the huge per-object GC header overhead incurred by PreciseAllocation.

For PreciseAllocation, the actual memory management work is done by malloc: when the mutator wants to allocate an object, it just mallocs it, and when the GC discovers a dead object, it just frees it.

CompleteSubspace introduces another complexity, as it only allocates/deallocates memory from malloc at 16KB-block level, and does memory management itself to divide the blocks into cells that it serves to the application. Therefore, it has to track whether each of its cells is available for allocation. The mutator allocates from the available cells, and the GC marks dead cells as available for allocation again.

The isMarked bit is not enough for the CompleteSubspace allocator to determine if a cell contains a live object or not: newly allocated objects have isMarked = false but are clearly live objects. Therefore, we need another bit.

In fact, there are other good reasons that we need to support checking if a cell contains a live object or not. A canonical example is the conservative stack scanning: JSC does not precisely understand the layout of the stack, so it needs to treat everything on the stack that could be pointers and pointing to live objects as a GC root, and this involves checking if a heap pointer points to a live object or not.

One can easily imagine some kind of isLive bit that is true for all live objects, which is only flipped to false by the GC when the object is dead. However, JSC employed a slightly different scheme, which is needed to facilitate optimizations that we will mention later.

As you have seen earlier, the bit used by JSC is called isNew.

However, keep in mind: you should not think of isNew as a bit that tells you anything related to its name, or indicates anything by itself. You should think of it as a helper bit, which sole purpose is that, when working together with isMarked, they tell you if a cell contains a live object or not. This thinking mode will be more important in the next section when we introduce logical versioning.

The core invariant around isNew and isMarked is:

  1. At any moment, an object is dead iff its isNew = false and isMarked = false.

If we were a stop-the-world GC, then to maintain this invariant, we only need the following:

  1. When an object is born, it has isNew = true and isMarked = false.

  2. At the end of each eden or full GC cycle, we set isNew of all objects to false.

Then, all newly-allocated objects are live because its isNew is true. At the end of each GC cycle, an object is live iff its isMarked is true, so after we set isNew to false (due to rule 2), the invariant on what is a dead object is maintained, as desired.

However, in an incremental GC, since the state of a partially-run GC cycle can be exposed to mutator, we need to ensure that the invariant holds in this case as well.

Specifically, in a full GC, we reset all isMarked to false at the beginning. Then, during a partially-run GC cycle, the mutator may see a live object with both isMarked = false (because it has not been marked by the GC yet), and isNew = false (because it has survived one prior GC cycle). This violates our invariant.

To fix this, at the beginning of a full GC, we additionally do isNew |= isMarked before clearing isMarked. Now, during the whole full GC cycle, all live objects must have isNew = true[18], so our invariant is maintained. At the end of the cycle, all isNew bits are cleared, and as a result, all the unmarked objects become dead, so our invariant is still maintained as desired. So let’s update our pseudo-code:

Eden GC preparation phase: no work is needed.

Full GC preparation phase:

// Do 'isNew |= isMarked, isMarked = false' for all
// PreciseAllocation and all cells in CompleteSubspace

for (PreciseAllocation* pa : allPreciseAllocations) {
    pa->isNew |= pa->isMarked;
    pa->isMarked = false;
}

for (BlockFooter* block : allCompleteSubspaceBlocks) {
    for (size_t cellId = 0; cellId < block->numCells; cellId++) {
        block->isNew[cellId] |= block->isMarked[cellId];
        block->isMarked[cellId] = false;
    }
}

Eden/Full GC collection phase:

// Update 'isNew = false' for CompleteSubspace cells
for (BlockFooter* block : allCompleteSubspaceBlocks) {
    for (size_t cellId = 0; cellId < block->numCells; cellId++) {
        block->isNew[cellId] = false;
    }
}

// For PreciseAllocation, in addition to updating 'isNew = false',
// we also need to free the dead objects
for (PreciseAllocation* pa : allPreciseAllocations) {
    pa->isNew = false;
    if (!pa->isMarked)
        free(pa);
}

In CompleteSubspace allocator, to check if a cell in a block contains a live object (if not, then the cell is available for allocation):

bool cellContainsLiveObject(BlockFooter* block, size_t cellId) {
    return block->isMarked[cellId] || block->isNew[cellId];
}

Logical Versioning: Do Not Sweep!

We are doing a lot of work at the beginning of a full GC cycle and at the end of any GC cycle, since we have to iterate through all the blocks in CompleteSubspace and update their isMarked and isNew bits. Despite that the bits in one block are clustered into bitvectors thus have good memory locality, this could still be an expensive operation, especially after we have a concurrent GC (as this stage cannot be made concurrent). So we want something better.

The optimization JSC employs is logical versioning. Instead of physically clearing all bits in all blocks for every GC cycle, we only bump a global “logical version”, indicating that all the bits are logically cleared (or updated). Only when we actually need to mark a cell in a block during the marking phase do we then physically clear (or update) the bitvectors in this block.

You may ask: why bother with logical versioning, if in the future we still have to update the bitvectors physically anyway? There are two good reasons:

  1. If all cells in a block are dead (either died out during this GC cycle[19], or already dead before this GC cycle), then we will never mark anything in the block, so logical versioning enabled us to avoid the work altogether. This also implies that at the end of each GC cycle, it’s unnecessary to figure out which blocks become completely empty, as logical versioning makes sure that these empty blocks will not cause overhead to future GC cycles.

  2. The marking phase can be done concurrently with multiple threads and while the mutator thread is running (our scheme isn’t concurrent now, but we will do it soon), while the preparation / collection phase must be performed with the mutator stopped. Therefore, shifting the work to the marking phase reduces GC latency in a concurrent setting.

There are two global version number g_markVersion and g_newVersion[20]. Each block footer also stores its local version number l_markVersion and l_newVersion.

Let’s start with the easier case: the logical versioning for the isNew bit.

If you revisit the pseudo-code above, in GC there is only one place where we write isNew: at the end of each GC cycle, we set all the isNew bits to false. Therefore, we simply bump g_newVersion there instead. A local version l_newVersion smaller than g_newVersion means that all the isNew bits in this block have been logically cleared to false.

When the CompleteSubspace allocator allocates a new object, it needs to start with isNew = true. One can clearly do this directly, but JSC did it in a trickier way that involves a block-level bit named allocated for slightly better performance. This is not too interesting, so I deferred it to the end of the post, and our scheme described here right now will not employ this optimization (but is otherwise intentionally kept semantically equivalent to JSC):

  1. When a BlockDirectory starts allocating from a new block, it update the the block’s l_newVersion to g_newVersion, and set isNew to true for all already-allocated cells (as the block may not be fully empty), and false for all free cells.

  2. Whenever it allocates a cell, it sets its isNew to true.

Why do we want to bother setting isNew to true for all already-allocated cells in the block? This is to provide a good property. Since we bump g_newVersion at the end of every GC cycle, due to the scheme above, for any block with latest l_newVersion, a cell is live if and only if its isNew bit is set. Now, when checking if a cell is live, if its l_newVersion is the latest, then we can just return isNew without looking at isMarked, so our logic is simpler.

The logical versioning for the isMarked bit is similar. At the beginning of a full GC cycle, we bump the g_markVersion to indicate that all mark bits are logically cleared. Note that the global version is not bumped for eden GC, since eden GC does not clear isMark bits.

There is one extra complexity: the above scheme would break down in an incremental GC. Specifically, during a full GC cycle, we have logically cleared the isMarked bit, but we also didn’t do anything to the isNew bit, so all cells in the old space would appear dead to the allocator. In our old scheme without logical versioning, this case is prevented by doing isNew |= isMarked at the start of the full GC, but we cannot do it now with logical versioning.

JSC solves this problem with the following clever trick: during a full GC, we should also accept l_markVersion that is off-by-one. In that case, we know the isMarked bit accurately reflects whether or not a cell is live, since that is the result of the last GC cycle. If you are a bit confused, take a look at footnote[21] for a more elaborated case discussion. It might also help to take a look at the comments in the pseudo-code below:

bool cellContainsLiveObject(BlockFooter* block, size_t cellId) {
    if (block->l_newVersion == g_newVersion) {
        // A latest l_newVersion indicates that the cell is live if
        // and only if its 'isNew' bit is set, so we don't need to
        // look at the 'isMarked' bit even if 'isNew' is false
        return block->isNew[cellId];
    }

    // Now we know isNew bit is logically false, so we should
    // look at the isMarked bit to determine if the object is live
    if (isMarkBitLogicallyCleared(block)) {
        // The isMarked bit is logically false
        return false;
    }

    // The isMarked bit is valid and accurately tells us if
    // the object is live or not
    return block->isMarked[cellId];
}


// Return true if the isMarked bitvector is logically cleared
bool isMarkBitLogicallyCleared(BlockFooter* block) {
    if (block->l_markVersion == g_markVersion) {
        // The mark version is up-to-date, so not cleared
        return false;
    }

    if (IsFullGcRunning() && IsGcInMarkingPhase() &&
        block->l_markVersion == g_markVersion - 1) {
        // We are halfway inside a full GC cycle's marking phase,
        // and the mark version is off-by-one, so the old isMarked bit
        // should be accepted, and it accurately tells us if the
        // object is live or not
        return false;
    }

    return true;
}

Before we mark an object in CompleteSubspace, we need to update the l_markVersion of the block holding the cell to the latest, and materialize the isMarked bits of all cells in the block. That is, we need to run the logic at the full GC preparation phase in our old scheme: isNew |= isMarked, isMarked = false for all cells in the block. This is shown below.

// Used by GC marking phase to mark an object in CompleteSubspace
void markObject(BlockFooter* block, size_t cellId) {
    aboutToMark(block);
    block->isMarked[cellId] = true;
}


// Materialize 'isMarked' bits if needed
// To do this, we need to execute the operation at full GC
// prepare phase: isNew |= isMarked, isMarked = false
void aboutToMark(BlockFooter* block) {
    if (block->l_markVersion == g_markVersion) {
        // Our mark version is already up-to-date,
        // which means it has been materialized before
        return;
    }

    // Check if the isMarked bit is logically cleared to false.
    // The function is defined in the previous snippet.
    if (isMarkBitLogicallyCleared(block)) {
        // This means that the isMarked bitvector should
        // be treated as all false. So operation isNew |= isMarked
        // is no-op, so all we need to do is isMarked = false
        for (size_t cellId = 0; cellId < block->numCells; cellId++) {
            block->isMarked[cellId] = false;
        }
    } else {
        // The 'isMarked' bit is not logically cleared. Now let's
        // check if the 'isNew' bit is logically cleared.
        if (block->l_newVersion < g_newVersion) {
            // The isNew bitvector is logically cleared and should be
            // treated as false. So operation isNew |= isMarked becomes
            // isNew = isMarked (note that executing |= is incorrect
            // beacuse isNew could physically contain true!)
            for (size_t cellId = 0; cellId < block->numCells; cellId++) {
                block->isNew[cellId] = block->isMarked[cellId];
                block->isMarked[cellId] = false;
            }

            // We materialized isNew, so update it to latest version
            block->l_newVersion = g_newVersion;
        } else {
            // The l_newVersion is latest, which means that the cell is
            // live if and only if its isNew bit is set.
            // Since isNew already reflects liveness, we do not have to
            // perform the operation isNew |= isMarked (and in fact, it
            // must be a no-op since no dead cell can have isMarked =
            // true). So we only need to do isMarked = false
            for (size_t cellId = 0; cellId < block->numCells; cellId++) {
                block->isMarked[cellId] = false;
            }
        }
    }

    // We finished materializing isMarked, so update the version
    block->l_markVersion = g_markVersion;
}

A fun fact: despite that what we conceptually want to do above is isNew |= isMarked, the above code never performs a |= at all 🙂

And also, let’s update the pseudo-code for the preparation GC logic:

Eden GC preparation phase: no work is needed.

Full GC preparation phase:

// For PreciseAllocation, we still need to manually do
// 'isNew |= isMarked, isMarked = false' for every allocation
for (PreciseAllocation* pa : allPreciseAllocations) {
    pa->isNew |= pa->isMarked;
    pa->isMarked = false;
}

// For CompleteSubspace, all we need to do is bump the
// global version for the 'isMarked' bit
g_markVersion++;

Eden/Full GC collection phase:

// For PreciseAllocation, we still need to manually
// update 'isNew = false' for each allocation, and also
// free the object if it is dead
for (PreciseAllocation* pa : allPreciseAllocations) {
    pa->isNew = false;
    if (!pa->isMarked)
        free(pa);
}

// For CompleteSubspace, all we need to do is bump the
// global version for the 'isNew' bit
g_newVersion++;

With logical versioning, the GC no longer sweeps the CompleteSubspace blocks to reclaim dead objects: the reclamation happens lazily, when the allocator starts to allocate from the block. This, however, introduces an unwanted side-effect. Some objects use manual memory management internally: they own additional memory that are not managed by the GC, and have C++ destructors to free that memory when the object is dead. This improves performance as it reduces the work of the GC. However, now we may not immediately sweep dead objects and run destructors, so the memory that is supposed to be freed by the destructor could be kept around indefinitely if the block is never allocated from. To mitigate this issue, JSC will also periodically sweep blocks and run the destructors of the dead objects. This is implemented by IncrementalSweeper, but we will not go into details.

To conclude, logical versioning provides two important optimizations to the GC scheme:

  1. The so-called “sweep” phase of the GC (to find and reclaim dead objects) is removed for CompleteSubspace objects. The reclamation is done lazily. This is clearly better than sweeping through the block again and again in every GC cycle.
  2. The full GC does not need to reset all isMarked bits in the preparation phase, but only lazily reset them in the marking phase by aboutToMark: this not only reduces work, but also allows the work to be done in parallel and concurrently while the mutator is running, after we make our GC scheme concurrent.

Optimizing WriteBarrier: The cellState Bit

As we have explained earlier, whenever the mutator modified a pointer of a marked object o to point to an unmarked object, it needs to add o to the “remembered set”, and this is called the WriteBarrier. In this section, we will dig a bit deeper into the WriteBarrier and explain the optimizations around it.

The first problem with our current WriteBarrier is that the isMarked bit resides in the block footer, so retrieving its value requires quite a few computations from the object pointer. Also it doesn’t sit in the same CPU cache line as the object, which makes the access even slower. This is undesirable as the cost is paid for every WriteBarrier, regardless of if we add the object to the remembered set.

The second problem is our WriteBarrier will repeatedly add the same object o to the remembered set every time it is run. The obvious solution is to make rememberedSet a hash set to de-duplicate the objects it contains, but doing a hash lookup to check if the object already exists is far too expensive.

This is where the last metadata bit that we haven’t explained yet: the cellState bit comes in, which solves both problems.

Instead of making rememberedSet a hash table, we reserve a byte (though we only use 1 bit of it) named cellState in every object’s object header, to indicate if we might need to put the object into the remembered set in a WriteBarrier. Since this bit resides in the object header as an object field (instead of in the block footer), it’s trivially accessible to the mutator who has the object pointer.

cellState has two possible values: black and white. The most important two invariants around cellState are the following:

  1. For any object with cellState = white, it is guaranteed that the object does not need to be added to remembered set.
  2. Unless during a full GC cycle, all black (live) objects have isMarked = true.

Invariant 1 serves as a fast-path: WriteBarrier can return immediately if our object is white, and checking it only requires one load instruction (to load cellState) and one comparison instruction to validate it is white.

However, if the object is black, a slow-path is needed to check whether it is actually needed to add the object to the remembered set.

Let’s look at our new WriteBarrier:

// Executed after writing a pointer to 'dst' into a field of 'obj'
void WriteBarrier(JSCell* obj) {
    if (obj->cellState == black)
        WriteBarrierSlowPath(obj);
}

The first thing to notice is that the WriteBarrier is no longer checking if dst (the object that the pointer points to) is marked or not. Clearly this does not affect the correctness: we are just making the criteria less restrictive. However, the performance impact of removing this dst check is a tricky question without a definite answer, even for JSC developers. Through some preliminary testing, their conclusion is that adding back the !isMarked(dst) check slightly regresses performance. They have two hypotheses. First, by not checking dst, more objects are put into the remembered set and need to be scanned by the GC, so the total amount of work increased. However, the mutator’s work probably decreased, as it does fewer checks and touches fewer cache lines (by not touching the outlined isMarked bit). Of course such benefit is offset because the mutator is adding more objects into the remembered set, but this isn’t too expensive either, as the remembered set is only a segmented vector. The GC has to do more work, as it needs to scan and mark more objects. However, after we make our scheme concurrent, the marking phase of the GC can be done concurrently as the mutator is running, so the latency is probably[22] hidden. Second, JSC’s DFG compiler has an optimization pass that coalesces barriers on the same object together, and it is harder for such barriers to check all the dsts.

The interesting part is how the invariants above are maintained by the relavent parties. As always, there are three actors: the mutator (WriteBarrier), the allocator, and the GC.

The interaction with the allocator is the simplest. All objects are born white. This is correct because newly-born objects are not marked, so have no reason to be remembered.

The interaction with GC is during the GC marking phase:

  1. When we mark an object and push it into the queue, we set its cellState to white.
  2. When we pop an object from the queue, before we start to scan its children, we set its cellState to black.

In pseudo-code, the Eden/Full GC marking phase now looks like the following (Line 5 and Line 9 are the newly-added logic to handle cellState, other lines unchanged):

while (!queue.empty() || !rmbSet.empty()) {
    // Both eden GC and full GC needs to consider remembered set
    // Prioritize popping from queue, pop remembered set last
    JSCell* obj = !queue.empty() ? queue.pop() : rmbSet.pop();
    obj->cellState = black;           // <----------------- newly added

    obj->ForEachChild([&](JSCell* child) {
        if (!child->isMarked) {
            markObject(child);
            child->cellState = white; // <----------------- newly added
            queue.push(child);
        }
    });
}

Let’s argue why the invariant is maintained by the above code.

  1. For invariant 1, note that in the above code, an object is white only if it is inside the queue (as once it’s popped out, it becomes black again), pending scanning of its children. Therefore, it is guaranteed that the object will still be scanned by the GC later, so we don’t need to add the object to remembered set, as desired.
  2. For invariant 2, at the end of any GC cycle, any live object is marked, which means it has been scanned, so it is black, as desired.

Now let’s look at what WriteBarrierSlowPath should do. Clearly, it’s correct if it simply unconditionally add the object to remembered set, but that also defeats most of the purpose of cellState as an optimization mechanism: we want something better. A key use case of cellState is to prevent adding an object into the remembered set if it is already there. Therefore, after we put the object into the remembered set, we will set its cellState to white, like shown below.

void WriteBarrierSlowPath(JSCell* obj) { 
    obj->cellState = white;
    addToRememberedSet(obj);
}

Let’s prove why the above code works. Once we added an object to remembered set, we set it to white. We don’t need to add the same object into the remembered set until it gets popped out from the set by GC. But when GC pops out the object, it would set its cellState back to black, so we are good.

JSC employed one more optimization. During a full GC, we might see a black object that has isMarked = false (note that this is the only possible case that the object is unmarked, due to invariant 2). In this case, it’s unnecessary to add the object to remembered set, since the object will eventually be scanned in the future (or it became dead some time later before it was scanned, in which case we are good as well). Furthermore, we can flip it back to white, so we don’t have to go into this slow path the next time a WriteBarrier on this object runs. To sum up, the optimized version is as below:

void WriteBarrierSlowPath(JSCell* obj) {
    if (IsFullGcRunning()) {
        if (!isMarked(obj)) {
            // Do not add the object to remembered set
            // In addition, set cellState to white so this
            // slow path is not triggered on the next run
            obj->cellState = white;
            return;
        }
    } else {
        assert(isMarked(obj)); // due to invariant 2
    }

    obj->cellState = white;
    addToRememberedSet(obj);
}

Getting Concurrent and Getting Wild

At this point, we already have a very good incremental and generational garbage collector: the mutator, allocator and GC all have their respective fast-paths for the common cases, and with logical versioning, we avoided redundant work as much as possible. In my humble opinion, this is a good balance point between performance and engineering complexity.

However, because JSC is one of the core drivers of performance in Safari, it’s unsurprising that performance is a top priority, even at the cost of engineering complexity. To squeeze out every bit of performance, JSC made their GC concurrent. This is no easy feat: due to the nature of GCs, it’s often too slow to use locks to protect against certain race conditions, so extensive lock-free programming is employed.

But once lock-free programming is involved, one starts to get into all sorts of architecture-dependent memory reordering problems. x86-64 is the more strict architecture: it only requires StoreLoadFence(), and it provides TSO-like semantics. JSC also supports ARM64 CPUs, which has even fewer guarantees: load-load, load-store, store-load, and store-store can all be reordered by the CPU, so a lot more operations need fences. As if things were not bad enough, for performance reasons, JSC often avoids using memory fences on ARM64. They have the so-called Dependency class, which creates an implicit CPU data dependency on ARM64 through some scary assembly hacks, so they can get the desired memory ordering for a specific data-flow without paying the cost of a memory fence. As you can imagine, with all of these complications and optimizations, the code can become difficult to read.

So due to my limited expertise, it’s unsurprising if I missed to explain or mis-explained some important race conditions in the code, especially some ARM64-specific ones: if you spotted any issue in this post, please let me know.

Let’s go through the concurrency assumptions first. JavaScript is a single-threaded language, so there is always only one mutator thread[23]. Apart from the mutator thread, JSC has a bunch of compilation threads, a GC thread, and a bunch of marking threads. Only the GC marking phase is concurrent: during which the mutator thread, the compiler threads, and a bunch of marking threads are concurrently running (yes, the marking itself is also done in parallel). However, all the other GC phases are run with the mutator thread and compilation threads stopped.

Some Less Interesting Issues

First of all, clearly the isMarked and isNew bitvector must be made safe for concurrent access, since multiple threads (including marking threads and mutator) may concurrently update it. Using CAS with appropriate retry/bail mechanism is enough for the bitvector itself.

BlockFooter is harder, and needs to be protected with a lock: multiple threads could be simultaneously calling aboutToMark(), so aboutToMark() must be guarded. For the reader side (the isMarked() function, which involves first checking if l_markVersion is latest, then reading the isMarked bitvector), in x86-64 thanks to x86-TSO, one does not need a lock or any memory fence (as long as aboutToMark takes care to update l_markVersion after the bitvector). In ARM64, since load-load reordering is allowed, a Dependency is required.

Making the cellContainsLiveObject (or in JSC jargon, isLive) check lock-free is harder, since it involves potentially reading both the isMarked bit and the isNew bit. JSC employs optimistic locking to provide a fast-path. This is not very different from an optimistic locking scheme you can find in a textbook, so I won’t dive into the details.

Of course, there are a lot more subtle issues to change. Almost all the pseudo-code above needs to be adapted for concurrency, either by using a lock or CAS, or by using some sort of memory barrier and concurrency protocol to ensure that the code works correctly under races. But now let’s turn to some more important and tricky issues.

The Race Between WriteBarrier and Marking

One of the most important races is the race between WriteBarrier and GC’s marking threads. The marking threads and the mutator thread can access the cellState of an object concurrently. For performance reasons, a lock is infeasible, so a race condition arises.

It’s important to note that we call WriteBarrier after we have written the pointer into the object. This is not only more convenient to use (especially for JIT-generated code), but also allows a few optimizations: for example, in certain cases, multiple writes to the same object may only call WriteBarrier once at the end.

With this in mind, let’s analyze why our current implementation is buggy. Suppose o is an object, and the mutator wants to store a pointer to another object target into a field f of o. The marking logic of the GC wants to scan o and append its children into the queue. We need to make sure that GC will observe the o->target pointer link.

Let’s first look at the correct logic:

Mutator (WriteBarrier) GC (Marker)
Store(o.f, target)
StoreLoadFence() // WriteBarrier begin
t1 = Load(o.cellState)
if (t1 == black): WriteBarrierSlowPath(o)
Store(o.cellState, black)
StoreLoadFence()
t2 = Load(o.f) // Load a children of o
Do some check to t2 and push it to queue

This is mostly just a copy of the pseudocode in the above sections, except that we have two StoreLoadFence(). A StoreLoadFence() guarantees that no LOAD after the fence may be executed by the CPU out-of-order engine until all STORE before the fence have completed. Let’s first analyze what could go wrong without either of the fences.

Just to make things perfectly clear, the precondition is o.cellState = white (because o is in the GC’s queue) and o.f = someOldValue.

What could go wrong if the mutator WriteBarrier doesn’t have the fence? Without the fence, the CPU can execute the LOAD in line 3 before the STORE in line 1. Then, in the following interleaving:

  1. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = white
  2. [GC Line 1] Store(o.cellState, black)
  3. [GC Line 3] t2 = Load(o.f)                 // t2 = some old value
  4. [Mutator Line 1] Store(o.f, target)

Now, the mutator did not add o to remembered set (because t1 is white, not black), and t2 in GC is the old value in o.f instead of target, so GC did not push target into the queue. So the pointer link from o to target is missed in GC. This can result in target being wrongly reclaimed despite it is live.

And what could go wrong if the GC marking logic doesn’t have the fence? Similarly, without the fence, the CPU can execute the LOAD in line 3 before the STORE in line 1. Then, in the following interleaving:

  1. [GC Line 3] t2 = Load(o.f)                 // t2 = some old value
  2. [Mutator Line 1] Store(o.f, target)
  3. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = white
  4. [GC Line 1] Store(o.cellState, black)

Similar to above, mutator sees t1 = white and GC sees t2 = oldValue. So o is not added to remembered set, and target is not pushed into the queue, the pointer link is missed.

Finally, let’s analyze why the code behaves correctly if both fences are present. Unfortunately there is not a better way than manually enumerating all the interleavings. Thanks to the fences, Mutator Line 1 must execute before Mutator Line 3, and GC Line 1 must execute before GC Line 3, but the four lines can otherwise be reordered arbitrarily. So there are 4! / 2! / 2! = 6 possible interleavings. So let’s go!

Interleaving 1:

  1. [Mutator Line 1] Store(o.f, target)
  2. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = white
  3. [GC Line 1] Store(o.cellState, black)
  4. [GC Line 3] t2 = Load(o.f)                 // t2 = target

In this interleaving, the mutator did not add o to remembered set, but the GC sees target, so it’s fine.

Interleaving 2:

  1. [GC Line 1] Store(o.cellState, black)
  2. [GC Line 3] t2 = Load(o.f)                // t2 = some old value
  3. [Mutator Line 1] Store(o.f, target)
  4. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = black

In this interleaving, the GC saw the old value, but the mutator added o to the remembered set, so the GC will eventually drain from the remembered set and scan o again, at which time it will see the correct new value target, so it’s fine.

Interleaving 3:

  1. [Mutator Line 1] Store(o.f, target)
  2. [GC Line 1] Store(o.cellState, black)
  3. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = black
  4. [GC Line 3] t2 = Load(o.f)                 // t2 = target

In this interleaving, the GC saw the new value target, nevertheless, the mutator saw t1 = black and added o to the remembered set. This is unfortunate since the GC will scan o again, but it doesn’t affect correctness.

Interleaving 4:

  1. [Mutator Line 1] Store(o.f, target)
  2. [GC Line 1] Store(o.cellState, black)
  3. [GC Line 3] t2 = Load(o.f)                 // t2 = target
  4. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = black

Same as Interleaving 3.

Interleaving 5:

  1. [GC Line 1] Store(o.cellState, black)
  2. [Mutator Line 1] store(o.f, target)
  3. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = black
  4. [GC Line 3] t2 = Load(o.f)                // t2 = target

Same as Interleaving 3.

Interleaving 6:

  1. [GC Line 1] Store(o.cellState, black)
  2. [Mutator Line 1] Store(o.f, target)
  3. [GC Line 3] t2 = Load(o.f)                 // t2 = target
  4. [Mutator Line 3] t1 = Load(o.cellState)    // t1 = black

Same as Interleaving 3.

This proves that with the two StoreLoadFence(), our code is no longer vulnerable to the above race condition.

Another Race Condition Between WriteBarrier and Marking

The above fix alone is not enough: there is another race between WriteBarrier and GC marking threads. Recall that in WriteBarrierSlowPath, we attempt to flip the object back to white if we saw it is not marked (this may happen during a full GC), as illustrated below:

... omitted ...
if (!isMarked(obj)) {
    obj->cellState = white;
    return;
}
... omitted ...

It turns out that, after setting the object white, we need to do a StoreLoadFence(), and check again if the object is marked. If it becomes marked, we need to set obj->cellState back to black.

Without the fix, the code is vulnerable to the following race:

  1. [Precondition] o.cellState = black and o.isMarked = false
  2. [WriteBarrier] Check isMarked()                 // see false
  3. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  4. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  5. [WriteBarrier] Store(o.cellState, white)
  6. [Postcondition] o.cellState = white and o.isMarked = true

The post-condition is bad because o will not be added to the remembered set in the future, despite that it needs to be (as the GC has already scanned it).

Let’s now prove why the code is correct when the fix is applied. Now the WriteBarrier logic looks like this:

  1. [WriteBarrier] Store(o.cellState, white)
  2. [WriteBarrier] t1 = isMarked()
  3. [WriteBarrier] if (t1 == true): Store(o.cellState, black)

Note that we omitted the first “Check isMarked()” line because it must be the first thing executed in the interleaving, as otherwise the if-check won’t pass at all.

The three lines in WriteBarrier cannot be reordered by CPU: Line 1-2 cannot be reordered because of the StoreLoadFence(), line 2-3 cannot be reordered since line 3 is a store that is only executed if line 2 is true. The two lines in GC cannot be reordered by CPU because line 2 stores to the same field o.cellState as line 1.

In addition, note that it’s fine if at the end of WriteBarrier, the object is black but GC has only executed to line 1: this is unfortunate, because the next WriteBarrier on this object will add the object to the remembered set despite it being unnecessary. However, it does not affect our correctness. So now, let’s enumerate all the interleavings again!

Interleaving 1.

  1. [WriteBarrier] Store(o.cellState, white)
  2. [WriteBarrier] t1 = isMarked() // t1 = false
  3. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // not executed

Object is not marked and white, OK.

Interleaving 2.

  1. [WriteBarrier] Store(o.cellState, white)
  2. [WriteBarrier] t1 = isMarked() // t1 = false
  3. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  4. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // not executed

Object is in queue and white, OK.

Interleaving 3.

  1. [WriteBarrier] Store(o.cellState, white)
  2. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  3. [WriteBarrier] t1 = isMarked() // t1 = true
  4. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is in queue and black, unfortunate but OK.

Interleaving 4.

  1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  2. [WriteBarrier] Store(o.cellState, white)
  3. [WriteBarrier] t1 = isMarked() // t1 = true
  4. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is in queue and black, unfortunate but OK.

Interleaving 5.

  1. [WriteBarrier] Store(o.cellState, white)
  2. [WriteBarrier] t1 = isMarked() // t1 = false
  3. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  4. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  5. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // not executed

Object is marked and black, OK.

Interleaving 6.

  1. [WriteBarrier] Store(o.cellState, white)
  2. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  3. [WriteBarrier] t1 = isMarked() // t1 = true
  4. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  5. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is marked and black, OK.

Interleaving 7.

  1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  2. [WriteBarrier] Store(o.cellState, white)
  3. [WriteBarrier] t1 = isMarked() // t1 = true
  4. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  5. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is marked and black, OK.

Interleaving 8.

  1. [WriteBarrier] Store(o.cellState, white)
  2. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  3. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  4. [WriteBarrier] t1 = isMarked() // t1 = true
  5. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is marked and black, OK.

Interleaving 9.

  1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  2. [WriteBarrier] Store(o.cellState, white)
  3. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  4. [WriteBarrier] t1 = isMarked() // t1 = true
  5. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is marked and black, OK.

Interleaving 10.

  1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed ‘o’ into queue
  2. [GC Marking] Popped ‘o’ from queue, Store(o.cellState, black)
  3. [WriteBarrier] Store(o.cellState, white)
  4. [WriteBarrier] t1 = isMarked() // t1 = true
  5. [WriteBarrier] if (t1 == true): Store(o.cellState, black)   // executed

Object is marked and black, OK.

So let’s update our pseudo-code. However, I would like to note that, in JSC’s implementation, they did not use a StoreLoadFence() after obj->cellState = white. Instead, they made the obj->cellState = white a CAS from black to white (with memory ordering memory_order_seq_cst). This is stronger than a StoreLoadFence() so their logic is also correct. Nevertheless, just in case my analysis above missed some other race with other components, our pseudo-code will stick to their logic…

Mutator WriteBarrier pseudo-code:

void WriteBarrier(JSCell* obj) {
    StoreLoadFence(); // Note the fence!
    if (obj->cellState == black)
        WriteBarrierSlowPath(obj);
}

void WriteBarrierSlowPath(JSCell* obj) {
    if (IsGcRunning()) {
        if (!isMarked(obj)) {
            if (SUCCESS == 
                CompareAndSwap(obj->cellState, black /*from*/, white /*to*/)) {
                if (isMarked(obj)) {
                    obj->cellState = black;
                }
            }
            return;
        }
    } else {
        assert(isMarked(obj));
    }

    obj->cellState = white;
    // Add 'obj' to remembered set
    rmbSet.push(obj);
}

Eden/Full GC Marking phase:

while (!queue.empty() || !rmbSet.empty()) {
    JSCell* obj = !queue.empty() ? queue.pop() : rmbSet.pop();
    obj->cellState = black;

    StoreLoadFence(); // Note the fence!

    obj->ForEachChild([&](JSCell* child) {
        if (!child->isMarked) {
            markObject(child);
            child->cellState = white;
            queue.push(child);
        }
    });
}

Remove Unnecessary Memory Fence In WriteBarrier

The WriteBarrier is now free of hazardous race conditions. However, we are executing a StoreLoadFence() for every WriteBarrier, which is a very expensive CPU instruction. Can we optimize it?

The idea is the following: the fence is used to protect against race with GC. Therefore, we definitely need the fence if the GC is concurrently running. However, the fence is unnecessary if the GC is not running. Therefore, we can check if the GC is running first, and only execute the fence if the GC is indeed running.

JSC is even more clever: instead of having two checks (one that checks if the GC is running and one that checks if the cellState is black), it combines them into a single check for the fast-path where the GC is not running and the object is white. The trick is the following:

  1. Assume black = 0 and white = 1 in the cellState enum.
  2. Create a global variable called blackThreshold. This blackThreshold is normally 0, but at the beginning of a GC cycle, it will be set to 1, and it will be reset back to 0 at the end of the GC cycle.
  3. Now, check if obj->cellState > blackThreshold.

Then, if the check succeeded, we know we can immediately return: the only case this check can succeed is when the GC is not running and we are white (because blackThreshold = 0 and cellState = 1 is the only situation to pass the check). This way, the fast path only executes one check. If the check fails, then we fallback to the slow path, which performs the full procedure: check if GC is running, execute a fence if needed, then check if cellState is black again. In pseudo-code:

void WriteBarrier(JSCell* obj) {
    if (obj->cellState > g_blackThreshold) {
        // Fast-path: the only way to reach here is when
        // the GC is not running and the cellState is white
        return;
    }

    if (!IsGcRunning()) {
        // g_blackThreshold is 0, so our object is
        // actually black, we need to go to WriteBarrierSlowPath
        WriteBarrierSlowPath(obj);
    } else {
        // GC is running so we need to execute the fence
        // and check cellState again
        StoreLoadFence();
        if (obj->cellState == black) {
            WriteBarrierSlowPath(obj);
        }
    }
}

Note that there is no race between WriteBarrier and GC setting/clearing IsGcRunning() flag and changing the g_blackThreshold value, because the mutator is always stopped at a safe point (of course, halfway inside WriteBarrier is not a safe point) when the GC starts/finishes.

“Obstruction-Free Double Collect Snapshot”

The concurrent GC also introduced new complexities for the ForEachChild function used by the GC marking phase to scan all objects referenced by a certain object. Each JavaScript object has a Structure (aka, hidden class) that describes how the content of this object shall be interpreted into object fields. Since the GC marking phase is run concurrently with the mutator, and the mutator may change the Structure of the object, and may even change the size of the object’s butterfly, the GC must be sure that despite the race conditions, it will never crash by dereferencing invalid pointers and never miss to scan a child. Using a lock is clearly infeasible for performance reasons. JSC uses a so-called obstruction-free double collect snapshot to solve this problem. Please refer to Filip Pizlo’s GC blog post to see how it works.

Some Minor Design Details and Optimizations

You might find this section helpful if you want to actually read and understand the code of JSC, but otherwise feel free to skip it: these details are not centric to the design, and are not particularly interesting either. I mention them only to bridge the gap between the GC scheme explained in this post and the actual implementation in JSC.

As explained earlier, each CompleteSubspace owns a list of BlockDirectory to handle allocations of different sizes; each BlockDirectory has an active block m_currentBlock where it allocates from, and it achieves this by holding a free list of all available cells in the block. But how does it work exactly?

As it turns out, each BlockDirectory has a cursor, which is reset to point at the beginning of the block list at the end of an eden or full GC cycle. Until it is reset, it can only move forward. The BlockDirectory will move the cursor forward, until it finds a block containing available cells, and allocate from it. If the cursor reaches the end of the list, it will attempt to steal a 16KB block from another BlockDirectory and allocate from it. If that also fails, it will allocate a new 16KB block from malloc and allocate from it.

I also mentioned that a BlockDirectory uses a free list to allocate from the currently active block m_currentBlock. It’s important to note that in the actual implementation of JSC, the cells in m_currentBlock does not respect the rule for isNew bit. Therefore, to check liveness, one either needs to do a special-case check to see if the cell is from m_currentBlock (for example, see HeapCell::isLive), or, for the GC[24], stop the mutator, destroy the free list (and populate isNew in the process), do whatever inspection, then rebuild the free list and resume the mutator. The latter is implemented by two functions named stopAllocating() and resumeAllocating(), which are automatically called whenever the world is stopped or resumed.

The motivation of allowing m_currentBlock to not respect the rule for isNew is (a tiny bit of) performance. Instead of manually setting isNew to true for every allocation, a block-level bit allocated (aggregated as a bitvector in BlockDirectory) is used to indicate if a block is full of live objects. When the free list becomes empty (i.e., the block is fully allocated), we simply set allocated to true for this block. When querying cell liveness, we check this bit first and directly return true if it is set. The allocated bitvector is cleared at the end of each GC cycle, and since the global logical version for isNew is also bumped, this effectively clears all the isNew bits, just as we desired.

JSC’s design also supports the so-called constraint solver, which allows specification of implicit reference edges (i.e., edges not represented as pointer in the object). This is mainly used to support JavaScript interaction with DOM. This part is not covered in this post.

Weak references have multiple implementations in JSC. The general (but less efficient) implementation is WeakImpl, denoting a weak reference edge. The data structure managing them is WeakSet, and you can see it in every block footer, and in every PreciseAllocation GC header. However, JSC also employs more efficient specialized implementations to handle the weak map feature in JavaScript. The details are not covered in this post.

In JSC, objects may also have destructors. There are three ways destructors are run. First, when we begin allocating from a block, destructors of the dead cells are run. Second, the IncrementalSweeper periodically scans the blocks and runs destructors. Finally, when the VM shuts down, the lastChanceToFinalize() function is called to ensure that all destructors are run at that time. The details of lastChanceToFinalize() are not covered in this post.

JSC employs a conservative approach for pointers on the stack and in registers: the GC uses UNIX signals to suspend the mutator thread, so it can copy its stack contents and CPU register values to search for data that looks like pointers. However, it’s important to note that a UNIX signal is not used to suspend the execution of the mutator: the mutator always actively suspends itself at a safe point. This is critical, as otherwise it could be suspended at weird places, for example, in a HeapCell::isLive check after it has read isNew but before it has read isMarked, and then GC did isNew |= isMarked, isMarked = false, and boom. So it seems like the only reason to suspend the thread is for the GC to get the CPU register values, including the SP register value so the GC knows where the stack ends. It’s unclear to me if it’s possible to do so in a cooperative manner instead of using costly UNIX signals.

Acknowledgements

I thank Saam Barati from Apple’s JSC team for his enormous help on this blog post. Of course, any mistakes in this post are mine.


Footnotes


  1. A brief stop-the-world pause is still required at the start and end of each GC cycle, and may be intentionally performed if the mutator thread (i.e.the thread running JavaScript code) is producing garbage too fast for the GC thread to keep up with. ↩︎
  2. The actual allocation logic is implemented in LocalAllocator. Despite that in the code BlockDirectory is holding a linked list of LocalAllocator, (at time of writing, for the codebase version linked in this blog) the linked list always contains exactly one element, so the BlockDirectory and LocalAllocator is one-to-one and can be viewed as an integrated component. This relationship might change in the future, but it doesn’t matter for the purpose of this post anyway. ↩︎
  3. Since the footer resides at the end of a 16KB block, and the block is also 16KB aligned, one can do a simple bit math from any object pointer to access the footer of the block it resides in. ↩︎
  4. Similar to that per-cell information is aggregated and stored in the block footer, per-block information is aggregated as bitvectors and stored in BlockDirectory for fast lookup. Specifically, two bitvectors empty and canAllocateButNotEmpty track if a block is empty, or partially empty. The code is relatively confusing because the bitvectors are laid out in a non-standard way to make resizing easier, but conceptually it’s just one bitvector for each boolean per-block property. ↩︎
  5. While seemingly straightforward, it is not straightforward at all (as you can see in the code). The free cells are marked free by the GC, and due to concurrency and performance optimization the logic becomes very tricky: we will revisit this later. ↩︎
  6. In fact, it also attempts to steal blocks from other allocators, and the OS memory allocator may have some special requirements required for the VM, but we ignore those details for simplicity. ↩︎
  7. In the current implementation, the list of sizes (byte) are 16, 32, 48, 64, 80, then 80 * 1.4 ^ n for n >= 1 up to about 8KB. Exponential growth guarantees that the overhead due to internal fragmentation is at most a fraction (in this case, 40%) of the total allocation size. ↩︎
  8. An interesting implementation detail is that IsoSubspace and CompleteSubspace always return memory aligned to 16 bytes, but PreciseAllocation always return memory address that has reminder 8 module 16. This allows identifying whether an object is allocated by PreciseAllocation with a simple bit math. ↩︎
  9. JSC has another small optimization here. Sometimes a IsoSubspace contains so few objects that it’s a waste to hold them using a 16KB memory page (the block size of BlockDirectory). So the first few memory pages of IsoSubspace use the so-called “lower-tier”, which are smaller memory pages allocated by PreciseAllocation. In this post, we will ignore this design detail for simplicity. ↩︎
  10. Memory of an IsoSubspace is only used by this IsoSubspace, never stolen by other allocators. As a result, a memory address in IsoSubspace can only be reused to allocate objects of the same type. So for any type A allocated by IsoSubspace, even if there is a use-after-free bug on type A, it is impossible to allocate A, free it, allocate type B at the same address, and exploit the bug to trick the VM into interpreting an integer field in B controlled by attacker as a pointer field in A. ↩︎
  11. In some GC schemes, an eden object is required to survive two (instead of one) eden GCs to be considered in old space. The purpose of such design is to make sure that any old space object is at least one eden-GC-gap old. In contrast, in JSC’s design, an object created immediately before an eden collection will be considered to be in old space immediately, which then can only be reclaimed via a full GC. The performance difference between the two designs is unclear to me. I conjecture JSC chose its current design because it’s easier to make concurrent. ↩︎
  12. There is one additional color Grey in the code. However, it turns out that White and Grey makes no difference (you can verify it by grepping all use of cellState and observe that the only comparison on cellState is checking if it is Black). The comments explaining what the colors mean do not fully capture all the invariants. In my opinion JSC should really clean it up and update the comment, as it can easily cause confusion to readers who intend to understand the design. ↩︎
  13. The bit is actually called isNewlyAllocated in the code. We shorten it to isNew for convenience in this post. ↩︎
  14. Safe point is a terminology in GC. At a safe point, the heap and stack is in a coherent state understandable by the GC, so the GC can correctly trace out which objects are dead or live. ↩︎
  15. For PreciseAllocation, all allocated objects are chained into a linked list, so we can traverse all objects (live or dead) easily. This is not efficient: we will explain the optimizations for CompleteSubspace later. ↩︎
  16. Keep in mind that while this is true for now, as we add more optimizations to the design, this will no longer be true. ↩︎
  17. Note that we push the old space object into the queue, not the eden object, because this pointer could have been overwritten at the start of the GC cycle, making the eden object potentially collectable. ↩︎
  18. Also note that all objects dead before this GC cycle, i.e. the free cells of a block in CompleteSubspace, still have isNew = false and isMarked = false, as desired. ↩︎
  19. Recall that under generational hypothesis, most objects die young. Therefore, that “all objects in an eden block are found dead during eden GC” is something completely plausible. ↩︎
  20. In JSC, the version is stored in a uint32_t and they have a bunch of logic to handle the case that it overflows uint32_t. In my humble opinion, this is an overoptimization that results in very hard-to-test edge cases, especially in a concurrent setting. So we will ignore this complexity: one can easily avoid these by spending 8 more bytes per block footer to have uint64_t version number instead. ↩︎
  21. Note that any number of eden GC cycles may have run between the last full GC cycle and the current full GC cycle, but eden GC does not bump mark version. So for any object born before the last GC cycle (no matter eden or full), the isMarked bit honestly reflects if it is live, and we will accept the bit as its mark version must be off-by-one. For objects born after the last GC cycle, it must have a latest isNew version, so we can know it’s alive through isNew. In both cases, the scheme correctly determines if an object is alive, just as desired. ↩︎
  22. And probably not: first, true sharing and false sharing between GC and mutator can cause slowdowns. Second, as we have covered before, JSC uses a Time-Space Scheduler to prevent the mutator from allocating too fast while the GC is running. Specifically, the mutator will be intentionally suspended for at least 30% of the duration. So as long as the GC is running, the mutator suffers from an 30%-or-more “performance tax”. ↩︎
  23. The real story is a bit more complicated. JSC actually reuse the same VM for different JavaScript scripts. However, at any moment, at most one of the script can be running. So technically, there are multiple mutually-exclusive mutator threads, but this doesn’t affect our GC story. ↩︎
  24. The GC needs to inspect a lot of cells, and its logic is already complex enough, so having one less special-case branch is probably beneficial for both engineering and performance. ↩︎
]]>
Meet Web Push https://webkit.org/blog/12945/meet-web-push/ Tue, 07 Jun 2022 15:00:55 +0000 https://webkit.org/?p=12945 a push notification on macOS

Websites have many reasons to notify their users of time-sensitive or high-priority events, even if the user does not currently have the site open. This feature is called Web Push, and is enabled by the W3C standards for Push API, Notifications API, and Service Workers, all working together. WebKit now supports the relevant parts of those standards to enable Web Push.

Apple has made changes to macOS that deeply integrate with WebKit’s support to provide a great user experience, and we’re excited to announce that Web Push is supported in Safari 16 on macOS Ventura.

Keep an eye out for Web Push on iOS and iPadOS in 2023.

As long as you’ve coded your web application to the standards you will be able to reach Safari 16 users on macOS Ventura. You don’t need to join the Apple Developer Program to send Web Push notifications.

If you exclude Safari through browser detection, now would be a great time to switch to feature detection, which lets you take advantage of new features as soon as they’re supported. Additionally, if you tightly manage push end points on your server, be sure to allow URLs from any subdomain of push.apple.com.

All of this and more is covered in Meet Web Push (15 minute video) at WWDC22.

Standards overview

Most features of the web platform are described in a single web standard. Web Push is an exception, with multiple standards describing implementation requirements.

There are many resources on the web to help web application authors get up and running with these standards. But to further cover how WebKit’s support works, it is useful to cover the web standards at a high level.

The Push API standard is the most directly relevant to start with. It describes the JavaScript interface that allows a website to register a push subscription. That subscription enables sending push messages to your user’s browser using a push service.

The ServiceWorker API is extended to support these push messages. Once a push message is received from a domain, that domain’s registered service worker script receives an event representing the push message.

The Notifications API is extended to allow service worker scripts to post a notification even without an open browser tab.

When a web application registers a push subscription, they promise that pushes will always be user visible. When the service worker handles a push message, it is required to use the Notifications API to display a user visible notification. Finally, when the user activates that notification, the service worker is sent an event representing the notification activation.

Power and privacy

Both the WebKit open source project and Apple treat privacy as a fundamental human right. As with other privileged features of the web platform, requesting a push subscription requires an explicit user gesture. It also requires you set the userVisibleOnly flag to true, and fulfill that promise by always showing a notification in response to a push message.

The Web Push API is not an invitation for silent background runtime, as that would both violate a user’s trust and impact a user’s battery life.

Violations of the userVisibleOnly promise will result in a push subscription being revoked.

A little bit about WebKit

Some of you are interested in the implementation details of Web Push in WebKit.

One goal of the WebKit open source project is to make it easy to deliver a modern browser engine that integrates well with any modern platform.

Many web-facing features are implemented entirely within WebKit, and the maintainers of a given WebKit port do not have to do any additional work to add support on their platforms.

Occasionally features require relatively deep integration with a platform. That means a WebKit port needs to write a lot of custom code inside WebKit or integrate with platform specific libraries. For example, to support the HTML <audio> and <video> elements, Apple’s port leverages Apple’s Core Media framework, whereas the GTK port uses the GStreamer project.

A feature might also require deep enough customization on a per-Application basis that WebKit can’t do the work itself.

For example web content might call window.alert(). In a general purpose web browser like Safari, the browser wants to control the presentation of the alert itself. But an e-book reader that displays web content might want to suppress alerts altogether.

From WebKit’s perspective, supporting Web Push requires deep per-platform and per-application customization.

Web Push in Apple’s WebKit port

Apple’s WebKit port includes a new daemon called webpushd. It is installed as a LaunchAgent in macOS Ventura to support Web Push. This daemon takes push subscription requests from webpages in Safari 16 and turns them into actual push subscriptions with the Apple Push Notification service.

Incoming pushes to the system are delivered to webpushd, which then wakes the appropriate application to hand off any pending push messages to a service worker.

The promise of Web Push is that you can reach your users even if they don’t have your website open in a browser tab. Because of how we integrated webpushd with built-in push support in macOS Ventura, Safari doesn’t even need to be running for a push message to be delivered.

The requirement to display user visible notifications is another platform specific point. Different browsers might implement Notifications API support in different ways. Safari has always supported local notifications by relying on the macOS Notification Center and has made additional changes to handle activating these notifications when Safari is not running.

Integrating Apple Push Notification service’s new Web Push support with webpushd and supporting notifications while Safari isn’t running are both system-level changes, making our implementation require macOS Ventura and later.

More resources

Apple has a few more resources to learn more about Web Push support in Safari 16 on macOS Ventura:

MDN has some great resources on Web Push. You should start out with Web Push API Notifications best practices.

And of course you can always reference the W3C standards directly:

]]>
Wide Gamut 2D Graphics using HTML Canvas https://webkit.org/blog/12058/wide-gamut-2d-graphics-using-html-canvas/ Tue, 14 Dec 2021 17:00:35 +0000 https://webkit.org/?p=12058 @media (prefers-color-scheme:dark) { figure .preserve-color, figure:hover .preserve-color { filter: none !important; } } figure.widescreen.inline-images img { display: inline; } article .byline { width: 210px; margin-left: -20px; } .nowrap-overflow-auto { white-space: nowrap; max-width: 100%; overflow: auto; } figure img { display: inline !important; } @media (max-width: 1180px) { article .byline { width: unset; margin-right: auto; } } #fillstyles img { background-color: #ddd; } @media (prefers-color-scheme: dark) { #fillstyles img { background-color: #444; } } #puzzle iframe { width: 350px; height: 250px; } @media (min-width: 500px) { #puzzle iframe { width: 500px; height: 310px; } } @media (min-width: 1000px) { #puzzle iframe { width: 1000px; height: 520px; } } @media (color-gamut: p3) { #gamut-warning { display: none; } }

Most colors used on the Web today are sRGB colors. These are the colors that you specify with the familiar #rrggbb and rgb(r, g, b) CSS syntax, and whose individual color components are given as values in the range [0, 255]. For example, rgb(255, 0, 0) is the most saturated, pure red in the sRGB color space. But the range of colors in sRGB — its color gamut — does not encompass all colors that can be perceived by the human visual system, and there are displays that can produce a broader range of colors.

sRGB is based on the color capabilities of computer monitors that existed at the time of its standardization, in the late 1990s. Since then, other, wider gamut color spaces have been defined for use in digital content, and which cover more of the colors that humans can perceive. One such color space is Display P3, which contains colors with significantly higher saturation than sRGB.

This browser reports that the display does not support Display P3 colors; figures in this post may not appear as intended.
A conic gradient showing a range of sRGB colors A conic gradient showing a range of Display P3 colors
Conic gradients showing fully saturated sRGB (left) and Display P3 (right) colors. Viewed in a browser and on a display supporting Display P3, the colors in the circle on the right will show as more intense than those on the left. (View as standalone page.)
For a more in depth introduction to color spaces, see Dean Jackson’s earlier post, Improving Color on the Web.

Today, there are many computer and mobile devices on the market with displays that can reproduce all the colors of the Display P3 gamut, and the Web platform has been evolving over the last few years to allow authors to make best use of these displays. WebKit has supported wide color images and video since 2016, and last year became the first browser engine to implement the new color syntax defined in CSS Color Module Level 4 where colors can be specified in a given color space (like color(display-p3 1 0 0), a fully saturated Display P3 red).

One notable omission in wide gamut color support, until now, has been in the HTML canvas element. The 2D canvas API was introduced before wide gamut displays were common, and until now has only handled drawing and manipulating sRGB pixel values. Earlier this year, a proposal for creating canvas contexts using other color spaces was added to the HTML standard, and we’ve recently added support for this to WebKit.

Drawing on a wide gamut canvas rendering context

The getContext method on a canvas element, which is used to create a rendering context object with 2D drawing APIs, accepts a new option to set the canvas backing store’s color space.

<canvas id="canvas" width="400" height="300"></canvas>
<script>
let canvas = document.getElementById("canvas");
let context = canvas.getContext("2d", { colorSpace: "display-p3" });
// ... draw on context ...
</script>

The default color space remains sRGB, rather than having the browser automatically use the wider color space, to avoid the performance overhead of color space conversions with existing content. The two explicit color spaces that can be requested are "srgb" and "display-p3".

Fill and stroke styles can be specified using any supported CSS color syntax.

let position = 0;
for (let green of [1, 0]) {
    for (let blue of [1, 0]) {
        for (let red of [1, 0]) {
            context.fillStyle = `color(display-p3 ${red} ${green} ${blue})`;
            context.fillRect(position, position, 40, 40);
            position += 20;
        }
    }
}
Colored squares that have been clamped to sRGB Colored squares using Display P3 colors that are outside the sRGB gamut
Display P3 colors used as fill styles on an sRGB (left) and Display P3 (right) canvas. Colors on the left are clamped to remain within the sRGB gamut. (View as standalone page.)

Any drawing that uses a color outside the color space of the canvas will be clamped so that it is in gamut. For example, filling a rectangle with color(display-p3 1 0 0) on an sRGB canvas will end up using a fully saturated sRGB red. Similarly, drawing on a Display P3 canvas with color(rec2020 0.9 0 0.9), an almost full magenta in the Rec.2020 color space, will result in pixels of approximately color(display-p3 1.0 0 0.923) being used, since that is the closest in the Display P3 color gamut.

const COLORS = ["#0f0", "color(display-p3 0 1 0)"];
for (let y = 20; y < 180; y += 20) {
    context.fillStyle = COLORS[(y / 20) % 2];
    context.fillRect(20, y, 160, 20);
}
A filled square of full sRGB green Stripes of full sRGB green and full Display P3 green
Stripes of interleaved Display P3 and sRGB colors on an sRGB (left) and Display P3 (right) canvas. Because colors are clamped to remain within the gamut of the canvas, the two shades of green are indistinguishable on the sRGB canvas. (View as standalone page.)
On macOS, you can use the ColorSync Utility to convert color values between sRGB, Display P3, Rec.2020, and some other predefined color spaces.

Wide gamut colors are usable in all canvas drawing primitives:

  • as the fill and stroke of rectangles, paths, and text
  • in gradient stops
  • as a shadow color

Pixel manipulation in sRGB and Display P3

getImageData and putImageData can be used to get and set pixel values on a wide gamut canvas. By default, getImageData will return an ImageData object with pixel values in the color space of the canvas, but it is possible to specify an explicit color space that does not match the canvas, and a conversion will be performed.

let context = canvas.getContext("2d", { colorSpace: "display-p3" });
context.fillStyle = "color(display-p3 0.5 0 0)";
context.fillRect(0, 0, 100, 100);

let imageData;

// Get ImageData in the canvas color space (Display P3).
imageData = context.getImageData(0, 0, 1, 1);
console.log(imageData.colorSpace);  // "display-p3"
console.log([...imageData.data]);   // [128, 0, 0, 255]

// Get ImageData in Display P3 explicitly.
imageData = context.getImageData(0, 0, 1, 1, { colorSpace: "display-p3" });
console.log(imageData.colorSpace);  // "display-p3"
console.log([...imageData.data]);   // [128, 0, 0, 255]

// Get ImageData converted to sRGB.
imageData = context.getImageData(0, 0, 1, 1, { colorSpace: "srgb" });
console.log(imageData.colorSpace);  // "srgb"
console.log([...imageData.data]);   // [141, 0, 0, 255]

The ImageData constructor similarly takes an optional options object with a colorSpace key.

let context = canvas.getContext("2d", { colorSpace: "display-p3" });

// Create and fill an ImageData with full Display P3 yellow.
let imageData = new ImageData(10, 10, { colorSpace: "display-p3" });
for (let i = 0; i < 10 * 10 * 4; ++i)
    imageData.data[i] = [255, 255, 0, 255][i % 4];

context.putImageData(imageData, 0, 0);

As when drawing shapes using colors of a different color space, any mismatch between the ImageData and the target canvas color space will cause putImageData to perform a conversion and potentially clamp the resulting pixels.

Serializing canvas content

The toDataURL and toBlob methods on a canvas DOM element produce a raster image with the canvas contents. In WebKit, these methods now embed an appropriate color profile in the generated PNG or JPEG when called on a Display P3 canvas, ensuring that the full range of color is preserved.

Drawing wide gamut images

Like putImageData, the drawImage method will perform any color space conversion needed when drawing an image whose color space differs from that of the canvas. Any color profile used by a raster image referenced by an img, and any color space information in a video referenced by a video (be it a video file or a WebRTC stream), will be honored when drawn to a canvas. This ensures that when drawing into a canvas whose color space matches the display’s (be that Display P3 or sRGB), the source image/video and the canvas pixels will look the same.

Here is an interactive demonstration of using canvas to make a sliding tile puzzle. The tiles are drawn by applying a clip path and calling drawImage pointing to the img element on the left, which references a wide gamut JPEG. Toggling the checkbox shows how the colors are muted when an sRGB canvas is used.

Sliding tile puzzle. Toggling the checkbox will change whether an sRGB or a Display P3 canvas is used. (View as standalone page.)

Web Inspector support

Web Inspector also now shows color space information for canvases to help ensure your canvases’ backing stores are in the expected color space.

In the Graphics tab, the Canvases Overview will display the color space for each canvas next to the context type (e.g. 2D) on each canvas overview tile.

After clicking on a Canvas overview tile to inspect it, the color space is shown in the Details Sidebar in the Attributes section.

Browser support

Wide gamut canvas is supported in the macOS and iOS ports of WebKit as of r283541, and is available in Safari on:

  • macOS Monterey 12.1 and above
  • iOS 15.1 and above

Safari is the first browser to support drawing shapes, text, gradients, and shadows with wide gamut CSS colors on Display P3 canvases. All other features, including getImageData, putImageData, and drawImage on Display P3 canvases, are supported in Safari and in Chrome 94 and above.

Feature detection

There are a few techniques you can use to detect whether wide gamut display and canvas support is available.

Display support: To check whether the display supports Display P3 colors, use the color-gamut media query.

function displaySupportsP3Color() {
    return matchMedia("(color-gamut: p3)").matches;
}

Canvas color space support: To check whether the browser supports wide gamut canvases, try creating one and checking the resulting color space.

function canvasSupportsDisplayP3() {
    let canvas = document.createElement("canvas");
    try {
        // Safari throws a TypeError if the colorSpace option is supported, but
        // the system requirements (minimum macOS or iOS version) for Display P3
        // support are not met.
        let context = canvas.getContext("2d", { colorSpace: "display-p3" });
        return context.getContextAttributes().colorSpace == "display-p3";
    } catch {
    }
    return false;
}

CSS Color Module Level 4 syntax support: To check whether the browser supports specifying wide gamut colors on canvas, try setting one and checking it wasn’t ignored.

function canvasSupportsWideGamutCSSColors() {
    let context = document.createElement("canvas").getContext("2d");
    let initialFillStyle = context.fillStyle;
    context.fillStyle = "color(display-p3 0 1 0)";
    return context.fillStyle != initialFillStyle;
}

Future work

There are a few areas where wide gamut canvas support could be improved.

  • 2D canvas still exposes image data as 8 bit RGBA values through ImageData objects. It may be useful to support other pixel formats for a greater color depth, such as 16 bit integers, or single precision or half precision floating point values, especially when wider color gamuts are used, since increased precision can help avoid banding artifacts. This has been proposed in an HTML Standard issue.
  • The two predefined color spaces that are supported are sRGB and Display P3, but as High Dynamic Range videos and displays that support HDR become more common, it’s worth consdering allowing 2D canvas to use these and other color spaces too. See this presentation at the W3C Workshop on Wide Color Gamut and High Dynamic Range for the Web from earlier this year, which talks about proposed new color space and HDR support.
  • Canvas can be used with context types other than 2D, such as WebGL and WebGPU. A proposal for wide gamut and HDR support in these contexts was presented at that same workshop.

In summary

WebKit now has support for creating 2D canvas contexts using the Display P3 color space, allowing authors to make best use of the displays that are becoming increasingly common. This feature is enabled in Safari on macOS Monterey 12.1 and iOS 15.1.

If you have any comments or questions about the feature, please feel free to send me a message at @heycam, and more general comments can be sent to the @webkit Twitter account.

Further reading

]]>
Optimizing JavaScript Standard Library Functions in JSC https://webkit.org/blog/11934/optimizing-javascript-standard-library-functions-in-jsc/ Wed, 28 Jul 2021 18:13:21 +0000 https://webkit.org/?p=11934 After three years working on JavaScriptCore (JSC), I recently had the opportunity to work on optimizing one of our standard library functions for the first time. I thought it’d be interesting to share what I learned about how they work in JSC and how we make them faster.

How are standard library functions implemented in JSC?

The JavaScript standard library functions include all the prototype functions for Array, Object, etc, and in this post we’ll look at Function.prototype.toString.

In JSC, standard library functions can either be implemented in native code (C++) or in JavaScript. When using JavaScript, we can use a special syntax for operations that are not allowed for regular JavaScript programs. Standard library functions written in JavaScript can be called like any other JavaScript function, and will also run through the same optimizations in each of our tiers, including inlining. Meanwhile, standard library functions written in C++ require the VM to make calls to native code, which is more expensive, and while the implementation will be optimized by the C++ compiler, its internals are completely opaque to the caller JS function, so there can be no inlining. The reason I mention inlining is that it allows the caller to “see” inside the function body, which unlocks many further optimizations to be done beyond the function boundary, some of which we’ll see in this example.

Making standard library functions faster

Let’s start with a reduced test case to see how can we optimize Function.prototype.toString:

function f() { /* ... some code here ... */ }

function g() {
  return f.toString();
}

When we call g, JSC will start by executing it in the interpreter (called LLInt), and if we continue calling it enough times it will eventually be promoted through the 3 compilers in JSC: the non-optimizing Baseline compiler and our optimizing DFG and FTL compilers. You can read more about our execution tiers on this post from back when the FTL was first introduced, but since then the FTL has gotten a new backend.

The first step before we run any code in any tier is to convert it from source code to bytecode (again, you can read a lot more about our bytecode here). After that, our DFG and FTL compilers also have their own representation of the code, that we call an Intermediate Representation or IR for short. For this post I’ll take the liberty of using some pseudocode that is somewhere in between our initial bytecode and DFG IR. This way we can see some of the details that are only available in DFG but without exposing too much low level complexity that is not relevant for our example. Here’s what our pretend IR could look like for g:

function g():
    f = Lookup("f") // look for the variable "f" in the lexical scope
    toString = Get(f, "toString") // access the "toString" property of `f`,
                                  // respecting the semantics of JavaScript property
                                  // lookup, including looking up the prototype, etc.
    result = Call(toString, f) // call the toString function with `f` as `this`
    Return(result) // return our result to the caller

Caching

The first optimization implemented wasn’t specific to standard library functions at all: the result of calling toString on a function never changes, so we can cache it.

Our implementation of Function.prototype.toString is written in C++ and has to handle a few special cases. One such case is when we are calling toString on a native function implemented in C++, but for the common case where it’s a regular JavaScript function, we have to look at the source code for the function. Since the source can’t be changed while it’s being executed, and the name of the function also cannot be changed, this result can be cached. This is completely transparent to our IR, which means we don’t have to change anything in the compiler and we already get a great speedup.

Speculation

When we start trying to optimize any JavaScript program we quickly face several challenges, since pretty much everything in JavaScript can be mutated at runtime, including our standard library functions. One way that modern JavaScript VMs get around these challenges is by speculating that certain facts about the program will not change during its execution, but since we can’t ever be sure of that, we need a way to fall back if our assumptions become invalid. We have an amazing post that goes deep into how we speculate in JSC, but for our current purposes all we need to know is that we can compile a function and say the code we generated is only valid so long as some assumptions are valid. This also isn’t specific to standard library functions, but it’ll be important later to see how all these optimizations interact with each other yielding more than the sum of their parts.

In this case we can speculate that f will never change, for our example let’s assume that it’s always called with an object allocated at address 0x123456. This is done through a mechanism called watchpoints, which is beyond the scope of this post, but the TL;DR is that this code will immediately become invalid if anyone assigns to f, which would produce a different result. Using other watchpoints we can also speculate that f.toString will always resolve to Function.prototype.toString which already leads to much better code:

function g():
    f = 0x123456 // speculated value of Lookup("f")
    toString = Function.prototype.toString // speculated value of Get(f, "toString")
    result = Call(toString, f)
    Return(result)

Intrinsics

Next, JSC has the concept of intrinsics. Intrinsics in JSC are when the compiler utilizes the knowledge that it’s calling a well known standard library function to emit optimized code inline. This is even more powerful than regular function inlining, since it can emit only the fast path inline and it works even if the standard library function was written in C++. In this case, the intrinsic tells us we are calling Function.prototype.toString and we can emit a new instruction, FunctionToString, instead of the generic function call. Our new pseudo instruction will try to load the cached value we computed in our very first optimization, but for the slow case, where we don’t have cached value yet, it will still call the C++ implementation. Computing the toString for the first time requires some really complex code that we don’t want to emit inline for every call to toString. Our code with our new instruction might look like this:

function g():
    f = 0x123456
    result = FunctionToString(f)
    Return(result)

And diving a little deeper, the implementation of our new FunctionToString instruction might look like the following:

function g():
    f = 0x123456
    result = Load(f, "cachedToString") // Load a specific property of the object,
                                       // much faster since it's just a memory
                                       // access, not a JS property access like Get
    if (!result) {
        // For the slow case where we don't have a cached value we just fallback
        // to calling the C++ code
        toString = Function.prototype.toString
        result = Call(toString, f)
    }
    Return(result)

Abstract Interpretation

The next step in speeding up our example is using our Abstract Interpreter (AI). The idea behind the AI is that it understands what our IR instructions will do to its operands at runtime, and if all of its operands are known (i.e. we proved they will always have the same value), we can try to compute the result of our instruction at compile time. In this case, since we already speculated that f will be 0x123456, when we’re compiling our FunctionToString instruction we can try to load the cachedToString property from the object at 0x123456. If that succeeds our generated code will look more like this:

function g():
    result = "function f() { /* … some code here .. */ }"
    Return(result)

That’s much better!

Recap

We can speed up the f.toString() call in our example by:

  • Caching the result
  • Speculating f will always be the same object
  • Speculating that f.toString will always resolve to Function.prototype.toString
  • Adding an Intrinsic and a new instruction, FunctionToString, that loads the cached value directly whenever it’s available
  • Teaching our abstract interpreter that if we already know which function we are stringifying, and its toString value has already been computed, we can just use the cached value as a constant.

If you want to dive deeper into the actual implementation you can find the commit here and if you have any questions feel free to reach out to me on Twitter.

]]>
Speculation in JavaScriptCore https://webkit.org/blog/10308/speculation-in-javascriptcore/ Wed, 29 Jul 2020 17:00:40 +0000 https://webkit.org/?p=10308 This post is all about speculative compilation, or just speculation for short, in the context of the JavaScriptCore virtual machine. Speculative compilation is ideal for making dynamic languages, or any language with enough dynamic features, run faster. In this post, we will look at speculation for JavaScript. Historically, this technique or closely related variants has been applied successfully to Smalltalk, Self, Java, .NET, Python, and Ruby, among others. Starting in the 90’s, intense benchmark-driven competition between many Java implementations helped to create an understanding of how to build speculative compilers for languages with small amounts of dynamism. Despite being a lot more dynamic than Java, the JavaScript performance war that started in the naughts has generally favored increasingly aggressive applications of the same speculative compilation tricks that worked great for Java. It seems like speculation can be applied to any language implementation that uses runtime checks that are hard to reason about statically.

This is a long post that tries to demystify a complex topic. It’s based on a two hour compiler lecture (slides also available in PDF). We assume some familiarity with compiler concepts like intermediate representations (especially Static Single Assignment Form, or SSA for short), static analysis, and code generation. The intended audience is anyone wanting to understand JavaScriptCore better, or anyone thinking about using these techniques to speed up their own language implementation. Most of the concepts described in this post are not specific to JavaScript and this post doesn’t assume prior knowledge about JavaScriptCore.

Before going into the details of speculation, we’ll provide an overview of speculation and an overview of JavaScriptCore. This will help provide context for the main part of this post, which describes speculation by breaking it down into five parts: bytecode (the common IR), control, profiling, compilation, and OSR (on stack replacement). We conclude with a small review of related work.

Overview of Speculation

The intuition behind speculation is to leverage traditional compiler technology to make dynamic languages as fast as possible. Construction of high-performance compilers is a well-understood art, so we want to reuse as much of that as we can. But we cannot do this directly for a language like JavaScript because the lack of type information means that the compiler can’t do meaningful optimizations for any of the fundamental operations (even things like + or ==). Speculative compilers use profiling to infer types dynamically. The generated code uses dynamic type checks to validate the profiled types. If the program uses a type that is different from what we profiled, we throw out the optimized code and try again. This lets the optimizing compiler work with a statically typed representation of the dynamically typed program.

Types are a major theme of this post even though the techniques we are describing are for implementing dynamically typed languages. When languages include static types, it can be to provide safety properties for the programmer or to help give an optimizing compiler leverage. We are only interested in types for performance and the speculation strategy in JavaScriptCore can be thought of in broad strokes as inferring the kinds of types that a C program would have, but using an internal type system purpose built for our optimizing compiler. More generally, the techniques described in this post can be used to enable any kind of profile-guided optimizations, including ones that aren’t related to types. But both this post and JavaScriptCore focus on the kind of profiling and speculation that is most natural to think if as being about type (whether a variable is an integer, what object shapes a pointer points to, whether an operation has effects, etc).

To dive into this a bit deeper, we first consider the impact of types. Then we look at how speculation gives us types.

Impact of Types

We want to give dynamically typed languages the kind of optimizing compiler pipeline that would usually be found in ahead-of-time compilers for high-performance statically typed languages like C. The input to such an optimizer is typically some kind of internal representation (IR) that is precise about the type of each operation, or at least a representation from which the type of each operation can be inferred.

To understand the impact of types and how speculative compilers deal with them, consider this C function:

int foo(int a, int b)
{
    return a + b;
}

In C, types like int are used to describe variables, arguments, return values, etc. Before the optimizing compiler has a chance to take a crack at the above function, a type checker fills in the blanks so that the + operation will be represented using an IR instruction that knows that it is adding 32-bit signed integers (i.e. ints). This knowledge is essential:

  • Type information tells the compiler’s code generator how to emit code for this instruction. We know to use integer addition instructions (not double addition or something else) because of the int type.
  • Type information tells the optimizer how to allocate registers for the inputs and outputs. Integers mean using general purpose registers. Floating point means using floating point registers.
  • Type information tells the optimizer what optimizations are possible for this instruction. Knowing exactly what it does allows us to know what other operations can be used in place of it, allows us to do some algebraic reasoning about the math the program is doing, and allows us to fold the instruction to a constant if the inputs are constants. If there are types for which + has effects (like in C++), then the fact that this is an integer + means that it’s pure. Lots of compiler optimizations that work for + would not work if it wasn’t pure.

Now consider the same program in JavaScript:

function foo(a, b)
{
    return a + b;
}

We no longer have the luxury of types. The program doesn’t tell us the types of a or b. There is no way that a type checker can label the + operation as being anything specific. It can do a bunch of different things based on the runtime types of a and b:

  • It might be a 32-bit integer addition.
  • It might be a double addition.
  • It might be a string concatenation.
  • It might be a loop with method calls. Those methods can be user-defined and may perform arbitrary effects. This’ll happen if a or b are objects.
Figure 1. The best that a nonspeculative compiler can do if given a JavaScript plus operation. This figure depicts a control flow graph as a compiler like JavaScriptCore’s DFG might see. The Branch operation is like an if and has outgoing edges for the then/else outcomes of the condition.

Based on this, it’s not possible for an optimizer to know what to do. Instruction selection means emitting either a function call for the whole thing or an expensive control flow subgraph to handle all of the various cases (Figure 1). We won’t know which register file is best for the inputs or results; we’re likely to go with general purpose registers and then do additional move instructions to get the data into floating point registers in case we have to do a double addition. It’s not possible to know if one addition produces the same results as another, since they have loops with effectful method calls. Anytime a + happens we have to allow for the the possibility that the whole heap might have been mutated.

In short, it’s not practical to use optimizing compilers for JavaScript unless we can somehow provide types for all of the values and operations. For those types to be useful, they need to help us avoid basic operations like + seeming like they require control flow or effects. They also need to help us understand which instructions or register files to use. Speculative compilers get speed-ups by applying this kind of reasoning to all of the dynamic operations in a language — ranging from those represented as fundamental operations (like + or memory accesses like o.f and o[i]) to those that involve intrinsics or recognizable code patterns (like calling Function.prototype.apply).

Speculated Types

This post focuses on those speculations where the collected information can be most naturally understood as type information, like whether or not a variable is an integer and what properties a pointed-to object has (and in what order). Let’s appreciate two aspects of this more deeply: when and how the profiling and optimization happen and what it means to speculate on type.

Figure 2. Optimizing compilers for C and JavaScript.

Let’s consider what we mean by speculative compilation for JavaScript. JavaScript implementations pretend to be interpreters; they accept JS source as input. But internally, these implementations use a combination of interpreters and compilers. Initially, code starts out running in an execution engine that does no speculative type-based optimizations but collects profiling about types. This is usually an interpreter, but not always. Once a function has a satisfactory amount of profiling, the engine will start an optimizing compiler for that function. The optimizing compiler is based on the same fundamentals as the one found in a C compiler, but instead of accepting types from a type checker and running as a command-line tool, here it accepts types from a profiler and runs in a thread in the same process as the program it’s compiling. Once that compiler finishes emitting optimized machine code, we switch execution of that function from the profiling tier to the optimized tier. Running JavaScript code has no way of observing this happening to itself except if it measures execution time. (However, the environment we use for testing JavaScriptCore includes many hooks for introspecting what has been compiled.) Figure 2 illustrates how and when profiling and optimization happens when running JavaScript.

Roughly, speculative compilation means that our example function will be transformed to look something like this:

function foo(a, b)
{
    speculate(isInt32(a));
    speculate(isInt32(b));
    return a + b;
}

The tricky thing is what exactly it means to speculate. One simple option is what we call diamond speculation. This means that every time that we perform an operation, we have a fast path specialized for what the profiler told us and a slow path to handle the generic case:

if (is int)
    int add
else
    Call(slow path)

To see how that plays out, let’s consider a slightly different example:

var tmp1 = x + 42;
... // things
var tmp2 = x + 100;

Here, we use x twice, both times adding it to a known integer. Let’s say that the profiler tells us that x is an integer but that we have no way of proving this statically. Let’s also say that x‘s value does not change between the two uses and we have proved that statically.

Figure 3. Diamond speculation that x is an integer.

Figure 3 shows what happens if we speculate on the fact that x is an integer using a diamond speculation: we get a fast path that does the integer addition and a slow path that bails out to a helper function. Speculations like this can produce modest speed-ups at modest cost. The cost is modest because if the speculation is wrong, only the operations on x pay the price. The trouble with this approach is that repeated uses of x must recheck whether it is an integer. The rechecking is necessary because of the control flow merge that happens at the things block and again at more things.

The original solution to this problem was splitting, where the region of the program between things and more things would get duplicated to avoid the branch. An extreme version of this is tracing, where the entire remainder of a function is duplicated after any branch. The trouble with these techniques is that duplicating code is expensive. We want to minimize the number of times that the same piece of code is compiled so that we can compile a lot of code quickly. The closest thing to splitting that JavaScriptCore does is tail duplication, which optimizes diamond speculations by duplicating the code between them if that code is tiny.

A better alternative to diamond speculations or splitting is OSR (on stack replacement). When using OSR, a failing type check exits out of the optimized function back to the equivalent point in the unoptimized code (i.e. the profiling tier’s version of the function).

Figure 4. OSR speculation that x is an integer.

Figure 4 shows what happens when we speculate that x is an integer using OSR. Because there is no control flow merge between the case where x is an int and the case where it isn’t, the second check becomes redundant and can be eliminated. The lack of a merge means that the only way to reach the second check is if the first check passed.

OSR speculations are what gives our traditional optimizing compiler its static types. After any OSR-based type check, the compiler can assume that the property that was checked is now fact. Moreover, because OSR check failure does not affect semantics (we exit to the same point in the same code, just with fewer optimizations), we can hoist those checks as high as we want and infer that a variable always has some type simply by guarding all assignments to it with the corresponding type check.

Note that what we call OSR exit in this post and in JavaScriptCore is usually called deoptimization elsewhere. We prefer to use the term OSR exit in our codebase because it emphasizes that the point is to exit an optimized function using an exotic technique (OSR). The term deoptimization makes it seem like we are undoing optimization, which is only true in the narrow sense that a particular execution jumps from optimized code to unoptimized code. For this post we will follow the JavaScriptCore jargon.

JavaScriptCore uses OSR or diamond speculations depending on our confidence that the speculation will be right. OSR speculation has higher benefit and higher cost: the benefit is higher because repeated checks can be eliminated but the cost is also higher because OSR is more expensive than calling a helper function. However, the cost is only paid if the exit actually happens. The benefits of OSR speculation are so superior that we focus on that as our main speculation strategy, with diamond speculation being the fallback if our profiling indicates lack of confidence in the speculation.

Figure 5. Speculating with OSR and exiting to bytecode.

OSR-based speculation relies on the fact that traditional compilers are already good at reasoning about side exits. Trapping instructions (like for null check optimization in Java virtual machines), exceptions, and multiple return statements are all examples of how compilers already support exiting from a function.

Assuming that we use bytecode as the common language shared between the unoptimizing profiled tier of execution and the optimizing tier, the exit destinations can just be bytecode instruction boundaries. Figure 5 shows how this might work. The machine code generated by the optimizing compiler contains speculation checks against unlikely conditions. The idea is to do lots of speculations. For example, the prologue (the enter instruction in the figure) may speculate about the types of the arguments — that’s one speculation per argument. An add instruction may speculate about the types of its inputs and about the result not overflowing. Our type profiling may tell us that some variable tends to always have some type, so a mov instruction whose source is not proved to have that type may speculate that the value has that type at runtime. Accessing an array element (what we call get_by_val) may speculate that the array is really an array, that the index is an integer, that the index is in bounds, and that the value at the index is not a hole (in JavaScript, loading from a never assigned array element means walking the array’s prototype chain to see if the element can be found there — something we avoid doing most of the time by speculating that we don’t have to). Calling a function may speculate that the callee is the one we expected or at least that it has the appropriate type (that it’s something we can call).

While exiting out of a function is straightforward without breaking fundamental assumptions in optimizing compilers, entering turns out to be super hard. Entering into a function somewhere other than at its primary entrypoint pessimises optimizations at any merge points between entrypoints. If we allowed entering at every bytecode instruction boundary, this would negate the benefits of OSR exit by forcing every instruction boundary to make worst-case assumptions about type. Even allowing OSR entry just at loop headers would break lots of loop optimizations. This means that it’s generally not possible to reenter optimized execution after exiting. We only support entry in cases where the reward is high, like when our profiler tells us that a loop has not yet terminated at the time of compilation. Put simply, the fact that traditional compilers are designed for single-entry multiple-exit procedures means that OSR entry is hard but OSR exit is easy.

JavaScriptCore and most speculative compilers support OSR entry at hot loops, but since it’s not an essential feature for most applications, we’ll leave understanding how we do it as an exercise for the reader.

Figure 6. Speculation broken into the five topics of this post.

The main part of this post describes speculation in terms of its five components (Figure 6): the bytecode, or common IR, of the virtual machine that allows for a shared understanding about the meaning of profiling and exit sites between the unoptimized profiling tier and the optimizing tier; the unoptimized profiling tier that is used to execute functions at start-up, collect profiling about them, and to serve as an exit destination; the control system for deciding when to invoke the optimizing compiler; the optimizing tier that combines a traditional optimizing compiler with enhancements to support speculation based on profiling; and the OSR exit technology that allows the optimizing compiler to use the profiling tier as an exit destination when speculation checks fail.

Overview of JavaScriptCore

Figure 7. The tiers of JavaScriptCore.

JavaScriptCore embraces the idea of tiering and has four tiers for JavaScript (and three tiers for WebAssembly, but that’s outside the scope of this post). Tiering has two benefits: the primary benefit, described in the previous section, of enabling speculation; and a secondary benefit of allowing us to fine-tune the throughput-latency tradeoff on a per-function basis. Some functions run for so short — like straight-line run-once initialization code — that running any compiler on those functions would be more expensive than interpreting them. Some functions get invoked so frequently, or have such long loops, that their total execution time far exceeds the time to compile them with an aggressive optimizing compiler. But there are also lots of functions in the grey area in between: they run for not enough time to make an aggressive compiler profitable, but long enough that some intermediate compiler designs can provide speed-ups. JavaScriptCore has four tiers as shown in Figure 7:

  • The LLInt, or low-level interpreter, which is an interpreter that obeys JIT compiler ABI. It runs on the same stack as the JITs and uses a known set of registers and stack locations for its internal state.
  • The Baseline JIT, also known as a bytecode template JIT, which emits a template of machine code for each bytecode instruction without trying to reason about relationships between multiple instructions in the function. It compiles whole functions, which makes it a method JIT. Baseline does no OSR speculations but does have a handful of diamond speculations based on profiling from the LLInt.
  • The DFG JIT, or data flow graph JIT, which does OSR speculation based on profiling from the LLInt, Baseline, and in some rare cases even using profiling data collected by the DFG JIT and FTL JIT. It may OSR exit to either baseline or LLInt. The DFG has a compiler IR called DFG IR, which allows for sophisticated reasoning about speculation. The DFG avoids doing expensive optimizations and makes many compromises to enable fast code generation.
  • The FTL JIT, or faster than light JIT, which does comprehensive compiler optimizations. It’s designed for peak throughput. The FTL never compromises on throughput to improve compile times. This JIT reuses most of the DFG JIT’s optimizations and adds lots more. The FTL JIT uses multiple IRs (DFG IR, DFG SSA IR, B3 IR, and Assembly IR).

An ideal example of this in action is this program:

"use strict";

let result = 0;
for (let i = 0; i < 10000000; ++i) {
    let o = {f: i};
    result += o.f;
}

print(result);

Thanks to the object allocation inside the loop, it will run for a long time until the FTL JIT can compile it. The FTL JIT will kill that allocation, so then the loop finishes quickly. The long running time before optimization virtually guarantees that the FTL JIT will take a stab at this program’s global function. Additionally, because the function is clean and simple, all of our speculations are right and there are no OSR exits.

Figure 8. Example timeline of a simple long loop executing in JavaScriptCore. Execution times recorded on my computer one day.

Figure 8 shows the timeline of this benchmark executing in JavaScriptCore. The program starts executing in the LLInt. After about a thousand loop iterations, the loop trigger causes us to start a baseline compiler thread for this code. Once that finishes, we do an OSR entry into the baseline JITed code at the for loop’s header. The baseline JIT also counts loop iterations, and after about a thousand more, we spawn the DFG compiler. The process repeats until we are in the FTL. When I measured this, I found that the DFG compiler needs about 4× the time of the baseline compiler, and the FTL needs about 6× the time of the DFG. While this example is contrived and ideal, the basic idea holds for any JavaScript program that runs long enough since all tiers of JavaScriptCore support the full JavaScript language.

Figure 9. JavaScriptCore tier architecture.

JavaScriptCore is architected so that having many tiers is practical. Figure 9 illustrates this architecture. All tiers share the same bytecode as input. That bytecode is generated by a compiler pipeline that desugars many language features, such as generators and classes, among others. In many cases, it’s possible to add new language features just by modifying the bytecode generation frontend. Once linked, the bytecode can be understood by any of the tiers. The bytecode can be interpreted by the LLInt directly or compiled with the baseline JIT, which mostly just converts each bytecode instruction into a preset template of machine code. The LLInt and Baseline JIT share a lot of code, mostly in the slow paths of bytecode instruction execution. The DFG JIT converts bytecode to its own IR, the DFG IR, and optimizes it before emitting code. In many cases, operations that the DFG chooses not to speculate on are emitted using the same code generation helpers as the Baseline JIT. Even operations that the DFG does speculate on often share slow paths with the Baseline JIT. The FTL JIT reuses the DFG’s compiler pipeline and adds new optimizations to it, including multiple new IRs that have their own optimization pipelines. Despite being more sophisticated than the DFG or Baseline, the FTL JIT shares slow path implementations with those JITs and in some cases even shares code generation for operations that we choose not to speculate on. Even though the various tiers try to share code whenever possible, they aren’t required to. Take the get_by_val (access an array element) instruction in bytecode. This has duplicate definitions in the bytecode liveness analysis (which knows the liveness rules for get_by_val), the LLInt (which has a very large implementation that switches on a bunch of the common array types and has good code for all of them), the Baseline (which uses a polymorphic inline cache), and the DFG bytecode parser. The DFG bytecode parser converts get_by_val to the DFG IR GetByVal operation, which has separate definitions in the DFG and FTL backends as well as in a bunch of phases that know how to optimize and model GetByVal. The only thing that keeps those implementations in agreement is good convention and extensive testing.

To give a feeling for the relative throughput of the various tiers, I’ll share some informal performance data that I’ve gathered over the years out of curiosity.

Figure 10. Relative performance of the four tiers on JetStream 2 on my computer at the time of that benchmark’s introduction.

We’re going to use the JetStream 2 benchmark suite since that’s the main suite that JavaScriptCore is tuned for. Let’s first consider an experiment where we run JetStream 2 with the tiers progressively enabled starting with the LLInt. Figure 10 shows the results: the Baseline and DFG are more than 2× better than the tier below them and the FTL is 1.1× better than the DFG.

The FTL’s benefits may be modest but they are unique. If we did not have the FTL, we would have no way of achieving the same peak throughput. A great example is the gaussian-blur subtest. This is the kind of compute test that the FTL is built for. I managed to measure the benchmark’s performance when we first introduced it and did not yet have a chance to tune for it. So, this gives a glimpse of the speed-ups that we expect to see from our tiers for code that hasn’t yet been through the benchmark tuning grind. Figure 11 shows the results. All of the JITs achieve spectacular speed-ups: Baseline is 3× faster than LLInt, DFG is 6× faster than Baseline, and FTL is 1.6× faster than DFG.

Figure 11. Relative performance of the four tiers on the guassian-blur subtest of JetStream 2.

The DFG and FTL complement one another. The DFG is designed to be a fast-running compiler and it achieves this by excluding the most aggressive optimizations, like global register allocation, escape analysis, loop optimizations, or anything that needs SSA. This means that the DFG will always get crushed on peak throughput by compilers that have those features. It’s the FTL’s job to provide those optimizations if a function runs long enough to warrant it. This ensures that there is no scenario where a hypothetical competing implementation could outperform us unless they had the same number of tiers. If you wanted to make a compiler that compiles faster than the FTL then you’d lose on peak throughput, but if you wanted to make a compiler that generates better code than the DFG then you’d get crushed on start-up times. You need both to stay in the game.

Another way of looking at the performance of these tiers is to ask: how much time does a bytecode instruction take to execute in each of the tiers on average? This tells us just about the throughput that a tier achieves without considering start-up at all. This can be hard to estimate, but I made an attempt at it by repeatedly running each JetStream 2 benchmark and having it limit the maximum tier of each function at random. Then I employed a stochastic counting mechanism to get an estimate of the number of bytecode instructions executed at each tier in each run. Combined with the execution times of those runs, this gave a simple linear regression problem of the form:

ExecutionTime = (Latency of LLInt) * (Bytecodes in LLInt)
              + (Latency of Baseline) * (Bytecodes in Baseline)
              + (Latency of DFG) * (Bytecodes in DFG)
              + (Latency of FTL) * (Bytecodes in FTL)

Where the Latency of LLInt means the average amount of time it takes to execute a bytecode instruction in LLInt.

After excluding benchmarks that spent most of their time outside JavaScript execution (like regexp and wasm benchmarks) and fiddling with how to weight benchmarks (I settled on solving each benchmarks separately and computing geomean of the coefficients since this matches JetStream 2 weighting), the solution I arrived at was:

Execution Time = (3.97 ns) * (Bytecodes in LLInt)
               + (1.71 ns) * (Bytecodes in Baseline)
               + (.349 ns) * (Bytecodes in DFG)
               + (.225 ns) * (Bytecodes in FTL)

In other words, Baseline executes code about 2× faster than LLInt, DFG executes code about 5× faster than Baseline, and the FTL executes code about 1.5× faster than DFG. Note how this data is in the same ballpark as what we saw for gaussian-blur. That makes sense since that was a peak throughput benchmark.

Although this isn’t a garbage collection blog post, it’s worth understanding a bit about how the garbage collector works. JavaScriptCore picks a garbage collection strategy that makes the rest of the virtual machine, including all of the support for speculation, easier to implement. The garbage collector has the following features that make speculation easier:

  • The collector scans the stack conservatively. This means that compilers don’t have to worry about how to report pointers to the collector.
  • The collector doesn’t move objects. This means that if a data structure (like the compiler IR) has many possible ways of referencing some object, we only have to report one of them to the collector.
  • The collector runs to fixpoint. This makes it possible to invent precise rules for whether objects created by speculation should be kept alive.
  • The collector’s object model is expressed in C++. JavaScript objects look like C++ objects, and JS object pointers look like C++ pointers.

These features make the compiler and runtime easier to write, which is great, since speculation requires us to write a lot of compiler and runtime code. JavaScript is a slow enough language even with the optimizations we describe in this post that garbage collector performance is rarely the longest pole in the tent. Therefore, our garbage collector makes many tradeoffs to make it easier to work on the performance-critical parts of our engine (like speculation). It would be unwise, for example, to make it harder to implement some compiler optimization as a way of getting a small garbage collector optimization, since the compiler has a bigger impact on performance for typical JavaScript programs.

To summarize: JavaScriptCore has four tiers, two of which do speculative optimizations, and all of which participate in the collection of profiling. The first two tiers are an interpreter and bytecode template JIT while the last two are optimizing compilers tuned for different throughput-latency trade-offs.

Speculative Compilation

Now that we’ve established some basic background about speculation and JavaScriptCore, this section goes into the details. First we will discuss JavaScriptCore’s bytecode. Then we show the control system for launching the optimizing compiler. Next will be a detailed section about how JavaScriptCore’s profiling tiers work, which focuses mostly on how they collect profiling. Finally we discuss JavaScriptCore’s optimizing compilers and their approach to OSR.

Bytecode

Speculation requires having a profiling tier and an optimizing tier. When the profiling tier reports profiling, it needs to be able to say what part of the code that profiling is for. When the optimizing compiler wishes to compile an OSR exit, it needs to be able to identify the exit site in a way that both tiers understand. To solve both issues, we need a common IR that is:

  • Used by all tiers as input.
  • Persistent for as long as the function that it represents is still live.
  • Immutable (at least for those parts that all tiers are interested in).

In this post, we will use bytecode as the common IR. This isn’t required; abstract syntax trees or even SSA could work as a common IR. We offer some insights into how we designed our bytecode for JavaScriptCore. JavaScriptCore’s bytecode is register-based, compact, untyped, high-level, directly interpretable, and transformable.

Our bytecode is register-based in the sense that operations tend to be written as:

add result, left, right

Which is taken to mean:

result = left + right

Where result, left, and right are virtual registers. Virtual registers may refer to locals, arguments, or constants in the constant pool. Functions declare how many locals they need. Locals are used both for named variables (like var, let, or const variables) and temporaries arising from expression tree evaluation.

Our bytecode is compact: each opcode and operand is usually encoded as one byte. We have wide prefixes to allow 16-bit or 32-bit operands. This is important since JavaScript programs can be large and the bytecode must persist for as long as the function it represents is still live.

Our bytecode is untyped. Virtual registers never have static type. Opcodes generally don’t have static type except for the few opcodes that have a meaningful type guarantee on their output (for example, the | operator always returns int32, so our bitor opcode returns int32). This is important since the bytecode is meant to be a common source of truth for all tiers. The profiling tier runs before we have done type inference, so the bytecode can’t have any more types than the JavaScript language.

Our bytecode is almost as high-level as JavaScript. While we use desugaring for many JavaScript features, we only do that when implementation by desugaring isn’t thought to cost performance. So, even the “fundamental” features of our bytecode are high level. For example, the add opcode has all of the power of the JavaScript + operator, including that it might mean a loop with effects.

Our bytecode is directly interpretable. The same bytecode stream that the interpreter executes is the bytecode stream that we will save in the cache (to skip parsing later) and feed to the compiler tiers.

Finally, our bytecode is transformable. Normally, intermediate representations use a control flow graph and make it easy to insert and remove instructions. That’s not how bytecode works: it’s an array of instructions encoded using a nontrivial variable-width encoding. But we do have a bytecode editing API and we use it for generatorification (our generator desugaring bytecode-to-bytecode pass). We can imagine this facility also being useful for other desugarings or for experimenting with bytecode instrumentation.

Compared to non-bytecode IRs, the main advantages of bytecode are that it’s easy to:

  • Identify targets for OSR exit. OSR exit in JavaScriptCore requires entering into an unoptimized bytecode execution engine (like an interpreter) at some arbitrary bytecode instruction. Using bytecode instruction index as a way of naming an exit target is intuitive since it’s just an integer.
  • Compute live state at exit. Register-based bytecode tends to have dense register numberings so it’s straightforward to analyze liveness using bitvectors. That tends to be fast and doesn’t require a lot of memory. It’s practical to cache the results of bytecode liveness analysis, for example.

JavaScriptCore’s bytecode format is independently implemented by the execution tiers. For example, the baseline JIT doesn’t try to use the LLInt to create its machine code templates; it just emits those templates itself and doesn’t try to match the LLInt exactly (the behavior is identical but the implementation isn’t). The tiers do share a lot of code – particularly for inline caches and slow paths – but they aren’t required to. It’s common for bytecode instructions to have algorithmically different implementations in the four tiers. For example the LLInt might implement some instruction with a large switch that handles all possible types, the Baseline might implement the same instruction with an inline cache that repatches based on type, and the DFG and FTL might try to do some combination of inline speculations, inline caches, and emitting a switch on all types. This exact scenario happens for add and other arithmetic ops as well as get_by_val/put_by_val. Allowing this independence allows each tier to take advantage of its unique properties to make things run faster. Of course, this approach also means that adding new bytecodes or changing bytecode semantics requires changing all of the tiers. For that reason, we try to implement new language features by desugaring them to existing bytecode constructs.

It’s possible to use any sensible IR as the common IR for a speculative compiler, including abstract syntax trees or SSA, but JavaScriptCore uses bytecode so that’s what we’ll talk about in the rest of this post.

Control

Speculative compilation needs a control system to decide when to run the optimizing compiler. The control system has to balance competing concerns: compiling functions as soon as it’s profitable, avoiding compiling functions that aren’t going to run long enough to benefit from it, avoiding compiling functions that have inadequate type profiling, and recompiling functions if a prior compilation did speculations that turned out to be wrong. This section describes JavaScriptCore’s control system. Most of the heuristics we describe were necessary, in our experience, to make speculative compilation profitable. Otherwise the optimizing compiler would kick in too often, not often enough, or not at the right rate for the right functions. This section describes the full details of JavaScriptCore’s tier-up heuristics because we suspect that to reproduce our performance, one would need all of these heuristics.

JavaScriptCore counts executions of functions and loops to decide when to compile. Once a function is compiled, we count exits to decide when to throw away compiled functions. Finally, we count recompilations to decide how much to back off from recompiling a function in the future.

Execution Counting

JavaScriptCore maintains an execution counter for each function. This counter gets incremented as follows:

  • Each call to the function adds 15 points to the execution counter.
  • Each loop execution adds 1 point to the execution counter.

We trigger tier-up once the counter reaches some threshold. Thresholds are determined dynamically. To understand our thresholds, first consider their static versions and then let’s look at how we modulate these thresholds based on other information.

  • LLInt→Baseline tier-up requires 500 points.
  • Baseline→DFG tier-up requires 1000 points.
  • DFG→FTL tier-up requires 100000 points.

Over the years we’ve found ways to dynamically adjust these thresholds based on other sources of information, like:

  • Whether the function got JITed the last time we encountered it (according to our cache). Let’s call this wasJITed.
  • How big the function is. Let’s call this S. We use the number of bytecode opcodes plus operands as the size.
  • How many times it has been recompiled. Let’s call this R.
  • How much executable memory is available. Let’s use M to say how much executable memory we have total, and U is the amount we estimate that we would use (total) if we compiled this function.
  • Whether profiling is “full” enough.

We select the LLInt→Baseline threshold based on wasJITed. If we don’t know (the function wasn’t in the cache) then we use the basic threshold, 500. Otherwise, if the function wasJITed then we use 250 (to accelerate tier-up) otherwise we use 2000. This optimization is especially useful for improving page load times.

Baseline→DFG and DFG→FTL use the same scaling factor based on S, R, M, and U. The scaling factor is defined as follows:

(0.825914 + 0.061504 * sqrt(S + 1.02406)) * pow(2, R) * M / (M - U)

We multiply this by 1000 for Baseline→DFG and by 100000 for DFG→FTL. Let’s break down what this scaling factor does:

First we scale by the square root of the size. The expression 0.825914 + 0.061504 * sqrt(S + 1.02406) gives a scaling factor that is between 1 and 2 for functions smaller than about 350 bytecodes, which we consider to be “easy” functions to compile. The scaling factor uses square root so it grows somewhat gently. We’ve also tried having the staling factor be linear, but that’s much worse. It is worth it to delay compilations of large functions a bit, but it’s not worth it to delay it too much. Note that the ideal delay doesn’t just have to do with the cost of compilation. It’s also about running long enough to get good profiling. Maybe there is some deep reason why square root works well here, but all we really care about is that scaling by this amount makes programs run faster.

Then we introduce exponential backoff based on the number of times that the function has been recompiled. The pow(2, R) expression means that each recompilation doubles the thresholds.

After that we introduce a hyperbolic scaling factor, M / (M - U), to help avoid cases where we run out of executable memory altogether. This is important since some configurations of JavaScriptCore run with a small available pool of executable memory. This expression means that if we use half of executable memory then the thresholds are doubled. If we use 3/4 of executable memory then the thresholds are quadrupled. This makes filling up executable memory a bit like going at the speed of light: the math makes it so that as you get closer to filling it up the thresholds get closer to infinity. However, it’s worth noting that this is imperfect for truly large programs, since those might have other reasons to allocate executable memory not covered by this heuristic. The heuristic is also imperfect in cases of multiple things being compiled in parallel. Using this factor increases the maximum program size we can handle with small pools of executable memory, but it’s not a silver bullet.

Finally, if the execution count does reach this dynamically computed threshold, we check that some kinds of profiling (specifically, value and array profiling, discussed in detail in the upcoming profiling section) are full enough. We say that profiling is full enough if more than 3/4 of the profiling sites in the function have data. If this threshold is not met, we reset the execution counters. We let this process repeat five times. The optimizing compilers tend to speculate that unprofiled code is unreachable. This is profitable if that code really won’t ever run, but we want to be extra sure before doing that, hence we give functions with partial profiling 5× the time to warm up.

This is an exciting combination of heuristics! These heuristics were added early in the development of tiering in JSC. They were all added before we built the FTL, and the FTL inherited those heuristics just with a 100× multiplier. Each heuristic was added because it produced either a speed-up or a memory usage reduction or both. We try to remove heuristics that are not known to be speed-ups anymore, and to our knowledge, all of these still contribute to better performance on benchmarks we track.

Exit Counting

After we compile a function with the DFG or FTL, it’s possible that one of the speculations we made is wrong. This will cause the function to OSR exit back to LLInt or Baseline (we prefer Baseline, but may throw away Baseline code during GC, in which case exits from DFG and FTL will go to LLInt). We’ve found that the best way of dealing with a wrong speculation is to throw away the optimized code and try optimizing again later with better profiling. We detect if a DFG or FTL function should be recompiled by counting exits. The exit count thresholds are:

  • For a normal exit, we require 100 * pow(2, R) exits to recompile.
  • If the exit causes the Baseline JIT to enter its loop trigger (i.e. we got stuck in a hot loop after exit), then it’s counted specially. We only allow 5 * pow(2, R) of those kinds of exits before we recompile. Note that this can mean exiting five times and tripping the loop optimization trigger each time or it can mean exiting once and tripping the loop optimization trigger five times.

The first step to recompilation is to jettison the DFG or FTL function. That means that all future calls to the function will call the Baseline or LLInt function instead.

Recompilation

If a function is jettisoned, we increment the recompilation counter (R in our notation) and reset the tier-up functionality in the Baseline JIT. This means that the function will keep running in Baseline for a while (twice as long as it did before it was optimized last time). It will gather new profiling, which we will be able to combine with the profiling we collected before to get an even more accurate picture of how types behave in the function.

It’s worth looking at an example of this in action. We already showed an idealized case of tier-up in Figure 8, where a function gets compiled by each compiler exactly once and there are no OSR exits or recompilations. We will now show an example where things don’t go so well. This example is picked because it’s a particularly awful outlier. This isn’t how we expect our engine to behave normally. We expect amusingly bad cases like the following to happen occasionally since the success or failure of speculation is random and random behavior means having bad outliers.

_handlePropertyAccessExpression = function (result, node)
{
    result.possibleGetOverloads = node.possibleGetOverloads;
    result.possibleSetOverloads = node.possibleSetOverloads;
    result.possibleAndOverloads = node.possibleAndOverloads;
    result.baseType = Node.visit(node.baseType, this);
    result.callForGet = Node.visit(node.callForGet, this);
    result.resultTypeForGet = Node.visit(node.resultTypeForGet, this);
    result.callForAnd = Node.visit(node.callForAnd, this);
    result.resultTypeForAnd = Node.visit(node.resultTypeForAnd, this);
    result.callForSet = Node.visit(node.callForSet, this);
    result.errorForSet = node.errorForSet;
    result.updateCalls();
}

This function belongs to the WSL subtest of JetStream 2. It’s part of the WSL compiler’s AST walk. It ends up being a large function after inlining Node.visit. When I ran this on my computer, I found that JSC did 8 compilations before hitting equilibrium for this function:

  1. After running the function in LLInt for a bit, we compile this with Baseline. This is the easy part since Baseline doesn’t need to be recompiled.
  2. We compile with DFG. Unfortunately, the DFG compilation exits 101 times and gets jettisoned. The exit is due to a bad type check that the DFG emitted on this.
  3. We again compile with the DFG. This time, we exit twice due to a check on result. This isn’t enough times to trigger jettison and it doesn’t prevent tier-up to the FTL.
  4. We compile with the FTL. Unfortunately, this compilation gets jettisoned due to a failing watchpoint. Watchpoints (discussed in greater detail in later sections) are a way for the compiler to ask the runtime to notify it when bad things happen rather than emitting a check. Failing watchpoints cause immediate jettison. This puts us back in Baseline.
  5. We try the DFG again. We exit seven times due to a bad check on result, just like in step 3. This still isn’t enough times to trigger jettison and it doesn’t prevent tier-up to the FTL.
  6. We compile with the FTL. This time we exit 402 times due to a bad type check on node. We jettison and go back to Baseline.
  7. We compile with the DFG again. This time there are no exits.
  8. We compile with the FTL again. There are no further exits or recompilations.

This sequence of events has some intriguing quirks in addition to the number of compilations. Notice how in steps 3 and 5, we encounter exits due to a bad check on result, but none of the FTL compilations encounter those exits. This seems implausible since the FTL will do at least all of the speculations that the DFG did and a speculation that doesn’t cause jettison also cannot pessimise future speculations. It’s also surprising that the speculation that jettisons the FTL in step 6 wasn’t encountered by the DFG. It is possible that the FTL does more speculations than the DFG, but that usually only happens in inlined functions, and this speculation on node doesn’t seem to be in inlined code. A possible explanation for all of these surprising quirks is that the function is undergoing phase changes: during some parts of execution, it sees one set of types, and during another part of execution, it sees a somewhat different set. This is a common issue. Types are not random and they are often a function of time.

JavaScriptCore’s compiler control system is designed to get good outcomes both for functions where speculation “just works” and for functions like the one in this example that need some extra time. To summarize, control is all about counting executions, exits, and recompilations, and either launching a higher tier compiler (“tiering up”) or jettisoning optimized code and returning to Baseline.

Profiling

This section describes the profiling tiers of JavaScriptCore. The profiling tiers have the following responsibilities:

  • To provide a non-speculative execution engine. This is important for start-up (before we do any speculation) and for OSR exits. OSR exit needs to exit to something that does no speculation so that we don’t have chains of exits for the same operation.
  • To record useful profiling. Profiling is useful if it enables us to make profitable speculations. Speculations are profitable if doing them makes programs run faster.

In JavaScriptCore, the LLInt and Baseline are the profiling tiers while DFG and FTL are the optimizing tiers. However, DFG and FTL also collect some profiling, usually only when it’s free to do so and for the purpose of refining profiling collected by the profiling tiers.

This section is organized as follows. First we explain how JavaScriptCore’s profiling tiers execute code. Then we explain the philosophy of how to profile. Finally we go into the details of JavaScriptCore’s profiling implementation.

How Profiled Execution Works

JavaScriptCore profiles using the LLInt and Baseline tiers. LLInt interprets bytecode while Baseline compiles it. The two tiers share a nearly identical ABI so that it’s possible to jump from one to the other at any bytecode instruction boundary.

LLInt: The Low Level Interpreter

The LLInt is an interpreter that obeys JIT ABI (in the style of HotSpot‘s interpreter). To that end, it is written in a portable assembly language called offlineasm. Offlineasm has a functional macro language (you can pass macro closures around) embedded in it. The offlineasm compiler is written in Ruby and can compile to multiple CPUs as well as C++. This section tells the story of why this crazy design produces a good outcome.

The LLInt simultaneously achieves multiple goals for JavaScriptCore:

  • LLInt is JIT-friendly. The LLInt runs on the same stack that the JITs run on (which happens to be the C stack). The LLInt even agrees on register conventions with the JITs. This makes it cheap for LLInt to call JITed functions and vice versa. It makes LLInt→Baseline and Baseline→LLInt OSR trivial and it makes any JIT→LLInt OSR possible.
  • LLInt allows us to execute JavaScript code even if we can’t JIT. JavaScriptCore in no-JIT mode (we call it “mini mode”) has some advantages: it’s harder to exploit and uses less memory. Some JavaScriptCore clients prefer the mini mode. JSC is also used on CPUs that we don’t have JIT support for. LLInt works great on those CPUs.
  • LLInt reduces memory usage. Any machine code you generate from JavaScript is going to be big. Remember, there’s a reason why they call JavaScript “high level” and machine code “low level”: it refers to the fact that when you lower JavaScript to machine code, you’re going to get many instructions for each JavaScript expression. Having the LLInt means that we don’t have to generate machine code for all JavaScript code, which saves us memory.
  • LLInt starts quickly. LLInt interprets our bytecode format directly. It’s designed so that we could map bytecode from disk and point the interpreter at it. The LLInt is essential for achieving great page load time in the browser.
  • LLInt is portable. It can be compiled to C++.

It would have been natural to write the LLInt in C++, since that’s what most of JavaScriptCore is written in. But that would have meant that the interpreter would have a C++ stack frame constructed and controlled by the C++ compiler. This would have introduced two big problems:

  1. It would be unclear how to OSR from the LLInt to the Baseline JIT or vice versa, since OSR would have to know how to decode and reencode a C++ stack frame. We don’t doubt that it’s possible to do this with enough cleverness, but it would create constraints on exactly how OSR works and it’s not an easy piece of machinery to maintain.
  2. JS functions running in the LLInt would have two stack frames instead of one. One of those stack frames would have to go onto the C++ stack (because it’s a C++ stack frame). We have multiple choices of how to manage the JS stack frame (we could try to alloca it on top of the C++ frame, or allocate it somewhere else) but this inevitably increases cost: calls into the interpreter would have to do twice the work. A common optimization to this approach is to have interpreter→interpreter calls reuse the same C++ stack frame by managing a separate JS stack on the side. Then you can have the JITs use that separate JS stack. This still leaves cost when calling out of interpreter to JIT or vice versa.

A natural way to avoid these problems is to write the interpreter in assembly. That’s basically what we did. But a JavaScript interpreter is a complex beast. It would be awful if porting JavaScriptCore to a new CPU meant rewriting the interpreter in another assembly language. Also, we want to use abstraction to write it. If we wrote it in C++, we’d probably have multiple functions, templates, and lambdas, and we would want all of them to be inlined. So we designed a new language, offlineasm, which has the following features:

  • Portable assembly with our own mnemonics and register names that match the way we do portable assembly in our JIT. Some high-level mnemonics require lowering. Offlineasm reserves some scratch registers to use for lowering.
  • The macro construct. It’s best to think of this as a lambda that takes some arguments and returns void. Then think of the portable assembly statements as print statements that output that assembly. So, the macros are executed for effect and that effect is to produce an assembly program. These are the execution semantics of offlineasm at compile time.

Macros allow us to write code with rich abstractions. Consider this example from the LLInt:

macro llintJumpTrueOrFalseOp(name, op, conditionOp)
    llintOpWithJump(op_%name%, op, macro (size, get, jump, dispatch)
        get(condition, t1)
        loadConstantOrVariable(size, t1, t0)
        btqnz t0, ~0xf, .slow
        conditionOp(t0, .target)
        dispatch()

    .target:
        jump(target)

    .slow:
        callSlowPath(_llint_slow_path_%name%)
        nextInstruction()
    end)
end

This is a macro that we use for implementing both jtrue and jfalse and opcodes. There are only three lines of actual assembly in this listing: the btqnz (branch test quad not zero) and the two labels (.target and .slow). This also shows the use of first-class macros: on the second line, we call llintOpWithJump and pass it a macro closure as the third argument. The great thing about having a lambda-like construct like macro is that we don’t need much else to have a pleasant programming experience. The LLInt is written in about 5000 lines of offlineasm (if you only count the 64-bit version).

To summarize, LLInt is an interpreter written in offlineasm. LLInt understands JIT ABI so calls and OSR between LLInt and JIT are cheap. The LLInt allows JavaScriptCore to load code more quickly, use less memory, and run on more platforms.

Baseline: The Bytecode Template JIT

The Baseline JIT achieves a speed-up over the LLInt at the cost of some memory and the time it takes to generate machine code. Baseline’s speed-up is thanks to two factors:

  • Removal of interpreter dispatch. Interpreter dispatch is the costliest part of interpretation, since the indirect branches used for selecting the implementation of an opcode are hard for the CPU to predict. This is the primary reason why Baseline is faster than LLInt.
  • Comprehensive support for polymorphic inline caching. It is possible to do sophisticated inline caching in an interpreter, but currently our best inline caching implementation is the one shared by the JITs.

The Baseline JIT compiles bytecode by turning each bytecode instruction into a template of machine code. For example, a bytecode instruction like:

add loc6, arg1, arg2

Is turned into something like:

0x2f8084601a65: mov 0x30(%rbp), %rsi
0x2f8084601a69: mov 0x38(%rbp), %rdx
0x2f8084601a6d: cmp %r14, %rsi
0x2f8084601a70: jb 0x2f8084601af2
0x2f8084601a76: cmp %r14, %rdx
0x2f8084601a79: jb 0x2f8084601af2
0x2f8084601a7f: mov %esi, %eax
0x2f8084601a81: add %edx, %eax
0x2f8084601a83: jo 0x2f8084601af2
0x2f8084601a89: or %r14, %rax
0x2f8084601a8c: mov %rax, -0x38(%rbp)

The only parts of this code that would vary from one add instruction to another are the references to the operands. For example, 0x30(%rbp) (that’s x86 for the memory location at frame pointer plus 0x30) is the machine code representation of arg1 in bytecode.

The Baseline JIT does few optimizations beyond just emitting code templates. It does no register allocation between instruction boundaries, for example. The Baseline JIT does some local optimizations, like if an operand to a math operation is a constant, or by using profiling information collected by the LLInt. Baseline also has good support for code repatching, which is essential for implementing inline caching. We discuss inline caching in detail later in this section.

To summarize, the Baseline JIT is a mostly unoptimized JIT compiler that focuses on removing interpreter dispatch overhead. This is enough to make it a ~2× speed-up over the LLInt.

Profiling Philosophy

Profiling in JSC is designed to be cheap and useful.

JavaScriptCore’s profiling aims to incur little or no cost in the common case. Running with profiling turned on but never using the results to do optimizations should result in throughput that is about as good as if all of the profiling was disabled. We want profiling to be cheap because even in a long running program, lots of functions will only run once or for too short to make an optimizing JIT profitable. Some functions might finish running in less time than it takes to optimize them. The profiling can’t be so expensive that it makes functions like that run slower.

Profiling is meant to help the compiler make the kinds of speculations that cause the program to run faster when we factor in both the speed-ups from speculations that are right and the slow-downs from speculations that are wrong. It’s possible to understand this formally by thinking of speculation as a bet. We say that profiling is useful if it turns the speculation into a value bet. A value bet is one where the expected value (EV) is positive. That’s another way of saying that the average outcome is profitable, so if we repeated the bet an infinite number of times, we’d be richer. Formally the expected value of a bet is:

p * B - (1 - p) * C

Where p is the probability of winning, B is the benefit of winning, and C is the cost of losing (both B and C are positive). A bet is a value bet iff:

p * B - (1 - p) * C > 0

Let’s view speculation using this formula. The scenario in which we have the choice to make a bet or not is that we are compiling a bytecode instruction, we have some profiling that implies that we should speculate, and we have to choose whether to speculate or not. Let’s say that B and C both have to do with the latency, in nanoseconds, of executing a bytecode instruction once. B is the improvement to that latency if we do some speculation and it turns out to be right. C is the regression to that latency if the speculation we make is wrong. Of course, after we have made a speculation, it will run many times and may be right sometimes and wrong sometimes. But B is just about the speed-up in the right cases, and C is just about the slow-down in the wrong cases. The baseline relative to which B and C are measured is the latency of the bytecode instruction if it was compiled with an optimizing JIT but without that particular OSR-exit-based speculation.

For example, we may have a less-than operation, and we are considering whether to speculate that neither input is double. We can of course compile less-than without making that speculation, so that’s the baseline. If we do choose to speculate, then B is the speed-up to the average execution latency of that bytecode in those cases when neither input is double. Meanwhile, C is the slow-down to the average execution latency of that bytecode in those cases when at least one input is a double.

For B, let’s just compute some bounds. The lower bound is zero, since some speculations are not profitable. A pretty good first order upper bound for B is the difference in per-bytecode-instruction latency between the baseline JIT and the FTL. Usually, the full speed-up of a bytecode instruction between baseline to FTL is the result of multiple speculations as well as nonspeculative compiler optimizations. So, a single speculation being responsible for the full difference in performance between baseline and FTL is a fairly conservative upper bound for B. Previously, we said that on average in the JetStream 2 benchmark on my computer, a bytecode instruction takes 1.71 ns to execute in Baseline and .225 ns to execute in FTL. So we can say:

B <= 1.71 ns - .225 ns = 1.48 ns

Now let’s estimate C. C is how many more nanoseconds it takes to execute the bytecode instruction if we have speculated and we experience speculation failure. Failure means executing an OSR exit stub and then reexecuting the same bytecode instruction in baseline or LLInt. Then, all subsequent bytecodes in the function will execute in baseline or LLInt rather than DFG or FTL. Every 100 exits or so, we jettison and eventually recompile. Compiling is concurrent, but running a concurrent compiler is sure to slow down the main thread even if there is no lock contention. To fully capture C, we have to account for the cost of the OSR exit itself and then amortize the cost of reduced execution speed of the remainder of the function and the cost of eventual recompilation. Fortunately, it’s pretty easy to measure this directly by hacking the DFG frontend to randomly insert pointless OSR exits with low probability and by having JSC report a count of the number of exits. I did an experiment with this hack for every JetStream 2 benchmark. Running without the synthetic exits, we get an execution time and a count of the number of exits. Running with synthetic exits, we get a longer execution time and a larger number of exits. The slope between these two points is an estimate of C. This is what I found, on the same machine that I used for running the experiments to compute B:

[DFG] C = 2499 ns
[FTL] C = 9998 ns

Notice how C is way bigger than B! This isn’t some slight difference. We are talking about three orders of magnitude for the DFG and four orders of magnitude for the FTL. This paints a clear picture: speculation is a bet with tiny benefit and enormous cost.

For the DFG, this means that we need:

p > 0.9994

For speculation to be a value bet. p has to be even closer to 1 for FTL. Based on this, our philosophy for speculation is we won’t do it unless we think that:

p ~ 1

Since the cost of speculation failure is so enormous, we only want to speculate when we know that we won’t fail. The speed-up of speculation happens because we make lots of sure bets and only a tiny fraction of them ever fail.

It’s pretty clear what this means for profiling:

  • Profiling needs to focus on noting counterexamples to whatever speculations we want to do. We don’t want to speculate if profiling tells us that the counterexample ever happened, since if it ever happened, then the EV of this speculation is probably negative. This means that we are not interested in collecting probability distributions. We just want to know if the bad thing ever happened.
  • Profiling needs to run for a long time. It’s common to wish for JIT compilers to compile hot functions sooner. One reason why we don’t is that we need about 3-4 “nines” of confidence that that the counterexamples didn’t happen. Recall that our threshold for tiering up into the DFG is about 1000 executions. That’s probably not a coincidence.

Finally, since profiling is a bet, it’s important to approach it with a healthy gambler’s philosophy: the fact that a speculation succeeded or failed in a particular program does not tell us if the speculation is good or bad. Speculations are good or bad only based on their average behavior. Focusing too much on whether profiling does a good job for a particular program may result in approaches that cause it to perform badly on average.

Profiling Sources in JavaScriptCore

JavaScriptCore gathers profiling from multiple different sources. These profiling sources use different designs. Sometimes, a profiling source is a unique source of data, but other times, profiling sources are able to provide some redundant data. We only speculate when all profiling sources concur that the speculation would always succeed. The following sections describe our profiling sources in detail.

Case Flags

Case flags are used for branch speculation. This applies anytime the best way to implement a JS operation involves branches and multiple paths, like a math operation having to handle either integers or doubles. The easiest way to profile and speculate is to have the profiling tiers implement both sides of the branch and set a different flag on each side. That way, the optimizing tier knows that it can profitably speculate that only one path is needed if the flags for the other paths are not set. In cases where there is clearly a preferred speculation — for example, speculating that an integer add did not overflow is clearly preferred overspeculating that it did overflow — we only need flags on the paths that we don’t like (like the overflow path).

Let’s consider two examples of case flags in more detail: integer overflow and property accesses on non-object values.

Say that we are compiling an add operation that is known to take integers as inputs. Usually the way that the LLInt interpreter or Baseline compiler would “know” this is that the add operation we’ll talk about is actually the part of a larger add implementation after we’ve already checked that the inputs are integers. Here’s the logic that the profiling tier would use written as if it was C++ code to make it easy to parse:

int32_t left = ...;
int32_t right = ...;
ArithProfile* profile = ...; // This is the thing with the case flags.
int32_t intResult;
JSValue result; // This is a tagged JavaScript value that carries type.
if (UNLIKELY(addOverflowed(left, right, &intResult))) {
    result = jsNumber(static_cast<double>(left) +
                      static_cast<double>(right));

    // Set the case flag indicating that overflow happened.
    profile->setObservedInt32Overflow();
} else
    result = jsNumber(intResult);

When optimizing the code, we will inspect the ArithProfile object for this instruction. If !profile->didObserveInt32Overflow(), we will emit something like:

int32_t left = ...;
int32_t right = ...;
int32_t result;
speculate(!addOverflowed(left, right, &result));

I.e. we will add and branch to an exit on overflow. Otherwise we will just emit the double path:

double left = ...;
double right = ...;
double result = left + right;

Unconditionally doing double math is not that expensive; in fact on benchmarks that I’ve tried, it’s cheaper than doing integer math and checking overflow. The only reason why integers are profitable is that they are cheaper to use for bit operations and pointer arithmetic. Since CPUs don’t accept floats or doubles for bit and pointer math, we need to convert the double to an integer first if the JavaScript program uses it that way (pointer math arises when a number is used as an array index). Such conversions are relatively expensive even on CPUs that support them natively. Usually it’s hard to tell, using profiling or any static analysis, whether a number that a program computed will be used for bit or pointer math in the future. Therefore, it’s better to use integer math with overflow checks so that if the number ever flows into an operation that requires integers, we won’t have to pay for expensive conversions. But if we learn that any such operation overflows — even occasionally — we’ve found that it’s more efficient overall to unconditionally switch to double math. Perhaps the presence of overflows is strongly correlated with the result of those operations not being fed into bit math or pointer math.

A simpler example is how case flags are used in property accesses. As we will discuss in the inline caches section, property accesses have associated metadata that we use to track details about their behavior. That metadata also has flags, like the sawNonCell bit, which we set to true if the property access ever sees a non-object as the base. If the flag is set, the optimizing compilers know not to speculate that the property access will see objects. This typically forces all kinds of conservatism for that property access, but that’s better than speculating wrong and exiting in this case. Lots of case flags look like sawNonCell: they are casually added as a bit in some existing data structure to help the optimizing compiler know which paths were taken.

To summarize, case flags are used to record counterexamples to the speculations that we want to do. They are a natural way to implement profiling in those cases where the profiling tiers would have had to branch anyway.

Case Counts

A predecessor to case flags in JavaScriptCore is case counts. It’s the same idea as flags, but instead of just setting a bit to indicate that a bad thing happened, we would count. If the count never got above some threshold, we would speculate.

Case counts were written before we realized that the EV of speculation is awful unless the probability of success is basically 1. We thought that we could speculate in cases where we knew we’d be right a majority of the time, for example. Initial versions of case counts had variable thresholds — we would compute a ratio with the execution count to get a case rate. That didn’t work as well as fixed thresholds, so we switched to a fixed count threshold of 100. Over time, we lowered the threshold to 20 or 10, and then eventually found that the threshold should really be 1, at which point we switched to case flags.

Some functionality still uses case counts. We still have case counts for determining if the this argument is exotic (some values of this require the function to perform a possibly-effectful conversion in the prologue). We still have case counts as a backup for math operations overflowing, though that is almost certainly redundant with our case flags for math overflow. It’s likely that we will remove case counts from JavaScriptCore eventually.

Value Profiling

Value profiling is all about inferring the types of JavaScript values (JSValues). Since JS is a dynamic language, JSValues have a runtime type. We use a 64-bit JSValue representation that uses bit encoding tricks to hold either doubles, integers, booleans, null, undefined, or pointers to cell, which may be JavaScript objects, symbols, or strings. We refer to the act of encoding a value in a JSValue as boxing it and the act of decoding as unboxing (note that boxing is a term used in other engines to refer specifically to the act of allocating a box object in the heap to hold a value; our use of the term boxing is more like what others call tagging). In order to effectively optimize JavaScript, we need to have some way of inferring the type so that the compiler can assume things about it statically. Value profiling tracks the set of values that a particular program point saw so that we can predict what types that program point will see in the future.

Figure 12. Value profiling and prediction propagation for a sample data flow graph.

We combine value profiling with a static analysis called prediction propagation. The key insight is that prediction propagation can infer good guesses for the types for most operations if it is given a starting point for certain opaque operations:

  • Arguments incoming to the function.
  • Results of most load operations.
  • Results of most calls.

There’s no way that a static analysis running just on some function could guess what types loads from plain JavaScript arrays or calls to plain JavaScript functions could have. Value profiling is about trying to help the static analysis guess the types of those opaque operations. Figure 12 shows how this plays out for a sample data flow graph. There’s no way static analysis can tell the type of most GetByVal and GetById oerations, since those are loads from dynamically typed locations in the heap. But if we did know what those operations return then we can infer types for this entire graph by using simple type rules for Add (like that if it takes integers as inputs and the case flags tell us there was no overflow then it will produce integers).

Let’s break down value profiling into the details of how exactly values are profiled, how prediction propagation works, and how the results of prediction propagation are used.

Recording value profiles. At its core, value profiling is all about having some program point (either a point in the interpreter or something emitted by the Baseline JIT) log the value that it saw. We log values into a single bucket so that each time the profiling point runs, it overwrites the last seen value. The code looks like this in the LLInt:

macro valueProfile(op, metadata, value)
    storeq value, %op%::Metadata::profile.m_buckets[metadata]
end

Let’s look at how value profiling works for the get_by_val bytecode instruction. Here’s part of the code for get_by_val in LLInt:

llintOpWithMetadata(
    op_get_by_val, OpGetByVal,
    macro (size, get, dispatch, metadata, return)
        macro finishGetByVal(result, scratch)
            get(dst, scratch)
            storeq result, [cfr, scratch, 8]
            valueProfile(OpGetByVal, t5, result)
            dispatch()
        end

        ... // more code for get_by_val

The implementation of get_by_val includes a finishGetByVal helper macro that stores the result in the right place on the stack and then dispatches to the next instruction. Note that it also calls valueProfile to log the result just before finishing.

Each ValueProfile object has a pair of buckets and a predicted type. One bucket is for normal execution. The valueProfile macro in the LLInt uses this bucket. The other bucket is for OSR exit: if we exit due to a speculation on a type that we got from value profiling, we feed the value that caused OSR exit back into the second bucket of the ValueProfile.

Each time that our execution counters (used for controlling when to invoke the next tier) count about 1000 points, the execution counting slow path updates all predicted types for the value profiles in that function. Updating value profiles means computing a predicted type for the value in the bucket and merging that type with the previously predicted type. Therefore, after repeated predicted type updates, the type will be broad enough to be valid for multiple different values that the code saw.

Predicted types use the SpeculatedType type system. A SpeculatedType is a 64-bit integer in which we use the low 40 bits to represent a set of 40 fundamental types. The fundamental types, shown in Figure 13, represent non-overlapping set of possible JSValues. 240 SpeculatedTypes are possible by setting any combination of bits.

Figure 13. All of the fundamental SpeculatedTypes.

This allows us to invent whatever types are useful for optimization. For example, we distinguish between 32-bit integers whose value is either 0 or 1 (BoolInt32) versus whose value is anything else (NonBoolInt32). Together these form the Int32Only type, which just has both bits set. BoolInt32 is useful for cases there integers are converted to booleans.

Prediction propagation. We use value profiling to fill in the blanks for the prediction propagation pass of the DFG compiler pipeline. Prediction propagation is an abstract interpreter that tracks the set of types that each variable in the program can have. It’s unsound since the types it produces are just predictions (it can produce any combination of types and at worst we will just OSR exit too much). However, it can be said that we optimize it to be sound; the more sound it is, the fewer OSR exits we have. Prediction propagation fills in the things that the abstract interpreter can’t reason about (loads from the heap, results returned by calls, arguments to the function, etc.) using the results of value profiling. On the topic of soundness, we would consider it to be a bug if the prediction propagation was unsound in a world where value profiling is never wrong. Of course, in reality, we know that value profiling will be wrong, so we know that prediction propagation is unsound.

Let’s consider some of the cases where prediction propagation can be sure about the result type of an operation based on the types of its inputs.

Figure 14. Some of the prediction propagation rules for Add. This figure doesn’t show the rules for string concatenation and objects.
Figure 15. Some of the prediction propagation rules for GetByVal (the DFG opcode for subscript access like array[index]). This figure only shows a small sample of the GetByVal rules.

Figure 14 shows some of the rules for the Add operation in DFG IR. Prediction propagation and case flags tell us everything we want to know about the output of Add. If the inputs are integers and the overflow flag isn’t set, the output is an integer. If the inputs are any other kinds of numbers or there are overflows, the output is a double. We don’t need anything else (like value profiling) to understand the output type of Add.

Figure 15 shows some of the rules for GetByVal, which is the DFG representation of array[index]. In this case, there are types of arrays that could hold any type of value. So, even knowing that it is a JSArray isn’t enough to know the types of values inside the array. Also, if the index is a string, then this could be accessing some named property on the array object or one of its prototypes and those could have any type. It’s in cases like GetByVal that we leverage value profiling to guess what the result type is.

Prediction propagation combined with value profiling allows the DFG to infer a predicted type at every point in the program where a variable is used. This allows operations that don’t do any profiling on their own to still perform type-based speculations. It’s of course possible to also have bytecode instructions that can speculate on type collect case flags (or use some other mechanism) to drive those speculations — and that approach can be more precise — but value profiling means that we don’t have to do this for every operation that wants type-based speculation.

Using predicted types. Consider the CompareEq operation in DFG IR, which is used for the DFG lowering of the eq, eq_null, neq, neq_null, jeq, jeq_null, jneq, and jneq_null bytecodes. These bytecodes do no profiling of their own. But CompareEq is one of the most aggressive type speculators in all of the DFG. CompareEq can speculate on the types it sees without doing any profiling of its own because the values it uses will either have value profiling or will have a predicted type filled in by prediction propagation.

Type speculations in the DFG are written like:

CompareEq(Int32:@left, Int32:@right)

This example means that the CompareEq will specuate that both operands are Int32. CompareEq supports the following speculations, plus others we don’t list here:

CompareEq(Boolean:@left, Boolean:@right)
CompareEq(Int32:@left, Int32:@right)
CompareEq(Int32:BooleanToNumber(Boolean:@left), Int32:@right)
CompareEq(Int32:BooleanToNumber(Untyped:@left), Int32:@right)
CompareEq(Int32:@left, Int32:BooleanToNumber(Boolean:@right))
CompareEq(Int32:@left, Int32:BooleanToNumber(Untyped:@right))
CompareEq(Int52Rep:@left, Int52Rep:@right)
CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(Int52:@right))
CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(RealNumber:@right))
CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(Number:@right))
CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(NotCell:@right))
CompareEq(DoubleRep:DoubleRep(RealNumber:@left), DoubleRep:DoubleRep(RealNumber:@right))
CompareEq(DoubleRep:..., DoubleRep:...)
CompareEq(StringIdent:@left, StringIdent:@right)
CompareEq(String:@left, String:@right)
CompareEq(Symbol:@left, Symbol:@right)
CompareEq(Object:@left, Object:@right)
CompareEq(Other:@left, Untyped:@right)
CompareEq(Untyped:@left, Other:@right)
CompareEq(Object:@left, ObjectOrOther:@right)
CompareEq(ObjectOrOther:@left, Object:@right)
CompareEq(Untyped:@left, Untyped:@right)

Some of these speculations, like CompareEq(Int32:, Int32:) or CompareEq(Object:, Object:), allow the compiler to just emit an integer compare instruction. Others, like CompareEq(String:, String:), emit a string compare loop. We have lots of variants to optimally handle bizarre comparisons that are not only possible in JS but that we have seen happen frequently in the wild, like comparisons between numbers and booleans and comparisons between one value that is always a number and another that is either a number or a boolean. We provide additional optimizations for comparisons between doubles, comparisons between strings that have been hash-consed (so-called StringIdent, which can be compared using comparison of the string pointer), and comparisons where we don’t know how to speculate (CompareEq(Untyped:, Untyped:)).

The basic idea of value profiling — storing a last-seen value into a bucket and then using that to bootstrap a static analysis — is something that we also use for profiling the behavior of array accesses. Array profiles and array allocation profiles are like value profiles in that they save the last result in a bucket. Like value profiling, data from those profiles is incorporated into prediction propagation.

To summarize, value profiling allows us to predict the types of variables at all of their use sites by just collecting profiling at those bytecode instructions whose output cannot be predicted with abstract interpretation. This serves as the foundation for how the DFG (and FTL, since it reuses the DFG’s frontend) speculates on the types of JSValues.

Inline Caches

Property accesses and function calls are particularly difficult parts of JavaScript to optimize:

  • Objects behave as if they were just ordered mappings from strings to JSValues. Lookup, insertion, deletion, replacement, and iteration are possible. Programs do these operations a lot, so they have to be fast. In some cases, programs use objects the same way that programs in other languages would use hashtables. In other cases, programs use objects the same way that they would in Java or some sensibly-typed object-oriented language. Most programs do both.
  • Function calls are polymorphic. You can’t make static promises about what function will be called.

Both of these dynamic features are amenable to optimization with Deutsch and Schiffman’s inline caches (ICs). For dynamic property access, we combine this with structures, based on the idea of maps in the Chambers, Ungar, and Lee’s Self implementation. We also follow Hölzle, Chambers, and Ungar: our inline caches are polymorphic and we use data from these caches as profiling of the types observed at a particular property access or call site.

It’s worth dwelling a bit on the power of inline caching. Inline caches are great optimizations separately from speculative compilation. They make the LLInt and Baseline run faster. Inline caches are our most powerful profiling source, since they can precisely collect information about every type encountered by an access or call. Note that we previously said that good profiling has to be cheap. We think of inline caches as negative cost profiling since inline caches make the LLInt and Baseline faster. It doesn’t get cheaper than that!

This section focuses on inline caching for dynamic property access, since it’s strictly more complex than for calls (accesses use structures, polymorphic inline caches (PICs), and speculative compilation; calls only use polymorphic inline caches and speculative compilation). We organize our discussion of inline caching for dynamic property access as follows. First we describe how structures work. Then we show the JavaScriptCore object model and how it incorporates structures. Next we show how inline caches work. Then we show how profiling from inline caches is used by the optimizing compilers. After that we show how inline caches support polymorphism and polyvariance. Finally we talk about how inline caches are integrated with the garbage collector.

Structures. Objects in JavaScript are just mappings from strings to JSValues. Lookup, insertion, deletion, replacement, and iteration are all possible. We want to optimize those uses of objects that would have had a type if the language had given the programmer a way to say it.

Figure 16. Some JavaScript objects that have x and y properties. Some of them have exactly the same shape (only x and y in the same order).

Consider how to implement a property access like:

var tmp = o.x;

Or:

o.x = tmp;

One way to make this fast is to use hashtables. That’s certainly a necessary fallback mechanism when the JavaScript program uses objects more like hashtables than like objects (i.e. it frequently inserts and deletes properties). But we can do better.

This problem frequently arises in dynamic programming languages and it has a well-understood solution. The key insight of Chambers, Ungar, and Lee’s Self implementation is that property access sites in the program will typically only see objects of the same shape. Consider the objects in Figure 16 that have x and y properties. Of course it’s possible to insert x and y in two possible orders, but folks will tend to pick some order and stick to it (like x first). And of course it’s possible to also have objects that have a z property, but it’s less likely that a property access written as part of the part of the program that works with {x, y} objects will be reused for the part that uses {x, y, z}. It’s possible to have shared code for many different kinds of objects but unshared code is more common. Therefore, we split the object representation into two parts:

  • The object itself, which only contains the property values and a structure pointer.
  • The structure, which is a hashtable that maps property names (strings) to indices in the objects that have that structure.
Figure 17. The same objects as in Figure 16, but using structures.

Figure 17 shows objects represented using structures. Objects only contain object property values and a pointer to a structure. The structure tells the property names and their order. For example, if we wanted to ask the {1, 2} object in Figure 17 for the value of property x, we would load the pointer to its structure, {x, y}, and ask that structure for the index of x. The index is 0, and the value at index 0 in the {1, 2} object is 1.

A key feature of structures is that they are hash consed. If two objects have the same properties in the same order, they are likely to have the same structure. This means that checking if an object has a certain structure is O(1): just load the structure pointer from the object header and compare the pointer to a known value.

Structures can also indicate that objects are in dictionary or uncacheable dictionary mode, which are basically two levels of hashtable badness. In both cases, the structure stops being hash consed and is instead paired 1:1 with its object. Dictionary objects can have new properties added to them without the structure changing (the property is added to the structure in-place). Uncacheable dictionary objects can have properties deleted from them without the structure changing. We won’t go into these modes in too much detail in this post.

To summarize, structures are hashtables that map property names to indices in the object. Object property lookup uses the object’s structure to find the index of the property. Structures are hash consed to allow for fast structure checks.

Figure 18. The JavaScriptCode object model.

JavaScriptCore object model. JavaScriptCore uses objects with a 64-bit header that includes a 32-bit structure ID and 32 bits worth of extra state for GC, type checks, and arrays. Figure 18 shows the object model. Named object properties may end up either in the inline slots or the out-of-line slots. Objects get some number of inline slots based on simple static analysis around the allocation site. If a property is added that doesn’t fit in the inline slots, we allocate a butterfly to store additional properties out-of-line. Accessing out-of-line properties in the butterfly costs one extra load.

Figure 19 shows an example object that only has two inline properties. This is the kind of object you would get if you used the object literal {f:5, g:6} or if you assigned to the f and g properties reasonably close to the allocation.

Figure 19. Example JavaScriptCore object together with its structure.

Simple inline caches. Let’s consider the code:

var v = o.f;

Let’s assume that all of the objects that flow into this have structure 42 like the object in Figure 19. Inline caching this property access is all about emitting code like the following:

if (o->structureID == 42)
    v = o->inlineStorage[0]
else
    v = slowGet(o, "f")

But how do we know that o will have structure 42? JavaScript does not give us this information statically. Inline caches get this information by filling it in once the code runs. There are a number of techniques for this, all of which come down to self-modifying code. Let’s look at how the LLInt and Baseline do it.

In the LLInt, the metadata for get_by_id contains a cached structure ID and a cached offset. The cached structure ID is initialized to an absurd value that no structure can have. The fast path of get_by_id loads the property at the cached offset if the object has the cached structure. Otherwise, we take a slow path that does the full lookup. If that full lookup is cacheable, it stores the structure ID and offset in the metadata.

The Baseline JIT does something more sophisticated. When emitting a get_by_id, it reserves a slab of machine code space that the inline caches will later fill in with real code. The only code in this slab initially is an unconditional jump to a slow path. The slow path does the fully dynamic lookup. If that is deemed cacheable, the reserved slab is replaced with code that does the right structure check and loads at the right offset. Here’s an example of a get_by_id initially compiled with Baseline:

0x46f8c30b9b0: mov 0x30(%rbp), %rax
0x46f8c30b9b4: test %rax, %r15
0x46f8c30b9b7: jnz 0x46f8c30ba2c
0x46f8c30b9bd: jmp 0x46f8c30ba2c
0x46f8c30b9c2: o16 nop %cs:0x200(%rax,%rax)
0x46f8c30b9d1: nop (%rax)
0x46f8c30b9d4: mov %rax, -0x38(%rbp)

The first thing that this code does is check that o (stored in %rax) is really an object (using a test and jnz). Then notice the unconditional jmp followed by two long nop instructions. This jump goes to the same slow path that we would have branched to if o was not an object. After the slow path runs, this is repatched to:

0x46f8c30b9b0: mov 0x30(%rbp), %rax
0x46f8c30b9b4: test %rax, %r15
0x46f8c30b9b7: jnz 0x46f8c30ba2c
0x46f8c30b9bd: cmp $0x125, (%rax)
0x46f8c30b9c3: jnz 0x46f8c30ba2c
0x46f8c30b9c9: mov 0x18(%rax), %rax
0x46f8c30b9cd: nop 0x200(%rax)
0x46f8c30b9d4: mov %rax, -0x38(%rbp)

Now, the is-object check is followed by a structure check (using cmp to check that the structure is 0x125) and a load at offset 0x18.

Inline caches as a profiling source. The metadata we use to maintain inline caches makes for a fantastic profiling source. Let’s look closely at what this means.

Figure 20. Timeline of using an inline cache at each JIT tier. Note that we end up having to generate code for this `get_by_id` *six times* in the berst case that each tier compiles this only once.

Figure 20 shows a naive use of inline caches in a multi-tier engine, where the DFG JIT forgets everything that we learned from the Baseline inline cache and just compiles a blank inline cache. This is reasonably efficient and we fall back on this approach when the inline caches from the LLInt and Baseline tell us that there is unmanagable polymorphism. Before we go into how polymorphism is profiled, let’s look at how a speculative compiler really wants to handle simple monomorphic inline caches like the one in Figure 20, where we only see one structure (S1) and the code that the IC emits is trivial (load at offset 10 from %rax).

When the DFG frontend (shared by DFG and FTL) sees an operation like get_by_id that can be implemented with ICs, it reads the state of all ICs generated for that get_by_id. By “all ICs” we mean all ICs that are currently in the heap. This usually just means reading the LLInt and Baseline ICs, but if there exists a DFG or FTL function that generated an IC for this get_by_id then we will also read that IC. This can happen if a function gets compiled multiple times due to inlining — we may be compiling function bar that inlines a call to function foo and foo already got compiled with FTL and the FTL emitted an IC for our get_by_id.

If all ICs for a get_by_id concur that the operation is monomorphic and they tell us the structure to use, then the DFG frontend converts the get_by_id into inline code that does not get repatched. This is shown in Figure 21. Note that this simple get_by_id is lowered to two DFG operations: CheckStructure, which OSR exits if the given object does not have the required structure, and GetByOffset, which is just a load with known offset and field name.

Figure 21. Inlining a simple momomorphic inline cache in DFG and FTL.

CheckStructure and GetByOffset are understood precisely in DFG IR:

  • CheckStructure is a load to get the structure ID of an object and a branch to compare that structure ID to a constant. The compiler knows what structures are. After a CheckStructcure, the compiler knows that it’s safe to execute loads to any of the properties that the structure says that the object has.
  • GetByOffset is a load from either an inline or out-of-line property of a JavaScript object. The compiler knows what kind of property is being loaded, what its offset is, and what the name of the property would have been.

The DFG knows all about how to model these operations and the dependency between them:

  • The DFG knows that neither operation causes a side effect, but that the CheckStructure represents a conditional side exit, and both operations read the heap.
  • The DFG knows that two CheckStructures on the same structure are redundant unless some operation between them could have changed object structure. The DFG knows a lot about how to optimize away redundant structure checks, even in cases where there is a function call between two of them (more on this later).
  • The DFG knows that two GetByOffsets that speak of the same property and object are loading from the same memory location. The DFG knows how to do alias analaysis on those properties, so it can precisely know when a GetByOffset’s memory location got clobbered.
  • The DFG knows that if it wants to hoist a GetByOffset then it has to ensure that the corresponding CheckStructure gets hoisted first. It does this using abstract interpretation, so there is no need to have a dependency edge between these operations.
  • The DFG knows how to generate either machine code (in the DFG tier) or B3 IR (in the FTL tier) for CheckStructure and GetByOffset. In B3, CheckStructure becomes a Load, NotEqual, and Check, while GetByOffset usually just becomes a Load.
Figure 22. Inlining two momomorphic inline caches, for different properties on the same object, in DFG and FTL. The DFG and FTL are able to eliminate the CheckStructure for the second IC.

The biggest upshot of lowering ICs to CheckStructure and GetByOffset is the redundancy elimination. The most common redundancy we eliminate is multiple CheckStrutures. Lots of code will do multiple loads from the same object, like:

var f = o.f;
var g = o.g;

With ICs, we would check the structure twice. Figure 22 shows what happens when the speculative compilers inline these ICs. We are left with just a single CheckStructure instead of two thanks to the fact that:

  • CheckStructure is an OSR speculation.
  • CheckStructure is not an IC. The compiler knows exactly what it does, so that it can model it, so that it can eliminate it.

Let’s pause to appreciate what this technique gives us so far. We started out with a language in which property accesses seem to need hashtable lookups. A o.f operation requires calling some procedure that is doing hashing and so forth. But by combining inline caches, structures, and speculative compilation we have landed on something where some o.f operations are nothing more than load-at-offset like they would have been in C++ or Java. But this assumes that the o.f operation was monomorphic. The rest of this section considers minimorphism, polymorphism, and polyvariance.

Minimorphism. Certain kinds of polymorphic accesses are easier to handle than others. Sometimes an access will see two or more structures but all of those structures have the property at the same offset. Other times an access will see multiple structures and those structures do not agree on the offset of the property. We say that an access is minimorphic if it sees more than one structure and all structures agree on the offset of the property.

Our inline caches handle all forms of polymorphism by generating a stub that switches on the structure. But in the DFG, minimorphic accesses are special because they still qualify for full inlining. Consider an access o.f that sees structures S1 and S2, and both agree that f is at offset 0. Then we would have:

CheckStructure(@o, S1, S2)
GetByOffset(@o, 0)

This minimorphic CheckStructure will OSR exit if @o has none of the listed structures. Our optimizations for CheckStructure generally work for both monomorphic and minimorphic variants. So, minimorphism usually doesn’t hurt performance much compared to monomorphism.

Polymorphism. But what about some access sees different structures, and those structures have the property at different offsets? Consider an access to o.f that sees structures S1 = {f, g}, S2 = {f, g, h}, and S3 = {g, f}. This would be a minimorphic access if it was just S1 or S2, but S3 has f at a different offset. In this case, the FTL will convert this to:

MultiGetByOffset(@o, [S1, S2] => 0, [S3] => 1)

in DFG IR and then lower it to something like:

if (o->structureID == S1 || o->structureID == S2)
    result = o->inlineStorage[0]
else
    result = o->inlineStorage[1]

in B3 IR. In fact, we would use B3’s Switch since that’s the canonical form for this code pattern in B3.

Note that we only do this optimization in the FTL. The reason is that we want polymorphic accesses to remain ICs in the DFG so that we can use them to collect refined profiling.

Figure 23. Polyvariant inlining of an inline cache. The FTL can inline the inline cache in foo-inlined-into-bar after DFG compiles bar and uses an IC to collect polyvariant profiling about the get_by_id.

Polyvariance. Polyvariance is when an analysis is able to reason about a function differently depending on where it is called from. We achieve this by inlining in the DFG tier and keeping polymorphic ICs as ICs. Consider the following example. Function foo has an access to o.f that is polymorphic and sees structures S1 = {f, g}, S2 = {f, g, h}, and S3 = {g, f}:

function foo(o)
{
    // o can have structure S1, S2, or S3.
    return o.f;
}

This function is small, so it will be inlined anytime our profiling tells us that we are calling it (or may be calling it, since call inlining supports inlining polymorphic calls). Say that we have another function bar that always passes objects with structure S1 = {f, g} to foo:

function bar(p)
{
    // p.g always happens to have structure S1.
    return foo(p.g);
}

Figure 23 shows what happens. When the DFG compiles bar (step 3), it will inline foo based on the profiling of its call opcode (in step 2). But it will leave foo‘s get_by_id as an IC because foo‘s Baseline version told us that it’s polymorphic (also step 2). But then, since the DFG’s IC for foo‘s get_by_id is the context of that call from bar, it only ever sees S1 (step 4). So, when the FTL compiles bar and inlines foo, it knows that this get_by_id can be inlined with a monomorphic structure check for just S1 (step 5).

Inline caches also support more exotic forms of property access, like loading from objects in the prototype chain, calling accessors, adding/replacing properties, and even deleting properties.

Inline caches, structures, and garbage collection. Inline caches results in objects that are allocated and referenced only for inline caching. Structures are the most notorious example of these kinds of objects. Structures are particularly problematic because they need strong references to both the object’s prototype and its global object. In some cases, a structure will only be reachable from some inline cache, that inline cache will never run again (but we can’t prove it), and there is a large global object only referenced by that structure. It can be difficult to determine if that means that the structure has to be deleted or not. If it should be deleted, then the inline cache must be reset. If any optimized code inlined that inline cache, then that code must be jettisoned and recompiled. Fortunately, our garbage collector allows us to describe this case precisely. Since the garbage collector runs to fixpoint, we simply add the constraint that the pointer from an inline cache to a structure only marks the structure if the structure’s global object and prototype are already marked. Otherwise, the pointer behaves like a weak pointer. So, an inline cache will only be reset if the only way to reach the structure is through inline caches and the corresponding global object and prototype are dead. This is an example of how our garbage collector is engineered to make speculation easy.

To summarize, inline caching is an optimization employed by all of our tiers. In addition to making code run faster, inline caching is a high-precision profiling source that can tell us about the type cases that an operation saw. Combined with structures, inline caches allow us to turn dynamic property accesses into easy-to-optimize instructions.

Watchpoints

We allow inline caches and speculative compilers to set watchpoints on the heap. A watchpoint in JavaScriptCore is nothing more than a mechanism for registering for notification that something happened. Most watchpoints are engineered to trigger only the first time that something bad happens; after that, the watchpoint just remembers that the bad thing had ever happened. So, if an optimizing compiler wants to do something that is valid only if some bad thing never happened, and the bad thing has a watchpoint, the compiler just checks if the watchpoint is still valid (i.e. the bad thing hasn’t happened yet) and then associates its generated code with the watchpoint (so the code will only get installed if the watchpoint is still valid when the code is done getting compiled, and will be jettisoned as soon as the watchpoint is fired). The runtime allows for setting watchpoints on a large number of activities. The following stick out:

  • It’s possible to set a watchpoint on structures to get a notification whenever any object switches from that structure to another one. This only works for structures whose objects have never transitioned to any other structure. This is called a structure transition watchpoint. It establishes a structure as a leaf in the structure transition tree.
  • It’s possible to set a watchpoint on properties in a structure to get a notification whenever the property is overwritten. Overwriting a property is easy to detect because the first time this happens, it usually involves repatching a put_by_id inline cache so that it’s in the property replacement mode. This is called a property replacement watchpoint.
  • It’s possible to set a watchpoint on the mutability of global variables.

Putting these watchpoints together gives the speculative compiler the ability to constant-fold object properties that happen to be immutable. Let’s consider a simple example:

Math.pow(42, 2)

Here, Math is a global property lookup. The base object is known to the compiler: it’s the global object that the calling code belongs to. Then, Math.pow is a lookup of the pow propery on the Math object. It’s extremely unlikely that the Math property of the global object or the pow property of the Math object had ever been overwritten. Both the global object and the Math object have structures that are unique to them (both because those structures have special magic since those are special objects and because those objects have what is usually a globally unique set of properties), which guarantees that they have leaf structures, so the structure transition watchpoint can be set. Therefore, except for pathological programs, the expression Math.pow is compiled to a constant by the speculative compiler. This makes lots of stuff fast:

  • It’s common to have named and scoped enumerations using objects and object properties, like TypeScript.NodeType.Error in the typescript compiler benchmark in JetStream 2. Watchpoints make those look like a constant to the speculative compiler.
  • Method calls like o.foo(things) are usually turned just into a structure check on o and a direct call. Once the structure is checked, watchpoints establish that the object’s prototype has a property called foo and that this property has some constant value.
  • Inline caches use watchpoints to remove some checks in their generated stubs.
  • The DFG can use watchpoints to remove redundant CheckStructures even when there is a side effect between them. If we set the structure transition watchpoint then we know that no effect can change the structure of any object that has this structure.
  • Watchpoints are used for lots of miscellaneous corner cases of JavaScript, like having a bad time.

To summarize, watchpoints let inline caches and the speculative compilers fold certain parts of the heap’s state to constants by getting a notification when things change.

Exit Flags

All of the profiling sources in our engine have a chance of getting things wrong. Profiling sources get things wrong because:

  • The program may change behavior between when we collected the profiling and when we speculated on it.
  • The profiling has some stochastic element and the program is getting unlucky, leading to wrong profiling.
  • The profiling source has a logic bug that makes it not able to see that something happened.
  • We neglected to implement a profiler for something and instead just speculated blind.

The first of these issues – behavior change over time – is inevitable and is sure to happen for some functions in any sufficiently large program. Big programs tend to experience phase changes, like some subroutine going from being called from one part of a larger library that uses one set of types, to being called from a different part with different types. Those things inevitably cause exits. The other three issues are all variants of the profiling being broken. We don’t want our profiling to be broken, but we’re only human. Recall that for speculation to have good EV, the probability of being right has to be about 1. So, it’s not enough to rely on profiling that was written by imperfect lifeforms. Exit flags are a check on the rest of the profiling and are there to ensure that we get things right eventually for all programs.

In JavaScriptCore, every OSR exit is tagged with an exit kind. When a DFG or FTL function exits enough times to get jettisoned, we record all of the exit kinds that happened along with the bytecode locations that semantically caused the exits (for example if we do a type check for add at bytecode #63 but then hoist the check so that it ends up exiting to bytecode #45, then we will blame #63 not #45). Whenever the DFG or FTL decide whether to perform a kind of speculation, they are expected to check whether there is an exit flag for that speculation at the bytecode that we’re compiling. Our exit flag checking discipline tends to be strictly better than our profiling discipline, and it’s way easier to get right — every phase of the DFG has fast access to exit flags.

Here’s an example of an actual OSR exit check in DFG:

speculationCheck(
    OutOfBounds, JSValueRegs(), 0,
    m_jit.branch32(
        MacroAssembler::AboveOrEqual,
        propertyReg,
        MacroAssembler::Address(storageReg, Butterfly::offsetOfPublicLength())));

Note that the first argument is OutOfBounds. That’s an example exit kind. Here’s another example, this time from the FTL:

speculate(NegativeZero, noValue(), nullptr, m_out.lessThan(left, m_out.int32Zero));

Again, the the first argument is the exit kind. This time it’s NegativeZero. We have 26 exit kinds, most of which describe a type check condition (some are used for other uses of OSR, like exception handling).

We use the exit kinds by querying if an exit had happened at the bytecode location we are compiling when choosing whether to speculate. We typically use the presence of an exit flag as an excuse not to speculate at all for that bytecode. We effectively allow ourselves to overcompensate a bit. The exit flags are a check on the rest of the profiler. They are telling the compiler that the profiler had been wrong here before, and as such, shouldn’t be trusted anymore for this code location.

Summary of Profiling

JavaScriptCore’s profiling is designed to be cheap and useful. Our best profiling sources tend to either involve minimal instrumentation (like just setting a flag or storing a value to a known location) or be intertwined with optimizations (like inline caching). Our profilers gather lots of rich information and in some cases we even collect information redundantly. Our profiling is designed to help us avoid making speculative bets that turn out to be wrong even once.

Compilation and OSR

Now that we have covered bytecode, control, and profiling, we can get to the really fun part: how to build a great speculative optimizing compiler. We will discuss the OSR aspect of speculation in tandem with our descriptions of the two optimizing compilers.

This section is organized into three parts. First we give a quick and gentle introduction to DFG IR, the intermediate representation used by both the DFG and FTL tiers. Then we describe the DFG tier in detail, including how it handles OSR. Finally we describe how the FTL tier works.

DFG IR

The most important component of a powerful optimizing compiler is the IR. We want to have the best possible speculative optimizing compiler for JavaScript, so we have the following goals for our IR:

  • The IR has to describe all of the parts of the program that are interesting to the optimizer. Like other high quality optimizing IRs, DFG IR has good support for talking about data flow, aliasing, effects, control flow, and debug information. Additionally, it’s also good at talking about profiling data, speculation decisions, and OSR.
  • The IR has to be mutable. Anything that is possible to express when first lowering a program to the IR should also be expressible during some later optimization. We prefer that decisions made during lowering to the IR can be be refined by optimizations later.
  • The IR has to have some validation support. It’s got to be possible to catch common mistakes in a validator instead of debugging generated code.
  • The IR has to be purpose-built. If there exists an optimization whose most comprehensive implementation requires a change to the IR or one of its core data structures, then we need to be able to make that change without asking anyone for permission.

Note that IR mutability is closely tied to how much it describes and how easy it is to validate. Any optimization that tries to transform one piece of code into a different, better, piece of code needs to be able to determine if the new code is a valid replacement for the old code. Generally, the more information the IR carries and the easier it is to validate, the easier it is to write the analyses that guard optimizations.

Let’s look at what the DFG IR looks like using a simple example:

function foo(a, b)
{
    return a + b;
}

This results in bytecode like:

[   0] enter             
[   1] get_scope         loc3
[   3] mov               loc4, loc3
[   6] check_traps       
[   7] add               loc6, arg1, arg2
[  12] ret               loc6

Note that only the last two lines (add and ret) are important. Let’s look at the DFG IR that we get from lowering those two bytecode instructions:

  23:  GetLocal(Untyped:@1, arg1(B<Int32>/FlushedInt32), R:Stack(6), bc#7)
  24:  GetLocal(Untyped:@2, arg2(C<BoolInt32>/FlushedInt32), R:Stack(7), bc#7)
  25:  ArithAdd(Int32:@23, Int32:@24, CheckOverflow, Exits, bc#7)
  26:  MovHint(Untyped:@25, loc6, W:SideState, ClobbersExit, bc#7, ExitInvalid)
  28:  Return(Untyped:@25, W:SideState, Exits, bc#12)

In this example, we’ve lowered the add opcode to four operations: two GetLocals to get the argument values from the stack (we load them lazily and this is the first operation that needs them), a speculative ArithAdd instruction, and a MovHint to tell the OSR part of the compiler about the ArithAdd. The ret opcode is just lowered to a Return.

In DFG jargon, the instructions are usually called nodes, but we use the terms node, instruction, and operation interchangeably. DFG nodes are simultaneously nodes in a data flow graph and instructions inside of a control flow graph, with semantics defined “as if” they executed in a particular order.

Figure 24. Explanation of an example ArithAdd DFG instruction.

Let’s consider the ArithAdd in greater detail (Figure 24). This instruction is interesting because it’s exactly the sort of thing that the DFG is designed to optimize: it represents a JavaScript operation that is dynamic and impure (it may call functions) but here we have inferred it to be free of side effects using the Int32: type speculations. These indicate that that before doing anything else, this instruction will check that its inputs are Int32’s. Note that the type speculations of DFG instructions should be understood like function overloads. ArithAdd also allows for both operands to be double or other kinds of integer. It’s as if ArithAdd was a C++ function that had overloads that took a pair of integers, a pair of doubles, etc. It’s not possible to add any type speculation to any operand, since that may result in an instruction overload that isn’t supported.

Another interesting feature of this ArithAdd is that it knows exactly which bytecode instruction it originated from and where it will exit to. These are separate fields in the IR (the semantic and forExit origins) but when the are equal we dump them as one, bc#7 in the case of this instruction.

Any DFG node that may exit will have the Exits flag. Note that we set this flag conservatively. For example, the Return in our example has it set not because Return exits but because we haven’t found a need to make the exit analysis any more precise for that instruction.

Figure 25. Example data flow graph.

DFG IR can be simultaneously understood as a sequence of operations that should be performed as if in the given order and as a data flow graph with backwards pointers. The data flow graph view of our running example is shown in Figure 25. This view is useful since lots of optimizations are concerned with asking questions like: “what instructions produce the values consumed by this instruction?” These data flow edges are the main way that values move around in DFG IR. Also, representing programs this way makes it natural to add SSA form, which we do in the FTL.

Figure 26. DFG and FTL compiler architecture. The pass pipeline depicted above the dotten line is shared between the DFG and FTL compilers. Everything below the dotted line is specialized for DFG or FTL.

DFG, in both non-SSA and SSA forms, forms the bulk of the DFG and FTL compilers. As shown in Figure 26, both JITs share the same frontend for parsing bytecode and doing some optimizations. The difference is what happens after the DFG optimizer. In the DFG tier, we emit machine code directly. In the FTL tier, we convert to DFG SSA IR (which is almost identical to DFG IR but uses SSA to represent data flow) and do more optimizations, and then lower through two additional optimizers (B3 and Assembly IR or Air). The remaining sections talk about the DFG and FTL compilers. The section on the DFG compiler covers the parts of DFG and FTL that are common.

DFG Compiler

The point of the DFG compiler is to remove lots of type checks quickly. Fast compilation is the DFG feature that differentiates it from the FTL. To get fast compilation, the DFG lacks SSA, can only do very limited code motion, and uses block-local versions of most optimizations (common subexpression elimination, register allocation, etc). The DFG has two focus areas where it does a great job despite compiling quickly: how it handles OSR and how it uses static analysis.

This section explains the DFG by going into these three concepts in greater detail:

  • OSR exit as a first-class concept in the compiler.
  • Static analysis as the main driver of optimization.
  • Fast compilation so that we get the benefits of optimization as soon as possible.

OSR Exit

OSR is all about flattening control flow by making failing checks exit sideways. OSR is a difficult optimization to get right. It’s especially difficult to reason about at a conceptual level. This section tries to demystify OSR exit. We’re going to explain the DFG compiler’s approach to OSR, which includes both parts that are specific to the DFG tier and parts that are shared with the FTL. The FTL section explains extensions to this approach that we use to do more aggressive optimizations.

Our discussion proceeds as follows. First we use a high-level example to illustrate what OSR exit is all about. Then we describe what OSR exit means at the machine level, which will take us into the details of how optimizing compilers handle OSR. We will show a simple OSR exit IR idea based on stackmaps to give a sense of what we’re trying to achieve and then we describe how DFG IR compresses stackmaps. Finally we talk about how OSR exit is integrated with watchpoints and invalidation.

High-level OSR example. To start to demystify DFG exit, let’s think of it as if it was an optimization we were doing to a C program. Say we had written code like:

int foo(int* ptr)
{
    int w, x, y, z;
    w = ... // lots of stuff
    x = is_ok(ptr) ? *ptr : slow_path(ptr);
    y = ... // lots of stuff
    z = is_ok(ptr) ? *ptr : slow_path(ptr);
    return w + x + y + z;
}

Let’s say we wanted to optimize out the second is_ok check. We could do that by duplicating all of the code after the first is_ok check, and having one copy statically assume that is_ok is true while another copy either assumes it’s false or makes no assumptions. This might make the fast path look like:

int foo(int* ptr)
{
    int w, x, y, z;
    w = .. // lots of stuff
    if (!is_ok(ptr))
        return foo_base1(ptr, w);
    x = *ptr;
    y = ... // lots of stuff
    z = *ptr;
    return w + x + y + z;
}

Where foo_base1 is the original foo function after the first is_ok check. It takes the live state at that point as an argument and looks like this:

int foo_base1(int* ptr, int w)
{
    int x, y, z;
    x = is_ok(ptr) ? *ptr : slow_path(ptr);
    y = ... // lots of stuff
    z = is_ok(ptr) ? *ptr : slow_path(ptr);
    return w + x + y + z;
}

What we’ve done here is OSR exit. We’re optimizing control flow on the fast path (removing one is_ok check) by exiting (tail-calling foo_base1) if !is_ok. OSR exit requires:

  • Somewhere to exit, like foo_base1 in this case. It should be a thing that can complete execution of the current function without getting stuck on the same speculation.
  • The live state at exit, like ptr and w in this case. Without that, the exit target can’t pick up where we left off.

That’s OSR exit at a high level. We’re trying to allow an optimizing compiler to emit checks that exit out of the function on failure so that the compiler can assume that the same check won’t be needed later.

OSR at the machine level. Now let’s look at what OSR exit looks like at a lower level. Figure 27 shows an example of OSR at a particular bytecode index.

Figure 27. OSR exit at the machine level for an example bytecode instruction.

OSR is all about replacing the current stack frame and register state, which correspond to some bytecode index in the optimizing tier, with a different frame and register state, which correspond to the same point in the profiling tier. This is all about shuffling live data from one format to another and jumping to the right place.

Knowing where to jump to is easy: each DFG node (aka instruction or operation) has forExit, or just exit, origin that tells us which bytecode location to exit to. This may even be a bytecode stack in case of inlining.

The live data takes a bit more effort. We have to know what the set of live data is and what its format is in both the profiling and optimizing tiers. It turns out that knowing what the set of live data is and how to represent it for the profiling tiers is easy, but extracting that data from the optimizing tier is hard.

First let’s consider what’s live. The example in Figure 27 says that we’re exiting at an add and it has loc3, loc4, and loc8 live before. We can solve for what’s live at any bytecode instruction by doing a liveness analysis. JavaScriptCore has an optimized bytecode liveness analysis for this purpose.

Note that the frame layout in the profiling tier is an orderly representation of the bytecode state. In particular, locN just means framePointer - 8 * N and argN just means framePointer + FRAME_HEADER_SIZE + 8 * N, where FRAME_HEADER_SIZE is usually 40. The only difference between frame layouts between functions in the profiling tier is the frame size, which is determined by a constant in each bytecode function. Given the frame pointer and the bytecode virtual register name, it’s always possible to find out where on the stack the profiling tiers would store that variable. This makes it easy to figure out how to convert any bytecode live state to what the Baseline JIT or LLInt would expect.

The hard part is the optimizing tier’s state. The optimizing compiler might:

  • Allocate the stack in any order. Even if a variable is on the stack, it may be anywhere.
  • Register-allocate a variable. In that case there may not be any location on the stack that contains the value of that variable.
  • Constant-fold a variable. In that case there may not be any location on the stack or in the register file that contains the value of that variable.
  • Represent a variable’s value in some creative way. For example, your program might have had a statement like x = y + z but the compiler chose to never actually emit the add except lazily at points of use. This can easily happen because of pattern-matching instruction selection on x86 or ARM, where some instructions (like memory accesses) can do some adds for free as part of address computation. We do an even more aggressive version of this for object allocations: some program variable semantically points to an object, but because our compiler is smart, we never actually allocated any object and the object’s fields may be register-allocated, constant-folded, or represented creatively.

We want to allow the optimizing compiler to do things like this, since we want OSR exit to be an enabler of optimization rather than an inhibitor. This turns out to be tricky: how do we let the optimizing compiler do all of the optimizations that it likes to do while still being able to tell us how to recover the bytecode state?

The trick to extracting the optimized-state-to-bytecode-state shuffle from the optimizing compiler is to leverage the original bytecode→IR conversion. The main difference between an SSA-like IR (like DFG IR) and bytecode is that it represents data flow relationships instead of variables. While bytecode says add x, y, z, DFG IR would have an Add node that points to the nodes that produced y and z (like in Figure 25). The conversion from bytecode to DFG IR looks like this pseudocode:

case op_add: {
    VirtualRegister result = instruction->result();
    VirtualRegister left   = instruction->left();
    VirtualRegister right  = instruction->right();

    stackMap[result] = createAdd(
        stackMap[left], stackMap[right]);
    break;
}

This uses a standard technique for converting variable-based IRs to data-flow-based IRs: the converter maintains a mapping from variables in the source IR to data flow nodes in the target IR. We’re going to call this the stackMap for now. Each bytecode instruction is handled by modeling the bytecode’s data flow: we load the left and right operands from the stackMap, which gives us the DFG nodes for those locals’ values. Then we create an ArithAdd node and store it into the result local in the stackMap to model the fact that the bytecode wanted to store the result to that local. Figure 28 shows the before-and-after of running this on the add bytecode in our running example.

Figure 28. Example of stackMap before and after running the SSA conversion on add at bc#42 along with an illustration of the data flow graph around the resulting ArithAdd.

The stackMap, pruned to bytecode liveness as we are doing in these examples, represents the set of live state that would be needed to be recovered at any point in bytecode execution. It tells us, for each live bytecode local, what DFG node to use to recover the value of that local. A simple way to support OSR would be to give each DFG node that could possibly exit a data flow edge to each node in the liveness-pruned stackMap.

This isn’t what the DFG actually does; DFG nodes do not have data flow edges for the stackmap. Doing literally that would be too costly in terms of memory usage since basically every DFG node may exit and stackmaps have O(live state) entries. The DFG’s actual approach is based on delta-compression of stackmaps. But it’s worth considering exactly how this uncompressed stackmap approach would work because it forms part of the FTL’s strategy and it gives a good mental model for understanding the DFG’s more sophisticated approach. So, we will spend some time describing the DFG IR as if it really did have stackmaps. Then we will show how the stackmap is expressed using delta compression.

OSR exit with uncompressed stackmaps. Imagine that DFG nodes really had extra operands for the stackmap. Then we would have an ArithAdd like the following, assuming that bc#42 is the exit origin and that loc3, loc4, and loc8 are live, as they are in Figures 27 and 28:

c: ArithAdd(@a, @b, loc3->@s, loc4->@a, loc8->@b, bc#42)

In this kind of IR, we’d let the first two operands of ArithAdd behave the expected way (they are the actual operands to the add), and we’d treat all of the other operands as the stackmap. The exit origin, bc#42, is a control flow label. Together, this tells the ArithAdd where to exit (bc#42) and the stackmap (@s, @a, and @b). The compiler treats the ArithAdd, and the stackmap operands, as if the ArithAdd had a side exit from the function the compiler was compiling.

One way to think about it is in terms of C pseudocode. We are saying that the semantics of ArithAdd and any other instruction that may exit are as if they did the following before any of their effects:

if (some conditions)
    return OSRExit(bc#42, {loc3: @s, loc4: @a, loc8: @b});

Where the return statement is an early return from the compiled function. So, this terminates the execution of the compiled function by tail-calling (jumping to) the OSRExit. That operation will transfer control to bc#42 and pass it the given stackmap.

Figure 29. Example of control flow in a compiler with OSR exit. OSR exit means having an additional implicit set of control flow edges that come out of almost every instruction and represent a side exit from the control flow graph.

This is easy to model in a compiler. We don’t allocate any kind of control flow constructs to represent the condition check and side exit but we assume it to exist implicitly when analyzing ArithAdd or any other node that may exit. Note that in JavaScript, basically every instruction is going to possibly exit, and JavaScriptCore’s may exit analysis defaults to true for most operations. Figure 29 illustrates what this looks like. We are going to have three kinds of control flow edges instead of the usual two:

  1. The normal control flow edges between basic blocks. This is what you normally think of as “control flow”. These edges are explicitly represented in the IR, as in, there is an actual data structure (usually vector of successors and vector of predecessors) that each block uses to tell what control flow edges it participates in.
  2. The implicit fall-through control flow for instructions within blocks. This is standard for compilers with basic blocks.
  3. A new kind of control flow edge due to OSR, which goes from instructions in blocks to OSR exit landing sites. This means changing the definition of basic blocks slightly. Normally the only successors of basic blocks are the ones in the control flow graph. Our basic blocks have a bunch of OSR exit successors as well. Those successors don’t exist in the control flow graph, but we have names for them thanks to the exit origins found in the exiting instructions. The edges to those exit origins exit out of the middle of blocks, so they may terminate the execution of blocks before the block terminal.

The OSR landing site is understood by the compiler as having the following behaviors:

  • It ends execution of this function in DFG IR. This is key, since it means that there is no merge point in our control flow graph that has to consider the consequences of exit.
  • It possibly reads and writes the whole world. The DFG has to care about the reads (since they may observe whatever happened just before the exit) but not the writes (since they affect execution after execution exited DFG).
  • It reads some set of values, namely those passed as the stackmap.

This understanding is abstract, so the compiler will just assume the worst case (after exit every location in memory is read and written and all of the bits in all of the values in the stackmap are etched into stone).

This approach is great because it allows precise reconstruction of baseline state when compiling OSR exit and it mostly doesn’t inhibit optimization because it “only” involves adding a new kind of implicit control flow edge to the control flow graph.

This approach allows for simple reconstruction of state at exit because the backend that compiles the DFG nodes would have treated the stackmap data flow edges (things like loc3->@s in our example) the same way it would have treated all other edges. So, at the ArithAdd, the backend would know which registers, stack slots, or constant values to use to materialize the stackmap values. It would know how to do this for the same reason that it would know how to materialize the two actual add operands.

If we survey the most common optimizations that we want the compiler to do, we find that only one major optimization is severely inhibited by this approach to OSR exit. Let’s first review the optimizations this doesn’t break. It’s still possible to perform common subexpression elimination (CSE) on ArithAdd. It’s still possible to hoist it out of loops, though if we do that then we have to edit the exit metadata (the exit destination and stackmap will have to be overwritten to be whatever they are at the loop pre-header). It’s still possible to model the ArithAdd to be pure in lots of the ways that matter, like that if there are two loads, one before and one after the ArithAdd, then we can assume them to be redundant. The ArithAdd could only cause effects on the exit path, in which case the second load doesn’t matter. It’s still possible to eliminate the ArithAdd if it’s unreachable.

The only thing we cannot easily do is what compilers call dead code elimination, i.e. the elimination of instructions if their results are not used. Note that the compiler terminology is confusing here. Outside the compiler field we use the term dead code to mean something that compilers call unreachable code. Code is unreachable if control flow doesn’t reach it and so it doesn’t execute. Outside the compiler field, we would say that such code is dead. It’s important that compilers be able to eliminate unreachable code. Happily, our approach to OSR has no impact on unreachable code elimination. What compilers call dead code is code that is reached by control flow (so live in the not-compiler sense) but that produces a result that no subsequent code uses. Here’s an example of dead code in the compiler sense:

int tmp = a + b;
// nobody uses tmp.

Dead code elimination (DCE) is the part of a compiler that removes this kind of code. Dead code elimination doesn’t quite work for the ArithAdd because:

  • ArithAdd’s speculation checks must be assumed live even if the result of the add is unused. We may do some optimization to a later check because we find that it is subsumed by checks done by this ArithAdd. That’s a pretty fundamental optimization that we do for OSR checks and it’s the reason why OSR ultimately flattens control flow. But we don’t bother recording whenever this ArithAdd’s check is used to unlock a later optimization, so we have to assume that some later operation is already depending on the ArithAdd doing all of its checks. This means that: say that the result of some operation A is used by a dead operation B. B will still have to do whatever checks it was doing on its inputs, which will keep A alive even though B is dead. This is particularly devastating for ArithAdd, since ArithAdd usually does an overflow check. You have to do the add to check overflow. So, ArithAdd is never really dead. Consider the alternative: if we did not considider the ArithAdd’s overflow check’s effect on abstract state, then we wouldn’t be able to do our range analysis, which uses the information inferred from overflow checks to remove array bounds checks and vice versa.
  • The ArithAdd is almost sure to end up in the stackmap of some later operation, as is basically every node in the DFG program, unless the node represents something that was dead in bytecode. Being dead in bytecode is particularly unlikely because in bytecode we must assume that everything is polymorphic and possibly effectful. Then the add is really not dead: it might be a loop with function calls, after all.

The DFG and FTL still do DCE, but it’s hard and usually only worth the effort for the most expensive constructs. We support decaying an operation just to its checks, for those rare cases where we can prove that the result is not used. We also support sinking to OSR, where an operation is replaced by a phantom version of itself that exists only to tell OSR how to perform the operation for us. We mainly use this complex feature for eliminating object allocations.

To summarize the effect on optimizations: we can still do most of the optimizations. The optimization most severely impacted is DCE, but even there, we have found ways to make it work for the most important cases.

The only real downside of this simple approach is repetition: almost every DFG operation may exit and the state at exit may easily have tens or hundreds of variables, especially if we have done significant inlining. Storing the stackmap in each DFG node would create a case of O(n2) explosion in memory usage and processing time within the compiler. Note that the fact that this explosion happens is somewhat of a JavaScript-specific problem, since JavaScript is unusual in the sheer number of speculations we have to make per operation (even simple ones like add or get_by_id). If the speculations were something we did seldom, like in Java where they are mostly used for virtual calls, then the simple approach would be fine.

Stackmap compression in DFG IR. Our solution to the size explosion of repeated stackmaps is to use a delta encoding. The stackmaps don’t change much. In our running example, the add just kills loc8 and defines loc7. The kill can be discovered by analyzing bytecode, so there’s no need to record it. All we have to record about this operation is that it defines loc7 to be the ArithAdd node.

We use an operation called MovHint as our delta encoding. It tells which bytecode variable is defined by which DFG node. For example, let’s look at the MovHint we would emit for the add in Figure 28:

c: ArithAdd(@a, @b, bc#42)
   MovHint(@c, loc7, bc#42)

We need to put some care into how we represent MovHints so that they are easy to preserve and modify. Our approach is two-fold:

  • We treat MovHint as a store effect.
  • We explicitly label the points in the IR where we expect it to be valid to exit based on the state constructed out of the MovHint deltas.

Let’s first look at how we use the idea of store effects to teach the compiler about MovHint. Imagine a hypothetical DFG IR interpreter and how it would do OSR exit. They key idea is that in that interpreter, the state of the DFG program comprises not just the mapping from DFG nodes to their values, but also an OSR exit state buffer containing values indexed by bytecode variable name. That OSR exit state buffer contains exactly the stack frame that the profiling tiers would use. MovHint’s interpreter semantics are to store the value of its operand into some slot in the OSR exit state buffer. This way, the DFG interpreter is able to always maintain an up-to-date bytecode stack frame in tandem with the optimized representation of program state. Although no such interpreter exists, we make sure that the way we compile MovHint produces something with semantics consistent with what this interpreter would have done.

MovHint is not compiled to a store. But any phase operating on MovHints or encountering MovHints just needs to understand it as a store to some abstract location. The fact that it’s a store means that it’s not dead code. The fact that it’s a store means that it may need to be ordered with other stores or loads. Lots of desirable properties we need for soundly preserving MovHints across compiler optimizations fall out naturally from the fact that we tell all the phases that it’s just a store.

The compiler emits zero code for MovHint. Instead, we use a reaching defs analysis of MovHints combined with a bytecode liveness analysis to rebuild the stackmaps that we would have had if each node carried a stackmap. We perform this analysis in the backend and as part of any optimization that needs to know what OSR is doing. In the DFG tier, the reaching defs analysis happens lazily (when the OSR exit actually occurs — so could be long after the DFG compiled the code), which ensures that the DFG never experiences the O(n2) blow-up of stackmaps. OSR exit analysis is not magical: in the “it’s just a store” model of MovHint, this analysis reduces to load elimination.

DFG IR’s approach to OSR means that OSR exit is possible at some points in DFG IR and not at others. Consider some examples:

  • A bytecode instruction may define multiple bytecode variables. When lowered to DFG IR, we would have two or more MovHints. It’s not possible to have an exit between those MovHints, since the OSR exit state is only partly updated at that point.
  • It’s not possible to exit after a DFG operation that does an observable effect (like storing to a JS object property) but before its corresponding MovHint. If we exit to the current exit origin, we’ll execute the effect again (which is wrong), but if we exit to the next exit origin, we’ll neglect to store the result into the right bytecode variable.

We need to make it easy for DFG transformations to know if it’s legal to insert operations that may exit at any point in the code. For example, we may want to write instrumentation that adds a check before every use of @x. If that use is a MovHint, then we need to know that it may not be OK to add that check right before that MovHint. Our approach to this is based on the observation that the lowering of a bytecode instruction produces two phases of execution in DFG IR of that instruction:

  • The speculation phase: at the start of execution of a bytecode, it’s both necessary and possible to speculate. It’s necessary to speculate since those speculations guard the optimizations that we do in the subsequent DFG nodes for that bytecode instruction. It’s possible to speculate because we haven’t done any of the instruction’s effects, so we can safely exit to the start of that bytecode instruction.
  • The effects phase: as soon as we perform any effect, we are no longer able to do any more speculations. That effect could be an actual effect (like storing to a property or making a call) or an OSR effect (like MovHint).

To help validate this, all nodes in DFG IR have an exitOK flag that they use to record whether they think that they are in the speculative phase (exitOK is true) or if they think that they might be in the effects phase (exitOK is false). It’s fine to say that exitOK is false if we’re not sure, but to say exitOK is true, we have to be completely sure. The IR validation checks that exitOK must become false after operations that do effects, that it becomes true again only at prescribed points (like a change in exit origin suggesting that we’ve ended the effects phase of one instruction and begun the speculation phase of the next one), and that no node that may exit has exitOK set to false. This validator helps prevent errors, like when dealing with bytecode operations that can be lowered to multiple effectful DFG nodes. One example is when put_by_id (i.e. something like o.f = v) is inferred to be a transition (the property f doesn’t exist on o so we need to add it), which results in two effects:

  • Storing a value v into the memory location for property o.f.
  • Changing o‘s structure to indicate that it now has an f.

The DFG IR for this will look something like:

CheckStructure(@o, S1)
PutByOffset(@o, @v, f)
PutStructure(@o, S2, ExitInvalid)

Note that PutStructure will be flagged with ExitInvalid, which is the way we say that exitOK is false in IR dumps. Failing to set exitOK to false for PutStructure would cause a validation error since PutByOffset (right before it) is an effect. This prevents us from making mistakes like replacing all uses of @o with some operation that could speculate, like:

a: FooBar(@o, Exits)
   CheckStructure(@a, S1)
b: FooBar(@o, Exits)
   PutByOffset(@b, @v, f)
c: FooBar(@o, Exits)
   PutStructure(@c, S2, ExitInvalid)

In this example, we’ve used some new FooBar operation, which may exit, as a filter on @o. It may seem absurd to instrument code this way, but it is a goal of DFG IR to:

  • Allow replacing uses of nodes with uses of other nodes that produce an equivalent value. Let’s assume that FooBar is an identity that also does some checks that may exit.
  • Allow inserting new nodes anywhere.

Therefore, the only bug here is that @c is right after the PutByOffset. The validator will complain that it is not marked ExitInvalid. It should be marked ExitInvalid because the previous node (PutByOffset) has an effect. But if you add ExitInvalid to @c, then the validator will complain that a node may exit with ExitInvalid. Any phase that tries to insert such a FooBar would have all the API it needs to realize that it will run into these failures. For example, it could ask the node that it’s inserting itself in front of (the PutStructure) whether it has ExitInvalid. Since it is ExitInvalid, we could do either of these things instead of inserting @c just before the PutStructure:

  1. We could use some other node that does almost what FooBar does but without the exit.
  2. We could insert @c earlier, so it can still exit.

Let’s look at what the second option would look like:

a: FooBar(@o, Exits)
   CheckStructure(@a, S1)
b: FooBar(@o, Exits)
c: FooBar(@o, Exits)
   PutByOffset(@b, @v, f)
   PutStructure(@c, S2, ExitInvalid)

Usually this is all it takes to deal with regions of code with !exitOK.

Note that in cases where something like FooBar absolutely needs to do a check after an effect, DFG IR does support exiting into the middle of a bytecode instruction. In some cases, we have no choice but to use that feature. This involves introducing extra non-bytecode state that can be passed down OSR exit, issuing OSR exit state updates before/after effects, using an exit origin that indicates that we’re exiting to some checkpoint in the middle of a bytecode instruction’s execution, and implementing a way to execute a bytecode starting at a checkpoint during OSR exit. It’s definitely possible, but not the sort of thing we want to have to do every time that some DFG node needs to do an effect. For this reason, canonical DFG IR use implies having !exitOK phases (aka effect phases) during some bytecode instructions’ execution.

Watchpoints and Invalidation. So far we have considered OSR exit for checks that the compiler emits. But the DFG compiler is also allowed to speculate by setting watchpoints in the JavaScript heap. If it finds something desirable — like that Math.sqrt points to the sqrt intrinsic function — it can often incorporate it into optimization without emitting checks. All that is needed is for the compiler to set a watchpoint on what it wants to prove (that the Math and sqrt won’t change). When the watchpoint fires, we want to invalidate the compiled code. That means making it so that the code never runs again:

  • no new calls to that function go to the optimized version and
  • all returns into that optimized function are redirected to go to baseline code instead.

Ensuring that new calls avoid optimized code is easy: we just patch all calls to the function to call the profiled code (Baseline, if available, or LLInt) instead. Handling returns is the interesting part.

One approach to handling invalidation is to walk the stack to find all returns to the invalidated code, and repoint those returns to an OSR exit. This would be troublesome for us due to our use of effects phases: it’s possible for multiple effects to happen in a row in a phase of DFG IR execution where it is not possible to exit. So, the DFG approach to invalidation involves letting the remaining effects of the current bytecode instruction finish executing in optimized code and then triggering an OSR exit right before the start of the next bytecode instruction.

Figure 30. How OSR exit and invalidation might work for hypothetical bytecodes.

Invalidation in DFG IR is enabled by the InvalidationPoint instruction, which is automatically inserted by the DFG frontend at the start of every exit origin that is preceded by effects that could cause a watchpoint to fire. InvalidationPoint is modeled as if it was a conditional OSR exit, and is given an OSR exit jump label as if there was a branch to link it to. But, InvalidationPoint emits no code. Instead, it records the location in the machine code where the InvalidationPoint would have been emitted. When a function is invalidated, all of those labels are overwritten with unconditional jumps to the OSR exit.

Figure 30 shows how OSR exit concepts like speculation and effect phases combine with InvalidationPoint for three hypothetical bytecode instructions. We make up intentionally absurd instructions because we want to show the range of possibilities. Let’s consider wat in detail. The first DFG IR node for wat is an InvalidationPoint, automatically inserted because the previous bytecode (foo) had an effect. Then wat does a CheckArray, which may exit but has no effects. So, the next DFG node, Wat, is still in the speculation phase. Wat is in a sort of perfect position in DFG IR: it is allowed to perform speculations and effects. It can perform speculations because no previous node for wat‘s exit origin has performed effects. It can also perform effects, but then the nodes after it (Stuff and Derp) cannot speculate anymore. But, they can perform more effects. Since wat has effects, an InvalidationPoint is immediately inserted at the start of the next bytecode (bar). Note that in this example, Foo, Wat, and StartBar are all in the perfect position (they can exit and have effects). Since Stuff, Derp, and FinishBar are in the effects region, the compiler will assert if they try to speculate.

Note that InvalidationPoint makes code layout tricky. On x86, the unconditional jump used by invalidation is five bytes. So, we must ensure that there are no other jump labels in the five bytes after an invalidation label. Otherwise, it would be possible for invalidation to cause one of those labels to point into the middle of a 5-byte invalidation jump. We solve this by adding nop padding to create at least a 5-byte gap between a label used for invalidation and any other kind of label.

To summarize, DFG IR has extensive support for OSR exit. We have a compact delta encoding of changes to OSR exit state. Exit destinations are encoded as an exit origin field in every DFG node. OSR exit due to invalidation is handled by automatic InvalidationPoint insertion.

Static Analysis

The DFG uses lots of static analysis to complement how it does speculation. This section covers three static analyses in the DFG that have particularly high impact:

  • We use prediction propagation to fill in predicted types for all values based on value profiling of some values. This helps us figure out where to speculate on type.
  • We use the abstract interpreter (or just AI for short in JavaScriptCore jargon) to find redundant OSR speculations. This helps us emit fewer OSR checks. Both the DFG and FTL include multiple optimization passes in their pipelines that can find and remove redundant checks but the abstract interpreter is the most powerful one. The abstract interpreter is the DFG tier’s primary optimization and it is reused with small enhancements in the FTL.
  • We use clobberize to get aliasing information about DFG operations. Given a DFG instruction, clobberize can describe the aliasing properties. In almost all cases that description is O(1) in time and space. That description implicitly describes a rich dependency graph.

Both the prediction propagator and the abstract interpreter work by forward-propagating type infromation. They’re both built on the principles of abstract interpretation. It’s useful to understand at least some of that theory, so let’s do a tiny review. Abstract interpreters are like normal interpreters, except that they describe program state abstractly rather than considering exact values. A classic example due to Kildall involves just remembering which variables have known constant values and forgetting any variable that may have more than one value. Abstract interpreters are run to fixpoint: we keep executing every instruction until we no longer observe any changes. We can execute forward (like Kildall) or backward (like liveness analysis). We can either have sets that shrink as we learn new things (like Kildall, where variables get removed if we learn that they may have more than one value) or we can have sets that grow (like liveness analysis, where we keep adding variables to the live set).

Now let’s go into more details about the two abstract interpreters and the alias analysis.

Prediction propagation. The prediction propagator’s abstract state comprises variable to speculated type (Figure 13) mappings. The speculated type is a set of fundamental types. The sets tell which types a value is predicted to have. The prediction propagator is not flow sensitive; it has one copy of the abstract state for all program statements. So, each execution of a statement considers the whole set of input types (even from program statements that can’t reach us) and joins the result with the speculated type of the result variable. Note that the input to the prediction propagator is a data flow IR, so multiple assignments to the same variable aren’t necessarily joined.

The prediction propagator doesn’t have to be sound. The worst case outcome of the prediction propagator being wrong is that we either:

  • do speculations that are too strong, and so we exit too much and then recompile.
  • do speculations that are too weak, so we run slower than we could forever.

Note that the second of those outcomes is generally worse. Recompiling and then speculating less at least means that the program eventually runs with the optimal set of speculations. Speculating too weakly and never recompiling means that we never get to optimal. Therefore, the prediction propagator is engineered to sometimes be unsound instead of conservative, since unsoundness can be less harmful.

The abstract interpreter. The DFG AI is the DFG tier’s most significant optimization. While there are many abstract interpreters throughout JavaScriptCore, this one is the biggest in terms of total code and the number of clients — hence to us it is the abstract interpreter.

The DFG AI’s abstract state comprises variable to abstract value mappings where each abstract value represents a set of possible JSValues that the variable could have. Those sets describe what type information we have proved from past checks. We join abstract states at control flow merge points. The solution after the fixpoint is a minimal solution (smallest possible sets that have a fixpoint). The DFG AI is flow-sensitive: it maintains a separate abstract state per instruction boundary. AI looks at the whole control flow graph at once but does not look outside the currently compiled function and whatever we inlined into it. AI is also sparse conditional.

The DFG abstract value representation has four sub-values:

  • Whether the value is known to be a constant, and if so, what that constant is.
  • The set of possible types (i.e. a SpeculatedType bitmap, shown in Figure 13).
  • The set of possible indexing types (also known as array modes) that the object pointed to by this value can have.
  • The set of possible structures that the object pointed to by this value can have. This set has special infinite set powers.

The last two sub-values can be mutated by effects. DFG AI assumes that all objects have escaped, so if an effect happens that can change indexing types and structures, then we have to clobber those parts of all live abstract values.

We interpret the four sub-values as follows: the abstract value represents the set of JSValues that reside in the intersection of the four sub-value sets. This means that when interpreting abstract values, we have the option of just looking at whichever sub-value is interesting to us. For example, an optimization that removes structure checks only needs to look at the structure set field.

Figure 31. Examples of check elimination with abstract interpretation.

The DFG AI gives us constant and type propagation simultaneously. The type propagation is used to remove checks, simplify checks, and replace dynamic operations with faster versions.

Figure 31 shows examples of checks that the DFG AI lets us remove. Note that in addition to eliminating obvious same-basic-block check redundancies (Figure 31(a)), AI lets us remove redundancies that span multiple blocks (like Figure 31(b) and (c)). For example, in Figure 31(c), the AI is able to prove that @x is an Int32 at the top of basic block #7 because it merges the Int32 states of @x from BB#5 and #6. Check elimination is usually performed by mutating the IR so that later phases know which checks are really necessary without having to ask the AI.

The DFG AI has many clients, including the DFG backend and the FTL-to-B3 lowering. Being an AI client means having access to its estimate of the set of JSValues that any variable or DFG node can have at any program point. The backends use this to simplify checks that were not removed. For example, the backend may see an Object-or-Undefined, ask AI about it, and find that AI already proved that that we must have either an object or a string. The backend will be able to combine those two pieces of information to only emit an is-object check and ignore the possibility of the value being undefined.

Type propagation also allows us to replace dynamic heap accesses with inlined ones. Most fast property accesses in DFG IR arise from inline cache feedback telling us that we should speculate, but sometimes the AI is able to prove something stronger than the profiler told us. This is especially likely in inlined code.

Clobberize. Clobberize is the alias analysis that the DFG uses to describe what parts of the program’s state an instruction could read and write. This allows us to see additional dependency edges between instructions beyond just the ones expressed as data flow. Dependency information tells the compiler what kinds of instruction reorderings are legal. Clobberize has many clients in both the DFG and FTL. In the DFG, it’s used for common subexpression elimination, for example.

To understand clobberize, it’s worth considering what it is about a program’s control flow that a compiler needs to remember. The control flow graph shows us one possible ordering of the program and we know that this ordering is legal. But both the DFG and FTL tiers want to move code around. The DFG tier mostly only moves code around within basic blocks rather than between them while the FTL tier can also move code between basic blocks. Even with the DFG’s block-local code motion, it’s necessary to know more than just the current ordering of the program. It’s also necessary to know how that ordering can be changed.

Some of this is already solved by the data flow graph. DFG IR provides a data flow graph that shows some of the dependencies between instructions. It’s obvious that if one instruction has a data flow edge to another, then only one possible ordering (source executes before sink) is valid. But what about:

  • Stores to memory.
  • Loads from memory.
  • Calls that can cause any effects.
  • OSR effects (like MovHint).

Data flow edges don’t talk about those dependencies. Data flow also cannot tell which instructions have effects at all. So, the data flow graph cannot tell us anything about the valid ordering of instructions if those instructions have effects.

The issue of how to handle dependencies that arise from effects is particularly relevant to JavaScript compilation — and speculative compilation in general — because of the precision about aliasing that speculation gives us. For example, although the JavaScript o.f operation could have any effect, after speculation we often know that it can only affect properties named f. Additionally, JavaScript causes us to have to emit lots of loads to fields that are internal to our object model and it’s good to know exactly when those loads are redundant so that we can remove as many of them as possible. So, we need to have the power to ask, for any operation that may access internal VM state, whether that state could be modified by any other operation, and we want that answer to be as precise as it can while being O(1)-ish.

Clobberize is a static analysis that augments the data flow and control flow graphs by telling us constraints on how instructions can be reordered. The neat thing about clobberize is that it avoids storing dependency information in the instructions themselves. So, while the compiler is free to query dependency information anytime it likes by running the analysis, it doesn’t have to do anything to maintain it.

Figure 32. Some of the abstract heap hierarchy. All heaps are subsets of World, which is subdivided into Heap, Stack and SideState. For example, JS function calls say that they write(Heap) and read(World). Subheaps of Heap include things like JSObject_butterfly, which refer to fields that are internal to the JSC object model and are not directly user-visible, and things like NamedProperties, a heap that contains subheaps for every named property the function accesses.

For each DFG instruction, clobberize reports zero or more reads or writes. Each read or write says which abstract heaps it is accessing. Abstract heaps are sets of memory locations. A read (or write) of an abstract heap means that the program will read (or write) from zero or more actual locations in that abstract heap. Abstract heaps form a hierarchy with World at the top (Figure 32). A write to World means that the effect could write to anything, so any read might see that write. The hierarchy can get very specific. For example, fully inferred, direct forms of property access like GetByOffset and PutByOffset report that they read and write (respectively) an abstract heap that names the property. So, accesses to properties of different names are known not to alias. The heaps are known to alias if either one is a descendant of the other.

It’s worth appreciating how clobberize combined with control flow is just a way of encoding a dependence graph. To build a dependence graph from clobberize information, we apply the following rule. If instruction B appears after instruction A in control flow, then we treat B as having a dependence edge to A (B depends on A) if:

  • any heap read by B overlaps any heap written by A, or
  • any heap written by B overlaps any heap read or written by A.

Conversely, any dependence graph can be expressed using clobberize. An absurd but correct representation would involve giving each edge in the dependence graph its own abstract heap and having the source of the edge write the heap while the sink reads it. But what makes clobberize such an efficient representation of dependence graphs is that every dependence that we’ve tried to represent can be intuitively described by reads and writes to a small collection of abstract heaps.

Those abstract heaps are either collections of concrete memory locations (for example the "foo" abstract heap is the set of memory locations used to represent the values of properties named “foo”) or they are metaphorical. Let’s explore some metaphorical uses of abstract heaps:

  • MovHint wants to say that it is not dead code, that it must be ordered with other MovHints, and that it must be ordered with any real program effects. We say this in clobberize by having MovHint write SideState. SideState is a subheap of World but disjoint from other things, and we have any operation that wants to be ordered with OSR exit state either read or write something that overlaps SideState. Note that DFG assumes that operations that may exit implicitly read(World) even if clobberize doesn’t say this, so MovHint’s write of SideState ensures ordering with exits.
  • NewObject wants to say that it’s not valid to hoist it out of loops because two successive executions of NewObject may produce different results. But it’s not like NewObject clobbers the world; for example if we had two accesses to the same property on either sides of a NewObject then we’d want the second one to be eliminated. DFG IR has many NewObject-like operations that also have this behavior. So, we introduce a new abstract heap called HeapObjectCount and we say that NewObject is metaphorically incrementing (reading and writing) the HeapObjectCount. HeapObjectCount is treated as a subheap of Heap but it’s disjoint from the subheaps that describe any state visible from JS. This is sufficient to block hoisting of NewObject while still allowing interesting optimizations to happen around it.
Figure 33. Sample sequence of DFG IR instructions and their dependence graph. DFG IR never stores the dependence graph in memory because we get the information implicitly by running clobberize.

The combination of clobberize and the control flow graph gives a scalable and intuitive way of expressing the dependence graph. It’s scalable because we don’t actually have to express any of the edges. Consider for example a dynamic access instruction that could read any named JavaScript property, like the Call instruction in Figure 33. Clobberize can say this in O(1) space and time. But a dependence graph would have to create an edge from that instruction to any instruction that accesses any named property before or after it. In short, clobberize gives us the benefit of a dependence graph without the cost of allocating memory to represent the edges.

The abstract heaps can also be efficiently collected into a set, which we use to summarize the aliasing effects of basic blocks and loops.

To summarize, the DFG puts a big emphasis on static analysis. Speculation decisions are made using a combination of profiling and an abstract interpreter called prediction propagation. Additionally, we have an abstract interpreter for optimization, simply called the DFG abstract interpreter, which serves as the main engine for redundant check removal. Abstract interpreters are a natural fit for the DFG because they give us a way to forward-propagate information about types. Finally, the DFG uses the clobberize analysis to describe dependencies and aliasing.

Fast Compilation

The DFG is engineered to compile quickly so that the benefits of OSR speculations can be realized quickly. To help reduce compile times, the DFG is focused about what optimizations it does and how it does them. The static analysis and OSR exit optimizations discussed so far represent the most powerful things that the DFG is capable of. The DFG does a quick and dirty job with everything else, like instruction selection, register allocation, and removal of redundant code that isn’t checks. Functions that benefit from the compiler doing a good job on those optimizations will get them if they run long enough to tier up into the FTL.

The DFG’s focus on fast compilation happened organically, as a result of many separate throughput-latency trade-offs. Initially, JavaScriptCore just had the Baseline JIT and then later Baseline as the profiling tier and DFG as the optimizing tier. The DFG experienced significant evolution during this time, and then experienced additional evolution after the FTL was introduced. While no single decision led to the DFG’s current design, we believe that it was most significantly shaped by tuning for short-running benchmarks and the introduction of the FTL.

The DFG was tuned for a diverse set of workloads. On the one hand, it was tuned for long-running tests in which one full second of warm-up was given to the speculative compiler for free (like the old V8 benchmarks, which live on in the form of Octane and JetStream, albeit without the freebie warmup), but on the other hand, it was also tuned for shorter-running benchmarks like SunSpider and page load tests. SunSpider focused on smallish programs running for very short bursts of time with little opportunity for warm-up. Compilers that do more optimizations than the DFG tend to lose to it on SunSpider because they fail to complete their optimizations before SunSpider finishes running. We continue to use tests that are in the spirit of SunSpider, like Speedometer and JetStream. Speedometer has a similar code-size-to-running-time ratio, so like SunSpider, it benefits a lot from DFG. JetStream includes a subset of SunSpider and puts a big emphasis on short-running code in all of its other tests. That’s not to say that we don’t also care about long-running code. It’s just that our methodology for improving the DFG was to try to get speed-ups on both short-running things and long-running things with the same engine. Since any long-running optimization would regress the short-running tests, we often avoided adding any long-running optimizations to the DFG. But we did add cheap versions of many sophisticated optimizations, giving respectable speed-ups on both short-running and long-running workloads.

The introduction of the FTL solidified the DFG’s position as the compiler that optimizes less. So long as the DFG generates reasonably good code quickly, we can get away with putting lots of expensive optimizations into the FTL. The FTL’s long compile times mean that many programs do not run long enough to benefit from the FTL. So, the DFG is there to give those programs a speculative optimization boost in way less time than an FTL-like compiler could do. Imagine a VM that only had one optimizing compiler. Unless that one compiler compiled as fast as the DFG and generated code that was as good as the FTL, it would end up being reliably slower than JavaScriptCore on some workloads. If that compiler compiled as fast as the DFG but didn’t have the FTL’s throughput then any program that ran long enough would run faster in JavaScriptCore. If that compiler generated code that was as good as the FTL but compiled slower than the DFG then any program that ran short enough to tier up into the DFG but not that compiler would run faster in JavaScriptCore. JavaScriptCore has multiple compiler tiers because we believe that it is not possible to build a compiler that compiles as fast as the DFG while generating code that is as good as the FTL.

To summarize, the DFG focuses on fast compilation because of the combination of the history of how it was tuned and the fact that it sits as the tier below the FTL JIT.

Figure 34. Illustration of a sample DFG IR program with all three graphs: local data flow, global data flow, and control flow.

The DFG compiler’s speed comes down to an emphasis on block-locality in the IR. The DFG IR used by the DFG tier has a two-level data flow graph:

  • Local data flow graph. The local data flow graph is used within basic blocks. This graph is a first-class citizen in the IR, when working with data flow in the DFG’s C++ code, it sometimes seems like this is the only data flow graph. DFG IR inside a basic block resembles SSA form in the sense that there’s a 1:1 mapping between instructions and the variables they assign and data flow is represented by having users of values point at the instructions (nodes) that produce those values. This representation does not allow you to use a value produced by an instruction in a different block except through tedious escape hatches.
  • Global data flow graph. We say global to mean the entire compilation unit, so some JS function and whatever the DFG inlined into it. So, global just means spanning basic blocks. DFG IR maintains a secondary data flow graph that spans basic blocks. DFG IR’s approach to global data flow is based on spilling: to pass a value to a successor block, you store it to a spill slot on the stack, and then that block loads it. But in DFG IR, we also thread data flow relationships through those loads and stores. This means that if you are willing to perform the tedious task of traversing this secondary data flow graph, you can get a global view of data flow.

Figure 34 shows an example of how this works. The compilation unit is represented as three graphs: a control flow graph, local data flow, and global data flow. Data flow graphs are represented with edges going from the user to the value being used. The local data flow graphs work like they do in SSA, so any SSA optimization can be run in a block-local manner on this IR. The global data flow graph is made of SetLocal/GetLocal nodes that store/load values into the stack. The data flow between SetLocal and GetLocal is represented completely in DFG IR, by threading data flow edges through special Phi nodes in each basic block where a local is live.

From the standpoint of writing outstanding high-throughput optimizations, this approach to IR design is like kneecapping the compiler. Compilers thrive on having actual SSA form, where there is a single data flow graph, and you don’t have to think about an instruction’s position in control flow when traversing data flow. The emphasis on locality is all about excellent compile times. We believe that locality gives us compile time improvements that we can’t get any other way:

  • Instruction selection and register allocation for a basic block can be implemented as a single pass over that basic block. The instruction selector can make impromptu register allocation decisions during that pass, like deciding that it needs any number of scratch registers to emit code for some DFG node. The combined instruction selector and register allocator (aka the DFG backend) compiles basic blocks independently of one another. This kind of code generation is good at register allocating large basic blocks but bad for small ones. For functions that only have a single basic block, the DFG often generates code that is as good as the FTL.
  • We never have to decompress the delta encoding of OSR exit. We just have the backend record a log of its register allocation decisions (the variable event stream). While the DFG IR for a function is thrown out after compilation, this log along with a minified version of the DFG IR (that only includes MovHints and the things they reference) is saved so that we can replay what the backend did whenever an OSR exit happens. This makes OSR exit handling super cheap in the DFG – we totally avoid the O(n2) complexity explosion of OSR stackmaps despite the fact that we speculate like crazy.
  • There is no need to enter or exit SSA. On the one hand, SSA conversion performance is a solved problem: it’s a nearly-linear-time operation. Even so, the constant factors are high enough that avoiding it entirely is profitable. Converting out of SSA is worse. If we wanted to combine SSA with our block-local backend, we’d have to add some sort of transformation that discovers how to load/store live state across basic blocks. DFG IR plays tricks where the same store that passes data flow to another block doubles as the OSR exit state update. It’s not obvious that exiting out of SSA would discover all of the cases where the same store can be reused for both OSR exit state update and the data flow edge. This suggests that any version of exiting out of SSA would make the DFG compiler either generate worse code or run slower. So, not having SSA makes the compiler run faster because entering SSA is not free and exiting SSA is awful.
  • Every optimization is faster if it is block-local. Of course, you could write block-local optimizations in an SSA IR. But having an IR that emphasizes locality is like a way to statically guarantee that we won’t accidentally introduce expensive compiler passes to the DFG.

The one case where global data flow is essential to the DFG’s mission is static analysis. This comes up in the prediction propagator and the abstract interpreter. Both of them use the global data flow graph in addition to the local data flow graphs, so that they can see how type information flows through the whole compilation unit. Fortunately, as shown in Figure 34, the global data flow graph is available. It’s in a format that makes it hard to edit but relatively easy to analyze. For example, it implicitly reports the set of live variables at each basic block boundary, which makes merging state in the abstract interpreter relatively cheap.

Figure 35. The DFG pipeline.

Figure 35 shows the complete DFG optimization pipeline. This is a fairly complete pipeline: it has classics like constant folding, control flow simplification, CSE, and DCE. It also has lots of JavaScript-specifics like deciding where to put checks (unification, prediction injection and propagation, prediction propagation, and fixup), a pass just to optimize common patterns of varargs, some passes for GC barriers, and passes that help OSR (CPS rethreading and phantom insertion). We can afford to do a lot of optimizations in the DFG so long as those optimizations are block-local and don’t try too hard. Still, this pipeline is way smaller than the FTL’s and runs much faster.

To summarize, the DFG compiler uses OSR exit and static analysis to emit an optimal set of type checks. This greatly reduces the number of type checks compared to running JavaScript in either of the profiled tiers. Because the benefit of type check removal is so big, the DFG compiler tries to limit how much time it spends doing other optimizations by restricting itself to a mostly block-local view of the program. This is a trade off that the DFG makes to get fast compile times. Functions that run long enough that they’d rather pay the compile time to get those optimizations end up tiering up to the FTL, which just goes all out for throughput.

FTL Compiler

We’ve previously documented some aspects of the FTL’s architecture in the original blog post and when we introduced B3. This section provides an updated description of this JIT’s capabilities as well as a deep dive into how FTL does OSR. We will structure our discussion of the FTL as follows. First we will enumerate what optimizations it is capable of. Then we will describe how it does OSR exit in detail. Finally we will talk about patchpoints — an IR operation based on a lambda.

All The Optimizations

The point of the FTL compiler is to run all the optimizations. This is a compiler where we never compromise on peak throughput. All of the DFG’s decisions that were known trade-offs in favor of compile time at the expense of throughput are reversed in the FTL. There is no upper limit on the amount of cycles that a function compiled with the FTL will run for, so it’s the kind of compiler where even esoteric optimizations have a chance to pay off eventually. The FTL combines multiple optimization strategies:

  • We reuse the DFG pipeline, including the weird IR. This ensures that any good thing that the DFG tier ever does is also available in the FTL.
  • We add a new DFG SSA IR and DFG SSA pipeline. We adapt lots of DFG phases to DFG SSA (which usually makes them become global rather than local). We add lots of new phases that are only possible in SSA (like loop invariant code motion).
  • We lower DFG SSA IR to B3 IR. B3 is an SSA-based optimizing JIT compiler that operates at the abstraction level of C. B3 has lots of optimizations, including global instructcion selection and graph coloring register allocation. The FTL was B3’s first customer, so B3 is tailored for optimizing at the abstraction level where DFG SSA IR can’t.

Having multiple ways of looking at the program gives the FTL maximum opportunities to optimize. Some of the compiler’s functionality, particularly in the part that decides where to put checks, thrives on the DFG’s weird IR. Other parts of the compiler work best in DFG SSA, like the DFG’s loop-invariant code motion. Lots of things work best in B3, like most reasoning about how to simplify arithmetic. B3 is the first IR that doesn’t know anything about JavaScript, so it’s a natural place to implement textbook optimization that would have difficulties with JavaScript’s semantics. Some optimizations, like CSE, work best when executed in every IR because they find unique opportunities in each IR. In fact, all of the IRs have the same fundamental optimization capabilities in addition to their specialized optimizations: CSE, DCE, constant folding, CFG simplification, and strength reductions (sometimes called peephole optimizations or instruction combining).

Figure 36. The FTL pipeline. Note that Lower DFG to B3 is in bold because it’s FTL’s biggest phase; sometimes when we say “FTL” we are just referring to this phase.

The no-compromise approach is probably best appreciated by looking at the FTL optimization pipeline in Figure 36. The FTL runs 93 phases on the code in encounters. This includes all phases from Figure 35 (the DFG pipeline), except Varargs Forwarding, only because it’s completely subsumed by the FTL’s Arguments Elimination. Let’s review some of the FTL’s most important optimizations:

  • DFG AI. This is one of the most important optimizations in the FTL. It’s mostly identical to the AI we run in the DFG tier. Making it work with SSA makes it slightly more precise and slightly more expensive. We run the AI a total of six times.
  • CSE (common subexpression elimination). We run this in DFG IR (Local Common Subexpression Elimination), DFG SSA IR (Global Common Subexpression Elimination), B3 IR (Reduce Strength and the dedicated Eliminate Common Subexpressions), and even in Air (Fix Obvious Spills, a CSE focused on spill code). Our CSEs can do value numbering and load/store elimination.
  • Object Allocation Sinking is a must-points-to analysis that we use to eliminate object allocations or sink them to slow paths. It can eliminate graphs of object allocations, including cyclic graphs.
  • Integer Range Optimization is a forward flow-sensitive abtract interpreter in which the state is a system of equations and inequalities that describe known relationships between variables. It can eliminate integer overflow checks and array bounds checks.
  • The B3 Reduce Strength phase runs a fixpoint that includes CFG simplification, constant folding, reassociation, SSA clean-up, dead code elimination, a light CSE, and lots of miscellaneous strength reductions.
  • Duplicate Tails, aka tail duplication, flattens some control flow diamonds, unswitches small loops, and undoes some cases of relooping. We duplicate small tails blindly over a CFG with critical edges broken. This allows us to achieve some of what splitting achieved for the original speculative compilers.
  • Lower B3 to Air is a global pattern matching instruction selector.
  • Allocate Registers By Graph Coloring implements the IRC and Briggs register allocators. We use IRC on x86 and Briggs on arm64. The difference is that IRC can find more opportunities for coalescing assignments into a single register in cases where there is high register pressure. Our register allocators have special optimizations for OSR exit, especially the OSR exits we emit for integer overflow checks.

OSR Exit in the FTL

Now that we have enumerated some of the optimizations that the FTL is capable of, let’s take a deep dive into how the FTL works by looking at how it compiles and does OSR. Let’s start with this example:

function foo(a, b, c)
{
    return a + b + c;
}

The relevant part of the bytecode sequence is:

[   7] add loc6, arg1, arg2
[  12] add loc6, loc6, arg3
[  17] ret loc6

Which results in the following DFG IR:

  24:  GetLocal(Untyped:@1, arg1(B<Int32>/FlushedInt32), R:Stack(6), bc#7)
  25:  GetLocal(Untyped:@2, arg2(C<BoolInt32>/FlushedInt32), R:Stack(7), bc#7)
  26:  ArithAdd(Int32:@24, Int32:@25, CheckOverflow, Exits, bc#7)
  27:  MovHint(Untyped:@26, loc6, W:SideState, ClobbersExit, bc#7, ExitInvalid)
  29:  GetLocal(Untyped:@3, arg3(D<Int32>/FlushedInt32), R:Stack(8), bc#12)
  30:  ArithAdd(Int32:@26, Int32:@29, CheckOverflow, Exits, bc#12)
  31:  MovHint(Untyped:@30, loc6, W:SideState, ClobbersExit, bc#12, ExitInvalid)
  33:  Return(Untyped:@3, W:SideState, Exits, bc#17)

The DFG data flow from the snippet above is illustrated in Figure 37 and the OSR exit sites are illustrated in Figure 38.

Figure 37. Data flow graph for FTL code generation example.
Figure 38. DFG IR example with the two exiting nodes highlighted along with where they exit and what state is live when they exit.

We want to focus our discussion on the MovHint @27 and how it impacts the code generation for the ArithAdd @30. That ArithAdd is going to exit to the second add in the bytecode, which requires restoring loc6 (i.e. the result of the first add), since it is live at that point in bytecode (it also happens to be directly used by that add).

This DFG IR is lowered to the following in B3:

Int32 @42 = Trunc(@32, DFG:@26)
Int32 @43 = Trunc(@27, DFG:@26)
Int32 @44 = CheckAdd(@42:WarmAny, @43:WarmAny, generator = 0x1052c5cd0,
                     earlyClobbered = [], lateClobbered = [], usedRegisters = [],
                     ExitsSideways|Reads:Top, DFG:@26)
Int32 @45 = Trunc(@22, DFG:@30)
Int32 @46 = CheckAdd(@44:WarmAny, @45:WarmAny, @44:ColdAny, generator = 0x1052c5d70,
                     earlyClobbered = [], lateClobbered = [], usedRegisters = [],
                     ExitsSideways|Reads:Top, DFG:@30)
Int64 @47 = ZExt32(@46, DFG:@32)
Int64 @48 = Add(@47, $-281474976710656(@13), DFG:@32)
Void @49 = Return(@48, Terminal, DFG:@32)

CheckAdd is the B3 way of saying: do an integer addition, check for overflow, and if it overflows, execute an OSR exit governed by a generator. The generator is a lambda that is given a JIT generator object (that it can use to emit code at the jump destination of the OSR exit) and a stackmap generation parameters that tells the B3 value representation for each stackmap argument. The B3 value reps tell you which register, stack slot, or constant to use to get the value. B3 doesn’t know anything about how exit works except that it involves having a stackmap and a generator lambda. So, CheckAdd can take more than 2 arguments; the first two arguments are the actual add operands and the rest are the stackmap. It’s up to the client to decide how many arguments to pass to the stackmap and only the generator will ever get to see their values. In this example, only the second CheckAdd (@46) is using the stackmap. It passes one extra argument, @44, which is the result of the first add — just as we would expect based on MovHint @27 and the fact that loc6 is live at bc#12. This is the result of the FTL decompressing the delta encoding given by MovHints into full stackmaps for B3.

Figure 39. The stackmaps and stackmap-like mappings maintained by the FTL to enable OSR.

FTL OSR exit means tracking what happens with the values of bytecode locals through multiple stages of lowering. The various stages don’t know a whole lot about each other. For example, the final IRs, B3 and Air, know nothing about bytecode, bytecode locals, or any JavaScript concepts. We implement OSR exit by tracking multiple stackmap-like mappings per exit site that give us the complete picture when we glue them together (Figure 39):

  • The DFG IR stackmaps that we get be decompressing MovHint deltas. This gives a mapping from bytecode local to either a DFG node or a stack location. In some cases, DFG IR has to store some values to the stack to support dynamic variable introspection like via function.arguments. DFG OSR exit analysis is smart enough recognize those cases, since it’s more optimal to handle those cases by having OSR exit extract the value from the stack. Hence, OSR exit analysis may report that a bytecode local is available through a DFG node or a stack location.
  • The B3 value reps array inside the stackmap generation parameters that B3 gives to the generator lambdas of Check instructions like CheckAdd. This is a mapping from B3 argument index to a B3 value representation, which is either a register, a constant, or a stack location. By argument index we mean index in the stackmap arguments to a Check. This is three pieces of information: some user value (like @46 = CheckAdd(@44, @45, @44)), some index within its argument list (like 2), and the value that index references (@44). Note that since this CheckAdd has two argument indices for @44, that means that they may end up having different value representations. It’s not impossible for one to be a constant and another to be a register or spill slot, for example (though this would be highly unlikely; if it happened then it would probably be the result of some sound-but-inefficient antipattern in the compiler). B3’s client gets to decide how many stackmap arguments it will pass and B3 guarantees that it will give the generator a value representation for each argument index in the stackmap (so starting with argument index 2 for CheckAdd).
  • The FTL OSR exit descriptor objects, which the FTL’s DFG→B3 lowering creates at each exit site and holds onto inside the generator lambda it passes to the B3 check. Exit descriptors are based on DFG IR stackmaps and provide a mapping from bytecode local to B3 argument index, constant, stack slot, or materialization. If the DFG IR stackmap said that a bytecode local is a Node that has a constant value, then the OSR exit descriptor will just tell us that value. If the DFG stackmap said that a local is already on the stack, then the OSR exit descriptor will just tell that stack slot. It could be that the DFG stackmap tells us that the node is a phantom object allocation — an object allocation we optimized out but that needs to be rematerialized on OSR exit. If it is none of those things, the OSR exit descriptor will tell us which B3 argument index has the value of that bytecode local.
  • The FTL’s DFG→B3 lowering already maintains a mapping from DFG node to B3 value.
  • The FTL OSR Exit object, which is a mapping from bytecode local to register, constant, stack slot, or materialization. This is the final product of the FTL’s OSR exit handling and is computed lazily from the B3 value reps and FTL OSR exit descriptor.

These pieces fit together as follows. First we compute the DFG IR stackmap and the FTL’s DFG node to B3 value mapping. We get the DFG IR stackmap from the DFG OSR exit analysis, which the FTL runs in tandem with lowering. We get the DFG to B3 mapping implicitly from lowering. Then we use that to compute the FTL OSR exit descriptor along with the set of B3 values to pass to the stackmap. The DFG IR stackmap tells us which DFG nodes are live, so we turn that into B3 values using the DFG to B3 mapping. Some nodes will be excluded from the B3 stackmap, like object materializations and constants. Then the FTL creates the Check value in B3, passes it the stackmap arguments, and gives it a generator lambda that closes over the OSR exit descriptor. B3’s Check implementation figures out which value representations to use for each stackmap argument index (as a result of B3’s register allocator doing this for every data flow edge), and reports this to the generator as an array of B3 value reps. The generator then creates a FTL::OSRExit object that refers to the FTL OSR exit descriptor and value reps. Users of the FTL OSR exit object can figure out which register, stack slot, constant value, or materialization to use for any bytecode local by asking the OSR exit descriptor. That can tell the constant, spill slot, or materialization script to use. It can also give a stackmap argument index, in which case we load the value rep at that index, and that tells us the register, spill slot, or constant.

This approach to OSR exit gives us two useful properties. First, it empowers OSR-specific optimization. Second, it empowers optimizations that don’t care about OSR. Let’s go into these in more detail.

FTL OSR empowers OSR-specific optimizations. This happens in DFG IR and B3 IR. In DFG IR, OSR exit is a mutable part of the IR. Any operation can be optimized by adding more OSR exits and we even have the ability to move checks around. The FTL does sophisticated OSR-aware optimizations using DFG IR, like object allocation sinking. In B3 IR, OSR exit gets special register allocation treatment. The stackmap arguments of Check are understood by B3 to be cold uses, which means that it’s not expensive if those uses are spilled. This is powerful information for a register allocator. Additionally, B3 does special register allocation tricks for addition and subtraction with overflow checks (for example we can precisely identify when the result register can reuse a stackmap register and when we can coalesce the result register with one of the input registers to produce optimal two-operand form on x86).

FTL OSR also empowers optimizations that don’t care about OSR exit. In B3 IR, OSR exit decisions get frozen into stackmaps. This is the easiest representation of OSR exit because it requires no knowledge of OSR exit semantics to get right. It’s natural for compiler phases to treat extra arguments to an instruction opaquely. Explicit stackmaps are particularly affordable in B3 because of a combination of factors:

  1. the FTL is a more expensive compiler anyway so the DFG OSR delta encoding optimizations matter less,
  2. we only create stackmaps in B3 for exits that DFG didn’t optimize out, and
  3. B3 stackmaps only include a subset of live state (the rest may be completely described in the FTL OSR exit descriptor).

We have found that some optimizations are annoying, sometimes to the point of being impractical, to write in DFG IR because of explicit OSR exit (like MovHint deltas and exit origins). It’s not necessary to worry about those issues in B3. So far we have found that every textbook optimization for SSA is practical to do in B3. This means that we only end up having a bad time with OSR exit in our compiler when we are writing phases that benefit from DFG’s high-level knowledge; otherwise we write the phases in B3 and have a great time.

This has some surprising outcomes. Anytime FTL emits a Check value in B3, B3 may duplicate the Check. B3 IR semantics allow any code to be duplicated during optimization and this usually happens due to tail duplication. Not allowing code duplication would restrict B3 more than we’re comfortable doing. So, when the duplication happens, we handle it by having multiple FTL OSR exits share the same OSR exit descriptor but get separate value reps. It’s also possible for B3 to prove that some Check is either unnecessary (always succeeds) or is never reached. In that case, we will have one FTL OSR exit descriptor but zero FTL OSR exits. This works in such a way that DFG IR never knows that the code was duplicated and B3’s tail duplication and unreachable code elimination know nothing about OSR exit.

Patchpoints: Lambdas in the IR

This brings us to the final point about the FTL. We think that what is most novel about this compiler is its use of lambdas in its IRs. Check is one example of this. The DFG has some knowledge about what a Check would do at the machine code level, but that knowledge is incomplete until we fill in some blanks about how B3 register-allocated some arguments to the Check. The FTL handles this by having one of the operands to a B3 Check be a lambda that takes a JIT code generator object and value representations for all of the arguments. We like this approach so much that we also have B3 support Patchpoint. A Patchpoint is like an inline assembly snippet in a C compiler, except that instead of a string containing assembly, we pass a lambda that will generate that assembly if told how to get its arguments and produce its result. The FTL uses this for a bunch of cases:

  • Anytime the B3 IR generated by the FTL interacts with JavaScriptCore’s internal ABI. This includes all calls and call-like instructions.
  • Inline caches. If the FTL wants to emit an inline cache, it uses the same inline cache code generation logic that the DFG and baseline use. Instead of teaching B3 how to do this, we just tell B3 that it’s a patchpoint.
  • Lazy slow paths. The FTL has the ability to only emit code for a slow path if that slow path executes. We implement that using patchpoints.
  • Instructions we haven’t added to B3 yet. If we find some JavaScript-specific CPU instruction, we don’t have to thread it through B3 as a new opcode. We can just emit it directly using a Patchpoint. (Of course, threading it through B3 is a bit better, but it’s great that it’s not strictly necessary.)

Here’s an example of the FTL using a patchpoint to emit a fast double-to-int conversion:

if (MacroAssemblerARM64::
    supportsDoubleToInt32ConversionUsingJavaScriptSemantics()) {
    PatchpointValue* patchpoint = m_out.patchpoint(Int32);
    patchpoint->appendSomeRegister(doubleValue);
    patchpoint->setGenerator(
        [=] (CCallHelpers& jit,
             const StackmapGenerationParams& params) {
            jit.convertDoubleToInt32UsingJavaScriptSemantics(
                params[1].fpr(), params[0].gpr());
        });
    patchpoint->effects = Effects::none();
    return patchpoint;
}

This tells B3 that it’s a Patchpoint that returns Int32 and takes a Double. Both are assumed to go in any register of B3’s choice. Then the generator uses a C++ lambda to emit the actual instruction using our JIT API. Finally, the patchpoint tells B3 that the operation has no effects (so it can be hoisted, killed, etc).

This concludes our discussion of the FTL. The FTL is our high throughput compiler that does every optimization we can think of. Because it is a speculative compiler, a lot of its design is centered around having a balanced handling of OSR exit, which involves a separation of concerns between IRs that know different amounts of things about OSR. A key to the FTL’s power is the use of lambdas in B3 IR, which allows B3 clients to configure how B3 emits machine code for some operations.

Summary of Compilation and OSR

To summarize, JavaScriptCore has two optimizing compilers, the DFG and FTL. They are based on the same IR (DFG IR), but the FTL extends this with lots of additional compiler technology (SSA and multiple IRs). The DFG is a fast compiler: it’s meant to compile faster than typical optimizing compilers. But, it generates code that is usually not quite optimal. If that code runs long enough, then it will also get compiled with the FTL, which tries to emit the best code possible.

Related Work

The idea of using feedback from cheap profiling to speculate was pioneered by the Hölzle, Chambers, and Ungar paper on polymorphic inline caches, which calls this adaptive compilation. That work used a speculation strategy based on splitting, which means having the compiler emit many copies of code, one for each possible type. The same three authors later invented OSR exit, though they called it dynamic deoptimization and only used it to enhance debugging. Our approach to speculative compilation means using OSR exit as our primary speculation strategy. We do use splitting in a very limited sense: we emit diamond speculations in those cases where we are not sure enough to use OSR and then we allow tail duplication to split the in-between code paths if they are small enough.

This speculative compilation technique, with OSR or diamond speculations but not so much splitting, first received extraordinary attention during the Java performance wars. Many wonderful Java VMs used combinations of interpreters and JITs with varied optimization strategies to profile virtual calls and speculatively devirtualize them, with the best implementations using inline caches, OSR exit, and watchpoints. Java implementations that used variants of this technique include (but are not limited to):

  • the IBM JIT, which combined an interpreter and an optimizing JIT and did diamond speculations for devirtualization.
  • HotSpot and HotSpot server, which combined an interpreter and an optimizing JIT and used diamond speculations, OSR exit, and lots of other techniques that JavaScriptCore uses. JavaScriptCore’s FTL JIT is similar to HotSpot server in the sense that both compilers put a big emphasis on great OSR support, comprehensive low-level optimizations, and graph coloring register allocation.
  • Eclipse J9, a major competitor to HotSpot that also uses speculative compilation.
  • Jikes RVM, a research VM that used OSR exit but combined a baseline JIT and an optimizing JIT. I learned most of what I know about this technique from working on Jikes RVM.

Like Java, JavaScript has turned out to be a great use case for speculative compilation. Early instigators in the JavaScript performance war included the Squirrelfish interpreter (predecessor to LLInt), the Squirrelfish Extreme JIT (what we now call the Baseline JIT), the early V8 engine that combined a baseline JIT with inline caches, and TraceMonkey. TraceMonkey used a cheap optimizing JIT strategy called tracing, which compiles lots of speculative paths. This JIT sometimes outperformed the baseline JITs, but often lost to them due to overspeculation. V8 upped the ante by introducing the speculative compilation approach to JavaScript, using the template that had worked so well in Java: a lower tier that does inline caches, then an optimizing JIT (called Crankshaft) that speculates based on the inline caches and exits to the lower tier. This version of V8 used a pair of JITs (baseline JIT and optimizing JIT), much like Jikes RVM did for Java. JavaScriptCore soon followed by hooking up the DFG JIT as an optimizing tier for the baseline JIT, then adding the LLInt and FTL JIT. During about the same time, TraceMonkey got replaced with IonMonkey, which uses similar techniques to Crankshaft and DFG. The ChakraCore JavaScript implementation also used speculative compilation. JavaScriptCore and V8 have continued to expand their optimizations with innovative compiler technology like B3 (a CFG SSA compiler) and TurboFan (a sea-of-nodes SSA compiler). Much like for Java, the top implementations have at least two tiers, with the lower one used to collect profiling that the upper one uses to speculate. And, like for Java, the fastest implementations are built around OSR speculation.

Conclusion

JavaScriptCore includes some exciting speculative compiler technology. Speculative compilation is all about speeding up dynamically typed programs by placing bets on what types the program would have had if it could have types. Speculation uses OSR exit, which is expensive, so we engineer JavaScriptCore to make speculative bets only if they are a sure thing. Speculation involves using multiple execution tiers, some for profiling, and some to optimize based on that profiling. JavaScriptCore includes four tiers to also get an ideal latency/throughput trade-off on a per-function basis. A control system chooses when to optimize code based on whether it’s hot enough and how many times we’ve tried to optimize it in the past. All of the tiers use a common IR (bytecode in JavaScriptCore’s case) as input and provide independent implementation strategies with different throughput/latency and speculation trade-offs.

This post is an attempt to demystify our take on speculative compilation. We hope that it’s a useful resource for those interested in JavaScriptCore and for those interested in building their own fast language implementations (especially the ones with really weird and funny features).

]]>
A Tour of Inline Caching with Delete https://webkit.org/blog/10298/inline-caching-delete/ Fri, 10 Apr 2020 17:11:29 +0000 https://webkit.org/?p=10298 If you search for any JavaScript performance advice, a very popular recommendation is to avoid the delete operator. Today, this seems to be good advice, but why should it be vastly more expensive to delete a property than to add it?

The goal of my internship at Apple this winter was to improve the performance of the delete operator in JavaScriptCore. This has given me the opportunity to learn about how the various pieces of JavaScriptCore’s inline caching optimizations work, and we will take a quick tour of these optimizations today.

First, we will look at how expensive deletion really is in the major engines, and discuss why we might want to optimize it. Then, we will learn what inline caching is by implementing it for property deletion. Finally, we will look at the performance difference these optimizations make on benchmarks and microbenchmarks.

How expensive is Deletion?

First of all, why should we even bother optimizing deletion? Deletion should be fairly rare, and many JavaScript developers already know not to use it when possible. At the same time, we generally try to avoid having large hidden performance cliffs. To demonstrate how a simple delete statement can have a surprising effect on performance, I wrote a small benchmark that renders a scene progressively, and measures the time it takes to render each frame. This benchmark is not designed to precisely measure performance, but rather to make it easy to see large performance differences.

You can run the program yourself by pressing the run button below, which will calculate a new color value for every pixel in the image. It will then display how long it took to render, in milliseconds.

Next, we can try executing a single delete statement to a hot part of the code. You can do this by checking the “Use Delete” checkbox above, and clicking run again. This is what the code looks like:

class Point {
    constructor(x, y, z) {
        this.x = x
        this.y = y
        this.z = z
        this.a = 0
        if (useDelete)
            delete this.a
    }
}

The following results were measured on my personal computer running Fedora 30, comparing tip of tree WebKit (r259643) with all of the delete optimizations both enabled and disabled. I also used Firefox 74 and Chromium 77 from the Fedora repositories.

Here, we can see that the addition of a delete statement to a hot section of code can send the performance off a cliff! The primary reason for this is that deletion in JavaScriptCore used to disable all of the inline caching optimizations for an object, including when putting and getting properties of the object. Let’s see why.

Structures and Transitions

JavaScript code can be extremely dynamic, making it tricky to optimize. While many other languages have fixed-size structs or classes to help the compiler make optimizations, objects in JavaScript behave like dictionaries that allow associating arbitrary values and keys. In addition, objects frequently change their shape over time. For performance, JavaScriptCore uses multiple internal representations for objects, choosing between them at runtime based on how a program uses them. The default representation of objects use something called a Structure to hold the general shape of an object, allowing many instances that have the same shape to share a structure. If two objects have the same structure ID, we can quickly tell that they have the same shape.

class A {
  constructor() {
    this.x = 5
    this.y = 10
  }
}
let a = new A() // Structure: { x, y }
let b = new A() // same as a

Structures store the names, attributes (such as read-only), and offsets of all of the properties owned by the object, and often the object’s prototype too. While the structure stores the shape, the object maintains its property storage, and the offset gives a location inside this storage.

At a high level, adding a property to an object using this representation goes as follows:

1) Construct a new structure with the property added.
2) Place the value into the property storage.
3) Change the object’s structure ID to point to the new structure.

One important optimization is to cache these structure changes in something called a structure transition table. This is the reason why in the example above, both objects share the same structure ID rather than having separate but equivalent structure objects. Instead of creating a new structure for b, we can simply transition to the structure we already created for a.

This representation improves performance for the case when you have a small number of object shapes that change predictably, but it is not always the best representation. For example, if there are many properties that are unlikely to be shared with other objects, then it is faster to create a new structure for every instance of this object. Since we know that there is a 1:1 relationship between this object and its structure, we no longer have to update the structure ID when making changes to the shape of the object. This also means that we can no longer cache most property accesses, since these optimizations rely on checking the structure ID.

In previous versions of JavaScriptCore, this is the representation that was chosen for any object that had a deleted property. This is why we see such a large performance difference when a delete is performed on a hot object.

The first optimization of my internship was to instead cache deletion transitions, allowing these objects to continue to be cached.

class A {
  constructor() {
    this.x = 5
    this.y = 10
    delete this.y
  }
}
let a = new A() // Structure: { x } from { x, y }
let b = new A() // same as a

Before this change, both a and b had distinct structures. Now they are the same. This not only removes our performance cliff, but will enable further optimizations to the deletion statement itself.

In the path tracing example above, we get the following performance results:

Inline Caching

Now that we can cache deletion transitions, we can further optimize the act of property deletion itself. Getting and putting properties both use something called an inline cache to do this, and now deletion does too. The way this works is by emitting a generic version of these operations that modifies itself over time to handle frequent cases faster.

I implemented this optimization in all three of our JIT compilers for delete, and our interpreter also supports inline caching. To learn more about how JavaScriptCore uses multiple compilers to trade-off latency and throughput, see this blog post. The summary is that code is first executed by the interpreter. Once it is sufficiently hot, it gets compiled into increasingly optimized pieces of machine code by our compilers.

An inline cache uses structures to keep track of the object shapes it has seen. We will see how inline caching now works for delete, which will give us some insight into how it works for getting and putting properties too.

When we see a delete property statement, we will first emit something called a patchable jump. This is a jump instruction that can be changed later on. At first, it will simply jump to some code that calls the slow path for property deletion. This slow path will store a record of this call, however, and after we have performed a few accesses, we will attempt to emit our first inline cache. Let’s walk through an example:

function test() {
    for (let i = 0; i < 50; ++i) {
        let o = { }
        if (i < 10)
            o.a = 1
        else if (i < 20)
            o.b = 1
        else
            o.c = 1

        delete o.a
    }
}

for (let i = 0; i < 100; ++i)
    test()

First, we run test a few times causing it to tier up to the baseline compiler. Inside test(), we see that the code deletes properties from objects with three different structures, while giving enough time for us to emit a new inline cache for each. Let’s see what the baseline code looks like:

          jmp [slow path] ; This jump is patchable
continuation:
          [package up boolean result]
          ; falls through to rest of program
...
slow path:
          [call into C++ slow path]
          jmp [next bytecode]

This code performs a few checks, then jumps either to the slow path call or the patchable jump target. The slow path call is placed outside the main portion of the generated code to improve cache locality. Right now, we see that the patchable jump target also points to the slow path, but that will change.

At this point in time, deletion is a jump to the slow path plus a call into C++, which is quite expensive.

Every time the slow path call is made, it collects and saves information about its arguments. In this example, the first case that it decided to cache is the delete miss for the structure with shape { c }. It generates the following code, and repatches the patchable jump target to go here instead:

      0x7f3fa96ff6e0: cmp [structure ID for { c }], (%rsi)
      0x7f3fa96ff6e6: jnz [slow path]
      0x7f3fa96ff6ec: mov $0x1, %eax ; return value true
      0x7f3fa96ff6f1: jmp [continuation]
      0x7f3fa96ff6f6: jmp [slow path]

We see that if the structure ID matches, we simply return true and jump back to the continuation which is responsible for packaging up the boolean result and continuing execution. We save ourselves an expensive call into C++, running just a few instructions in its place.

Next, we see the following inline cache get generated as the engine decides to cache the next case:

      0x7f3fa96ff740: mov (%rsi), %edx
      0x7f3fa96ff742: cmp [structure ID for { c }], %edx
      0x7f3fa96ff748: jnz [case 2]
      0x7f3fa96ff74e: mov $0x1, %eax
      0x7f3fa96ff753: jmp [continuation]
case 2:
      0x7f3fa96ff758: cmp [structure ID for { a }], %edx
      0x7f3fa96ff75e: jnz [slow path]
      0x7f3fa96ff764: xor %rax, %rax; Zero out %rax
      0x7f3fa96ff767: mov %rax, 0x10(%rsi); Store to the property storage
      0x7f3fa96ff76b: mov [structure ID for { } from { a }], (%rsi)
      0x7f3fa96ff771: mov $0x1, %eax
      0x7f3fa96ff776: jmp [continuation]

Here, we see what happens if the property exists. In this case, we zero the property storage, change the structure ID, and return true (0x1), to the continuation for our result to be packaged up.

Finally, we see the complete inline cache:

      0x7f3fa96ff780: mov (%rsi), %edx
      0x7f3fa96ff782: cmp [structure ID for { c }], %edx
      0x7f3fa96ff788: jnz [case 2]
      0x7f3fa96ff78e: mov $0x1, %eax
      0x7f3fa96ff793: jmp [continuation]
case 2:
      0x7f3fa96ff798: cmp [structure ID for { a }], %edx
      0x7f3fa96ff79e: jnz [case 3]
      0x7f3fa96ff7a4: xor %rax, %rax
      0x7f3fa96ff7a7: mov %rax, 0x10(%rsi)
      0x7f3fa96ff7ab: mov $0x37d4, (%rsi)
      0x7f3fa96ff7b1: mov $0x1, %eax
      0x7f3fa96ff7b6: jmp [continuation]
case 3:
      0x7f3fa96ff7bb: cmp [structure ID for { b }], %edx
      0x7f3fa96ff7c1: jnz [slow path]
      0x7f3fa96ff7c7: mov $0x1, %eax
      0x7f3fa96ff7cc: jmp [continuation]

When we run the code above in different browsers, we get the following performance numbers with and without delete optimizations:

Inlining Delete

Now that we have seen how an inline cache is generated, we would like to see how we can feed back this profiling information into the compiler. We will see how our first optimizing compiler, the DFG, can attempt to inline delete statements like it already does for puts and gets. This will then allow all of the other optimizations we have to see inside the delete statement, rather than seeing a black box.

We will demonstrate this by looking at just one of these optimizations, our object allocation sinking and elimination phase. This phase attempts to prevent allocations of objects when they do not escape their scope, but it previously saw property deletion as an escape. Consider the following code:

function noEscape(i) {
    let foo = { i }
    delete foo.x
    return foo.i
}

for (let i = 0; i < 1000000; ++i)
    noEscape(5)

As the delete is run, it will first get an inline cache. As the code continues to tier up, the DFG tier will eventually look at the cases covered by the inline cache and see only one structure. It will decide to try inlining the delete.

It first speculates that the structure ID of the target object is for the structure with shape { i }. That is, we check that the structure of the object matches, and if it does not, we exit to lower-tier code. This means that our engine can assume that the structures match for any subsequent code. If this assumption turns out to be false, we will eventually recompile the code again without it.

D@25:   NewObject()
D@30:   CheckStructure(D@25, Structure for { })
D@31:   PutByOffset(D@25, D@25, D@36, id0{i})
D@32:   PutStructure(D@25, Structure for { i })
D@36:   CheckStructure(D@25, Structure for { i })
D@37:   JSConstant(True)
D@42:   CheckStructure(D@25. Structure for { i })
D@43:   GetByOffset(D@25, 0)
D@34:   Return(D@43)

We see here that we make a new object, and then see inlined versions of put, delete and get. Finally, we return.

If the delete statement were compiled into a DeleteByID node, later optimizations could not do much to optimize this code. They don’t understand what effects the delete has on the heap or the rest of the program. We see that once inlined however, the delete statement actually becomes very simple:

D@36:   CheckStructure(D@25, Structure for { i })
D@37:   JSConstant(True)

That is, the delete statement becomes simply a check and a constant! Next, while it often isn’t possible to prove every check is redundant, in this example our compiler will be able to prove that is can safely remove them all. Finally, our object allocation elimination phase can look at this code and remove the object allocation. The final code for noEscape() looks like this:

D@36:   GetStack(arg1)
D@25:   PhantomNewObject()
D@34:   Return(D@36)

PhantomNewObject does not cause any code to be emitted, making our final code trivial! We get the following performance results:

Results

The caching of delete transitions caused a 1% improvement overall on the speedometer benchmark. The primary reason for this is that the EmberJS Debug subtest uses the delete statement frequently, and delete transition caching progresses this benchmark by 6%. This subtest was included in the benchmark because it was discovered that many websites do not ship the release version of EmberJS. The other optimizations we discussed were all performance-neutral on the macro-benchmarks we track.

In conclusion, we have seen how inline caching works first-hand, and even progressed a few benchmarks in the process! The order that we learned about these optimizations (and indeed the order that I learned about them this term) follows closely the order that they were discovered too. If you would like to learn more, check out these papers on implementations of inline caching in smalltalk and self, two languages that inspired the design of JavaScript. We can see the evolution from monomorphic inline caches in smalltalk to polymorphic inline caches with inlining support in self, just like we saw on our tour today. Implementing inline caching for delete was an extremely educational experience for me, and I hope you enjoyed reading about it.

While deletion will still be rare, it still feels great to make JSC even more robust. You can get in touch by filing a bug or by joining the WebKit slack workspace. You can also consider downloading the code and hacking on it yourself!

]]>
A New Bytecode Format for JavaScriptCore https://webkit.org/blog/9329/a-new-bytecode-format-for-javascriptcore/ Fri, 21 Jun 2019 17:00:20 +0000 https://webkit.org/?p=9329 In revision r237547 we introduced a new bytecode format for JavaScriptCore (JSC). The goals of the new format were to improve memory usage and allow the bytecode to be cached on disk, while the previous format was optimized for interpreter throughput at the cost of memory usage.

In this post, we will start with a quick overview of JSC’s bytecode, key aspects of the old bytecode format and the optimizations it enabled. Next, we will look into the new format and how it affects interpreter execution. Finally, we will look at the impact of the new format on memory usage and performance and how this rewrite improved type safety in JavaScriptCore.

Background

Before JSC executes any JavaScript code, it must lex, parse and generate bytecode for it. JSC has 4 tiers of execution:

  • Low Level Interpreter (LLInt): the start-up interpreter
  • Baseline JIT: a template JIT
  • DFG JIT: a low-latency optimizing compiler
  • FTL JIT: a high-throughput optimizing compiler

Execution starts by interpreting the bytecode, at the lowest tier, and as the code gets executed more it gets promoted to a higher tier. This is described in a lot more details in this blog post about the FTL.

The bytecode is the source of truth throughout the whole engine. The LLInt executes the bytecode. The baseline is a template JIT, which emits snippets of machine code for each bytecode instruction. Finally, the DFG and FTL parse the bytecode and emit DFG IR, which is then run through an optimizing compiler.

Because the bytecode is the source of truth, it tends to stay alive in memory throughout the whole program execution. In JavaScript-heavy websites, such as Facebook or Reddit, the bytecode is responsible for 20% of the overall memory usage.

The Bytecode

To make things more concrete, let’s look at a simple JavaScript program, learn how to inspect the bytecode generated by JSC and how to interpret the bytecode dump.

// double.js
function double(a) {
    return a + a;
}
double(2);

If you run the above program with jsc -d double.js, JSC will dump all the bytecode it generates to stderr. The bytecode dump will contain the bytecode generated for double:

[   0] enter
[   1] get_scope          loc4
[   3] mov                loc5, loc4
[   6] check_traps
[   7] add                loc7, arg1, arg1, OperandTypes(126, 126)
[  13] ret                loc7

Each line starts with the offset of the instruction in brackets, followed by the opcode name and its operands. Here we can see the operands loc for local variables, arg for function arguments and OperandTypes, which is metadata about the predicted types of the arguments.

Old Bytecode Format

The old bytecode format had a few issues that we wanted to fix:

  • It used too much memory.
  • The instruction stream was writable, which prevented memory-mapping the bytecode stream.
  • It had optimizations that we no longer benefited from, such as direct threading.

In order to better understand how we addressed these issues in the new format, we need a basic understanding of the old bytecode format. In the old format, instructions could be in one of two forms: unlinked, which is compact and optimized for storage and linked, which is inflated and optimized for execution, containing memory addresses of runtime objects directly in the instruction stream.

Unlinked Instructions

The instructions were encoded using variable-width encoding. The opcode and each operand took as little space as possible, ranging from 1 to 5 bytes. Take the add instruction from the program above as an example, it would take 6 bytes: one for the opcode (add), one for each of the registers (loc7, arg1 and arg1 again) and two bytes for the operand types.

Unlinked instructions

Linking / Linked Instructions

Before executing the bytecode it needed to be linked. Linking inflated all the instructions, making the opcode and each of the operands pointer-sized. The opcode was replaced by the actual pointer to the implementation of the instruction and the profiling-related metadata replaced with the memory address of the newly allocated profiling data structures. The add from above took 40 bytes to represent:

Linked instructions

Execution

The bytecode is executed by the LLInt, which is written in a portable assembly called offlineasm. Here are a few notes about offlineasm that may help understand the code snippets that will follow:

  • Temporary registers are t0-t5, argument registers are a0-a3 and return registers are r0 and r1. For floating point, the equivalent registers are ft0-ft5, fa0-fa3 and fr. cfp and sp are special registers that hold the call frame and stack pointer respectively.
  • Instructions have one of the following suffixes: b for byte, h for 16-bit word, i for 32-bit word, q for 64-bit word and p for pointer (either 32- or 64-bit depending on the architecture).
  • Macros are lambda expressions that take zero or more arguments and return code. Macros may be named or anonymous and can be passed as arguments to other macros.

If you want to learn more about offlineasm, LowLevelInterpreter.asm is the main offlineasm file in JSC and it contains a more in depth explanation of the language at the top.

The old bytecode format had two big advantages related to interpreter throughput: direct threading and inline caching.

Direct Threading

The linked bytecode had the actual pointer to the offlineasm implementation of the instruction in place of the opcode, which made executing the next instruction as simple as advancing the program counter (PC) by the size of the current instruction and making an indirect jump. That is illustrated in the following offlineasm code:

macro dispatch(instructionSize)
    addp instructionSize * PtrSize, PC
    jmp [PC]
end

Notice that dispatch is a macro, this way the code is duplicated at the end of every instruction. This is very efficient, since it’s just an addition plus an indirect branch. The duplication reduces branch prediction pollution since we don’t have all instructions jumping to a common label and sharing the same indirect jump to the next instruction.

Inline Caching

Since the instruction stream is writable and all the arguments are pointer-sized, we can store metadata in the instruction stream itself. The best example of this is for the get_by_id instruction, which is emitted when we load from an object in JavaScript.

object.field

This emits a get_by_id for loading the property field from object. Since this is one of the most common operations in JavaScript, it’s crucial that it be fast. JSC uses inline caching as a way of speeding this up. The way this was done in the interpreter was by reserving space in the instruction stream to cache metadata about the load. More specifically, we recorded the StructureID of the object we are loading from and the memory offset from which we have to load the value. The LLInt implementation of get_by_id looked like the following:

_llint_op_get_by_id:
    // Read operand 2 from the instruction stream as a signed integer,
    // i.e. the virtual register of `object`
    loadisFromInstruction(2, t0)

    // Load `object` from the stack and its structureID
    loadConstantOrVariableCell(t0, t3, .opGetByIdSlow)
    loadi JSCell::m_structureID[t3], t1

    // Read the cached StructureID and verify that it matches the object's
    loadisFromInstruction(4, t2)
    bineq t2, t1, .opGetByIdSlow

    // Read the PropertyOffset of `field` and load the actual value from it
    loadisFromInstruction(5, t1)
    loadPropertyAtVariableOffset(t1, t3, t0)

    // Read the virtual register where the result should be stored and store
    // the value to it
    loadisFromInstruction(1, t2)
    storeq t0, [cfr, t2, 8]
    dispatch(constexpr op_get_by_id_length)

.opGetByIdSlow
    // Jump to C++ slow path
    callSlowPath(_llint_slow_path_get_by_id)
    dispatch(constexpr op_get_by_id_length)

New Bytecode Format

When designing the new bytecode, we had two major goals: it should be more compact and easily cacheable on disk. With these two goals, we expected significant improvements on memory usage and set ourselves to improve runtime performance through caching. This shaped how we encoded instructions as well as runtime metadata.

The first and biggest change is that we no longer have a separate linked encoding for execution. This immediately means that the bytecode can no longer be direct threaded, since the address of the instruction could not be stored to disk, as it changes with every program invocation. Removing this optimization was a deliberate choice of the design.

In order to make the single format suitable for both storage and execution, each instruction can be encoded as either narrow or wide.

Narrow Instructions

In a narrow instruction, the opcode and its operands each take 1-byte. Here’s the add instruction again, this time as a narrow instruction in the new format:

Narrow

Ideally, all instructions would be encoded as narrow, but not all operands fit into a single byte. If any of the operands (or the opcode for that matter) requires more than 1 byte, the whole instruction is promoted to wide.

Wide Instructions

A wide instruction consists of a special 1-byte opcode, op_wide, followed by a series of 4-byte slots for the original opcode and each of its arguments.

Wide Instructions

An important thing to notice here is that it is not possible to have a single wide operand – if one operand overflows, the whole instruction becomes wide. This is an important compromise, because even though it might seem wasteful at first, it makes the implementation much simpler: the offset of any given operand is the same regardless of the instruction being narrow or wide. The only difference is whether the instruction stream is treated as an array of 4-byte values or 1-byte values.

Linking / Metadata Table

The other fundamental part of the new bytecode is the metadata table. During linking, instead of constructing a new instruction stream, we initialize a side table with all the writable data associated with any given instruction. Conceptually, the table is 2-dimensional, its first index is the opcode for the instruction, e.g. add, which points to a monomorphic array of metadata entries for that specific instruction. For example, we have a struct called OpAdd::Metadata to hold the metadata for the add instruction, so accessing metadataTable[op_add] will result in a pointer of type OpAdd::Metadata*. Additionally, each instruction has a special operand, metadataID, which is used as the second index in the metadata table.

Metadata

For compactness, the metadata table is laid out in memory as a single block of memory. Here’s a representation of what the table above would actually look like in memory.

Metadata Memory Table

At the start of the table is the header, an array containing an integer for each opcode representing the offset from the start of the table where the metadata for that opcode starts. The next region is the payload, which contains the actual metadata for each opcode.

Execution

The biggest changes to execution are that the interpreter is now indirect threaded, and for any writable or runtime metadata, we need to read from the metadata table. Another interesting aspect is how wide instructions are executed.

Indirect Threading

The process of mapping from opcode to instruction address, which used to be done as part of linking in the old format, is now part of the interpreter dispatch. This means that extra loads are required, and the dispatch macro now looks like the following:

macro dispatch(instructionSize)
    addp instructionSize, PC
    loadb [PC], t0
    leap _g_opcodeMap, t1
    jmp [t1, t0, PtrSize]
end

Metadata Table

The same kind of “inline caching”† still applies to get_by_id, but given that we need to write the metadata at runtime and the instruction stream is now read-only, this data needs to live in the metadata table.

Loads from the metadata table are a bit costly, since the CodeBlock needs to be loaded from the call frame, the metadata table from the CodeBlock, the array for the current opcode from the table and then finally the metadata entry from the array based on the current instruction’s metadataID.

† I write “inline caching” in quotes here since it’s not technically inline anymore.

Wide Instruction Execution

The execution of wide instructions is surprisingly simple: there are two functions for each instruction, one for the narrow version and another for the wide. We take advantage of offlineasm to generate the two functions from a single implementation. (e.g. the implementation of op_add will automatically generate its wide version, op_add_wide.)

By default, the narrow version of the instruction is executed, until the special instruction op_wide is executed. All it does is read the next opcode, and dispatch to its wide version, e.g. op_add_wide. Once the wide opcode is done, it goes back to dispatching a narrow instruction, since every wide instruction needs the 1-byte op_wide prefix.

Memory Usage

The new bytecode format uses approximately 50% less memory, which means a reduction of 10% in overall memory usage for JavaScript-heavy websites, such as Facebook or Reddit.

Here’s a chart comparing the bytecode size for reddit.com, apple.com, facebook.com and gmail.com.

Bytecode size charts for reddit.com, apple.com, facebook.com and gmail.com

Notice that in the After case the Metadata has been merged with Linked to represent the memory used by the metadata table and Unlinked represents the actual bytecode instructions. The reason is that a significant part of what lives in the new metadata used to live in the old linked bytecode. Otherwise, the comparison would look skewed, since the whole Linked bar would be gone and the Metadata would seem to have doubled in size, when in reality it was part of the data from Linked that moved into Metadata.

Performance

As discussed earlier, indirect threading increases the overhead of the interpreter dispatch. However, given the average complexity of the bytecode instructions in JSC we did not expect the dispatch overhead to play a meaningful role in the overall performance of the interpreter. This is further amortized by the multiple JIT tiers. We profiled purely switching from direct to indirect threading in CLoop, the C++ backend for LLInt, and the results were neutral even with the JITs disabled.

Loads from the metadata table are costly, involving a large chain of loads. In order to reduce this chain, we pin a callee save register in the interpreter to hold the pointer to the metadata table at all times. Even with this optimization, it takes three loads to access the metadata entry. This was a 10% interpreter slowdown, but was neutral when running with all JIT tiers enabled.

Bonus: Type Safety

Changing the bytecode format required changes across the whole engine, so we decided to take this as an opportunity to improve the bytecode-related infrastructure, increasing the type safety, readability and maintainability of the code.

To support all this changes, first we needed to change how instructions were specified. Originally, instructions were declared in a JSON file, with their respective name and length (the number of operands plus one, for the opcode). Here is how the add instruction was declared:

{ "name": "op_add", "length": 5 }

When accessing the instruction and its operands from C++ the code would look roughly like this:

SLOW_PATH_DECL(slow_path_add)
{
    JSValue lhs = OP_C(2).jsValue();
    JSValue rhs = OP_C(3).jsValue();
    ...
}

Notice that operands were accessed by their offset: OP_C was a macro that would access the Instruction* pc argument (hidden here by the SLOW_PATH_DECL macro) at the given offset. However, the type of the data at the specified offset is unknown, so the result would have to be cast to the desired type. This is error-prone and makes the code harder to understand. The only way to learn which operands an instruction had (and their types) was to look for usages of it and look at variable names and type casts.

With the new infrastructure, when declaring an instruction, a name and type must be given for each of the operands, as well as declaring all the data that will be stored in the metadata table.

op :add,
    args: {
        dst: VirtualRegister,
        lhs: VirtualRegister,
        rhs: VirtualRegister,
        operandTypes: OperandTypes,
    },
    metadata: {
        arithProfile: ArithProfile,
    }

With this extra information, it is now possible to generate a fully typed struct for each instruction, which results in safer C++ code:

SLOW_PATH_DECL(slow_path_add)
{
   OpAdd bytecode = pc->as<OpAdd>();
   JSValue lhs = GET_C(bytecode.m_lhs);
   JSValue rhs = GET_C(bytecode.m_rhs);
   ...
}

In the example above, we first need to convert the generic Instruction* pc to the instruction we want to access, which will perform a runtime check. If the opcode matches, it returns an instance of the generated struct, OpAdd, which includes the fields m_lhs and m_rhs, both of type VirtualRegister as specified in our instruction declaration.

The extra information also allowed us to replace a lot of mechanical code with safer, generated code. For example, we auto generate all the type safe code for writing instructions to the instruction stream, which automatically does the narrow/wide fitting. We also generate all the code to dump the instructions for debugging.

Conclusion

JavaScriptCore has a new bytecode format that uses 50% less memory on average and is available on Safari 12.1 and Safari Technology Preview. The work on the caching API for the new bytecode is already underway in the WebKit repository, and anyone interested is welcome to follow along and contribute on bugs.webkit.org! You can also get in touch with me on Twitter with any questions or comments.

]]>
Web High Level Shading Language https://webkit.org/blog/8482/web-high-level-shading-language/ Mon, 12 Nov 2018 23:21:46 +0000 https://webkit.org/?p=8482 This article is introducing a new graphics shading language for the Web named Web High Level Shading Language (WHLSL, pronounced “whistle”). The language is insprired by HLSL, the dominant shading language for graphics app developers. It extends HLSL for the Web platform to be safe and secure. It’s easy to read and write, and is well-specified using formal techniques.

Background

Over the past few decades, 3D graphics have changed significantly, and the APIs programmers use to write 3D applications have also changed accordingly. Five years ago, state-of-the-art graphics applications would use OpenGL to perform their rendering. However, the past few years have seen a shift in the 3D graphics industry toward newer, lower-level graphics frameworks that better match the behavior of real hardware. In 2014, Apple created the Metal framework, which lets iOS and macOS apps use the full power of the GPU. In 2015, Microsoft created Direct3D 12, a major update to Direct3D which allows for console-level efficiency for rendering and compute. In 2016, the Khronos Group published the Vulkan API, which is primarily used on Android, that offers similar advantages.

Just like how WebGL brought OpenGL to the Web, the Web community is pursuing bringing this type of new, low-level 3D graphics API to the platform. Last year, Apple established the WebGPU Community Group inside the W3C to standardize a new 3D graphics API which provides the benefits of these native APIs, but is also suitable for the Web environment. This new Web API is implementable on top of Metal, Direct3D, and Vulkan. All of the major browser vendors are participating and contributing to the standardization effort.

Each of these modern 3D graphics APIs uses shaders, and WebGPU is no different. Shaders are programs that take advantage of the specialized architecture of GPUs. In particular, GPUs are better than CPUs at heavy parallel numerical processing. To take advantage of both architectures, modern 3D apps use a hybrid design, using both the CPU and the GPU for different tasks. By leveraging the best traits of each, modern graphics APIs provide a powerful framework for developers to create complex, rich, and fast 3D apps. Apps designed for Metal use the Metal Shading Language, apps designed for Direct3D 12 use HLSL, and apps designed for Vulkan use SPIR-V or GLSL.

Language Requirements

Just like its native counterparts, WebGPU needs a shader language. This language needs to meet several requirements that make it well-tailored for the Web platform.

The language needs to be safe. No matter what the application does, the shader must only be able to read or write data from the Web page’s domain. Without this guarantee, a malicious website could run a shader that reads pixels out of other parts of your screen, even from native apps.

The language needs to be well-specified. The language specification has to be explicit about whether every single possible string of characters is a valid program or not. As with all other Web formats, a shading language for the Web must be precisely specified to guarantee interoperability between browsers.

The language also needs to be well-specified so that it can be used as a compilation target. Many rendering teams write shaders in their own custom in-house language, and then cross-compile to whichever language is necessary. For this reason, the language should have a reasonably small set of unambiguous grammar and type checking rules that compiler writers can reference when emitting this language.

This language needs to be translatable to Metal Shading Language, HLSL (or DXIL), and SPIR-V. This is because WebGPU is designed to work on top of Metal, Direct3D 12, and Vulkan, so the shaders need to be able to be represented in a form that each of those APIs can accept.

The language needs to be performant. The entire reason developers want to use the GPU in the first place is for performance. The compiler itself needs to run quickly, and programs produced by the compiler need to run efficiently on real GPUs.

This language needs to evolve with the WebGPU API. WebGPU features such as the binding model and tessellation model interact deeply with the shading language. Though it is feasible to have the language developed independently of the API, having the WebGPU API and shading language in the same forum ensures the goals are shared, and makes development more streamlined.

The language needs to be easy for a developer to read and write. There are two pieces to this: firstly, the language should be familiar to both GPU programmers and CPU programmers. GPU programmers are important clients because they have experience writing shaders. CPU programmers are important because GPUs are increasingly being used for purposes beyond rendering, including machine learning, computer vision, and neural networks. For them, the language should be compatible with familiar programming language concepts and syntax.

The second piece to this is that the language should be human-readable. The culture of the Web is one where anyone can start writing a webpage with just a text editor and a browser. This democratization of content is one of the Web’s greatest strengths. This culture has created a rich ecosystem of tools and inspectors, where tinkerers can investigate how any webpage works simply by View-Source. Having a single canonical human-readable language will greatly aid in community adoption of the WebGPU API.

All the major languages used on the web today are human-readable, with one exception. The WebAssembly Community Group expected that parsing a bytecode would be more performant than parsing a text language. However, that turned out not to be true; Asm.js, which is JavaScript source, is still faster than WebAssembly for many use cases.

Similarly, using a bytecode format such as WebAssembly doesn’t obviate the browser from needing to run optimization passes over the source code. Every major browser runs optimization passes on the bytecode prior to execution. Unfortunately, the desires of simpler compilers never ended up panning out.

There is active debate in the Community Group about whether or not this human-readable language should be the one that’s natively accepted by the API, but the group agrees that the language that shaders are written in should be easily readable and writable.

A New Language? Really?

While there are a number of existing languages, none have been designed with both the Web and modern graphics applications in mind, and none that address the requirements listed above. Before we describe WHLSL, let’s look at some existing languages.

Metal Shading Language is very similar to C++, which means it has all the power of bit-casts and raw pointers. It’s extremely powerful; the same source code can even be compiled for CPUs and GPUs. It’s extremely easy to port existing CPU-side code to Metal Shading Language. Unfortunately, all this power has some downsides. In Metal Shading Language, you could, for example, write a shader that casts a pointer to an integer, adds 17, casts it back to a pointer, and dereferences it. This is a security problem because it means the shader could access any resource that happens to be in the address space of the application, which is contrary to the Web’s security model. Theoretically, it could be possible to specify a dialect of Metal Shading Language that doesn’t have raw pointers, but pointers are so fundamental to the C and C++ languages that the result would be completely unfamiliar. C++ also heavily relies on undefined behavior, so any effort to fully specify each of C++’s numerous features would be unlikely to be successful.

HLSL is the supported language that Direct3D shaders written in. It’s currently the most popular realtime shading language in the world, and is therefore the most familiar language to graphics programmers. There are multiple implementations, but there is no formal specification, making it difficult to create consistent, interoperable implementations. Nonetheless, given HLSL’s ubiquity, it is valuable to adopt its syntax as much as possible in the design of WHLSL.

GLSL is the language used by WebGL, and was adopted by WebGL for the web platform. However, reaching cross-browser interoperability was extremely difficult due to incompatibilities in GLSL compilers. There remains a long tail of security and portability bugs with GLSL still being investigated. Also, GLSL is showing its age. It’s limited in that it doesn’t have pointer-like objects, or the ability to have variable length arrays. Its input and outputs are global variables with hardcoded names.

SPIR-V was designed to be a low-level universal intermediate format for the actual shading languages that developers would use. People do not author SPIR-V; they use a human-readable language, and then convert it into SPIR-V bytecode using a tool.

There are a few challenges with adopting SPIR-V for the web. First, SPIR-V was not written with security as a first principle, and it’s unclear whether it can be modified to satisfy the security requirements of the web. Forking the SPIR-V language means developers would have to recompile their shaders, possibly being forced to rewrite their source code anyway. Additionally, browsers would still be unable to trust incoming bytecode, and would be required to validate programs to make sure they are not doing anything insecure. And since Windows and macOS/iOS don’t support Vulkan, the incoming SPIR-V would still need to be translated/compiled into another language. Weirdly, this would mean on those two platforms, the starting point and the ending point are both human readable, but the bit in between is obfuscated with no benefit.

Second, a significant amount of the SPIR-V specification exists inside separate documents known as “execution environments.” A SPIR-V execution environment currently doesn’t exist for the Web, and without one of these execution environments, many critical pieces of SPIR-V are undefined, such as which of the over 50 optional capabilities are supported.

Third, many graphics applications such as Babylon.js require dynamically modifying shaders at runtime. Using a bytecode format means that these applications would have to include a compiler written in JavaScript that runs in the browser to produce the bytecode from the dynamically created shader. This would significantly increase the bloat of these sites and would lead to worse performance.

Though JavaScript is the canonical language for the Web, its properties make it a poor candidate for a shading language. One of its strengths is its flexibility, but this dynamism leads to many conditionals and divergent control flow, which GPUs are not designed to execute efficiently. It is also garbage-collected, which is a procedure that definitely isn’t well-suited for GPU hardware.

WebAssembly is another familiar possibility, but it also doesn’t map well to the architecture of GPUs. For example, WebAssembly assumes a single dynamically-sized heap, but GPU programs operate with access to multiple dynamically-sized buffers. There isn’t a high-performance way to map between the two models without recompiling.

Therefore, after a fairly exhaustive search for a suitable language, we couldn’t find one which adequately meets the requirements of the project. So, the Community Group is making a new language. Creating a new language is a large task, but we feel there is an opportunity to produce something new that uses modern programming language design principles and fulfills our requirements.

WHLSL

WHLSL is a new shading language that fits the Web platform. It’s being developed by the WebGPU Community Group at the W3C, and the group is working on a specification, a compiler, and a CPU-side interpreter to show correctness.

The language is based on HLSL, but simplifies and extends it. We’d really like existing HLSL shaders to just work as WHLSL shaders. Since WHLSL is a well-specified powerful and expressive shading language, some HLSL shaders will need tweaks to work, but as a result, WHLSL can guarantee safety and other benefits outlined above.

For example, here is an example vertex shader from Microsoft’s DirectX-Graphics-Samples repository. It works as a WHLSL shader without any changes:

VSParticleDrawOut output;
output.pos = g_bufPosVelo[input.id].pos.xyz;
float mag = g_bufPosVelo[input.id].velo.w / 9;
output.color = lerp(float4(1.0f, 0.1f, 0.1f, 1.0f), input.color, mag);
return output;

And here’s the associated pixel shader, which works as a WHLSL shader completely unmodified:

float intensity = 0.5f - length(float2(0.5f, 0.5f) - input.tex);
intensity = clamp(intensity, 0.0f, 0.5f) * 2.0f;
return float4(input.color.xyz, intensity);

Basics

Let’s talk about the language itself.

Just like in HLSL, the primitive data types are bool, int, uint, float, and half. Doubles are not supported because they don’t exist in Metal, and software emulation would be too slow. Bools don’t have a particular bit representation and thus cannot be present in shader inputs/outputs or resources. This same restriction is present in SPIR-V, and we’d like to be able to use OpTypeBool in the generated SPIR-V code. WHLSL also includes smaller integral types of char, uchar, short, and ushort, which are available directly in Metal Shading Language, can be specified in SPIR-V by specifying 16 in OpTypeFloat, and can be emulated in HLSL. Emulation of these types is faster than emulation of doubles because the types are smaller and their bit representation is less complicated.

WHLSL doesn’t provide C-style implicit conversions. We’ve found implicit conversions to be a common source of errors in shaders, and forcing the programmer to be explicit about where the conversions occur eliminates this often frustrating and mysterious class of bugs. This is a similar approach that languages such as Swift have taken. Additionally, a lack of implicit conversions keeps the specification and the compiler simple.

Just like in HLSL, there are vector types and matrix types such as float4 and int3x4. Rather than add a bunch of “x1” single-element vectors and matrices, we opted to keep the standard library simple, since a single-element vector is already representable as a scalar and a single-element-matrix is already representable as a vector. This is consistent with the desire to eliminate implicit conversions, and requiring an explicit conversion between float1 and float is cumbersome and needlessly verbose.

So, the following is a valid snippet of a shader:

int a = 7;
a += 3;
float3 b = float3(float(a) * 5, 6, 7);
float3 c = b.xxy;
float3 d = b * c;

I mentioned earlier that no implicit conversions are allowed, but you may have noticed in the above snippet, 5 is not written as 5.0. This is because literals are represented as a special type that can be unified with other numeric types. When the compiler sees the above code, it knows the multiplication operator requires the arguments to be the same type, and the first argument is clearly a float. So, when the compiler sees float(a) * 5 it says “well, I know the first argument is a float, so that means I must be using the (float, float) overload, so let’s unify the 5 with the second argument, and thus the 5 becomes a float.” This works even when both arguments are literals, because literals have a preferred type. So, 5 * 5 will get the (int, int) overload, 5u * 5u will get the (uint, uint) overload, and 5.0 * 5.0 will get the (float, float) overload.

One difference between WHLSL and C is that WHLSL zero-initializes all uninitialized variables at their declaration site. This prevents non-portable behavior across OSes and drivers, or even worse, reading whatever value happened to be there before your shader started executing. It also means that all constructible types in WHLSL have a zero-value.

Enums

Because they don’t incur any runtime cost and are extremely useful, WHLSL has native support for enums.

enum Weekday {
    Monday,
    Tuesday,
    Wednesday,
    Thursday,
    PizzaDay
}

The underlying type for an enum defaults to int, but you can override the type, e.g. enum Weekday : uint. Similarly, enum values can have an underlying value like Tuesday = 72. Because enums have defined types and values, they can therefore be used in buffers, and they can be casted between their underlying type and the enum type. When you want to refer to a value in code, you qualify it like Weekday.PizzaDay similar to how enum classes work in C++. This means that enum values don’t pollute the global namespace, and values of independent enums won’t collide.

Structs

Structs in WHLSL work similarly to HLSL and C.

struct Foo {
    int x;
    float y;
}

Simply designed, they avoid inheritance, virtual methods, and access control. It’s impossible to have a “private” member of a struct. Because structs don’t have access control, there is no need for structs to have member functions. Free functions can see every member of every struct.

Arrays

Like other shading languages, arrays are value types that are passed and returned from functions by value (aka “copy-in copy-out,” like regular scalars). You make one using the following syntax:

int[3] x;

Just like any variable declaration, this will zero-fill the contents of the array, and is therefore an O(n) operation. We wanted to put the brackets after the type instead of after the variable name for two reasons:

  1. Putting all type information in a single place makes the parser simpler (avoiding the clockwise/spiral rule)
  2. Avoiding ambiguity when multiple variables are declared in a single statement (e.g. int[10] x, y;)

One critical way we ensure safety of the language is performing bounds checking on every array access. There are a number of ways we make this potentially expensive operation efficient. Array indexes are uint, which reduce the check to a single comparison. Arrays are not sparsely implemented, and contain a length member which is available at compile-time, making access have near-zero cost.

Whereas arrays are value types, WHLSL achieves reference semantics using two other types: safe pointers and array references.

Safe Pointers

The first is the safe pointer. Some form of reference semantics, which is the behavior pointers allow for, are used in almost every CPU-side programming language. Including pointers in WHLSL will make it easier for developers to migrate existing CPU-side code to the GPU, thereby allowing for easy porting of things like machine learning, computer vision, and signal processing applications.

To satisfy the safety requirement, WHLSL uses safe pointers, which are guaranteed to either point to something valid or be null. Like C, you can create a pointer to an lvalue by using the & operator and can dereference one by using the * operator. Unlike C, you can’t index through a pointer as-if it were an array. You can’t cast it to and from a scalar value, and it doesn’t have a specific bit pattern representation. Therefore, it can’t exist in a buffer or as a shader input/output.

Just like in OpenCL and Metal Shading Language, the GPU has different heaps, or address spaces that values can exist within. WHLSL has 4 different heaps: device, constant, threadgroup, and thread. All reference types must be tagged with the address space they point into.

The device address space corresponds to the majority of memory on the device. This memory is readable and writable, and corresponds to Unordered Access Views in Direct3D and device memory in Metal Shading Language. The constant address space corresponds to a read-only region of memory, typically optimized for data being broadcast to every thread. As such, writing to an lvalue that exists in the constant address space is a compile error. Lastly, the threadgroup address space corresponds to a readable and writable region of memory that is shared between each thread in a threadgroup. It can only be used in compute shaders.

By default, values exist within the thread address space:

int i = 4;
thread int* j = &i;
*j = 7;
// i is now 7

Because all variables are zero-initialized, pointers are null-initialized. Therefore, the following is valid:

thread int* i;

Trying to dereference this pointer will cause either trapping or clamping, as described later.

Array References

Array references are similar to pointers, but they can be used with the subscript operator to access multiple elements in the array reference. Whereas arrays’ lengths are known at compile time and must be stated inside the type declaration, an array reference’s length is only known at runtime. Just like pointers, they must be associated with an address space, and they may be nullptr. Just like arrays, they are indexed using uints for single-comparison bounds checks, and they can’t be sparse.

They correspond to the OpTypeRuntimeArray type in SPIR-V and one of Buffer, RWBuffer, StructuredBuffer, or RWStructuredBuffer in HLSL. In Metal, it is represented as a tuple of a pointer and a length. Just like array accesses, all operations are checked against the array reference’s length. Buffers are passed into the entry points from the API via array references or pointers.

You can make an array reference from an lvalue by using the @ operator:

int i = 4;
thread int[] j = @i;
j[0] = 7;
// i is 7
// j.length is 1

Just as you might expect, using @ on pointer j creates an array reference that points to the same thing as j:

int i = 4;
thread int* j = &i;
thread int[] k = @j;
k[0] = 7;
// i is 7
// k.length is 1

Using @ on an array makes the array reference point to that array:

int[3] i = int[3](4, 5, 6);
thread int[] j = @i;
j[1] = 7;
// i[1] is 7
// j.length is 3

Functions

Functions look very similar to their C counterparts. For example, here is a function in the standard library:

float4 lit(float n_dot_l, float n_dot_h, float m) {
    float ambient = 1;
    float diffuse = max(0, n_dot_l);
    float specular = n_dot_l < 0 || n_dot_h < 0 ? 0 : n_dot_h * m;
    float4 result;
    result.x = ambient;
    result.y = diffuse;
    result.z = specular;
    result.w = 1;
    return result;
}

This example shows how similar WHLSL functions are to C: function declarations and calls (e.g. to max()) have similar syntax, arguments and parameters are matched up pairwise in order, and ternary expressions are supported.

Operators and Operator Overloading

However, something else is going on here, too. When the compiler sees n_dot_h * m, it doesn’t intrinsically know how to perform that multiplication. Instead, the compiler will turn that into a call to operator*(). Then, the specific operator*() is chosen via the standard function overload resolution algorithm. This is important because it means you can write your own operator*() function, and teach WHLSL how to multiply your own types.

This even works for operations like ++. Though pre- and post-increment have different behaviors, they both get overloaded to the same function: operator++(). Here’s an example from the standard library:

int operator++(int value) {
    return value + 1;
}

This operator will be called for both pre-increment and post-increment, and the compiler is smart enough to do the right thing with the result. This solves the problem that C++ runs into where those operators are distinct, and are differentiated using an extra dummy int argument. For post-increment, the compiler will emit code to save the value to an anonymous variable, call operator++(), assign the result, and use the saved value for further processing.

Operator overloading is used all over the language. It’s how vector and matrix multiplication is implemented. It’s how arrays are indexed. It’s how swizzle operators work. Operator overloading provides power and simplicity; the core language doesn’t have to know about each of these operations directly because they are implemented by overloaded operators.

Generated Properties

WHLSL doesn’t just stop at operator overloading, though. An earlier example included b.xxy where b is a float3. This is an expression that means “make a 3-element vector where the first two elements have the same value as b.x and the third element has the same value as b.y.” So it’s sort of like a member of the vector, except it isn’t actually associated with any storage; instead, it’s computed during the time it’s accessed. These “swizzle operators” are present in every realtime shading language, and WHLSL is no exception. The way they’re supported is by marking them as a generated property, like in Swift.

Getters

The standard library includes many functions of the following form:

float3 operator.xxy(float3 v) {
    float3 result;
    result.x = v.x;
    result.y = v.x;
    result.z = v.y;
    return result;
}

When the compiler sees a property access to a member that doesn’t exist, it can call the operator passing the object as the first argument. Colloquially, we call this a getter.

Setters

The same approach even works for setters:

float4 operator.xyz=(float4 v, float3 c) {
    float4 result = v;
    result.x = c.x;
    result.y = c.y;
    result.z = c.z;
    return result;
}

Using setters is very natural:

float4 a = float4(1, 2, 3, 4);
a.xyz = float3(7, 8, 9);

The implementation of the setter creates a copy of the object with the new data. When the compiler sees an assignment to a generated property, it calls the setter and assigns the result to the original variable.

Anders

A generalization of getters and setters is the ander, which works with pointers. It exists as a performance optimization, so setters don’t have to create a copy of the object. Here’s an example:

thread float* operator.r(thread Foo* value) {
    return &value->x;
}

Anders are more powerful than either getters or setters, because the compiler can use anders to implement either reads or assignments. When reading from a generated property via an ander, the compiler invokes the ander and then dereferences the result. When writing to it, the compiler invokes the ander, dereferences the result, and assigns to the result of that. Any user-defined type can have any combination of getters, setters, anders, and indexers; if the same type has an ander and either a getter or a setter, the compiler will prefer using the ander.

Indexers

But what about matrices? In most realtime shading languages, matrices aren’t accessed with members corresponding to their columns or rows. Instead, they are accessed using array syntax, e.g. myMatrix[3][1]. Vector types also usually have this kind of syntax. So how does this work? More operator overloading!

float operator[](float2 v, uint index) {
    switch (index) {
        case 0:
            return v.x;
        case 1:
            return v.y;
        default:
            /* trap or clamp, more on this below */
    }
}

float2 operator[]=(float2 v, uint index, float a) {
    switch (index) {
        case 0:
            v.x = a;
            break;
        case 1:
            v.y = a;
            break;
        default:
            /* trap or clamp, more on this below */
    }
    return v;
}

As you can see, indexing uses operators too, and can therefore be overloaded. Vectors get these “indexers” too, so myVector.x and myVector[0] are synonyms for each other.

Standard Library

We designed the standard library based on the Microsoft Docs describing the HLSL standard library. The WHLSL standard library mostly includes math operations, which work both on scalar values and element-wise on vectors and matrices. All the standard operators you would expect are defined, including logical and bitwise operations, like operator*() and operator<<(). All the swizzle operators, getters, and setters are defined for vectors and matrices, where applicable.

One of the design principles of WHLSL was to keep the language itself small so as much as possible could be defined in the standard library. Of course, not all the functions in the standard library can be expressed in WHLSL (like bool operator*(float, float)) but almost everything else is implemented in WHLSL. For example, this function is part of the standard library:

float smoothstep(float edge0, float edge1, float x) {
    float t = clamp((x - edge0) / (edge1 - edge0), 0, 1);
    return t * t * (3 - 2 * t);
}

Because the standard library is designed to match HLSL as much as possible, most of the functions in it are already present in HLSL directly. So a compilation of WHLSL’s standard library to HLSL would choose to omit these functions and instead use the built-in versions. This will happen, for instance, for all the vector/matrix indexers — the GPU should never actually see the code above; the code generation step in the compiler should use the intrinsic instead. However, different shading languages have different intrinsics, so every function is defined to allow for correctness testing. Similarly, WHLSL includes a CPU-side interpreter, which uses the WHLSL implementations of these functions when executing WHLSL programs. This is true for every WHLSL function including the texture sampling functions.

Not every function in HLSL’s standard library is present in WHLSL. For example, HLSL supports printf(). However, implementing such a function in Metal Shading Language or SPIR-V would be quite difficult. We included as many functions from the HLSL standard library as is reasonable in the Web environment.

Variable Lifetime

But if there are pointers in the language, how should we deal with use-after-free problems? For example, consider the following snippet:

thread int* foo() {
    int a;
    return &a;
}
…
int b = *foo();

In languages like C, this code has undefined behavior. So, one solution is for WHLSL to just forbid this kind of structure and throw a compilation error when it sees something like this. However, this would require tracking the values that every pointer could possibly point to, which is a difficult analysis in the presence of loops and function calls. Instead, WHLSL makes every variable behave as if it has global lifetime.

This means that this WHLSL snippet is completely valid and well-defined, for two reasons:

  1. Declaring a without an initializer will zero-fill it. Therefore, the value of a is well-defined. This zero-filling will occur each time foo() is called.

  2. All variables have global lifetime (similar to C’s static keyword). Therefore, a never goes out of scope.

This global lifetime is only possible because recursion is disallowed (which is common for shading languages), which means there will not be any reentrancy problems. Similarly, shaders cannot allocate or free memory, so the compiler knows at compile-time every piece of memory that a shader could possibly access.

So, for example:

thread int* foo() {
    int a;
    return &a;
}
…
thread int* x = foo();
*x = 7;
thread int* y = foo();
// *x equals 0, because the variable got zero-filled again
*y = 8;
// *x equals 8, because x and y point to the same variable

Most variables won’t need to truly be made global, so there isn’t a big impact on performance. If the compiler can prove that it is unobservable whether or not a particular variable actually has global lifetime, the compiler is free to keep the variable local. Because the pattern of returning pointers to locals is discouraged in other languages (indeed, many other shading languages don’t even have pointers), examples like this one will be relatively rare.

Stages of Compilation

WHLSL doesn’t make use of a preprocessor as other languages do. In other languages, the preprocessor’s primary purpose is to include multiple source files together. On the web, however, there is no direct file access, and usually the entire shader is presented in one downloaded resource. In many shading languages, the preprocessor is used to conditionally enable rendering features inside a large ubershader, but WHLSL allows for this use case by using specialization constants instead. Moreover, the many variants of preprocessors are incompatible in subtle ways, so the benefit of a preprocessor for WHLSL doesn’t outweigh the complexity of creating a specification for one.

WHLSL is designed for a two-stage compilation. In our research, we’ve found that many 3D engines want to compile a large corpus of shaders, and each compilation includes large libraries of functions that are duplicated between the different compilations. Instead of compiling these support functions multiple times, a better solution is to compile the entire library once, and then allow a second stage to select which entry points from the library should be used together.

This two-stage compilation means that as much of the compilation should be done in the first pass, so it isn’t run multiple times for families of shaders. This is the reason entry points in WHLSL are marked as either vertex, fragment, or compute. Letting the first stage of the compilation know which functions are entry points of which type lets more of the compilation occur in the first stage rather than the second stage.

This second compilation stage also provides a convenient place to specify specialization constants. Recall that WHLSL doesn’t have a preprocessor, which is the traditional way for features to be enabled and disabled in HLSL. Engines often tailor a single shader to a particular situation by enabling a rendering effect or switching out a BRDF with the flip of a switch. The technique of including every rendering option in a single shader, and specializing the single shader based on which effects to enable, is so common it has a name: ubershaders. Instead of preprocessor macros, WHLSL programmers can use specialization constants, which work the same way as SPIR-V’s specialization constants. From the language’s point of view, they are just scalar constants. However, the values for these constants are supplied during this second compilation stage, making it super easy to configure the program at runtime.

Because a single WHLSL program can include multiple shaders, the inputs and outputs to the shader aren’t represented by global variables in the way that other shading languages do it. Instead, the inputs and the outputs for a particular shader are associated with that shader itself. Inputs are represented as arguments to the shader’s entry point and outputs are represented as the return value of the entry point.

The following shows how to describe a compute shader entry point:

compute void ComputeKernel(device uint[] b : register(u0)) {
   …
}

Safety

WHLSL is a safe language. This means that it is impossible to access information outside of a website’s origin. One of the ways WHLSL achieves this is by eliminating undefined behavior, as described above regarding uniformity.

Another way WHLSL achieves safety is by performing bounds checks of array/pointer accesses. There are three ways these bounds checks may occur:

  1. Trapping. When a trap occurs in a program, the shader stage immediately exits, filling in 0s for all of the shader stage’s outputs. The draw call continues, and the next stage of the graphics pipeline gets run.
    Because trapping introduces new control flow, it has an effect on the uniformity of the program. Traps are issued inside bounds checks, which means they are necessarily present in non-uniform control flow. It may be okay for some programs that don’t use uniformity for anything, but in general this makes traps difficult to use.
  2. Clamping. Array index operations can clamp the index to the size of the array. This doesn’t involve new control flow, so it doesn’t have any affect on uniformity. It is even possible to “clamp” a pointer access or a zero-length array access by disregarding writes and returning 0s for reads. This is possible because the set of things you can do with a pointer in WHLSL is limited, so we can simply make each of those operations do some well-defined thing with a “clamped” pointer.
  3. Hardware and Driver Support. Some hardware and drivers already includes a mode where out-of-bounds accesses can’t happen. With this method, the mechanism by which the hardware forbids out-of-bounds accesses is implementation-defined. One example is the ARB_robustness OpenGL Extension. Unfortunately, WHLSL should be runnable on almost all modern hardware, and simply not enough APIs/devices support these kinds of modes.

Whichever method the compiler uses, it should not affect the uniformity of the shader; in other words, it can’t turn an otherwise valid program into an invalid one.

In order to determine the best behavior for bounds checks, we ran some performance experiments. We took some of the kernels used in the Metal Performance Shaders framework and made two new versions: one that uses clamping, and one that uses trapping. The kernels we picked were ones that do lots of array accesses: for example, multiplying large matrices. We ran this benchmark on a variety of devices at varying data sizes. We made sure that none of the traps were actually hit and none of the clamps actually had any effect, so we can be sure we were measuring the common case of a correctly-written program.

We expected trapping to be generally faster, because redundant traps can be eliminated by the downstream compiler. However, we discovered that there wasn’t one clear winner. On some devices, trapping was significantly faster than clamping, and on other devices, clamping was significantly faster than trapping. These results show that the compiler should be able to choose which method is best for the particular device it’s being run on, rather than being forced to always choose one method.

Chart of iPhone 6 vs iPhone X runtime scores

Shader Signatures

WHLSL supports a language feature of HLSL called “semantics.” They are used to identify variables between shader stages and from the WebGPU API. There are four types of semantics:

  • Built-in variables, e.g. uint vertexID : SV_VertexID
  • Specialization constants, e.g. uint numlights : specialized
  • Stage in/out semantics, e.g. float2 coordinate : attribute(0)
  • Resource semantics, e.g. device float[] coordinate : register(u0)

As described above, WHLSL programs accept their inputs and outputs in the form of function parameters, not global variables.

However, shaders often have multiple outputs. The most common example of this is the vertex shader passing multiple output values to the interpolator to be fed as inputs into the fragment shader.

In order to accommodate this, the return value of a shader can be a struct, and the individual fields are treated independently. In fact, this works recursively — the struct can contain another struct, and its members are also treated independently. Nested structs are flattened, and all the fields which aren’t structs are gathered and treated as shader outputs.

Shader parameters work the same way. An individual parameter can be a shader input, or it can be a struct with a collection of shader inputs. Structs can also contain other structs. Variables inside these structs are treated independently, as if they were additional parameters to the shader.

After all these structs have been flattened into a set of inputs and a set of outputs, each item in the sets must have a semantic. Each built-in variable must have a particular type and must only be used in a particular shader stage. Specialization constants must only have simple scalar types.

Stage in/out variables have the attribute semantic rather than the traditional HLSL semantics because many shaders pass around data that don’t match the canned semantics HLSL provides. In HLSL, it’s common to see generic data packed into the COLOR semantic, because COLOR is a float4 and the data fits inside a float4. Instead, the approach both SPIR-V and Metal Shading Language (via [[user(n)]]) take is to assign an identifier to each stage in/out variable, and use the assignments to match the variables between shader stages.

Resource semantics should be familiar to HLSL programmers. WHLSL includes both resource semantics and address spaces, but both of these have different purposes. The address space of the variable is used to determine which cache and memory hierarchy it should be accessed within. The address space is necessary because it persists even through pointer operations; a device pointer can’t be set to point to a thread variable. In WHLSL, the resource semantic is only used to identify a variable from the WebGPU API. However, for consistency with HLSL, the resource semantic must “match” the address space of the variable it’s being put on. For example, you can’t put register(s0) on a texture. You can’t put register(u0) on a constant resource. Arrays in WHLSL don’t have address space (because they are value types, not reference types) so if an array appears as a shader argument, it is treated as if it was a device resource for the purposes of matching semantics.

Just like Direct3D, WebGPU has a two-level binding model. Resource descriptors are aggregated into sets, and sets can be switched out in the WebGPU API. WHLSL matches HLSL by modeling this by an optional space parameter inside resource semantics: register(u0, space1).

“Logical Mode” restrictions

WHLSL is designed with the requirement that it is compilable to Metal Shading Language, SPIR-V, and HLSL (or DXIL). SPIR-V has many different operating modes, targeted by different embedding APIs. Specifically, we’re interested in the flavor of SPIR-V that Vulkan targets.

This flavor of SPIR-V is a flavor of SPIR-V called Logical Addressing Mode. In SPIR-V Logical Mode, variables cannot have pointer type. Similarly, a pointer cannot be used in a Phi operation. The result of this is that each pointer must point to exactly one thing for all time; a pointer is simply a name for a value.

Because WHLSL needs to be compilable to SPIR-V, WHLSL must not be more expressive than SPIR-V. Therefore, WHLSL has some restrictions to make it expressible in SPIR-V Logical Mode. These restrictions aren’t surfaced as an optional mode for WHLSL; instead, they’re part of the language itself. Eventually, we hope these restrictions can be lifted in a future version of the language, but until then, the language is restricted.

These restrictions are:

  • Pointers and array references must not occur inside device, constant, or threadgroup memory
  • Pointers and array references must not occur inside arrays or array references
  • Pointers and array references must not be assigned outside of their initializer (in their declaration)
  • Functions that return pointers or array references must only have a single return point
  • Ternary expressions must not result in pointers

With these restrictions, the compiler knows exactly what every pointer points to.

But not so fast! Recall from above that thread variables have global lifetime, which means they behave as-if they were declared at the beginning of the entry point. What if the runtime gathered all these local variables together, sorted by type, and aggregated all the variables with the same type into arrays? Then, a pointer could simply be an offset into the appropriate array. A pointer can’t be recast to point to a different type in WHLSL, which means the appropriate array is determined statically by the compiler. Therefore, thread pointers don’t need to abide by the restrictions above. But, this technique doesn’t work for the pointers in the other address spaces; it only works for thread pointers.

Resources

WHLSL supports textures, samplers, and array references for buffers. Just like in HLSL, texture types in WHLSL look like Texture2D<float4>. The presence of these angle brackets don’t imply templates or generics; the language doesn’t have facilities for those (for simplicity). The only types that are allowed to have them are a finite set of built-in types. This design is a middle ground between allowing these types, which are present in HLSL, but also allowing further development of the language in a way that the Community Group can use the angle bracket characters.

Depth textures are distinct from non-depth-textures because they are different types in Metal Shading Language, so the compiler needs to know which one to emit when it’s emitting Metal Shading Language. Because WHLSL doesn’t support member functions, texture sampling isn’t done like texture.Sample(…); instead, it’s done with free functions like Sample(texture, …).

Samplers are not specialized; there is one sampler type for all use cases. You can use this sampler for both depth textures and non-depth textures. Depth textures support things like comparison operations in the sampler. If the sampler is configured to include a depth comparison and it’s used with a non-depth texture, the depth operation is ignored.

The WebGPU API will automatically emit some resource barriers at particular places, which means the API needs to know which resources are used in a shader. Therefore, the “bindless” model of resources can’t be used. This means that all resources are listed as explicit inputs to a shader. Similarly, the API wants to know which resources are used for reading and which are used for writing; the compiler knows this statically by inspecting the program. There is no language-level support for “const” or a distinction between StructuredBuffer and RWStructuredBuffer because that information is already present in the program.

Current Work

The WebGPU community group is working on a formal language specification written with OTT that describes WHLSL with the same level of rigor that other Web languages employ. We’re also working on a compiler that can produce Metal Shading Language, SPIR-V, and HLSL. In addition, the compiler includes a CPU-side interpreter to show correctness of an implementation. Please try it out!

Future Directions

WHLSL is still nascent, and there is still a long way to go before the design of the language is complete. We would love to hear from you about your desires, concerns, and use cases! Please feel free to file issues in our GitHub repository about your ideas and thoughts!

For the first proposal, we wanted to satisfy the constraints outlined at the beginning of this post, yet provide ample opportunity for expanding the language. One natural evolution of the language could add facilities for abstractions of types, like protocols or interfaces. WHLSL includes simple structs with no access control or inheritance. Other shading languages like Slang model type abstractions as a set of methods that must be present inside the struct. However, Slang runs into a problem where it is impossible to make an existing type adhere to a new interface. Once the struct is defined, you can’t add new methods to it; the curly brace has closed the struct forever. This problem is solved with extensions, similar to Objective-C or Swift, which can retroactively add methods into a struct after the struct has been defined. Java solved this problem by encouraging authors to add new classes, called adapters, which only exist to implement an interface, and plumb each call through to the implementing type.

The WHLSL approach is much simpler; by using free functions instead of struct methods, we can use a system like Haskell’s type classes. Here, a type class defines a set of arbitrary functions that must exist, and a type adheres to the type class by implementing them. A solution like this could potentially be added to the language in the future.

Wrapping Up

This describes a new shading language named WHLSL that the W3C’s WebGPU Community Group owns. The goals of the language are satisfied by its familiar, HLSL-based syntax, safety guarantees, and simple, extensible design. As such, it represents the best-supported way to write shaders to be used in the WebGPU API. However, the WebGPU Community Group is unsure whether or not WHLSL programs should be supplied to the WebGPU API directly, or whether they should be compiled to an intermediate form before delivery to the API. Either way, WebGPU programmers should be writing in WHLSL because it fits best with the API.

Please get involved! We’re doing this work on the WebGPU GitHub project. We’ve been working on a formal specification for the language, a reference compiler to emit Metal Shading Language and SPIR-V, and a CPU-side interpreter to validate correctness. We welcome everyone to try it out, and let us know how it goes!

For more information, you can contact me at mmaxfield@apple.com or @Litherum, or you can contact our evangelist, Jonathan Davis.

]]>