Performance – WebKit https://webkit.org Open Source Web Browser Engine Tue, 30 Apr 2024 20:42:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 JetStream 2.2 https://webkit.org/blog/15378/jetstream-2-2/ Tue, 30 Apr 2024 20:42:01 +0000 https://webkit.org/?p=15378 JetStream 2 is a suite of JavaScript and WebAssembly performance benchmarks that runs 64 subtests. We recently fixed several minor issues we discovered that impact running the benchmark.

One issue happened when running JetStream on fast hardware where the resolution of time to run a subtest, as measured by the timing code, occasionally returned 0. This caused problems when calculating subtest scores. Two other issues were found in the sub-test segmentation, where the code didn’t create the initial Float32Array for the test and another race condition issue was fixed in the task queue code for that test.

Two other issues running JetStream via the command-line were found and fixed. The first was specific to the Spider Monkey command-line shell and the second was to fix the Wasm command-line script when running with both the D8 and Spider Monkey shells.

We are introducing JetStream 2.2, which fixes the issues mentioned above and includes the same subtests as JetStream 2.1 as well as the original JetStream 2. The fixes improve the stability of the benchmark. We measured JetStream 2.2’s benchmark scores from Chrome, Safari, and Firefox and found them to be as stable as JetStream 2.1. Measurements were performed using a 16“ MacBook Pro with an M3 processor. The system was not running other applications and the browser app being tested, configured without extensions enabled, was restarted before each JetStream run. A total of 10 runs were performed for each browser app JetStream version combination. Scores produced from JetStream 2.2 are not comparable to other versions of any JetStream benchmark.

]]>
Optimizing WebKit & Safari for Speedometer 3.0 https://webkit.org/blog/15249/optimizing-webkit-safari-for-speedometer-3-0/ Wed, 10 Apr 2024 17:00:50 +0000 https://webkit.org/?p=15249 The introduction of Speedometer 3.0 is a major step forward in making the web faster for all, and allowing Web developers to make websites and web apps that were not previously possible. In this article, we explore ways the WebKit team made performance optimizations in WebKit and Safari based on the Speedometer 3.0 benchmark.

In order to make these improvements, we made an extensive use of our performance testing infrastructure. It’s integrated with our continuous integration, and provides the capability to schedule A/B tests. This allows engineers to quickly test out performance optimizations and catch new performance regressions.

Improving Tools

Proper tooling support is the key to identifying and addressing performance bottlenecks. We defined new internal JSON format for JavaScriptCore sampling profiler output to dump and process them offline. It includes a script which processes and generates analysis of hot functions and hot byte codes for JavaScriptCore. We also added FlameGraph generation tool for the dumped sampling profiler output which visualizes performance bottlenecks. In addition, we added support for JITDump generation on Darwin platforms to dump JIT related information during execution. And we improved generated JITDump information for easy use as well. These tooling improvements allowed us to quickly identify bottlenecks across Speedometer 3.0.

Improving JavaScriptCore

Revising Megamorphic Inline Cache (IC)

Megamorphic IC offers faster property access when one property access site observes many different object types and/or property names. We observed that some frameworks such as React contain a megamorphic property access. This led us to continuously improve JavaScriptCore’s megamorphic property access optimizations: expanding put megamorphic IC, adding megamorphic IC for the in operation, and adding generic improvements for megamorphic IC.

Revising Call IC

Call IC offers faster function calls by caching call targets inline. We redesigned Call IC and we integrated two different architectures into different tiers of Just-In-Time (JIT) compilers. Lower level tiers use Call IC without any JIT code generation and the highest level tier uses JIT code generatiton with the fastest Call IC. There is a tradeoff between code generation time and code efficiency, and JavaScriptCore performs a balancing act between them to achieve the best performance across different tiers.

Optimizing JSON

Speedometer 3.0 also presented new optimization opportunities to our JSON implementations as they contain more non-ASCII characters than before. We made our fast JSON stringifier work for unicode characters. We also analyzed profile data carefully and made JSON.parse faster than ever.

Adjusting Inlining Heuristics

There are many tradeoffs when inlining functions in JavaScript. For example, inline functions can more aggressively increase the total bytecode size and may cause memory bandwidth to become a new bottleneck. The amount of instruction cache available in CPU can also influence how effective a given inlining strategy is. And the calculus of these tradeoffs change over time as we make more improvements to JavaScriptCore such as adding new bytecode instruction and changes to DFG’s numerous optimization phases. We took the release of the new Speedometer 3.0 benchmark as an opportunity to adjust inlining heuristics based on data collected in modern Apple silicon Macs with the latest JavaScriptCore.

Make JIT Code Destruction Lazy

Due to complicated conditions, JavaScriptCore eagerly destroyed CodeBlock and JIT code when GC detects they are dead. Since these destructions are costly, they should be delayed and processed while the browser is idle. We made changes so that they are now destroyed lazily, during idle time in most cases.

Opportunistic Sweeping and Garbage Collection

In addition, we noticed that a significant amount of time goes into performing garbage collection and incremental sweeping across all subtests in both Speedometer 2.1 and 3.0. In particular, if a subtest allocated a large number of JavaScript objects on the heap, we would often spend a significant amount of time in subsequent subtests collecting these objects. This had several effects:

  1. Increasing synchronous time intervals on many subtests due to on-demand sweeping and garbage collection when hitting heap size limits.
  2. Increasing asynchronous time intervals on many subtests due to asynchronous garbage collection or timer-based incremental sweeping triggering immediately after the synchronous timing interval.
  3. Increasing overall variance depending on whether timer-based incremental sweeping and garbage collection would fall in the synchronous or asynchronous timing windows of any given subtest.

At a high level, we realized that some of this work could be performed opportunistically in between rendering updates — that is, during idle time — instead of triggering in the middle of subtests. To achieve this, we introduced a new mechanism in WebCore to provide hints to JavaScriptCore to opportunistically perform scheduled work after the previous rendering update has completed until a given deadline (determined by the estimated remaining time until the next rendering update). The opportunistic task scheduler also accounts for imminently scheduled zero delay timers or pending requestAnimationFrame callbacks: if it observes either, it’s less likely to schedule opportunistic work in order to avoid interference with imminent script execution. We currently perform a couple types of opportunistically scheduled tasks:

  • Incremental Sweeping: Prior to the opportunistic task scheduler, incremental sweeping in JavaScriptCore was automatically triggered by a periodically scheduled 100 ms timer. This had the effect of occasionally triggering incremental sweeping during asynchronous timing intervals, but also wasn’t aggressive enough to prevent on-demand sweeping in the middle of script execution. Now that JavaScriptCore is knowledgable about when to opportunistically schedule tasks, it can instead perform the majority of incremental sweeping in between rendering updates while there aren’t imminently scheduled timers. The process of sweeping is also granular to each marked block, which allows us to halt opportunistic sweeping early if we’re about to exceed the deadline for the next estimated rendering update.
  • Garbage Collection: By tracking the amount of time spent performing garbage collection in previous cycles, we’re able to roughly estimate the amount of time needed to perform the next garbage collection based on the number of bytes visited or allocated since the last cycle. If the remaining duration for performing opportunistically scheduled tasks is longer than this estimated garbage collection duration, we immediately perform either an Eden collection or full garbage collection. Furthermore, we integrated activity-based garbage collections into this new scheme to schedule them at appropriate timing.

Overall, this strategy yields a 6.5% total improvement in Speedometer 3.0*, decreasing the time spent in every subtest by a significant margin, and a 6.9% total improvement in Speedometer 2.1*, significantly decreasing the time spent in nearly all subtests.

* macOS 14.4, MacBook Air (M2, 2022)

Various Miscellaneous Optimizations for Real World Use Cases

We extensively reviewed all Speedometer 3.0 subtests and did many optimizations for realistic use cases. The examples include but are not limited to: faster Object.assign with empty objects, improving object spread performance, and so on.

Improving DOM code

Improving DOM code is Speedometer’s namesake, and that’s exactly what we did. For example, we now store the NodeType in the Node object itself instead of relying on a virtual function call. We also made DOMParser use a fast parser, improved its support of li elements, and made DOMParser not construct a redundant DocumentFragment. Together, these changes improved TodoMVC-JavaScript-ES5 by ~20%. We also eliminated O(n^2) behavior in the fast parser for about ~0.5% overall progression in Speedometer 3.0. We also made input elements construct their user-agent shadow tree lazily during construction and cloning, the latter of which is new in Speedometer 3.0 due to web components and Lit tests. We devirtualized many functions and inlined more functions to reduce the function call overheads. We carefully reviewed performance profile data and removed inefficiency in hot paths like repeated reparsing of the same URLs.

Improving Layout and Rendering

We landed a number of important optimizations in our layout and rendering code. First off, most type checks performed on RenderObject are now done using an inline enum class instead of virtual function calls, this alone is responsible for around ~0.7% of overall progression in Speedometer 3.0.

Improving Style Engine

We also optimized the way we compute the properties animated by Web Animations code. Previously, we were enumerating every animatable properties while resolving transition: all. We optimized this code to only enumerate affected properties. This was ~0.7% overall Speedometer 3.0 progression. Animating elements can now be resolved without fully recomputing their style unless necessary for correctness.

Speedometer 3.0 content, like many modern web sites, uses CSS custom properties extensively. We implemented significant optimizations to improve their performance. Most custom property references are now resolved via fast cache lookups, avoiding expensive style resolution time property parsing. Custom properties are now stored in a new hierarchical data structure that reduces memory usage as well.

One key component of WebKit styling performance is a cache (called “matched declarations cache”) that maps directly from a set of CSS declarations to the final element style, avoiding repeating expensive style building steps for identically styled elements. We significantly improved the hit rate of this cache.

We also improved styling performance of author shadow trees, allowing trees with identical styles to share style data more effectively.

Improving Inline Layout

We fixed a number of performance bottlenecks in inline layout engine as well. Eliminating complex text path in Editor-TipTap was a major ~7% overall improvement. To understand this optimization, WebKit has two different code paths for text layout: the simple text path, which uses low level font API to access raw font data, and the complex text path, which uses CoreText for complex shaping and ligatures. The simple text path is faster but it does not cover all the edge cases. The complex text path has full coverage but is slower than the simple text path.

Previously, we were taking the complex text path whenever a non-default value of font-feature or font-variant was used. This is because historically the simple text path wouldn’t support these operations. However, we noticed that the only feature of these still missing in the simple text path was font-variant-caps. By implementing font-variant-caps support for the simple text path, we allowed the simple text path to handle the benchmark content. This resulted in 4.5x improvement in Editor-TipTap subtest, and ~7% overall progression in Speedometer 3.0.

In addition to improving the handling of text content in WebKit, we also worked with CoreText team to avoid unnecessary work in laying out glyphs. This resulted in ~0.5% overall progression in Speedometer 3.0, and these performance gains will benefit not just WebKit but other frameworks and applications that use CoreText.

Improving SVG Layout

Another area we landed many optimizations for is SVG. Speedometer 3.0 contains a fair bit of SVG content in test cases such as React-Stockcharts-SVG. We were spending a lot of time computing the bounding box for repaint by creating GraphicsContext, applying all styles, and actually drawing strokes in CoreGraphics. Here, we adopted Blink’s optimization to approximate bounding box and made ~6% improvement in React-Stockcharts-SVG subtest. We also eliminated O(n^2) algorithm in SVG text layout code, which made some SVG content load a lot quicker.

Improving IOSurface Cache Hit Rate

Another optimization we did involve improving the cache hit rate of IOSurface. An IOSurface is a bitmap image buffer we use to paint web contents into. Since creating this object is rather expensive, we have a cache of IOSurface objects based on their dimensions. We observed that the cache hit rate was rather low (~30%) so we increased the cache size from 64MB to 256MB on macOS and improved the cache hit rate by 2.7x to ~80%, improving the overall Speedometer 3.0 score by ~0.7%. In practice, this means lower latency for canvas operations and other painting operations.

Reducing Wait Time for GPU Process

Previously, we required a synchronous IPC call from the Web Process to the GPU process to determine which of the existing buffers had been released by CoreAnimation and was suitable to use for the next frame. We optimized this by having the GPUP just select (or allocate) an appropriate buffer, and direct all incoming drawing commands to the right destination without requiring any response. We also changed the delivery of any newly allocated IOSurface handles to go via a background helper thread, rather than blocking the Web Process’s main thread.

Improving Compositing

Updates to compositing layers are now batched, and flushed during rendering updates, rather than computed during every layout. This significantly reduces the cost of script-incurred layout flushes.

Improving Safari

In addition to optimizations we made in WebKit, there were a handful of optimizations for Safari as well.

Optimizing AutoFill Code

One area we looked at was Safari’s AutoFill code. Safari uses JavaScript to implement its AutoFill logic, and this execution time was showing up in the Speedometer 3.0 profile. We made this code significantly faster by waiting for the contents of the page to settle before performing some work for AutoFill. This includes coalescing handling of newly focused fields until after the page had finished loading when possible, and moving lower-priority work out of the critical path of loading and presenting the page for long-loading pages. This was responsible for ~13% progression in TodoMVC-React-Complex-DOM and ~1% progression in numerous other tests, improving the overall Speedometer 3.0 score by ~0.9%.

Profile Guided Optimizations

In addition to making the above code changes, we also adjusted our profile-guided optimizations to take Speedometer 3.0 into account. This allowed us to improve the overall Speedometer 3.0 score by 1~1.6%. It’s worth noting that we observed an intricate interaction between making code changes and profile-guided optimizations. We sometimes don’t observe an immediate improvement in the overall Speedometer 3.0 score when we eliminate, or reduce the runtime cost of a particular code path until the daily update of profile-guided optimizations kicks. This is because the modified or newly added code has to benefit from profile-guided optimizations before it can show a measurable difference. In some cases, we even observed that a performance optimization initially results in a performance degradation until the profile-guided optimizations are updated.

Results

With all these optimizations and dozens more, we were able to improve the overall Speedometer 3.0 score by ~60% between Safari 17.0 and Safari 17.4. Even though individual progressions were often less than 1%, over time, they all stacked up together to make a big difference. Because some of these optimizations also benefited Speedometer 2.1, Safari 17.4 is also ~13% faster than Safari 17.0 on Speedometer 2.1. We’re thrilled to deliver these performance improvements to our users allowing web developers to build websites and web apps that are more responsive and snappier than ever.

]]>
Announcing MotionMark 1.3 https://webkit.org/blog/14908/motionmark-1-3/ Wed, 10 Jan 2024 16:45:59 +0000 https://webkit.org/?p=14908 We are glad to announce an update to MotionMark, Apple’s graphics benchmark for web browsers, taking it to version 1.3. This is a minor update, intended to improve the behavior of the benchmark on a wider variety of hardware devices and to improve the reliability of the scores.

Motion Mark 3 logo

The most significant behavior change is that the benchmark now adapts to the browser’s refresh rate, automatically detecting the frequency of requestAnimationFrame callbacks and using the result to set the target frame rate. Previously, it assumed that the target was 60 frames per second; if the browser called requestAnimationFrame at some different frequency, the results would be invalid.

With this change, comparing benchmark scores between browsers on devices with non-60Hz displays requires some care; a browser that chooses to run requestAnimationFrame at 120Hz has less time to do work per frame, so it will return a lower score. We advise that the display refresh rate is set to 60Hz for such comparisons, as described in the About page.

Minor changes were made to various sub-tests for improved reliability. The maximum complexity of the Multiply subtest was increased because newer, faster configurations could easily hit the cap. The way that elements are hidden in Multiply was changed to ensure only the visible elements contribute to the rendering cost. The Paths test was tweaked to ensure a more consistent workload. And finally, the Design subtest was changed to avoid text getting clipped out at the stage boundary, which would lead to artificially high scores.

There are two scoring-related changes. First, scoring previously chose between a “flat” profile and a “slope” profile based on which had the lowest variance, but the “flat” profile often produced inaccurate scores; it was removed. Second, this version changes which particular frames contribute to the score. The benchmark works by ramping the scene complexity and measuring the duration of some number of frames at each complexity level. In the HTML and SVG sub-tests, the changes in scene complexity involve DOM mutations, in the form of removing elements (since complexity ramps down as the test runs). So there are two types of frames: “mutation” frames and “animation” frames. Previously, the durations of both types of frames fed into the score computation. However, “mutation” frames generally take longer because DOM mutation tends to be more expensive than just animating existing elements, and this adds noise to the results. Since MotionMark is intended to be an animation benchmark, it makes sense for only “animation” frames to contribute to the score.

MotionMark development is now in GitHub; we welcome your participation there.

Change log

]]>
MotionMark Moves to Open Governance https://webkit.org/blog/14359/motionmark-moves-to-open-governance/ Mon, 12 Jun 2023 16:00:53 +0000 https://webkit.org/?p=14359 The WebKit team created the MotionMark benchmark in 2016 to measure graphics performance of web browsers. In the 7 years since then, it has proven to be a critical tool to help us guide our graphics performance improvements. It’s been successful not just for WebKit, but also for other browser engines, which have used it to help direct their own work. It is now the industry-standard benchmark for graphics on the web.

This is why we’re opening up MotionMark to the browser community and migrating to a new governance model. This new model is designed so that browser engine developers can collaboratively improve MotionMark together and is based on the governance model of the popular Speedometer benchmark.

We’re excited to work with the browser community to develop MotionMark further! Please feel free to file issues at github.com/WebKit/MotionMark, and please get involved!

]]>
Introducing JetStream 2.1 https://webkit.org/blog/13146/introducing-jetstream-2-1/ Mon, 12 Sep 2022 16:50:33 +0000 https://webkit.org/?p=13146 JetStream 2 is a JavaScript benchmark suite that runs 64 subtests. We recently discovered that the benchmark score of JetStream 2 may be influenced by pause times between individual subtests because of second order effects, like CPU ramping. Network latency is one source of such pause times. As a result, depending on the speed of one’s network, JetStream 2’s benchmark score can vary. This is undesirable since the goal of JetStream 2 is to measure JavaScript performance, not the second order effects of network latency.

We’re introducing JetStream 2.1 which runs the same subtests as JetStream 2. However, JetStream 2.1 updates the benchmark driver to pre-fetch network resources into Blobs, and create URLs from them. The subtests will now load resources from these blob URLs instead.

With this change, we measured JetStream 2.1’s benchmark scores on both Chrome and Safari to be more stable and is less perturbed by pause times due to network latency. Scores produced from JetStream 2.1 are not comparable to other versions of any JetStream benchmark.

]]>
Speedometer 2.1 https://webkit.org/blog/13083/speedometer-2-1/ Tue, 02 Aug 2022 16:00:20 +0000 https://webkit.org/?p=13083 We are announcing an update to the Speedometer benchmark. This is a minor version bump fixing the benchmark harness to improve the accuracy and stability of how we measure the speed of the rendering engine.

setTimeout nesting level and throttling

Speedometer was designed to include asynchronous work in its work time measurement when computing a score. This is needed because some browser engines use an optimization strategy of deferring some work to reduce the run time of synchronous operations. To measure the time used for asynchronous work, the Speedometer test harness was using setTimeout to compute the end time after the asynchronous work is done.

However, using setTimeout inside the test harness for each subtest piles up the setTimeout nesting level. According to the specification, this triggers the insertion of an artificial 4ms throttle between each setTimeout callback for energy efficiency, and results in the measured work time including this throttling. This is not ideal because (1) we reach a deep nesting level due to how the test harness is constructed, not due to test content, and (2) the intended purpose of using setTimeout in the harness is to measure the time of deferred asynchronous work, not artificial throttling.

This update inserts window.requestAnimationFrame for each test run. This resets the setTimeout nesting level added by the test harness and other subtests, and increases the accuracy and stability of the work time measurement by isolating each subtest more. We observed a 3-5% improvement in stable Safari, Chrome, and Firefox.

Conclusion

Speedometer 2.1 improves the accuracy and stability of the score measurement by ensuring that it captures the work time of asynchronous work without inducing web engine timer throttling.

]]>
MotionMark 1.2 https://webkit.org/blog/11685/motionmark-1-2/ Mon, 07 Jun 2021 21:21:31 +0000 https://webkit.org/?p=11685 MotionMark Version 1.2

Today we are announcing an update to the MotionMark benchmark. This is a relatively small update, aimed at increasing test reliability and reproducibility. The largest change is the removal of the Focus subtest, which was causing significant variance in test results, and wasn’t measuring what it was intending to measure.

Benchmark Harness

Most of the changes in MotionMark 1.2 aim to reduce variance between multiple runs of the benchmark. We increased the warm-up time between each test by 1900ms and required at least 30 frames to be rendered in between tests to reduce interference between adjacent tests. Different tests stress different parts of the graphics pipeline, and their processing can overlap if they aren’t more strongly segmented.

We also implemented a few different strategies to decrease the benchmark’s sensitivity to individual frame times. The first is to make sure that the benchmark never makes any ramping decisions based on the time of a single frame, but instead requires at least 9 frames before adjusting complexity. In addition, the benchmark now discards outlier frame times.

Focus Subtest

Modern browsers use a compositing architecture, where part of the graphics work is responsible for drawing individual elements into layers, and other graphics work is responsible for compositing layers together into a final image. The interface between these two parts behaves differently in different browsers, and may indeed be entirely asynchronous, possibly providing almost no backpressure if the compositor is running more slowly than element painting.

In browser engines which run the compositor asynchronously (like WebKit), the Focus subtest measured how fast descriptions of work can be delivered to the compositor, rather than how fast the compositor can actually execute that work, which is what the subtest was trying to measure. In addition, the backpressure in the Focus subtest is indirect, passes through the scheduler, and is therefore noisy. It was causing a huge variance in subtest score on machines which have relatively high element painting performance compared to their compositor performance.

Conclusion

MotionMark 1.2 produces significantly less score variance than previous versions of MotionMark on a wide variety of machines of varying relative performance. Because of the removal of the Focus subtest, MotionMark 1.2 is also more reflective of real-world graphics performance across browsers.

ChangeLog

Here is a list of all the changes that have gone into MotionMark 1.2:

]]>
WebGPU and WSL in Safari https://webkit.org/blog/9528/webgpu-and-wsl-in-safari/ Thu, 12 Sep 2019 17:00:42 +0000 https://webkit.org/?p=9528 WebGPU is a new API being developed by Apple and others in the W3C which enables high-performance 3D graphics and data-parallel computation on the Web. This API represents a significant improvement over the existing WebGL API in both performance and ease of use. Starting in Safari Technology Preview release 91, beta support is available for WebGPU API and WSL, our proposal for the WebGPU shading language.

A few months ago we discussed a proposal for a new shading language called Web High-Level Shading Language, and began implementation as a proof of concept. Since then, we’ve shifted our approach to this new language, which I will discuss a little later in this post. With help from our friends at Babylon.js we were able to adapt a demo to work with WebGPU and Web Shading Language (WSL):

You can see the demo in action with Safari Technology Preview 91 or later, and with the WebGPU experimental feature enabled.1 The demo utilizes many of the best features of WebGPU. Let’s dig in to see what make WebGPU work so well, and why we think WSL is a great choice for WebGPU.

WebGPU JavaScript API

Just like WebGL, the WebGPU API is accessed through JavaScript because most Web developers are already familiar with working in JavaScript. We expect to create a WebGPU API that is accessed through WebAssembly in the future.

Pipeline State Objects and Bind Groups

Most 3D applications render more than a single object. In WebGL, each of those objects requires a collection of state-changing calls before that object could be rendered. For example, rendering a single object in WebGL might look like:

gl.UseProgram(program1);
gl.frontFace(gl.CW);
gl.cullFace(gl.FRONT);
gl.blendEquationSeparate(gl.FUNC_ADD, gl.FUNC_MIN);
gl.blendFuncSeparate(gl.SRC_COLOR, gl.ZERO, gl.SRC_ALPHA, gl.ZERO);
gl.colorMask(true, false, true, true);
gl.depthMask(true);
gl.stencilMask(1);
gl.depthFunc(gl.GREATER);
gl.drawArrays(gl.TRIANGLES, 0, count);

On the other hand, rendering a single object in WebGPU might look like:

encoder.setPipeline(renderPipeline);
encoder.draw(count, 1, 0, 0);

All the pieces of state in the WebGL example are wrapped up into a single object in WebGPU, named a “pipeline state object.” Though validating state is expensive, with WebGPU it is done once when the pipeline is created, outside of the core rendering loop. As a result, we can avoid performing expensive state analysis inside the draw call. Also, setting an entire pipeline state is a single function call, reducing the amount of “chatting” between Javascript and WebKit’s C++ browser engine.

Resources have a similar story. Most rendering algorithms require a set of resources in order to draw a particular material. In WebGL, each resource would be bound one-by-one, leading to code that looks like:

gl.bindBufferRange(gl.UNIFORM_BUFFER, 0, materialBuffer1, 0, size1);
gl.bindBufferRange(gl.UNIFORM_BUFFER, 1, materialBuffer2, 0, size2);
gl.activeTexture(0);
gl.bindTexture(gl.TEXTURE_2D, materialTexture1);
gl.activeTexture(1);
gl.bindTexture(gl.TEXTURE_2D, materialTexture2);

However, in WebGPU, resources are batched up into “bind groups.” When using WebGPU, the behavior of the above code is represented as simply:

encoder.setBindGroup(0, bindGroup);

In both of these examples, multiple objects are gathered up together and baked into a hardware-dependent format, which is when the browser performs validation. Being able to separate object validation from object use means the application author has more control over when expensive operations occur in the lifecycle of their application.

Run-time Performance

We expect WebGPU to perform faster and handle larger workloads than WebGL. To measure performance, we adopted the test harness from our 2D graphics benchmark MotionMark and wrote two versions of a performance test that had a simple yet realistic workload, one in WebGL and the other in WebGPU.

The test measures how many triangles with different properties can be drawn while maintaining 60 frames per second. Each triangle renders with a different draw call and bind group; this mimics the way many games render objects (like characters, explosions, bullets, or environment objects) in different draw calls with different resources. Because much of the validation logic in WebGPU is performed during the creation of the bind group instead of inside the draw call, both of these calls execute much faster than the equivalent calls in WebGL.

Web Shading Language

The WebGPU community group is actively discussing which shading language or languages should be supported by the specification. Last year, we proposed a new shading language for WebGPU. We are pleased to announce that a beta version of this language is available in Safari Technology Preview 91.

Because of community feedback, our approach toward designing the language has evolved. Previously, we designed the language to be source-compatible with HLSL, but we realized this compatibility was not what the community was asking for. Many wanted the shading language to be as simple and as low-level as possible. That would make it an easier compile target for whichever language their development shop uses.

There are many Web developers using GLSL today in WebGL, so a potential browser accepting a different high level language, like HLSL, wouldn’t suit their needs well. In addition, a high-level language such as HLSL can’t be executed faithfully on every platform and graphics API that WebGPU is designed to execute on.

So, we decided to make the language more simple, low-level, and fast to compile, and renamed the language to Web Shading Language to match this pursuit.2

The name change reflects a shift in our approach, but the major tenets of WSL have not changed. It’s still a language that is high-performance. It’s still text-based, which integrates well with existing web technologies and culture. WSL’s semantics still bake in safety and portability. It still works great as a compile target. As expected for every web technology, the language is well-defined with a robust specification. And because the WebGPU Community Group owns the language, it matures in concert with the WebGPU API.

In an upcoming post, we’ll talk more about the technical details of this language update. Here, let’s discuss further about WSL’s performance characteristics.

Wire Size

The Babylon.js demo above demonstrates an area where WSL really shines. The shaders in the demo use complex physically-based algorithms to draw with the highest realism possible. Babylon.js creates these complex shading algorithms by stitching together snippets of shader source code at run-time. This way, Babylon’s shaders are tuned at run-time to the specific type of content the application contains. Since WSL is a textual language like GLSL, this kind of workflow can be accomplished simply with string concatenation.

Building shaders by string concatenation and compiling them directly is a dramatic improvement over languages which cannot do this kind of manipulation easily, like bytecode-based languages. For these other languages, generation of shaders at run-time requires the execution of an additional compiler, written in JavaScript or WebAssembly, to compile the generated source text into the bytecode form.

This extra step slows down performance in two ways. First, the execution of the additional compiler takes time. Second, the compiler source would have to be served with the rest of the web page’s resources, which increases the page size and lengthens the loading time of the web page. On the other hand, since WSL’s format is textual, there’s no additional compile step; you can just write it and run it!

As a point of comparison, let’s take the Babylon.js demo shown above. Babylon.js renders scenes by joining GLSL shader snippets at runtime. When using WSL, Babylon.js would simply serve the WSL snippets just like they’re serving GLSL snippets today. To do the same with SPIR-V, Babylon.js would serve their shader snippets along with a compiler that compiles these textual snippets down to SPIR-V. Chrome recently announced their support for WebGPU with SPIR-V using the same demo. Here is a comparison of the gzipped wire sizes for the shaders as WSL, and that of the shaders as GLSL and the SPIR-V compiler:

The top bar shows the wire size if Babylon.js used WSL directly. The middle bar shows the wire size of the GLSL source, plus the compiler that Babylon.js uses to compile their snippets into SPIR-V. The bottom bar shows the wire size if Babylon.js simply served the results of their SPIR-V compilation. Because of the dynamic nature of Babylon.js’s shaders, they assemble their shaders at runtime and compile them dynamically, so they require the SPIR-V compiler. However, other websites’ shaders may be more static, in which case they wouldn’t need the compiler and could serve the SPIR-V directly, like in the bottom bar in the graph above.

Compilation Performance

We’ve heard from developers that shader compilation performance is extremely important. One way the community group addressed this was designing WebGPU to support asynchronous compilation, which avoids blocking successive rendering commands. Asynchronous compilation isn’t a magic bullet, though. If the next rendering command uses a shader currently being compiled, that rendering command must wait for the compilation complete. So, we’ve put significant effort into making WSL compilation times fast.

The WebGPU community created a compute shader to demonstrate the flocking characteristics of “boids,” and we turned that sample into a compiler performance benchmark. Here we compare the compilation time of WSL with the run-time of the GLSL compiler to SPIR-V running as a WebAssembly library:

As you can see, we’re continuing to optimize compilation performance, and it’s been increasing both throughout and after STP 91’s release.

Give WebGPU a try!

We’re very excited to have implemented a beta version of WebGPU and WSL in the latest version of Safari Technology Preview. Try it out and let us know how well it works for you! We’re interested in feedback so we can evolve the WebGPU and WSL language to better suit everyone’s needs.

And be sure to check out our gallery of WebGPU samples. We’ll be keeping this page updated with the latest demos. Many thanks to Sebastien Vandenberghe and David Catuhe from Babylon.js for their help with Babylon.js in Safari.

For more information, you can contact me at mmaxfield@apple.com or @Litherum, or you can contact our evangelist, Jonathan Davis.


1 To enable WebGPU beta support, in Develop menu, select Experimental Features > WebGPU.
2 We still pronounce it “whistle”, though.

]]>
How Web Content Can Affect Power Usage https://webkit.org/blog/8970/how-web-content-can-affect-power-usage/ Tue, 27 Aug 2019 17:00:58 +0000 https://webkit.org/?p=8970 @media (prefers-color-scheme:dark) { figure picture img { filter: none; } }

Users spend a large proportion of their online time on mobile devices, and a significant fraction of the rest is users on untethered laptop computers. For both, battery life is critical. In this post, we’ll talk about factors that affect battery life, and how you, as a web developer, can make your pages more power efficient so that users can spend more time engaged with your content.

What Draws Power?

Most of the energy on mobile devices is consumed by a few major components:

  • CPU (Main processor)
  • GPU (Graphics processing)
  • Networking (Wi-Fi and cellular radio chips)
  • Screen

Screen power consumption is relatively constant and mostly under the user’s control (via screen on-time and brightness), but the other components, the CPU, GPU and networking hardware have high dynamic range when it comes to power consumption.

The system adapts the CPU and GPU performance based on the current tasks being processed, including, of course, rendering web pages that the user is interacting with in their web browser and other apps using web content. This is done through turning some components on or off, and by changing their clock frequency. In broad terms, the more performance that is required from the chips, the lower their power-efficiency. The hardware can ramp up to high performance very quickly (but at a large power cost), then rapidly go back to a more efficient low-power state.

General Principles for Good Power Usage

To maximize battery life, you therefore want to reduce the amount of time spent in high-power states, and let the hardware go back to idle as much as possible.

For web developers, there are three states of interaction to think about:

  • When the user is actively interacting with the content.
  • When the page is the frontmost, but the user is not interacting with it.
  • When the page is not the frontmost content.

Efficient user interaction

Obviously it’s good to expend power at times when the user is interacting with the page. You want the page to load fast and respond quickly to touch. In many cases, the same optimizations that reduce time to first paint and time to user interactive will also reduce power usage. However, be cautious about continuing to load resources and to run script after the initial page load. The goal should be to get back to idle as fast as possible. In general, the less JavaScript that runs, the more power-efficient the page will be, because script is work on top of what the browser has already done to layout and paint the page.

Once the page has loaded, user interactions like scrolling and tapping will also ramp up the hardware power (mainly the CPU and GPU), which again makes sense, but make sure to go back to idle as soon as the user stops interacting. Also, try to stay on the browser “fast paths” — for example, normal page scrolling will be much more power-efficient than custom scrolling implemented in JavaScript.

Drive idle power usage towards zero

When the user is not interacting with the page, try to make the page use as little power as possible. For example:

  • Minimize the use of timers to avoid waking up the CPU. Try to coalesce timer-based work into a few, infrequent timers. Lots of uncoordinated timers which trigger frequent CPU wake-ups are much worse than gathering that work into fewer chunks.
  • Minimize continually animating content, like animated images and auto-playing video. Be particularly vigilant to avoid “loading” spinner GIFs or CSS animations that continually trigger painting, even if you can’t see them. IntersectionObserver can be used to runs animations only when they are visible.
  • Use declarative animations (CSS Animations and Transitions) where possible. The browser can optimize these away when the animating content is not visible, and they are more efficient than script-driven animation.
  • Avoid network polling to obtain periodic updates from a server. Use WebSockets or Fetch with a persistent connection, instead of polling.

A page that is doing work when it should be idle will also be less responsive to user interaction, so minimizing background activity also improves responsiveness as well as battery life.

Zero CPU usage while in the background

There are various scenarios where a page becomes inactive (not the user’s primary focus), for instance:

  • The user switches to a different tab.
  • The user switches to a different app.
  • The browser window is minimized.
  • The browser window is visible but is not the focused window.
  • The browser window is behind another window.
  • The space the window is on is not the current space.

When a page becomes inactive, WebKit automatically takes steps to save power:

In addition, WebKit takes advantage of features provided by the operating system to maximize efficiency:

  • on iOS, tabs are completely suspended when possible.
  • on macOS, tabs participate in App Nap, which means that the web process for a tab that is not visually updating gets lower priority and has its timers throttled.

However, pages can trigger CPU wake-ups via timers (setTimeout and setInterval), messages, network events, etc. You should avoid these when in the background as much as possible. There are two APIs that are useful for this:

  • Page Visibility API provides a way to respond to a page transitioning to be in the background or foreground. This is a good way to avoid updating the UI while the page is in the background, then using the visibilitychange event to update the content when the page becomes visible.
  • blur events are sent whenever the page is no longer focused. In that case, a page may still be visible but it is not the currently focused window. Depending on the page, it can be a good idea to stop animations.

The easiest way to find problems is Web Inspector’s Timelines. The recording should not show any event happening while the page is in the background.

Hunting Power Inefficiencies

Now that we know the main causes of power use by web pages and have given some general rules about creating power-efficient content, let’s talk about how to identify and fix issues that cause excessive power drain.

Scripting

As mentioned above, modern CPUs can ramp power use from very low, when the device is idle, to very high to meet the demands of user interaction and other tasks. Because of this, the CPU is a leading cause of battery life variance. CPU usage during page loading is a combination of work the browser engine does to load, parse and render resources, and in executing JavaScript. On many modern web pages, time spent executing JavaScript far exceeds the time spent by the browser in the rest of the loading process, so minimizing JavaScript execution time will have the biggest benefits for power.

The best way to measure CPU usage is with Web Inspector. As we showed in a previous post, the timeline now shows the CPU activity for any selected time range:

To use the CPU efficiently, WebKit distributes work over multiple cores when possible (and pages using Workers will also be able to make use of multiple cores). Web Inspector provides a breakdown of the threads running concurrently with the page’s main thread. For example, the following screenshot shows the threads while scrolling a page with complex rendering and video playback:

When looking for things to optimize, focus on the main thread, since that’s where your JavaScript is running (unless you’re using Workers), and use the “JavaScript and Events” timeline to understand what’s triggering your script. Perhaps you’re doing too much work in response to user or scroll events, or triggering updates of hidden elements from requestAnimationFrame. Be cognizant of work done by JavaScript libraries and third party scripts that you use on your page. To dig deeper, you can use Web Inspector’s JavaScript profiler to see where time is being spent.

Activity in “WebKit Threads” is mostly triggered by work related to JavaScript: JIT compilation and garbage collection, so reducing the amount of script that runs, and reducing the churn of JavaScript objects should lower this.

Various other system frameworks invoked by WebKit make use of threads, so “Other threads” include work done by those; the largest contributor to “Other thread” activity is painting, which we’ll talk about next.

Painting

Main thread CPU usage can also be triggered by lots of layout and painting; these are usually triggered by script, but a CSS animation of a property other than transform, opacity and filter can also cause them. Looking at the “Layout and Rendering” timeline will help you understand what’s causing activity.

If the “Layout and Rendering” timeline shows painting but you can’t figure out what’s changing, turn on Paint Flashing:

This will cause those paints to be briefly highlighted with a red overlay; you might have to scroll the page to see them. Be aware that WebKit keeps some “overdraw” tiles to allow for smooth scrolling, so paints that are not visible in the viewport can still be doing work to keep offscreen tiles up-to-date. If a paint shows in the timeline, it’s doing actual work.

In addition to causing power usage by the CPU, painting usually also triggers GPU work. WebKit on macOS and iOS uses the GPU for painting, and so triggering painting can cause significant increases in power. The additional CPU usage will often show under “Other threads” in the CPU Usage timeline.

The GPU is also used for <canvas> rendering, both 2D canvas and WebGL/WebGPU. To minimize drawing, don’t call into the <canvas> APIs if the canvas content isn’t changing, and try to optimize your canvas drawing commands.

Many Mac laptops have two GPUs, an “integrated” GPU which is on the same die as the CPU, and is less powerful but more power-efficient, and a more powerful, but more power-hungry “discrete” GPU. WebKit will default to using the integrated GPU by default; you can request the discrete GPU using the powerPreference context creation parameter, but only do this if you can justify the power cost.

Networking

Wireless networking can affect battery life in unexpected ways. Phones are the most affected due to their combination of powerful radios (the WiFi and cellular network chips) with a smaller battery. Unfortunately, measuring the power impact of networking is not easy outside of a lab, but can be reduced by following some simple rules.

The most direct way to reduce networking power usage is to maximize the use of the browser’s cache. All of the best practices for minimizing page load time also benefit the battery by reducing how long the radios need to be powered on.

Another important aspect is to group network requests together temporally. Any time a new request comes, the operating system needs to power on the radio, connect to the base station or cell tower, and transmit the bytes. After transmitting the packets, the radio remains powered for a small amount of time in case more packets are transmitted.

If a page transmits small amounts of data infrequently, the overhead can become larger than the energy required to transmit the data:

Networking Power Overhead of transmitting 2 packets with a delay between them

Such issues can be discovered from Web Inspector in the Network Requests Timeline. For example, the following screenshot shows four separate requests (probably analytics) being sent over several seconds:

Sending all the requests at the same time would improve network power efficiency.

Conclusion

Webpages can be good citizens of battery life.

It’s important to measure the battery impact in Web Inspector and drive those costs down. Doing so improves the user experience and battery life.

The most direct way to improve battery life is to minimize CPU usage. The new Web Inspector provides a tool to monitor that over time.

To achieve great battery life the goals are:

  • Drive CPU usage to zero in idle
  • Maximize performance during user interaction to quickly get back to idle

If you have any questions, feel free to reach me, Joseph Pecoraro, Devin Rousso, and of course Jon Davis.

Note: Learn more about Web Inspector from the Web Inspector Reference documentation.
]]>
Introducing the JetStream 2 Benchmark Suite https://webkit.org/blog/8685/introducing-the-jetstream-2-benchmark-suite/ Wed, 27 Mar 2019 17:00:04 +0000 https://webkit.org/?p=8685 Today we are announcing a new version of the JetStream JavaScript benchmark suite, JetStream 2. JetStream 2 combines a variety of JavaScript and WebAssembly benchmarks, covering an array of advanced workloads and programming techniques, and reports a single score that balances them using a geometric mean. JetStream 2 rewards browsers that start up quickly, execute code quickly, and run smoothly. Optimizing the overall performance of our JavaScript engine has always been a high priority for the WebKit team. We use benchmarks to motivate wide-reaching and maintainable optimizations to the WebKit engine, often leading us to make large architectural improvements, such as creating an entirely new optimizing compiler, B3, in 2015. JetStream 2 mixes together workloads we’ve been optimizing for years along with a new set of workloads we’ll be improving for years to come.

When we released JetStream 1 in 2014, it was a cross-cutting benchmark, measuring the performance of the latest features in the JavaScript ecosystem, at that time. One of our primary goals with JetStream 1 was to have a single benchmark we could use to measure overall JavaScript performance. However, since 2014, a lot about the JavaScript ecosystem has changed. Two of the major changes were the release of an extensive update to the JavaScript language in ES6, and the introduction of an entirely new language in WebAssembly. Even though we created JetStream 1 as a benchmark to track overall engine performance, we found ourselves creating and using new benchmarks in the years since its release. We created ARES-6 in 2017 to measure our performance on ES6 workloads. For the last two years, we’ve also tracked two WebAssembly benchmarks internal to the WebKit team. In late 2017, the V8 team released Web Tooling Benchmark, a benchmark we’ve used to improve regular expression performance in JavaScriptCore. Even though JetStream 1 was released after Kraken, we found ourselves continuing to track Kraken after its release because we liked certain subtests in it. In total, prior to creating JetStream 2, the WebKit team was tracking its performance across six different JavaScript and WebAssembly benchmarks.

Gating JavaScriptCore’s performance work on a relative ordering of six different benchmark suites is impractical. It proved non-obvious how to evaluate the value of a change based on the collective results of six different benchmark suites. With JetStream 2, we are moving back to having a single benchmark — which balances the scores of its subtests to produce a single overall score — to measure overall engine performance. JetStream 2 takes the best parts of these previous six benchmarks, adds a group of new benchmarks, and combines them into a single new benchmark suite. JetStream 2 is composed of:

  • Almost all JetStream 1 benchmarks.
  • New benchmarks inspired by subtests of Kraken.
  • All of ARES-6.
  • About half of the Web Tooling Benchmark.
  • New WebAssembly tests derived from the asm.js benchmarks of JetStream 1.
  • New benchmarks covering areas not tested by the other six benchmark suites.

New Benchmarks

JetStream 2 adds new benchmarks measuring areas not covered by these previous benchmark suites. An essential component of JetStream 1 was the asm.js subset of benchmarks. With the release of WebAssembly, the importance of asm.js has lessened since many users of asm.js are now using WebAssembly. We converted many of the asm.js benchmarks from JetStream 1 into WebAssembly in JetStream 2. We also created a few new WebAssembly benchmarks. One such new test is richards-wasm, which models a hybrid JS and WebAssembly application. Richards-wasm is an implementation of Martin Richard’s system language benchmark that is implemented partially in WebAssembly and partially in JavaScript. The benchmark models a JavaScript application that frequently calls into helper methods defined in WebAssembly.

JetStream 2 adds two new Web Worker tests. It’s essential for JetStream 2 to measure Web Worker performance given how popular they are across the web. The first worker benchmark we created is bomb-workers. Bomb-workers runs all of SunSpider in parallel — so it stresses how well the browser handles running a lot of JavaScript in parallel. The second benchmark we created is segmentation. Segmentation runs four Web Workers in parallel to compute the time series segmentation over a sample data set. This benchmark is derived from the same algorithm used in WebKit’s performance dashboard.

JetStream 2 adds three new benchmarks emphasizing regular expression performance: OfflineAssembler, UniPoker, and FlightPlanner. UniPoker and FlightPlanner stress the performance of Unicode regular expressions — a feature that was new in ES6. OfflineAssembler is the JavaScriptCore offline assembler parser and AST generator translated from Ruby into JavaScript. UniPoker is a 5 card stud poker simulation using the Unicode code points for playing cards to represent card values. FlightPlanner parses aircraft flight plans that consist of named segments, e.g. takeoff, climb, cruise, etc, that include waypoints, airways, or directions and times, and then processes those segments with an aircraft profile to compute course, distance and forecast times for each resulting leg in the flight plan, as well as the total flight plan. It exercises Unicode regular expression code paths because the segment names are in Russian.

JetStream 2 also adds two new general purpose JavaScript benchmarks: async-fs and WSL. Async-fs models performing various file system operations, such as adding and removing files, and swapping the byte-order of existing files. Async-fs stresses the performance of DataView, Promise, and async-iteration. WSL is an implementation of an early version of WHLSL — a new shading language proposal for the web. WSL measures overall engine performance, especially stressing various ES6 constructs and throw.

You can read about each benchmark in JetStream 2 in the summary page.

Benchmarking Methodology

Each benchmark in JetStream 2 measures a distinct workload, and no single optimization technique is sufficient to speed up all benchmarks. Some benchmarks demonstrate tradeoffs, and aggressive or specialized optimizations for one benchmark might make another benchmark slower. Each benchmark in JetStream 2 computes an individual score. JetStream 2 weighs each benchmark equally.

When measuring JavaScript performance on the web, it’s not enough to just consider the total running time of a workload. Browsers may perform differently for the same JavaScript workload depending on how many times it has run. For example, garbage collection runs periodically, making some iterations take longer than others. Code that runs repeatedly gets optimized by the browser, so the first iteration of any workload is usually more expensive than the rest.

For this reason, JetStream 1 categorized each benchmark into one of two buckets: latency or throughput. Latency tests either measured startup performance or worst case performance. Throughput tests measured sustained peak performance. Like JetStream 1, JetStream 2 measures startup, worst case, and peak performance. However, unlike JetStream 1, JetStream 2 measures these metrics for every benchmark.

JetStream 2 scores JavaScript and WebAssembly differently. For all but one of the JavaScript benchmarks in JetStream 2, individual benchmark scores equally weigh startup performance, worst case performance, and average case performance. These three metrics are crucial to running performant JavaScript in the browser. Fast startup times lead browsers to loading pages more quickly and allow users to interact with the page sooner. Good worst case performance ensures web applications run without hiccups or visual jank. Fast average case performance makes it so that the most advanced web applications can run at all. To measure these three metrics, each of these benchmarks run for N iterations, where N is usually 120, but may vary based on the total running time of each iteration. JetStream 2 reports the startup score as the time it takes to run the first iteration. The worst case score is the average of the worst M iterations, excluding the first iteration. M is always less than N, and is usually 4. For some benchmarks, M can be less than 4 when N is smaller than 120. The average case score is the average of all but the first of the N iterations. These three metrics are weighed equally using the geometric mean.

WSL is JetStream 2’s one JavaScript benchmark that is an exception to the above scoring technique. WSL uses a different scoring mechanism because it takes orders of magnitude longer than the other benchmarks to run through a single iteration. It instead computes its score as the geometric mean over two metrics: the time it takes to compile the WSL standard library, and the time it takes to run through the WSL specification test suite.

JetStream 2’s WebAssembly benchmarks are scored in two parts that equally weigh startup time and total execution time. The first part is the startup time, which is the time it takes until the WebAssembly module is instantiated. This is the time it takes the browser to put the WebAssembly code in a runnable state. The second part is the execution time. This is the time it takes to run the benchmark’s workload after it is instantiated. Both metrics are crucial for JetStream 2 to measure. Good startup performance makes it so that WebAssembly applications load quickly. Good execution time makes it so WebAssembly benchmarks run quickly and smoothly.

In total, JetStream 2 includes 64 benchmarks. Each benchmark’s score is computed as outlined above, and the final score is computed as the geometric mean of those 64 scores.

Optimizing Regular Expressions

The Web Tooling Benchmark made us aware of some performance deficiencies in JavaScriptCore’s Regular Expression processing and provided a great way to measure any performance improvements for changes we made. For some background, the JavaScriptCore Regular Expression engine, also know as YARR, has both a JIT and interpreter matching engine. There were several types of patterns that were commonly used in the WebTooling Benchmark tests that JavaScriptCore didn’t have support for in the YARR JIT. This included back references, and nested greedy and non-greedy groups.

A back reference takes the form of /^(x*) 123 \1$/, where we match what is in the parenthesized group and then match the same thing again later in the string when a prior group is referenced. For the example given here, the string "x 123 x" would match as well as the string “xxxxx 123 xxxxx”. It matches any line that begins with 0 or more 'x', has " 123 " in the middle, and then ends with the same number of 'x' characters as the string started with. JIT support for back references was added for both Unicode and non-Unicode patterns. We also added JIT support for patterns with the ignore case flag that process ASCII and Latin1 strings. Back reference matching of Unicode ignore case patterns is more problematic due to the large amount of case folding data required. So, we revert to the YARR interpreter for Unicode regular expressions that contain back references that also ignore character case.

Before discussing the next improvement we made to the regular expression engine, some background is in order. The YARR Regular Expression engine uses a backtracking algorithm. When we want to match a pattern like /a[bc]*c/, we process forward in the pattern and when we fail to match a term, we backtrack to the prior term to see if we can try the match a little differently. The [bc]* term will match as many b’s and/or c’s in a row as possible. The string “abc” matches the pattern, but it needs to backtrack when matching the 'c' the first time. This happens because the middle [bc]* term will match both the 'b' and the 'c', and when we try to match the final 'c' term in the pattern, the string has been exhausted. We backtrack to the [bc]* term and reduce the length of that term’s match from “bc” to just “b” and then match the final 'c' term. This algorithm requires state information to be saved for various term types so that we can backtrack. Before we started this work, backtracking state consisted of counts and pointers into the term’s progress in a subject string.

We had to extend how backtracking state was saved in order to add support for counted captured parenthesized groups that are nested within longer patterns. Consider a pattern like /a(b|c)*bc/. In addition to the backtracking information for the contents of the captured parenthesized group, we also need to save that group’s match count, and start and end locations as part of the captured group’s backtracking state. This state is saved in nodes on a singly linked list stack structure. The size of these parenthesized group backtracking nodes is variable, depending on the amount of state needed for all nested terms including the nested capture group’s extents. Whenever we begin matching a parenthesized group, either for the first or a subsequent time, we save the state for all the terms nested within that group as a new node on this stack. When we backtrack to the beginning of a parenthesized group, to try a shorter match or to backtrack to the prior terms, we pop the captured group’s backtracking node and restore the saved state.

We also made two other performance improvements that helped our JetStream 2 performance. One is matching longer constant strings at once and the other is canonicalizing constructed character classes. We have had the optimization to match multiple adjacent fixed characters in a pattern as a group for some time, where we could match up to 32bits of character data at once, e.g four 8 bit characters. Consider the expression /starting/, which simply looks for the string “starting”. These optimizations allowed us to match “star” with one 32 bit load-compare-branch sequence and then the trailing “ting” with a second 32 bit load-compare-branch sequence. The recent change was made for 64 bit platforms and allows us to match eight 8 bit characters at a time. With this change, this regular expression is now matched with a single 64 bit load-compare-branch sequence.

The latest Regular Expression performance improvement we did that provided some benefit to JetStream 2 was to coalesce character classes. For background, character classes can be constructed from individual characters, ranges of characters, built in escape characters, or built in character classes. Prior to this change, the character class [\dABCDEFabcdef], which matches any hex digit, would check for digit characters by comparing a character value against ‘0’ and ‘9’, and then individually compare it against each of the alphabetic characters, 'A', 'B', etc. Although this character class could have been written as [\dA-Fa-f], it makes sense for YARR to do this type of optimization for the JavaScript developer. Our change to character class coalescing now does this. We merge individual adjacent characters into ranges, and adjacent ranges into larger ranges. This often reduces the number of compare and branch instructions. Before this optimization, /[\dABCDEFabcdef]/.exec(“X”) would require 14 compares and branches to determine there wasn’t a match. Now it takes just 4 compares and branches.

The performance impact of these optimizations was dramatic for some of the Web Tooling Benchmark tests. Adding the ability to JIT greedy nested parens improved the performance of coffeescript by 6.5x. JIT’ing non-greedy nested parens was a 5.8x improvement to espree and a 3.1x improvement to acorn. JIT’ing back references improved coffeescript by another 5x for a total improvement of 33x on that test. These changes also had smaller, but still measurable improvements on other JetStream 2 tests.

Performance Results

The best way for us to understand the state of performance in JavaScriptCore is both to track our performance over time and to compare our performance to other JavaScript engines. This section compares the performance of the prior released version of Safari, Safari 12.0.3, the newly released Safari 12.1 from macOS 10.14.4, and the latest released versions of Chrome, Chrome 73.0.3683.86, and Firefox, Firefox 66.0.1.

All numbers are gathered on a 2018 MacBook Pro (13-inch, Four Thunderbolt 3 Ports, MacBookPro15,2), with a 2.3GHz Intel Core i5 and 8GB of RAM. The numbers for Safari 12.0.3 were gathered on macOS 10.14.3. The numbers for Safari 12.1, Chrome 73.0.3683.86, and Firefox 66.0.1 were gathered on macOS 10.14.4. The numbers are the average of five runs of the JetStream 2 benchmark in each browser. Each browser was quit and relaunched between each run.

JetStream 2 Scores. Bigger is Better.

The above figure shows that Safari 12.1 is the fastest browser at running JetStream 2. It is 9% faster than Safari 12.0.3, 8% faster than Chrome 73.0.3683.86, and 68% faster than Firefox 66.0.1.

Conclusion

JetStream 2 is a major update to the JetStream benchmark suite and we’re excited to be sharing it with you today. JetStream 2 includes a diverse set of 64 JavaScript and WebAssembly benchmarks, making it the most extensive and representative JavaScript benchmark we’ve ever released. We believe that engines optimizing for this benchmark will lead to better JavaScript performance on the web. The JavaScriptCore team is focused on improving JetStream 2 and has already made a 9% improvement on it in Safari 12.1.

We’d love to hear any feedback you have on JetStream 2 or the optimizations we’ve made for it. Get in touch with Saam or Michael on Twitter with any feedback you have.

]]>