This document is an explainer for a new class for the EWS of WebKit that will be able to cope with a tree that is not green

Why is this needed

Some WebKit ports (GTK and WPE) are not able to keep their tree always green (meaning that often there are unexpected layout tests results).

This is due to a number of reasons, but the two principal ones are:

There isn't any EWS that runs layout tests for this ports. A patch author has no way of knowing that their patch breakes things on this port without manually testing it on this ports.

We already tried (hard) adding a new EWS for GTK using the current code for the EWS, but it didn't worked. The current design of the EWS assumes that the tree is green (or almost green). But that assumption is not working for the GTK and WPE ports, we get constant whole step retries due to aborts early caused by reaching more than 30 unexpected failures. Also, even when we get the test step finish the EWS reports lots of false positive with GTK or WPE because of the flakiness of tests which we have not under so good control like the Mac port. And we already tried also hard to keep the flakies under control by marking them but is an endless herculean task.

commit-queue doesn't run layout tests on GTK or WPE, so it won't notice that a patch breakes something on this ports before landing. That means that breaking patches are constantly landing and the port gardeners have to mark the new failures or fix the issue after the fact, but that means the tree on this ports will be red until the gardeners have time to do that (which can be several days).

On top of that, the EWS should be a tool that would help the developers to get baselines for other ports without they having to build and run tests on other ports. But the current design of the EWS makes it unsuitable to be used as a tool for helping with re-baselines because it will abort at 30 unexpected failures. But when you are working on a patch that adds lot of new tests (like a WPT import) or modifies something that cause lot of unexpected results you end needing to re-baseline more than 30 tests (sometimes we are talking about hundreds of tests). So, if you are relying on the EWS to emit the baselines for other ports other than the one you are testing, the cut on 30 unexpected tests is clearly not enough.

How this new EWS will work

This patch adds a new class to the EWS and sets the GTK EWS bots to run with it (later WPE ones will follow). Mac/iOS ports will continue to use the current default EWS, so no behaviour change for them.

The new class is named RunWebKitTestsRedTree and is specially designed to cope with a tree that is not always green. It tries hard to avoid reporting any false positive, so it will only report new consistent failures (fail always with the patch and pass always without it).

The cut-off for aborting early on this class is raised to 500 (instead of 30). 500 unexpected failures is also the number that we use on the post-commit bots at build.webkit.org to abort early the layout-test step.

The simplified logic of this is the following:

Run layout tests with patch (abort early at 500 unexpected failures)

Run layout tests with patch 10 times for each test that failed consistently (non-flaky) on step 1.

Run layout tests without patch 10 times for each test that failed consistently (non-flaky) on step 2.

And it will report as new failures the tests that failed consistently on step 2 (failed all the 10 times with patch) minus the tests that were flaky or passed on step 3 (never failed without the patch on 10 runs).

It will also report to the bot watchers any flaky tests (with the patch or without it). But to the patch author it won't give this information. To the patch author it will only report consistent new failures (fail always with the patch and pass always without the patch, retrying 10 times). It tries hard to not report any false positive to the patch author (flakies are only reported to the bot watchers).

This is a more elaborated (but still simplified) flow diagram of how it works:

# Leyend:
# => obtain
# -> goto

step: layout-tests [class RunWebKitTestsRedTree]
  with-patch: run-layout-tests(all_tests) => failures1, flakies1

    if failures1 -> layout-tests-repeat-failures
    if flakies1 -> analyze-layout-tests-results
    if step_exit_ok -> report_green_patch
    -> run-layout-tests-without-patch


step: layout-tests-repeat-failures [class RunWebKitTestsRepeatFailuresRedTree] [Timeout: 5h]
  with-patch: run-layout-tests-10times(failures1) => failures2, flakies2

    if failures2 -> layout-tests-repeat-failures-without-patch
    if timeout -> layout-tests-repeat-failures-without-patch
    -> analyze-layout-tests-results


step: layout-tests-repeat-failures-without-patch [class RunWebKitTestsRepeatFailuresWithoutPatchRedTree] [Timeout: 3h]
  without-patch: run-layout-tests-10times(did_timeout(layout-tests-repeat-failures)?failures1:failures2) => wfailures3, wflakies3

    -> analyze-layout-tests-results


step: run-layout-tests-without-patch [class RunWebKitTestsWithoutPatchRedTree]
    with-patch: run-layout-tests(all_tests) => cleantreefailures, cleantreeflakies

    -> analyze-layout-tests-results


step: analyze-layout-tests-results [class AnalyzeLayoutTestsResultsRedTree]

    if retry_number > max_retries:
       -> report_warning('Unable to determine if patch is bad or there is a pre-existent infrastructure issue')

    if not failures1 and not flakies1:
        if step(run-layout-tests-without-patch) == exit_ok:
            -> report_unknown_failure(patch)
        -> raise_infrastructure_issue_retry_build(max_retries=3)

    if did_timeout(layout-tests-repeat-failures-without-patch):
      -> raise_infrastructure_issue_retry_build(max_retries=3

    if did_timeout(layout-tests-repeat-failures-with-patch):
      -> warn_about_step_timing_out(botwatchers)
      for any(test in (failures1-(wfailures3+wflakies3))):
        -> report_new_failure(test)

    if not failures2 and not flakies2:
        if step(layout-tests-repeat-failures) != exit_ok:
            -> raise_infrastructure_issue_retry_build(max_retries=3)

  for any(test in failures2 and not in (wfailures3+wflakies3)):
    -> report_new_failure(test)

  for any(test in (flakies1+flakies2+wflakies3)):
    -> warn_bot_watchers_flaky(test)

  -> report_green_patch

Design considerations

Some design considerations:

The first step layout-tests (with patch) will abort at 500 unexpected failures (rather than 30). This number was chosen because:

500 unexpected failures is also the number that we use on the post-commit bots at build.webkit.org to abort early the layout-test step.

Is a number big enough to make this EWS useful for (non-Linux) developers to get baselines for GTK and WPE when they are working on a patch that needs lot of new baselines (like a WPT import).

It gives enough margin for the EWS to continue working as usual when something is more broken than expected on the clean_tree. For example, if the clean_tree has 120 unexpected failures because some patch landed with a WPT import that didn't included re-baselines for the GTK port, then the EWS can still work testing other patches with a margin of 380 (500-120).

The steps layout-tests-repeat-failures and layout-tests-repeat-failures-without-patch run without the flag --exit-after-n-failures so they never abort early. This is by design, since they are repeating 10 times each failure and the maximum number of failures is set to 500 (from step layout-tests) they are potentially running 5000 failures. So it doesn't make sense to make them abort early, that will cause the testing to be interrupted without a good reason since having 5000 failures is expected when there are really 500 unexpected results.

However, we have to take into account the time it can take to run those 500 tests repeating each one 10 times. Assuming a worst case scenario in where every test of those 500 timeouts then the upper bound timing is 500*10*15 = 75000 seconds => 20.8 hours which is too much (and it can be 2x worse if those tests are marked as slow). For that reason a global timeout for this two steps is introduced. Buildbot will stop the step if they reach that timeout, which is set to 5h (with patch) and 3h (without patch).

When an unexpected issue is found, a infrastructure exception is raised. This sends a warning e-mail to the bot watchers and retries the whole testing from scratch. But it keeps a counter between retries, so when it reaches the third retry instead of retrying another time it will mark the patch as warnings ("Unable to determine if patch is bad or there is a pre-existent infrastructure issue"), will send an e-mail to the bot watchers and will stop the testing. That way we avoid infinite loops when something is really wrong either on the clean tree or with the patch.
Passing the tests on the command-line is OK, because:

The maximum number of failures on the first step (500) means that we are passing on the command-line a maximum of 500 tests.
So, the command line we will pass will be something like:
python3 Tools/Scripts/run-webkit-tests --no-build --no-show-results --no-new-test-results --clobber-old-results \
  --release --results-directory layout-test-results --debug-rwt-logging --skip-failing-tests --skipped=always \
  [ list of 500 long test names]
Breaking this command line down we have:

220 characters before the list of tests (let's more than double it to 500 for security)

For the list of tests we can use as maximum (before hitting the ARG_MAX limit)
MacOS (ARG_MAX==1048576):
1048576 - 500 = 1048076 for the 500 tests as maximum
1048076 / 500 => 2096 characters per test as maximum
On Linux the limit is 1/4 of max stack size, but never more than 6MB and never less than 128kb. The default value is 2097152 (1/4 of max stack size, with a stack of 8kb):
2097152 - 500 = 2096652 for the 500 tests as maximum
2096652 / 500 => 4193 characters per test as maximum
The current maximum length of a test name under the LayoutTests directory is currently 181 ( which is imported/w3c/web-platform-tests/html/browsers/the-window-object/apis-for-creating-and-navigating-browsing-contexts-by-name/no_window_open_when_term_nesting_level_nonzero.window.html) that is 11.5 times lower than the Mac case (2096) or 23 times lower than the Linux case (4193)

So, even when using the very worst possible case (500 tests at maximum length of 181) the security margin is big enough (more than 10 or 20 times higher) that is ok to pass the tests on the command-line for a maximum of 500 tests.