Summary: Bug 1679892 - add initial schedule of CI config changes in-tree. r=releng-reviewers,aki

add ci-configuration process and schedule of CI config changes in-tree

Differential Revision: https://phabricator.services.mozilla.com/D98252
This commit is contained in:
Joel Maher 2020-12-08 21:17:24 +00:00
parent 419b002c14
commit aa94a1b6e4
3 changed files with 114 additions and 0 deletions

View file

@ -34,6 +34,7 @@ categories:
- tools/moztreedocs
testing_doc:
- testing/testing-policy
- testing/ci-configs
- testing/marionette
- testing/geckodriver
- web-platform

View file

@ -0,0 +1,65 @@
# Configuration Changes
This process outlines how Mozilla will handle configuration changes. For a list of configuration changes, please see the [schedule](schedule.html)
## Infrastructure setup (2-4 weeks)
This is behind the scenes, when there is a need for a configuration change (upgrade or addition of a new platform), the first step
is to build a machine and work to get the OS working with taskcluster. This is work for hardware/cloud is done by IT. Sometimes
this is as simple as installing a package or changing an OS setting on an existing machine, but this requires automation and documentation.
In some cases there is little to no work as the CI change is running tests with different runtime settings (environment variables or preferences).
## Setting up a pool on try server (1 week)
The next step is getting some machines available on try server. This is where we add some code in tree to support the new config
(a new worker type, test variant, etc.) and validate any setup done by IT works with taskcluster client. Then Releng ensures the target tests
can run at a basic level (mozharness, testharness, os environment, logging, something passes).
## Green up tests (1 week)
This is a stage where Releng will run all the target tests on try server and disable, skip, fail-if all tests that are not passing or frequently
intermittent. Typically there are a dozen or so iterations of this because a crash on one test means we don't run the rest of the tests in the
manifest.
## Turn on new config as tier-2 (1/2 week)
We will time this at the start of a new release.
Releng will land changes to manifests for all non passing tests and then schedule the new jobs by default. This will be tier-2 for a couple reasons:
* it is a new config with a lot of tests that still need attention
* in many cases there is a previous config (lets say upgrading windows 10 from 1803 -> 1903) which is still running in parallel as tier-1
This will now run on central and integration and be available on try server. In a few cases where there are limited machines (android phones),
there will be needs to turn off the old config, or make the try server access hidden behind `./mach try --full`
## Turn on new backstop jobs which run the skipped tests (1/2 week)
Releng will turn on a new temporary job that will run the tests which are not green by default. These will run as tier-2 on mozilla-central and be sheriffed.
The goal here is to find tests that are now passing and should be run by default. By doing this we are effectively running all the tests instead of
disabling dozens of tests and forgetting about them.
## Handoff to developers (1 week)
Releng will file bugs for all failing tests (one bug per manifest) and needinfo the triage owner to raise awareness that one or more tests in their area need
attention. At this point, Releng is done and will move onto other work. Developers can reproduce the failures on try server and when fixed edit the manifest
as appropriate.
There will be at least 6 weeks to investigate and fix the tests before they are promoted to tier-1.
## move config to tier-1 (6-7 weeks later)
After the config has been running as tier-2 makes it to beta and then to the release branch (i.e. 2 new releases later), Releng will:
* turn off the old tier-1 tests (if applicable)
* promote the tier-2 jobs to tier-1
* turn off the backstop jobs
This allows developers to schedule time in a 6 weeks period to investigate and fix any test failures.

View file

@ -0,0 +1,48 @@
# Schedule
For each CI config change, we need to follow:
* scope of work (what will run, how frequently)
* capacity planning (cost, physical space limitations)
* will this replace anything or is this 100% new
* puppet/deployment scripts or documentation
* setup pool on try server
* documented updated on this page, communicate with release management and others as appropriate
## Current / Future CI config changes
Start Date | Completed | Tracking Bug | Description
--- | --- | --- | ---
October 2020 | TBD | [Bug 1665012](https://bugzilla.mozilla.org/show_bug.cgi?id=1665012) | add samsung S7 phones for perf testing
November 2020 | TBD | [Bug 1676850](https://bugzilla.mozilla.org/show_bug.cgi?id=1676850) | Windows tests migrate from AWS -> Datacenter/Azure and 1803 -> 1903
November 2020 | TBD | TBD | upgrade datacenter linux perf machines from ubuntu 16.04 to 18.04
TBD | TBD | [Bug 1665012](https://bugzilla.mozilla.org/show_bug.cgi?id=1665012) | Android phones upgrade from version 7 -> 10
October 2020 | TBD | [Bug 1673067](https://bugzilla.mozilla.org/show_bug.cgi?id=1673067) | Run tests on MacOSX BigSur (subset in parallel)
October 2020 | TBD | [Bug 1673067](https://bugzilla.mozilla.org/show_bug.cgi?id=1673067) | Run tests on MacOSX Aarch64 (subset in parallel)
December 2020 | TBD | TBD | Migrate OSX from Mac Mini R7, OSX 10.14 (Mojave) -> Mac Mini R8, OSX 10.15 (Catalina)
TBD | TBD | TBD | Migrate more coverage of OSX from 10.14 to BigSur/aarch64
TBD | TBD | TBD | Upgrade ubuntu from 18.04 to 20.04
TBD | TBD | TBD | Upgrade android emulators to modern version
September 2020 | TBD | [Bug 1548264](https://bugzilla.mozilla.org/show_bug.cgi?id=1548264) | Python 2.7 -> 3.6 migration in CI
TBD | TBD | [Bug 1665010](https://bugzilla.mozilla.org/show_bug.cgi?id=1665010) | Add more android phone hardware (replace moto g5 and probably pixel 2)
TBD | TBD | TBD | Upgrade datacenter hardware for windows/linux (primarily perf)
TBD | TBD | TBD | Add Linux ARM64 worker in AWS (as it is close to Apple Silicon)
## Completed CI config changes
Start Date | Completed | Tracking Bug | Description
--- | --- | --- | ---
July 2020 | October 2020| [Bug 1653344](https://bugzilla.mozilla.org/show_bug.cgi?id=1653344) | Remove EDID dongles from MacOSX machines
August 2020 | September 2020 | [Bug 1643689](https://bugzilla.mozilla.org/show_bug.cgi?id=1643689) | Schedule tests by test selection/manifest
June 2020 | August 2020 | [Bug 1486004](https://bugzilla.mozilla.org/show_bug.cgi?id=1486004) | Android hardware tests running without rooted phones
August 2019 | January 2020 | [Bug 1572242](https://bugzilla.mozilla.org/show_bug.cgi?id=1572242) | Upgrade Ubuntu from 16.04 to 18.04 (finished in January)
## Appendix:
* *OS*: base operating system such as Android, Linux, Mac OSX, Windows
* *Hardware*: specific cpu/memory/disk/graphics/display/inputs that we are using, could be physical hardware we own or manage, or it could be a cloud provider.
* *Platform*: a combination of hardware and OS
* *Configuration*: what we change on a platform (can be runtime with flags), installed OS software updates (service pack), tools (python/node/etc.), hardware or OS settings (anti aliasing, display resolution, background processes, clipboard), environment variables,
* *Test Failure*: a test doesnt report the expected result (if we expect fail and we crash, that is unexpected). Typically this is a failure, but it can be a timeout, crash, not run, or even pass
* *Greening up*: Assuming all tests return expected results (passing), they are green. When tests fail, they are orange. We need to find a way to get all tests green by investigating test failures.