forked from mirrors/gecko-dev
Summary: Bug 1679892 - add initial schedule of CI config changes in-tree. r=releng-reviewers,aki
add ci-configuration process and schedule of CI config changes in-tree Differential Revision: https://phabricator.services.mozilla.com/D98252
This commit is contained in:
parent
419b002c14
commit
aa94a1b6e4
3 changed files with 114 additions and 0 deletions
|
|
@ -34,6 +34,7 @@ categories:
|
|||
- tools/moztreedocs
|
||||
testing_doc:
|
||||
- testing/testing-policy
|
||||
- testing/ci-configs
|
||||
- testing/marionette
|
||||
- testing/geckodriver
|
||||
- web-platform
|
||||
|
|
|
|||
65
testing/docs/ci-configs/index.md
Normal file
65
testing/docs/ci-configs/index.md
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
# Configuration Changes
|
||||
|
||||
This process outlines how Mozilla will handle configuration changes. For a list of configuration changes, please see the [schedule](schedule.html)
|
||||
|
||||
## Infrastructure setup (2-4 weeks)
|
||||
|
||||
This is behind the scenes, when there is a need for a configuration change (upgrade or addition of a new platform), the first step
|
||||
is to build a machine and work to get the OS working with taskcluster. This is work for hardware/cloud is done by IT. Sometimes
|
||||
this is as simple as installing a package or changing an OS setting on an existing machine, but this requires automation and documentation.
|
||||
|
||||
In some cases there is little to no work as the CI change is running tests with different runtime settings (environment variables or preferences).
|
||||
|
||||
|
||||
## Setting up a pool on try server (1 week)
|
||||
|
||||
The next step is getting some machines available on try server. This is where we add some code in tree to support the new config
|
||||
(a new worker type, test variant, etc.) and validate any setup done by IT works with taskcluster client. Then Releng ensures the target tests
|
||||
can run at a basic level (mozharness, testharness, os environment, logging, something passes).
|
||||
|
||||
|
||||
## Green up tests (1 week)
|
||||
|
||||
This is a stage where Releng will run all the target tests on try server and disable, skip, fail-if all tests that are not passing or frequently
|
||||
intermittent. Typically there are a dozen or so iterations of this because a crash on one test means we don't run the rest of the tests in the
|
||||
manifest.
|
||||
|
||||
|
||||
## Turn on new config as tier-2 (1/2 week)
|
||||
|
||||
We will time this at the start of a new release.
|
||||
|
||||
Releng will land changes to manifests for all non passing tests and then schedule the new jobs by default. This will be tier-2 for a couple reasons:
|
||||
* it is a new config with a lot of tests that still need attention
|
||||
* in many cases there is a previous config (lets say upgrading windows 10 from 1803 -> 1903) which is still running in parallel as tier-1
|
||||
|
||||
This will now run on central and integration and be available on try server. In a few cases where there are limited machines (android phones),
|
||||
there will be needs to turn off the old config, or make the try server access hidden behind `./mach try --full`
|
||||
|
||||
|
||||
## Turn on new backstop jobs which run the skipped tests (1/2 week)
|
||||
|
||||
Releng will turn on a new temporary job that will run the tests which are not green by default. These will run as tier-2 on mozilla-central and be sheriffed.
|
||||
|
||||
The goal here is to find tests that are now passing and should be run by default. By doing this we are effectively running all the tests instead of
|
||||
disabling dozens of tests and forgetting about them.
|
||||
|
||||
|
||||
## Handoff to developers (1 week)
|
||||
|
||||
Releng will file bugs for all failing tests (one bug per manifest) and needinfo the triage owner to raise awareness that one or more tests in their area need
|
||||
attention. At this point, Releng is done and will move onto other work. Developers can reproduce the failures on try server and when fixed edit the manifest
|
||||
as appropriate.
|
||||
|
||||
There will be at least 6 weeks to investigate and fix the tests before they are promoted to tier-1.
|
||||
|
||||
|
||||
## move config to tier-1 (6-7 weeks later)
|
||||
|
||||
After the config has been running as tier-2 makes it to beta and then to the release branch (i.e. 2 new releases later), Releng will:
|
||||
* turn off the old tier-1 tests (if applicable)
|
||||
* promote the tier-2 jobs to tier-1
|
||||
* turn off the backstop jobs
|
||||
|
||||
This allows developers to schedule time in a 6 weeks period to investigate and fix any test failures.
|
||||
|
||||
48
testing/docs/ci-configs/schedule.md
Normal file
48
testing/docs/ci-configs/schedule.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# Schedule
|
||||
|
||||
For each CI config change, we need to follow:
|
||||
* scope of work (what will run, how frequently)
|
||||
* capacity planning (cost, physical space limitations)
|
||||
* will this replace anything or is this 100% new
|
||||
* puppet/deployment scripts or documentation
|
||||
* setup pool on try server
|
||||
* documented updated on this page, communicate with release management and others as appropriate
|
||||
|
||||
|
||||
## Current / Future CI config changes
|
||||
|
||||
Start Date | Completed | Tracking Bug | Description
|
||||
--- | --- | --- | ---
|
||||
October 2020 | TBD | [Bug 1665012](https://bugzilla.mozilla.org/show_bug.cgi?id=1665012) | add samsung S7 phones for perf testing
|
||||
November 2020 | TBD | [Bug 1676850](https://bugzilla.mozilla.org/show_bug.cgi?id=1676850) | Windows tests migrate from AWS -> Datacenter/Azure and 1803 -> 1903
|
||||
November 2020 | TBD | TBD | upgrade datacenter linux perf machines from ubuntu 16.04 to 18.04
|
||||
TBD | TBD | [Bug 1665012](https://bugzilla.mozilla.org/show_bug.cgi?id=1665012) | Android phones upgrade from version 7 -> 10
|
||||
October 2020 | TBD | [Bug 1673067](https://bugzilla.mozilla.org/show_bug.cgi?id=1673067) | Run tests on MacOSX BigSur (subset in parallel)
|
||||
October 2020 | TBD | [Bug 1673067](https://bugzilla.mozilla.org/show_bug.cgi?id=1673067) | Run tests on MacOSX Aarch64 (subset in parallel)
|
||||
December 2020 | TBD | TBD | Migrate OSX from Mac Mini R7, OSX 10.14 (Mojave) -> Mac Mini R8, OSX 10.15 (Catalina)
|
||||
TBD | TBD | TBD | Migrate more coverage of OSX from 10.14 to BigSur/aarch64
|
||||
TBD | TBD | TBD | Upgrade ubuntu from 18.04 to 20.04
|
||||
TBD | TBD | TBD | Upgrade android emulators to modern version
|
||||
September 2020 | TBD | [Bug 1548264](https://bugzilla.mozilla.org/show_bug.cgi?id=1548264) | Python 2.7 -> 3.6 migration in CI
|
||||
TBD | TBD | [Bug 1665010](https://bugzilla.mozilla.org/show_bug.cgi?id=1665010) | Add more android phone hardware (replace moto g5 and probably pixel 2)
|
||||
TBD | TBD | TBD | Upgrade datacenter hardware for windows/linux (primarily perf)
|
||||
TBD | TBD | TBD | Add Linux ARM64 worker in AWS (as it is close to Apple Silicon)
|
||||
|
||||
|
||||
## Completed CI config changes
|
||||
|
||||
Start Date | Completed | Tracking Bug | Description
|
||||
--- | --- | --- | ---
|
||||
July 2020 | October 2020| [Bug 1653344](https://bugzilla.mozilla.org/show_bug.cgi?id=1653344) | Remove EDID dongles from MacOSX machines
|
||||
August 2020 | September 2020 | [Bug 1643689](https://bugzilla.mozilla.org/show_bug.cgi?id=1643689) | Schedule tests by test selection/manifest
|
||||
June 2020 | August 2020 | [Bug 1486004](https://bugzilla.mozilla.org/show_bug.cgi?id=1486004) | Android hardware tests running without rooted phones
|
||||
August 2019 | January 2020 | [Bug 1572242](https://bugzilla.mozilla.org/show_bug.cgi?id=1572242) | Upgrade Ubuntu from 16.04 to 18.04 (finished in January)
|
||||
|
||||
|
||||
## Appendix:
|
||||
* *OS*: base operating system such as Android, Linux, Mac OSX, Windows
|
||||
* *Hardware*: specific cpu/memory/disk/graphics/display/inputs that we are using, could be physical hardware we own or manage, or it could be a cloud provider.
|
||||
* *Platform*: a combination of hardware and OS
|
||||
* *Configuration*: what we change on a platform (can be runtime with flags), installed OS software updates (service pack), tools (python/node/etc.), hardware or OS settings (anti aliasing, display resolution, background processes, clipboard), environment variables,
|
||||
* *Test Failure*: a test doesn’t report the expected result (if we expect fail and we crash, that is unexpected). Typically this is a failure, but it can be a timeout, crash, not run, or even pass
|
||||
* *Greening up*: Assuming all tests return expected results (passing), they are green. When tests fail, they are orange. We need to find a way to get all tests green by investigating test failures.
|
||||
Loading…
Reference in a new issue