This improves the integrity of downloads of upstream artifacts when using fetch-content. If `verify-hash: True` is set on the fetch config, then the chain-of-trust.json of the upstream is used to retieve the expected sha256 of the artifact, and this is checked.
Differential Revision: https://phabricator.services.mozilla.com/D87725
This improves the integrity of downloads of upstream artifacts when using fetch-content. If `verify-hash: True` is set on the fetch config, then the chain-of-trust.json of the upstream is used to retieve the expected sha256 of the artifact, and this is checked.
Differential Revision: https://phabricator.services.mozilla.com/D87725
There are cases where --recurse-submodules breaks things (e.g. when
newer versions of the repository remove a submodule). So don't
recurse-submodules at all at clone or checkout time, but instead
initialize and update submodules after the checkout.
Also don't checkout at clone time, because it's redundant with the
checkout, and we only really trust the explicit checkout anyways, so
it's better to not checkout during the clone.
Differential Revision: https://phabricator.services.mozilla.com/D73353
The win64-aarch64 have a kind of a nasty trick that makes fetch-content
download artifacts of a dependent task directly as artifacts of the task
itself. For some reason, while this pattern works on native Windows
jobs, it doesn't on Linux. What happens is essentially that:
`pathlib.Path(path).joinpath('../foo').mkdir(parents=True, exist=ok=True)`
fails when path doesn't exist first. I guess the fetches directory
already exists on Windows worker or something.
Unfortunately, os.path.normpath doesn't take `pathlib.Path`s in
still-supported python 3.5, so we have to convert to str first.
Differential Revision: https://phabricator.services.mozilla.com/D66518
--HG--
extra : moz-landing-system : lando
Note: while we can use time.monotonic in fetch-content, we can't in
mach artifact toolchain yet because it's still python2.
Differential Revision: https://phabricator.services.mozilla.com/D65690
--HG--
extra : moz-landing-system : lando
It will be relanded once these are complete. This prevents from those tasks
getting scheduled for every push until the initial ones have been completed.
Using git-archive for the fetch task means that we don't get the
submodules of a git repository included in the archive. There isn't a
straightforward way to get submodules from a bare repo included with
git-archive, so instead we can simply clone & checkout with
--recurse-submodules and then use a standard tar command to bundle up
the tree.
Adding --recurse-submodules to the commands has no effect on a repo
without submodules, so we can add it to all invocations for simplicity.
Differential Revision: https://phabricator.services.mozilla.com/D46827
--HG--
extra : moz-landing-system : lando
python's `urllib.request.urlopen(url)` can fail when a system doesn't know how to verify a ca certificate. this patch makes use of the cafile provided by the `certifi` module, if/when it is installed, to verify certificates.
Differential Revision: https://phabricator.services.mozilla.com/D47044
--HG--
extra : source : 92b9ffc8f37ddd16ca3f426d64df059eea38d5fa
python's `urllib.request.urlopen(url)` can fail when a system doesn't know how to verify a ca certificate. this patch makes use of the cafile provided by the `certifi` module, if/when it is installed, to verify certificates.
Differential Revision: https://phabricator.services.mozilla.com/D47044
--HG--
extra : moz-landing-system : lando
MANUAL PUSH: to allow docker images to build without closing autoland
Differential Revision: https://phabricator.services.mozilla.com/D41038
--HG--
extra : rebase_source : 60ae00549917411d1839b6e3f8e6ae962d217470
extra : amend_source : a2531b115f5732345f8c34c88669428510d100a4
Bug 1479533 was proposing to add a similar functionality, but this
iteration avoids actually unpacking anything, and ensures
reproducibility by relying on the reproducible bits from the original
archives: file ordering, flags, etc. (since they are checksummed, those
are never going to change for a given archive).
Another notable difference is that this applies the repack on the fetch
task itself, rather than create a separate task to apply the repack. The
latter has advantages, in that it allows to change the repacking without
redownloading the original file from a third-party server, but in
practice, most changes to the repacking would trigger the download tasks
anyways.
This patch only takes care of changing the archive type (zip->tar), and
the compression type (anything->zstandard).
Differential Revision: https://phabricator.services.mozilla.com/D40740
This is loosely based on what was in bug 1467359, but simplified to
handle git only, and simply using git-archive because, at least now,
it's deterministic (it uses the commit date as timestamp in tar
archives).
This also adds 4 tasks for some of the things we use for toolchains, but
doesn't hook them up yet.
This also upgrades the fetch docker image to Debian buster, and installs
the required packages in it.
Differential Revision: https://phabricator.services.mozilla.com/D39480
Rather than trying to parse strings, just pass a json blob. This will allow us
to easily do things like mark artifacts to be left unextracted.
Differential Revision: https://phabricator.services.mozilla.com/D3553
--HG--
extra : rebase_source : 4e762c65d1c9f13361d5bae2e4608ba09bb39a91
This also moves the call to 'fetch_artifacts' in run-task down inside the
try/finally block. This way if something goes wrong, we'll still cleanup
MOZ_FETCHES_DIR.
Differential Revision: https://phabricator.services.mozilla.com/D4152
--HG--
extra : moz-landing-system : lando
Otherwise it can't be used as a context manager since it
doesn't have __enter__ or __exit__.
Differential Revision: https://phabricator.services.mozilla.com/D2672
--HG--
extra : moz-landing-system : lando
This is what a lot of programs do.
We do logging in a helper function so we can flush after every write.
Differential Revision: https://phabricator.services.mozilla.com/D2526
--HG--
extra : rebase_source : 98563aee129c16662a783122241623b8ed2fe457
Previously, we told `tar` or `unzip` to operate on an explicit file.
This worked when `tar` understood the compression format of the file.
And this worked in the majority of cases.
But `tar` does not support zstandard compression (at least not outside
extremely new versions, which aren't yet widely deployed). And not all
versions of `tar` support the `-a` argument.
This commit changes our invocation of `tar` so input data is piped
to it from Python. In the case of `tar`, we perform decompression in
Python, if possible. This allows us to support zstandard and `tar`
binaries that don't support `-a` to auto-detect the compression format.
I wanted to be consistent and always pipe the raw data via stdin.
But `unzip` doesn't appear to like this. Oh well.
We also refactor the logic around detecting archives. We have a
function to identify the archive type based on a filename. We then
pass the archive type to the extraction function and key off that
logic within. We also conditionally call extract_archive() and
fail hard in extract_archive() when things fail. This will make
future archive code easier to reason about.
Differential Revision: https://phabricator.services.mozilla.com/D1576
--HG--
extra : rebase_source : 1c66396cced1b2a94a959386eecc3f512b033308
Currently 'fetch' artifacts are all extracted in the same directory, this could
make the extdir messy, or in the worst case, cause file name collisions.
Some artifacts are ok to extract into the same directory as they're already
bundled within the archive. But other artifacts are not. This patch keeps the
default behaviour (extracting everything into the same directory), but allows
task authors to specify per-artifact directories to extract into.
The syntax is:
path[>dest]@<task>
The 'dest' value will be a subdirectory of the MOZ_FETCHES_DIR environment
variable.
Depends on D2102.
Differential Revision: https://phabricator.services.mozilla.com/D2166
--HG--
extra : moz-landing-system : lando
Currently, many tasks fetch content from the Internets. A problem with
that is fetching from the Internets is unreliable: servers may have
outages or be slow; content may disappear or change out from under us.
The unreliability of 3rd party services poses a risk to Firefox CI.
If services aren't available, we could potentially not run some CI tasks.
In the worst case, we might not be able to release Firefox. That would
be bad. In fact, as I write this, gmplib.org has been unavailable for
~24 hours and Firefox CI is unable to retrieve the GMP source code.
As a result, building GCC toolchains is failing.
A solution to this is to make tasks more hermetic by depending on
fewer network services (which by definition aren't reliable over time
and therefore introduce instability).
This commit attempts to mitigate some external service dependencies
by introducing the *fetch* task kind.
The primary goal of the *fetch* kind is to obtain remote content and
re-expose it as a task artifact. By making external content available
as a cached task artifact, we allow dependent tasks to consume this
content without touching the service originally providing that
content, thus eliminating a run-time dependency and making tasks more
hermetic and reproducible over time.
We introduce a single "fetch-url" "using" flavor to define tasks that
fetch single URLs and then re-expose that URL as an artifact. Powering
this is a new, minimal "fetch" Docker image that contains a
"fetch-content" Python script that does the work for us.
We have added tasks to fetch source archives used to build the GCC
toolchains.
Fetching remote content and re-exposing it as an artifact is not
very useful by itself: the value is in having tasks use those
artifacts.
We introduce a taskgraph transform that allows tasks to define an
array of "fetches." Each entry corresponds to the name of a "fetch"
task kind. When present, the corresponding "fetch" task is added as a
dependency. And the task ID and artifact path from that "fetch" task
is added to the MOZ_FETCHES environment variable of the task depending
on it. Our "fetch-content" script has a "task-artifacts"
sub-command that tasks can execute to perform retrieval of all
artifacts listed in MOZ_FETCHES.
To prove all of this works, the code for fetching dependencies when
building GCC toolchains has been updated to use `fetch-content`. The
now-unused legacy code has been deleted.
This commit improves the reliability and efficiency of GCC toolchain
tasks. Dependencies now all come from task artifacts and should always
be available in the common case. In addition, `fetch-content` downloads
and extracts files concurrently. This makes it faster than the serial
application which we were previously using.
There are some things I don't like about this commit.
First, a new Docker image and Python script for downloading URLs feels
a bit heavyweight. The Docker image is definitely overkill as things
stand. I can eventually justify it because I want to implement support
for fetching and repackaging VCS repositories and for caching Debian
packages. These will require more packages than what I'm comfortable
installing on the base Debian image, therefore justifying a dedicated
image.
The `fetch-content static-url` sub-command could definitely be
implemented as a shell script. But Python is readily available and
is more pleasant to maintain than shell, so I wrote it in Python.
`fetch-content task-artifacts` is more advanced and writing it in
Python is more justified, IMO. FWIW, the script is Python 3 only,
which conveniently gives us access to `concurrent.futures`, which
facilitates concurrent download.
`fetch-content` also duplicates functionality found elsewhere.
generic-worker's task payload supports a "mounts" feature which
facilitates downloading remote content, including from a task
artifact. However, this feature doesn't exist on docker-worker.
So we have to implement downloading inside the task rather than
at the worker level. I concede that if all workers had generic-worker's
"mounts" feature and supported concurrent download, `fetch-content`
wouldn't need to exist.
`fetch-content` also duplicates functionality of
`mach artifact toolchain`. I probably could have used
`mach artifact toolchain` instead of writing
`fetch-content task-artifacts`. However, I didn't want to introduce
the requirement of a VCS checkout. `mach artifact toolchain` has its
origins in providing a feature to the build system. And "fetching
artifacts from tasks" is a more generic feature than that. I think
it should be implemented as a generic feature and not something that is
"toolchain" specific.
I think the best place for a generic "fetch content" feature is in
the worker, where content can be defined in the task payload. But as
explained above, that feature isn't universally available. The next
best place is probably run-task. run-task already performs generic,
very-early task preparation steps, such as performing a VCS checkout.
I would like to fold `fetch-content` into run-task and make it all
driven by environment variables. But run-task is currently Python 2
and achieving concurrency would involve a bit of programming (or
adding package dependencies). I may very well port run-task to Python
3 and then fold fetch-content into it. Or maybe we leave
`fetch-content` as a standalone script.
MozReview-Commit-ID: AGuTcwNcNJR
--HG--
extra : source : 0b941cbdca76fb2fbb98dc5bbc1a0237c69954d0
extra : histedit_source : a3e43bdd8a9a58550bef02fec3be832ca304ea93
Currently, many tasks fetch content from the Internets. A problem with
that is fetching from the Internets is unreliable: servers may have
outages or be slow; content may disappear or change out from under us.
The unreliability of 3rd party services poses a risk to Firefox CI.
If services aren't available, we could potentially not run some CI tasks.
In the worst case, we might not be able to release Firefox. That would
be bad. In fact, as I write this, gmplib.org has been unavailable for
~24 hours and Firefox CI is unable to retrieve the GMP source code.
As a result, building GCC toolchains is failing.
A solution to this is to make tasks more hermetic by depending on
fewer network services (which by definition aren't reliable over time
and therefore introduce instability).
This commit attempts to mitigate some external service dependencies
by introducing the *fetch* task kind.
The primary goal of the *fetch* kind is to obtain remote content and
re-expose it as a task artifact. By making external content available
as a cached task artifact, we allow dependent tasks to consume this
content without touching the service originally providing that
content, thus eliminating a run-time dependency and making tasks more
hermetic and reproducible over time.
We introduce a single "fetch-url" "using" flavor to define tasks that
fetch single URLs and then re-expose that URL as an artifact. Powering
this is a new, minimal "fetch" Docker image that contains a
"fetch-content" Python script that does the work for us.
We have added tasks to fetch source archives used to build the GCC
toolchains.
Fetching remote content and re-exposing it as an artifact is not
very useful by itself: the value is in having tasks use those
artifacts.
We introduce a taskgraph transform that allows tasks to define an
array of "fetches." Each entry corresponds to the name of a "fetch"
task kind. When present, the corresponding "fetch" task is added as a
dependency. And the task ID and artifact path from that "fetch" task
is added to the MOZ_FETCHES environment variable of the task depending
on it. Our "fetch-content" script has a "task-artifacts"
sub-command that tasks can execute to perform retrieval of all
artifacts listed in MOZ_FETCHES.
To prove all of this works, the code for fetching dependencies when
building GCC toolchains has been updated to use `fetch-content`. The
now-unused legacy code has been deleted.
This commit improves the reliability and efficiency of GCC toolchain
tasks. Dependencies now all come from task artifacts and should always
be available in the common case. In addition, `fetch-content` downloads
and extracts files concurrently. This makes it faster than the serial
application which we were previously using.
There are some things I don't like about this commit.
First, a new Docker image and Python script for downloading URLs feels
a bit heavyweight. The Docker image is definitely overkill as things
stand. I can eventually justify it because I want to implement support
for fetching and repackaging VCS repositories and for caching Debian
packages. These will require more packages than what I'm comfortable
installing on the base Debian image, therefore justifying a dedicated
image.
The `fetch-content static-url` sub-command could definitely be
implemented as a shell script. But Python is readily available and
is more pleasant to maintain than shell, so I wrote it in Python.
`fetch-content task-artifacts` is more advanced and writing it in
Python is more justified, IMO. FWIW, the script is Python 3 only,
which conveniently gives us access to `concurrent.futures`, which
facilitates concurrent download.
`fetch-content` also duplicates functionality found elsewhere.
generic-worker's task payload supports a "mounts" feature which
facilitates downloading remote content, including from a task
artifact. However, this feature doesn't exist on docker-worker.
So we have to implement downloading inside the task rather than
at the worker level. I concede that if all workers had generic-worker's
"mounts" feature and supported concurrent download, `fetch-content`
wouldn't need to exist.
`fetch-content` also duplicates functionality of
`mach artifact toolchain`. I probably could have used
`mach artifact toolchain` instead of writing
`fetch-content task-artifacts`. However, I didn't want to introduce
the requirement of a VCS checkout. `mach artifact toolchain` has its
origins in providing a feature to the build system. And "fetching
artifacts from tasks" is a more generic feature than that. I think
it should be implemented as a generic feature and not something that is
"toolchain" specific.
I think the best place for a generic "fetch content" feature is in
the worker, where content can be defined in the task payload. But as
explained above, that feature isn't universally available. The next
best place is probably run-task. run-task already performs generic,
very-early task preparation steps, such as performing a VCS checkout.
I would like to fold `fetch-content` into run-task and make it all
driven by environment variables. But run-task is currently Python 2
and achieving concurrency would involve a bit of programming (or
adding package dependencies). I may very well port run-task to Python
3 and then fold fetch-content into it. Or maybe we leave
`fetch-content` as a standalone script.
MozReview-Commit-ID: AGuTcwNcNJR
--HG--
extra : rebase_source : 4918b8c3bac53d63665006802054038bfbca0314