forked from mirrors/gecko-dev
Bug 1861516 - Update Translations language-identification source docs r=gregtatum
Updates the Firefox source docs related to Translations language identification to reflect that fastText is no longer used by Translations, and that we use CLD2 only. Differential Revision: https://phabricator.services.mozilla.com/D192660
This commit is contained in:
parent
c90f8889ed
commit
e10b593f4f
2 changed files with 11 additions and 294 deletions
|
|
@ -80,20 +80,16 @@ architecture to identify content as being written in a detected language.
|
|||
|
||||
### Technology
|
||||
|
||||
Firefox Translations utilizes a [WASM] version of the [fastText] library to identify in which
|
||||
language content is written.
|
||||
Firefox Translations utilizes a [CLD2] language detector to identify in which language content is written.
|
||||
|
||||
### Models
|
||||
|
||||
Unlike the language translations models in the [section](#language-translations) above, the [fastText]
|
||||
model is a is a one-to-many model that is capable of detecting all of our supported languages
|
||||
from the single model.
|
||||
No models are currently used for language identification, since [CLD2] exists in the Firefox source tree.
|
||||
|
||||
---
|
||||
## Remote Settings
|
||||
|
||||
Firefox Translations utilizes [Remote Settings] to download [WASM] binaries, [Language Translation](#language-translation)
|
||||
models and [Language Identification](#language-identification) models to use locally on your system.
|
||||
Remote Settings is not currently used for language identification, since [CLD2] exists in the Firefox source tree.
|
||||
|
||||
---
|
||||
## Using Firefox Translations
|
||||
|
|
@ -139,7 +135,7 @@ It is, however, useful and fun, so it is documented here.
|
|||
|
||||
<!-- Hyperlinks -->
|
||||
[Bergamot]: https://browser.mt/
|
||||
[fastText]: https://fasttext.cc/
|
||||
[CLD2]: https://github.com/CLD2Owners/cld2
|
||||
[Firefox Nightly]: https://www.mozilla.org/en-US/firefox/channel/desktop/
|
||||
[Marian]: https://aclanthology.org/P18-4020/
|
||||
[Remote Settings]: https://remote-settings.readthedocs.io/en/latest/
|
||||
|
|
|
|||
|
|
@ -13,11 +13,11 @@ to provide helpful information regarding contributing to Firefox Translations.
|
|||
- [Versioning](#versioning)
|
||||
- [Non-Breaking Changes](#non-breaking-changes)
|
||||
- [Breaking Changes](#breaking-changes)
|
||||
- [Building fastText](#building-fasttext)
|
||||
- [Downloading The Models](#downloading-the-models)
|
||||
- [Building the WASM Binary](#building-the-wasm-binary)
|
||||
- [Dependencies](#dependencies)
|
||||
- [Modifying the EMCXXFLAGS](#modifying-the-emcxxflags)
|
||||
- [Language Identification](#language-identification)
|
||||
- [Building Bergamot](#building-bergamot)
|
||||
|
||||
---
|
||||
|
|
@ -127,290 +127,11 @@ Tying breaking changes to releases in this way frees up Firefox Translations to
|
|||
switching one third-party library for another in the compiled source code, while allowing older versions of Firefox to continue utilizing the old library and allowing newer versions of Firefox to utilize the new library.
|
||||
|
||||
---
|
||||
## Building fastText
|
||||
## Language Identification
|
||||
|
||||
### Downloading the Models
|
||||
Translations currently uses the [CLD2] language detector.
|
||||
|
||||
The fastText model that we use can be downloaded directly from the fastText website:<br>
|
||||
> [https://fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html)
|
||||
|
||||
Firefox Translations uses the compressed, **`lid.176.ftz`** model.
|
||||
|
||||
### Building the WASM Binary
|
||||
|
||||
To build the fastText [WASM] binary, we can follow the steps in the [Requirements] section of the fastText website.
|
||||
|
||||
#### Dependencies
|
||||
|
||||
**C++ Compiler**<br>
|
||||
Any of the C++ compilers from [Getting Set Up To Work On The Firefox Codebase] will be sufficient for this.
|
||||
|
||||
**emskd**<br>
|
||||
Follow the [Download and Install] instructions for setting up the emscripten sdk.
|
||||
|
||||
#### Modifying the EMCXXFLAGS
|
||||
|
||||
At the time of writing, the a latest commit on the fastText repo ([3697152e0fd772d9185697fdbd4a1d340ca5571d])
|
||||
is not compatible by default with the latest version of [emscripten (3.1.35)].
|
||||
|
||||
A few changes need to be made to the Makefile in order to generate the fastText [WASM] for use in Firefox.
|
||||
|
||||
**1) Disable DYNAMIC_EXECUTION**<br>
|
||||
In the `Makefile` for the fastText repo, there is a variable called **`EMCXXFLAGS`**.<br>
|
||||
We need to add the following flag to this variable:
|
||||
|
||||
```
|
||||
-s "DYNAMIC_EXECUTION=0"
|
||||
```
|
||||
|
||||
If this flag is not set to **`0`**, then emscripten will [generate functions] that use the [eval()] function.
|
||||
[eval()] is not allowed in the context that fastText runs in FireFox due to security reasons.
|
||||
|
||||
**2) Rename EXTRA_EXPORTED_RUNTIME_METHODS**<br>
|
||||
In [emscripten (2.0.18)], **`EXTRA_EXPORTED_RUNTIME_METHODS`** was deprecated in favor of **`EXPORTED_RUNTIME_METHODS`**.
|
||||
The fastText Makefile still has the old flag, so we need to update the name.
|
||||
|
||||
**3) Use the -r Flag When Appropriate**<br>
|
||||
In [emscripten (2.0.3)] the following change was made:
|
||||
|
||||
> "The default output format is now executable JavaScript. Previously we would default to output objecting files unless, for example, the output name ended in **`.js`**. This is contrary to behavior of clang and gcc. Now emscripten will always produce and executable unless the **`-c`**, **`-r`** or **`-shared`** flags are given. This is true even when the name of the output file ends in **`.o`**. e.g, **`emcc foo.c -o foo.o`** will produce a JavaScript file called **`foo.o`**. This might surprise some users (although it matches the behavior of existing toolchains) so we now produce a warning in this case."
|
||||
|
||||
The Makefile needs to be modified to use the **`-r`** flag when appropriate. These changes are modeled after comments on this [GitHub Issue].
|
||||
|
||||
**Cumulative Changes**<br>
|
||||
Here is a diff of the full changes needed for the Makefile at the time of writing:
|
||||
|
||||
```diff
|
||||
diff --git a/Makefile b/Makefile
|
||||
index e246f79..396ae0b 100644
|
||||
--- a/Makefile
|
||||
+++ b/Makefile
|
||||
@@ -73,7 +73,9 @@ clean:
|
||||
|
||||
EMCXX = em++
|
||||
-EMCXXFLAGS = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXTRA_EXPORTED_RUNTIME_METHODS=['addOnPostRun', 'FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=1" -s 'EXPORT_NAME="FastTextModule"' -Isrc/
|
||||
+EMCXXFLAGS_BASE = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun', 'FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=0" -s "DYNAMIC_EXECUTION=0" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=1" -s 'EXPORT_NAME="FastTextModule"' -Isrc/
|
||||
+EMCXXFLAGS = $(EMCXXFLAGS_BASE) -r
|
||||
+EMCXXFLAGS_JS = $(EMCXXFLAGS_BASE)
|
||||
EMOBJS = args.bc autotune.bc matrix.bc dictionary.bc loss.bc productquantizer.bc densematrix.bc quantmatrix.bc vector.bc model.bc utils.bc meter.bc fasttext.bc main.bc
|
||||
|
||||
|
||||
@@ -120,6 +122,6 @@ fasttext.bc: src/fasttext.cc src/*.h
|
||||
$(EMCXX) $(EMCXXFLAGS) src/fasttext.cc -o fasttext.bc
|
||||
|
||||
webassembly/fasttext_wasm.js: $(EMOBJS) webassembly/fasttext_wasm.cc Makefile
|
||||
- $(EMCXX) $(EMCXXFLAGS) $(EMOBJS) -o webassembly/fasttext_wasm.js
|
||||
+ $(EMCXX) $(EMCXXFLAGS_JS) $(EMOBJS) -o webassembly/fasttext_wasm.js
|
||||
```
|
||||
|
||||
After modifying the Makefile in the previous section, running **`make wasm`** in the fastText repo should run without warnings or errors and the following files will be generated in the **`webassembly`** directory:
|
||||
|
||||
```
|
||||
webassembly
|
||||
├── fasttext.js
|
||||
├── fasttext_wasm.js
|
||||
└── fasttext_wasm.wasm
|
||||
```
|
||||
|
||||
#### Modifying fasttext_wasm.js
|
||||
|
||||
There are a few changes we need to make to the **`fasttext_wasm.js`** file to make it compatible with use in Firefox.
|
||||
|
||||
**1) Define a function, not a module**<br>
|
||||
The generated code exports a module, but this needs to be modified into a function for use in [importScripts()] in a worker.
|
||||
|
||||
At the top of the file we need to make the following changes:
|
||||
|
||||
```diff
|
||||
diff --git a/toolkit/components/translations/fasttext/fasttext_wasm.js b/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
index 64c6184a85851..4802343da2a03 100644
|
||||
--- a/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
+++ b/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
@@ -1,9 +1,6 @@
|
||||
|
||||
-var FastTextModule = (() => {
|
||||
- var _scriptDir = import.meta.url;
|
||||
-
|
||||
- return (
|
||||
-async function(FastTextModule = {}) {
|
||||
+async function loadFastTextModule(FastTextModule = {}) {
|
||||
+ const _scriptDir = null;
|
||||
|
||||
// include: shell.js
|
||||
// The Module object: Our interface to the outside world. We import
|
||||
```
|
||||
|
||||
Here we are defining a function rather than a variable, and we are setting **`_scriptDir`** to null
|
||||
because **`import.meta.url`** is only available for use within modules.
|
||||
|
||||
Next we need to modify the bottom of the file to match these changes:
|
||||
|
||||
```diff
|
||||
diff --git a/toolkit/components/translations/fasttext/fasttext_wasm.js b/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
index 64c6184a85851..0a6fca3f524e4 100644
|
||||
--- a/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
+++ b/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
@@ -8287,7 +8287,3 @@ run();
|
||||
|
||||
return FastTextModule.ready
|
||||
}
|
||||
-
|
||||
-);
|
||||
-})();
|
||||
-export default FastTextModule;
|
||||
```
|
||||
|
||||
**2) Remove unneeded environment checks**<br>
|
||||
Next we need to remove unneeded checks for different environments:
|
||||
|
||||
```JavaScript
|
||||
if (ENVIRONMENT_IS_NODE) {
|
||||
// ...
|
||||
} else
|
||||
if (ENVIRONMENT_IS_SHELL) {
|
||||
// ...
|
||||
} else
|
||||
if (ENVIRONMENT_IS_WEB || ENVIRONMENT_IS_WORKER) {
|
||||
// ...
|
||||
} else
|
||||
{
|
||||
throw new Error('environment detection error');
|
||||
}
|
||||
```
|
||||
|
||||
Since this code will only be run inside of a worker, we want to delete the blocks that deal with **`ENVIRONMENT_IS_NODE`** and **`ENVIRONMENT_IS_SHELL`**. In fact, this code will fail to be imported by [importScripts()] if we don't do this.
|
||||
|
||||
**3) Remove the use of `import.meta.url`**<br>
|
||||
Finally, there is a use of **`import.meta.url`** that we need to remove.
|
||||
|
||||
```diff
|
||||
diff --git a/toolkit/components/translations/fasttext/fasttext_wasm.js b/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
index 64c6184a85851..746cbae2ec952 100644
|
||||
--- a/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
+++ b/toolkit/components/translations/fasttext/fasttext_wasm.js
|
||||
@@ -746,7 +746,7 @@ if (Module['locateFile']) {
|
||||
}
|
||||
} else {
|
||||
// Use bundler-friendly `new URL(..., import.meta.url)` pattern; works in browsers too.
|
||||
- wasmBinaryFile = new URL('fasttext_wasm.wasm', import.meta.url).href;
|
||||
+ wasmBinaryFile = null;
|
||||
}
|
||||
|
||||
function getBinary(file) {
|
||||
```
|
||||
|
||||
As mentioned before, **`import.meta.url`** is not allowed outside of modules and cannot be used with [importScripts()]
|
||||
in the worker code that we are creating.
|
||||
|
||||
It is okay to set this to null here, because we will be providing the **`wasmBinaryFile`** via [Remote Settings].
|
||||
|
||||
**4) Minifying the file**<br>
|
||||
The generated **`fasttext_wasm.js`** file is very large. To minimize the impact on the size of the code in the Firefox source tree, we want to minify the file using the [minify] tool.
|
||||
|
||||
```
|
||||
Size Name
|
||||
291k ├── fasttext_wasm.js (original)
|
||||
109k └── fasttext_wasm.js (minified)
|
||||
```
|
||||
|
||||
**5) Adding the license**<br>
|
||||
Finally, we should add a copy of the current fastText MIT license to the top of the minified **`fasttext_wasm.js`** file.
|
||||
You should be able to paste this from the generated **`fasttext.js`** file.
|
||||
|
||||
#### Modifying fasttext.js
|
||||
|
||||
```{note}
|
||||
It is likely that the source file in tree already has these changes and is already sufficient,
|
||||
even if **`fasttext_wasm.js`** has been recently updated. Try running it first as-is before replacing
|
||||
and re-modifying.
|
||||
```
|
||||
|
||||
Next we need to modify **`fasttext.js`** to utilize the changes that we made to **`fasttext_wasm.js`** and also to
|
||||
not be a module so that we can import it using [importScripts()].
|
||||
|
||||
These changes do the following:
|
||||
|
||||
1) Define a variable called **`fastTextModule`** for use in the worker scripts.
|
||||
2) Utilize the **`loadFastTextModule()`** function that we defined in **`fasttext_wasm.js`**
|
||||
3) Add a function **`loadModelBinary()`** that takes the wasm binary directly, which we will provide through [Remote Settings].
|
||||
4) Remove any module exports.
|
||||
|
||||
```diff
|
||||
diff --git a/toolkit/components/translations/fasttext/fasttext.js b/toolkit/components/translations/fasttext/fasttext.js
|
||||
index 86600b9ac9e28..2c49b3faaeedc 100644
|
||||
--- a/toolkit/components/translations/fasttext/fasttext.js
|
||||
+++ b/toolkit/components/translations/fasttext/fasttext.js
|
||||
@@ -6,20 +6,30 @@
|
||||
* LICENSE file in the root directory of this source tree.
|
||||
*/
|
||||
|
||||
-import fastTextModularized from './fasttext_wasm.js';
|
||||
-const fastTextModule = fastTextModularized();
|
||||
+let fastTextModule;
|
||||
+
|
||||
+const _initFastTextModule = async function (wasmModule) {
|
||||
+ try {
|
||||
+ fastTextModule = await loadFastTextModule(wasmModule);
|
||||
+ } catch(e) {
|
||||
+ console.error(e);
|
||||
+ }
|
||||
+ return true
|
||||
+}
|
||||
|
||||
let postRunFunc = null;
|
||||
const addOnPostRun = function(func) {
|
||||
postRunFunc = func;
|
||||
};
|
||||
|
||||
-fastTextModule.addOnPostRun(() => {
|
||||
- if (postRunFunc) {
|
||||
- postRunFunc();
|
||||
- }
|
||||
-});
|
||||
|
||||
+const loadFastText = (wasmModule) => {
|
||||
+ _initFastTextModule(wasmModule).then((res) => {
|
||||
+ if (postRunFunc) {
|
||||
+ postRunFunc();
|
||||
+ }
|
||||
+ })
|
||||
+}
|
||||
const thisModule = this;
|
||||
const trainFileInWasmFs = 'train.txt';
|
||||
const testFileInWasmFs = 'test.txt';
|
||||
@@ -41,7 +51,7 @@ const getFloat32ArrayFromHeap = (len) => {
|
||||
const heapToFloat32 = (r) => new Float32Array(r.buffer, r.ptr, r.size);
|
||||
|
||||
class FastText {
|
||||
- constructor() {
|
||||
+ constructor(fastTextModule) {
|
||||
this.f = new fastTextModule.FastText();
|
||||
}
|
||||
|
||||
@@ -77,6 +87,15 @@ class FastText {
|
||||
});
|
||||
}
|
||||
|
||||
+ loadModelBinary(buffer) {
|
||||
+ const fastTextNative = this.f;
|
||||
+ const byteArray = new Uint8Array(buffer);
|
||||
+ const FS = fastTextModule.FS;
|
||||
+ FS.writeFile(modelFileInWasmFs, byteArray);
|
||||
+ fastTextNative.loadModel(modelFileInWasmFs);
|
||||
+ return new FastTextModel(fastTextNative);
|
||||
+ }
|
||||
+
|
||||
_train(url, modelName, kwargs = {}, callback = null) {
|
||||
const fetchFunc = (thisModule && thisModule.fetch) || fetch;
|
||||
const fastTextNative = this.f;
|
||||
@@ -515,6 +534,3 @@ class FastTextModel {
|
||||
});
|
||||
}
|
||||
}
|
||||
-
|
||||
-
|
||||
-export {FastText, addOnPostRun};
|
||||
```
|
||||
We have previously experimented with using the [fastText] language detector, but we opted to use [CLD2] due to complications with [fastText] [WASM] runtime performance. The benefit of the [CLD2] language detector is that it already exists in the Firefox source tree. In the future, we would still like to explore moving to a more modern language detector such as [CLD3], or perhaps something else.
|
||||
|
||||
---
|
||||
## Building Bergamot
|
||||
|
|
@ -419,20 +140,21 @@ TODO
|
|||
|
||||
|
||||
<!-- Hyperlinks -->
|
||||
[3697152e0fd772d9185697fdbd4a1d340ca5571d]: https://github.com/facebookresearch/fastText/tree/3697152e0fd772d9185697fdbd4a1d340ca5571d
|
||||
[Bugzilla]: https://bugzilla.mozilla.org/enter_bug.cgi?product=Cloud%20Services&component=Server%3A%20Remote%20Settings
|
||||
[Child]: https://searchfox.org/mozilla-central/search?q=TranslationsChild
|
||||
[CLD2]: https://github.com/CLD2Owners/cld2
|
||||
[CLD3]: https://github.com/google/cld3
|
||||
[Download and Install]: https://emscripten.org/docs/getting_started/downloads.html#download-and-install
|
||||
[emscripten (2.0.3)]: https://github.com/emscripten-core/emscripten/blob/main/ChangeLog.md#203-09102020
|
||||
[emscripten (2.0.18)]: https://github.com/emscripten-core/emscripten/blob/main/ChangeLog.md#2018-04232021
|
||||
[emscripten (3.1.35)]: https://github.com/emscripten-core/emscripten/blob/main/ChangeLog.md#3135---040323
|
||||
[Environments]: https://remote-settings.readthedocs.io/en/latest/getting-started.html#environments
|
||||
[eval()]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/eval
|
||||
[fastText]: https://fasttext.cc/
|
||||
[Filter Expressions]: https://remote-settings.readthedocs.io/en/latest/target-filters.html#filter-expressions
|
||||
[Firefox Release Schedule]: https://wiki.mozilla.org/Release_Management/Calendar
|
||||
[generate functions]: https://emscripten.org/docs/api_reference/emscripten.h.html?highlight=dynamic_execution#functions
|
||||
[Getting Set Up To Work On The Firefox Codebase]: https://firefox-source-docs.mozilla.org/setup/index.html
|
||||
[GitHub Issue]: https://github.com/facebookresearch/fastText/pull/1227#issuecomment-1353830003
|
||||
[importScripts()]: https://developer.mozilla.org/en-US/docs/Web/API/WorkerGlobalScope/importScripts
|
||||
[JSWindowActors]: https://firefox-source-docs.mozilla.org/dom/ipc/jsactors.html#jswindowactor
|
||||
[minify]: https://github.com/tdewolff/minify
|
||||
|
|
@ -440,7 +162,6 @@ TODO
|
|||
[Step 3]: https://remote-settings.readthedocs.io/en/latest/getting-started.html#create-a-new-official-type-of-remote-settings
|
||||
[remote-settings-devtools]: https://github.com/mozilla-extensions/remote-settings-devtools/releases
|
||||
[Remote Settings]: https://remote-settings.readthedocs.io/en/latest/
|
||||
[Requirements]: https://fasttext.cc/docs/en/webassembly-module.html#requirements
|
||||
[toolkit/components/translations]: https://searchfox.org/mozilla-central/search?q=toolkit%2Fcomponents%2Ftranslations
|
||||
[WASM]: https://webassembly.org/
|
||||
[Workers]: https://searchfox.org/mozilla-central/search?q=%2Ftranslations.*worker&path=&case=false®exp=true
|
||||
|
|
|
|||
Loading…
Reference in a new issue