Bug 1861516 - Update Translations language-identification source docs r=gregtatum

Updates the Firefox source docs related to Translations
language identification to reflect that fastText is no
longer used by Translations, and that we use CLD2 only.

Differential Revision: https://phabricator.services.mozilla.com/D192660
This commit is contained in:
Erik Nordin 2023-11-07 01:34:02 +00:00
parent c90f8889ed
commit e10b593f4f
2 changed files with 11 additions and 294 deletions

View file

@ -80,20 +80,16 @@ architecture to identify content as being written in a detected language.
### Technology
Firefox Translations utilizes a [WASM] version of the [fastText] library to identify in which
language content is written.
Firefox Translations utilizes a [CLD2] language detector to identify in which language content is written.
### Models
Unlike the language translations models in the [section](#language-translations) above, the [fastText]
model is a is a one-to-many model that is capable of detecting all of our supported languages
from the single model.
No models are currently used for language identification, since [CLD2] exists in the Firefox source tree.
---
## Remote Settings
Firefox Translations utilizes [Remote Settings] to download [WASM] binaries, [Language Translation](#language-translation)
models and [Language Identification](#language-identification) models to use locally on your system.
Remote Settings is not currently used for language identification, since [CLD2] exists in the Firefox source tree.
---
## Using Firefox Translations
@ -139,7 +135,7 @@ It is, however, useful and fun, so it is documented here.
<!-- Hyperlinks -->
[Bergamot]: https://browser.mt/
[fastText]: https://fasttext.cc/
[CLD2]: https://github.com/CLD2Owners/cld2
[Firefox Nightly]: https://www.mozilla.org/en-US/firefox/channel/desktop/
[Marian]: https://aclanthology.org/P18-4020/
[Remote Settings]: https://remote-settings.readthedocs.io/en/latest/

View file

@ -13,11 +13,11 @@ to provide helpful information regarding contributing to Firefox Translations.
- [Versioning](#versioning)
- [Non-Breaking Changes](#non-breaking-changes)
- [Breaking Changes](#breaking-changes)
- [Building fastText](#building-fasttext)
- [Downloading The Models](#downloading-the-models)
- [Building the WASM Binary](#building-the-wasm-binary)
- [Dependencies](#dependencies)
- [Modifying the EMCXXFLAGS](#modifying-the-emcxxflags)
- [Language Identification](#language-identification)
- [Building Bergamot](#building-bergamot)
---
@ -127,290 +127,11 @@ Tying breaking changes to releases in this way frees up Firefox Translations to
switching one third-party library for another in the compiled source code, while allowing older versions of Firefox to continue utilizing the old library and allowing newer versions of Firefox to utilize the new library.
---
## Building fastText
## Language Identification
### Downloading the Models
Translations currently uses the [CLD2] language detector.
The fastText model that we use can be downloaded directly from the fastText website:<br>
> [https://fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html)
Firefox Translations uses the compressed, **`lid.176.ftz`** model.
### Building the WASM Binary
To build the fastText [WASM] binary, we can follow the steps in the [Requirements] section of the fastText website.
#### Dependencies
**C++ Compiler**<br>
Any of the C++ compilers from [Getting Set Up To Work On The Firefox Codebase] will be sufficient for this.
**emskd**<br>
Follow the [Download and Install] instructions for setting up the emscripten sdk.
#### Modifying the EMCXXFLAGS
At the time of writing, the a latest commit on the fastText repo ([3697152e0fd772d9185697fdbd4a1d340ca5571d])
is not compatible by default with the latest version of [emscripten (3.1.35)].
A few changes need to be made to the Makefile in order to generate the fastText [WASM] for use in Firefox.
**1) Disable DYNAMIC_EXECUTION**<br>
In the `Makefile` for the fastText repo, there is a variable called **`EMCXXFLAGS`**.<br>
We need to add the following flag to this variable:
```
-s "DYNAMIC_EXECUTION=0"
```
If this flag is not set to **`0`**, then emscripten will [generate functions] that use the [eval()] function.
[eval()] is not allowed in the context that fastText runs in FireFox due to security reasons.
**2) Rename EXTRA_EXPORTED_RUNTIME_METHODS**<br>
In [emscripten (2.0.18)], **`EXTRA_EXPORTED_RUNTIME_METHODS`** was deprecated in favor of **`EXPORTED_RUNTIME_METHODS`**.
The fastText Makefile still has the old flag, so we need to update the name.
**3) Use the -r Flag When Appropriate**<br>
In [emscripten (2.0.3)] the following change was made:
> "The default output format is now executable JavaScript. Previously we would default to output objecting files unless, for example, the output name ended in **`.js`**. This is contrary to behavior of clang and gcc. Now emscripten will always produce and executable unless the **`-c`**, **`-r`** or **`-shared`** flags are given. This is true even when the name of the output file ends in **`.o`**. e.g, **`emcc foo.c -o foo.o`** will produce a JavaScript file called **`foo.o`**. This might surprise some users (although it matches the behavior of existing toolchains) so we now produce a warning in this case."
The Makefile needs to be modified to use the **`-r`** flag when appropriate. These changes are modeled after comments on this [GitHub Issue].
**Cumulative Changes**<br>
Here is a diff of the full changes needed for the Makefile at the time of writing:
```diff
diff --git a/Makefile b/Makefile
index e246f79..396ae0b 100644
--- a/Makefile
+++ b/Makefile
@@ -73,7 +73,9 @@ clean:
EMCXX = em++
-EMCXXFLAGS = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXTRA_EXPORTED_RUNTIME_METHODS=['addOnPostRun', 'FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=1" -s 'EXPORT_NAME="FastTextModule"' -Isrc/
+EMCXXFLAGS_BASE = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun', 'FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=0" -s "DYNAMIC_EXECUTION=0" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=1" -s 'EXPORT_NAME="FastTextModule"' -Isrc/
+EMCXXFLAGS = $(EMCXXFLAGS_BASE) -r
+EMCXXFLAGS_JS = $(EMCXXFLAGS_BASE)
EMOBJS = args.bc autotune.bc matrix.bc dictionary.bc loss.bc productquantizer.bc densematrix.bc quantmatrix.bc vector.bc model.bc utils.bc meter.bc fasttext.bc main.bc
@@ -120,6 +122,6 @@ fasttext.bc: src/fasttext.cc src/*.h
$(EMCXX) $(EMCXXFLAGS) src/fasttext.cc -o fasttext.bc
webassembly/fasttext_wasm.js: $(EMOBJS) webassembly/fasttext_wasm.cc Makefile
- $(EMCXX) $(EMCXXFLAGS) $(EMOBJS) -o webassembly/fasttext_wasm.js
+ $(EMCXX) $(EMCXXFLAGS_JS) $(EMOBJS) -o webassembly/fasttext_wasm.js
```
After modifying the Makefile in the previous section, running **`make wasm`** in the fastText repo should run without warnings or errors and the following files will be generated in the **`webassembly`** directory:
```
webassembly
├── fasttext.js
├── fasttext_wasm.js
└── fasttext_wasm.wasm
```
#### Modifying fasttext_wasm.js
There are a few changes we need to make to the **`fasttext_wasm.js`** file to make it compatible with use in Firefox.
**1) Define a function, not a module**<br>
The generated code exports a module, but this needs to be modified into a function for use in [importScripts()] in a worker.
At the top of the file we need to make the following changes:
```diff
diff --git a/toolkit/components/translations/fasttext/fasttext_wasm.js b/toolkit/components/translations/fasttext/fasttext_wasm.js
index 64c6184a85851..4802343da2a03 100644
--- a/toolkit/components/translations/fasttext/fasttext_wasm.js
+++ b/toolkit/components/translations/fasttext/fasttext_wasm.js
@@ -1,9 +1,6 @@
-var FastTextModule = (() => {
- var _scriptDir = import.meta.url;
-
- return (
-async function(FastTextModule = {}) {
+async function loadFastTextModule(FastTextModule = {}) {
+ const _scriptDir = null;
// include: shell.js
// The Module object: Our interface to the outside world. We import
```
Here we are defining a function rather than a variable, and we are setting **`_scriptDir`** to null
because **`import.meta.url`** is only available for use within modules.
Next we need to modify the bottom of the file to match these changes:
```diff
diff --git a/toolkit/components/translations/fasttext/fasttext_wasm.js b/toolkit/components/translations/fasttext/fasttext_wasm.js
index 64c6184a85851..0a6fca3f524e4 100644
--- a/toolkit/components/translations/fasttext/fasttext_wasm.js
+++ b/toolkit/components/translations/fasttext/fasttext_wasm.js
@@ -8287,7 +8287,3 @@ run();
return FastTextModule.ready
}
-
-);
-})();
-export default FastTextModule;
```
**2) Remove unneeded environment checks**<br>
Next we need to remove unneeded checks for different environments:
```JavaScript
if (ENVIRONMENT_IS_NODE) {
// ...
} else
if (ENVIRONMENT_IS_SHELL) {
// ...
} else
if (ENVIRONMENT_IS_WEB || ENVIRONMENT_IS_WORKER) {
// ...
} else
{
throw new Error('environment detection error');
}
```
Since this code will only be run inside of a worker, we want to delete the blocks that deal with **`ENVIRONMENT_IS_NODE`** and **`ENVIRONMENT_IS_SHELL`**. In fact, this code will fail to be imported by [importScripts()] if we don't do this.
**3) Remove the use of `import.meta.url`**<br>
Finally, there is a use of **`import.meta.url`** that we need to remove.
```diff
diff --git a/toolkit/components/translations/fasttext/fasttext_wasm.js b/toolkit/components/translations/fasttext/fasttext_wasm.js
index 64c6184a85851..746cbae2ec952 100644
--- a/toolkit/components/translations/fasttext/fasttext_wasm.js
+++ b/toolkit/components/translations/fasttext/fasttext_wasm.js
@@ -746,7 +746,7 @@ if (Module['locateFile']) {
}
} else {
// Use bundler-friendly `new URL(..., import.meta.url)` pattern; works in browsers too.
- wasmBinaryFile = new URL('fasttext_wasm.wasm', import.meta.url).href;
+ wasmBinaryFile = null;
}
function getBinary(file) {
```
As mentioned before, **`import.meta.url`** is not allowed outside of modules and cannot be used with [importScripts()]
in the worker code that we are creating.
It is okay to set this to null here, because we will be providing the **`wasmBinaryFile`** via [Remote Settings].
**4) Minifying the file**<br>
The generated **`fasttext_wasm.js`** file is very large. To minimize the impact on the size of the code in the Firefox source tree, we want to minify the file using the [minify] tool.
```
Size Name
291k ├── fasttext_wasm.js (original)
109k └── fasttext_wasm.js (minified)
```
**5) Adding the license**<br>
Finally, we should add a copy of the current fastText MIT license to the top of the minified **`fasttext_wasm.js`** file.
You should be able to paste this from the generated **`fasttext.js`** file.
#### Modifying fasttext.js
```{note}
It is likely that the source file in tree already has these changes and is already sufficient,
even if **`fasttext_wasm.js`** has been recently updated. Try running it first as-is before replacing
and re-modifying.
```
Next we need to modify **`fasttext.js`** to utilize the changes that we made to **`fasttext_wasm.js`** and also to
not be a module so that we can import it using [importScripts()].
These changes do the following:
1) Define a variable called **`fastTextModule`** for use in the worker scripts.
2) Utilize the **`loadFastTextModule()`** function that we defined in **`fasttext_wasm.js`**
3) Add a function **`loadModelBinary()`** that takes the wasm binary directly, which we will provide through [Remote Settings].
4) Remove any module exports.
```diff
diff --git a/toolkit/components/translations/fasttext/fasttext.js b/toolkit/components/translations/fasttext/fasttext.js
index 86600b9ac9e28..2c49b3faaeedc 100644
--- a/toolkit/components/translations/fasttext/fasttext.js
+++ b/toolkit/components/translations/fasttext/fasttext.js
@@ -6,20 +6,30 @@
* LICENSE file in the root directory of this source tree.
*/
-import fastTextModularized from './fasttext_wasm.js';
-const fastTextModule = fastTextModularized();
+let fastTextModule;
+
+const _initFastTextModule = async function (wasmModule) {
+ try {
+ fastTextModule = await loadFastTextModule(wasmModule);
+ } catch(e) {
+ console.error(e);
+ }
+ return true
+}
let postRunFunc = null;
const addOnPostRun = function(func) {
postRunFunc = func;
};
-fastTextModule.addOnPostRun(() => {
- if (postRunFunc) {
- postRunFunc();
- }
-});
+const loadFastText = (wasmModule) => {
+ _initFastTextModule(wasmModule).then((res) => {
+ if (postRunFunc) {
+ postRunFunc();
+ }
+ })
+}
const thisModule = this;
const trainFileInWasmFs = 'train.txt';
const testFileInWasmFs = 'test.txt';
@@ -41,7 +51,7 @@ const getFloat32ArrayFromHeap = (len) => {
const heapToFloat32 = (r) => new Float32Array(r.buffer, r.ptr, r.size);
class FastText {
- constructor() {
+ constructor(fastTextModule) {
this.f = new fastTextModule.FastText();
}
@@ -77,6 +87,15 @@ class FastText {
});
}
+ loadModelBinary(buffer) {
+ const fastTextNative = this.f;
+ const byteArray = new Uint8Array(buffer);
+ const FS = fastTextModule.FS;
+ FS.writeFile(modelFileInWasmFs, byteArray);
+ fastTextNative.loadModel(modelFileInWasmFs);
+ return new FastTextModel(fastTextNative);
+ }
+
_train(url, modelName, kwargs = {}, callback = null) {
const fetchFunc = (thisModule && thisModule.fetch) || fetch;
const fastTextNative = this.f;
@@ -515,6 +534,3 @@ class FastTextModel {
});
}
}
-
-
-export {FastText, addOnPostRun};
```
We have previously experimented with using the [fastText] language detector, but we opted to use [CLD2] due to complications with [fastText] [WASM] runtime performance. The benefit of the [CLD2] language detector is that it already exists in the Firefox source tree. In the future, we would still like to explore moving to a more modern language detector such as [CLD3], or perhaps something else.
---
## Building Bergamot
@ -419,20 +140,21 @@ TODO
<!-- Hyperlinks -->
[3697152e0fd772d9185697fdbd4a1d340ca5571d]: https://github.com/facebookresearch/fastText/tree/3697152e0fd772d9185697fdbd4a1d340ca5571d
[Bugzilla]: https://bugzilla.mozilla.org/enter_bug.cgi?product=Cloud%20Services&component=Server%3A%20Remote%20Settings
[Child]: https://searchfox.org/mozilla-central/search?q=TranslationsChild
[CLD2]: https://github.com/CLD2Owners/cld2
[CLD3]: https://github.com/google/cld3
[Download and Install]: https://emscripten.org/docs/getting_started/downloads.html#download-and-install
[emscripten (2.0.3)]: https://github.com/emscripten-core/emscripten/blob/main/ChangeLog.md#203-09102020
[emscripten (2.0.18)]: https://github.com/emscripten-core/emscripten/blob/main/ChangeLog.md#2018-04232021
[emscripten (3.1.35)]: https://github.com/emscripten-core/emscripten/blob/main/ChangeLog.md#3135---040323
[Environments]: https://remote-settings.readthedocs.io/en/latest/getting-started.html#environments
[eval()]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/eval
[fastText]: https://fasttext.cc/
[Filter Expressions]: https://remote-settings.readthedocs.io/en/latest/target-filters.html#filter-expressions
[Firefox Release Schedule]: https://wiki.mozilla.org/Release_Management/Calendar
[generate functions]: https://emscripten.org/docs/api_reference/emscripten.h.html?highlight=dynamic_execution#functions
[Getting Set Up To Work On The Firefox Codebase]: https://firefox-source-docs.mozilla.org/setup/index.html
[GitHub Issue]: https://github.com/facebookresearch/fastText/pull/1227#issuecomment-1353830003
[importScripts()]: https://developer.mozilla.org/en-US/docs/Web/API/WorkerGlobalScope/importScripts
[JSWindowActors]: https://firefox-source-docs.mozilla.org/dom/ipc/jsactors.html#jswindowactor
[minify]: https://github.com/tdewolff/minify
@ -440,7 +162,6 @@ TODO
[Step 3]: https://remote-settings.readthedocs.io/en/latest/getting-started.html#create-a-new-official-type-of-remote-settings
[remote-settings-devtools]: https://github.com/mozilla-extensions/remote-settings-devtools/releases
[Remote Settings]: https://remote-settings.readthedocs.io/en/latest/
[Requirements]: https://fasttext.cc/docs/en/webassembly-module.html#requirements
[toolkit/components/translations]: https://searchfox.org/mozilla-central/search?q=toolkit%2Fcomponents%2Ftranslations
[WASM]: https://webassembly.org/
[Workers]: https://searchfox.org/mozilla-central/search?q=%2Ftranslations.*worker&path=&case=false&regexp=true