Bug 1806793 - Update documentation on how to manage en-US dictionary, r=sylvestre

Depends on D165303

Differential Revision: https://phabricator.services.mozilla.com/D165304
This commit is contained in:
Francesco Lodolo (:flod) 2022-12-28 15:06:24 +00:00
parent df55534a51
commit d974a31757
4 changed files with 150 additions and 78 deletions

View file

@ -1,28 +1,158 @@
======================================
Managing the built-in en-US dictionary
======================================
The en-US build of Firefox includes a built-in Hunspell dictionary based on the
`SCOWL`_ dataset. This document describes the process to add new words to the
dictionary, or update it to the current upstream version.
For more information about Hunspell or the affix file format, you can check
`the Ubuntu man page for hunspell
<https://manpages.ubuntu.com/manpages/bionic/man5/hunspell.5.html>`_.
Requesting to add new words to the en-US dictionary
===================================================
If youd like to add new words to the dictionary, you can `file a bug`_. Try to
provide information on the terms you want to add, in particular references to
external sources that confirm the usage of the term.
Adding new words to the en-US dictionary
========================================
Occasionally bugs are filed pointing out situations where perfectly
legitimate words are missing from the English spell check dictionary in
Firefox. This article describes the process for adding a word to the
dictionary.
This section describes the process for adding a word to the dictionary:
The process is pretty straight-forward:
#. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick Reference`), if
you don't already have one, and make sure you can build it
#. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick
Reference`), if you dont already have one, and make sure you can build it
successfully.
#. Get into the dictionary sources directory using this command:
``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``
#. There's a special script used for editing dictionaries. The script
#. Theres a special script used for editing dictionaries. The script
only works if you have the environment variable ``EDITOR`` set to the
executable of an editor program; if you don't have it set, you can use
``EDITOR=vim sh edit-dictionary`` to edit using vim (or you can
substitute some other editor), or you can just type
executable of an editor program; if you dont have it set, you can use
``EDITOR=vim sh edit-dictionary`` to edit using ``vim`` (or you can
substitute it with another editor), or you can just type
``sh edit-dictionary`` if you have an ``EDITOR`` already specified.
#. Add and remove words in the dictionary file, then quit the editor.
#. Use ``sh merge-dictionaries`` to process the dictionary changes you've
made.
#. Move the revised dictionary file into position: ``mv en-US.dic ..``
#. Build Firefox and test your updated dictionary. Once you're
#. Build Firefox and test your updated dictionary. Once youre
satisfied, use the process described in :ref:`write_a_patch` to create a
patch.
Note that the update script will modify 2 files, and both need to be committed:
* ``en-US.dic``: the dictionary actually shipping in the build and uses
ISO-8859-1 encoding.
* ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This
is used to work around issues with Phabricator, and it allows to display
actual changes in the diff.
Upgrading dictionary to a new upstream version of SCOWL
=======================================================
The English dictionary available in mozilla-central is based on the
`SCOWL`_ dictionary. Some scripts distributed with the SCOWL package are
used to generate the files for the en-US dictionary.
The working directory for this process is
``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
#. Download the latest version of the dictionary from `SCOWL`_ homepage or
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
#. Run the script ``sh make-new-dict`` to generate a new dictionary and make
sure it runs without any errors. For more details on this script, see the
`make-new-dict`_ section.
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
example, make sure that the size is about the same as the original dictionary
(or slightly larger).
#. If everything looks correct, use ``sh install-new-dict`` to copy the
generated file in the right position and use the process described in
:ref:`write_a_patch` to create a patch.
Info about the file structure
=============================
mozilla-exclusions.txt
----------------------
``mozilla-exclusions.txt`` is used to explicitly exclude some words from
suggestions. The ``make-new-dict`` script will add them to the dictionary file
with the ``/!`` flag.
Terms should be added to this file with exactly the same format used in the .dic
file, including affix rules if available.
mozilla-specific.txt
--------------------
This file contains Mozilla-specific words that should not be submitted
upstream. For example, ``Firefox`` should go in this file (see `bug 237921`_).
Note that the file ``5-mozilla-specific.txt`` is generated by expanding
``mozilla-specific.txt`` and should not be edited directly.
utf8 folder
-----------
``dictionary-sources/utf8`` is used to store a copy with UTF-8 encoding of the
dictionary files. This is used to work around limitations in Phabricator, which
treats ISO-8859-1 files as binary and wont display a diff when updating them.
Info about the included scripts
===============================
make-new-dict
-------------
The dictionary upgrade scripts ``make-new-dict`` works by expanding (i.e.
“unmunching”) the affix compression dictionaries to create wordlists and
use those to generate a new dictionary.
The upgrade script expects the current upstream version to be kept in the
directory ``orig``.
The script will create a few files in ``dictionary-sources/support_file`` in the
following order:
* ``0-special.txt`` contains numbers and ordinals expanded from SCOWL
``en.dic.supp``.
* ``1-base.txt`` contains words expanded from ``en_US-custom.dic`` in the
**previous** version of SCOWL (from the ``orig`` folder).
* ``2-mozilla.txt`` contains words expanded from the current Mozilla dictionary.
* ``3-upstream.txt`` contains words expanded from ``en_US-custom.dic`` in the
**new** version of SCOWL (from the ``scowl/speller`` folder).
* ``2-mozilla-removed.txt`` contains words that are only available in the SCOWL
dictionary, i.e. removed by Mozilla.
* ``2-mozilla-added.txt`` contains words that are only available in the current
Mozilla dictionary, i.e. added by Mozilla.
* ``4-patched.txt`` contains words from the new SCOWL dictionary
(``3-upstream.txt``), with words from (``2-mozilla-removed.txt``) removed and
words (``2-mozilla-added.txt``) added.
* ``5-mozilla-specific.txt`` is expanded from ``mozilla-specific.txt`` using the
current affix rules from the Mozilla dictionary.
* ``5-mozilla-removed.txt`` and ``5-mozilla-added.txt`` contain words that are
respectively removed and added by Mozilla compared to the **new** SCOWL
version. These files could be used to submit upstream changes, but words
included in ``5-mozilla-specific.txt`` should be removed from this list.
The new dictionary is available as ``en_US-mozilla.dic`` and should be copied
over using the ``install-new-dict`` script.
install-new-dict
----------------
The script:
* Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new
upstream version to ``orig``.
* Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``.
* Converts the dictionary (.dic) generated by ``make-new-dict`` from UTF-8 to
ISO-8859-1 and moves it to the parent folder.
* Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the
original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables).
.. _SCOWL: http://wordlist.aspell.net
.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20checker
.. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/
.. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921

View file

@ -1,6 +1,2 @@
README_mozilla
To edit the dictionary use "dictionary-sources/edit-dictionary".
For additional info see dictionary-sources/README.
See Firefox Source Docs for information about these scripts, and how to add new words.
https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html

View file

@ -1,56 +0,0 @@
ADDING OR REMOVING ENTRIES IN THE DICTIONARY:
To edit the dictionary use "edit-dictionary" and than copy the
resulting "en-US.dic" file info place.
UPGRADING TO A NEW UPSTREAM VERSION:
In order to upgrade to the latest dictionary some scripts found in
SCOWL (the source of the en_US Hunspell dictionary) are used. The
en_US dictionary is also generated from the SCOWL source.
1) Unpack the tarball (tar.gz) version of the latest version of SCOWL
in the current directory and rename the directory from
"scowl-YYYY.MM.DD" to "scowl". You can find the latest version at
http://wordlist.aspell.net/ or
http://sourceforge.net/projects/wordlist/files/SCOWL/
2) Run the script "./make-new-dict" to generate a new dictionary and
make sure it runs without any errors.
3) Do a quick sanity check on the resulting dictionary
"en_US-mozilla.dic". For example make sure the size is about the same
(it should likely be slightly large) as the original dictionary.
4) Once everything is okay copy the new dictionary in place using
"./install-new-dict" and commit the changes.
NOTES ON UPGRADE PROCESS:
The dictionary upgrade scripts work by expanding (i.e. unmunching) the
affix compression dictionaries to create simple wordlists and use
those to generate a new dictionary.
The upgrade script expects the original upstream version to be kept in
the directory "orig".
The install script renames "orig" to "orig-bk" and copies the new
upstream version to "orig". The install script also copies the
original Mozilla dictionary to the "mozilla-bk".
SUBMITTING MOZILLA SPECIFIC CHANGES UPSTREAM:
The upgrade script creates two files that can be reviewed and possible
submitted upstream. The file "5-mozilla-removed" lists words that were
removed in the Mozilla dictionary and the file "5-mozilla-added"
contains the list of words that were added. When submitting new words
upstream Mozilla specific words that are found in "5-mozilla-specific"
(expanded from mozilla-specific.txt) should likely be removed from the list.
ABOUT mozilla-specific.txt:
This file contains Mozilla-specific words that should not be submitted
upstream. For example, "Firefox" goes here. (See bug 237921).
Note that the file 5-mozilla-specific is generated by expanding
mozilla-specific.txt and should not be edited directly.

View file

@ -0,0 +1,2 @@
See Firefox Source Docs for information about these scripts, and how to add new words.
https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html