forked from mirrors/gecko-dev
Added .sh extension to all scripts. edit-dictionary.sh: * Convert to utf-8 before editing, and back to iso-8859-1 before saving * Place a copy of the utf-8 dictionary inside the utf8 folder, and store the iso-8859-1 in place make-new-dict.sh: * Use .txt extension for support wordlists, and place them in a subfolder * Exclude words in mozilla-exclusions.txt from the generated dictionary * Save 5-mozilla-*.txt files to utf-8 Depends on D165304 Differential Revision: https://phabricator.services.mozilla.com/D165305
158 lines
7.1 KiB
ReStructuredText
158 lines
7.1 KiB
ReStructuredText
======================================
|
||
Managing the built-in en-US dictionary
|
||
======================================
|
||
|
||
The en-US build of Firefox includes a built-in Hunspell dictionary based on the
|
||
`SCOWL`_ dataset. This document describes the process to add new words to the
|
||
dictionary, or update it to the current upstream version.
|
||
|
||
For more information about Hunspell or the affix file format, you can check
|
||
`the Ubuntu man page for hunspell
|
||
<https://manpages.ubuntu.com/manpages/bionic/man5/hunspell.5.html>`_.
|
||
|
||
Requesting to add new words to the en-US dictionary
|
||
===================================================
|
||
|
||
If you’d like to add new words to the dictionary, you can `file a bug`_. Try to
|
||
provide information on the terms you want to add, in particular references to
|
||
external sources that confirm the usage of the term.
|
||
|
||
Adding new words to the en-US dictionary
|
||
========================================
|
||
|
||
This section describes the process for adding a word to the dictionary:
|
||
|
||
#. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick
|
||
Reference`), if you don’t already have one, and make sure you can build it
|
||
successfully.
|
||
#. Get into the dictionary sources directory using this command:
|
||
``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``
|
||
#. There’s a special script used for editing dictionaries. The script
|
||
only works if you have the environment variable ``EDITOR`` set to the
|
||
executable of an editor program; if you don’t have it set, you can use
|
||
``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can
|
||
substitute it with another editor), or you can just type
|
||
``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified.
|
||
#. Add and remove words in the dictionary file, then quit the editor.
|
||
#. Build Firefox and test your updated dictionary. Once you’re
|
||
satisfied, use the process described in :ref:`write_a_patch` to create a
|
||
patch.
|
||
|
||
Note that the update script will modify 2 files, and both need to be committed:
|
||
|
||
* ``en-US.dic``: the dictionary actually shipping in the build and uses
|
||
ISO-8859-1 encoding.
|
||
* ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This
|
||
is used to work around issues with Phabricator, and it allows to display
|
||
actual changes in the diff.
|
||
|
||
Upgrading dictionary to a new upstream version of SCOWL
|
||
=======================================================
|
||
|
||
The English dictionary available in mozilla-central is based on the
|
||
`SCOWL`_ dictionary. Some scripts distributed with the SCOWL package are
|
||
used to generate the files for the en-US dictionary.
|
||
|
||
The working directory for this process is
|
||
``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
|
||
|
||
#. Download the latest version of the dictionary from `SCOWL`_ homepage or
|
||
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
|
||
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
|
||
#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
|
||
sure it runs without any errors. For more details on this script, see the
|
||
`make-new-dict.sh`_ section.
|
||
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
|
||
example, make sure that the size is about the same as the original dictionary
|
||
(or slightly larger).
|
||
#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
|
||
generated file in the right position and use the process described in
|
||
:ref:`write_a_patch` to create a patch.
|
||
|
||
Info about the file structure
|
||
=============================
|
||
|
||
mozilla-exclusions.txt
|
||
----------------------
|
||
|
||
``mozilla-exclusions.txt`` is used to explicitly exclude some words from
|
||
suggestions. The ``make-new-dict.sh`` script will add them to the dictionary file
|
||
with the ``/!`` flag.
|
||
|
||
Terms should be added to this file with exactly the same format used in the .dic
|
||
file, including affix rules if available.
|
||
|
||
mozilla-specific.txt
|
||
--------------------
|
||
|
||
This file contains Mozilla-specific words that should not be submitted
|
||
upstream. For example, ``Firefox`` should go in this file (see `bug 237921`_).
|
||
|
||
Note that the file ``5-mozilla-specific.txt`` is generated by expanding
|
||
``mozilla-specific.txt`` and should not be edited directly.
|
||
|
||
utf8 folder
|
||
-----------
|
||
|
||
``dictionary-sources/utf8`` is used to store a copy with UTF-8 encoding of the
|
||
dictionary files. This is used to work around limitations in Phabricator, which
|
||
treats ISO-8859-1 files as binary and won’t display a diff when updating them.
|
||
|
||
Info about the included scripts
|
||
===============================
|
||
|
||
make-new-dict.sh
|
||
----------------
|
||
|
||
The dictionary upgrade scripts ``make-new-dict.sh`` works by expanding (i.e.
|
||
“unmunching”) the affix compression dictionaries to create wordlists and
|
||
use those to generate a new dictionary.
|
||
|
||
The upgrade script expects the current upstream version to be kept in the
|
||
directory ``orig``.
|
||
|
||
The script will create a few files in ``dictionary-sources/support_file`` in the
|
||
following order:
|
||
|
||
* ``0-special.txt`` contains numbers and ordinals expanded from SCOWL
|
||
``en.dic.supp``.
|
||
* ``1-base.txt`` contains words expanded from ``en_US-custom.dic`` in the
|
||
**previous** version of SCOWL (from the ``orig`` folder).
|
||
* ``2-mozilla.txt`` contains words expanded from the current Mozilla dictionary.
|
||
* ``3-upstream.txt`` contains words expanded from ``en_US-custom.dic`` in the
|
||
**new** version of SCOWL (from the ``scowl/speller`` folder).
|
||
* ``2-mozilla-removed.txt`` contains words that are only available in the SCOWL
|
||
dictionary, i.e. removed by Mozilla.
|
||
* ``2-mozilla-added.txt`` contains words that are only available in the current
|
||
Mozilla dictionary, i.e. added by Mozilla.
|
||
* ``4-patched.txt`` contains words from the new SCOWL dictionary
|
||
(``3-upstream.txt``), with words from (``2-mozilla-removed.txt``) removed and
|
||
words (``2-mozilla-added.txt``) added.
|
||
* ``5-mozilla-specific.txt`` is expanded from ``mozilla-specific.txt`` using the
|
||
current affix rules from the Mozilla dictionary.
|
||
* ``5-mozilla-removed.txt`` and ``5-mozilla-added.txt`` contain words that are
|
||
respectively removed and added by Mozilla compared to the **new** SCOWL
|
||
version. These files could be used to submit upstream changes, but words
|
||
included in ``5-mozilla-specific.txt`` should be removed from this list.
|
||
|
||
The new dictionary is available as ``en_US-mozilla.dic`` and should be copied
|
||
over using the ``install-new-dict.sh`` script.
|
||
|
||
install-new-dict.sh
|
||
-------------------
|
||
|
||
The script:
|
||
|
||
* Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new
|
||
upstream version to ``orig``.
|
||
* Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``.
|
||
* Converts the dictionary (.dic) generated by ``make-new-dict.sh`` from UTF-8 to
|
||
ISO-8859-1 and moves it to the parent folder.
|
||
* Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the
|
||
original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables).
|
||
|
||
|
||
.. _SCOWL: http://wordlist.aspell.net
|
||
.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20checker
|
||
.. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/
|
||
.. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921
|