Bug 1806793 - Update scripts for en-US dictionary, r=sylvestre

Added .sh extension to all scripts.

edit-dictionary.sh:
* Convert to utf-8 before editing, and back to iso-8859-1 before saving
* Place a copy of the utf-8 dictionary inside the utf8 folder, and store the iso-8859-1 in place

make-new-dict.sh:
* Use .txt extension for support wordlists, and place them in a subfolder
* Exclude words in mozilla-exclusions.txt from the generated dictionary
* Save 5-mozilla-*.txt files to utf-8

Depends on D165304

Differential Revision: https://phabricator.services.mozilla.com/D165305
This commit is contained in:
Francesco Lodolo (:flod) 2022-12-28 15:06:25 +00:00
parent d974a31757
commit 7afa41bb66
10 changed files with 201 additions and 152 deletions

View file

@ -267,3 +267,8 @@ toolkit/components/certviewer/content/package-lock.json
^tools/esmify/jscodeshift.cmd ^tools/esmify/jscodeshift.cmd
^tools/esmify/jscodeshift.ps1 ^tools/esmify/jscodeshift.ps1
^tools/esmify/package-lock.json ^tools/esmify/package-lock.json
# Ignore support files for en-US dictionary updates
^extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/scowl
^extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/support_files/
^extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/*en_US-mozilla*

View file

@ -30,9 +30,9 @@ This section describes the process for adding a word to the dictionary:
#. Theres a special script used for editing dictionaries. The script #. Theres a special script used for editing dictionaries. The script
only works if you have the environment variable ``EDITOR`` set to the only works if you have the environment variable ``EDITOR`` set to the
executable of an editor program; if you dont have it set, you can use executable of an editor program; if you dont have it set, you can use
``EDITOR=vim sh edit-dictionary`` to edit using ``vim`` (or you can ``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can
substitute it with another editor), or you can just type substitute it with another editor), or you can just type
``sh edit-dictionary`` if you have an ``EDITOR`` already specified. ``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified.
#. Add and remove words in the dictionary file, then quit the editor. #. Add and remove words in the dictionary file, then quit the editor.
#. Build Firefox and test your updated dictionary. Once youre #. Build Firefox and test your updated dictionary. Once youre
satisfied, use the process described in :ref:`write_a_patch` to create a satisfied, use the process described in :ref:`write_a_patch` to create a
@ -59,13 +59,13 @@ The working directory for this process is
#. Download the latest version of the dictionary from `SCOWL`_ homepage or #. Download the latest version of the dictionary from `SCOWL`_ homepage or
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory. `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``. Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
#. Run the script ``sh make-new-dict`` to generate a new dictionary and make #. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
sure it runs without any errors. For more details on this script, see the sure it runs without any errors. For more details on this script, see the
`make-new-dict`_ section. `make-new-dict.sh`_ section.
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For #. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
example, make sure that the size is about the same as the original dictionary example, make sure that the size is about the same as the original dictionary
(or slightly larger). (or slightly larger).
#. If everything looks correct, use ``sh install-new-dict`` to copy the #. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
generated file in the right position and use the process described in generated file in the right position and use the process described in
:ref:`write_a_patch` to create a patch. :ref:`write_a_patch` to create a patch.
@ -76,7 +76,7 @@ mozilla-exclusions.txt
---------------------- ----------------------
``mozilla-exclusions.txt`` is used to explicitly exclude some words from ``mozilla-exclusions.txt`` is used to explicitly exclude some words from
suggestions. The ``make-new-dict`` script will add them to the dictionary file suggestions. The ``make-new-dict.sh`` script will add them to the dictionary file
with the ``/!`` flag. with the ``/!`` flag.
Terms should be added to this file with exactly the same format used in the .dic Terms should be added to this file with exactly the same format used in the .dic
@ -101,10 +101,10 @@ treats ISO-8859-1 files as binary and wont display a diff when updating them.
Info about the included scripts Info about the included scripts
=============================== ===============================
make-new-dict make-new-dict.sh
------------- ----------------
The dictionary upgrade scripts ``make-new-dict`` works by expanding (i.e. The dictionary upgrade scripts ``make-new-dict.sh`` works by expanding (i.e.
“unmunching”) the affix compression dictionaries to create wordlists and “unmunching”) the affix compression dictionaries to create wordlists and
use those to generate a new dictionary. use those to generate a new dictionary.
@ -136,17 +136,17 @@ following order:
included in ``5-mozilla-specific.txt`` should be removed from this list. included in ``5-mozilla-specific.txt`` should be removed from this list.
The new dictionary is available as ``en_US-mozilla.dic`` and should be copied The new dictionary is available as ``en_US-mozilla.dic`` and should be copied
over using the ``install-new-dict`` script. over using the ``install-new-dict.sh`` script.
install-new-dict install-new-dict.sh
---------------- -------------------
The script: The script:
* Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new * Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new
upstream version to ``orig``. upstream version to ``orig``.
* Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``. * Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``.
* Converts the dictionary (.dic) generated by ``make-new-dict`` from UTF-8 to * Converts the dictionary (.dic) generated by ``make-new-dict.sh`` from UTF-8 to
ISO-8859-1 and moves it to the parent folder. ISO-8859-1 and moves it to the parent folder.
* Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the * Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the
original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables). original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables).

View file

@ -1,31 +0,0 @@
#!/bin/sh
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
#
# edit-dictionary
set -e
if [ -z "$EDITOR" ]; then
echo 'Need to set the $EDITOR environment variable to your favorite editor!'
exit 1
fi
# Strip the first line that contains the count
tail -n +2 ../en-US.dic > en-US.stripped
# Open the patched hunspell editor and let the user edit it
echo "Now the dictionary is going to be opened for you to edit. When you're done, just quit the editor"
echo -n "Press Enter to begin."
read foo
$EDITOR en-US.stripped
# Add back the line count
wc -l < en-US.stripped | tr -d '[:blank:]' > en-US.dic
LC_ALL=C sort en-US.stripped >> en-US.dic
# Clean up
rm -f en-US.stripped

View file

@ -0,0 +1,37 @@
#! /usr/bin/env sh
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
set -e
if [ -z "$EDITOR" ]; then
echo 'Need to set the $EDITOR environment variable to your favorite editor.'
exit 1
fi
# Copy the current en-US dictionary and strip the first line that contains
# the count.
tail -n +2 ../en-US.dic > en-US.stripped
# Convert the file to UTF-8
iconv -f iso-8859-1 -t utf-8 en-US.stripped > en-US.utf8
rm en-US.stripped
# Open the hunspell dictionary and let the user edit it
echo "Now the dictionary is going to be opened for you to edit. Quit the editor to finish editing."
echo "Press Enter to begin."
read foo
$EDITOR en-US.utf8
# Add back the line count and sort the lines
wc -l < en-US.utf8 | tr -d '[:blank:]' > en-US.dic
LC_ALL=C sort en-US.utf8 >> en-US.dic
rm -f en-US.utf8
# Convert back to ISO-8859-1
iconv -f utf-8 -t iso-8859-1 en-US.dic > ../en-US.dic
# Keep a copy of the UTF-8 file in /utf8
mv en-US.dic utf8/en-US-utf8.dic

View file

@ -1,39 +0,0 @@
#!/bin/sh
#
# This script copies the new dictionary created by make-new-dict in
# place.
#
set -e
WKDIR="`pwd`"
export SCOWL="$WKDIR/scowl/"
SPELLER="$SCOWL/speller"
set -x
if [ -e orig-bk ]; then echo "$0: directory 'orig-bk' exists." 1>&2 ; exit 0; fi
mv orig orig-bk
mkdir orig
cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-custom.txt orig
mkdir mozilla-bk
mv ../en-US.dic ../en-US.aff ../README_en_US.txt mozilla-bk
# Convert the affix file to ISO8859-1
sed -i=bak -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff
# Convert the dictionary to ISO8859-1
mv en_US-mozilla.dic en_US-mozilla-utf8.dic
iconv -f utf-8 -t iso-8859-1 < en_US-mozilla-utf8.dic > en_US-mozilla.dic
cp en_US-mozilla.aff ../en-US.aff
cp en_US-mozilla.dic ../en-US.dic
cp README_en_US-mozilla.txt ../README_en_US.txt
set +x
echo "New dictionary copied into place. Please commit the changes."

View file

@ -0,0 +1,41 @@
#! /usr/bin/env sh
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
# This script copies the new dictionary created by make-new-dict in
# place.
set -e
WKDIR="`pwd`"
export SCOWL="$WKDIR/scowl/"
SUPPORT_DIR="$WKDIR/support_files/"
SPELLER="$SCOWL/speller"
if [ -e "$SUPPORT_DIR/orig-bk" ]; then
echo "$0: directory '$SUPPORT_DIR/orig-bk' exists." 1>&2
exit 0
fi
mv orig "$SUPPORT_DIR/orig-bk"
mkdir orig
cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-custom.txt orig
mkdir "$SUPPORT_DIR/mozilla-bk"
mv ../en-US.dic ../en-US.aff ../README_en_US.txt "$SUPPORT_DIR/mozilla-bk"
# Convert the affix file to ISO-8859-1
cp en_US-mozilla.aff utf8/en-US-utf8.aff
sed -i "" -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff
# Convert the dictionary to ISO-8859-1
mv en_US-mozilla.dic utf8/en-US-utf8.dic
iconv -f utf-8 -t iso-8859-1 < utf8/en-US-utf8.dic > en_US-mozilla.dic
cp en_US-mozilla.aff ../en-US.aff
cp en_US-mozilla.dic ../en-US.dic
mv README_en_US-mozilla.txt ../README_en_US.txt
echo "New dictionary copied into place. Please commit the changes."

View file

@ -1,69 +0,0 @@
#!/bin/sh
#
# This script creates a new dictionary by expanding the original,
# Mozilla's, and the upstream dictionary to remove affix flags and
# then doing the wordlist equivalent of diff3 to create a new
# dictionary.
#
# The files 2-mozilla-add and 2-mozilla-rem contain words added and
# removed, receptively in the Mozilla dictionary. The final
# dictionary will be in hunspell-en_US-mozilla.zip.
#
set -e
export LANG=C
export LC_ALL=C
export LC_CTYPE=C
export LC_COLLATE=C
WKDIR="`pwd`"
export SCOWL="$WKDIR/scowl/"
ORIG="$WKDIR/orig/"
SPELLER="$SCOWL/speller"
expand() {
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
}
cd $SPELLER
MK_LIST="../mk-list -v1 --accents=both en_US 60"
cat <<EOF > params.txt
With Input Command: $MK_LIST
EOF
# note: output of make-hunspell-dict is utf-8
$MK_LIST | ./make-hunspell-dict -one en_US-custom params.txt > ./make-hunspell-dict.log
cd $WKDIR
# Note: Input and output of "expand" is always iso-8859-1.
# All expanded word list files are thus in iso-8859-1.
expand $SPELLER/en.aff < $SPELLER/en.dic.supp > 0-special # input: ASCII
# input in utf-8, expand expects iso-8859-1 so use iconv
iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > 1-base.txt
expand ../en-US.aff < ../en-US.dic > 2-mozilla.txt # input: iso-8850-1
# input in utf-8, expand expects iso-8859-1 so use iconv
iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > 3-upstream.txt
comm -23 1-base.txt 2-mozilla.txt > 2-mozilla-rem
comm -13 1-base.txt 2-mozilla.txt > 2-mozilla-add
comm -23 3-upstream.txt 2-mozilla-rem | cat - 2-mozilla-add | sort -u > 4-patched.txt
# note: output of make-hunspell-dict is utf-8
cat 4-patched.txt | comm -23 - 0-special | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null
# sanity check should yield identical results
#comm -23 1-base.txt 3-upstream.txt > 3-upstream-rem
#comm -13 1-base.txt 3-upstream.txt > 3-upstream-add
#comm -23 2-mozilla.txt 3-upstream-rem | cat - 3-upstream-add | sort -u > 4-patched-v2.txt
expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific
comm -12 3-upstream.txt 2-mozilla-rem > 5-mozilla-removed
comm -13 3-upstream.txt 2-mozilla-add > 5-mozilla-added

View file

@ -0,0 +1,102 @@
#! /usr/bin/env sh
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
# This script creates a new dictionary by expanding the original,
# Mozilla's, and the upstream dictionary to remove affix flags and
# then doing the wordlist equivalent of diff3 to create a new
# dictionary.
#
# The files 2-mozilla-add and 2-mozilla-rem contain words added and
# removed, respectively in the Mozilla dictionary. The final
# dictionary will be in hunspell-en_US-mozilla.zip.
set -e
export LANG=C
export LC_ALL=C
export LC_CTYPE=C
export LC_COLLATE=C
WKDIR="`pwd`"
export SCOWL="$WKDIR/scowl/"
ORIG="$WKDIR/orig/"
SUPPORT_DIR="$WKDIR/support_files/"
SPELLER="$SCOWL/speller"
expand() {
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
}
mkdir -p $SUPPORT_DIR
cd $SPELLER
MK_LIST="../mk-list -v1 --accents=both en_US 60"
cat <<EOF > params.txt
With Input Command: $MK_LIST
EOF
# Note: the output of make-hunspell-dict is UTF-8
$MK_LIST | ./make-hunspell-dict -one en_US-custom params.txt > ./make-hunspell-dict.log
cd $WKDIR
# Note: Input and output of "expand" is always ISO-8859-1.
# All expanded word list files are thus in ISO-8859-1.
expand $SPELLER/en.aff < $SPELLER/en.dic.supp > $SUPPORT_DIR/0-special.txt
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $ORIG/en_US-custom.aff > $SUPPORT_DIR/1-base.txt
# The existing Mozilla dictionary is already in ISO-8859-1
expand ../en-US.aff < ../en-US.dic > $SUPPORT_DIR/2-mozilla.txt
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > $SUPPORT_DIR/3-upstream.txt
# Suppress common lines and lines only in the 2nd file, leaving words that are
# only available in the 1st file (SCOWL), i.e. were removed by Mozilla.
comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/2-mozilla.txt > $SUPPORT_DIR/2-mozilla-removed.txt
# Suppress common lines and lines only in the 1st file, leaving words that are
# only available in the 2nd file (current Mozilla dictionary), i.e. were added
# by Mozilla.
comm -13 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/2-mozilla.txt > $SUPPORT_DIR/2-mozilla-added.txt
# Suppress common lines and lines only in the 2nd file, leaving words that are
# only available in the 1st file (words from the new upstream SCOWL dictionary).
# The result is upstream, minus the words removed, plus the words added.
comm -23 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt | cat - $SUPPORT_DIR/2-mozilla-added.txt | sort -u > $SUPPORT_DIR/4-patched.txt
# Note: the output of make-hunspell-dict is UTF-8
cat $SUPPORT_DIR/4-patched.txt | comm -23 - $SUPPORT_DIR/0-special.txt | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null
# Exclude specific words from suggestions
while IFS= read -r line
do
# If the string already contains an affix, just add !, otherwise add /!
if [[ "$line" == *"/"* ]]; then
sed -i "" "s|^$line$|$line!|" en_US-mozilla.dic
else
sed -i "" "s|^$line$|$line/!|" en_US-mozilla.dic
fi
done < "mozilla-exclusions.txt"
# Sanity check should yield identical results
#comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-remover.txt
#comm -13 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-added.txt
#comm -23 $SUPPORT_DIR/2-mozilla.txt $SUPPORT_DIR/3-upstream-removed.txt | cat - $SUPPORT_DIR/3-upstream-added.txt | sort -u > $SUPPORT_DIR/4-patched-v2.txt
expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific.txt
# Update Mozilla removed and added wordlists based on the new upstream
# dictionary, save them as UTF-8 and not ISO-8951-1
comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed.txt
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-removed.txt > 5-mozilla-removed.txt
comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added.txt
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-added.txt > 5-mozilla-added.txt
# Clean up some files
rm hunspell-en_US-mozilla.zip
rm nosug

View file

@ -0,0 +1,2 @@
nigga/SM
niggaz

View file

@ -2,6 +2,7 @@
shellcheck: shellcheck:
description: Shell script linter description: Shell script linter
include: include:
- extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/
- taskcluster/docker/ - taskcluster/docker/
exclude: [] exclude: []
# 1090: https://github.com/koalaman/shellcheck/wiki/SC1090 # 1090: https://github.com/koalaman/shellcheck/wiki/SC1090