forked from mirrors/gecko-dev
Bug 1806793 - Update scripts for en-US dictionary, r=sylvestre
Added .sh extension to all scripts. edit-dictionary.sh: * Convert to utf-8 before editing, and back to iso-8859-1 before saving * Place a copy of the utf-8 dictionary inside the utf8 folder, and store the iso-8859-1 in place make-new-dict.sh: * Use .txt extension for support wordlists, and place them in a subfolder * Exclude words in mozilla-exclusions.txt from the generated dictionary * Save 5-mozilla-*.txt files to utf-8 Depends on D165304 Differential Revision: https://phabricator.services.mozilla.com/D165305
This commit is contained in:
parent
d974a31757
commit
7afa41bb66
10 changed files with 201 additions and 152 deletions
|
|
@ -267,3 +267,8 @@ toolkit/components/certviewer/content/package-lock.json
|
|||
^tools/esmify/jscodeshift.cmd
|
||||
^tools/esmify/jscodeshift.ps1
|
||||
^tools/esmify/package-lock.json
|
||||
|
||||
# Ignore support files for en-US dictionary updates
|
||||
^extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/scowl
|
||||
^extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/support_files/
|
||||
^extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/*en_US-mozilla*
|
||||
|
|
|
|||
|
|
@ -30,9 +30,9 @@ This section describes the process for adding a word to the dictionary:
|
|||
#. There’s a special script used for editing dictionaries. The script
|
||||
only works if you have the environment variable ``EDITOR`` set to the
|
||||
executable of an editor program; if you don’t have it set, you can use
|
||||
``EDITOR=vim sh edit-dictionary`` to edit using ``vim`` (or you can
|
||||
``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can
|
||||
substitute it with another editor), or you can just type
|
||||
``sh edit-dictionary`` if you have an ``EDITOR`` already specified.
|
||||
``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified.
|
||||
#. Add and remove words in the dictionary file, then quit the editor.
|
||||
#. Build Firefox and test your updated dictionary. Once you’re
|
||||
satisfied, use the process described in :ref:`write_a_patch` to create a
|
||||
|
|
@ -59,13 +59,13 @@ The working directory for this process is
|
|||
#. Download the latest version of the dictionary from `SCOWL`_ homepage or
|
||||
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
|
||||
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
|
||||
#. Run the script ``sh make-new-dict`` to generate a new dictionary and make
|
||||
#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
|
||||
sure it runs without any errors. For more details on this script, see the
|
||||
`make-new-dict`_ section.
|
||||
`make-new-dict.sh`_ section.
|
||||
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
|
||||
example, make sure that the size is about the same as the original dictionary
|
||||
(or slightly larger).
|
||||
#. If everything looks correct, use ``sh install-new-dict`` to copy the
|
||||
#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
|
||||
generated file in the right position and use the process described in
|
||||
:ref:`write_a_patch` to create a patch.
|
||||
|
||||
|
|
@ -76,7 +76,7 @@ mozilla-exclusions.txt
|
|||
----------------------
|
||||
|
||||
``mozilla-exclusions.txt`` is used to explicitly exclude some words from
|
||||
suggestions. The ``make-new-dict`` script will add them to the dictionary file
|
||||
suggestions. The ``make-new-dict.sh`` script will add them to the dictionary file
|
||||
with the ``/!`` flag.
|
||||
|
||||
Terms should be added to this file with exactly the same format used in the .dic
|
||||
|
|
@ -101,10 +101,10 @@ treats ISO-8859-1 files as binary and won’t display a diff when updating them.
|
|||
Info about the included scripts
|
||||
===============================
|
||||
|
||||
make-new-dict
|
||||
-------------
|
||||
make-new-dict.sh
|
||||
----------------
|
||||
|
||||
The dictionary upgrade scripts ``make-new-dict`` works by expanding (i.e.
|
||||
The dictionary upgrade scripts ``make-new-dict.sh`` works by expanding (i.e.
|
||||
“unmunching”) the affix compression dictionaries to create wordlists and
|
||||
use those to generate a new dictionary.
|
||||
|
||||
|
|
@ -136,17 +136,17 @@ following order:
|
|||
included in ``5-mozilla-specific.txt`` should be removed from this list.
|
||||
|
||||
The new dictionary is available as ``en_US-mozilla.dic`` and should be copied
|
||||
over using the ``install-new-dict`` script.
|
||||
over using the ``install-new-dict.sh`` script.
|
||||
|
||||
install-new-dict
|
||||
----------------
|
||||
install-new-dict.sh
|
||||
-------------------
|
||||
|
||||
The script:
|
||||
|
||||
* Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new
|
||||
upstream version to ``orig``.
|
||||
* Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``.
|
||||
* Converts the dictionary (.dic) generated by ``make-new-dict`` from UTF-8 to
|
||||
* Converts the dictionary (.dic) generated by ``make-new-dict.sh`` from UTF-8 to
|
||||
ISO-8859-1 and moves it to the parent folder.
|
||||
* Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the
|
||||
original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables).
|
||||
|
|
|
|||
|
|
@ -1,31 +0,0 @@
|
|||
#!/bin/sh
|
||||
# This Source Code Form is subject to the terms of the Mozilla Public
|
||||
# License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
|
||||
|
||||
#
|
||||
# edit-dictionary
|
||||
|
||||
set -e
|
||||
|
||||
if [ -z "$EDITOR" ]; then
|
||||
echo 'Need to set the $EDITOR environment variable to your favorite editor!'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Strip the first line that contains the count
|
||||
tail -n +2 ../en-US.dic > en-US.stripped
|
||||
|
||||
# Open the patched hunspell editor and let the user edit it
|
||||
echo "Now the dictionary is going to be opened for you to edit. When you're done, just quit the editor"
|
||||
echo -n "Press Enter to begin."
|
||||
read foo
|
||||
$EDITOR en-US.stripped
|
||||
|
||||
# Add back the line count
|
||||
wc -l < en-US.stripped | tr -d '[:blank:]' > en-US.dic
|
||||
LC_ALL=C sort en-US.stripped >> en-US.dic
|
||||
|
||||
# Clean up
|
||||
rm -f en-US.stripped
|
||||
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
#! /usr/bin/env sh
|
||||
|
||||
# This Source Code Form is subject to the terms of the Mozilla Public
|
||||
# License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
|
||||
|
||||
set -e
|
||||
|
||||
if [ -z "$EDITOR" ]; then
|
||||
echo 'Need to set the $EDITOR environment variable to your favorite editor.'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Copy the current en-US dictionary and strip the first line that contains
|
||||
# the count.
|
||||
tail -n +2 ../en-US.dic > en-US.stripped
|
||||
|
||||
# Convert the file to UTF-8
|
||||
iconv -f iso-8859-1 -t utf-8 en-US.stripped > en-US.utf8
|
||||
rm en-US.stripped
|
||||
|
||||
# Open the hunspell dictionary and let the user edit it
|
||||
echo "Now the dictionary is going to be opened for you to edit. Quit the editor to finish editing."
|
||||
echo "Press Enter to begin."
|
||||
read foo
|
||||
$EDITOR en-US.utf8
|
||||
|
||||
# Add back the line count and sort the lines
|
||||
wc -l < en-US.utf8 | tr -d '[:blank:]' > en-US.dic
|
||||
LC_ALL=C sort en-US.utf8 >> en-US.dic
|
||||
rm -f en-US.utf8
|
||||
|
||||
# Convert back to ISO-8859-1
|
||||
iconv -f utf-8 -t iso-8859-1 en-US.dic > ../en-US.dic
|
||||
|
||||
# Keep a copy of the UTF-8 file in /utf8
|
||||
mv en-US.dic utf8/en-US-utf8.dic
|
||||
|
|
@ -1,39 +0,0 @@
|
|||
#!/bin/sh
|
||||
|
||||
#
|
||||
# This script copies the new dictionary created by make-new-dict in
|
||||
# place.
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
WKDIR="`pwd`"
|
||||
export SCOWL="$WKDIR/scowl/"
|
||||
SPELLER="$SCOWL/speller"
|
||||
|
||||
set -x
|
||||
|
||||
if [ -e orig-bk ]; then echo "$0: directory 'orig-bk' exists." 1>&2 ; exit 0; fi
|
||||
mv orig orig-bk
|
||||
mkdir orig
|
||||
cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-custom.txt orig
|
||||
|
||||
mkdir mozilla-bk
|
||||
mv ../en-US.dic ../en-US.aff ../README_en_US.txt mozilla-bk
|
||||
|
||||
# Convert the affix file to ISO8859-1
|
||||
sed -i=bak -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff
|
||||
|
||||
# Convert the dictionary to ISO8859-1
|
||||
mv en_US-mozilla.dic en_US-mozilla-utf8.dic
|
||||
iconv -f utf-8 -t iso-8859-1 < en_US-mozilla-utf8.dic > en_US-mozilla.dic
|
||||
|
||||
cp en_US-mozilla.aff ../en-US.aff
|
||||
cp en_US-mozilla.dic ../en-US.dic
|
||||
cp README_en_US-mozilla.txt ../README_en_US.txt
|
||||
|
||||
set +x
|
||||
|
||||
echo "New dictionary copied into place. Please commit the changes."
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,41 @@
|
|||
#! /usr/bin/env sh
|
||||
|
||||
# This Source Code Form is subject to the terms of the Mozilla Public
|
||||
# License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
|
||||
|
||||
# This script copies the new dictionary created by make-new-dict in
|
||||
# place.
|
||||
|
||||
set -e
|
||||
|
||||
WKDIR="`pwd`"
|
||||
export SCOWL="$WKDIR/scowl/"
|
||||
SUPPORT_DIR="$WKDIR/support_files/"
|
||||
SPELLER="$SCOWL/speller"
|
||||
|
||||
if [ -e "$SUPPORT_DIR/orig-bk" ]; then
|
||||
echo "$0: directory '$SUPPORT_DIR/orig-bk' exists." 1>&2
|
||||
exit 0
|
||||
fi
|
||||
|
||||
mv orig "$SUPPORT_DIR/orig-bk"
|
||||
mkdir orig
|
||||
cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-custom.txt orig
|
||||
|
||||
mkdir "$SUPPORT_DIR/mozilla-bk"
|
||||
mv ../en-US.dic ../en-US.aff ../README_en_US.txt "$SUPPORT_DIR/mozilla-bk"
|
||||
|
||||
# Convert the affix file to ISO-8859-1
|
||||
cp en_US-mozilla.aff utf8/en-US-utf8.aff
|
||||
sed -i "" -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff
|
||||
|
||||
# Convert the dictionary to ISO-8859-1
|
||||
mv en_US-mozilla.dic utf8/en-US-utf8.dic
|
||||
iconv -f utf-8 -t iso-8859-1 < utf8/en-US-utf8.dic > en_US-mozilla.dic
|
||||
|
||||
cp en_US-mozilla.aff ../en-US.aff
|
||||
cp en_US-mozilla.dic ../en-US.dic
|
||||
mv README_en_US-mozilla.txt ../README_en_US.txt
|
||||
|
||||
echo "New dictionary copied into place. Please commit the changes."
|
||||
|
|
@ -1,69 +0,0 @@
|
|||
#!/bin/sh
|
||||
|
||||
#
|
||||
# This script creates a new dictionary by expanding the original,
|
||||
# Mozilla's, and the upstream dictionary to remove affix flags and
|
||||
# then doing the wordlist equivalent of diff3 to create a new
|
||||
# dictionary.
|
||||
#
|
||||
# The files 2-mozilla-add and 2-mozilla-rem contain words added and
|
||||
# removed, receptively in the Mozilla dictionary. The final
|
||||
# dictionary will be in hunspell-en_US-mozilla.zip.
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
export LANG=C
|
||||
export LC_ALL=C
|
||||
export LC_CTYPE=C
|
||||
export LC_COLLATE=C
|
||||
|
||||
WKDIR="`pwd`"
|
||||
|
||||
export SCOWL="$WKDIR/scowl/"
|
||||
|
||||
ORIG="$WKDIR/orig/"
|
||||
SPELLER="$SCOWL/speller"
|
||||
|
||||
expand() {
|
||||
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
|
||||
}
|
||||
|
||||
cd $SPELLER
|
||||
MK_LIST="../mk-list -v1 --accents=both en_US 60"
|
||||
cat <<EOF > params.txt
|
||||
With Input Command: $MK_LIST
|
||||
EOF
|
||||
# note: output of make-hunspell-dict is utf-8
|
||||
$MK_LIST | ./make-hunspell-dict -one en_US-custom params.txt > ./make-hunspell-dict.log
|
||||
cd $WKDIR
|
||||
|
||||
# Note: Input and output of "expand" is always iso-8859-1.
|
||||
# All expanded word list files are thus in iso-8859-1.
|
||||
|
||||
expand $SPELLER/en.aff < $SPELLER/en.dic.supp > 0-special # input: ASCII
|
||||
|
||||
# input in utf-8, expand expects iso-8859-1 so use iconv
|
||||
iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > 1-base.txt
|
||||
|
||||
expand ../en-US.aff < ../en-US.dic > 2-mozilla.txt # input: iso-8850-1
|
||||
|
||||
# input in utf-8, expand expects iso-8859-1 so use iconv
|
||||
iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > 3-upstream.txt
|
||||
|
||||
comm -23 1-base.txt 2-mozilla.txt > 2-mozilla-rem
|
||||
comm -13 1-base.txt 2-mozilla.txt > 2-mozilla-add
|
||||
comm -23 3-upstream.txt 2-mozilla-rem | cat - 2-mozilla-add | sort -u > 4-patched.txt
|
||||
|
||||
# note: output of make-hunspell-dict is utf-8
|
||||
cat 4-patched.txt | comm -23 - 0-special | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null
|
||||
|
||||
# sanity check should yield identical results
|
||||
#comm -23 1-base.txt 3-upstream.txt > 3-upstream-rem
|
||||
#comm -13 1-base.txt 3-upstream.txt > 3-upstream-add
|
||||
#comm -23 2-mozilla.txt 3-upstream-rem | cat - 3-upstream-add | sort -u > 4-patched-v2.txt
|
||||
|
||||
expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific
|
||||
|
||||
comm -12 3-upstream.txt 2-mozilla-rem > 5-mozilla-removed
|
||||
comm -13 3-upstream.txt 2-mozilla-add > 5-mozilla-added
|
||||
102
extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict.sh
Executable file
102
extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict.sh
Executable file
|
|
@ -0,0 +1,102 @@
|
|||
#! /usr/bin/env sh
|
||||
|
||||
# This Source Code Form is subject to the terms of the Mozilla Public
|
||||
# License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
|
||||
|
||||
# This script creates a new dictionary by expanding the original,
|
||||
# Mozilla's, and the upstream dictionary to remove affix flags and
|
||||
# then doing the wordlist equivalent of diff3 to create a new
|
||||
# dictionary.
|
||||
#
|
||||
# The files 2-mozilla-add and 2-mozilla-rem contain words added and
|
||||
# removed, respectively in the Mozilla dictionary. The final
|
||||
# dictionary will be in hunspell-en_US-mozilla.zip.
|
||||
|
||||
set -e
|
||||
|
||||
export LANG=C
|
||||
export LC_ALL=C
|
||||
export LC_CTYPE=C
|
||||
export LC_COLLATE=C
|
||||
|
||||
WKDIR="`pwd`"
|
||||
|
||||
export SCOWL="$WKDIR/scowl/"
|
||||
|
||||
ORIG="$WKDIR/orig/"
|
||||
SUPPORT_DIR="$WKDIR/support_files/"
|
||||
SPELLER="$SCOWL/speller"
|
||||
|
||||
expand() {
|
||||
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
|
||||
}
|
||||
|
||||
mkdir -p $SUPPORT_DIR
|
||||
cd $SPELLER
|
||||
MK_LIST="../mk-list -v1 --accents=both en_US 60"
|
||||
cat <<EOF > params.txt
|
||||
With Input Command: $MK_LIST
|
||||
EOF
|
||||
# Note: the output of make-hunspell-dict is UTF-8
|
||||
$MK_LIST | ./make-hunspell-dict -one en_US-custom params.txt > ./make-hunspell-dict.log
|
||||
cd $WKDIR
|
||||
|
||||
# Note: Input and output of "expand" is always ISO-8859-1.
|
||||
# All expanded word list files are thus in ISO-8859-1.
|
||||
expand $SPELLER/en.aff < $SPELLER/en.dic.supp > $SUPPORT_DIR/0-special.txt
|
||||
|
||||
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
|
||||
iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $ORIG/en_US-custom.aff > $SUPPORT_DIR/1-base.txt
|
||||
|
||||
# The existing Mozilla dictionary is already in ISO-8859-1
|
||||
expand ../en-US.aff < ../en-US.dic > $SUPPORT_DIR/2-mozilla.txt
|
||||
|
||||
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
|
||||
iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > $SUPPORT_DIR/3-upstream.txt
|
||||
|
||||
# Suppress common lines and lines only in the 2nd file, leaving words that are
|
||||
# only available in the 1st file (SCOWL), i.e. were removed by Mozilla.
|
||||
comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/2-mozilla.txt > $SUPPORT_DIR/2-mozilla-removed.txt
|
||||
|
||||
# Suppress common lines and lines only in the 1st file, leaving words that are
|
||||
# only available in the 2nd file (current Mozilla dictionary), i.e. were added
|
||||
# by Mozilla.
|
||||
comm -13 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/2-mozilla.txt > $SUPPORT_DIR/2-mozilla-added.txt
|
||||
|
||||
# Suppress common lines and lines only in the 2nd file, leaving words that are
|
||||
# only available in the 1st file (words from the new upstream SCOWL dictionary).
|
||||
# The result is upstream, minus the words removed, plus the words added.
|
||||
comm -23 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt | cat - $SUPPORT_DIR/2-mozilla-added.txt | sort -u > $SUPPORT_DIR/4-patched.txt
|
||||
|
||||
# Note: the output of make-hunspell-dict is UTF-8
|
||||
cat $SUPPORT_DIR/4-patched.txt | comm -23 - $SUPPORT_DIR/0-special.txt | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null
|
||||
|
||||
# Exclude specific words from suggestions
|
||||
while IFS= read -r line
|
||||
do
|
||||
# If the string already contains an affix, just add !, otherwise add /!
|
||||
if [[ "$line" == *"/"* ]]; then
|
||||
sed -i "" "s|^$line$|$line!|" en_US-mozilla.dic
|
||||
else
|
||||
sed -i "" "s|^$line$|$line/!|" en_US-mozilla.dic
|
||||
fi
|
||||
done < "mozilla-exclusions.txt"
|
||||
|
||||
# Sanity check should yield identical results
|
||||
#comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-remover.txt
|
||||
#comm -13 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-added.txt
|
||||
#comm -23 $SUPPORT_DIR/2-mozilla.txt $SUPPORT_DIR/3-upstream-removed.txt | cat - $SUPPORT_DIR/3-upstream-added.txt | sort -u > $SUPPORT_DIR/4-patched-v2.txt
|
||||
|
||||
expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific.txt
|
||||
|
||||
# Update Mozilla removed and added wordlists based on the new upstream
|
||||
# dictionary, save them as UTF-8 and not ISO-8951-1
|
||||
comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed.txt
|
||||
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-removed.txt > 5-mozilla-removed.txt
|
||||
comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added.txt
|
||||
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-added.txt > 5-mozilla-added.txt
|
||||
|
||||
# Clean up some files
|
||||
rm hunspell-en_US-mozilla.zip
|
||||
rm nosug
|
||||
|
|
@ -0,0 +1,2 @@
|
|||
nigga/SM
|
||||
niggaz
|
||||
|
|
@ -2,6 +2,7 @@
|
|||
shellcheck:
|
||||
description: Shell script linter
|
||||
include:
|
||||
- extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/
|
||||
- taskcluster/docker/
|
||||
exclude: []
|
||||
# 1090: https://github.com/koalaman/shellcheck/wiki/SC1090
|
||||
|
|
|
|||
Loading…
Reference in a new issue