forked from mirrors/gecko-dev
Remove extension/universalchardet/doc/. Bug 471480, r=smontagu
This commit is contained in:
parent
c764b6beb3
commit
d22af6cdd4
2 changed files with 0 additions and 231 deletions
|
|
@ -1,231 +0,0 @@
|
||||||
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
|
|
||||||
<html>
|
|
||||||
<head>
|
|
||||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
||||||
<meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
|
|
||||||
<title>Charset Detector Interface</title>
|
|
||||||
</head>
|
|
||||||
<body>
|
|
||||||
<font face="Arial">Charset Detector Interface</font>
|
|
||||||
<p><font size=-1>This is the charset detector’s interface that is exposed
|
|
||||||
to outside world, in our case, the browser. In the very beginning, caller
|
|
||||||
calls detector’s "Init()" method and let detector know how it would like
|
|
||||||
to be notified about the detecting result. Observer pattern is used in
|
|
||||||
this case. Then the caller just need to feed charset detector with text
|
|
||||||
data through "DoIt()". This can be done through a series "DoIt()" calls,
|
|
||||||
with each call only contains part of the data. This can be very useful
|
|
||||||
if the data is only partially available at one time. In our case, since
|
|
||||||
the data comes from network, we can start detecting long before network
|
|
||||||
finishes transferring all data. When detector is confident enough about
|
|
||||||
one encoding, it will notify its caller and stop detecting. If all data
|
|
||||||
has been feed to detector but detector still is not confident enough about
|
|
||||||
any encoding, method "Done" will tell detector to make a best guess.</font>
|
|
||||||
<p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
|
|
||||||
nsISupports {</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> public:</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
|
|
||||||
<p><font face="Courier New"><font size=-1> //Setup the observer so
|
|
||||||
it know how to notify the answer</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> NS_IMETHOD Init(nsICharsetDetectionObserver*
|
|
||||||
observer) = 0;</font></font>
|
|
||||||
<p><font face="Courier New"><font size=-1> //Feed a block of bytes
|
|
||||||
to the detector.</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> //It will call the Notify
|
|
||||||
function of the nsICharsetObserver if it</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> //find out the answer.</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> // aBytesArray - array
|
|
||||||
of bytes</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> // aLen - length of aBytesArray</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> // oDontFeedMe - return
|
|
||||||
PR_TRUE if the detector do not need the</font></font>
|
|
||||||
<br> <font face="Courier New"><font size=-1>// following block</font></font>
|
|
||||||
<br> <font face="Courier New"><font size=-1>// PR_FALSE it need more
|
|
||||||
bytes.</font></font>
|
|
||||||
<br> <font face="Courier New"><font size=-1>// This is used to enhance
|
|
||||||
performance</font></font>
|
|
||||||
<br> <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
|
|
||||||
char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
|
|
||||||
<p> /<font face="Courier New"><font size=-1>/It also tell the detector
|
|
||||||
the last chance the make a decision</font></font>
|
|
||||||
<br> <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
|
|
||||||
<br><font size=-1>}<font face="Courier New">;</font></font>
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
<p><font face="Arial">Inside Charset Detector</font>
|
|
||||||
<p><font size=-1>Inside Charset Detector, major work is done by function
|
|
||||||
"HandleData()". In fact, "DoIt" has very little extra thing to do other
|
|
||||||
than call "HandleData". The following is the algorithm logic using C-Like
|
|
||||||
Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
|
|
||||||
<p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1>{</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> if (batch_of_text contains
|
|
||||||
BOM)</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> report UCS2;</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> if ((inputState is PureAscii)
|
|
||||||
|| (inputState is EscAscii))</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> if (batch_of_text
|
|
||||||
contains 8-bits-byte)</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1>
|
|
||||||
inputState = HighByte;</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> else if ((inputState
|
|
||||||
is PureAscii ) && (batch_of_text contains Esc_Sequence) )</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1>
|
|
||||||
inputState = EscAscii;</font></font>
|
|
||||||
<p><font face="Courier New"><font size=-1> if (inputState is HighByte)</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> {</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> Remove Ascii
|
|
||||||
character that is not neighboring to 8-bits byte</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> For each
|
|
||||||
prober in multibyte_probers</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> For each
|
|
||||||
prober in singlebyte_probers</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> }</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> else if (inputState is
|
|
||||||
EscAscii)</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> {</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> For each
|
|
||||||
prober in (ISO2022_XX or HZ)</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> Prober.HandleData(batch_of_text);</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> }</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1>}</font></font>
|
|
||||||
<p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
|
|
||||||
<br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
|
|
||||||
<p><i><font face="Courier New"><font size=-1>Implemented the high level
|
|
||||||
control logic.</font></font></i>
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
<p>Charset Prober
|
|
||||||
<p><font size=-1>A charset prober verifies if the input data is belong
|
|
||||||
to certain encoding or group of encoding. It maintains its state in member
|
|
||||||
"mState", which has 3 possible value. State "eDetecting" means it hasn’t
|
|
||||||
found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
|
|
||||||
as their names. Method "GetCharSetName" tell its caller its sure answer
|
|
||||||
or best guess.</font>
|
|
||||||
<p><font size=-1>Generally, for each encoding we implemented a charset
|
|
||||||
prober. Several probers can be wrapped together with a wrapper prober.
|
|
||||||
It is also possible for a prober to "probe" several encodings. Each charset
|
|
||||||
prober is designed, implemented and working independently. This enables
|
|
||||||
prober caller to eliminate certain probers when it has any pre-knowledge.
|
|
||||||
For example, if user know that an html page is some kind of Japanese encoding,
|
|
||||||
non-Japanese charset probers will not be fired. If user have not interest
|
|
||||||
in certain languages, they can also eliminate those charset probers. Those
|
|
||||||
measures will lead to a small footprint and faster performance.</font>
|
|
||||||
<p><font face="Courier New"><font size=-1>typedef enum {</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> eDetecting = 0,</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> eFoundIt = 1,</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> eNotMe = 2</font></font>
|
|
||||||
<br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
|
|
||||||
<p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> public:</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> nsCharSetProber(){};</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> virtual const
|
|
||||||
char* GetCharSetName() {return "";};</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> virtual nsProbingState
|
|
||||||
HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> nsProbingState
|
|
||||||
GetState(void) {return mState;};</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> virtual void
|
|
||||||
Reset(void) {mState = eDetecting;};</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> virtual float
|
|
||||||
GetConfidence(void) = 0;</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> virtual void
|
|
||||||
SetOpion() {};</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> protected:</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1> nsProbingState
|
|
||||||
mState;</font></font>
|
|
||||||
<br><font face="Courier New"><font size=-1>};</font></font>
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
<p><font face="Arial">How multi-byte encoding charset prober works</font>
|
|
||||||
<p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
|
|
||||||
(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
|
|
||||||
which identify legal byte sequence base on its encoding scheme. If an illegal
|
|
||||||
byte sequence is met, this state machine will reach "eError" state. That
|
|
||||||
signifies a failure for this prober, and prober will report negative answer
|
|
||||||
to its caller. Once state machine reach "eStart" state, it means sequence
|
|
||||||
of bytes has been identified as a character. This character will be sent
|
|
||||||
to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
|
|
||||||
analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
|
|
||||||
will let its caller know the likelihood of input charset being of this
|
|
||||||
encoding.</font>
|
|
||||||
<p><font size=-1>Inside "HandleData" method each time after a batch of
|
|
||||||
text has been processed, shortcut judgement is performed. If the prober
|
|
||||||
receives enough data and reaches certain confidence level, it will set
|
|
||||||
its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
|
|
||||||
<p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
|
|
||||||
state machine can do almost a perfect job along, no other statistic sampling
|
|
||||||
is done.</font>
|
|
||||||
<p><i><font size=-1>Big5Freq.tab</font></i>
|
|
||||||
<p><i><font size=-1>EUCKRFreq.tab</font></i>
|
|
||||||
<p><i><font size=-1>EUCTWFreq.tab</font></i>
|
|
||||||
<p><i><font size=-1>GB2312Freq.tab</font></i>
|
|
||||||
<p><i><font size=-1>JISFreq.tab</font></i>
|
|
||||||
<p><i><font size=-1>Those files defined the frequency table (Character
|
|
||||||
to frequency order mapping) for each language. Since Big5 and EUC-TW are
|
|
||||||
not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
|
|
||||||
is defined.</font></i>
|
|
||||||
<p><i><font size=-1>CharDistribution.h</font></i>
|
|
||||||
<p><i><font size=-1>CharDistribution.cpp</font></i>
|
|
||||||
<p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
|
|
||||||
<p><i><font size=-1>nsPkgInt.h</font></i>
|
|
||||||
<p><i><font size=-1>nsCodingStateMachine.h</font></i>
|
|
||||||
<p><i><font size=-1>Those are bases of state machine implementation.</font></i>
|
|
||||||
<p><i><font size=-1>nsEscSM.cpp</font></i>
|
|
||||||
<p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
|
|
||||||
<p><i><font size=-1>nsMBCSSM.cpp</font></i>
|
|
||||||
<p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
|
|
||||||
SJIS, and UTF8.</font></i>
|
|
||||||
<p><i><font size=-1>JpCntx.h</font></i>
|
|
||||||
<p><i><font size=-1>JpCntx.cpp</font></i>
|
|
||||||
<p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
|
|
||||||
<p><i><font size=-1>nsBig5Prober.h</font></i>
|
|
||||||
<p><i><font size=-1>nsBig5Prober.cpp</font></i>
|
|
||||||
<p><i><font size=-1>nsEUCKRProber.h</font></i>
|
|
||||||
<p><i><font size=-1>nsEUCKRProber.cpp</font></i>
|
|
||||||
<p><i><font size=-1>nsEUCJPProber.h</font></i>
|
|
||||||
<p><i><font size=-1>nsEUCJPProber.cpp</font></i>
|
|
||||||
<p><i><font size=-1>nsEUCTWProber.h</font></i>
|
|
||||||
<p><i><font size=-1>nsEUCTWProber.cpp</font></i>
|
|
||||||
<p><i><font size=-1>nsSJISProber.h</font></i>
|
|
||||||
<p><i><font size=-1>nsSJISProber.cpp</font></i>
|
|
||||||
<p><i><font size=-1>nsGB2312Prober.h</font></i>
|
|
||||||
<p><i><font size=-1>nsGB2312Prober.cpp</font></i>
|
|
||||||
<p><i><font size=-1>nsUTF8Prober.h</font></i>
|
|
||||||
<p><i><font size=-1>nsUTF8Prober.cpp</font></i>
|
|
||||||
<p><i><font size=-1>Charset Prober classes definition and implementation
|
|
||||||
for each encoding. Each prober has an embedded state machine and a character
|
|
||||||
distribution analyzer except UTF8, which state machine is good enough.</font></i>
|
|
||||||
<p><i><font size=-1>nsMBCSProber.h</font></i>
|
|
||||||
<p><i><font size=-1>nsMBCSProber.cpp</font></i>
|
|
||||||
<p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
|
|
||||||
to put some high level logic which base on multiple encoding knowledge
|
|
||||||
to appears here in the very beginning. That might still be needed in future.</font></i>
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
<p><font face="Arial">How single-byte encoding charset prober works</font>
|
|
||||||
<p><font size=-1>For each encoding, a table is used to map a character
|
|
||||||
to an encoding independent identification number. Those identification
|
|
||||||
numbers in fact come from characters’ frequency order but with some adjustment.
|
|
||||||
For each language, a 2-D matrix is defined as language model. If cell <x,
|
|
||||||
y> is 0, it means sequence <character(x), character(y)> is a rarely
|
|
||||||
used sequence in this language, with character(x) representing the character
|
|
||||||
whose identification number is x. The 2-D matrix only defines sequence
|
|
||||||
of a subset of all the characters. For characters whose identification
|
|
||||||
number is out of this range, those characters are ignored. Since some of
|
|
||||||
the sequences, like ascii-to-ascii sequences, have no relation with the
|
|
||||||
language we try to verify, and those sequences should not be counted. In
|
|
||||||
current implementation, a sequence will be counted if both characters are
|
|
||||||
8-bits ones. In some situations, one 8-bits character sequence is expected
|
|
||||||
to be counted.</font>
|
|
||||||
<p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
|
|
||||||
table for each encoding and a 2-D matrix for all Cyrillic languages. A
|
|
||||||
"SequenceModel" structure is also defined for each encoding. This structure
|
|
||||||
will be used to initialize a single-byte character prober class. All Cyrillic
|
|
||||||
encodings are sharing the same prober class implementation.</font></i>
|
|
||||||
<p><i><font size=-1>nsSBCharSetProber.h</font></i>
|
|
||||||
<p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
|
|
||||||
single-byte charset prober.</font></i>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
Binary file not shown.
Loading…
Reference in a new issue