Remove extension/universalchardet/doc/. Bug 471480, r=smontagu

2010-03-13 12:08:57 -08:00 · 2010-03-13 12:08:57 -08:00 · d22af6cdd4
commit d22af6cdd4
parent c764b6beb3
2 changed files with 0 additions and 231 deletions
--- a/extensions/universalchardet/doc/ChardetInterface.htm
+++ b/extensions/universalchardet/doc/ChardetInterface.htm
@ -1,231 +0,0 @@
 <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
 <html>
 <head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
   <title>Charset Detector Interface</title>
 </head>
 <body>
 <font face="Arial">Charset Detector Interface</font>
 <p><font size=-1>This is the charset detector’s interface that is exposed
 to outside world, in our case, the browser. In the very beginning, caller
 calls detector’s "Init()" method and let detector know how it would like
 to be notified about the detecting result. Observer pattern is used in
 this case. Then the caller just need to feed charset detector with text
 data through "DoIt()". This can be done through a series "DoIt()" calls,
 with each call only contains part of the data. This can be very useful
 if the data is only partially available at one time. In our case, since
 the data comes from network, we can start detecting long before network
 finishes transferring all data. When detector is confident enough about
 one encoding, it will notify its caller and stop detecting. If all data
 has been feed to detector but detector still is not confident enough about
 any encoding, method "Done" will tell detector to make a best guess.</font>
 <p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
 nsISupports {</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
 <p><font face="Courier New"><font size=-1>&nbsp; //Setup the observer so
 it know how to notify the answer</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; NS_IMETHOD Init(nsICharsetDetectionObserver*
 observer) = 0;</font></font>
 <p><font face="Courier New"><font size=-1>&nbsp; //Feed a block of bytes
 to the detector.</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; //It will call the Notify
 function of the nsICharsetObserver if it</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; //find out the answer.</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; // aBytesArray - array
 of bytes</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; // aLen - length of aBytesArray</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; // oDontFeedMe - return
 PR_TRUE if the detector do not need the</font></font>
 <br>&nbsp; <font face="Courier New"><font size=-1>// following block</font></font>
 <br>&nbsp; <font face="Courier New"><font size=-1>// PR_FALSE it need more
 bytes.</font></font>
 <br>&nbsp; <font face="Courier New"><font size=-1>// This is used to enhance
 performance</font></font>
 <br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
 char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
 <p>&nbsp; /<font face="Courier New"><font size=-1>/It also tell the detector
 the last chance the make a decision</font></font>
 <br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
 <br><font size=-1>}<font face="Courier New">;</font></font>
 <br>&nbsp;
 <br>&nbsp;
 <p><font face="Arial">Inside Charset Detector</font>
 <p><font size=-1>Inside Charset Detector, major work is done by function
 "HandleData()". In fact, "DoIt" has very little extra thing to do other
 than call "HandleData". The following is the algorithm logic using C-Like
 Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
 <p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
 <br><font face="Courier New"><font size=-1>{</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; if (batch_of_text contains
 BOM)</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; report UCS2;</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; if ((inputState is PureAscii)
 || (inputState is EscAscii))</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; if (batch_of_text
 contains 8-bits-byte)</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 inputState = HighByte;</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; else if ((inputState
 is PureAscii ) &amp;&amp; (batch_of_text contains Esc_Sequence) )</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 inputState = EscAscii;</font></font>
 <p><font face="Courier New"><font size=-1>&nbsp; if (inputState is HighByte)</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Remove Ascii
 character that is not neighboring to 8-bits byte</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
 prober in multibyte_probers</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
 prober in singlebyte_probers</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; else if (inputState is
 EscAscii)</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
 prober in (ISO2022_XX or HZ)</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
 <br><font face="Courier New"><font size=-1>}</font></font>
 <p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
 <br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
 <p><i><font face="Courier New"><font size=-1>Implemented the high level
 control logic.</font></font></i>
 <br>&nbsp;
 <br>&nbsp;
 <p>Charset Prober
 <p><font size=-1>A charset prober verifies if the input data is belong
 to certain encoding or group of encoding. It maintains its state in member
 "mState", which has 3 possible value. State "eDetecting" means it hasn’t
 found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
 as their names. Method "GetCharSetName" tell its caller its sure answer
 or best guess.</font>
 <p><font size=-1>Generally, for each encoding we implemented a charset
 prober. Several probers can be wrapped together with a wrapper prober.
 It is also possible for a prober to "probe" several encodings. Each charset
 prober is designed, implemented and working independently. This enables
 prober caller to eliminate certain probers when it has any pre-knowledge.
 For example, if user know that an html page is some kind of Japanese encoding,
 non-Japanese charset probers will not be fired. If user have not interest
 in certain languages, they can also eliminate those charset probers. Those
 measures will lead to a small footprint and faster performance.</font>
 <p><font face="Courier New"><font size=-1>typedef enum {</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; eDetecting = 0,</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; eFoundIt = 1,</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; eNotMe = 2</font></font>
 <br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
 <p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsCharSetProber(){};</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual const
 char* GetCharSetName() {return "";};</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual nsProbingState
 HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
 GetState(void) {return mState;};</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
 Reset(void) {mState = eDetecting;};</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual float
 GetConfidence(void) = 0;</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
 SetOpion() {};</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp; protected:</font></font>
 <br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
 mState;</font></font>
 <br><font face="Courier New"><font size=-1>};</font></font>
 <br>&nbsp;
 <br>&nbsp;
 <p><font face="Arial">How multi-byte encoding charset prober works</font>
 <p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
 (or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
 which identify legal byte sequence base on its encoding scheme. If an illegal
 byte sequence is met, this state machine will reach "eError" state. That
 signifies a failure for this prober, and prober will report negative answer
 to its caller. Once state machine reach "eStart" state, it means sequence
 of bytes has been identified as a character. This character will be sent
 to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
 analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
 will let its caller know the likelihood of input charset being of this
 encoding.</font>
 <p><font size=-1>Inside "HandleData" method each time after a batch of
 text has been processed, shortcut judgement is performed. If the prober
 receives enough data and reaches certain confidence level, it will set
 its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
 <p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
 state machine can do almost a perfect job along, no other statistic sampling
 is done.</font>
 <p><i><font size=-1>Big5Freq.tab</font></i>
 <p><i><font size=-1>EUCKRFreq.tab</font></i>
 <p><i><font size=-1>EUCTWFreq.tab</font></i>
 <p><i><font size=-1>GB2312Freq.tab</font></i>
 <p><i><font size=-1>JISFreq.tab</font></i>
 <p><i><font size=-1>Those files defined the frequency table (Character
 to frequency order mapping) for each language. Since Big5 and EUC-TW are
 not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
 is defined.</font></i>
 <p><i><font size=-1>CharDistribution.h</font></i>
 <p><i><font size=-1>CharDistribution.cpp</font></i>
 <p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
 <p><i><font size=-1>nsPkgInt.h</font></i>
 <p><i><font size=-1>nsCodingStateMachine.h</font></i>
 <p><i><font size=-1>Those are bases of state machine implementation.</font></i>
 <p><i><font size=-1>nsEscSM.cpp</font></i>
 <p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
 <p><i><font size=-1>nsMBCSSM.cpp</font></i>
 <p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
 SJIS, and UTF8.</font></i>
 <p><i><font size=-1>JpCntx.h</font></i>
 <p><i><font size=-1>JpCntx.cpp</font></i>
 <p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
 <p><i><font size=-1>nsBig5Prober.h</font></i>
 <p><i><font size=-1>nsBig5Prober.cpp</font></i>
 <p><i><font size=-1>nsEUCKRProber.h</font></i>
 <p><i><font size=-1>nsEUCKRProber.cpp</font></i>
 <p><i><font size=-1>nsEUCJPProber.h</font></i>
 <p><i><font size=-1>nsEUCJPProber.cpp</font></i>
 <p><i><font size=-1>nsEUCTWProber.h</font></i>
 <p><i><font size=-1>nsEUCTWProber.cpp</font></i>
 <p><i><font size=-1>nsSJISProber.h</font></i>
 <p><i><font size=-1>nsSJISProber.cpp</font></i>
 <p><i><font size=-1>nsGB2312Prober.h</font></i>
 <p><i><font size=-1>nsGB2312Prober.cpp</font></i>
 <p><i><font size=-1>nsUTF8Prober.h</font></i>
 <p><i><font size=-1>nsUTF8Prober.cpp</font></i>
 <p><i><font size=-1>Charset Prober classes definition and implementation
 for each encoding. Each prober has an embedded state machine and a character
 distribution analyzer except UTF8, which state machine is good enough.</font></i>
 <p><i><font size=-1>nsMBCSProber.h</font></i>
 <p><i><font size=-1>nsMBCSProber.cpp</font></i>
 <p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
 to put some high level logic which base on multiple encoding knowledge
 to appears here in the very beginning. That might still be needed in future.</font></i>
 <br>&nbsp;
 <br>&nbsp;
 <p><font face="Arial">How single-byte encoding charset prober works</font>
 <p><font size=-1>For each encoding, a table is used to map a character
 to an encoding independent identification number. Those identification
 numbers in fact come from characters’ frequency order but with some adjustment.
 For each language, a 2-D matrix is defined as language model. If cell &lt;x,
 y> is 0, it means sequence &lt;character(x), character(y)> is a rarely
 used sequence in this language, with character(x) representing the character
 whose identification number is x. The 2-D matrix only defines sequence
 of a subset of all the characters. For characters whose identification
 number is out of this range, those characters are ignored. Since some of
 the sequences, like ascii-to-ascii sequences, have no relation with the
 language we try to verify, and those sequences should not be counted. In
 current implementation, a sequence will be counted if both characters are
 8-bits ones. In some situations, one 8-bits character sequence is expected
 to be counted.</font>
 <p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
 table for each encoding and a 2-D matrix for all Cyrillic languages. A
 "SequenceModel" structure is also defined for each encoding. This structure
 will be used to initialize a single-byte character prober class. All Cyrillic
 encodings are sharing the same prober class implementation.</font></i>
 <p><i><font size=-1>nsSBCharSetProber.h</font></i>
 <p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
 single-byte charset prober.</font></i>
 </body>
 </html>
--- a/extensions/universalchardet/doc/UniversalCharsetDetection.doc
+++ b/extensions/universalchardet/doc/UniversalCharsetDetection.doc