Text Tokenizer. More...
#include <gtokenizer.h>
Inheritance diagram for GTokenizer:
Public Member Functions | |
| GTokenizer (GSession *session, GPlugInFactory *fac) | |
| void | AddChar (const R::RChar &car) |
| R::RString | Extract (size_t begin, size_t end) |
| size_t | GetPos (void) const |
| virtual void | Start (void) |
| virtual bool | TreatChar (GDocAnalyze *analyzer, const R::RChar &car)=0 |
Detailed Description
Text Tokenizer.
The GTokenizer class provides some methods that break a set of characters into tokens.
It proposes a framework for a finite-state machine with memory.
It is used in the analyze of a document to determine how to extract the basic elements (words, abbreviation, e-mails, etc.).
Constructor & Destructor Documentation
| GTokenizer | ( | GSession * | session, |
| GPlugInFactory * | fac | ||
| ) |
Construct the tokenizer.
- Parameters:
-
session Session. fac Factory.
Member Function Documentation
Add a character to the memory.
- Parameters:
-
car Character to save.
| R::RString Extract | ( | size_t | begin, |
| size_t | end | ||
| ) |
Extract a string from the memory
- Parameters:
-
begin Beginning position. end Ending position. If it is cNoRef, the end is the last character of the memory. Else, the ending position is to copied.
- Returns:
| size_t GetPos | ( | void | ) | const |
Get the position currently treated.
- Returns:
- position.
| virtual void Start | ( | void | ) | [virtual] |
Method call each time the tokenizer is started to analyze some text. It must be called by all inheriting method.
| virtual bool TreatChar | ( | GDocAnalyze * | analyzer, |
| const R::RChar & | car | ||
| ) | [pure virtual] |
This method is called each time the analyzer treat a character.
The method should called the AddToken method the analyzer to add valid tokens.
- Parameters:
-
analyzer Analyzer. car Character treated.
- Returns:
- true if the character starts a token.