Text Tokenizer. More...
#include <gtokenizer.h>
Public Member Functions | |
GTokenizer (GSession *session, GPlugInFactory *fac) | |
void | AddChar (const R::RChar &car) |
R::RString | Extract (size_t begin, size_t end) |
size_t | GetPos (void) const |
virtual void | Start (void) |
virtual bool | TreatChar (GDocAnalyze *analyzer, const R::RChar &car)=0 |
Public Member Functions inherited from GPlugIn | |
GPlugIn (GSession *session, GPlugInFactory *fac) | |
virtual void | ApplyConfig (void) |
void | InsertParam (R::RParam *param) |
template<class T > | |
T * | FindParam (const R::RString &name) |
R::RCursor< R::RParam > | GetParams (const R::RString &cat=R::RString::Null) |
void | GetCategories (R::RContainer< R::RString, true, false > &cats) |
virtual void | Init (void) |
virtual void | CreateConfig (void) |
virtual void | Reset (void) |
GPlugInFactory * | GetFactory (void) const |
int | Compare (const GPlugIn &plugin) const |
int | Compare (const R::RString &plugin) const |
R::RString | GetName (void) const |
R::RString | GetDesc (void) const |
GSession * | GetSession (void) const |
virtual void | Done (void) |
virtual | ~GPlugIn (void) |
Private Attributes | |
R::RString | Buffer |
size_t | CurPos |
Additional Inherited Members | |
Protected Attributes inherited from GPlugIn | |
GPlugInFactory * | Factory |
GSession * | Session |
size_t | Id |
Detailed Description
Text Tokenizer.
The GTokenizer class provides some methods that break a set of characters into tokens.
It proposes a framework for a finite-state machine with memory.
It is used in the analyze of a document to determine how to extract the basic elements (words, abbreviation, e-mails, etc.).
See the documentation related to GPlugIn for more general information.
Constructor & Destructor Documentation
GTokenizer | ( | GSession * | session, |
GPlugInFactory * | fac | ||
) |
Construct the tokenizer.
- Parameters
-
session Session. fac Factory.
Member Function Documentation
void AddChar | ( | const R::RChar & | car | ) |
Add a character to the memory.
- Parameters
-
car Character to save.
R::RString Extract | ( | size_t | begin, |
size_t | end | ||
) |
Extract a string from the memory
- Parameters
-
begin Beginning position. end Ending position. If it is cNoRef, the end is the last character of the memory. Else, the ending position is to copied.
- Returns
size_t GetPos | ( | void | ) | const |
Get the position currently treated.
- Returns
- position.
|
virtual |
Method call each time the tokenizer is started to analyze some text. It must be called by all inheriting method.
|
pure virtual |
This method is called each time the analyzer treat a character.
The method should called the AddToken method the analyzer to add valid tokens.
- Parameters
-
analyzer Analyzer. car Character treated.
- Returns
- true if the character starts a token.
Member Data Documentation
|
private |
Memory.
|
private |
Current position.