Text Tokenizer. More...

#include <gtokenizer.h>

Inheritance diagram for GTokenizer:
[legend]

List of all members.

Public Member Functions

 GTokenizer (GSession *session, GPlugInFactory *fac)
void AddChar (const R::RChar &car)
R::RString Extract (size_t begin, size_t end)
size_t GetPos (void) const
virtual void Start (void)
virtual bool TreatChar (GDocAnalyze *analyzer, const R::RChar &car)=0

Detailed Description

Text Tokenizer.

The GTokenizer class provides some methods that break a set of characters into tokens.

It proposes a framework for a finite-state machine with memory.

It is used in the analyze of a document to determine how to extract the basic elements (words, abbreviation, e-mails, etc.).


Constructor & Destructor Documentation

GTokenizer ( GSession session,
GPlugInFactory fac 
)

Construct the tokenizer.

Parameters:
sessionSession.
facFactory.

Member Function Documentation

void AddChar ( const R::RChar car)

Add a character to the memory.

Parameters:
carCharacter to save.
R::RString Extract ( size_t  begin,
size_t  end 
)

Extract a string from the memory

Parameters:
beginBeginning position.
endEnding position. If it is cNoRef, the end is the last character of the memory. Else, the ending position is to copied.
Returns:
size_t GetPos ( void  ) const

Get the position currently treated.

Returns:
position.
virtual void Start ( void  ) [virtual]

Method call each time the tokenizer is started to analyze some text. It must be called by all inheriting method.

virtual bool TreatChar ( GDocAnalyze analyzer,
const R::RChar car 
) [pure virtual]

This method is called each time the analyzer treat a character.

The method should called the AddToken method the analyzer to add valid tokens.

Parameters:
analyzerAnalyzer.
carCharacter treated.
Returns:
true if the character starts a token.