#include <gdocanalyze.h>
Public Member Functions | |
| GDocAnalyze (GSession *session) | |
| GDoc * | GetDoc (void) const |
| GSession * | GetSession (void) const |
| const GDescription & | GetDescription (void) const |
| const GConceptTree & | GetTree (void) const |
| size_t | SkipToken (void) |
| GTokenOccur * | AddToken (const R::RString &token, tTokenType type) |
| GTokenOccur * | AddToken (const R::RString &token, tTokenType type, GConcept *concept, double weight, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
| void | ExtractTextual (const R::RString &text, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
| void | ExtractDCMI (const R::RString &element, const R::RString &value, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
| void | ExtractBody (const R::RString &content, size_t pos, size_t depth=0) |
| void | AssignPlugIns (void) |
| R::RCursor< GToken > | GetTokens (void) const |
| R::RCursor< GTokenOccur > | GetOccurs (void) const |
| void | DeleteToken (GToken *token) |
| void | ReplaceToken (GToken *token, R::RString value) |
| void | MoveToken (GTokenOccur *occur, R::RString value) |
| void | MoveToken (GTokenOccur *occur, GConcept *concept) |
| GLang * | GetLang (void) const |
| void | SetLang (GLang *lang) |
| void | Analyze (GDoc *doc, bool ram=true) |
| virtual | ~GDocAnalyze (void) |
Detailed Description
The GDocAnalyze class analyzes a given document by coordinating the following steps:
- It determines the filter corresponding to the type of the document to analyze.
- It uses the current tokenizer to extract the tokens from the text provided by the filter (child classes of GFilter).
- The tokens are then passed to the analyzers in the order specified in the configuration to be treated (stemming, filtering, etc.).
In practice, it manages the tokens extracted from the documents by the filter and their occurrences (position, depth and syntactic position). Once the analysis steps are finished, it build a vector and a concept tree using the tokens for which a concept is associated.
It is supposed that each token of a given type in a document as an unique name and corresponds to one concept only. It is the responsibility of the filter to ensure it.
Once a document is analyzed, it send a notification 'DocAnalyzed'. It can be catched by all classes inhereting from R::RObject that become an observator with the command (generally in the constructor):
InsertObserver(HANDLER(TheClass::Handle),"DocAnalyzed");
A method must then by created:
void TheClass::Handle(const R::RNotification& notification) { GDocAnalyze* Analyzer(GetData<GDocAnalyze*>(notification)); cout<<Analyzer->GetLang()->GetCode()<<endl; }
Constructor & Destructor Documentation
| GDocAnalyze | ( | GSession * | session | ) |
Constructor of the document analysis method.
- Parameters:
-
session Session.
| virtual ~GDocAnalyze | ( | void | ) | [virtual] |
Destruct the document analyzer.
Member Function Documentation
| GSession* GetSession | ( | void | ) | const |
- Returns:
- the session.
| const GDescription& GetDescription | ( | void | ) | const |
- Returns:
- the description that was just computed.
| const GConceptTree& GetTree | ( | void | ) | const |
- Returns:
- the tree that was just computed (if asked).
| size_t SkipToken | ( | void | ) |
Inform the document analysis process that a potential token is skipped. In practice, it increments the current syntactic position.
Typically, it is called by the current tokenizer to indicate that an existing character sequence is not considered as a valid token.
- Returns:
- the syntactic position skipped.
| GTokenOccur* AddToken | ( | const R::RString & | token, |
| tTokenType | type | ||
| ) |
Add a token to the current vector.
The current syntactic position is incremented by one.
- Warning:
- This method should only be called by child classes of GTokenizer.
- Parameters:
-
token Token to add. type Token type.
- Returns:
- the occurrence of the token added.
| GTokenOccur* AddToken | ( | const R::RString & | token, |
| tTokenType | type, | ||
| GConcept * | concept, | ||
| double | weight, | ||
| GConcept * | metaconcept, | ||
| size_t | pos, | ||
| size_t | depth = 0, |
||
| size_t | spos = SIZE_MAX |
||
| ) |
Add a token of a given type and representing a concept. It is added to a vector associated with a given meta-concept.
The current syntactic position is incremented by one.
- Parameters:
-
token Token to add. The name must be unique for a given document whatever its type. type Token type. concept Concept to add. weight Weight associate to the concept. metaconcept Meta-concept of the vector associated to the concept. pos Position of the concept. depth Depth of the concept. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
- Returns:
- the occurrence of the token added.
| void ExtractTextual | ( | const R::RString & | text, |
| GConcept * | metaconcept, | ||
| size_t | pos, | ||
| size_t | depth = 0, |
||
| size_t | spos = SIZE_MAX |
||
| ) |
Extract some tokens of a given text, and add them to a vector associated with a given meta-concept. If the vector already exists, the content is added.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters:
-
text Text to add. metaconcept Meta-concept of the vector associated to the text. pos Position of the text. depth Depth of the text. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
| void ExtractDCMI | ( | const R::RString & | element, |
| const R::RString & | value, | ||
| size_t | pos, | ||
| size_t | depth = 0, |
||
| size_t | spos = SIZE_MAX |
||
| ) |
Extract some tokens of a given text, and add them to a vector associated with a given metadata defined by the Dublin core. In practice, to each metadata corresponds one vector. Several contents associated with a given metadata are simply added.
The only allowed elements are: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, type.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters:
-
element Element of the DCMI (without namespace and/or prefix). value Value of the metadata. pos Position of the metadata. depth Depth of the metadata. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
| void ExtractBody | ( | const R::RString & | content, |
| size_t | pos, | ||
| size_t | depth = 0 |
||
| ) |
Extract some tokens from a given text, and add them to the 'body' meta-concept. Each time the method is called, the content is added to the vector corresponding to the 'body' meta-concept.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters:
-
content Content. pos Position of the content. depth Depth of the content.
| void AssignPlugIns | ( | void | ) |
Assign the plug-ins. An exception is generated if no plug-ins are defined.
| R::RCursor<GToken> GetTokens | ( | void | ) | const |
Get a cursor over the tokens extracted. The order of the container reflects the order of the first occurrence of each token.
- Returns:
- a cursor.
| R::RCursor<GTokenOccur> GetOccurs | ( | void | ) | const |
Get a cursor over the occurrences of the different tokens extracted as they appear in the document.
- Returns:
- a cursor.
| void DeleteToken | ( | GToken * | token | ) |
Delete a given token. In practice, it modifies its type to ttDeleted.
- Warning:
- This method may modified the cursor over the tokens.
- Parameters:
-
token Token to delete.
| void ReplaceToken | ( | GToken * | token, |
| R::RString | value | ||
| ) |
Replace a given token by a given value (for example a word by its stem). If it new value corresponds to an existing token, the occurrences are merged and the type of the current token is set to ttDeleted.
- Warning:
- This method may modified the cursor over the tokens.
- Parameters:
-
token Token to replace. value New value.
| void MoveToken | ( | GTokenOccur * | occur, |
| R::RString | value | ||
| ) |
Move a token occurrence associated to a particular token to another one given by a value. If it new value corresponds to an existing token, the occurrence is added. If the current token has no more occurrences, its type is set to ttDeleted.
- Warning:
- This method may modified the cursor over the tokens.
- Parameters:
-
occur Token occurrence to change. value New value.
| void MoveToken | ( | GTokenOccur * | occur, |
| GConcept * | concept | ||
| ) |
Move a token occurrence associated to a particular token to another existing concept. If necessary, a new token is created. If the current token has no more occurrences its type is set to ttDeleted.
- Warning:
- This method may modified the cursor over the tokens.
- Parameters:
-
occur Token occurrence to change. concept Concept.