#include <gdocanalyze.h>
Public Member Functions | |
GDocAnalyze (GSession *session) | |
GDoc * | GetDoc (void) const |
void * | GetData (void) const |
void | SetData (void *data) |
GSession * | GetSession (void) const |
const GDescription & | GetDescription (void) const |
size_t | SkipToken (void) |
GVector * | GetCurrentVector (void) const |
void | SetCurrentVector (GVector *vector) |
GTokenOccur * | AddToken (const R::RString &token, tTokenType type=ttUnknown, double weight=0.0) |
GTokenOccur * | AddToken (const R::RString &token, GConcept *metaconcept, tTokenType type=ttUnknown, double weight=0.0) |
GTokenOccur * | AddToken (const R::RString &token, tTokenType type, GConcept *concept, double weight, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
GTokenOccur * | AddDefaultNamedEntityToken (const R::RString &token, double weight, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | ExtractText (const R::RString &text, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | ExtractText (const R::RString &text, tTokenType type, double weight, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | ExtractDCMI (const R::RString &element, const R::RString &value, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | ExtractDefaultText (const R::RString &content, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | ExtractDefaultText (const R::RString &content, tTokenType type, double weight, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | ExtractDefaultURI (const R::RString &uri, size_t pos, size_t depth=0, size_t spos=SIZE_MAX) |
void | AssignPlugIns (void) |
R::RCursor< GToken > | GetTokens (void) const |
R::RCursor< GTokenOccur > | GetOccurs (void) const |
void | DeleteToken (GToken *token) |
void | ReplaceToken (GToken *token, R::RString value) |
void | MoveToken (GTokenOccur *occur, R::RString value) |
void | MoveToken (GTokenOccur *occur, GConcept *concept) |
GLang * | GetLang (void) const |
void | SetLang (GLang *lang) |
void | Analyze (GDoc *doc, bool force, bool download) |
virtual | ~GDocAnalyze (void) |
Private Member Functions | |
GToken * | CreateToken (const R::RString &token, tTokenType type) |
void | BuildTensor (void) |
void | BuildRecords (GTokenOccur *occur) |
void | Print (GTokenOccur *occur) |
Private Member Functions inherited from RDownloadFile | |
RDownloadFile (void) | |
void | Download (const RURI &uri, const R::RURI &local) |
Private Member Functions inherited from RDownload | |
RDownload (void) | |
void | Download (const RURI &uri) |
RString | GetMIMEType (void) |
virtual | ~RDownload (void) |
Private Attributes | |
GDoc * | Doc |
void * | Data |
GSession * | Session |
GDescription | Description |
R::RContainer< GConceptRecord, false, true > | Records |
size_t | NbRecords |
GLang * | Lang |
GTokenizer * | Tokenizer |
R::RCastCursor< GPlugIn, GAnalyzer > | Analyzers |
R::RContainer< GToken, true, false > | MemoryTokens |
size_t | NbMemoryTokensUsed |
R::RContainer< GTokenOccur, true, false > | MemoryOccurs |
size_t | NbMemoryOccursUsed |
R::RHashContainer< GToken, false > | OrderTokens |
R::RContainer< GToken, false, false > | Tokens |
R::RContainer< GTokenOccur, false, false > | Occurs |
R::RContainer< GTokenOccur, false, false > | Top |
R::RStack< GTokenOccur, false, true, true > | Depths |
GVector * | CurVector |
size_t | CurPos |
size_t | CurDepth |
bool | DepthError |
size_t | CurSyntacticPos |
tTokenType | CurTokenType |
double | CurTokenWeight |
R::RContainer < R::RNumContainer< size_t, false >, true, false > | SyntacticPos |
size_t | NbTopRecords |
size_t | NbRefs |
Detailed Description
The GDocAnalyze class analyzes a given document by coordinating the following steps:
- It determines the filter corresponding to the type of the document to analyze.
- It uses the current tokenizer to extract the tokens from the text provided by the filter (child classes of GFilter).
- The tokens are then passed to the analyzers in the order specified in the configuration to be treated (stemming, filtering, etc.).
In practice, it manages the tokens extracted from the documents by the filter and their occurrences (position, depth and syntactic position). Once the analysis steps are finished, it build a vector and a concept tree using the tokens for which a concept is associated.
It is supposed that each token of a given type in a document as an unique name and corresponds to one concept only. It is the responsibility of the filter to ensure it.
Constructor & Destructor Documentation
GDocAnalyze | ( | GSession * | session | ) |
Constructor of the document analysis method.
- Parameters
-
session Session.
|
virtual |
Destruct the document analyzer.
Member Function Documentation
GDoc* GetDoc | ( | void | ) | const |
Get the document currently analyzed.
- Returns
- pointer to the document.
void* GetData | ( | void | ) | const |
Get the data assigned to the analyzer.
It is the responsible of the caller of this function to correctly cast the pointer.
- Returns
- a raw pointer.
void SetData | ( | void * | data | ) |
Assign some data to the analyzer.
- Parameters
-
data Raw pointer to the data.
GSession* GetSession | ( | void | ) | const |
- Returns
- the session.
const GDescription& GetDescription | ( | void | ) | const |
- Returns
- the description that was just computed.
size_t SkipToken | ( | void | ) |
Inform the document analysis process that a potential token is skipped. In practice, it increments the current syntactic position.
Typically, it is called by the current tokenizer to indicate that an existing character sequence is not considered as a valid token.
- Returns
- the syntactic position skipped.
|
private |
Create a token with a given name and a given type. In practice, if a free token exists, it is used.
- Parameters
-
token Token. type Type.
- Returns
- a pointer to the created token.
GVector* GetCurrentVector | ( | void | ) | const |
Get the current vector.
- Returns
- a pointer to the current vector.
void SetCurrentVector | ( | GVector * | vector | ) |
Set the current vector.
Be careful with this method.
- Parameters
-
vector Pointer to the vector.
GTokenOccur* AddToken | ( | const R::RString & | token, |
tTokenType | type = ttUnknown , |
||
double | weight = 0.0 |
||
) |
Add a token to the current vector.
The current syntactic position is incremented by one.
- Warning
- This method should only be called by child classes of GTokenizer.
- Parameters
-
token Token to add. type Token type. If ttUnknown, the current token type is used. weight Weight associate to the concept. If null, the current weight is used.
- Returns
- the occurrence of the token added.
GTokenOccur* AddToken | ( | const R::RString & | token, |
GConcept * | metaconcept, | ||
tTokenType | type = ttUnknown , |
||
double | weight = 0.0 |
||
) |
Add a token to a given vector.
The current syntactic position is incremented by one.
- Warning
- This method should only be called by child classes of GTokenizer.
- Parameters
-
token Token to add. metaconcept Meta-concept of the vector associated to the concept. type Token type. If ttUnknown, the current token type is used. weight Weight associate to the concept. If null, the current weight is used.
- Returns
- the occurrence of the token added.
GTokenOccur* AddToken | ( | const R::RString & | token, |
tTokenType | type, | ||
GConcept * | concept, | ||
double | weight, | ||
GConcept * | metaconcept, | ||
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Add a token of a given type and representing a concept. It is added to a vector associated with a given meta-concept.
The current syntactic position is incremented by one.
- Parameters
-
token Token to add. The name must be unique for a given document whatever its type. type Token type. concept Concept to add. weight Weight associate to the concept. metaconcept Meta-concept of the vector associated to the concept. pos Position of the concept. depth Depth of the concept. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
- Returns
- the occurrence of the token added.
GTokenOccur* AddDefaultNamedEntityToken | ( | const R::RString & | token, |
double | weight, | ||
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Add a named-entity token and add it to a vector with a meta-concept corresponding to named entity. The method verifies that each part starts with a character in uppercase and separated by only one space.
- Parameters
-
token Token to add. The name must be unique for a given document whatever its type. weight Weight associate to the concept. pos Position of the concept. depth Depth of the concept. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
- Returns
- the occurrence of the token added.
void ExtractText | ( | const R::RString & | text, |
GConcept * | metaconcept, | ||
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Extract some tokens of a given text, and add them to a vector associated with a given meta-concept. If the vector already exists, the content is added.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters
-
text Text to add. metaconcept Meta-concept of the vector associated to the text. pos Position of the text. depth Depth of the text. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractText | ( | const R::RString & | text, |
tTokenType | type, | ||
double | weight, | ||
GConcept * | metaconcept, | ||
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Extract some tokens of a given text, and add them to a vector associated with a given meta-concept. If the vector already exists, the content is added.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters
-
text Text to add. type Token type. weight Weight associate to the concept. metaconcept Meta-concept of the vector associated to the text. pos Position of the text. depth Depth of the text. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDCMI | ( | const R::RString & | element, |
const R::RString & | value, | ||
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Extract some tokens of a given text, and add them to a vector associated with a given metadata defined by the Dublin core. In practice, to each metadata corresponds one vector. Several contents associated with a given metadata are simply added.
The only allowed elements are: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, type.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters
-
element Element of the DCMI (without namespace and/or prefix). value Value of the metadata. pos Position of the metadata. depth Depth of the metadata. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDefaultText | ( | const R::RString & | content, |
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Extract some tokens from a given text, and add them to a '*' (neutral) meta-concept of the type 'text block. Each time the method is called, the content is added to the vector corresponding to the '*' meta-concept. The tokens are supposed to be text.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters
-
content Content. pos Position of the content. depth Depth of the content. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDefaultText | ( | const R::RString & | content, |
tTokenType | type, | ||
double | weight, | ||
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Extract some tokens from a given text, and add them to a '*' (neutral) meta-concept of the type 'text block. Each time the method is called, the content is added to the vector corresponding to the '*' meta-concept. The tokens are supposed to be text.
The current syntactic position is incremented by the number of tokens extracted.
- Parameters
-
content Content. type Token type. weight Weight associate to the concept. pos Position of the content. depth Depth of the content. spos Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDefaultURI | ( | const R::RString & | uri, |
size_t | pos, | ||
size_t | depth = 0 , |
||
size_t | spos = SIZE_MAX |
||
) |
Extract a token that represents a URI, and add it to the '*' (neutral) meta-concept of the type 'URI'. Each time the method is called, the content is added to the vector corresponding to the '*' meta-concept.
The current syntactic position is incremented by one.
- Parameters
-
uri URI. pos Position of the URI. depth Depth of the URI. spos Syntactic position. If SIZE_MAX, the URI is supposed to be next the previous one.
void AssignPlugIns | ( | void | ) |
Assign the plug-ins. An exception is generated if no plug-ins are defined.
R::RCursor<GToken> GetTokens | ( | void | ) | const |
Get a cursor over the tokens extracted. The order of the container reflects the order of the first occurrence of each token.
- Returns
- a cursor.
R::RCursor<GTokenOccur> GetOccurs | ( | void | ) | const |
Get a cursor over the occurrences of the different tokens extracted as they appear in the document.
- Returns
- a cursor.
void DeleteToken | ( | GToken * | token | ) |
Delete a given token. In practice, it modifies its type to ttDeleted.
- Warning
- This method may modified the cursor over the tokens.
- Parameters
-
token Token to delete.
void ReplaceToken | ( | GToken * | token, |
R::RString | value | ||
) |
Replace a given token by a given value (for example a word by its stem). If it new value corresponds to an existing token, the occurrences are merged and the type of the current token is set to ttDeleted.
- Warning
- This method may modified the cursor over the tokens.
- Parameters
-
token Token to replace. value New value.
void MoveToken | ( | GTokenOccur * | occur, |
R::RString | value | ||
) |
Move a token occurrence associated to a particular token to another one given by a value. If it new value corresponds to an existing token, the occurrence is added. If the current token has no more occurrences, its type is set to ttDeleted.
- Warning
- This method may modified the cursor over the tokens.
- Parameters
-
occur Token occurrence to change. value New value.
void MoveToken | ( | GTokenOccur * | occur, |
GConcept * | concept | ||
) |
Move a token occurrence associated to a particular token to another existing concept. If necessary, a new token is created. If the current token has no more occurrences its type is set to ttDeleted.
- Warning
- This method may modified the cursor over the tokens.
- Parameters
-
occur Token occurrence to change. concept Concept.
GLang* GetLang | ( | void | ) | const |
Get the language actually determined.
void SetLang | ( | GLang * | lang | ) |
Set the language for the document currently analyzed.
- Parameters
-
lang
|
private |
Create the descriptions.
|
private |
Build the records starting with a given token occurrence.
- Parameters
-
parent Parent node. occur Token occurrence.
|
private |
Print the concept tree starting with a given token occurrence.
- Parameters
-
occur Token occurrence.
void Analyze | ( | GDoc * | doc, |
bool | force, | ||
bool | download | ||
) |
Analyze a document.
- Parameters
-
doc Pointer to the document to analyze. force Force the analysis of the document? download Try to download locally the document?
Member Data Documentation
|
private |
Current document analysed.
|
private |
Some data than can be assigned to the analyser.
|
private |
Corresponding session.
|
private |
Description to build during the analysis.
|
private |
Records to build during the analysis.
|
private |
Number of records really used for the represents the documents.
|
private |
Language associated to the document.
|
private |
The tokenizer.
|
private |
The analyzers.
|
private |
Memory of tokens.
|
private |
Number of tokens from the memory used.
|
private |
Memory of occurrences.
|
private |
Number of occurrences from the memory used.
|
private |
List of tokens currently added ordered.
|
private |
List of tokens currently added.
|
private |
The occurrences of the tokens.
|
private |
Top occurrences.
|
private |
A stack representing the "active" tokens at each depth.
|
private |
Vector for which new concepts should be added.
|
private |
Current position.
|
private |
Current depth.
|
private |
Is there a depth error for the current situation.
|
private |
Current syntactic position.
|
private |
Current token type.
|
private |
Current token weight.
|
private |
|
private |
Number of top records.
|
private |
Number of valid concepts referenced.