Documentation TYPO3 par Ameos

tx_indexedsearch_indexer Class Reference

List of all members.

Public Member Functions

 hook_indexContent (&$pObj)
 init ()
 initExternalReaders ()
 indexTypo3PageContent ()
 splitHTMLContent ($content)
 bodyDescription ($contentArr)
 extractLinks ($content)
 getJumpurl ($query)
 splitPdfInfo ($pdfInfoArray)
 indexRegularDocument ($file)
 readFileContent ($ext, $absFile, $cPKey)
 fileContentParts ($ext, $absFile)
 embracingTags ($string, $tagName, &$tagContent, &$stringAfter, &$paramList)
 indexAnalyze ($content)
 analyzeHeaderinfo (&$retArr, $content, $key, $offset)
 analyzeBody (&$retArr, $content)
 typoSearchTags (&$body)
 split2words (&$string)
 wordOK ($w)
 metaphone ($word)
 strtolower_all ($str)
 freqMap ($freq)
 getRootLineFields (&$fieldArr)
 removeIndexedPhashRow ($phashList, $clearPageCache=1)
 checkMtimeTstamp ($mtime, $maxAge, $minAge, $phash)
 update_grlist ($phash, $phash_x)
 is_grlist_set ($phash_x)
 checkContentHash ()
 removeLoginpagesWithContentHash ()
 removeOldIndexedPages ($phash)
 checkExternalDocContentHash ($hashGr, $content_md5h)
 updateTstamp ($phash, $mtime=0)
 updateParsetime ($phash, $parsetime)
 updateRootline ()
 submitPage ()
 submit_grlist ($hash, $phash_x)
 submit_section ($hash, $hash_t3)
 submitFilePage ($hash, $file, $subinfo, $ext, $mtime, $ctime, $size, $content_md5h, $contentParts)
 submitFile_grlist ($hash)
 submitFile_section ($hash)
 checkWordList ($wl)
 submitWords ($wl, $phash)
 setT3Hashes ()
 setExtHashes ($file, $subinfo=array())
 md5inthash ($str)

Public Attributes

 $reasons
 $convChars
 $excludeSections = 'script,style'
 $supportedExtensions
 $pdf_mode = -20
 $app
 $defaultGrList = '0,-1'
 $tstamp_maxAge = 0
 $tstamp_minAge = 0
 $defaultContentArray
 $wordcount = 0
 $Itypes
 $conf = array()
 $hash = array()
 $contentParts = array()
 $pObj = ''
 $content_md5h = ''
 $cHashParams = array()
 $mtime = 0
 $rootLine = array()
 $freqRange = 65000
 $freqMax = 0.1

Detailed Description

Definition at line 118 of file class.indexer.php.


Member Function Documentation

tx_indexedsearch_indexer::analyzeBody &$  retArr,
content
 

Calculates relevant information for bodycontent

Parameters:
[type] $$retArr: ...
[type] $content: ...
Returns:
[type] ...

Definition at line 820 of file class.indexer.php.

tx_indexedsearch_indexer::analyzeHeaderinfo &$  retArr,
content,
key,
offset
 

Calculates relevant information for headercontent

Parameters:
[type] $$retArr: ...
[type] $content: ...
[type] $key: ...
[type] $offset: ...
Returns:
[type] ...

Definition at line 801 of file class.indexer.php.

tx_indexedsearch_indexer::bodyDescription contentArr  ) 
 

Returns bodyDescription

Parameters:
[type] $contentArr: ...
Returns:
[type] ...

Definition at line 482 of file class.indexer.php.

References t3lib_div::intInRange().

tx_indexedsearch_indexer::checkContentHash  ) 
 

Check content hash Returns true if the page needs to be indexed (that is, there was no result)

Returns:
[type] ...

Definition at line 1140 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::checkExternalDocContentHash hashGr,
content_md5h
 

Check content hash for external documents Returns true if the document needs to be indexed (that is, there was no result)

Parameters:
[type] $hashGr: ...
[type] $content_md5h: ...
Returns:
[type] ...

Definition at line 1190 of file class.indexer.php.

tx_indexedsearch_indexer::checkMtimeTstamp mtime,
maxAge,
minAge,
phash
 

Check the mtime / tstamp of the currently indexed page/file (based on phash) Return positive integer if the page needs to being indexed!

Parameters:
integer mtime value to test against limits and indexed page.
integer Maximum age in seconds.
integer Minimum age in seconds.
integer "phash" used to select any already indexed page to see what its mtime is.
Returns:
integer Result integer: Generally: <0 = No indexing, >0 = Do indexing (see $this->reasons): -2) Min age was NOT exceed and so indexing cannot occur. -1) Mtimes matched so no need to reindex page. 0) N/A 1) Max age exceeded, page must be indexed again. 2) mtime of indexed page doesn't match mtime given for current content and we must index page. 3) No mtime was set, so we will index... 4) No indexed page found, so of course we will index.

Definition at line 1083 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::checkWordList wl  ) 
 

Adds new words to db

Parameters:
[type] $wl: ...
Returns:
[type] ...

Definition at line 1436 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::embracingTags string,
tagName,
&$  tagContent,
&$  stringAfter,
&$  paramList
 

Finds first occurence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns false if no match found. ie. useful for finding <title> of document or removing <script>-sections

Parameters:
[type] $string: ...
[type] $tagName: ...
[type] $tagContent: ...
[type] $stringAfter: ...
[type] $paramList: ...
Returns:
[type] ...

Definition at line 754 of file class.indexer.php.

Referenced by splitHTMLContent().

tx_indexedsearch_indexer::extractLinks content  ) 
 

extract links and if indexable media is found, it is indexed

Parameters:
[type] $content: ...
Returns:
[type] ...

Definition at line 499 of file class.indexer.php.

References t3lib_div::htmlspecialchars_decode(), and t3lib_div::makeInstance().

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::fileContentParts ext,
absFile
 

[Describe function...]

Parameters:
[type] $ext: ...
[type] $absFile: ...
Returns:
[type] ...

Definition at line 711 of file class.indexer.php.

References t3lib_div::intInRange().

tx_indexedsearch_indexer::freqMap freq  ) 
 

maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.

Parameters:
[type] $freq: ...
Returns:
[type] ...

Definition at line 985 of file class.indexer.php.

tx_indexedsearch_indexer::getJumpurl query  ) 
 

[Describe function...]

Parameters:
[type] $query: ...
Returns:
[type] ...

Definition at line 531 of file class.indexer.php.

tx_indexedsearch_indexer::getRootLineFields &$  fieldArr  ) 
 

[Describe function...]

Parameters:
[type] $$fieldArr: ...
Returns:
[type] ...

Definition at line 1003 of file class.indexer.php.

tx_indexedsearch_indexer::hook_indexContent &$  pObj  ) 
 

Parent Object (TSFE)

Parameters:
object Parent Object (frontend TSFE object), passed by reference
Returns:
void

Definition at line 200 of file class.indexer.php.

References $pObj, indexTypo3PageContent(), and init().

tx_indexedsearch_indexer::indexAnalyze content  ) 
 

Analyzes content to use for indexing, the parameter must be an array with the keys title,keywords,description and body, which all contain an array of words.

Parameters:
[type] $content: ...
Returns:
[type] ...

Definition at line 780 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::indexRegularDocument file  ) 
 

Indexing a regular document given as $file (relative to PATH_site, local file)

Parameters:
[type] $file: ...
Returns:
[type] ...

Definition at line 564 of file class.indexer.php.

References t3lib_div::milliseconds().

tx_indexedsearch_indexer::indexTypo3PageContent  ) 
 

Start indexing of the TYPO3 page

Returns:
void

Definition at line 325 of file class.indexer.php.

References checkContentHash(), checkMtimeTstamp(), checkWordList(), extractLinks(), indexAnalyze(), md5inthash(), t3lib_div::milliseconds(), splitHTMLContent(), submitPage(), submitWords(), update_grlist(), updateParsetime(), updateRootline(), and updateTstamp().

Referenced by hook_indexContent().

tx_indexedsearch_indexer::init  ) 
 

Initializes the object

Returns:
void

Definition at line 242 of file class.indexer.php.

References initExternalReaders(), and setT3Hashes().

Referenced by hook_indexContent().

tx_indexedsearch_indexer::initExternalReaders  ) 
 

Initializes external readers, if any

Returns:
void

Definition at line 271 of file class.indexer.php.

References t3lib_div::intInRange().

Referenced by init().

tx_indexedsearch_indexer::is_grlist_set phash_x  ) 
 

Parameters:
[type] $phash_x: ...
Returns:
[type] ...

Definition at line 1129 of file class.indexer.php.

tx_indexedsearch_indexer::md5inthash str  ) 
 

md5 integer hash

Parameters:
[type] $str: ...
Returns:
[type] ...

Definition at line 1563 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::metaphone word  ) 
 

metaphone

Parameters:
[type] $word: ...
Returns:
[type] ...

Definition at line 942 of file class.indexer.php.

tx_indexedsearch_indexer::readFileContent ext,
absFile,
cPKey
 

[Describe function...]

Parameters:
[type] $ext: ...
[type] $absFile: ...
[type] $cPKey: ...
Returns:
[type] ...

Definition at line 647 of file class.indexer.php.

References t3lib_div::tempnam().

tx_indexedsearch_indexer::removeIndexedPhashRow phashList,
clearPageCache = 1
 

Removes ALL data regarding a certain indexed phash-row

Parameters:
[type] $phashList: ...
[type] $clearPageCache: ...
Returns:
[type] ...

Definition at line 1043 of file class.indexer.php.

References t3lib_div::trimExplode().

tx_indexedsearch_indexer::removeLoginpagesWithContentHash  ) 
 

Removes any indexed pages with userlogins which has the same contentHash

Returns:
[type] ...

Definition at line 1154 of file class.indexer.php.

tx_indexedsearch_indexer::removeOldIndexedPages phash  ) 
 

Removes records for the indexed page, $phash

Parameters:
[type] $phash: ...
Returns:
[type] ...

Definition at line 1172 of file class.indexer.php.

tx_indexedsearch_indexer::setExtHashes file,
subinfo = array()
 

Get search hash, external files

Parameters:
[type] $file: ...
[type] $subinfo: ...
Returns:
[type] ...

Definition at line 1540 of file class.indexer.php.

tx_indexedsearch_indexer::setT3Hashes  ) 
 

Get search hash, T3 pages

Returns:
[type] ...

Definition at line 1517 of file class.indexer.php.

Referenced by init().

tx_indexedsearch_indexer::split2words &$  string  ) 
 

Splits the incoming string into words The $string parameter is a reference and will be made into an array!

Parameters:
[type] $$string: ...
Returns:
[type] ...

Definition at line 891 of file class.indexer.php.

tx_indexedsearch_indexer::splitHTMLContent content  ) 
 

Splits HTML content and returns an associative array, with title, a list of metatags, and a list of words in the body.

Parameters:
[type] $content: ...
Returns:
[type] ...

Definition at line 400 of file class.indexer.php.

References embracingTags(), and t3lib_div::get_tag_attributes().

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::splitPdfInfo pdfInfoArray  ) 
 

Splitting PDF info

Parameters:
[type] $pdfInfoArray: ...
Returns:
[type] ...

Definition at line 544 of file class.indexer.php.

tx_indexedsearch_indexer::strtolower_all str  ) 
 

Converts string-to-lower including special characters.

Parameters:
[type] $str: ...
Returns:
[type] ...

Definition at line 954 of file class.indexer.php.

tx_indexedsearch_indexer::submit_grlist hash,
phash_x
 

Stores gr_list

Parameters:
[type] $hash: ...
[type] $phash_x: ...
Returns:
[type] ...

Definition at line 1317 of file class.indexer.php.

tx_indexedsearch_indexer::submit_section hash,
hash_t3
 

Stores section

Parameters:
[type] $hash: ...
[type] $hash_t3: ...
Returns:
[type] ...

Definition at line 1335 of file class.indexer.php.

tx_indexedsearch_indexer::submitFile_grlist hash  ) 
 

Stores file gr_list for a file IF it does not exist

Parameters:
[type] $hash: ...
Returns:
[type] ...

Definition at line 1402 of file class.indexer.php.

tx_indexedsearch_indexer::submitFile_section hash  ) 
 

Stores file section for a file IF it does not exist

Parameters:
[type] $hash: ...
Returns:
[type] ...

Definition at line 1419 of file class.indexer.php.

tx_indexedsearch_indexer::submitFilePage hash,
file,
subinfo,
ext,
mtime,
ctime,
size,
content_md5h,
contentParts
 

Updates db with information about the file

Parameters:
[type] $hash: ...
[type] $file: ...
[type] $subinfo: ...
[type] $ext: ...
[type] $mtime: ...
[type] $ctime: ...
[type] $size: ...
[type] $content_md5h: ...
[type] $contentParts: ...
Returns:
[type] ...

Definition at line 1361 of file class.indexer.php.

tx_indexedsearch_indexer::submitPage  ) 
 

Updates db with information about the page

Returns:
[type] ...

Definition at line 1264 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::submitWords wl,
phash
 

Submits information about words on the page to the db

Parameters:
[type] $wl: ...
[type] $phash: ...
Returns:
[type] ...

Definition at line 1473 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::typoSearchTags &$  body  ) 
 

Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.

Parameters:
[type] $$body: ...
Returns:
[type] ...

Definition at line 840 of file class.indexer.php.

tx_indexedsearch_indexer::update_grlist phash,
phash_x
 

Check if an grlist-entry for this hash exists and if not so, write one.

Parameters:
[type] $phash: ...
[type] $phash_x: ...
Returns:
[type] ...

Definition at line 1117 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::updateParsetime phash,
parsetime
 

Update parsetime

Parameters:
[type] $phash: ...
[type] $parsetime: ...
Returns:
[type] ...

Definition at line 1221 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::updateRootline  ) 
 

Update section rootline for the page

Returns:
[type] ...

Definition at line 1234 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::updateTstamp phash,
mtime = 0
 

Update tstamp

Parameters:
[type] $phash: ...
[type] $mtime: ...
Returns:
[type] ...

Definition at line 1205 of file class.indexer.php.

Referenced by indexTypo3PageContent().

tx_indexedsearch_indexer::wordOK w  ) 
 

Checks if a word is supposed to be indexed. This assessment includes that the word must be between 1 and 50 chars. The more exotic feature is that only 30 percent of the word must be non-alphanum characters. This is to exclude binary nonsense. This is done with a little trick it's counted how many chars are converted with a rawurlencode command. THis is not really an exact method, but I guess it's fast.

Parameters:
[type] $w: ...
Returns:
[type] ...

Definition at line 924 of file class.indexer.php.


Member Data Documentation

tx_indexedsearch_indexer::$app
 

Initial value:

 array(
                'pdftotext' => '/usr/local/bin/pdftotext',
                'pdfinfo' => '/usr/local/bin/pdfinfo',
                'catdoc' => '/usr/local/bin/catdoc'
        )

Definition at line 150 of file class.indexer.php.

tx_indexedsearch_indexer::$convChars
 

Initial value:

array(
                '',
                ''
        )

Definition at line 129 of file class.indexer.php.

tx_indexedsearch_indexer::$defaultContentArray
 

Initial value:

array(
                'title' => '',
                'description' => '',
                'keywords' => '',
                'body' => '',
        )

Definition at line 164 of file class.indexer.php.

tx_indexedsearch_indexer::$Itypes
 

Initial value:

 array(
                'html' => 1,
                'htm' => 1,
                'pdf' => 2,
                'doc' => 3,
                'txt' => 4
        )

Definition at line 171 of file class.indexer.php.

tx_indexedsearch_indexer::$reasons
 

Initial value:

 array(
                -1 => 'mtime matched the document, so no changes detected and no content updated',
                -2 => 'The minimum age was not exceeded',
                1 => "The configured max-age was exceeded for the document and thus it's indexed.",
                2 => 'The minimum age was exceed and mtime was set and the mtime was different, so the page was indexed.',
                3 => 'The minimum age was exceed, but mtime was not set, so the page was indexed.',
                4 => 'Page has never been indexed (is not represented in the index_phash table).'
        )

Definition at line 121 of file class.indexer.php.

tx_indexedsearch_indexer::$supportedExtensions
 

Initial value:

 array(
                        'pdf' => 1,
                        'doc' => 1,
                        'txt' => 1,
                        'html' => 1,
                        'htm' => 1
                )

Definition at line 138 of file class.indexer.php.


The documentation for this class was generated from the following file:


Gnr par L'expert TYPO3 avec  doxygen 1.4.6