Grammalecte  Check-in [3339da6424]

Overview
Comment:[graphspell] tokenizer: add option for <start> and <end> tokens
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | graphspell | rg
Files: files | file ages | folders
SHA3-256: 3339da64247027f881ef64d14bb6621537ce2b5fd55fe9b421d0af9e1f19f8a4
User & Date: olr on 2018-06-02 13:47:06
Other Links: branch diff | manifest | tags
Context
2018-06-02
14:01
[core] token offset for correct token positioning check-in: 38cd64c0b9 user: olr tags: core, rg
13:47
[graphspell] tokenizer: add option for <start> and <end> tokens check-in: 3339da6424 user: olr tags: graphspell, rg
2018-06-01
10:51
[core] gc engine update check-in: 102180fb1d user: olr tags: core, rg
Changes

Modified graphspell/tokenizer.py from [b3cbfe75ea] to [b723a02695].

    40     40   
    41     41       def __init__ (self, sLang):
    42     42           self.sLang = sLang
    43     43           if sLang not in _PATTERNS:
    44     44               self.sLang = "default"
    45     45           self.zToken = re.compile( "(?i)" + '|'.join(sRegex for sRegex in _PATTERNS[sLang]) )
    46     46   
    47         -    def genTokens (self, sText):
           47  +    def genTokens (self, sText, bStartEndToken=False):
           48  +        if bStartEndToken:
           49  +            yield { "i": 0, "sType": "INFO", "sValue": "<start>", "nStart": 0, "nEnd": 0 }
    48     50           for i, m in enumerate(self.zToken.finditer(sText), 1):
    49     51               yield { "i": i, "sType": m.lastgroup, "sValue": m.group(), "nStart": m.start(), "nEnd": m.end() }
           52  +        if bStartEndToken:
           53  +            iEnd = len(sText)
           54  +            yield { "i": i+1, "sType": "INFO", "sValue": "<end>", "nStart": iEnd, "nEnd": iEnd }