The interface of "TinySegmenter in Python" is compatible with NLTK's TokenizerI, although the distribution file below does not directly depend on NLTK. If you'd like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below:
import nltk import re from nltk.tokenize.api import * class TinySegmenter(TokenizerI):
Download the source code from here. TinySegmenter in Python is freely distributable under the terms of a new BSD license.
No need to install it - just copy it anywhere, import it, and use it as the follow example:
from tinysegmenter import * segmenter = TinySegmenter() print ' | '.join(segmenter.tokenize(u"私の名前は中野です")) 私 | の | 名前 | は | 中野 | です
I thank Mr. Kudo for his effort on this kind of wonderful software.