"TinySegmenter in Python" is a Python re-implementation of TinySegmenter, which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above.
The interface of "TinySegmenter in Python" is compatible with NLTK's TokenizerI, although the distribution file below does not directly depend on NLTK. If you'd like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below:
import nltk
import re
from nltk.tokenize.api import *
class TinySegmenter(TokenizerI):
Download the source code from here. TinySegmenter in Python is freely distributable under the terms of a new BSD license.
No need to install it - just copy it anywhere, import it, and use it as the follow example:
from tinysegmenter import *
segmenter = TinySegmenter()
print ' | '.join(segmenter.tokenize(u"私の名前は中野です"))
私 | の | 名前 | は | 中野 | です
If you are interested in more accurate analysis of Japanese and Chinese text in JavaScript, check out my other open source project Rakuten MA, which is an open source morphological analyzer written in 100% JavaScript, supporting PoS tagging in Japanese and Chinese.
I thank Mr. Kudo for his effort on this kind of wonderful software.