TinySegmenter in Python
What is this?
The interface of "TinySegmenter in Python" is compatible with NLTK's TokenizerI, although the distribution file below does not directly depend on NLTK. If you'd like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below:
import nltk import re from nltk.tokenize.api import * class TinySegmenter(TokenizerI):
Download and Usage
Download the source code from here. TinySegmenter in Python is freely distributable under the terms of a new BSD license.
No need to install it - just copy it anywhere, import it, and use it as the follow example:
from tinysegmenter import * segmenter = TinySegmenter() print ' | '.join(segmenter.tokenize(u"私の名前は中野です")) 私 | の | 名前 | は | 中野 | です
Features (from the original TinySegmenter)
- Around 95% segemntation precision for Japanese news articles.
- Segmentation units are compatible with MeCab + IPADic
- Only 23KB of source code. Just copy it anywhere and no other things are required.
- No dependency on any dictionaries - character-based segmentation (Features: character, character N-grams, character types).
- Feature selection by L1-norm regularization Boosting.
A Better Alternative
I thank Mr. Kudo for his effort on this kind of wonderful software.