API Reference

nagisa.Tagger(vocabs=None, params=None, hp=None, single_word_list=None)[source]: This class has a word segmentation function and a POS-tagging function for Japanese.

nagisa.wakati(text, lower=False)

Word segmentation function. Return the segmented words.

args:

text (str): An input sentence.
lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

return:

words (list): A list of the words.

nagisa.tagging(text, lower=False)

Return the words with POS-tags of the given sentence.

args:

text (str): An input sentence.
lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

return:

object : The object of the words with POS-tags.

nagisa.filter(text, lower=False, filter_postags=None)

Return the filtered words with POS-tags of the given sentence.

args:

text (str): An input sentence.
lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.
filter_postags (list): Filtering the word with the POS-tag in filter_postags from a text.

return:

object : The object of the words with POS-tags.

nagisa.extract(text, lower=False, extract_postags=None)

Return the extracted words with POS-tags of the given sentence.

args:

text (str): An input sentence.
lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.
filter_postags (list): Extracting the word with the POS-tag in filter_postags from a text.

return:

object : The object of the words with POS-tags.

nagisa.decode(words, lower=False)

Return the words with tags of the given words.

args:

words (list): Input words.
lower (bool, optional): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

return:

object : The object of the words with tags.

nagisa.fit(train_file, dev_file, test_file, model_name, dict_file=None, emb_file=None, delimiter='\t', newline='EOS', layers=1, min_count=2, decay=1, epoch=10, window_size=3, dim_uni=32, dim_bi=16, dim_word=16, dim_ctype=8, dim_tagemb=16, dim_hidden=100, learning_rate=0.1, dropout_rate=0.3, seed=1234)[source]

Train a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.

args:

train_file (str): Path to a train file.
dev_file (str): Path to a development file for early stopping.
test_file (str): Path to a test file for evaluation.
model_name (str): Output model filename.
dict_file (str, optional): Path to a dictionary file.
emb_file (str, optional): Path to a pre-trained embedding file (word2vec format).
delimiter (str, optional): Separate word and tag in each line by ‘delimiter’.
newline (str, optional): Separate lines in the file by ‘newline’.
layers (int, optional): RNN Layer size.
min_count (int, optional): Ignores all words with total frequency lower than this.
decay (int, optional): Learning rate decay.
epoch (int, optional): Epoch size.
window_size (int, optional): Window size of the context characters for word segmentation.
dim_uni (int, optional): Dimensionality of the char-unigram vectors.
dim_bi (int, optional): Dimensionality of the char-bigram vectors.
dim_word (int, optional): Dimensionality of the word vectors.
dim_ctype (int, optional): Dimensionality of the character-type vectors.
dim_tagemb (int, optional): Dimensionality of the tag vectors.
dim_hidden (int, optional): Dimensionality of the BiLSTM’s hidden layer.
learning_rate (float, optional): Learning rate of SGD.
dropout_rate (float, optional): Dropout rate of the input vector for BiLSTMs.
seed (int, optional): Random seed.

return:

Nothing. After finish training, however, save the three model files (*.vocabs, *.params, *.hp) in the current directory.