API Reference

nagisa.Tagger(vocabs=None, params=None, hp=None, single_word_list=None)[source]

This class has a word segmentation function and a POS-tagging function for Japanese.

nagisa.wakati(text, lower=False)

Word segmentation function. Return the segmented words.

args:
  • text (str): An input sentence.

  • lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

return:
  • words (list): A list of the words.

nagisa.tagging(text, lower=False)

Return the words with POS-tags of the given sentence.

args:
  • text (str): An input sentence.

  • lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

return:
  • object : The object of the words with POS-tags.

nagisa.filter(text, lower=False, filter_postags=None)

Return the filtered words with POS-tags of the given sentence.

args:
  • text (str): An input sentence.

  • lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

  • filter_postags (list): Filtering the word with the POS-tag in filter_postags from a text.

return:
  • object : The object of the words with POS-tags.

nagisa.extract(text, lower=False, extract_postags=None)

Return the extracted words with POS-tags of the given sentence.

args:
  • text (str): An input sentence.

  • lower (bool): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

  • filter_postags (list): Extracting the word with the POS-tag in filter_postags from a text.

return:
  • object : The object of the words with POS-tags.

nagisa.decode(words, lower=False)

Return the words with tags of the given words.

args:
  • words (list): Input words.

  • lower (bool, optional): If lower is True, all uppercase characters in a list of the words are converted into lowercase characters.

return:
  • object : The object of the words with tags.

nagisa.fit(train_file, dev_file, test_file, model_name, dict_file=None, emb_file=None, delimiter='\t', newline='EOS', layers=1, min_count=2, decay=1, epoch=10, window_size=3, dim_uni=32, dim_bi=16, dim_word=16, dim_ctype=8, dim_tagemb=16, dim_hidden=100, learning_rate=0.1, dropout_rate=0.3, seed=1234)[source]

Train a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.

args:
  • train_file (str): Path to a train file.

  • dev_file (str): Path to a development file for early stopping.

  • test_file (str): Path to a test file for evaluation.

  • model_name (str): Output model filename.

  • dict_file (str, optional): Path to a dictionary file.

  • emb_file (str, optional): Path to a pre-trained embedding file (word2vec format).

  • delimiter (str, optional): Separate word and tag in each line by ‘delimiter’.

  • newline (str, optional): Separate lines in the file by ‘newline’.

  • layers (int, optional): RNN Layer size.

  • min_count (int, optional): Ignores all words with total frequency lower than this.

  • decay (int, optional): Learning rate decay.

  • epoch (int, optional): Epoch size.

  • window_size (int, optional): Window size of the context characters for word segmentation.

  • dim_uni (int, optional): Dimensionality of the char-unigram vectors.

  • dim_bi (int, optional): Dimensionality of the char-bigram vectors.

  • dim_word (int, optional): Dimensionality of the word vectors.

  • dim_ctype (int, optional): Dimensionality of the character-type vectors.

  • dim_tagemb (int, optional): Dimensionality of the tag vectors.

  • dim_hidden (int, optional): Dimensionality of the BiLSTM’s hidden layer.

  • learning_rate (float, optional): Learning rate of SGD.

  • dropout_rate (float, optional): Dropout rate of the input vector for BiLSTMs.

  • seed (int, optional): Random seed.

return:
  • Nothing. After finish training, however, save the three model files (*.vocabs, *.params, *.hp) in the current directory.