Tutorial (Stopwords for nagisa)

How to use stopwords for nagisa

This tutorial provides how to use stopwords for Japanese text in nagisa.

Install python libraries

Before we get started, please run the following command to install the libraries used in this tutorial.

pip install nagisa
pip install datasets

Get stopwords

This is a stopword list of frequently used words in the Japanese language, created according to the tokenization rules of the Japanese text analysis library, nagisa.

This list is constructed by extracting the top 100 most commonly used words from the CC-100 dataset and Wikipedia.

To access this list of words, simply run the provided program code below.

python tutorial_stopwords.py

tutorial_stopwords.py

from datasets import load_dataset

dataset = load_dataset("taishi-i/nagisa_stopwords")

# the top 100 most commonly used words
words = dataset["nagisa_stopwords"]["words"]

# the part-of-speech list for the top 100 most commonly used words
postags = dataset["nagisa_stopwords"]["postags"]