Tutorial (Stopwords for nagisa)

How to use stopwords for nagisa

This tutorial provides how to use stopwords for Japanese text in nagisa.

Install python libraries

Before we get started, please run the following command to install the libraries used in this tutorial.

pip install nagisa
pip install datasets

Get stopwords

This is a stopword list of frequently used words in the Japanese language, created according to the tokenization rules of the Japanese text analysis library, nagisa.

This list is constructed by extracting the top 100 most commonly used words from the CC-100 dataset and Wikipedia.

To access this list of words, simply run the provided program code below.

python tutorial_stopwords.py
tutorial_stopwords.py
1from datasets import load_dataset
2
3dataset = load_dataset("taishi-i/nagisa_stopwords")
4
5# the top 100 most commonly used words
6words = dataset["nagisa_stopwords"]["words"]
7
8# the part-of-speech list for the top 100 most commonly used words
9postags = dataset["nagisa_stopwords"]["postags"]