Japanese Topic Modeling

Run topic modeling using the Livedoor News Corpus

This tutorial provides an example of running topic modeling on Japanese news articles by using nagisa for tokenization and scikit-learn’s LDA (Latent Dirichlet Allocation).

Install python libraries

Before we get started, please run the following commands to install the libraries used in this tutorial.

pip install nagisa
pip install scikit-learn
pip install requests
pip install tqdm

Run the script

Run the following command to download the corpus and extract topics. The Livedoor News Corpus (~45 MB) is downloaded automatically on the first run into the ldcc/ directory. By default, the model is built from the title of each article (TEXT_TYPE = "title"). To use the full article body instead, change the value to "body".

python tutorial_topic_model.py

tutorial_topic_model.py

import io
import os
import tarfile

import nagisa
import requests
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

CORPUS_URL = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
CORPUS_DIR = "ldcc"
N_TOPICS = 9
N_TOP_WORDS = 10
MAX_ITER = 20
TEXT_TYPE = "title"  # or body


def download_corpus():
    if os.path.isdir(CORPUS_DIR):
        return CORPUS_DIR

    resp = requests.get(CORPUS_URL, timeout=120)
    resp.raise_for_status()

    with tarfile.open(fileobj=io.BytesIO(resp.content), mode="r:*") as tar:
        tar.extractall(path=".", filter="data")

    extracted = "text"
    os.rename(extracted, CORPUS_DIR)
    return CORPUS_DIR


def load_documents(corpus_dir, text_type=TEXT_TYPE):
    documents = []
    for category in sorted(os.listdir(corpus_dir)):
        cat_path = os.path.join(corpus_dir, category)
        if not os.path.isdir(cat_path):
            continue
        for fname in sorted(os.listdir(cat_path)):
            if not fname.endswith(".txt") or fname.startswith("LICENSE"):
                continue
            fpath = os.path.join(cat_path, fname)
            with open(fpath, encoding="utf-8") as f:
                lines = f.readlines()

            if text_type == "title":
                text = lines[3].strip() if len(lines) > 3 else ""
            else:
                text = "".join(lines[2:]).strip()

            if text:
                documents.append(text)
    return documents


def tokenize_documents(documents):
    stopwords = nagisa.stopwords
    tokenized = []
    for doc in tqdm(documents):
        tokens = nagisa.tagging(doc)
        words = [w for w in tokens.words if len(w) > 1 and w not in stopwords]
        tokenized.append(" ".join(words))
    return tokenized


def run_lda(tokenized_docs):
    vectorizer = CountVectorizer(max_df=0.85, min_df=5)
    dtm = vectorizer.fit_transform(tokenized_docs)
    feature_names = vectorizer.get_feature_names_out()

    lda = LatentDirichletAllocation(
        n_components=N_TOPICS,
        max_iter=MAX_ITER,
        learning_method="online",
        random_state=1234,
    )
    lda.fit(dtm)

    for topic_idx, topic in enumerate(lda.components_):
        top_indices = topic.argsort()[-N_TOP_WORDS:][::-1]
        top_words = [feature_names[i] for i in top_indices]
        print(f"\nTopic {topic_idx + 1}:")
        print(f"\t{', '.join(top_words)}")


def main():
    corpus_dir = download_corpus()
    documents = load_documents(corpus_dir)
    tokenized_documents = tokenize_documents(documents)
    run_lda(tokenized_documents)


if __name__ == "__main__":
    main()

This is an example of the output.

100%|████████████████████████| 7367/7367 [00:28<00:00, 261.03it/s]

Topic 1:
    スマートフォン, Android, アプリ, iPhone, スマホ, 端末, 対応, 無料, 利用, サービス

Topic 2:
    映画, 公開, 主演, 出演, 作品, 監督, 俳優, 女優, ドラマ, 役

Topic 3:
    選手, 試合, チーム, 得点, 優勝, サッカー, 野球, 監督, シーズン, 日本

Topic 4:
    家電, 発売, 価格, 製品, メーカー, 対応, 機能, 搭載, カメラ, テレビ

Topic 5:
    女性, 男性, 生活, 仕事, 結婚, 子供, 美容, ファッション, 料理, おすすめ

Topic 6:
    サービス, 企業, 事業, 市場, 展開, ビジネス, 提供, 投資, 成長, 戦略

Topic 7:
    政府, 政治, 経済, 日本, 問題, 社会, 国, 対策, 発表, 制度

Topic 8:
    料理, レシピ, 食材, 食べ, 味, 食事, 野菜, 作り, 簡単, おいしい

Topic 9:
    音楽, ライブ, アルバム, 曲, アーティスト, 歌, リリース, コンサート, シングル, ツアー

Each topic corresponds to one of the 9 news categories in the Livedoor News Corpus. By looking at the top words, you can see that the model successfully captures the themes of each category.

Topic	Likely category	Key words
Topic 1	smax (smartphones)	スマートフォン, Android, アプリ, iPhone
Topic 2	movie-enter (movies)	映画, 公開, 主演, 監督
Topic 3	sports-watch (sports)	選手, 試合, チーム, サッカー, 野球
Topic 4	kaden-channel (home appliances)	家電, 発売, 価格, 製品, カメラ
Topic 5	peachy / livedoor-homme (lifestyle)	女性, 美容, ファッション, おすすめ
Topic 6	it-life-hack (IT / business)	企業, 市場, ビジネス, サービス
Topic 7	topic-news (general news)	政府, 政治, 経済, 社会
Topic 8	people (food / culture)	料理, レシピ, 食材, 食事
Topic 9	movie-enter / people (entertainment)	音楽, ライブ, アルバム, アーティスト