Japanese Topic Modeling

Run topic modeling using the Livedoor News Corpus

This tutorial provides an example of running topic modeling on Japanese news articles by using nagisa for tokenization and scikit-learn’s LDA (Latent Dirichlet Allocation).

Install python libraries

Before we get started, please run the following commands to install the libraries used in this tutorial.

pip install nagisa
pip install scikit-learn
pip install requests
pip install tqdm

Run the script

Run the following command to download the corpus and extract topics. The Livedoor News Corpus (~45 MB) is downloaded automatically on the first run into the ldcc/ directory. By default, the model is built from the title of each article (TEXT_TYPE = "title"). To use the full article body instead, change the value to "body".

python tutorial_topic_model.py
tutorial_topic_model.py
 1import io
 2import os
 3import tarfile
 4
 5import nagisa
 6import requests
 7from sklearn.decomposition import LatentDirichletAllocation
 8from sklearn.feature_extraction.text import CountVectorizer
 9from tqdm import tqdm
10
11CORPUS_URL = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
12CORPUS_DIR = "ldcc"
13N_TOPICS = 9
14N_TOP_WORDS = 10
15MAX_ITER = 20
16TEXT_TYPE = "title"  # or body
17
18
19def download_corpus():
20    if os.path.isdir(CORPUS_DIR):
21        return CORPUS_DIR
22
23    resp = requests.get(CORPUS_URL, timeout=120)
24    resp.raise_for_status()
25
26    with tarfile.open(fileobj=io.BytesIO(resp.content), mode="r:*") as tar:
27        tar.extractall(path=".", filter="data")
28
29    extracted = "text"
30    os.rename(extracted, CORPUS_DIR)
31    return CORPUS_DIR
32
33
34def load_documents(corpus_dir, text_type=TEXT_TYPE):
35    documents = []
36    for category in sorted(os.listdir(corpus_dir)):
37        cat_path = os.path.join(corpus_dir, category)
38        if not os.path.isdir(cat_path):
39            continue
40        for fname in sorted(os.listdir(cat_path)):
41            if not fname.endswith(".txt") or fname.startswith("LICENSE"):
42                continue
43            fpath = os.path.join(cat_path, fname)
44            with open(fpath, encoding="utf-8") as f:
45                lines = f.readlines()
46
47            if text_type == "title":
48                text = lines[3].strip() if len(lines) > 3 else ""
49            else:
50                text = "".join(lines[2:]).strip()
51
52            if text:
53                documents.append(text)
54    return documents
55
56
57def tokenize_documents(documents):
58    stopwords = nagisa.stopwords
59    tokenized = []
60    for doc in tqdm(documents):
61        tokens = nagisa.tagging(doc)
62        words = [w for w in tokens.words if len(w) > 1 and w not in stopwords]
63        tokenized.append(" ".join(words))
64    return tokenized
65
66
67def run_lda(tokenized_docs):
68    vectorizer = CountVectorizer(max_df=0.85, min_df=5)
69    dtm = vectorizer.fit_transform(tokenized_docs)
70    feature_names = vectorizer.get_feature_names_out()
71
72    lda = LatentDirichletAllocation(
73        n_components=N_TOPICS,
74        max_iter=MAX_ITER,
75        learning_method="online",
76        random_state=1234,
77    )
78    lda.fit(dtm)
79
80    for topic_idx, topic in enumerate(lda.components_):
81        top_indices = topic.argsort()[-N_TOP_WORDS:][::-1]
82        top_words = [feature_names[i] for i in top_indices]
83        print(f"\nTopic {topic_idx + 1}:")
84        print(f"\t{', '.join(top_words)}")
85
86
87def main():
88    corpus_dir = download_corpus()
89    documents = load_documents(corpus_dir)
90    tokenized_documents = tokenize_documents(documents)
91    run_lda(tokenized_documents)
92
93
94if __name__ == "__main__":
95    main()

This is an example of the output.

100%|████████████████████████| 7367/7367 [00:28<00:00, 261.03it/s]

Topic 1:
    スマートフォン, Android, アプリ, iPhone, スマホ, 端末, 対応, 無料, 利用, サービス

Topic 2:
    映画, 公開, 主演, 出演, 作品, 監督, 俳優, 女優, ドラマ, 役

Topic 3:
    選手, 試合, チーム, 得点, 優勝, サッカー, 野球, 監督, シーズン, 日本

Topic 4:
    家電, 発売, 価格, 製品, メーカー, 対応, 機能, 搭載, カメラ, テレビ

Topic 5:
    女性, 男性, 生活, 仕事, 結婚, 子供, 美容, ファッション, 料理, おすすめ

Topic 6:
    サービス, 企業, 事業, 市場, 展開, ビジネス, 提供, 投資, 成長, 戦略

Topic 7:
    政府, 政治, 経済, 日本, 問題, 社会, 国, 対策, 発表, 制度

Topic 8:
    料理, レシピ, 食材, 食べ, 味, 食事, 野菜, 作り, 簡単, おいしい

Topic 9:
    音楽, ライブ, アルバム, 曲, アーティスト, 歌, リリース, コンサート, シングル, ツアー

Each topic corresponds to one of the 9 news categories in the Livedoor News Corpus. By looking at the top words, you can see that the model successfully captures the themes of each category.

Topic

Likely category

Key words

Topic 1

smax (smartphones)

スマートフォン, Android, アプリ, iPhone

Topic 2

movie-enter (movies)

映画, 公開, 主演, 監督

Topic 3

sports-watch (sports)

選手, 試合, チーム, サッカー, 野球

Topic 4

kaden-channel (home appliances)

家電, 発売, 価格, 製品, カメラ

Topic 5

peachy / livedoor-homme (lifestyle)

女性, 美容, ファッション, おすすめ

Topic 6

it-life-hack (IT / business)

企業, 市場, ビジネス, サービス

Topic 7

topic-news (general news)

政府, 政治, 経済, 社会

Topic 8

people (food / culture)

料理, レシピ, 食材, 食事

Topic 9

movie-enter / people (entertainment)

音楽, ライブ, アルバム, アーティスト