Japanese Topic Modeling
Run topic modeling using the Livedoor News Corpus
This tutorial provides an example of running topic modeling on Japanese news articles by using nagisa for tokenization and scikit-learn’s LDA (Latent Dirichlet Allocation).
Install python libraries
Before we get started, please run the following commands to install the libraries used in this tutorial.
pip install nagisa
pip install scikit-learn
pip install requests
pip install tqdm
Run the script
Run the following command to download the corpus and extract topics.
The Livedoor News Corpus (~45 MB)
is downloaded automatically on the first run into the ldcc/ directory.
By default, the model is built from the title of each article (TEXT_TYPE = "title").
To use the full article body instead, change the value to "body".
python tutorial_topic_model.py
1import io
2import os
3import tarfile
4
5import nagisa
6import requests
7from sklearn.decomposition import LatentDirichletAllocation
8from sklearn.feature_extraction.text import CountVectorizer
9from tqdm import tqdm
10
11CORPUS_URL = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
12CORPUS_DIR = "ldcc"
13N_TOPICS = 9
14N_TOP_WORDS = 10
15MAX_ITER = 20
16TEXT_TYPE = "title" # or body
17
18
19def download_corpus():
20 if os.path.isdir(CORPUS_DIR):
21 return CORPUS_DIR
22
23 resp = requests.get(CORPUS_URL, timeout=120)
24 resp.raise_for_status()
25
26 with tarfile.open(fileobj=io.BytesIO(resp.content), mode="r:*") as tar:
27 tar.extractall(path=".", filter="data")
28
29 extracted = "text"
30 os.rename(extracted, CORPUS_DIR)
31 return CORPUS_DIR
32
33
34def load_documents(corpus_dir, text_type=TEXT_TYPE):
35 documents = []
36 for category in sorted(os.listdir(corpus_dir)):
37 cat_path = os.path.join(corpus_dir, category)
38 if not os.path.isdir(cat_path):
39 continue
40 for fname in sorted(os.listdir(cat_path)):
41 if not fname.endswith(".txt") or fname.startswith("LICENSE"):
42 continue
43 fpath = os.path.join(cat_path, fname)
44 with open(fpath, encoding="utf-8") as f:
45 lines = f.readlines()
46
47 if text_type == "title":
48 text = lines[3].strip() if len(lines) > 3 else ""
49 else:
50 text = "".join(lines[2:]).strip()
51
52 if text:
53 documents.append(text)
54 return documents
55
56
57def tokenize_documents(documents):
58 stopwords = nagisa.stopwords
59 tokenized = []
60 for doc in tqdm(documents):
61 tokens = nagisa.tagging(doc)
62 words = [w for w in tokens.words if len(w) > 1 and w not in stopwords]
63 tokenized.append(" ".join(words))
64 return tokenized
65
66
67def run_lda(tokenized_docs):
68 vectorizer = CountVectorizer(max_df=0.85, min_df=5)
69 dtm = vectorizer.fit_transform(tokenized_docs)
70 feature_names = vectorizer.get_feature_names_out()
71
72 lda = LatentDirichletAllocation(
73 n_components=N_TOPICS,
74 max_iter=MAX_ITER,
75 learning_method="online",
76 random_state=1234,
77 )
78 lda.fit(dtm)
79
80 for topic_idx, topic in enumerate(lda.components_):
81 top_indices = topic.argsort()[-N_TOP_WORDS:][::-1]
82 top_words = [feature_names[i] for i in top_indices]
83 print(f"\nTopic {topic_idx + 1}:")
84 print(f"\t{', '.join(top_words)}")
85
86
87def main():
88 corpus_dir = download_corpus()
89 documents = load_documents(corpus_dir)
90 tokenized_documents = tokenize_documents(documents)
91 run_lda(tokenized_documents)
92
93
94if __name__ == "__main__":
95 main()
This is an example of the output.
100%|████████████████████████| 7367/7367 [00:28<00:00, 261.03it/s]
Topic 1:
スマートフォン, Android, アプリ, iPhone, スマホ, 端末, 対応, 無料, 利用, サービス
Topic 2:
映画, 公開, 主演, 出演, 作品, 監督, 俳優, 女優, ドラマ, 役
Topic 3:
選手, 試合, チーム, 得点, 優勝, サッカー, 野球, 監督, シーズン, 日本
Topic 4:
家電, 発売, 価格, 製品, メーカー, 対応, 機能, 搭載, カメラ, テレビ
Topic 5:
女性, 男性, 生活, 仕事, 結婚, 子供, 美容, ファッション, 料理, おすすめ
Topic 6:
サービス, 企業, 事業, 市場, 展開, ビジネス, 提供, 投資, 成長, 戦略
Topic 7:
政府, 政治, 経済, 日本, 問題, 社会, 国, 対策, 発表, 制度
Topic 8:
料理, レシピ, 食材, 食べ, 味, 食事, 野菜, 作り, 簡単, おいしい
Topic 9:
音楽, ライブ, アルバム, 曲, アーティスト, 歌, リリース, コンサート, シングル, ツアー
Each topic corresponds to one of the 9 news categories in the Livedoor News Corpus. By looking at the top words, you can see that the model successfully captures the themes of each category.
Topic |
Likely category |
Key words |
|---|---|---|
Topic 1 |
smax (smartphones) |
スマートフォン, Android, アプリ, iPhone |
Topic 2 |
movie-enter (movies) |
映画, 公開, 主演, 監督 |
Topic 3 |
sports-watch (sports) |
選手, 試合, チーム, サッカー, 野球 |
Topic 4 |
kaden-channel (home appliances) |
家電, 発売, 価格, 製品, カメラ |
Topic 5 |
peachy / livedoor-homme (lifestyle) |
女性, 美容, ファッション, おすすめ |
Topic 6 |
it-life-hack (IT / business) |
企業, 市場, ビジネス, サービス |
Topic 7 |
topic-news (general news) |
政府, 政治, 経済, 社会 |
Topic 8 |
people (food / culture) |
料理, レシピ, 食材, 食事 |
Topic 9 |
movie-enter / people (entertainment) |
音楽, ライブ, アルバム, アーティスト |