book

Pythonデータサイエンスハンドブック ―Jupyter、NumPy、pandas、Matplotlib、scikit-learnを使ったデータ分析、機械学習

by Jake VanderPlas, 菊池彰

May 2018

Intermediate to advanced

556 pages

13h 21m

Japanese

O'Reilly Japan, Inc.

Read now

Unlock full access

Content preview from Pythonデータサイエンスハンドブック ―Jupyter、NumPy、pandas、Matplotlib、scikit-learnを使ったデータ分析、機械学習

380

章機械学習

単語数に基づいてこのデータをベクトル化するには、「

problem

」という単語、「

evil

」という単語、

「

horizon

」という単語などを表す列を作成します。これを手作業で行うことも可能ですが、

scikit-

learn

の

CountVectorizer

を使用すれば退屈な作業を行わずに済みます。

In[7]: from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()

X = vec.fit_transform(sample)

Out[7]: <3x5 sparse matrix of type '<class 'numpy.int64'>'

with 7 stored elements in Compressed Sparse R

ow format>

結果として各単語の出現回数を記録する疎行列（

sparse matrix

）が得られます。ラベル付きの列

を持つ

DataFrame

に変換すれば中身を簡単に確認できます。

In[8]: import pandas as pd

pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Out[8]: evil horizon of problem queen

0 1 0 1 1 0

1 1 0 0 0 1

2 0 1 0 1 0

しかし、このアプローチにはいくつかの問題があります。単語数をそのまま使うと、頻出単語に ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Pythonデータサイエンスハンドブック第2版 ―Jupyter、NumPy、pandas、Matplotlib、scikit-learnを使ったデータ分析、機械学習

Jake VanderPlas, 菊池彰

Pythonではじめる機械学習 ―scikit-learnで学ぶ特徴量エンジニアリングと機械学習の基礎

Andreas C. Muller, Sarah Guido, 中田秀基

PythonによるAIプログラミング入門 ―ディープラーニングを始める前に身につけておくべき15の基礎技術

Prateek Joshi, 相川愛三

初めてのGraphQL ―Webサービスを作って学ぶ新世代API

Eve Porcello, Alex Banks, 尾崎沙耶, あんどうやすし

Publisher Resources

ISBN: 9784873118413Other