一天密集式學(xué)習(xí) 快速帶你入門(mén)
正文共3499個(gè)字,9張圖,預(yù)計(jì)閱讀時(shí)間13分鐘。
Word2Vec是由Google的Mikolov等人提出的一個(gè)詞向量計(jì)算模型。
詞向量的重要意義在于將自然語(yǔ)言轉(zhuǎn)換成了計(jì)算機(jī)能夠理解的向量。相對(duì)于詞袋模型、TF-IDF等模型,詞向量能抓住詞的上下文、語(yǔ)義,衡量詞與詞的相似性,在文本分類(lèi)、情感分析等許多自然語(yǔ)言處理領(lǐng)域有重要作用。
詞向量經(jīng)典例子:
http://latex./png.latex?\vec{man}-\vec{woman}\approx\vec{king}-\vec{queen}
gensim已經(jīng)用python封裝好了word2vec的實(shí)現(xiàn),有語(yǔ)料的話(huà)可以直接訓(xùn)練了,參考中英文維基百科語(yǔ)料上的Word2Vec實(shí)驗(yàn)。
會(huì)使用gensim訓(xùn)練詞向量,并不表示真的掌握了word2vec,只表示會(huì)讀文檔會(huì)調(diào)接口而已。
Word2vec詳細(xì)實(shí)現(xiàn)
word2vec的詳細(xì)實(shí)現(xiàn),簡(jiǎn)而言之,就是一個(gè)三層的神經(jīng)網(wǎng)絡(luò)。要理解word2vec的實(shí)現(xiàn),需要的預(yù)備知識(shí)是神經(jīng)網(wǎng)絡(luò)和Logistic Regression。
神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)

word2vec原理圖
上圖是Word2vec的簡(jiǎn)要流程圖。首先假設(shè),詞庫(kù)里的詞數(shù)為10000; 詞向量的長(zhǎng)度為300(根據(jù)斯坦福CS224d的講解,詞向量一般為25-1000維,300維是一個(gè)好的選擇)。下面以單個(gè)訓(xùn)練樣本為例,依次介紹每個(gè)部分的含義。
1、輸入層:輸入為一個(gè)詞的one-hot向量表示。這個(gè)向量長(zhǎng)度為10000。假設(shè)這個(gè)詞為ants,ants在詞庫(kù)中的ID為i,則輸入向量的第i個(gè)分量為1,其余為0。[0, 0, ..., 0, 0, 1, 0, 0, ..., 0, 0]
2、隱藏層:隱藏層的神經(jīng)元個(gè)數(shù)就是詞向量的長(zhǎng)度。隱藏層的參數(shù)是一個(gè)[10000 ,300]的矩陣。 實(shí)際上,這個(gè)參數(shù)矩陣就是詞向量?;貞浺幌戮仃囅喑?,一個(gè)one-hot行向量和矩陣相乘,結(jié)果就是矩陣的第i行。經(jīng)過(guò)隱藏層,實(shí)際上就是把10000維的one-hot向量映射成了最終想要得到的300維的詞向量。

矩陣乘法
3、輸出層: 輸出層的神經(jīng)元個(gè)數(shù)為總詞數(shù)10000,參數(shù)矩陣尺寸為[300,10000]。詞向量經(jīng)過(guò)矩陣計(jì)算后再加上softmax歸一化,重新變?yōu)?0000維的向量,每一維對(duì)應(yīng)詞庫(kù)中的一個(gè)詞與輸入的詞(在這里是ants)共同出現(xiàn)在上下文中的概率。

輸出層
上圖中計(jì)算了car與ants共現(xiàn)的概率,car所對(duì)應(yīng)的300維列向量就是輸出層參數(shù)矩陣中的一列。輸出層的參數(shù)矩陣是[300,10000],也就是計(jì)算了詞庫(kù)中所有詞與ants共現(xiàn)的概率。輸出層的參數(shù)矩陣在訓(xùn)練完畢后沒(méi)有作用。
4、訓(xùn)練:訓(xùn)練樣本(x, y)有輸入也有輸出,我們知道哪個(gè)詞實(shí)際上跟ants共現(xiàn),因此y也是一個(gè)10000維的向量。損失函數(shù)跟Logistic Regression相似,是神經(jīng)網(wǎng)絡(luò)的最終輸出向量和y的交叉熵(cross-entropy)。最后用隨機(jī)梯度下降來(lái)求解

交叉熵(cross-entropy)
上述步驟是一個(gè)詞作為輸入和一個(gè)上下文中的詞作為輸出的情況,但實(shí)際情況顯然更復(fù)雜,什么是上下文呢?用一個(gè)詞去預(yù)測(cè)周?chē)钠渌~,還是用周?chē)暮枚嘣~來(lái)預(yù)測(cè)一個(gè)詞?這里就要引入實(shí)際訓(xùn)練時(shí)的兩個(gè)模型skip-gram和CBOW。
skip-gram: 核心思想是根據(jù)中心詞來(lái)預(yù)測(cè)周?chē)脑~。假設(shè)中心詞是cat,窗口長(zhǎng)度為2,則根據(jù)cat預(yù)測(cè)左邊兩個(gè)詞和右邊兩個(gè)詞。這時(shí),cat作為神經(jīng)網(wǎng)絡(luò)的input,預(yù)測(cè)的詞作為label。下圖為一個(gè)例子:

skip-gram
在這里窗口長(zhǎng)度為2,中心詞一個(gè)一個(gè)移動(dòng),遍歷所有文本。每一次中心詞的移動(dòng),最多會(huì)產(chǎn)生4對(duì)訓(xùn)練樣本(input,label)。
CBOW(continuous-bag-of-words):如果理解了skip-gram,那CBOW模型其實(shí)就是倒過(guò)來(lái),用周?chē)乃性~來(lái)預(yù)測(cè)中心詞。這時(shí)候,每一次中心詞的移動(dòng),只能產(chǎn)生一個(gè)訓(xùn)練樣本。如果還是用上面的例子,則CBOW模型會(huì)產(chǎn)生下列4個(gè)訓(xùn)練樣本:
這時(shí)候,input很可能是4個(gè)詞,label只是一個(gè)詞,怎么辦呢?其實(shí)很簡(jiǎn)單,只要求平均就行了。經(jīng)過(guò)隱藏層后,輸入的4個(gè)詞被映射成了4個(gè)300維的向量,對(duì)這4個(gè)向量求平均,然后就可以作為下一層的輸入了。
([quick, brown], the)
([the, brown, fox], quick)
([the, quick, fox, jumps], brown)
([quick, brown, jumps, over], fox)
兩個(gè)模型相比,skip-gram模型能產(chǎn)生更多訓(xùn)練樣本,抓住更多詞與詞之間語(yǔ)義上的細(xì)節(jié),在語(yǔ)料足夠多足夠好的理想條件下,skip-gram模型是優(yōu)于CBOW模型的。在語(yǔ)料較少的情況下,難以抓住足夠多詞與詞之間的細(xì)節(jié),CBOW模型求平均的特性,反而效果可能更好。
實(shí)際訓(xùn)練時(shí),還是假設(shè)詞庫(kù)有10000個(gè)詞,詞向量300維,那么每一層神經(jīng)網(wǎng)絡(luò)的參數(shù)是300萬(wàn)個(gè),輸出層相當(dāng)于有一萬(wàn)個(gè)可能類(lèi)的多分類(lèi)問(wèn)題??梢韵胂?,這樣的計(jì)算量非常非常非常大。
作者M(jìn)ikolov等人提出了許多優(yōu)化的方法,在這里著重講一下負(fù)采樣。
負(fù)采樣的思想非常簡(jiǎn)單,簡(jiǎn)單地令人發(fā)指:我們知道最終神經(jīng)網(wǎng)絡(luò)經(jīng)過(guò)softmax輸出一個(gè)向量,只有一個(gè)概率最大的對(duì)應(yīng)正確的單詞,其余的稱(chēng)為negative sample?,F(xiàn)在只選擇5個(gè)negative sample,所以輸出向量就只是一個(gè)6維的向量。要考慮的參數(shù)不是300萬(wàn)個(gè),而減少到了1800個(gè)! 這樣做看上去很偷懶,實(shí)際效果卻很好,大大提升了運(yùn)算效率。
我們知道,訓(xùn)練神經(jīng)網(wǎng)絡(luò)時(shí),每一次訓(xùn)練會(huì)對(duì)神經(jīng)網(wǎng)絡(luò)的參數(shù)進(jìn)行微小的修改。在word2vec中,每一個(gè)訓(xùn)練樣本并不會(huì)對(duì)所有參數(shù)進(jìn)行修改。假設(shè)輸入的詞是cat,我們的隱藏層參數(shù)有300萬(wàn)個(gè),但這一步訓(xùn)練只會(huì)修改cat相對(duì)應(yīng)的300個(gè)參數(shù),因?yàn)榇藭r(shí)隱藏層的輸出只跟這300個(gè)參數(shù)有關(guān)!
負(fù)采樣是有效的,我們不需要那么多negative sample。Mikolov等人在論文中說(shuō):對(duì)于小數(shù)據(jù)集,負(fù)采樣的個(gè)數(shù)在5-20個(gè);對(duì)于大數(shù)據(jù)集,負(fù)采樣的個(gè)數(shù)在2-5個(gè)。
那具體如何選擇負(fù)采樣的詞呢?論文給出了如下公式:

負(fù)采樣的選擇
其中f(w)是詞頻。可以看到,負(fù)采樣的選擇只跟詞頻有關(guān),詞頻越大,越有可能選中。
最后用tensorflow動(dòng)手實(shí)踐一下。參考Udacity Deep Learning的一次作業(yè)
這里只是訓(xùn)練了128維的詞向量,并通過(guò)TSNE的方法可視化。作為練手和深入理解word2vec不錯(cuò),實(shí)戰(zhàn)還是推薦gensim。
1# These are all the modules we'll be using later. Make sure you can import them
2# before proceeding further.
3%matplotlib inline
4from __future__ import print_function
5import collections
6import math
7import numpy as np
8import os
9import random
10import tensorflow as tf
11import zipfile
12from matplotlib import pylab
13from six.moves import range
14from six.moves.urllib.request import urlretrieve
15from sklearn.manifold import TSNE
Download the data from the source website if necessary.
1url = 'http:///dc/'
2def maybe_download(filename, expected_bytes):
3'''Download a file if not present, and make sure it's the right size.'''
4if not os.path.exists(filename):
5filename, _ = urlretrieve(url filename, filename)
6statinfo = os.stat(filename)
7if statinfo.st_size == expected_bytes:
8print('Found and verified %s' % filename)
9else:
10print(statinfo.st_size)
11raise Exception(
12'Failed to verify ' filename '. Can you get to it with a browser?')
13return filename
14filename = maybe_download('text8.zip', 31344016)
15Found and verified text8.zip
Read the data into a string.
1def read_data(filename):
2'''Extract the first file enclosed in a zip file as a list of words'''
3with zipfile.ZipFile(filename) as f:
4data = tf.compat.as_str(f.read(f.namelist()[0])).split()
5return data
6words = read_data(filename)
7print('Data size %d' % len(words))
8Data size 17005207
Build the dictionary and replace rare words with UNK token.
1vocabulary_size = 50000
2def build_dataset(words):
3count = [['UNK', -1]]
4count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
5dictionary = dict()
6for word, _ in count:
7dictionary[word] = len(dictionary)
8data = list()
9unk_count = 0
10for word in words:
11if word in dictionary:
12index = dictionary[word]
13else:
14index = 0 # dictionary['UNK']
15unk_count = unk_count 1
16data.append(index)
17count[0][1] = unk_count
18reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
19return data, count, dictionary, reverse_dictionary
20data, count, dictionary, reverse_dictionary = build_dataset(words)
21print('Most common words ( UNK)', count[:5])
22print('Sample data', data[:10])
23del words # Hint to reduce memory.
1Most common words ( UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
2Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]
Function to generate a training batch for the skip-gram model.
1data_index = 0
2def generate_batch(batch_size, num_skips, skip_window):
3global data_index
4assert batch_size % num_skips == 0
5assert num_skips <= 2 * skip_window
6batch = np.ndarray(shape=(batch_size), dtype=np.int32)
7labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
8span = 2 * skip_window 1 # [ skip_window target skip_window ]
9buffer = collections.deque(maxlen=span)
10for _ in range(span):
11buffer.append(data[data_index])
12data_index = (data_index 1) % len(data)
13for i in range(batch_size // num_skips):
14target = skip_window # target label at the center of the buffer
15targets_to_avoid = [ skip_window ]
16for j in range(num_skips):
17 while target in targets_to_avoid:
18 target = random.randint(0, span - 1)
19 targets_to_avoid.append(target)
20 batch[i * num_skips j] = buffer[skip_window]
21 labels[i * num_skips j, 0] = buffer[target]
22buffer.append(data[data_index])
23data_index = (data_index 1) % len(data)
24return batch, labels
25print('data:', [reverse_dictionary[di] for di in data[:8]])
26for num_skips, skip_window in [(2, 1), (4, 2)]:
27data_index = 0
28batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
29print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
30print(' batch:', [reverse_dictionary[bi] for bi in batch])
31print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
1data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']
2with num_skips = 2 and skip_window = 1:
3batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
4labels: ['anarchism', 'as', 'originated', 'a', 'as', 'term', 'a', 'of']
5with num_skips = 4 and skip_window = 2:
6batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a']
7labels: ['originated', 'term', 'anarchism', 'a', 'of', 'as', 'originated', 'term']
Train a skip-gram model.
1batch_size = 128
2embedding_size = 128 # Dimension of the embedding vector.
3skip_window = 1 # How many words to consider left and right.
4num_skips = 2 # How many times to reuse an input to generate a label.
5# We pick a random validation set to sample nearest neighbors. here we limit the
6# validation samples to the words that have a low numeric ID, which by
7# construction are also the most frequent.
8valid_size = 16 # Random set of words to evaluate similarity on.
9valid_window = 100 # Only pick dev samples in the head of the distribution.
10valid_examples = np.array(random.sample(range(valid_window), valid_size))
11#######important#########
12num_sampled = 64 # Number of negative examples to sample.
13graph = tf.Graph()
14with graph.as_default(), tf.device('/cpu:0'):
15# Input data.
16train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
17train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
18valid_dataset = tf.constant(valid_examples,dtype=tf.int32)
19# Variables.
20embeddings = tf.Variable(
21tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
22softmax_weights = tf.Variable(
23tf.truncated_normal([vocabulary_size, embedding_size],
24 stddev=1.0 / math.sqrt(embedding_size)))
25softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
26# Model.
27# Look up embeddings for inputs.
28embed = tf.nn.embedding_lookup(embeddings, train_dataset)
29# Compute the softmax loss, using a sample of the negative labels each time.
30loss = tf.reduce_mean(
31tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
32labels=train_labels,num_sampled=num_sampled, num_classes=vocabulary_size))
33# Optimizer.
34# Note: The optimizer will optimize the softmax_weights AND the embeddings.
35# This is because the embeddings are defined as a variable quantity and the
36# optimizer's `minimize` method will by default modify all variable quantities
37# that contribute to the tensor it is passed.
38# See docs on `tf.train.Optimizer.minimize()` for more details.
39optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
40# Compute the similarity between minibatch examples and all embeddings.
41# We use the cosine distance:
42norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
43normalized_embeddings = embeddings / norm
44valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
45similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
1num_steps = 100001
2with tf.Session(graph=graph) as session:
3tf.global_variables_initializer().run()
4print('Initialized')
5average_loss = 0
6for step in range(num_steps):
7batch_data, batch_labels = generate_batch(
8batch_size, num_skips, skip_window)
9feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
10 _, l = session.run([optimizer, loss],feed_dict=feed_dict)average_loss = l
11if step % 2000 == 0:
12if step > 0:
13average_loss = average_loss / 2000
14# The average loss is an estimate of the loss over the last 2000 batches.
15 print('Average loss at step %d: %f' % (step,average_loss))
16 average_loss = 0
17# note that this is expensive (~20% slowdown if computed every 500 steps)
18if step % 10000 == 0:
19sim = similarity.eval()
20for i in range(valid_size):
21valid_word = reverse_dictionary[valid_examples[i]]
22top_k = 8 # number of nearest neighbors
23nearest = (-sim[i, :]).argsort()[1:top_k 1]
24log = 'Nearest to %s:' % valid_word
25for k in range(top_k):
26 close_word = reverse_dictionary[nearest[k]]
27 log = '%s %s,' % (log, close_word)
28print(log)
29final_embeddings = normalized_embeddings.eval()
1num_points = 400
2tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
3two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points 1, :])
1def plot(embeddings, labels):
2assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
3pylab.figure(figsize=(15,15)) # in inches
4for i, label in enumerate(labels):
5x, y = embeddings[i,:]
6pylab.scatter(x, y)
7pylab.annotate(label, xy=(x, y), xytext=(5, 2),textcoords='offset points',
8 ha='right', va='bottom')
9pylab.show()
10words = [reverse_dictionary[i] for i in range(1, num_points 1)]
11plot(two_d_embeddings, words)

skip-gram可視化
1data_index_cbow = 0
2def get_cbow_batch(batch_size, num_skips, skip_window):
3global data_index_cbow
4assert batch_size % num_skips == 0
5assert num_skips <= 2 * skip_window
6batch = np.ndarray(shape=(batch_size), dtype=np.int32)
7labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
8span = 2 * skip_window 1 # [ skip_window target skip_window ]
9buffer = collections.deque(maxlen=span)
10for _ in range(span):
11buffer.append(data[data_index_cbow])
12data_index_cbow = (data_index_cbow 1) % len(data)
13for i in range(batch_size // num_skips):
14target = skip_window # target label at the center of the buffer
15targets_to_avoid = [ skip_window ]
16for j in range(num_skips):
17 while target in targets_to_avoid:
18 target = random.randint(0, span - 1)
19 targets_to_avoid.append(target)
20 batch[i * num_skips j] = buffer[skip_window]
21 labels[i * num_skips j, 0] = buffer[target]
22buffer.append(data[data_index_cbow])
23data_index_cbow = (data_index_cbow 1) % len(data)
24cbow_batch = np.ndarray(shape=(batch_size), dtype=np.int32)
25cbow_labels = np.ndarray(shape=(batch_size // (skip_window * 2), 1), dtype=np.int32)
26for i in range(batch_size):
27cbow_batch[i] = labels[i]
28cbow_batch = np.reshape(cbow_batch, [batch_size // (skip_window * 2), skip_window * 2])
29for i in range(batch_size // (skip_window * 2)):
30# center word
31cbow_labels[i] = batch[2 * skip_window * i]
32return cbow_batch, cbow_labels
1# actual batch_size = batch_size // (2 * skip_window)
2batch_size = 128
3embedding_size = 128 # Dimension of the embedding vector.
4skip_window = 1 # How many words to consider left and right.
5num_skips = 2 # How many times to reuse an input to generate a label.
6# We pick a random validation set to sample nearest neighbors. here we limit the
7# validation samples to the words that have a low numeric ID, which by
8# construction are also the most frequent.
9valid_size = 16 # Random set of words to evaluate similarity on.
10valid_window = 100 # Only pick dev samples in the head of the distribution.
11valid_examples = np.array(random.sample(range(valid_window), valid_size))
12#######important#########
13num_sampled = 64 # Number of negative examples to sample.
14graph = tf.Graph()
15with graph.as_default(), tf.device('/cpu:0'):
16# Input data.
17train_dataset = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), skip_window * 2])
18train_labels = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), 1])
19valid_dataset = tf.constant(valid_examples,dtype=tf.int32)
20# Variables.
21embeddings = tf.Variable(
22tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
23softmax_weights = tf.Variable(
24tf.truncated_normal([vocabulary_size, embedding_size],
25 stddev=1.0 / math.sqrt(embedding_size)))
26softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
27# Model.
28# Look up embeddings for inputs.
29embed = tf.nn.embedding_lookup(embeddings, train_dataset)
30# reshape embed
31embed = tf.reshape(embed, (skip_window * 2, batch_size // (skip_window * 2), embedding_size))
32# average embedembed = tf.reduce_mean(embed, 0)
33# Compute the softmax loss, using a sample of the negative labels each time.
34loss = tf.reduce_mean(
35tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
36# Optimizer.
37# Note: The optimizer will optimize the softmax_weights AND the embeddings.
38# This is because the embeddings are defined as a variable quantity and the
39# optimizer's `minimize` method will by default modify all variable quantities
40# that contribute to the tensor it is passed.
41# See docs on `tf.train.Optimizer.minimize()` for more details.
42optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
43# Compute the similarity between minibatch examples and all embeddings.
44# We use the cosine distance:
45norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
46normalized_embeddings = embeddings / norm
47valid_embeddings = tf.nn.embedding_lookup(
48normalized_embeddings, valid_dataset)
49similarity = tf.matmul(valid_embeddings,tf.transpose(normalized_embeddings))
1num_steps = 100001
2with tf.Session(graph=graph) as session:
3tf.global_variables_initializer().run()
4print('Initialized')
5average_loss = 0
6for step in range(num_steps):
7batch_data, batch_labels = get_cbow_batch(
8batch_size, num_skips, skip_window)
9feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
10_, l = session.run([optimizer, loss],feed_dict=feed_dict)
11average_loss = l
12if step % 2000 == 0:
13if step > 0:
14average_loss = average_loss / 2000
15# The average loss is an estimate of the loss over the last 2000 batches.
16print('Average loss at step %d: %f' % (step, average_loss))
17average_loss = 0
18# note that this is expensive (~20% slowdown if computed every 500 steps)
19if step % 10000 == 0:
20sim = similarity.eval()
21for i in range(valid_size):
22valid_word = reverse_dictionary[valid_examples[i]]
23top_k = 8 # number of nearest neighbors
24nearest = (-sim[i, :]).argsort()[1:top_k 1]
25log = 'Nearest to %s:' % valid_word
26for k in range(top_k):
27 close_word = reverse_dictionary[nearest[k]]
28 log = '%s %s,' % (log, close_word)
29print(log)
30final_embeddings = normalized_embeddings.eval()
1num_points = 400
2tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
3two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points 1, :])
4words = [reverse_dictionary[i] for i in range(200, num_points 1)]
5plot(two_d_embeddings, words)

CBOW可視化
1、Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[J]. 2014, 4:II-1188.
2、Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119.
3、Word2Vec Tutorial - The Skip-Gram Model
4、Udacity Deep Learning
5、Stanford CS224d Lecture2,3
原文鏈接:https://www.jianshu.com/p/b779f8219f74
查閱更為簡(jiǎn)潔方便的分類(lèi)文章以及最新的課程、產(chǎn)品信息,請(qǐng)移步至全新呈現(xiàn)的“LeadAI學(xué)院官網(wǎng)”:
www.leadai.org