models = [ { # Multi-lingual model of Universal Sentence Encoder for 15 languages: # Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. "name": "distiluse-base-multilingual-cased-v1", "dims": 512, "metric": "angular", }, { # Multi-lingual model of Universal Sentence Encoder for 50 languages. "name": "distiluse-base-multilingual-cased-v2", "dims": 512, "metric": "angular", }, { # Multi-lingual model of paraphrase-multilingual-MiniLM-L12-v2, extended to 50+ languages. "name": "paraphrase-multilingual-MiniLM-L12-v2", "dims": 384, "metric": "angular", }, { # Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages. "name": "paraphrase-multilingual-mpnet-base-v2", "dims": 768, "metric": "angular", }, { # This model was tuned for semantic search: # Given a query/question, if can find relevant passages. # It was trained on a large and diverse set of (question, answer) pairs. # 215M (question, answer) pairs from diverse sources. "name": "multi-qa-mpnet-base-dot-v1", "dims": 768, "metric": "dot" }, { # This model was tuned for semantic search: # Given a query/question, if can find relevant passages. # It was trained on a large and diverse set of (question, answer) pairs. # 215M (question, answer) pairs from diverse sources. "name": "multi-qa-mpnet-base-cos-v1", "dims": 768, "metric": "angular" }, ]
deffind_model_with_name(models, name): for model in models: if model["name"] == name: return model raise NameError(f"Could not find model {name}.")
以下はSemanticSearchクラスでシンプルにベーシックな機能(モデルを読み込み、corpusの読み込み、エンコードして文をベクトル化すし、vector tree indexのビルド、そして、N個の最近傍探索)を実装します。
from sentence_transformers import SentenceTransformer, util from simpleneighbors import SimpleNeighbors
defload_corpus(self, filename): withopen(f"corpus/{filename}") as f: self.feed(f.read().split("\n"))
deffeed(self, sentences): for sentence in sentences: vector = self.encoder.encode(sentence) self.index.add_one(sentence, vector) self.index.build()
deffind_nearest(self, query, n=5): vector = self.encoder.encode(query) nearests = self.index.nearest(vector, n) res = [] for neighbor in nearests: dist = self.metric_func(vector, self.index.vec(neighbor)) res.append((neighbor, float(dist))) return res
早速、クエリを投げてみます。
if __name__ == "__main__": model = find_model_with_name( models, "distiluse-base-multilingual-cased-v2") ss = SemanticSearch(model) ss.load_corpus("future.txt")
res = ss.find_nearest("フューチャーはいつ創立されましたか。") for r in res: print(r)
Text embeddings and semantic search. In this video we’ll explore how Transformer models represent text as embedding vectors and how these vectors can be used to find similar documents in a corpus. Text embeddings are just a fancy way of saying that we can represent text as an array of numbers called a vector. To create these embeddings we usually use an encoder-based model like BERT. In this example, you can see how we feed three sentences to the encoder and get three vectors as the output. Reading the text, we can see that walking the dog seems to be most similar to walking the cat, but let's see if we can quantify this. The trick to do the comparison is to compute a similarity metric between each pair of embedding vectors. These vectors usually live in a high-dimensional space, so a similarity metric can be anything that measures some sort of distance between vectors. One popular metric is cosine similarity, which uses the angle between two vectors to measure how close they are. In this example, our embedding vectors live in 3D and we can see that the orange and grey vectors are close to each other and have a smaller angle. Now one problem we have to deal with is that Transformer models like BERT will actually return one embedding vector per token. For example in the sentence "I took my dog for a walk", we can expect several embedding vectors, one for each word. For example, here we can see the output of our model has produced 9 embedding vectors per sentence, and each vector has 384 dimensions. But what we really want is a single embedding vector for the whole sentence. To deal with this, we can use a technique called pooling. The simplest pooling method is to just take the token embedding of the CLS token. Alternatively, we can average the token embeddings which is called mean pooling. With mean pooling only thing we need to make sure is that we don't include the padding tokens in the average, which is why you can see the attention mask being used here. This now gives us one 384 dimensional vector per sentence which is exactly what we want. And once we have our sentence embeddings, we can compute the cosine similarity for each pair of vectors. In this example we use the function from scikit-learn and you can see that the sentence "I took my dog for a walk" has an overlap of 0.83 with "I took my cat for a walk". Hooray. We can take this idea one step further by comparing the similarity between a question and a corpus of documents. For example, suppose we embed every post in the Hugging Face forums. We can then ask a question, embed it, and check which forum posts are most similar. This process is often called semantic search, because it allows us to compare queries with context. To create a semantic search engine is quite simple in Datasets. First we need to embed all the documents. In this example, we take a small sample from the SQUAD dataset and apply the same embedding logic as before. This gives us a new column called "embeddings" that stores the embedding of every passage. Once we have our embeddings, we need a way to find nearest neighbours to a query. Datasets provides a special object called a FAISS index that allows you to quickly compare embedding vectors. So we add the FAISS index, embed a question and voila. we've now found the 3 most similar articles which might store the answer.
同じように、それをロードして、日本語のクエリで投げてみます。
if __name__ == "__main__": model = find_model_with_name( models, "paraphrase-multilingual-MiniLM-L12-v2") ss = SemanticSearch(model) ss.load_corpus("semantic_search.txt")
res = ss.find_nearest("埋め込みベクトルでのエンコーディングについて、どんなモデルを使えますか") for r in res: print(r)
出力結果5
それなりにいい感じにヒットできていますね。
('To create these embeddings we usually use an encoder-based model like BERT.', 0.6005619764328003) ('In this video we’ll explore how Transformer models represent text as embedding vectors and how these vectors can be used to find similar documents in a corpus.', 0.5864262580871582) ('For example, here we can see the output of our model has produced 9 embedding vectors per sentence, and each vector has 384 dimensions.', 0.5198760032653809) ('In this example, we take a small sample from the SQUAD dataset and apply the same embedding logic as before.', 0.4749892055988312) ('In this example, our embedding vectors live in 3D and we can see that the orange and grey vectors are close to each other and have a smaller angle.', 0.46906405687332153)
('Text embeddings and semantic search.', 0.3169878125190735) ('To create a semantic search engine is quite simple in Datasets.', 0.22516131401062012) ('To deal with this, we can use a technique called pooling.', 0.19742435216903687) ('This process is often called semantic search, because it allows us to compare queries with context.', 0.1717163324356079) ('Once we have our embeddings, we need a way to find nearest neighbours to a query.', 0.1544724851846695)