Update on RSS Brain to Find Related Articles With Machine Learning | Bin Wang - My Personal Blog
@tags:: #lit✍/📰️article/highlights
@links::
@ref:: Update on RSS Brain to Find Related Articles With Machine Learning | Bin Wang - My Personal Blog
@author:: binwang.me
=this.file.name
Reference
=this.ref
Notes
The basic idea is to replace tf-idf algorithm with text embeddings to represent the articles as vectors, and use ElastcSearch to store and query those vectors.
- No location available
-
With the advancement of machine learning, a new method to represent words as vectors has been developed in the paper Efficient Estimation of Word Representations in Vector Space. The vector is called word embedding. Then based on the idea, Distributed Representations of Sentences and Documents explores representing paragraphs as a vectors. Without go into the details, the basic idea is to get a layer from neural network for a NLP task.
- No location available
-
Also surprisingly, quoted from the paper Efficient Estimation of Word Representations in Vector Space: “To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector X=vector(biggest)−vector(big)+vector(small)X=vector(biggest)−vector(big)+vector(small)X = vector(biggest) − vector(big) + vector(small).” What a beautiful result!
- No location available
-
(highlight:: There is also Llama. But it’s not really a multilingual model. I see some attempts to train it on some other languages, but the result are not that good in my experience. The license of the model is not commercial friendly as well. And there is no easy to use API to get the embeddings.
At the end I found SentenceTransformers. There are lots of pretrained models. After all I selected the model paraphrase-multilingual-mpnet-base-v2 since it’s a multilingual model. But it’s called “sentence” transformers for a reason: there is a size limit on the length of document that you can feed in to the models. I ended up to just get the embeddings for the article title. I think it’s a good enough for my use case.)
- No location available
-
The library SentenceTransformer is very easy to use. However it’s implemented in Python so it needs a way to communicate with RSS Brain server, which is written in Scala. Since this is a computation heavy task, the first though is to have a buffer queue in between so that the Python program can process the articles in a speed it can handle. Kafka is a good choice for external task queue but I don’t think it worth the complexity to import another component into the system. So I created buffer queue at both end to avoid creating too many requests while maintain some parallelism. Here is what the whole architecture looks like:
- No location available
-
There are a few vector databases that can store vectors and query nearest vectors if given one. ElasticSearch added vector fields support at 7.0 and approximate nearest neighbor search (ANN) at 8.0. Since RSS Brain is already using ElasticSearch heavily for searching, I can just use it without add another database into the dependency. It also supports machine learning models so that you don’t need to insert the embedding vectors from the outside world, but I find it’s not as flexible.
- No location available
-
Once the vectors are inserted into ElastiSearch, it’s just an API call to get the most similar documents. The details of vector insert and query are in the ElasticSearch KNN search document. One tricky part is that even though ElasticSearch supports combining ANN search with other features like term searches (tf-idf algorithm) by using a boost factor, it doesn’t work well unless you are willing to tune it. That’s because the embedding vector and term vector mean different things, and the similarity score is not really comparable. So I ended up enable vector search only for finding related articles, instead of combining with term searches.
- No location available
-
dg-publish: true
created: 2024-07-01
modified: 2024-07-01
title: Update on RSS Brain to Find Related Articles With Machine Learning | Bin Wang - My Personal Blog
source: hypothesis
@tags:: #lit✍/📰️article/highlights
@links::
@ref:: Update on RSS Brain to Find Related Articles With Machine Learning | Bin Wang - My Personal Blog
@author:: binwang.me
=this.file.name
Reference
=this.ref
Notes
The basic idea is to replace tf-idf algorithm with text embeddings to represent the articles as vectors, and use ElastcSearch to store and query those vectors.
- No location available
-
With the advancement of machine learning, a new method to represent words as vectors has been developed in the paper Efficient Estimation of Word Representations in Vector Space. The vector is called word embedding. Then based on the idea, Distributed Representations of Sentences and Documents explores representing paragraphs as a vectors. Without go into the details, the basic idea is to get a layer from neural network for a NLP task.
- No location available
-
Also surprisingly, quoted from the paper Efficient Estimation of Word Representations in Vector Space: “To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector X=vector(biggest)−vector(big)+vector(small)X=vector(biggest)−vector(big)+vector(small)X = vector(biggest) − vector(big) + vector(small).” What a beautiful result!
- No location available
-
(highlight:: There is also Llama. But it’s not really a multilingual model. I see some attempts to train it on some other languages, but the result are not that good in my experience. The license of the model is not commercial friendly as well. And there is no easy to use API to get the embeddings.
At the end I found SentenceTransformers. There are lots of pretrained models. After all I selected the model paraphrase-multilingual-mpnet-base-v2 since it’s a multilingual model. But it’s called “sentence” transformers for a reason: there is a size limit on the length of document that you can feed in to the models. I ended up to just get the embeddings for the article title. I think it’s a good enough for my use case.)
- No location available
-
The library SentenceTransformer is very easy to use. However it’s implemented in Python so it needs a way to communicate with RSS Brain server, which is written in Scala. Since this is a computation heavy task, the first though is to have a buffer queue in between so that the Python program can process the articles in a speed it can handle. Kafka is a good choice for external task queue but I don’t think it worth the complexity to import another component into the system. So I created buffer queue at both end to avoid creating too many requests while maintain some parallelism. Here is what the whole architecture looks like:
- No location available
-
There are a few vector databases that can store vectors and query nearest vectors if given one. ElasticSearch added vector fields support at 7.0 and approximate nearest neighbor search (ANN) at 8.0. Since RSS Brain is already using ElasticSearch heavily for searching, I can just use it without add another database into the dependency. It also supports machine learning models so that you don’t need to insert the embedding vectors from the outside world, but I find it’s not as flexible.
- No location available
-
Once the vectors are inserted into ElastiSearch, it’s just an API call to get the most similar documents. The details of vector insert and query are in the ElasticSearch KNN search document. One tricky part is that even though ElasticSearch supports combining ANN search with other features like term searches (tf-idf algorithm) by using a boost factor, it doesn’t work well unless you are willing to tune it. That’s because the embedding vector and term vector mean different things, and the similarity score is not really comparable. So I ended up enable vector search only for finding related articles, instead of combining with term searches.
- No location available
-