Keep-Current :: Machine Learning Seminar #2


Details
Machine Learning Seminar #2 - Document Distance
Level: Advanced
This is the second event in a series of seminars for approaching, understanding and working with Machine learning from different perspectives.
These events are not a lectures, but rather discussions that aim to expand the know-how and the understanding of machine learning.
It is known that the best way to learn and understand something fully, is to teach it to others. Therefore, this is an opportunity for you to 'show-off' what you have learnt while at the same time deepening your knowledge in the field through teaching it to the other members in the group.
Yet, this is not a competition. Gaps in the material can and should be filled by other members in the group. We're here to learn from each other - without judging.
--
We remain in the field of Natural Language Processing and with this event we move from words representations to documents while focusing on document distance for clustering and classification.
We will discuss and explore similarities and differences across varying methodologies - from cosine similarity through word-movers distance to Kullback-leibler divergence and Hellinger distance - in an attempt to understand better the best uses and limitations of these tools.
The seminar format works best if you come prepared. Please check the reading list below and bring your own insights, questions, and perplexities to the table!
Note: The meetup end-hour is approximated. After the meetup, we will continue to a restaurant nearby for drinks and/or dinner.
## Recommended reading list:
# Background - Information Theory:
https://web.stanford.edu/class/stats311/Lectures/lec-02.pdf
# Word-Movers-Distance:
http://proceedings.mlr.press/v37/kusnerb15.pdf - From word embeddings to document distance
https://arxiv.org/pdf/1805.04437.pdf - Cross-lingual Document Retrieval using Regularized Wasserstein Distance
https://medium.com/@stephenhky/word-movers-distance-as-a-linear-programming-problem-6b0c2658592e
(Optional read - an extended method - http://papers.nips.cc/paper/6138-supervised-word-movers-distance)
# Kullback-Leibler Divergence:
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
# Doc2Vec
https://arxiv.org/pdf/1405.4053.pdf - Distributed Representations of Sentences and Documents
https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e
# Differences between KL-Divergence, Bhattacharyya and Hellinger distance:
as well as in the correspondent Wikipedia articles
# Document Clustering
Similarity Measures for Text Document Clustering
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&rep=rep1&type=pdf
--
As always, if you have more sources, please share them in the comments or the discussions/forum section of the meetup.
We look forward to seeing you!

Keep-Current :: Machine Learning Seminar #2