Merge branch 'release-4.1.0'

piskvorky · Aug 14, 2021 · 109c88e · 109c88e
2 parents b4f64a9 + 1bb426a
commit 109c88e
Show file tree

Hide file tree

Showing 88 changed files with 8,577 additions and 922 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -39,7 +39,8 @@ jobs:
       #
       - name: Update sbt
         run: |
-          echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
+          echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
+          echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list
           curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
           sudo apt-get update -y
           sudo apt-get install -y sbt

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,126 @@
 Changes
 =======
 
+## Unreleased
+
+## 4.1.0, 2021-08-15
+
+Gensim 4.1 brings two major new functionalities:
+
+* [Ensemble LDA](https://radimrehurek.com/gensim/auto_examples/tutorials/run_ensemblelda.html) for robust training, selection and comparison of LDA models.
+* [FastSS module](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/fastss.pyx) for super fast Levenshtein "fuzzy search" queries. Used e.g. for ["soft term similarity"](https://github.com/RaRe-Technologies/gensim/pull/3146) calculations.
+
+There are several minor changes that are **not** backwards compatible with previous versions of Gensim.
+The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump.
+Nevertheless, we describe them below.
+
+### Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods
+
+We now handle both ``positive`` and ``negative`` keyword parameters consistently.
+They may now be either:
+
+1. A string, in which case the value is reinterpreted as a list of one element (the string value)
+2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
+3. A list of strings
+4. A list of vectors
+
+So you can now simply do:
+
+```python
+    model.most_similar(positive='war', negative='peace')
+```
+
+instead of the slightly more involved
+
+```python
+model.most_similar(positive=['war'], negative=['peace'])
+```
+
+Both invocations remain correct, so you can use whichever is most convenient.
+If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.
+
+```python
+model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
+```
+
+then you will need to specify the lists explicitly in gensim 4.1.
+### Deprecated obsolete `step` parameter from doc2vec
+
+With the newer version, do this:
+
+```python
+model.infer_vector(..., epochs=123)
+```
+
+instead of this:
+
+```python
+model.infer_vector(..., steps=123)
+```
+
+Plus a large number of smaller improvements and fixes, as usual.
+
+**⚠️ If migrating from old Gensim 3.x, read the [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) first.**
+
+### :+1: New features
+
+* [#3169](https://github.com/RaRe-Technologies/gensim/pull/3169): Implement `shrink_windows` argument for Word2Vec, by [@M-Demay](https://github.com/M-Demay)
+* [#3163](https://github.com/RaRe-Technologies/gensim/pull/3163): Optimize word mover distance (WMD) computation, by [@flowlight0](https://github.com/flowlight0)
+* [#3157](https://github.com/RaRe-Technologies/gensim/pull/3157): New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by [@Witiko](https://github.com/Witiko)
+* [#3153](https://github.com/RaRe-Technologies/gensim/pull/3153): Vectorize word2vec.predict_output_word for speed, by [@M-Demay](https://github.com/M-Demay)
+* [#3146](https://github.com/RaRe-Technologies/gensim/pull/3146): Use FastSS for fast kNN over Levenshtein distance, by [@Witiko](https://github.com/Witiko)
+* [#3128](https://github.com/RaRe-Technologies/gensim/pull/3128): Materialize and copy the corpus passed to SoftCosineSimilarity, by [@Witiko](https://github.com/Witiko)
+* [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness)
+* [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091): LsiModel: Only log top words that actually exist in the dictionary, by [@kmurphy4](https://github.com/kmurphy4)
+* [#2980](https://github.com/RaRe-Technologies/gensim/pull/2980): Added EnsembleLda for stable LDA topics, by [@sezanzeb](https://github.com/sezanzeb)
+* [#2978](https://github.com/RaRe-Technologies/gensim/pull/2978): Optimize performance of Author-Topic model, by [@horpto](https://github.com/horpto)
+* [#3000](https://github.com/RaRe-Technologies/gensim/pull/3000): Tidy up KeyedVectors.most_similar() API, by [@simonwiles](https://github.com/simonwiles)
+
+### :books: Tutorials and docs
+
+* [#3155](https://github.com/RaRe-Technologies/gensim/pull/3155): Correct parameter name in documentation of fasttext.py, by [@bizzyvinci](https://github.com/bizzyvinci)
+* [#3148](https://github.com/RaRe-Technologies/gensim/pull/3148): Fix broken link to mycorpus.txt in documentation, by [@rohit901](https://github.com/rohit901)
+* [#3142](https://github.com/RaRe-Technologies/gensim/pull/3142): Use more permanent pdf link and update code link, by [@dymil](https://github.com/dymil)
+* [#3141](https://github.com/RaRe-Technologies/gensim/pull/3141): Update link for online LDA paper, by [@dymil](https://github.com/dymil)
+* [#3133](https://github.com/RaRe-Technologies/gensim/pull/3133): Update link to Hoffman paper (online VB LDA), by [@jonaschn](https://github.com/jonaschn)
+* [#3129](https://github.com/RaRe-Technologies/gensim/pull/3129): [MRG] Add bronze sponsor: TechTarget, by [@piskvorky](https://github.com/piskvorky)
+* [#3126](https://github.com/RaRe-Technologies/gensim/pull/3126): Fix typos in make_wiki_online.py and make_wikicorpus.py, by [@nicolasassi](https://github.com/nicolasassi)
+* [#3125](https://github.com/RaRe-Technologies/gensim/pull/3125): Improve & unify docs for dirichlet priors, by [@jonaschn](https://github.com/jonaschn)
+* [#3123](https://github.com/RaRe-Technologies/gensim/pull/3123): Fix hyperlink for doc2vec tutorial, by [@AdityaSoni19031997](https://github.com/AdityaSoni19031997)
+* [#3121](https://github.com/RaRe-Technologies/gensim/pull/3121): [MRG] Add bronze sponsor: eaccidents.com, by [@piskvorky](https://github.com/piskvorky)
+* [#3120](https://github.com/RaRe-Technologies/gensim/pull/3120): Fix URL for ldamodel.py, by [@jonaschn](https://github.com/jonaschn)
+* [#3118](https://github.com/RaRe-Technologies/gensim/pull/3118): Fix URL in doc string, by [@jonaschn](https://github.com/jonaschn)
+* [#3107](https://github.com/RaRe-Technologies/gensim/pull/3107): Draw attention to sponsoring in README, by [@piskvorky](https://github.com/piskvorky)
+* [#3105](https://github.com/RaRe-Technologies/gensim/pull/3105): Fix documentation links: Travis to Github Actions, by [@piskvorky](https://github.com/piskvorky)
+* [#3057](https://github.com/RaRe-Technologies/gensim/pull/3057): Clarify doc comment in LdaModel.inference(), by [@yocen](https://github.com/yocen)
+* [#2964](https://github.com/RaRe-Technologies/gensim/pull/2964): Document that preprocessing.strip_punctuation is limited to ASCII, by [@sciatro](https://github.com/sciatro)
+
+
+### :red_circle: Bug fixes
+
+* [#3178](https://github.com/RaRe-Technologies/gensim/pull/3178): Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by [@Witiko](https://github.com/Witiko)
+* [#3174](https://github.com/RaRe-Technologies/gensim/pull/3174): Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by [@emgucv](https://github.com/emgucv)
+* [#3136](https://github.com/RaRe-Technologies/gensim/pull/3136): Fix indexing error in word2vec_inner.pyx, by [@bluekura](https://github.com/bluekura)
+* [#3131](https://github.com/RaRe-Technologies/gensim/pull/3131): Add missing import to NMF docs and models/__init__.py, by [@properGrammar](https://github.com/properGrammar)
+* [#3116](https://github.com/RaRe-Technologies/gensim/pull/3116): Fix bug where saved Phrases model did not load its connector_words, by [@aloknayak29](https://github.com/aloknayak29)
+* [#2830](https://github.com/RaRe-Technologies/gensim/pull/2830): Fixed KeyError in coherence model, by [@pietrotrope](https://github.com/pietrotrope)
+
+
+### :warning: Removed functionality & deprecations
+
+* [#3176](https://github.com/RaRe-Technologies/gensim/pull/3176): Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by [@rock420](https://github.com/rock420)
+* [#2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [@sciatro](https://github.com/sciatro)
+* [#3180](https://github.com/RaRe-Technologies/gensim/pull/3180): Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by [@rock420](https://github.com/rock420)
+
+### 🔮 Testing, CI, housekeeping
+
+* [#3156](https://github.com/RaRe-Technologies/gensim/pull/3156): Update Numpy minimum version to 1.17.0, by [@PrimozGodec](https://github.com/PrimozGodec)
+* [#3143](https://github.com/RaRe-Technologies/gensim/pull/3143): replace _mul function with explicit casts, by [@mpenkov](https://github.com/mpenkov)
+* [#2952](https://github.com/RaRe-Technologies/gensim/pull/2952): Allow newer versions of the Morfessor module for the tests, by [@pabs3](https://github.com/pabs3)
+* [#2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [@sciatro](https://github.com/sciatro)
+
+
+
 ## 4.0.1, 2021-04-01
 
 Bugfix release to address issues with Wheels on Windows:

diff --git a/README.md b/README.md
@@ -19,12 +19,8 @@ and *similarity retrieval* with large corpora. Target audience is the
 *natural language processing* (NLP) and *information retrieval* (IR)
 community.
 
-<!--
-## :pizza: Hacktoberfest 2019 :beer:
+## ⚠️  Please [sponsor Gensim](https://github.com/sponsors/piskvorky) to help sustain this open source project ❤️
 
-We are accepting PRs for Hacktoberfest!
-See [here](HACKTOBERFEST.md) for details.
--->
 
 Features
 --------
@@ -57,10 +53,10 @@ scientific computing. You must have them installed prior to installing
 gensim.
 
 It is also recommended you install a fast BLAS library before installing
-NumPy. This is optional, but using an optimized BLAS such as [ATLAS] or
+NumPy. This is optional, but using an optimized BLAS such as MKL, [ATLAS] or
 [OpenBLAS] is known to improve performance by as much as an order of
-magnitude. On OS X, NumPy picks up the BLAS that comes with it
-automatically, so you don’t need to do anything special.
+magnitude. On OSX, NumPy picks up its vecLib BLAS automatically,
+so you don’t need to do anything special.
 
 Install the latest version of gensim:
 
@@ -77,7 +73,8 @@ package:
 
 For alternative modes of installation, see the [documentation].
 
-Gensim is being [continuously tested](https://travis-ci.org/RaRe-Technologies/gensim) under Python 3.6, 3.7 and 3.8.
+Gensim is being [continuously tested](http://radimrehurek.com/gensim/#testing) under all
+[supported Python versions](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility).
 Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.
 
 How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?
@@ -110,9 +107,12 @@ Documentation
 Support
 -------
 
-Ask open-ended or research questions on the [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim).
+For commercial support, please see [Gensim sponsorship](https://github.com/sponsors/piskvorky).
+
+Ask open-ended questions on the public [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim).
+
+Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but please **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to provide the requested details will be closed without inspection.
 
-Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to follow the issue template will be closed without inspection.
 
 ---------
 
@@ -162,15 +162,12 @@ BibTeX entry:
 
   [citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C
 
-  [Travis CI for automated testing]: https://travis-ci.org/RaRe-Technologies/gensim
   [design goals]: http://radimrehurek.com/gensim/about.html
   [RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20
   [rare\_tech]: //rare-technologies.com
   [Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&s=100
   [citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC
 
-
-
   [documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation
   [Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model
   [unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing

diff --git a/docs/notebooks/ensemble_lda_with_opinosis.ipynb b/docs/notebooks/ensemble_lda_with_opinosis.ipynb
@@ -0,0 +1,178 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "from gensim.models import EnsembleLda, LdaMulticore\n",
+    "from gensim.models.ensemblelda import rank_masking\n",
+    "from gensim.corpora import OpinosisCorpus\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "enable the ensemble logger to show what it is doing currently"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "elda_logger = logging.getLogger(EnsembleLda.__module__)\n",
+    "elda_logger.setLevel(logging.INFO)\n",
+    "elda_logger.addHandler(logging.StreamHandler())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def pretty_print_topics():\n",
+    "    # note that the words are stemmed so they appear chopped off\n",
+    "    for t in elda.print_topics(num_words=7):\n",
+    "        print('-', t[1].replace('*',' ').replace('\"','').replace(' +',','), '\\n')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Experiments on the Opinosis Dataset\n",
+    "\n",
+    "Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.\n",
+    "\n",
+    "[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online],_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparing the corpus\n",
+    "\n",
+    "First, download the opinosis dataset. On linux it can be done like this for example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!mkdir ~/opinosis\n",
+    "!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip\n",
+    "!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path = os.path.expanduser('~/opinosis/')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.\n",
+    "It preprocesses the data using the PorterStemmer and stopwords from the nltk package.\n",
+    "\n",
+    "The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "opinosis = OpinosisCorpus(path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**parameters**\n",
+    "\n",
+    "**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.\n",
+    "\n",
+    "Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "elda = EnsembleLda(\n",
+    "    corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,\n",
+    "    passes=20, iterations=100, ensemble_workers=3, distance_workers=4,\n",
+    "    topic_model_class='ldamulticore', masking_method=rank_masking,\n",
+    ")\n",
+    "pretty_print_topics()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The default for **min_samples** would be 64, half of the number of models and **eps** would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "elda.recluster(min_samples=55, eps=0.14)\n",
+    "pretty_print_topics()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/src/_static/images/eaccidents-logo.png b/docs/src/_static/images/eaccidents-logo.png
diff --git a/docs/src/_static/images/techtarget-logo.png b/docs/src/_static/images/techtarget-logo.png