1. Steps

WordStorms generator for Projects in Ecosystems

Keyword for research: Latent Semantic Indexing - WordNet

Using scrapy, nltk library for automatically generate wordcloud for projects listed in eclipse iot site

1. Steps
2. Install package
3. Config Java VM argument
4. Run the project
5. Debug

1. Steps

Define the following inputs:
1. A project list. The steps follow will repeat for all the project in the list.
2. A protocol keyword list: Json file contains array of items. For example

[
    {
        "IsCrawled": true,
        "CrawlDepthLevel": 1,
        "IsWordcloudGenerated": true,
        "SiteUrl": "http://www.eclipse.org/paho/",
        "ProjectName": "paho"
    }
]

[
    {
        "id": 1,
        "description": "Message Queuing Telemetry Transport",
        "keys": [
            "MQTT",
            "ZMQ",
            "RabbitMQ"]
    }
]

Crawl the site with re-defined depth level, extract all text into a .txt file (a series of the paragraph, we can use this later to find out the relationship between projects)
Preprocessing crawled data
1. Sentence Tokenize the paragraphs
2. Word Tokenize the sentences
3. Stem the words (since at this point, we don’t need the other forms of a word)
4. Remove all stop words (with English list of stop words and our re-defined stop word, for example: eclips, github, project, etc.)
5. Extract all programming languages (If we want to include the programming languages in wordcloud, we must choose between the language the project is written in/the languages the project support)
Draw the wordcloud
1. Get frequency distribution of each keywords in step 3, select the 50 most common keywords (can choose any number, not just 50 😃 )
2. Feed the drawing python lib to create the picture above, then save/serve the picture
Analyze the crawled data with NLP
1. Split sentences and tokenize
2. Find the sentences containing the keywords
3. Use Grammar rules to identify the relationship implied by the sentence
Generate a graph of the relationship
Draw the graph

For now, drawing grammar tree only worked on Windows

2. Install package

After each run, the ptidejWordcloud/sitelist.json file marks the project crawled or wordcloud generated with true value. Modify these values if you want to re-run any project

Requirement:

python 3.7 or above

Java 8 or above

Install Ghostscript

To handle exporting images from .ps file resulted of nltk grammar scan

# for Windows: using Chocolatey package manager
choco install ghostscript

# for Linux:
[sudo] apt-get install ghostscript

Install python packages

[sudo] python3 -m pip install Twisted
# for windows:
# pip install Twisted[windows_platform]
[sudo] python3 -m pip install Scrapy
[sudo] python3 -m pip install beautifulsoup4
[sudo] python3 -m pip install matplotlib
[sudo] python3 -m pip install Pillow
[sudo] python3 -m pip install Wordcloud
[sudo] python3 -m pip install tabulate
[sudo] python3 -m pip install pandas
[sudo] python3 -m pip install --upgrade gensim
[sudo] python3 -m pip install jsonpickle

Install Natural Language Processing toolkit (nltk)

Install package

[sudo] python3 -m pip install nltk

Download nltk.data

python3

>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('wordnet')
>>> quit()

For Windows Machine

Using Python on Windows machine require Microsoft Visual C++ Build Tools.

You can get the build tools at https://visualstudio.microsoft.com/downloads/.

Here is more about downloading nltk data

3. Config Java VM argument

Stanford POS Tagger is resource-consuming. You will need to increase Java heap size to avoid java.lang.OutOfMemoryError exception

Add/modify these parameters in your vscode settings of Java

"java.jdt.ls.vmargs": "-Xmx4G -Xms512m [existing settings]"

4. Run the project

cd rootProjectFolder
[sudo] python3 auto_runner.py

5. Debug

For debugging with Visual Studio Code:

Choose Python: Run Scrapy and NLTK at debug menu list

Put a breakpoint in any of the python code
Press F5 or debug button to start debugging

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.vscode		.vscode
data		data
docs		docs
evaluation		evaluation
helpers		helpers
input		input
model		model
nlp		nlp
ptidejWordcloud		ptidejWordcloud
sigma_helper		sigma_helper
standford_pos_tagger_data		standford_pos_tagger_data
test		test
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
auto_runner.py		auto_runner.py
run_test.py		run_test.py
runner.py		runner.py
scrapy.cfg		scrapy.cfg
server.py		server.py

License

huntertran/seco-storms-maker

Folders and files

Latest commit

History

Repository files navigation

WordStorms generator for Projects in Ecosystems

1. Steps

2. Install package

3. Config Java VM argument

4. Run the project

5. Debug

About

Topics

Resources

License

Stars

Watchers

Forks

Languages