Code style and composition course in Python

Code style and composition course for junior ML developers (in Python)

Lectures

Lecture 1: slides
Lecture 2: slides in progress

Test task

Input

You can find the input.txt file at the root of the repository. It contains several records each of them representing a tweet body and a JSON-encoded tags array (a sample of 3 lines is presented):

$ABBV why price is going down, despite good results?,['@price']
$CMA max pain is 87.5 for expiry 2018-11-16 Source: http://sweep.ly/maxpain.html,['@source']
#STAAnalystAlert for $BLL : KeyCorp Reiterates with a rating of Hold. Our own verdict is Strong Buy http://www.stocktargetadvisor.com/toprating,['@keycorp']

Your job is to write a Python script that will tidy up this file according to a set of rules.

Processing rules

For the tags array:

Remove quotes and square brackets for each of the text tokens
Place all cleaned-up tags into a separate array for each tweet's resulting record (under the metadata key)

For the text:

Remove words starting with $ sign
Place all words starting with @ or # to a separate array for each tweet's resulting record (body_tags)
If a tweet body contains a URL, add it to the array holding the cleaned-up tags (metadata)
Tokenize the tweet body: just separate by whitespace and remove all punctuation signs at the end of the tokens or in the middle of a whitespace (why? -> why, remove :). Skip all tokens starting with $, @, # and URLs. Check the rest of the tokens for inclusion in Wordnet corpus using Lesk algorithm from nltk.wsd. Place all tokens that are not included into the corpus into a separate array under the orphan_tokens key for each tweet's resulting record.

Output

The full output of the script should be an output.json file created at the root of the project overwriting any existing file with the same name. The output JSON in this file should be of the following structure:

{
  "records": [
    {
      "body": "...", // the entire tweet body, untouched
      "body_tags": [], // array of Strings that are your body tags (#this or @this)
      "metadata": [], // array of Strings that are your cleaned-up up tags from the tags field
      "orphan_tokens": [], // array of Strings that are your tokens that are missing from the words corpus
    }
  ]
}

For a sample of input given above, the output fragment should look like the following:

{
  "records": [
    {
      "body": "$ABBV why price is going down, despite good results?",
      "body_tags": [],
      "metadata": ["price"],
      "orphan_tokens": []
    },
    {
      "body": "$CMA max pain is 87.5 for expiry 2018-11-16 Source: http://sweep.ly/maxpain.html",
      "body_tags": [],
      "metadata": ["source", "http://sweep.ly/maxpain.html"],
      "orphan_tokens": ["87.5", "for", "2018-11-16", "Source:"]
    },
    {
      "body": "#STAAnalystAlert for $BLL : KeyCorp Reiterates with a rating of Hold. Our own verdict is Strong Buy http://www.stocktargetadvisor.com/toprating",
      "body_tags": ["STAAnalystAlert"],
      "metadata": ["keycorp", "http://www.stocktargetadvisor.com/toprating"],
      "orphan_tokens": ["for", "KeyCorp", "with", "of", "our"]
    }
  ]
}

The process

Make a public fork of this repo into your Github account
Add your commits to the master branch of your fork
Make a pull request from your master to the source repo's master branch.
Let's discuss your work!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt
main_ovs.py		main_ovs.py
output.json		output.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

input.txt

input.txt

main_ovs.py

main_ovs.py

output.json

output.json

Repository files navigation

Code style and composition course in Python

Lectures

Test task

Input

Processing rules

Output

The process

About

Releases

Packages

Languages

License

vloooo/code-style-and-composition-course-python

Folders and files

Latest commit

History

Repository files navigation

Code style and composition course in Python

Lectures

Test task

Input

Processing rules

Output

The process

About

Resources

License

Stars

Watchers

Forks

Languages