Skip to content
This repository has been archived by the owner on Nov 11, 2018. It is now read-only.

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
p-e-w committed Jul 19, 2015
0 parents commit 790aa56
Show file tree
Hide file tree
Showing 7 changed files with 448 additions and 0 deletions.
34 changes: 34 additions & 0 deletions .gitignore
@@ -0,0 +1,34 @@
# Based on .gitignore from https://github.com/pypa/sampleproject

# Backup files
*.~

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

# C extensions
*.so

# Distribution / packaging
bin/
build/
develop-eggs/
dist/
eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Translations
*.mo
61 changes: 61 additions & 0 deletions README.md
@@ -0,0 +1,61 @@
# Want news? Just `krill` it!

[Krill](https://en.wikipedia.org/wiki/Krill) are [*filter feeders*](https://en.wikipedia.org/wiki/Filter_feeder). True to its namesake, `krill` :fried_shrimp: filters feeds. It is not picky about its diet, and will happily consume **RSS, Atom, CDF** and even **Twitter** :bird: feeds (*no credentials required!*). It aggregates feed items from all sources you specify, filters out those that interest you, and displays them as a **live stream** :fire: of clean, legible command line output.

![Screenshot](screenshot.png)

`krill` is beautifully minimal. `krill` is extremely easy to set up and use, and runs anywhere Python runs. `krill` is a refreshingly different way of consuming news :newspaper: and updates from anywhere on the web. **`krill` is the hacker's way of keeping up with the world.** :globe_with_meridians:


## Installation

`krill` requires [Python](https://www.python.org/) 2.7+/3.2+ :snake:. If you have the [pip](https://pip.pypa.io) package manager, all you need to do is run

```
pip install krill
```

either as a superuser or from a [virtualenv](https://virtualenv.pypa.io) environment.

Of course, you can also [download the script](krill/krill.py) directly from this repository, in which case you will need to install the dependencies [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) (*what a library!* :star:), [feedparser](https://github.com/kurtmckee/feedparser) and [blessings](https://github.com/erikrose/blessings) manually.


## Usage

### Command line

```
krill [-h] [-s URL [URL ...]] [-S FILE] [-f REGEX [REGEX ...]]
[-F FILE] [-u SECONDS]
-s URL [URL ...], --sources URL [URL ...]
URLs to pull data from
-S FILE, --sources-file FILE
file from which to load source URLs
-f REGEX [REGEX ...], --filters REGEX [REGEX ...]
patterns used to select feed items to print
-F FILE, --filters-file FILE
file from which to load filter patterns
-u SECONDS, --update-interval SECONDS
time between successive feed updates (default: 300
seconds, 0 for single pull only)
```

### Example

```
krill -s "https://twitter.com/nasa" -f "new ?horizons"
```

will follow NASA's :rocket: Twitter stream, printing only tweets that mention the [*New Horizons* probe](https://en.wikipedia.org/wiki/New_Horizons).

`krill` automatically determines whether to treat a web document as a Twitter or an XML feed. If multiple sources and/or filters are loaded from a file with the `-S` and `-F` tags, each must be on a separate line. Empty lines and lines starting with `#` (comments) are ignored.

Inline and file specifications may be combined freely. If more than one filter is given, items matching *any* of the filters are printed. If no filter is given, all items are printed.


## License

Copyright &copy; 2015 Philipp Emanuel Weidmann (<pew@worldwidemann.com>)

Released under the terms of the [GNU General Public License, version 3](https://gnu.org/licenses/gpl.html)
Empty file added krill/__init__.py
Empty file.
300 changes: 300 additions & 0 deletions krill/krill.py
@@ -0,0 +1,300 @@
#!/usr/bin/env python

# krill - the hacker's way of keeping up with the world
#
# Copyright (c) 2015 Philipp Emanuel Weidmann <pew@worldwidemann.com>
#
# Nemo vir est qui mundum non reddat meliorem.
#
# Released under the terms of the GNU General Public License, version 3
# (https://gnu.org/licenses/gpl.html)


try:
# Python 3
from urllib.request import urlopen
except ImportError:
# Python 2
from urllib2 import urlopen

import re
import sys
import codecs
import hashlib
import argparse
from time import sleep, mktime
from datetime import datetime
from collections import namedtuple

import feedparser
from bs4 import BeautifulSoup
from blessings import Terminal



StreamItem = namedtuple("StreamItem", ["source", "time", "text", "link"])



class StreamParser:
def _html_to_text(self, html):
# Hack to prevent Beautiful Soup from collapsing space-keeping tags
# until no whitespace remains at all
html = re.sub("<(br|p)", " \\g<0>", html, flags=re.IGNORECASE)
text = BeautifulSoup(html, "html.parser").get_text()
# Idea from http://stackoverflow.com/a/1546251
return " ".join(text.strip().split())


def get_tweets(self, html):
document = BeautifulSoup(html, "html.parser")

for tweet in document.find_all("p", class_="tweet-text"):
header = tweet.find_previous("div", class_="stream-item-header")

name = header.find("strong", class_="fullname").string
username = header.find("span", class_="username").b.string

time_string = header.find("span", class_="_timestamp")["data-time"]
time = datetime.fromtimestamp(int(time_string))

# For Python 2 and 3 compatibility
to_unicode = unicode if sys.version_info[0] < 3 else str
# Remove ellipsis characters added by Twitter
text = self._html_to_text(to_unicode(tweet).replace(u"\u2026", ""))

link = "https://twitter.com%s" % header.find("a", class_="tweet-timestamp")["href"]

yield StreamItem("%s (@%s)" % (name, username), time, text, link)


def get_feed_items(self, xml):
feed_data = feedparser.parse(xml)

for entry in feed_data.entries:
time = datetime.fromtimestamp(mktime(entry.published_parsed))
text = "%s - %s" % (entry.title, self._html_to_text(entry.description))
yield StreamItem(feed_data.feed.title, time, text, entry.link)



class TextExcerpter:
# Clips the text to the position succeeding the first whitespace string
def _clip_left(self, text):
return re.sub("^\S*\s*", "", text, 1)


# Clips the text to the position preceding the last whitespace string
def _clip_right(self, text):
return re.sub("\s*\S*$", "", text, 1)


# Returns a portion of text at most max_length in length
# and containing the first match of pattern, if specified
def get_excerpt(self, text, pattern=None, max_length=300):
if len(text) <= max_length:
return text, False, False

if pattern is None:
return self._clip_right(text[0:max_length]), False, True
else:
match = pattern.search(text)
start, end = match.span()
match_text = match.group()
remaining_length = max_length - len(match_text)
if remaining_length <= 0:
# Matches are never clipped
return match_text

excerpt_start = max(start - (remaining_length // 2), 0)
excerpt_end = min(end + (remaining_length - (start - excerpt_start)), len(text))
# Adjust start of excerpt in case the string after the match was too short
excerpt_start = max(excerpt_end - max_length, 0)
excerpt = text[excerpt_start:excerpt_end]
if excerpt_start > 0:
excerpt = self._clip_left(excerpt)
if excerpt_end < len(text):
excerpt = self._clip_right(excerpt)

return excerpt, excerpt_start > 0, excerpt_end < len(text)



class Application:
_known_hashes = set()


def __init__(self, args):
self.args = args


def _print_error(self, error):
print("")
print(Terminal().bright_red(error))


def _get_stream_items(self, url):
try:
data = urlopen(url).read()
except Exception as error:
self._print_error("Unable to retrieve data from URL '%s': %s" % (url, str(error)))
# The problem might be temporary, so we do not exit
return list()

parser = StreamParser()
if "//twitter.com/" in url:
return parser.get_tweets(data)
else:
return parser.get_feed_items(data)


def _read_file(self, filename):
try:
with open(filename, "r") as myfile:
lines = [line.strip() for line in myfile.readlines()]
except Exception as error:
self._print_error("Unable to read file '%s': %s" % (filename, str(error)))
sys.exit(1)

# Discard empty lines and comments
return [line for line in lines if line and not line.startswith("#")]


def _print_stream_item(self, item, pattern=None):
print("")

term = Terminal()
time_label = "%s at %s" % (term.yellow(item.time.strftime("%a, %d %b %Y")),
term.yellow(item.time.strftime("%H:%M")))
print("%s on %s:" % (term.bright_cyan(item.source), time_label))

excerpter = TextExcerpter()
excerpt, clipped_left, clipped_right = excerpter.get_excerpt(item.text, pattern)

# Hashtag or mention
excerpt = re.sub("(?<!\w)([#@])(\w+)",
term.green("\\g<1>") + term.bright_green("\\g<2>") + term.bright_white,
excerpt)
# URL in one of the forms commonly encountered on the web
excerpt = re.sub("(\w+://)?[\w.-]+\.[a-zA-Z]{2,4}(?(1)|/)[\w#?&=%/:.-]*",
term.bright_magenta_underline("\\g<0>") + term.bright_white,
excerpt)

if pattern is not None:
# TODO: This can break previously applied highlighting (e.g. URLs)
excerpt = pattern.sub(term.black_on_bright_yellow("\\g<0>") + term.bright_white,
excerpt)

print(" %s%s%s" % ("... " if clipped_left else "",
term.bright_white(excerpt),
" ..." if clipped_right else ""))
print(" %s" % term.bright_blue_underline(item.link))


def update(self):
# Reload sources and filters to allow for live editing
sources = list()
if self.args.sources is not None:
sources.extend(self.args.sources)
if self.args.sources_file is not None:
sources.extend(self._read_file(self.args.sources_file))
if not sources:
self._print_error("No source specifications found")
sys.exit(1)

filters = list()
if self.args.filters is not None:
filters.extend(self.args.filters)
if self.args.filters_file is not None:
filters.extend(self._read_file(self.args.filters_file))

patterns = list()
for filter_string in filters:
try:
patterns.append(re.compile(filter_string, re.IGNORECASE))
except Exception as error:
self._print_error("Error while compiling regular expression '%s': %s" %
(filter_string, str(error)))
sys.exit(1)

items = list()
def add_item(item, pattern=None):
# Note that item.time is excluded from duplicate detection
# as it sometimes changes without affecting the content
hash_code = hashlib.md5((item.source + item.text + item.link)
.encode("utf-8")).hexdigest()
if hash_code in self._known_hashes:
# Do not print an item more than once
return
self._known_hashes.add(hash_code)
items.append((item, pattern))

for source in sources:
for item in self._get_stream_items(source):
if patterns:
for pattern in patterns:
if pattern.search(item.text):
add_item(item, pattern)
break
else:
# No filter patterns specified; simply print all items
add_item(item)

# Print latest news last
items.sort(key=lambda item: item[0].time)

for item in items:
self._print_stream_item(item[0], item[1])


def run(self):
term = Terminal()
print("%s (%s)" % (term.bold("krill 0.1.0"),
term.underline("https://github.com/p-e-w/krill")))

while True:
try:
self.update()
if self.args.update_interval <= 0:
break
sleep(self.args.update_interval)
except KeyboardInterrupt:
# Do not print stacktrace if user exits with Ctrl+C
sys.exit()



def main():
# Force UTF-8 encoding for stdout as we will be printing Unicode characters
# which will fail with a UnicodeEncodeError if the encoding is not set,
# e.g. because stdout is being piped.
# See http://www.macfreek.nl/memory/Encoding_of_Python_stdout and
# http://stackoverflow.com/a/4546129 for extensive discussions of the issue.
if sys.stdout.encoding != "UTF-8":
# For Python 2 and 3 compatibility
prev_stdout = sys.stdout if sys.version_info[0] < 3 else sys.stdout.buffer
sys.stdout = codecs.getwriter("utf-8")(prev_stdout)

arg_parser = argparse.ArgumentParser(prog="krill", description="Read and filter web feeds.")
arg_parser.add_argument("-s", "--sources", nargs="+",
help="URLs to pull data from", metavar="URL")
arg_parser.add_argument("-S", "--sources-file",
help="file from which to load source URLs", metavar="FILE")
arg_parser.add_argument("-f", "--filters", nargs="+",
help="patterns used to select feed items to print", metavar="REGEX")
arg_parser.add_argument("-F", "--filters-file",
help="file from which to load filter patterns", metavar="FILE")
arg_parser.add_argument("-u", "--update-interval", default=300, type=int,
help="time between successive feed updates " +
"(default: 300 seconds, 0 for single pull only)", metavar="SECONDS")
args = arg_parser.parse_args()

if args.sources is None and args.sources_file is None:
arg_parser.error("either a source URL (-s) or a sources file (-S) must be given")

Application(args).run()



if __name__ == "__main__":
main()
Binary file added screenshot.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 790aa56

Please sign in to comment.