Initial commit

p-e-w · Jul 19, 2015 · 790aa56 · 790aa56
commit 790aa56
Show file tree

Hide file tree

Showing 7 changed files with 448 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,34 @@
+# Based on .gitignore from https://github.com/pypa/sampleproject
+
+# Backup files
+*.~
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+
+# C extensions
+*.so
+
+# Distribution / packaging
+bin/
+build/
+develop-eggs/
+dist/
+eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Translations
+*.mo
diff --git a/README.md b/README.md
@@ -0,0 +1,61 @@
+# Want news? Just `krill` it!
+
+[Krill](https://en.wikipedia.org/wiki/Krill) are [*filter feeders*](https://en.wikipedia.org/wiki/Filter_feeder). True to its namesake, `krill` :fried_shrimp: filters feeds. It is not picky about its diet, and will happily consume **RSS, Atom, CDF** and even **Twitter** :bird: feeds (*no credentials required!*). It aggregates feed items from all sources you specify, filters out those that interest you, and displays them as a **live stream** :fire: of clean, legible command line output.
+
+![Screenshot](screenshot.png)
+
+`krill` is beautifully minimal. `krill` is extremely easy to set up and use, and runs anywhere Python runs. `krill` is a refreshingly different way of consuming news :newspaper: and updates from anywhere on the web. **`krill` is the hacker's way of keeping up with the world.** :globe_with_meridians:
+
+
+## Installation
+
+`krill` requires [Python](https://www.python.org/) 2.7+/3.2+ :snake:. If you have the [pip](https://pip.pypa.io) package manager, all you need to do is run
+
+```
+pip install krill
+```
+
+either as a superuser or from a [virtualenv](https://virtualenv.pypa.io) environment.
+
+Of course, you can also [download the script](krill/krill.py) directly from this repository, in which case you will need to install the dependencies [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) (*what a library!* :star:), [feedparser](https://github.com/kurtmckee/feedparser) and [blessings](https://github.com/erikrose/blessings) manually.
+
+
+## Usage
+
+### Command line
+
+```
+krill [-h] [-s URL [URL ...]] [-S FILE] [-f REGEX [REGEX ...]]
+      [-F FILE] [-u SECONDS]
+
+  -s URL [URL ...], --sources URL [URL ...]
+                        URLs to pull data from
+  -S FILE, --sources-file FILE
+                        file from which to load source URLs
+  -f REGEX [REGEX ...], --filters REGEX [REGEX ...]
+                        patterns used to select feed items to print
+  -F FILE, --filters-file FILE
+                        file from which to load filter patterns
+  -u SECONDS, --update-interval SECONDS
+                        time between successive feed updates (default: 300
+                        seconds, 0 for single pull only)
+```
+
+### Example
+
+```
+krill -s "https://twitter.com/nasa" -f "new ?horizons"
+```
+
+will follow NASA's :rocket: Twitter stream, printing only tweets that mention the [*New Horizons* probe](https://en.wikipedia.org/wiki/New_Horizons).
+
+`krill` automatically determines whether to treat a web document as a Twitter or an XML feed. If multiple sources and/or filters are loaded from a file with the `-S` and `-F` tags, each must be on a separate line. Empty lines and lines starting with `#` (comments) are ignored.
+
+Inline and file specifications may be combined freely. If more than one filter is given, items matching *any* of the filters are printed. If no filter is given, all items are printed.
+
+
+## License
+
+Copyright &copy; 2015 Philipp Emanuel Weidmann (<pew@worldwidemann.com>)
+
+Released under the terms of the [GNU General Public License, version 3](https://gnu.org/licenses/gpl.html)
diff --git a/krill/__init__.py b/krill/__init__.py
diff --git a/krill/krill.py b/krill/krill.py
@@ -0,0 +1,300 @@
+#!/usr/bin/env python
+
+# krill - the hacker's way of keeping up with the world
+#
+# Copyright (c) 2015 Philipp Emanuel Weidmann <pew@worldwidemann.com>
+#
+# Nemo vir est qui mundum non reddat meliorem.
+#
+# Released under the terms of the GNU General Public License, version 3
+# (https://gnu.org/licenses/gpl.html)
+
+
+try:
+    # Python 3
+    from urllib.request import urlopen
+except ImportError:
+    # Python 2
+    from urllib2 import urlopen
+
+import re
+import sys
+import codecs
+import hashlib
+import argparse
+from time import sleep, mktime
+from datetime import datetime
+from collections import namedtuple
+
+import feedparser
+from bs4 import BeautifulSoup
+from blessings import Terminal
+
+
+
+StreamItem = namedtuple("StreamItem", ["source", "time", "text", "link"])
+
+
+
+class StreamParser:
+    def _html_to_text(self, html):
+        # Hack to prevent Beautiful Soup from collapsing space-keeping tags
+        # until no whitespace remains at all
+        html = re.sub("<(br|p)", " \\g<0>", html, flags=re.IGNORECASE)
+        text = BeautifulSoup(html, "html.parser").get_text()
+        # Idea from http://stackoverflow.com/a/1546251
+        return " ".join(text.strip().split())
+
+
+    def get_tweets(self, html):
+        document = BeautifulSoup(html, "html.parser")
+
+        for tweet in document.find_all("p", class_="tweet-text"):
+            header = tweet.find_previous("div", class_="stream-item-header")
+
+            name = header.find("strong", class_="fullname").string
+            username = header.find("span", class_="username").b.string
+
+            time_string = header.find("span", class_="_timestamp")["data-time"]
+            time = datetime.fromtimestamp(int(time_string))
+
+            # For Python 2 and 3 compatibility
+            to_unicode = unicode if sys.version_info[0] < 3 else str
+            # Remove ellipsis characters added by Twitter
+            text = self._html_to_text(to_unicode(tweet).replace(u"\u2026", ""))
+
+            link = "https://twitter.com%s" % header.find("a", class_="tweet-timestamp")["href"]
+
+            yield StreamItem("%s (@%s)" % (name, username), time, text, link)
+
+
+    def get_feed_items(self, xml):
+        feed_data = feedparser.parse(xml)
+
+        for entry in feed_data.entries:
+            time = datetime.fromtimestamp(mktime(entry.published_parsed))
+            text = "%s - %s" % (entry.title, self._html_to_text(entry.description))
+            yield StreamItem(feed_data.feed.title, time, text, entry.link)
+
+
+
+class TextExcerpter:
+    # Clips the text to the position succeeding the first whitespace string
+    def _clip_left(self, text):
+        return re.sub("^\S*\s*", "", text, 1)
+
+
+    # Clips the text to the position preceding the last whitespace string
+    def _clip_right(self, text):
+        return re.sub("\s*\S*$", "", text, 1)
+
+
+    # Returns a portion of text at most max_length in length
+    # and containing the first match of pattern, if specified
+    def get_excerpt(self, text, pattern=None, max_length=300):
+        if len(text) <= max_length:
+            return text, False, False
+
+        if pattern is None:
+            return self._clip_right(text[0:max_length]), False, True
+        else:
+            match = pattern.search(text)
+            start, end = match.span()
+            match_text = match.group()
+            remaining_length = max_length - len(match_text)
+            if remaining_length <= 0:
+                # Matches are never clipped
+                return match_text
+
+            excerpt_start = max(start - (remaining_length // 2), 0)
+            excerpt_end = min(end + (remaining_length - (start - excerpt_start)), len(text))
+            # Adjust start of excerpt in case the string after the match was too short
+            excerpt_start = max(excerpt_end - max_length, 0)
+            excerpt = text[excerpt_start:excerpt_end]
+            if excerpt_start > 0:
+                excerpt = self._clip_left(excerpt)
+            if excerpt_end < len(text):
+                excerpt = self._clip_right(excerpt)
+
+            return excerpt, excerpt_start > 0, excerpt_end < len(text)
+
+
+
+class Application:
+    _known_hashes = set()
+
+
+    def __init__(self, args):
+        self.args = args
+
+
+    def _print_error(self, error):
+        print("")
+        print(Terminal().bright_red(error))
+
+
+    def _get_stream_items(self, url):
+        try:
+            data = urlopen(url).read()
+        except Exception as error:
+            self._print_error("Unable to retrieve data from URL '%s': %s" % (url, str(error)))
+            # The problem might be temporary, so we do not exit
+            return list()
+
+        parser = StreamParser()
+        if "//twitter.com/" in url:
+            return parser.get_tweets(data)
+        else:
+            return parser.get_feed_items(data)
+
+
+    def _read_file(self, filename):
+        try:
+            with open(filename, "r") as myfile:
+                lines = [line.strip() for line in myfile.readlines()]
+        except Exception as error:
+            self._print_error("Unable to read file '%s': %s" % (filename, str(error)))
+            sys.exit(1)
+
+        # Discard empty lines and comments
+        return [line for line in lines if line and not line.startswith("#")]
+
+
+    def _print_stream_item(self, item, pattern=None):
+        print("")
+
+        term = Terminal()
+        time_label = "%s at %s" % (term.yellow(item.time.strftime("%a, %d %b %Y")),
+                                   term.yellow(item.time.strftime("%H:%M")))
+        print("%s on %s:" % (term.bright_cyan(item.source), time_label))
+
+        excerpter = TextExcerpter()
+        excerpt, clipped_left, clipped_right = excerpter.get_excerpt(item.text, pattern)
+
+        # Hashtag or mention
+        excerpt = re.sub("(?<!\w)([#@])(\w+)",
+                         term.green("\\g<1>") + term.bright_green("\\g<2>") + term.bright_white,
+                         excerpt)
+        # URL in one of the forms commonly encountered on the web
+        excerpt = re.sub("(\w+://)?[\w.-]+\.[a-zA-Z]{2,4}(?(1)|/)[\w#?&=%/:.-]*",
+                         term.bright_magenta_underline("\\g<0>") + term.bright_white,
+                         excerpt)
+
+        if pattern is not None:
+            # TODO: This can break previously applied highlighting (e.g. URLs)
+            excerpt = pattern.sub(term.black_on_bright_yellow("\\g<0>") + term.bright_white,
+                                  excerpt)
+
+        print("   %s%s%s" % ("... " if clipped_left else "",
+                             term.bright_white(excerpt),
+                             " ..." if clipped_right else ""))
+        print("   %s" % term.bright_blue_underline(item.link))
+
+
+    def update(self):
+        # Reload sources and filters to allow for live editing
+        sources = list()
+        if self.args.sources is not None:
+            sources.extend(self.args.sources)
+        if self.args.sources_file is not None:
+            sources.extend(self._read_file(self.args.sources_file))
+        if not sources:
+            self._print_error("No source specifications found")
+            sys.exit(1)
+
+        filters = list()
+        if self.args.filters is not None:
+            filters.extend(self.args.filters)
+        if self.args.filters_file is not None:
+            filters.extend(self._read_file(self.args.filters_file))
+
+        patterns = list()
+        for filter_string in filters:
+            try:
+                patterns.append(re.compile(filter_string, re.IGNORECASE))
+            except Exception as error:
+                self._print_error("Error while compiling regular expression '%s': %s" %
+                                  (filter_string, str(error)))
+                sys.exit(1)
+
+        items = list()
+        def add_item(item, pattern=None):
+            # Note that item.time is excluded from duplicate detection
+            # as it sometimes changes without affecting the content
+            hash_code = hashlib.md5((item.source + item.text + item.link)
+                                    .encode("utf-8")).hexdigest()
+            if hash_code in self._known_hashes:
+                # Do not print an item more than once
+                return
+            self._known_hashes.add(hash_code)
+            items.append((item, pattern))
+
+        for source in sources:
+            for item in self._get_stream_items(source):
+                if patterns:
+                    for pattern in patterns:
+                        if pattern.search(item.text):
+                            add_item(item, pattern)
+                            break
+                else:
+                    # No filter patterns specified; simply print all items
+                    add_item(item)
+
+        # Print latest news last
+        items.sort(key=lambda item: item[0].time)
+
+        for item in items:
+            self._print_stream_item(item[0], item[1])
+
+
+    def run(self):
+        term = Terminal()
+        print("%s (%s)" % (term.bold("krill 0.1.0"),
+                           term.underline("https://github.com/p-e-w/krill")))
+
+        while True:
+            try:
+                self.update()
+                if self.args.update_interval <= 0:
+                    break
+                sleep(self.args.update_interval)
+            except KeyboardInterrupt:
+                # Do not print stacktrace if user exits with Ctrl+C
+                sys.exit()
+
+
+
+def main():
+    # Force UTF-8 encoding for stdout as we will be printing Unicode characters
+    # which will fail with a UnicodeEncodeError if the encoding is not set,
+    # e.g. because stdout is being piped.
+    # See http://www.macfreek.nl/memory/Encoding_of_Python_stdout and
+    # http://stackoverflow.com/a/4546129 for extensive discussions of the issue.
+    if sys.stdout.encoding != "UTF-8":
+        # For Python 2 and 3 compatibility
+        prev_stdout = sys.stdout if sys.version_info[0] < 3 else sys.stdout.buffer
+        sys.stdout = codecs.getwriter("utf-8")(prev_stdout)
+
+    arg_parser = argparse.ArgumentParser(prog="krill", description="Read and filter web feeds.")
+    arg_parser.add_argument("-s", "--sources", nargs="+",
+            help="URLs to pull data from", metavar="URL")
+    arg_parser.add_argument("-S", "--sources-file",
+            help="file from which to load source URLs", metavar="FILE")
+    arg_parser.add_argument("-f", "--filters", nargs="+",
+            help="patterns used to select feed items to print", metavar="REGEX")
+    arg_parser.add_argument("-F", "--filters-file",
+            help="file from which to load filter patterns", metavar="FILE")
+    arg_parser.add_argument("-u", "--update-interval", default=300, type=int,
+            help="time between successive feed updates " +
+                 "(default: 300 seconds, 0 for single pull only)", metavar="SECONDS")
+    args = arg_parser.parse_args()
+
+    if args.sources is None and args.sources_file is None:
+        arg_parser.error("either a source URL (-s) or a sources file (-S) must be given")
+
+    Application(args).run()
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/screenshot.png b/screenshot.png