Speeple » Join Speeple | People | Groups | Blogs | News

Speeple Core

Tagged Syndication

Page 1 of 1

  1. Resource: Speeple News Statistics Page

    I’ve put together a source for Speeple News Statistics. The page provides overall statistics of the Speeple News service, including health statistics such as crawl rate and top sources grouped by domain and individual feeds.

    The stats cover news item totals, feed count, feed types and type version and content languages. The page is updated every 30 minutes.

  2. Speeple NewsBot: ETag (If-None-Match) and Last-Modified (If-Modified-Since) Implemented

    To further improve the performance of the Speeple News “NewsBot” I have implemented support for ETag and Last-Modified HTTP headers. This basically means that only the HTTP headers will be retrieved rather than the full body content if the feed hasn’t changed since the last time NewsBot accessed the XML feed.

    This not only improves the efficiency fetching content for Speeple News, it also benefits webmasters and site owners because less bandwidth is used.

    Initial statistics shows that supporting HTTP ETag / Last-Modified headers along with handling gzip encoded content has reduced bandwidth costs by over 60%.

    Continue Reading »

  3. Speeple NewsBot Update

    The Speeple News “NewsBot” has been updated to support content compressed with the gzip compression algorithms. I should have supported HTTP content encoding in gzip all along, but my recent bandwidth logs on the server have brought it to my immediate attention. Averaging 80 GB per day for 80 thousands XML news feeds just isn’t economical use of bandwidth.

    The next step of improving the economy of the Speeple News “NewsBot” is to give each feed a score based on the update frequency of that feed; resulting in feeds which rarely update to be downloaded less often.

    In conclusion I am hoping a mixture of enabling gzip, a score for feed update frequency and some “If-None-Match” ETag & “If-Modified-Since” support thrown in will produce a very efficient news crawler.

  4. Speeple NewsBot 2.0

    Version 2.0 of the Java based news crawler (Speeple NewsBot) for Speeple News is officially in action.

    The primary reason for the code re-write was to increase the number of feeds crawled per hour, this requirement has been fulfilled, with the news crawler now processing 80K news feeds within the hour:

    0.56 hours taken to crawl 84367 feeds. 100162 items today, 9369.2 items per hour (224860.91 projected, progress: 44.54%)

    Continue Reading »

  5. Speeple News Update: Multithreaded News Bot

    Before my visit to Russia I was working primarily on the Speeple control panel and blogging services, but because of my limited internet access I worked on a new version of the news bot for Speeple News.

    The new and vastly improved version is multithreaded and makes use of the ROME RSS/Atom syndication and publishing tools library rather than my own RSS/Atom parser. I can't be entirely sure of the performance difference; the main reason for using the ROME library was to get development rolling along quickly.

    The new bot performs amazingly well in comparison to the old variant. It manages to crawl 50K news feeds within the hour on a server with 8 CPU cores (using 4 threads per core).

    Continue Reading »