» Join Speeple | People | Groups | Blogs | News
Published: Mon, 28 July 2008, 15:04, also tagged: technology, news, development, internet, search, statistics, graphs, speeple news, historical data
A graph is now displayed for certain searches which are grouped by day or month. The graphs display the activity over a time period of 100 days or 100 months.
The Speeple News graphs help outline when keywords were popular in the index, normally by showing a spike. This is most visible during sporting and seasonal events along with major world issues.
Published: Tue, 1 July 2008, 16:43, also tagged: technology, rss, news, development, internet, programming, statistics, syndication, speeple news
I’ve put together a source for Speeple News Statistics. The page provides overall statistics of the Speeple News service, including health statistics such as crawl rate and top sources grouped by domain and individual feeds.
The stats cover news item totals, feed count, feed types and type version and content languages. The page is updated every 30 minutes.
Published: Fri, 27 June 2008, 10:28, also tagged: technology, rss, news, development, internet, programming, xml, php, http, java, syndication, speeple news, newsbot, etag, last modified, if modified since, if none match
To further improve the performance of the Speeple News “NewsBot” I have implemented support for ETag and Last-Modified HTTP headers. This basically means that only the HTTP headers will be retrieved rather than the full body content if the feed hasn’t changed since the last time NewsBot accessed the XML feed.
This not only improves the efficiency fetching content for Speeple News, it also benefits webmasters and site owners because less bandwidth is used.
Initial statistics shows that supporting HTTP ETag / Last-Modified headers along with handling gzip encoded content has reduced bandwidth costs by over 60%.
Published: Thu, 26 June 2008, 07:53, also tagged: technology, rss, news, development, internet, programming, xml, java, syndication, gzip, speeple news, newsbot
The Speeple News “NewsBot” has been updated to support content compressed with the gzip compression algorithms. I should have supported HTTP content encoding in gzip all along, but my recent bandwidth logs on the server have brought it to my immediate attention. Averaging 80 GB per day for 80 thousands XML news feeds just isn’t economical use of bandwidth.
The next step of improving the economy of the Speeple News “NewsBot” is to give each feed a score based on the update frequency of that feed; resulting in feeds which rarely update to be downloaded less often.
In conclusion I am hoping a mixture of enabling gzip, a score for feed update frequency and some “If-None-Match” ETag & “If-Modified-Since” support thrown in will produce a very efficient news crawler.
Published: Wed, 18 June 2008, 14:55, also tagged: technology, news, development, internet, search, speeple news
The news service provided by Speeple has been indexing content for over a year and half now and in this post I will outline some basic statistics.
News in 50+ languages, top 5 languages:
The news crawler retrieves 8000+ news items per hour, taking 0.5 hours to process the full feed list.