Twitter says it is planning to open source Storm, its Hadoop-like real-time data processing tool. In a blog post Thursday, the microblogging network said it plans to release the Storm code on Sept. 19 at the Strange Loop event in St. Louis, Mo.
Read here.
Cassandra, CouchDB, MongoDB, Redis, Riak, Neo4J, and FlockDB reinvent the data store
Read here.
In this post author argues NoSQL is a premature optimization and should be considered when really needed. Relational DB can handle load required in most of the cases.
Read full post here
A super-computer architecture that crunches big data for banks, police, and spooks will soon be open sourced as a super-fast alternative to the Googlesque Hadoop.
LexisNexis Risk Solutions is opening up its High Performance Computing Cluster (HPCC), a system written in C++ that it claims is four-times faster than Hadoop when running data-intensive queries on ordinary Linux servers.
Source
How do large-scale sites and applications remain SQL-based?
Source
Spotify is a music streaming service offering lowlatency access to a library of over 8 million music tracks. Streaming is performed by a combination of client-server access and a peer-to-peer protocol. In this paper, we give an overview of the protocol and peer-to-peer architecture used and provide measurements of service performance and user behavior. The service currently has a user base of over 7 million and has been available in six European countries since October 2008.
Data collected indicates that the combination of the client-server and peer-to-peer paradigms can be applied to music streaming with good results. In particular, 8.8% of music data played comes from Spotify’s servers while the median playback latency is only 265 ms (including cached tracks). We also discuss the user access patterns observed and how the peer-to-peer network affects the access patterns as they reach the server.
Full Article
Read this interesting article here.
This post walks through the internal process of tuning a particular page on Stack Overflow.
** Source
SQL is actually the name of a declarative query language, while more precisely this article concerns traditional relational database systems. Since it is common to talk about NoSQL as the opposite of relational database systems, we have taken the editorial liberty of using SQL as a synonym for relational database systems.
Read full article here.
R is an open source statistical programming language. The easiest way to think about it is the largest commercial competitor in the states is a company called SAS, and while it’s not a perfect analogy, one way to think about R is as an open source version of SAS. It’s not perfectly correct, but for people who have not heard of R, that’s one way to explain it.
It’s used to analyze data – any kind of data that exists. That’s really why R is becoming so popular
** Source **
Consider, for instance, the makers of Stride Rite shoes, who imagined in the book that when you went to the shoe store, a wireless device would measure your foot and the way you walk. It would select your shoe size, produce your shoe from parts the store had on-site, and share data about your foot with supply chain vendors, non-competing retailers, health-care providers and medical researchers.
Read full article here…
An ebook meant to help people get familiar with MongoDB and answer some of the more common questions they have.
Read here.
Drizzle – a lightweight fork of Oracle’s MySQL database for cloud computing – has been released by open sourcers.
Drizzle tarball version 2011.03.13 has been released as general availability (GA) version. It comes nearly three years after the project was announced by Brian Aker, one of MySQL’s key architects,.
Drizzle aims to be different from MySQL, stripping out “unnecessary” features loved by enterprise and OEMs in the name of greater speed and simplicity and for reduced management overhead.
Drizzle has no stored procedures, triggers, or views – three staples of MySQL and other relational databases – and, in a blow to a large chunk of the computing and IT establishment, it doesn’t run on Microsoft’s Windows. Also, there’s no embedded sever.
** Source **
DB optimized for a bunch of PNG images. The idea is to split PNG images into many blocks and have each block stored in a DB. If there are several equal blocks, it is only stored once. Via a hash table, the lookup for such blocks is made fast.
** Source **
After HBase and Hypertable, another BigTable clone – Cloudata
As per their homepage, Cloudata has the following features.
** Basic data service
o Single row operation(get, put)
o Multi row operation(like, between, scanner)
o Data uploader(DirectUploader)
o MapReduce(TabletInputFormat)
o Simple cloudata query and supports JDBC driver
** Table Management
o split
o distribution
o compaction
** Utility
o Web based Monitor
o CLI Shell
** Failover
o Master failover
o TabletServer failover
** Change log Server
o Reliable fast appendable change log server
** Support language
o Java, RESTful API, Thrift
Park Kieun, CUBRID Cluster Architect, gives an overview of popular large scale database technologies.
* Massively Parallel Processing (MPP) or parallel DBMS – A system that parallelizes the query execution of a DBMS, and splits queries and allocates them to multiple DBMS nodes in order to process massive amounts of data concurrently.
Examples: EBay DW, Yahoo! Everest Architecture, Greenplum, AsterData
* Column-oriented database – A system that stores the values in the same field as a column, as opposed to the conventional ow method that stores them as individual records.
Examples: Vertica, Sybase IQ, MonetDB
* Streaming processing (ESP or CEP) – A system that processes a constant data (or events) stream, or a concept in which the content of a database is continuously changing over time.
Examples: Truviso
* Key-value storage (with MapReduce programming model) – A storage system that focuses on enhancing the performance when reading a single record by adopting the key-value data model, which is simpler than the relational data model.
Examples: NoSQL databases
** Source **
Here are the next ten things you should know about big data:
1. Big data means the amount of data you’re working with today will look trivial within five years.
2. Huge amounts of data will be kept longer and have way more value than today’s archived data.
3. Business people will covet a new breed of alpha geeks. You will need new skills around data science, new types of programming, more math and statistics skills and data hackers…lots of data hackers.
4. You are going to have to develop new techniques to access, secure, move, analyze, process, visualize and enhance data; in near real time.
5. You will be minimizing data movement wherever possible by moving function to the data instead of data to function. You will be leveraging or inventing specialized capabilities to do certain types of processing- e.g. early recognition of images or content types – so you can do some processing close to the head.
6. The cloud will become the compute and storage platform for big data which will be populated by mobile devices and social networks.
7. Metadata management will become increasingly important.
8. You will have opportunities to separate data from applications and create new data products.
9. You will need orders of magnitude cheaper infrastructure that emphasizes bandwidth, not iops and data movement and efficient metadata management.
10. You will realize sooner or later that data and your ability to exploit it is going to change your business, social and personal life; permanently.
** Source **
Source code of a simple benchmark of noSQL databases for both read/update and MapReduce performances is available at Github.
Latest benchmark of the latest versions of Cassandra (0.6.10), HBase (0.20.6), MongoDB (1.6.5), and Riak (0.14.0). The results are interesting.
Check here