CSC/ECE 517 Spring 2014/ch1a 1o sr: Difference between revisions
(17 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
===What is Big Data?=== | ===What is Big Data?=== | ||
[http://en.wikipedia.org/wiki/Big_data Big data] means a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. <ref> http://www.webopedia.com/TERM/B/big_data.html </ref> The term may be used to refer the volumes of data, as well as the tools or techniques used to process, manage, analyze the data. | [http://en.wikipedia.org/wiki/Big_data Big data] means a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. <ref> http://www.webopedia.com/TERM/B/big_data.html </ref> The term may be used to refer the volumes of data, as well as the tools or techniques used to process, manage, analyze the data. | ||
===Challenges=== | ===Challenges=== | ||
Line 12: | Line 9: | ||
The real problem is not to acquire huge amounts of data, but how to make sense out of it to make any useful deduction. For example, if Google records and each and every search query that any of it's user makes, indexed with the Google account (where singed in) and IP address where not, the problem is not storage. The problem is predicting browsing habits, optimizing search results, creating profiles of google users, letting the profiles evolve with additional data, but only relevant data, deciding which data should be considered relevant etc. | The real problem is not to acquire huge amounts of data, but how to make sense out of it to make any useful deduction. For example, if Google records and each and every search query that any of it's user makes, indexed with the Google account (where singed in) and IP address where not, the problem is not storage. The problem is predicting browsing habits, optimizing search results, creating profiles of google users, letting the profiles evolve with additional data, but only relevant data, deciding which data should be considered relevant etc. | ||
Big data is often received at a very speed. We can consider the same example as mentioned above. Not only is data pouring in huge amounts, but it has to be structured, categorized, stored, managed at a very high speed . | Big data is often received at a very high speed. We can consider the same example as mentioned above. Not only is data pouring in huge amounts, but it has to be structured, categorized, stored, managed at a very high speed . | ||
Another challenge is that there's no set structure for big data (why would there be? big data is just data, just in huge volumes). For example, an international car parts supplier can index and store the sale of parts based on their make, model, item number, manfacturing data, location of purchase etc. This is a very structured form of big data, but as in our last example, where we only had a text string to determine all the variable, extraction of variables can be a challenge on unstructured big data. | Another challenge is that there's no set structure for big data (why would there be? big data is just data, just in huge volumes). For example, an international car parts supplier can index and store the sale of parts based on their make, model, item number, manfacturing data, location of purchase etc. This is a very structured form of big data, but as in our last example, where we only had a text string to determine all the variable, extraction of variables can be a challenge on unstructured big data. | ||
Another challenge of big data, as with all emerging technologies is elastic scalability. If we're recording huge amounts of data, let's say views for a news website or shopping on a shopping portal, the data won't be consistent. There will be predictable (Christmas for the shopping website) and unpredictable (Any newsworthy outside event) occasions that will cause peaks and troughs in the inflow of data. Any system designed to store, manage, analyze big data should account for these as well. | Another challenge of big data, as with all emerging technologies is elastic scalability. If we're recording huge amounts of data, let's say views for a news website or shopping on a shopping portal, the data won't be consistent. There will be predictable (Christmas for the shopping website) and unpredictable (Any newsworthy outside event) occasions that will cause peaks and troughs in the inflow of data. Any system designed to store, manage, analyze big data should account for these as well. | ||
==Example== | |||
E-bay stores almost 90PB of data about customer transactions and behaviors to support some $3500 of product sales every second. | |||
Data is stored in three systems, with about 7.5PB in a Teradata enterprise data warehouse, 40PB on commodity Hadoop clusters and 40PB on ‘Singularity’: a custom system for performing deep-dive analysis on semi-structured and relational data. | |||
As of May 2013, eBay had 500 million live auction listings, split into more than 50,000 categories. The site has more than 100 million active users, generating up to 100TB of new data each day to be stored and used by more than 6000 eBay staff. | |||
The users of this data, are of course company's key personnel whose job is to make important business decisions. They range from expert data scientists to non-technical business people and about 50 executives who need access to top-line reports. | |||
E-bay has three separate teams just to manage this data. One team is responsible to look after all the technological details, and other to decide which kind of analytics are needed, considering the problems faced. <ref> http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx#ixzz31bJMJXRX </ref> | |||
=== | ==Big Data Usage in Rails Applications== | ||
Ruby is a dynamically-typed, interpreted programming language in the style of PHP or Python. Ruby on Rails is a framework to simplify the building of typical web apps, particularly CRUD apps. So neither of them in itself is a "big data" or "data mining" framework. However, for the purpose of storage or analytics, other tools are used, which do all the backend work, which is then then stored in the databases, which ruby can then access. Here we shall talk about RoR's compatibility with these different tools. | |||
== | These tools can be broadly categorized into two categories: | ||
'''Tools for Storage''' | |||
#[http://cassandra.apache.org/ Cassandra]: An open source distributed DBMS designed to handle large amounts of data | |||
#[http://www.mongodb.org/ MongoDB]: A distributed NoSQL database | |||
#[http://couchdb.apache.org/ CouchDB]: Also a NoSQL databse that stores documents in JSON and uses javascript to make queries | |||
'''Tools for data mining or Analytics''' | |||
#[http://www.teradata.com/about-us/#tabbable=0&tab1=1 Tera Data] | |||
#[http://www.vertica.com/ Vertica] | |||
Using RoR to solve big data problems is a matter of identifying your particular use case, and then deciding on the right storage and analytic tools for that case. Please note that the above lists are not complete. For a novel web app, the tool might be out there, with no existing support for ruby or rails, and would require porting its libraries into a gem. | |||
RoR community's support for the above mentioned tools is mature, but constantly evolving. The gems are available for all of them. | |||
* For MongoDB: [http://github.com/jnunemaker/mongomapper mondomapper] | |||
* For using CouchDB in a typical RoR RESTful way [https://github.com/couchrest/couchrest couchrest] | |||
* For Cassandra [https://github.com/cassandra-rb/cassandra cassandra] | |||
* For Tera Data [http://rubydoc.info/gems/activerecord-jdbcteradata-adapter/0.5.1/frames Tera data driver that enables the use of ActiveRecord with Tera Data] | |||
* For Vertica [https://github.com/wvanbergen/vertica vertica gem] | |||
==Rails versus other frameworks for processing big data== | |||
===Tools implemented on Python=== | |||
Python has long been great for data munging and preparation, but less so for data analysis and modeling. [http://pandas.pydata.org/ pandas] helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R. | |||
Combined with the excellent [http://ipython.org/ IPython] toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate. | |||
[http://pandas.pydata.org/ pandas] does not implement significant modeling functionality outside of linear and panel regression; This functionality is provided by [http://ipython.org/ statsmodels] and [http://scikit-learn.org/stable/ scikit-learn]. <ref> http://pandas.pydata.org/ </ref> | |||
Other tools based on python that cater to the needs of big data are [http://continuum.io/anaconda-server Anaconda] and [http://continuum.io/wakari Wakari]. Anaconda is a free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing. It has the libraries that facilitate running python scripts that can manage and analyse big data. | |||
Wakari is a collaborative data analytics platform that includes tools to explore data, develop analytics scripts, collaborate with IPython notebooks, visualize, and share data analysis and findings. <ref> http://continuum.io/wakari </ref> | |||
==References== | ==References== | ||
<references/> | <references/> | ||
* http://www.sas.com/en_us/insights/big-data/what-is-big-data.html | * http://www.sas.com/en_us/insights/big-data/what-is-big-data.html |
Latest revision as of 16:02, 13 May 2014
This page covers the usage of Big Data with respect to Ruby on Rails.
Background
What is Big Data?
Big data means a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. <ref> http://www.webopedia.com/TERM/B/big_data.html </ref> The term may be used to refer the volumes of data, as well as the tools or techniques used to process, manage, analyze the data.
Challenges
As costs of storage are decreasing, it becomes trivial to store huge amounts of data, which leads to a much bigger challenge: determining relevance within volumes of data, and use analytics to create value from relevant data.
The real problem is not to acquire huge amounts of data, but how to make sense out of it to make any useful deduction. For example, if Google records and each and every search query that any of it's user makes, indexed with the Google account (where singed in) and IP address where not, the problem is not storage. The problem is predicting browsing habits, optimizing search results, creating profiles of google users, letting the profiles evolve with additional data, but only relevant data, deciding which data should be considered relevant etc.
Big data is often received at a very high speed. We can consider the same example as mentioned above. Not only is data pouring in huge amounts, but it has to be structured, categorized, stored, managed at a very high speed .
Another challenge is that there's no set structure for big data (why would there be? big data is just data, just in huge volumes). For example, an international car parts supplier can index and store the sale of parts based on their make, model, item number, manfacturing data, location of purchase etc. This is a very structured form of big data, but as in our last example, where we only had a text string to determine all the variable, extraction of variables can be a challenge on unstructured big data.
Another challenge of big data, as with all emerging technologies is elastic scalability. If we're recording huge amounts of data, let's say views for a news website or shopping on a shopping portal, the data won't be consistent. There will be predictable (Christmas for the shopping website) and unpredictable (Any newsworthy outside event) occasions that will cause peaks and troughs in the inflow of data. Any system designed to store, manage, analyze big data should account for these as well.
Example
E-bay stores almost 90PB of data about customer transactions and behaviors to support some $3500 of product sales every second.
Data is stored in three systems, with about 7.5PB in a Teradata enterprise data warehouse, 40PB on commodity Hadoop clusters and 40PB on ‘Singularity’: a custom system for performing deep-dive analysis on semi-structured and relational data.
As of May 2013, eBay had 500 million live auction listings, split into more than 50,000 categories. The site has more than 100 million active users, generating up to 100TB of new data each day to be stored and used by more than 6000 eBay staff.
The users of this data, are of course company's key personnel whose job is to make important business decisions. They range from expert data scientists to non-technical business people and about 50 executives who need access to top-line reports.
E-bay has three separate teams just to manage this data. One team is responsible to look after all the technological details, and other to decide which kind of analytics are needed, considering the problems faced. <ref> http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx#ixzz31bJMJXRX </ref>
Big Data Usage in Rails Applications
Ruby is a dynamically-typed, interpreted programming language in the style of PHP or Python. Ruby on Rails is a framework to simplify the building of typical web apps, particularly CRUD apps. So neither of them in itself is a "big data" or "data mining" framework. However, for the purpose of storage or analytics, other tools are used, which do all the backend work, which is then then stored in the databases, which ruby can then access. Here we shall talk about RoR's compatibility with these different tools.
These tools can be broadly categorized into two categories:
Tools for Storage
- Cassandra: An open source distributed DBMS designed to handle large amounts of data
- MongoDB: A distributed NoSQL database
- CouchDB: Also a NoSQL databse that stores documents in JSON and uses javascript to make queries
Tools for data mining or Analytics
Using RoR to solve big data problems is a matter of identifying your particular use case, and then deciding on the right storage and analytic tools for that case. Please note that the above lists are not complete. For a novel web app, the tool might be out there, with no existing support for ruby or rails, and would require porting its libraries into a gem.
RoR community's support for the above mentioned tools is mature, but constantly evolving. The gems are available for all of them.
- For MongoDB: mondomapper
- For using CouchDB in a typical RoR RESTful way couchrest
- For Cassandra cassandra
- For Tera Data Tera data driver that enables the use of ActiveRecord with Tera Data
- For Vertica vertica gem
Rails versus other frameworks for processing big data
Tools implemented on Python
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.
pandas does not implement significant modeling functionality outside of linear and panel regression; This functionality is provided by statsmodels and scikit-learn. <ref> http://pandas.pydata.org/ </ref>
Other tools based on python that cater to the needs of big data are Anaconda and Wakari. Anaconda is a free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing. It has the libraries that facilitate running python scripts that can manage and analyse big data. Wakari is a collaborative data analytics platform that includes tools to explore data, develop analytics scripts, collaborate with IPython notebooks, visualize, and share data analysis and findings. <ref> http://continuum.io/wakari </ref>
References
<references/>