Postingan

Menampilkan postingan dari November, 2013

MapReduce Overview

Gambar
What is MapReduce ? Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce. The main MapReduce job usually splits the input data-set into independent chunks. MapTask: will process these chunks in a completely parallel manner (One node can process one or more chunks).The framework sorts the outputs of the maps. Reduce Task : And the above output will be the input for the reducetasks, produces the final result. Your business logic would be written in the MappedTask and ReducedTask. Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. MapReduce Overview 'MapReduce' is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are o...

Dancing with Sqoop

Gambar
What is Sqoop? Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. (or) Import/Export data from RDBMS to Hadoop(HDFS) by using sqoop. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Step1: Consider I have a table in mysql(emp) as follows mysql>create database <database name>; mysql>use <database>; Step2:  Now we need to give grant permissions to our created database as follows mysql>grant all privileges on *.* to '<database username>'@'%' identified by '<database password>'; mysql> flush priviliges; Step3:  Now we need to create one table and insert values into table mysql> create table emp(id int,name varchar(20),sal float); mysql>insert ...

The Hadoop Distributed File System

Gambar
Introduction HDFS , the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it. A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network.  How to solve the Traditional System problems by using Big Data Traditional System Problem : Data is too big store in one computer Today's big data is 'too big' to store in ONE single computer -- no matter how powerful it is and how much storage it has. This eliminates lot of storage system and databases that were built for single machines. So we are ...

What is Unstructured Data

Gambar
The phrase "unstructured data" usually refers to information that doesn't reside in a traditional row-column database. As you might expect, it's the opposite of structured data -- the data stored in fields in a database. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly -- often many times faster than structured databases are growing. Mining Unstructured Data Many organizations believe that their unstr...