Postingan

Featured Post

Neo4j Overview

Gambar
Neo4j is an open-source graph database, implemented in Java.The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j is the most popular graph database Data Structure: In Neo4j, everything is stored in form of nodes and relationships. Each node and relationship can have any number of attributes. Both the nodes and relationship can be labelled. Labeling is useful, because you can narrow down your searching area using the labels. Neo4j suported node indexing. What is Neo4j? Neo4j is an open-source graph database supported by Neo Technology. Neo4j stores data in nodes and relationships with properties on both are connected by directed(-> or <- or -). Features: intuitive, using a graph model for data representation reliable, with full ACID transactions durable and fast, using a custom disk-based, native storage engine massively scalable, up to sev...

Pig UDF

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript, Ruby and Groovy. The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported such as the Algebraic Interface and the Accumulator Interface. Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note t...

Hive Architecture

Gambar
Command line interface: It’s the default and the most common way of accessing hive. Hiveserver : Runs hive as a server exposing a thrift service,enabling access from a range of clients written in different languages. HWI :  Hive web interface Shell: Shell is the command line interface.It allows interactive queries like MySQL shell connected to database.Also supports web and JDBC clients. Driver,compiler and execution engine take the HiveQL scripts and run in Hadoop environment. Driver: The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces. Compiler: The component that parses the query, does semantic analysis on the different queries blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. Execution engine: The component which executes the execution plan cr...

Pig Overview

Hive Vs Pig Feature Hive Pig Language SQL-like PigLatin Schemas/Types Yes (explicit) Yes (implicit) Partitions Yes No Server Optional (Thrift) No User Defined Functions (UDF) Yes (Java) Yes (Java) Custom Serializer/Deserializer Yes Yes DFS Direct Access Yes (implicit) Yes (explicit) Join/Order/Sort Yes Yes Shell Yes Yes Streaming Yes Yes Web Interface Yes No JDBC/ODBC Yes (limited) No             Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for...