HDFS

Introduction The HDFS (Hadoop Distributed File System) and PostgreSQL databases are both powerful tools for data storage, queries and analyses. Each have their own unique strengths that make them well suited for specific tasks. The HDFS, being distributed across several computing nodes, is robust and amenable to storing massive datasets, provided your computing infrastructure has the prerequisite width (the number of nodes in your cluster) and the depth(the available memory on each individual node). The HDFS is optimized for batch processing of massive datasets, making it suitable for big data applications like data warehousing, log processing, and large-scale data analytics. In fact, Spark, the HDFS’ natural companion, has it’s own machine learning library MLlib, making large scale data analytics very much possible. ...

Integrating HDFS and PostgreSQL through Apache Spark.

Installing and configuring Hadoop and Spark on a 4 node cluster.