Integrating HDFS and PostgreSQL through Apache Spark.

Introduction The HDFS (Hadoop Distributed File System) and PostgreSQL databases are both powerful tools for data storage, queries and analyses. Each have their own unique strengths that make them well suited for specific tasks. The HDFS, being distributed across several computing nodes, is robust and amenable to storing massive datasets, provided your computing infrastructure has the prerequisite width (the number of nodes in your cluster) and the depth(the available memory on each individual node). The HDFS is optimized for batch processing of massive datasets, making it suitable for big data applications like data warehousing, log processing, and large-scale data analytics. In fact, Spark, the HDFS’ natural companion, has it’s own machine learning library MLlib, making large scale data analytics very much possible. ...

March 18, 2024 · 6 min · Naveen Kannan

Installing and configuring Hadoop and Spark on a 4 node cluster.

A brief introduction to Hadoop Apache Hadoop is an open source software library that allows for the distributed processing of large datasets across a computing cluster. Hadoop shines at processing enormously massive amounts of data that is too big to fit on a single computing system. Hadoop’s Main Components Hadoop has components integral to its functioning. These include: The HDFS (Hadoop Distributed File System) HDFS is a distributed file system that provides high-throughput access to application data, and is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. ...

August 21, 2023 · 7 min · Naveen Kannan