Integrating HDFS and PostgreSQL through Apache Spark.

Introduction The HDFS (Hadoop Distributed File System) and PostgreSQL databases are both powerful tools for data storage, queries and analyses. Each have their own unique strengths that make them well suited for specific tasks. The HDFS, being distributed across several computing nodes, is robust and amenable to storing massive datasets, provided your computing infrastructure has the prerequisite width (the number of nodes in your cluster) and the depth(the available memory on each individual node). The HDFS is optimized for batch processing of massive datasets, making it suitable for big data applications like data warehousing, log processing, and large-scale data analytics. In fact, Spark, the HDFS’ natural companion, has it’s own machine learning library MLlib, making large scale data analytics very much possible. ...

March 18, 2024 · 6 min · Naveen Kannan

Introduction to PXE boot servers.

Introduction What is a PXE boot server? The term PXE stands for Preboot Execution Environment. This environment consists of a server which serves multiple clients. The server hosts software images, and the clients it serves are able to boot these images by retrieving them from the server via the network. Essentially, this allows for the clients to boot over the network, instead of from physical media such as a CD-ROM or hard disk, provided they are PXE boot capable. This typically includes BIOS and UEFI PCs. ...

March 18, 2024 · 4 min · Naveen Kannan

Mamba implementation in Scientific Pipelines.

What is Mamba? Mamba is intended to be a drop-in replacement for /reimplementation of conda (written in C++). Mamba has been something that I have implemented into all of my pipelines, since it trivializes the package management process. I do almost all of my work within the context of containers/virtual environments, and mamba makes my work life so much easier. Previously, I used to use Conda as my package manager of choice, relying on it to cut down the amount of time I would need to build an environment using pip as an installer. Once I discovered Mamba, however, I never went back to Conda. ...

November 18, 2023 · 9 min · Naveen Kannan

Moving Docker's Data directory to another location.

Introduction Docker is a container service that we have discussed previously in my blog posts. The default storage location for Docker is at /var/lib/docker. Speaking from experience, as images and containers are built over a period of time, especially if there are multiple users using the Docker service, the root filesystem can run into issues where the size of the Docker storage directory can cause potential out-of-space crises, and significantly deteriorate overall system performance. ...

October 30, 2023 · 7 min · Naveen Kannan

Installing and configuring the HIVE metastore with a MySQL backend.

Hive Metastore Hive, a powerful data warehousing system built on top of Hadoop, relies on a component known as the metastore to efficiently manage metadata about the data stored within it. This metadata is crucial for organizing, querying, and processing data in Hive. In this blog post, we’ll explore the role of the Hive Metastore and the significance of selecting the right relational database management system (RDBMS) for it. Metadata Management The Hive Metastore serves as a central repository for storing and managing metadata related to datasets within Hive. This metadata includes essential information such as: ...

September 22, 2023 · 10 min · Naveen Kannan