Integrating HDFS and PostgreSQL through Apache Spark.

Introduction The HDFS (Hadoop Distributed File System) and PostgreSQL databases are both powerful tools for data storage, queries and analyses. Each have their own unique strengths that make them well suited for specific tasks. The HDFS, being distributed across several computing nodes, is robust and amenable to storing massive datasets, provided your computing infrastructure has the prerequisite width (the number of nodes in your cluster) and the depth(the available memory on each individual node). The HDFS is optimized for batch processing of massive datasets, making it suitable for big data applications like data warehousing, log processing, and large-scale data analytics. In fact, Spark, the HDFS’ natural companion, has it’s own machine learning library MLlib, making large scale data analytics very much possible. ...

March 18, 2024 · 6 min · Naveen Kannan

Introduction to PXE boot servers.

Introduction What is a PXE boot server? The term PXE stands for Preboot Execution Environment. This environment consists of a server which serves multiple clients. The server hosts software images, and the clients it serves are able to boot these images by retrieving them from the server via the network. Essentially, this allows for the clients to boot over the network, instead of from physical media such as a CD-ROM or hard disk, provided they are PXE boot capable. This typically includes BIOS and UEFI PCs. ...

March 18, 2024 · 4 min · Naveen Kannan

Mamba implementation in Scientific Pipelines.

What is Mamba? Mamba is intended to be a drop-in replacement for /reimplementation of conda (written in C++). Mamba has been something that I have implemented into all of my pipelines, since it trivializes the package management process. I do almost all of my work within the context of containers/virtual environments, and mamba makes my work life so much easier. Previously, I used to use Conda as my package manager of choice, relying on it to cut down the amount of time I would need to build an environment using pip as an installer. Once I discovered Mamba, however, I never went back to Conda. ...

November 18, 2023 · 9 min · Naveen Kannan

Moving Docker's Data directory to another location.

Introduction Docker is a container service that we have discussed previously in my blog posts. The default storage location for Docker is at /var/lib/docker. Speaking from experience, as images and containers are built over a period of time, especially if there are multiple users using the Docker service, the root filesystem can run into issues where the size of the Docker storage directory can cause potential out-of-space crises, and significantly deteriorate overall system performance. ...

October 30, 2023 · 7 min · Naveen Kannan

Installing and configuring the HIVE metastore with a MySQL backend.

Hive Metastore Hive, a powerful data warehousing system built on top of Hadoop, relies on a component known as the metastore to efficiently manage metadata about the data stored within it. This metadata is crucial for organizing, querying, and processing data in Hive. In this blog post, we’ll explore the role of the Hive Metastore and the significance of selecting the right relational database management system (RDBMS) for it. Metadata Management The Hive Metastore serves as a central repository for storing and managing metadata related to datasets within Hive. This metadata includes essential information such as: ...

September 22, 2023 · 10 min · Naveen Kannan

Using Ansible to install Hive on a Spark cluster.

What is Hive? Apache Hive is a distributed, fault-tolerant data warehouse system, built on top of Hadoop, designed to simplify and streamline the processing of large datasets. Through Hive, a user can manage and analyze massive volumes of data by organizing it into tables, resembling a traditional relational database. Hive uses the HiveQL (HQL) language, which is very similar to SQL. These SQL-like queries get translated into MapReduce tasks, leveraging the power of Hadoop’s MapReduce functionalities while bypassing the need to know how to program MapReduce jobs. ...

September 4, 2023 · 11 min · Naveen Kannan

Installing and configuring Hadoop and Spark on a 4 node cluster.

A brief introduction to Hadoop Apache Hadoop is an open source software library that allows for the distributed processing of large datasets across a computing cluster. Hadoop shines at processing enormously massive amounts of data that is too big to fit on a single computing system. Hadoop’s Main Components Hadoop has components integral to its functioning. These include: The HDFS (Hadoop Distributed File System) HDFS is a distributed file system that provides high-throughput access to application data, and is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. ...

August 21, 2023 · 7 min · Naveen Kannan

Using Ansible to remotely configure a cluster.

What is Ansible? Ansible is an open-source IT automation tool that allows for automated management of remote systems. A basic Ansible environment has the following three components: Control Node: This is a system on which Ansible is installed, and the system from which Ansible commands such as ansible-inventory are issued. This is also where Ansible playbooks and configuration files are stored. Managed node: This is a remote system that Ansible intends to manage and configure. ...

June 24, 2023 · 10 min · Naveen Kannan

Docker, Singularity, and HPC.

Containers Containers are environments that are intended to be lightweight and standalone software, with isolation from the host machine, which ensures that containers work uniformly across different staging and development instances. Containers share the host OS’s kernel and do not require an OS per application, which is a key difference between containers and virtual machines, which otherwise share a lot of similarities. A very basic overview of container architecture. ...

June 17, 2023 · 7 min · Naveen Kannan

SLURM and HPC.

SLURM Workload Manager SLURM (formerly known as Simple Linux Utility for Resource Management) is an open-source job scheduling system for Linux clusters. It does not require kernel modification and is relatively self contained. It has three key functions: Allocation of access to resources (compute nodes) to users for a defined period of time. Providing a framework that allows for starting and executing jobs, including parallel computing processes. Queue management to arbitrate resource contention. ...

June 15, 2023 · 5 min · Naveen Kannan