Installing and configuring the HIVE metastore with a MySQL backend.

Hive Metastore Hive, a powerful data warehousing system built on top of Hadoop, relies on a component known as the metastore to efficiently manage metadata about the data stored within it. This metadata is crucial for organizing, querying, and processing data in Hive. In this blog post, we’ll explore the role of the Hive Metastore and the significance of selecting the right relational database management system (RDBMS) for it. Metadata Management The Hive Metastore serves as a central repository for storing and managing metadata related to datasets within Hive. This metadata includes essential information such as: ...

September 22, 2023 · 10 min · Naveen Kannan

Using Ansible to install Hive on a Spark cluster.

What is Hive? Apache Hive is a distributed, fault-tolerant data warehouse system, built on top of Hadoop, designed to simplify and streamline the processing of large datasets. Through Hive, a user can manage and analyze massive volumes of data by organizing it into tables, resembling a traditional relational database. Hive uses the HiveQL (HQL) language, which is very similar to SQL. These SQL-like queries get translated into MapReduce tasks, leveraging the power of Hadoop’s MapReduce functionalities while bypassing the need to know how to program MapReduce jobs. ...

September 4, 2023 · 11 min · Naveen Kannan

Installing and configuring Hadoop and Spark on a 4 node cluster.

A brief introduction to Hadoop Apache Hadoop is an open source software library that allows for the distributed processing of large datasets across a computing cluster. Hadoop shines at processing enormously massive amounts of data that is too big to fit on a single computing system. Hadoop’s Main Components Hadoop has components integral to its functioning. These include: The HDFS (Hadoop Distributed File System) HDFS is a distributed file system that provides high-throughput access to application data, and is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. ...

August 21, 2023 · 7 min · Naveen Kannan

Using Ansible to remotely configure a cluster.

What is Ansible? Ansible is an open-source IT automation tool that allows for automated management of remote systems. A basic Ansible environment has the following three components: Control Node: This is a system on which Ansible is installed, and the system from which Ansible commands such as ansible-inventory are issued. This is also where Ansible playbooks and configuration files are stored. Managed node: This is a remote system that Ansible intends to manage and configure. ...

June 24, 2023 · 10 min · Naveen Kannan

SLURM and HPC.

SLURM Workload Manager SLURM (formerly known as Simple Linux Utility for Resource Management) is an open-source job scheduling system for Linux clusters. It does not require kernel modification and is relatively self contained. It has three key functions: Allocation of access to resources (compute nodes) to users for a defined period of time. Providing a framework that allows for starting and executing jobs, including parallel computing processes. Queue management to arbitrate resource contention. ...

June 15, 2023 · 5 min · Naveen Kannan