Skip to main content

Posts

Build Data Platform

I'd appreciate your likes and comments). Additionally, it will result in my lengthiest blog post to date. However, regardless of the length, it's a significant achievement that I'm eager to share with you. I'll refrain from delving into unnecessary details and get straight to the main points to prevent this post from turning into a 100-minute read :). As always, I'll strive to simplify everything to ensure even those who aren't tech-savvy can easily follow along. Why? Everything has a why, this project too. (DevOps for data engineering) and I needed to apply them in an end-to-end project. Of course, this project is not the best one out there, but it helps me to quickly iterate and make errors. (And it reflects the reality of Modern Data Engineering, with beautiful tool icons everywhere). End Goal The end goal of this project is to have a fully functional data platform/pipeline, that will refresh our analytics tables/dashboards daily. The whole infrastructu...
Recent posts
Seven Short Money Lessons That Can Unlock (Realistic) Financial Freedom Your $4 latte addiction isn’t what’s stopping you from reaching financial freedom. This short story is sure to punch you in the face…then inspire you. A young man in 2012 sat down at a Portland cafe. An elderly gentleman walked past him and sat at the table next to him. He was working on his Macbook like the cool digital nomad he thought he was. The old man says “Do you like Apple?” The young man instantly becomes dismissive of him. Ohh great, why’d I have to sit next to this old geezer, he thinks to himself. “I don’t like Apple because of what their iPad did to society. People just use them to consume, whereas a real computer can help you create,” the old man proclaimed. He went on … “Too many people never try to do things that have never been done before. It’s sad. But if they give a few things a crack, they soon learn they too can do things that have never been done.” The old man quietly does a half-smil...

Machine Learning — Logistic Regression with Python

# Importing packages import pandas as pd # data processing import numpy as np # working with arrays import itertools import matplotlib.pyplot as plt # visualizations from matplotlib import rcParams # plot size customization from termcolor import colored as cl # text customization from sklearn.model_selection import train_test_split # splitting the data from sklearn.linear_model import LogisticRegression # model algorithm from sklearn.preprocessing import StandardScaler # data normalization from sklearn.metrics import jaccard_similarity_score as jss # evaluation metric from sklearn.metrics import precision_score # evaluation metric from sklearn.metrics import classification_report # evaluation metric from sklearn.metrics import confusion_matrix # evaluation metric from sklearn.metrics import log_loss # evaluation metric rcParams['figure.figsize'] = (20, 10) # Importing the data and EDA df = pd.read_csv('tele_customer_data.csv') df.drop(['Unnamed: 0', 'l...

AWS Cloudformation

AWS Cloudformation AWS CloudFormation is a service that helps you model and set up your Amazon Web Services resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. AWS Cloudformation In the given screenshot, template is a. JSON or .Yaml file with parameter definitions, resource and configuration actions. CloudFormation works as a framework for creating a new stack, updating a stack, error detection and/or rollback. Stack is basically used to configure AWS services. Why CloudFormation? AWS Cloudformation Getting Started, Log in to here Enter Username and Password Go to services Search CloudFormation in Management & Governance You will see running stacks there and have an option for creating new stack AWS Cloudformation What is a stack? The CloudFormation Stack provides the ability to deploy, update and delete a template and its associated collection of resources by using the AWS Management Consol...

ACL Deep Dive

In general, plain Unix permissions aren’t sufficient when you have permission requirements that don’t map cleanly to an enterprise’s natural hierarchy of users and groups , HDFS ACLs is be available in Apache Hadoop 2.4.0, HDFS ACLs give you the ability to specify fine-grained file permissions for specific named users or named groups, not just the file’s owner and group. HDFS ACLs are modeled after POSIX ACLs , Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules. To use ACLs, first we’ll need to enable ACLs on the NameNode by adding the following configuration property to hdfs-site.xml and restarting the NameNode. Dfs.Namenode.Acls.Enabled True Most users will interact with ACLs using 2 new commands added to the HDFS CLI: setfacl and getfacl. For examples of how HDFS ACLs can help implement complex security requirements. EXAMPLE 1: G...

Kubernetes Configuration Provider to load data from Secrets and Config Maps

Using Kubernetes Configuration Provider to load data from Secrets and Config Maps When running Apache Kafka on Kubernetes, you will sooner or later probably need to use Config Maps or Secrets. Either to store something in them, or load them into your Kafka configuration. That is true regardless of whether you use Strimzi to manage your Apache Kafka cluster or something else. Kubernetes has its own way of using Secrets and Config Maps from Pods. But they might not be always sufficient. That is why in Strimzi, we created Kubernetes Configuration Provider for Apache Kafka which we will introduce in this blog post. Usually, when you need to use data from a Config Map or Secret in your Pod, you will either mount it as volume or map it to an environment variable. Both methods are configured in the spec section or the Pod resource or in the spec.template.spec section when using higher level resources such as Deployments or StatefulSets. When mounted as a volume, the contents of the Secr...

Python and Parquet Performance

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of  columnar storage ,  columnar compression  and  data partitioning . Used together, these three optimizations can dramatically accelerate I/O for your Python applications compared to CSV, JSON, HDF or other row-based formats. Parquet makes applications possible that are simply impossible using a text format like JSON or CSV. Introduction I have recently gotten more familiar with how to work with  Parquet  datasets across the six major tools used to read and write from Parquet in the Python ecosystem:  Pandas ,  PyArrow ,  fastparquet ,  AWS Data Wrangler ,  PySpark  and  Dask . My work of late in algorithmic trading involves switching between these tools a lot and as I said I often mix up the APIs. I use Pandas and PyArrow for in-RAM comput...