Big Data and Cloud Computing

Posts

Machine Learning — Logistic Regression with Python

# Importing packages import pandas as pd # data processing import numpy as np # working with arrays import itertools import matplotlib.pyplot as plt # visualizations from matplotlib import rcParams # plot size customization from termcolor import colored as cl # text customization from sklearn.model_selection import train_test_split # splitting the data from sklearn.linear_model import LogisticRegression # model algorithm from sklearn.preprocessing import StandardScaler # data normalization from sklearn.metrics import jaccard_similarity_score as jss # evaluation metric from sklearn.metrics import precision_score # evaluation metric from sklearn.metrics import classification_report # evaluation metric from sklearn.metrics import confusion_matrix # evaluation metric from sklearn.metrics import log_loss # evaluation metric rcParams['figure.figsize'] = (20, 10) # Importing the data and EDA df = pd.read_csv('tele_customer_data.csv') df.drop(['Unnamed: 0', 'l...

AWS Cloudformation

AWS Cloudformation AWS CloudFormation is a service that helps you model and set up your Amazon Web Services resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. AWS Cloudformation In the given screenshot, template is a. JSON or .Yaml file with parameter definitions, resource and configuration actions. CloudFormation works as a framework for creating a new stack, updating a stack, error detection and/or rollback. Stack is basically used to configure AWS services. Why CloudFormation? AWS Cloudformation Getting Started, Log in to here Enter Username and Password Go to services Search CloudFormation in Management & Governance You will see running stacks there and have an option for creating new stack AWS Cloudformation What is a stack? The CloudFormation Stack provides the ability to deploy, update and delete a template and its associated collection of resources by using the AWS Management Consol...

ACL Deep Dive

In general, plain Unix permissions aren’t sufficient when you have permission requirements that don’t map cleanly to an enterprise’s natural hierarchy of users and groups , HDFS ACLs is be available in Apache Hadoop 2.4.0, HDFS ACLs give you the ability to specify fine-grained file permissions for specific named users or named groups, not just the file’s owner and group. HDFS ACLs are modeled after POSIX ACLs , Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules. To use ACLs, first we’ll need to enable ACLs on the NameNode by adding the following configuration property to hdfs-site.xml and restarting the NameNode. Dfs.Namenode.Acls.Enabled True Most users will interact with ACLs using 2 new commands added to the HDFS CLI: setfacl and getfacl. For examples of how HDFS ACLs can help implement complex security requirements. EXAMPLE 1: G...

Kubernetes Configuration Provider to load data from Secrets and Config Maps

Using Kubernetes Configuration Provider to load data from Secrets and Config Maps When running Apache Kafka on Kubernetes, you will sooner or later probably need to use Config Maps or Secrets. Either to store something in them, or load them into your Kafka configuration. That is true regardless of whether you use Strimzi to manage your Apache Kafka cluster or something else. Kubernetes has its own way of using Secrets and Config Maps from Pods. But they might not be always sufficient. That is why in Strimzi, we created Kubernetes Configuration Provider for Apache Kafka which we will introduce in this blog post. Usually, when you need to use data from a Config Map or Secret in your Pod, you will either mount it as volume or map it to an environment variable. Both methods are configured in the spec section or the Pod resource or in the spec.template.spec section when using higher level resources such as Deployments or StatefulSets. When mounted as a volume, the contents of the Secr...

Python and Parquet Performance

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage , columnar compression and data partitioning . Used together, these three optimizations can dramatically accelerate I/O for your Python applications compared to CSV, JSON, HDF or other row-based formats. Parquet makes applications possible that are simply impossible using a text format like JSON or CSV. Introduction I have recently gotten more familiar with how to work with Parquet datasets across the six major tools used to read and write from Parquet in the Python ecosystem: Pandas , PyArrow , fastparquet , AWS Data Wrangler , PySpark and Dask . My work of late in algorithmic trading involves switching between these tools a lot and as I said I often mix up the APIs. I use Pandas and PyArrow for in-RAM comput...