Table of contents

Spreadsheet

SQL

R

Python

Dashboard

Statistic

CERTIFICATION

datarockie-cert

SpreadSheet

create Google Sheet Project with use function in Google Sheets for experiment Filtering and Sorting, Creating Conditionsm, Creating Advanced Conditions, Conditional Formatting, Querying and Summarizing Data, Regular Expressions. Link

SQL

using SQL for answer business question use Basic query, Join Operation, Aggregation. Link

create table and insert data into table, Aggregate Function, Subqueries
Common Table Expression(CTE), CASE condition. Link


create table customers(
  customer_id int,
  name varchar,
  order_id int
);

create table orders(
  order_id int,
  order_num int
);

create table menus(
  order_num int,
  name varchar,
  price int  
);

Insert Into customers values
  (1, 'John', 1),
  (2, 'Mary', 2),
  (3, 'David', 3);

Insert Into orders values
  (1, 1), (1, 2),
  (2, 2), (2, 3),
  (3, 3);

Insert Into menus values
  (1, 'Singha', 300),
  (2, 'Budweiser', 400),
  (3, 'Craft_Beer', 500);

.mode box
  --- JOIN, GROUP BY, ORDER BY, SELECTING Columns
select  
  o.order_id,
  c.name,
  sum(m.price) as total
from menus as m
JOIN orders as o
  ON m.order_num = o.order_num
JOIN customers as c 
  ON o.order_id = c.order_id
GROUP BY c.name
ORDER BY o.order_id ;


--- Aggregate Function
SELECT 
  m.name,
  max(price),
  min(price),
  avg(price),
  count(*),
  avg(price)*count(*) AS sum_price
FROM orders as o
JOIN menus as m
ON o.order_num  = m.order_num
group by m.name
ORDER BY sum_price DESC;

--- Basic Subqueries
SELECT *
FROM (select * from orders where order_id = 1) as o
Join menus as m
  ON o.order_num = m.order_num ;

--- WITH Subqueries
WITH o as (select * from orders where order_id = 1)
SELECT *
FROM  o
Join menus as m
  ON o.order_num = m.order_num 
WHERE name = 'Budweiser';

--- CASE CONDITION
SELECT
  name,
  CASE WHEN price > 450 THEN 'Expensive'
       WHEN price > 350 THEN 'Average'
       ELSE 'Cheap'
  END AS 'segment'
FROM menus

Python

GET data from open API by using request library and loop for mapping key and value into Array and create DataFrame with Pandas Library for easy to understand create visualize by using matplotlib.pyplot for create correlation plot.

Import Churn.csv into DataFrame after this using train_test_split prepare for train Machine Learning, Using Model Logistic Regression for predict Binary Classification and Score Model, For evaluate performance this model use Confusion Matrix for summary result.

Get Data from SWAPI.


# import module
from requests import get
from time import  sleep
import pandas as pd
import json

url = "https://swapi.dev/api/people/"

sw_people = {
    'name' : [],
    'height' : [],
    'gender' : [],
    'mass' : [],
    'homeworld' : [],
}

# get data from api with for
for i in range(10):
    index = i + 1
    new_url = url + str(index)
    resp = get(new_url).json()
    name = resp['name']
    height = resp['height']
    mass = resp['mass']
    gender = resp['gender'].replace("n/a","unknown")
    homeworld_url = resp['homeworld']
    homeworld = get(homeworld_url).json()['name']
    print(name, height, mass, gender, homeworld)
    sw_people['name'].append(name)
    sw_people['height'].append(height)
    sw_people['mass'].append(mass)
    sw_people['gender'].append(gender)
    sw_people['homeworld'].append(homeworld)
    sleep(1)

# create DataFrame
df = pd.DataFrame(sw_people)

## Export to CSV
df.to_csv('sw_people.csv')

GET data crypto currency from Coinbase API.


from requests import get
from time import  sleep
import pandas as pd
import json

#import library for visualize
import matplotlib.pyplot as plt

url2 = "https://api.coinbase.com/v2/currencies/crypto"

resp2 = get(url2).json()

crypto = {
    'name' : [],
    'code' : [],
    'sort_index' : [],
}

for i in resp2['data']:
    crypto['name'].append(i['name'])
    crypto['code'].append(i['code'])
    crypto['sort_index'].append(i['sort_index'])

df = pd.DataFrame(crypto)
print(df)

#connect to retreive BTC exchange rate data in coinbase
url3 = "https://api.coinbase.com/v2/exchange-rates?currency=BTC"
resp3 = get(url3).json()

price_btc = {
    'coin' : [],
    'price' : []
}


#Loop for insert to price_btc dict
for i in resp3['data']['rates'].keys():
    price_btc['coin'].append(i)

for i in resp3['data']['rates'].values():
    price_btc['price'].append(i)

df2 = pd.DataFrame(price_btc) 

#Join for find Full name
join_df = pd.merge(df,df2, left_on = 'code', right_on ='coin', how='inner')

# Arrange ascending
sort_df = join_df.sort_values(by = 'price', ascending=True).reset_index()

#select 20 row
first_100 = sort_df[:21]

plt.figure(figsize=(12, 8))
plt.plot(first_100['code'],first_100['price'])
plt.title('Chart 1 BTC price on Other Currency')
plt.xlabel('Currency')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.show()

Machine Learning for Predict Churn.


# ML
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

churn = pd.read_csv('churn.csv')

df = pd.DataFrame(churn)

y = df['churn']
X = df[['accountlength','numbercustomerservicecalls', 'totaldaycalls','totalintlminutes']]

#split > train > score > evaluate

#Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 42)

#Train
model = LogisticRegression()
model.fit(X_train, y_train)

#Score
#Test
pred = model.predict(X_test)

#Evaluate
#Accuracy
accuracy = np.mean(pred == y_test)

print('Accuracy :', accuracy)

# Compute confusion matrix
cm = confusion_matrix(y_test, pred)
print("Confusion Matrix:")
print(cm)

R

Data Transformation with dplyr library . Link


library("readr")
install.packages("dplyr")

## read csv file into Rstudio
imdb <- read_csv("imdb.csv")

## review data structure
glimpse(imdb)

## print head and tail of data
head(imdb,10)
tail(imdb,10)

## select columns
select(imdb, MOVIE_NAME, RATING)
select(imdb,1,5)

select(imdb, movie_name = MOVIE_NAME, released_year = YEAR)

## pipe operator
imdb %>% 
  select (movie_name = MOVIE_NAME, released_year = YEAR) %>%
  head(10)

## filter Data
filter(imdb, SCORE >= 9.0)

imdb %>% filter(SCORE >= 9.0)

## lower col names
names(imdb) <- tolower(names(imdb))

imdb %>%
  select(movie_name, year, score)%>%
  filter(score >= 9 & year > 2000)

imdb %>%
  select(movie_name, length, score) %>%
  filter(score == 8.8 | score == 8.3)

imdb %>%
  select(movie_name, length, score) %>%
  filter(score %in% c(8.3,8.8,9.0))

##filter string in columns
imdb %>% 
  select(movie_name, genre, rating) %>%
  filter(grepl("Drama", imdb$genre))

## create new columns
imdb %>%
  mutate(score_group = if_else(score >= 9, "High Rating", "Low Rating"),
         length_group = if_else(length >= 120, "Long Film", "Short Film"))

imdb %>%
  select(movie_name, score) %>%
  mutate(score = score + 0.1) %>%
  head(10)

## Arrange Data
head(imdb)

imdb %>% 
  arrange(desc(length)) %>% 
  head(10)

imdb %>% 
  arrange(rating,desc(length))

## Summarise and group_by
imdb %>%
  filter(rating != "") %>% 
  group_by(rating) %>% 
  summarise(mean_length = mean(length),
            sum_length = sum(length),
            sd_length = sd(length),
            min_length = min(length),
            max_length = max(length),
            n = n())

## Join data

favourite_films <- data.frame(id = c(5,10,25,30,98))

favourite_films %>% 
  inner_join(imdb, by = c("id" = "no"))


## Write csv file (export result)
imdb_prep <- imdb %>% 
  select(movie_name, released_year = year, rating, length, score) %>% 
  filter(rating == "R" & released_year > 2000)

## export file
write_csv(imdb_prep,"imdb_prep.csv")

Data Visualization with ggplot2 library. Link


## ggplot2
## library tidyverse
library(tidyverse)

##First Plot
ggplot(data = mtcars, mapping = aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth() +
  geom_rug()

ggplot(mtcars, aes(hp, mpg)) +
  geom_point(size = 3, col = "blue", alpha = 0.3) #alpha 0-1

ggplot(mtcars, aes(hp)) +
  geom_histogram(bins = 10, fill = "red", alpha = 0.5)

p <- ggplot(mtcars, aes(hp))
p + geom_histogram(bins = 10)
p + geom_density()
p + geom_boxplot()

## Box Plot by groups
diamonds %>% 
  count(cut)

ggplot(diamonds, aes(cut)) +
  geom_bar(fill = "lightblue")

ggplot(diamonds, mapping = aes(cut, fill = color)) +
  geom_bar(position = "stack")

ggplot(diamonds, mapping = aes(cut, fill = color)) +
  geom_bar(position = "dodge")

ggplot(diamonds, mapping = aes(cut, fill = color)) +
  geom_bar(position = "fill")

###Scatter plot
set.seed(70)
small_diamonds <- sample_n(diamonds, 5000)

ggplot(small_diamonds, aes(carat, price)) +
  geom_point()

### FACET : small multiples
ggplot(small_diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  facet_wrap(~color, ncol = 2) +
  theme_minimal() +
  labs(title = "Relationship between carat and price by color",
       x = "Carat",
       y = "Price USD",
       caption = "Source : Diamonds from ggplot2 package")

### Final Example 

ggplot(small_diamonds, aes(carat, price, col = cut)) +
  geom_point(size = 3, alpha = 0.2) +
  facet_wrap(~color, ncol = 2) +
  theme_minimal()

### mapping inside aes(), Setting outside aes() 

build Machine Learnong to classify diabetes patients in Binary classification. Link

## build ML to classify diabetes patients
## Binary classification

library(tidyverse)
library(caret)
library(mlbench)
library(MLmetrics)
## load dataset
data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes

## explore dataset
glimpse(df)

## no missing values
mean(complete.cases(df))

## select variables
df_starter <- df %>% 
  select(2, 5, 6, 8, 9)

## split data
set.seed(42)
n <- nrow(df_starter)
id <- sample(1:n, size = 0.8*n)
train_df <- df_starter[id, ]
test_df <- df_starter[-id, ]

## train
set.seed(42)

## USE FOR recall, Precision, F1, AUC
ctrl <- trainControl(method = "cv",
                     number = 5,
                     ## pr = precision + recall
                     summaryFunction = prSummary,
                     classProbs = TRUE)
## USE FOR Accuracy
ctrl <- trainControl(method = "cv",
                     number = 5)
knn_model <- train(diabetes ~ .,
                     data = train_df,
                     method = "knn",
                     metric = "Accuracy",
                     trControl = ctrl)

## score
p <- predict(logis_model, newdata = test_df)


## evaluate
mean(p == test_df$diabetes)

## confusion matrix
confusionMatrix(p, test_df$diabetes, 
                positive = "pos",
                mode = "prec_recall")

Dashboard

Power Bi . Link

Looker Studio. Link

 

Tableau Public. Link


Statistics

  • What is Statistics
    • statistics help to understand the truth with information.
  • Population vs. Sample
    • Population = All, Sample = some of population to represent population
  • Sampling & Bias
    • Proability Sampling
      • Simple Random Sampling
      • Systematic Sampling
      • Stratified Random Sampling
      • Cluster Sampling
    • non-probability sampling
      • Convenience or haphazard sampling
      • Volunteer sampling
      • Judgement sampling
      • Quota sampling
      • Snowball or network sampling
      • Web panels
  • Descriptive Statistics
    • Measures of Central Tendency [Mean, Medain, Mode]
    • Measures of Spread [SD, Varaince, IQR, Range]
    • Measures of Position [Min, Max, Percentile]
    • Measures of Relationship [Correlation(r)]
  • Normal Distribution & The Empirical Rule
    • Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a "bell curve" when graphed.
    • Empirical Rule, also sometimes called the three-sigma or 68-95-99.7 rule, is a statistical rule which states that for normally distributed data, almost all observed data will fall within three standard deviations [+- 1SD, +- 2SD, +-3SD]
  • Z Score (Normalization) - continue
    • Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point's score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.
  • Central Limit Theorem
    • the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.
    • Simple Random Sampling
    • Sample size >= 30
    • n not more than 10% of population
  • SE
    • Standard error (SE) is a statistic that reveals how accurately sample data represents the whole population. It measures the accuracy with which a sample distribution represents a population by using standard deviation. In statistics, a sample mean deviates from the actual mean of a population; this deviation is the standard error of the mean.

  • Confidence Interval
    • in statistics, refers to the probability that a population parameter will fall between a set of values for a certain proportion of times. Analysts often use confidence intervals that contain either 95% or 99% of expected observations.
  • Testing hypotheses | p-value | CI | critical region
    • Critical Region, also known as the rejection region, is a set of values for the test statistic for which the null hypothesis is rejected.
    • P-value, a number that indicates how likely you are to obtain a value that is at least equal to or more than the actual observation if the null hypothesis is not reject. p-value < alpha ⇒ Reject Ho
    • Confidence Interval, when set of value fall in this range then not reject null hypothesis. CI does not contain Ho ⇒ Reject Ho
  • What is Regression vs Correlation
    • Regression, is a statistical technique that relates a dependent variable to one or more independent (explanatory) variables
    • Correlation, statistics method for find relationship between two numeric vairavbles [-1,+1]