Chendi Xue

I am linux software engineer, currently working on Spark, Arrow, Kubernetes, Ceph, c/c++, and etc.

BLOG


Weekly Plan
MondayTuesdayWednesdayThursdayFridaySaturdaySunday
WorkdayWorkdayWorkdayWorkdayWorkdayPlaydayPlayday
Not StartedNot StartedNot startedNot startedNot startedNot startedNot started

Using VIM as Cpp IDE
I prefered to write codes inside vim, since I can directly compile them in the same nodes, and it is free. And this blog is my Must Have when using VIM as a cpp IDE.
read more >>

How to build IKEv2 VPN server on Amazon AWS EC2
A very nice guide of setting up IKEv2 VPN server on EC2 by using strongswan(ipsec)
read more >>

Apache Arrow enabling HDFS Parquet support
Enable Apache Arrow with HDFS and Parquet support, and continually implement a new java interface for loading parquet from hdfs.
read more >>

Prepare TPCDS data for spark
Step by steps to install a tpcds kit and then prepare tpcds data.
read more >>

Apache Arrow Gandiva on LLVM(Installation and evaluation)
Installation and evaluation of Apache Arrow and Gandiva.
read more >>

Spark WholeStageCodeGen
Talk about what is WholeStageCodegen and how it worked in spark.
read more >>

Spark Sql DataFrame processing Deep Dive
DataFrame vectorized/columnar based data format processing and row based data processing deep dive will be covered in this blog.
read more >>

Spark and Hadoop build from Source
Spark is using 3.0.0(master of Apr 2019), Hadoop is using 3.2.0(claimed to be supported in spark pom.xml)
read more >>

TensorFlowOnSpark: Install Tutorial Step by Step (spark on Yarn)
TensorFlowOnSpark installation and verification step by step.
read more >>

Optimize Spark (pyspark) with Apache Arrow
Apache Arrow is a standardized language-independent columnar memory format platform, implemented in c++ and providing interfaces in Python, java, etc. Aim of Apache Arrow is to provide a unified data structure for different projects and different process memeory space.
read more >>

Difference between Spark Shuffle vs. Spill
What is Spark Shuffle and spill, why there are two category on spark UI and how are they differed? Also how to understand why system shuffled that much data or spilled that much data to my spark.local.dir? This post tries to explain all above questions.
read more >>

Persisten Memory Development Kit(PMDK) Notes 2: Benchmark examples for multiple interfaces(c/c++/java)
PMDK is super cool, if you missed the introduction, please go to PMDK Notes 0: what it is and quick examples to have a ramp up. For this blog, three benmark codes of using C, C++ and Java PMDK interface will be demonstrated to meet different applications.
read more >>

S3A Committer review: how to enable, how to verify and performance
S3A Committer is a brand new feature in Hadoop 3.1.1, it helped to eliminated a rename operation which is a disaster to s3a performance. And I will cover how to enable S3A Committer and how to verify if S3A Committer is working and performance here.
read more >>

Running Spark on kubernetes Step by steps
I will cover how to deploy spark on kubernetes and how to run spark examples including simplest example like calculating pi, examples required input/output through HDFS and examples with Hive.
read more >>

[Solved]Hadoop no credential provider issue with Ceph as object store(hadoop 3.1.1)
I met no credential provider issue after upgrading my hadoop to 3.1.1 with hive 3.1.0, since it took quite a while to find a solution, hope others may benefit from my findings
read more >>

Java Program Profiling and Optimization
This is a blog to keep good java online-doc references, and may add a original one after.
read more >>

Coding Basics Note 0: Refreshing
I am always afraid of getting rusty of coding basics, data structures and algorithm. Hope this cheet sheet would help to refresh every time I needed.
read more >>

Coding Advanced Note 0
In this series, I will continuely add some hard problem I tried on leetcode, update weekly.
read more >>

Persisten Memory Development Kit(PMDK) Notes 1: How to install
PMDK is super cool, if you missed the introduction, please go to PMDK Notes 0: what it is and quick examples to have a ramp up. For this blog, I wrote down all my steps to build up a PMDK with its c++ api and java api in my Centos 7.3 system.
read more >>

Persisten Memory Development Kit(PMDK) Notes 0: What it is and Quick Examples
PMDK is a super cool open source library developed and owned by intel. This lib is used to help users to implement applications on persistent memory just like using normal memory. Which is to say, basically, when we want to manage memory, we allocate a chunk, and use pointers to indicate where data is to build a logical image and do read and write, and that is exactly how to use persistent memory by this PMDK lib. There is a new pointer, persistent_ptr, and it just can be used like a char*, super cool!
read more >>