Essential Data Structures and Algorithms

Following the Pareto Principle, I’ve filtered the essential data structures and algorithms based on my experience. While the list is already filtered down, I’ve noted the *most* important out of each category by postpending a pipe ” | “, so you should really tackle those first and foremost. The easiest way I find is to read the theory behind it, implement it yourself, and do a couple of questions.

  • Data Structure: Binary Search Tree | Hash table
  • Search: Binary search | Order statistics (kth smallest/largest)
  • Graph: BFS | DFS, Dijkstra, Bellman-Ford
  • Sort: Quicksort | Mergesort, Heapsort, insertion sort
  • Algorithm design: Recursion | Divide and Conquer, Dynamic Programming, Greedy
  • Tree traversal: Inorder | Preorder, postorder

Again, this is *not* a comprehensive list, but rather an intentionally curated list to pan the essential from less essential.

Common beginner questions about Apache Spark

For a beginner, Apache Spark can be a lot to take in. While it’s mostly elegant and intuitive, there are some parts of Apache Spark where beginners tend to have questions on.

Why is RDD immutable?

RDD is the basic data structure of Apache Spark, and it is immutable, meaning RDD cannot be changed. That means, whether you apply a function to all entires or filter out only a couple from the origin RDD, Apache Spark will always create a new destination RDD from scratch.

As Apache Spark is commonly marketed as a high-performance big data processing engine, it’s understandable to question whether creating an entirely new RDD whenever we call a transformation function is really an optimal solution. Not only are you spending extra computing resources to create a fresh RDD when you could have just modified the original one, you potentially create a large duplicates, especially in the case of filtering very small number of entires, wasting space resources.

However, while those might be valid concerns for smaller data, Apache Spark’s strength comes from scaling operation to the “Big Data.” Immutability is a crucial technical feature that enables Apache Spark to be fast on the big data. Apache Spark achieves the famous fast scalability by managing the multiple clusters effectively. In other words, the Apache Spark is optimized for distributed computing. From that perspective, immutability of RDD relieves Apache Spark from the responsibility of tracking whether the RDD was mutated. When there are dozens of clusters working fast, it’s easier to build a dataset from scratch by partitioning it than have the clusters come back each time to perform an operation to check whether the dataset is modified before.

Another benefit that can be drawn from immutability is that since we know that the RDD is immutable, Apache Spark can also recreate RDD from scratch at any point again. This is significant benefit when dealing with multi-cluster environment where any cluster can go down. If the cluster that was supposed to build a partition of RDD goes down, another can take in and build that instead since we know exactly what it’s supposed to look like.

To illustrate the pain point of not having immutability in a distributed computing, consider git which is a distributed version control. Git guarantees that the committed code stays that way as much as Apache Spark guarantees that RDD stays that way. Imagine the world where git doesn’t actually guarantee that commit will stay the same, and you and 20 other developers are working on the same file. Whenever someone modifies the code, the commit itself changes. In a sense, it will be like working on a code on a Google doc with 20 other developers. One obvious way would be to lock the entire document so no one else can edit it, but you will hamper the productivity of the 20 developers. Likewise, immutability allows Apache Spark to empower all clusters to run as effectively as possible.

In conclusion, you can think of immutability as the invariance of RDD. Just as much as the purpose of invariance in an algorithm is to eliminate side effects, the invariance of immutability in RDD eliminates the headache when Apache Spark tries to scale its operation to multiple clusters.

Serverless Computing: The Perfect Solution for Your Startup

Is building a MVP as soon as possible your primary goal? Are you looking for a back-end engineering solution that can scale even after your MVP?

Then you should consider building your project with serverless computing.

What is Serverless Computing?

In serverless computing, the developer does not have to concern with the typical “server” tasks that are traditionally considered “backend software engineering”. Thus, the illusion of server-less. For example, the developer would not have to manually set up database or run a cron job to perform repetitive tasks. These would be done with a click of a button on the website of the serverless computing provider like Google Firebase or AWS Lambda who would automate those tasks for you instead.

However, there’s definitely a server involved; the server is just not “visible” to the developer, and thus you just don’t have to care about the server. It provides an abstraction over the server-related tasks so that developers can interact with the nicely-designed UI rather than grueling over the server code and network.

You can consider serverless computing as the subset of cloud computing. Apart from inheriting the misleading naming scheme, it further delegates the work and decision to the 3rd party. If the leap to cloud computing is the automation of the database administration which managed data operation within local servers, the leap to serverless computing is the automation of the backend engineering, which oversees the magic logic behind front-end.

Why Is Serverless Computing So Great?

Serverless computing eliminates a large chunk of work needed to produce your MVP. While there’s still some work that need to be done for database administration and backend engineering, a large part of the backend engineering work are automated with serverless computing. You can simply focus on the features that drive values, and let the 3rd party handle the scaling of the server as your user base grows. In other words, your developer has one less thing to worry about since large part of what “had to be done” is now automated, which in turn enables your company to reach the completion of MVP faster. You no longer have to worry about which database to use (which can be a pretty heated topic of discussion among some teams) and just use the database. Less fishing for solutions, more developing the actual value.

Furthermore, serverless computing allows your developer to wear multiple hats, and let your startup stay lean. As a startup, hiring a too-specialized niche developer is a huge investment and often an unwanted decision. Startups need someone who can perform tasks wherever is needed whether that’s front-end, back-end, DevOps, or whatever. The abstraction of the server tasks enables startups to simply higher more generic software engineers to build a scalable MVP without expertise in database or backend engineering. In other words, you can reduce the cost of hiring while growing your startup.

On the flip side, the talented, specialized engineers at reputable companies are working for your scalable solution. Imagine getting the engineering support from the talents of Google and Amazon. This is exactly what you will be getting if you use serverless computing platform from these companies. The vetted engineers who have expertise in database, backend engineering work to ensure that your server does not go down. You can tap into the talent pool of Google and Amazon and enjoy the top quality products that these engineers produce.

What’s The Catch?

However with all the benefits comes the shortcomings. First is that the lack of choices in your tools such as database. While I see this as a benefit for majority of the startups whose success does not depend on the type of database, if the customization and optimization of database is crucial for your startup, then the serverless might be problematic.

But for other startups, most of these “technical” concerns usually are not the limiting factor as long as they work. And “as long as they work” aligns with the financial motivation of the serverless computing platforms because “as long as they work”, they get paid. So, you can be ensured that your serverless computing platform with full of talented engineers are very hard at work to ensure that your app does not go down.

Second possible shortcoming is that, all these goodies don’t come free. As you increase the usage, the cost might increase beyond your scope. Depending on your usage pattern, the cost of using serverless computing might be more expensive than more traditional cloud computing or in-house solutions. At this point, it might make sense to move your solution to the computing platform under your control.

Yet, for most startups, those concerns might be premature. If you are at the scale where the serverless computing does not make sense, you probably have enough budget to switch over. Nevertheless, make sure you are prepared to move if necessary.

Lastly, talking about the cost, you might also get some unexpected spikes in your charge if there’s a sudden jump in usage or even a bug that triggers the serverless computing. So expect the unexpected, and make sure your budget covers this.

So…

Serverless computing may have many downsides. However, most of the downsides won’t negatively impact startups as much, and the current costing structure (where many serverless services are free until you get some actual user base) make perfect sense for companies that are strapped with cash but need to develop MVP fast, which is basically all startup companies.

To P or NP

Written Originally in 2015-05-30 at my previous blog

NOTE: The following article is intended as a gentle introduction for the subject, so I have intentionally left out some information that I considered is out of scope. However, if you are familiar with the subject and if you think I presented factually wrong statements, I will greatly appreciate an email with corrections. Also, if you are confused with my explanation, please feel free to send me an email with details of where you are confused, so I can make it a bit more clear.


What exactly are P and NP? Why do we need them?

In order to understand the P vs. NP problem, we must understand P and NP first. P and NP rose from the necessity of matheticians and computer scientists to describe how much resources, such as time, are needed to solve a decision problem. A decision problem is defined as any problem that given an input can determine or decide with “yes” or “no” answer. For example, given a path and a budget cost as inputs, the decision problem might output “yes” if the path cost is equal to or lower than the budget cost. Likewise, the decision problem might output “no” if the path cost is higher than the budget cost.

To be more precise, P problems are a set of decision problems that can be solved in polynomial time and verified in polynomial time. For the decision problem used as an example above, our verification algorithm may sum up the cost of edges in linear time ( O(n) ) to compare it with the budget cost while the search algorithms such as DFS/BFS obtain the path in linear time as wel. On the other hand, NP problems are a set of decision problems which can be verified in polynomial time, but not necessarily solved in polynomial time.

Although a common misunderstanding is to conceptualize NP as “not P”, such notion is greatly misleading. Note that NP problems may or may not be solved in polynomial time. Therefore, NP and P are not mutually exclusive classes. Rather, P is a subset of NP; all P problems automatically qualify as NP problems since all P problems satisfy polynomial-time verification as well as polynomial-time solution, and since NP problems only need to fulfill polynomial-time verification, all P problems are NP problems as well.

For more technical details between P and NP, refer to the extra note.

P versus NP

Now we can tackle P vs. NP problem. The question simiply asks if P equals to NP or not. In other words, the problem asks, if a problem can be verified in a polynomial time, can the problem be solved in a polynomial time all the time? That is, are some NP problems inherently difficult or even impossible to solve in polynomial time, or have we just not found the fast algorithms yet? So far, no official proof has been recognized by the Clay Mathematics Institute, which promises to reward one million dollars to anyone who can solve the P vs. NP problem. Although the question is simple, the implication is significant.

Implications of P = NP

If P = NP is proven, then this implies that all problems that can be verified in polynomial time can also be solved in polynomial time as well. The impact extends beyond the academic world since many industrial algorithms including cryptography are written under the assumption that P != NP.

For example, RSA, which is a widely used cryptosystem to secure data transmission, exploits P != NP by designing a decryption algorithm that runs in polynomial time only for those who have the right access (known as a private key). If you do not have the private key (meaning you do not have access), you would need to solve the RSA problem, which is a NP problem since there is no known algorithm that can decrypt encryption in polynomial time.

However, if P = NP is proven true, this implies that there exists some algorithms that can solve the RSA problem in polynomial time, meaning any encrypted message can be decrypted easily by even those who should not have access if they can find the polynomial-time decryption algorithm, which is not supposed to exist under the P != NP assumption.

Obviously, this is a significant security threat since this “encrpyted message” could be a private message that you may not wish to share or even financial information that can be hijacked.

Therefore, if P = NP is true, many of the current algorithms that assume P != NP become insecure or invalid, and people can maliciously exploit the algorithm for their personal gains.

Implications of P != NP

If P != NP is true, there are some NP problems that will never be transition into P, meaning there never will be algorithms that can solve those NP problems in polynomial time.

Although the impact of this proof will not be as significant as P = NP since most of the the academia and the industry already assume P != NP, the proof will influence how researchers spend their time. Since there cannot be a polynomial-time solution for some NP problems, researchers would no longer have to waste their time trying to find fast algorithms for problems in NP.


Extra Note

NP stands for “non-deterministic polynomial time” and although P does not have an official abbreviation, it is commonly known as “deterministic polynomial time”. A non-deterministic polynomial problem can be solved in polynomial time with some lucky guesses. For example, finding a least-cost cycle that visits every node in a graph is known as a Travelling Salesman Problem, which is notoriously difficult. However, if our solution algorithm makes certain smart/lucky decisions in finding the path, the optimal path could be found in polynomial time.

Meanwhile, the P problem can find a solution in polynomial time without depending on luck. As mentioned before, since P is in NP under P != NP, P problems can also find polynomial-time solution with some lucky guesses.

4 Ways You Can Parallelize Your Bash Script

Written Originally in 2015-08-09 at my previous blog

Parallel programming basically means writing scripts that can process multiple tasks at the same time. This is in contrast to running scripts in a serial order which requires the previous tasks to be completed before moving onto the next.

Parallelization can be useful for boosting the performance, but it requires the tasks that are going to be parallelized to be independent of one another. Therefore, in this tutorial, we will refer to the function do-something as a task that can be safely run in parallel. For didactic purpose, we define do-something as follows:

#!bash
function do-something(){
    sleep $1
    echo $1
}

There are several ways of parallelizing a bash script, and each has its own pros and cons.

1.

#!bash
for i in $(seq 5 1); do
    do-something $i &
done

This is the simplest form of parallelization in bash. & forks each do-something task to the background every time it’s called.

If you run this parallel script, your output should look something like this:

1
2
3
4
5

Even though our script runs from 5 to 1, because all five tasks start at the same time due to parallelization, do-something task with the shortest sleeping time gets finished first.

Also, note the speed improvement. Our serial script runs in 5+4+3+2+1=15 seconds while our parallel should theoretically run in 5 seconds. That’s 3 times performance improvement by just attaching &!

However, there is one critical problem. Imagine, instead of looping through 5 iterations, your script has to process 5 billion different tasks, and your do-something does something far more intensive. Then, when you run your script, it will attempt to process all those tasks at once, which will easily overload the machine. Our next method will help us prevent that.

ADVATAGE: simple

DISADVANTAGE: can potentially overload the system

2.

#!bash
wait_turn=4
j=1
for i in $(seq 10 1); do
    do-something $i &
    if ((j++ % wait_turn == 0)); then
        wait
    fi
done
wait
echo "after loop"

Running this, our result should look something like this:

7
8
9
10
3
4
5
6
1
2
after loop

Can you see the pattern?

Once our wait function is called, the shell “waits” until all background tasks are finished. In this case, a batch of four tasks are forked into the background at a time as specified by wait_turn, and the script waits until all four processes are complete.

By doing this, we limit the number of tasks that will be processed. While this will also limit the time it takes to process the all tasks, it won’t take down your machine.

The value of wait_turn has been chosen arbitrarily, so you can fiddle with it to find the right number for your own script. Usually, the value is a multiple of the number of the cores you have.

A keen observer might have noted the another wait function after the function. To illustrate its need, consider a similar script without the after-loop wait:

#!bash
wait_turn=4
j=1
for i in $(seq 10 1); do
    do-something $i &
    if ((j++ % wait_turn == 0)); then
        wait
    fi
done
echo "after loop"

Before scrolling down, think what this would output.

7
8
9
10
3
4
5
6
after loop
1
2

Notice how our “after loop” string comes before do-something 1 and do-something 2 is processed. This could be critical if you have an after-loop code that needs the for-loop to completely finish. Likewise, apply wait function after the for-loop for the first method if necessary.

Yet, even this method can be restrictive. Consider a case where the total time of task completion varies greatly. For our example above, say every fifth tasks (do-something 5do-something 10), for some odd reason, takes 5 and 10 miniutes to complete instead of mere 5 or 10 seconds, respectively. If we run our script again under this new condition, we can see that our script runs very inefficiently because for our first batch, even though tasks 7, 8, 9 are already completed, they would have to wait for one task, do-something 10 to finish! Likewise, for our second batch, the threads that were working on tasks 3, 4, and 6 are idling because they cannot move forward until do-something 5completes. In other words, if the batch contains an unusually slow task, it becomes the bottleneck and slows down the entire process, almost defeating the purpose of parallelism.

ADVANTAGE: parallelization without overloading the system

DISADVANTAGE: bottlenecks occur and can potentially slow down the entire process

3.

Fortunately, UNIX/LINUX comes with a command called xargs which whips idling tasks back to work. xargs is not specifically designed to be used for parallelization, but it comes with a handy -P flag which defines the maximum number of parallel processes that can run at a time. xargs is extremely flexible, so I will not go into too much details in this article, but I will cover the basics:

#!bash
seq 10 1 | xargs -n1 -P4 -I {} bash -c "sleep {}; echo {};"

To break it down, I give the input ranging from 10 to 1. Then, xargs separates the input into indiviual pieces (as specified by -n1, meaning take at most one argument). The big -P flag as mentioned is what gives the parallelization ability. Without it, the xargs runs the command in a serial manner. After we set the variable to {}-I replaces all ocurrences of {} in the command with the input chunk, which in our case are individual numbers. Finally, xargs runs the bash, which, in turn, executes the string with all {} replaced with input chunks.

The above snippet produces:

7
8
9
10
5
4
3
6
1
2

If you take a look at the first batch, as we have seen with #2, only upto 4 tasks are processed at a time since we have limited to the four processes. However, if you take a look at the second batch, you will see that the numbers are not ordered as before. This is because xargs assigns tasks as soon as processes becomes available.

Indeed, if we give as many processes as there are inputs, all processes will be handled at the same time as we would expect.

#!bash
seq 10 1 | xargs -n1 -P10 -I {} bash -c "sleep {}; echo {};"

which rightly produces

1
2
3
4
5
6
7
8
9
10

xargs should be sufficient for most users under most usage cases. Other than the fact that it’s a tad bit more complicated than the first two methods I introduced above, xargs provides an efficient, easy way of implementing a parallel script.

ADVANTAGE: eliminates bottleneck

DISADVANTAGE: none for most users. See #4 for more

4.

However, for some users, they might need something more powerful, in which case, I introduce you GNU Parallel. Thankfully, GNU Parallel already detailed reasons why you might want to switch over from xargs.

The parallel equivalent of

#!bash
seq 10 1 | xargs -P4 -I {} bash -c "sleep {}; echo {};"

is

#!bash
seq 10 1 | parallel -P4 "sleep {}; echo {};"

You can see that parallelsyntax is a lot simpler than xargs.

ADVANTAGE: refer to the comparison between xargs and GNU Parallel; to name a few, flexibility and simplicity

DISADVANTAGE: external dependency

Hello World!

I am migrating from Github Pages + Pelican setup to NameCheap + WordPress.

Given that my primary interest in website has shifted from a portfolio where I wanted to impress employers to a blog where I want to impress clients, I decided the latter combo makes more sense since WordPress is well-renowned CMS that *specializes* in content creation.

Pelican worked well when I didn’t mind hacking since I considered that as a part of learning. I would configure Arch Linux for days to achieve the “perfect” setup or learn lisp just so that I can build my own “perfect” emacs. But now, I changed to Macbook and IDEs (+ vim binding) that just works and actually allowed me to *work* instead of *fix*.

Likewise, I am hoping my switch will allow me to write more contents and start grinding words out rather than fiddling with terminal and writing several commands to publish one article. With WordPress, it’s just a button click.