Q. Why Spark is faster than Hadoop?

A: Hadoop workers on MapReduce. In MapReduce after every task, it writes back to Hard Disk. This increases the number of I/O operations. After every task executer is killed in MapReduce.

Whereas Spark uses In-memory processing. When the action is triggered it gets all the data from Hard Disk and stores in Memory till the job is completed. It reduces the number of I/O operations and increases the execution speed. The killing of the executor is not present in Spark. 


Q. What are the advantages of spark over Hadoop?

A: 1.       Spark uses in-memory processing. Whereas  Hadoop reads and writes from hard disk. This makes the spark to process much faster the Hadoop.

2.       In Hadoop we need to depend on third-party tools 

for SQL -> Hive, 

for streaming -> flume, 

for Ml -> ( ). But spark provides in-built tools for all these.

3.       Hadoop supports only java programming language. Spark supports java, scala, python, and R languages.

Q.What is Spark?

  •  It is an open-source distributed in-memory big data processing Engine. 
  •  It can process both batch and streaming data.
  • Spark is written in Scala
Q. Explain Spark Architecture?
A: In the below link, the Architecture of the Spark is Explained clearly. please follow it.
     

👉 Spark Architecture


Q.What is the difference between Spark Context(SC) and Spark Session(SS)?
A: From the architecture, you might have understood that every spark application contains one SC which is present in the driver and will communicate with the cluster manager and Executors. 
Suppose we came across a scenario where multiple users want to access the same cluster and everyone wants to have their own configurations, access the same table, and perform transformations that do not affect others. 
In spark 1.X we would have created separate SC for every user. This is will increase the infrastructure cost.
In Spark 2.X they introduced a concept called Spark Session. for one spark context, we can create multiple SS which can be accessed by each user with thair required properties and configurations.

The cluster-level resources are available to all the SS. whatever transformation we do in SS is isolated to that Session. That doesn't affect other sessions.  Once the Session is closed all the Transformations done will decipher.

In spark 1.x we have created a separate context for Spark, Hive, SQL, and Streaming. With SS we need not create that everything is taken care of by SS.