edited by
824 views
1 1 vote

I have a hard time distinguishing terminologies of SparkSQL. While SparkSQL are quite flexible in terms of abstraction layers, its really difficult for beginner to navigate around those options.

1. When we say " using SparkSQL to perform .....", does it mean that we can use any API/abstraction layers such as Scala, Python, HiveQL to query? As long as the core dataframe is in spark, we should be fine?

2. Can we manipulate data in both PySpark and Scala sequentially?

For example, may I clean up the data in Scala, then perform follow up manipulation in PySpark, then go back to Scala?

3. As demonstrated in the tutorial, we can query with SQL command by using the api spark.sql("My SQL command"). does it count as SQL or SPARK?

 

100% Accept Rate Accepted 1 answers out of 1 questions

1 Answer

Best answer
1 1 vote
The fact is the engine is still the same, regardless of which interface language you use. For some tasks, such as special cleaning we probably do not have SQL commands, and we have to use Scala or Python. Using Zeppelin, you can switch back and forth among languages the engine supports, however it is not a common practice. For some specific tasks, you can use pure Spark SQL or if you want to use the SQL in pyspark or scala, there are functions that can help you achieve the goal.

I believe observing more examples will help you understand when you can use what.
selected by

Related questions

0 0 votes
1 1 answer
1.7k
1.7k views
kmr1994 asked Apr 5, 2019
1,729 views
Dataframe looks like belowI have dataframe like above. which I want to a~t reshape (a~t, 1)I want to reshape dataframe like below ( b~t column is go to under the a column...
2 2 votes
1 1 answer
1.5k
1.5k views
askdatascience asked Aug 26, 2018
1,514 views
I am wondering if anyone can suggest the best resources for studying SQL?
0 0 votes
0 0 answers
548
548 views
Anas asked Nov 28, 2021
548 views
So say I have a column with categorical data like different styles of temperature: 'Lukewarm', 'Hot', 'Scalding', 'Cold', 'Frostbite',... etc.I know that we can use pd.ge...
1 1 vote
1 answers 1 answer
1.5k
1.5k views
interview asked Dec 24, 2019
1,541 views
Consider the Pandas DataDrame df below. Filter it appropriately so that it outputs the shown results.gh owner language repo stars 0 pandas-dev python pandas 17800 1 tidyv...
4 4 votes
1 1 answer
1.2k
1.2k views
askdatascience asked Oct 14, 2018
1,223 views
I have a 10 GB CSV file and it takes more than 48 hours for logstash installed on my laptop to ingest the whole data. Is there a way to accelerate this process?