Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

Write composable Spark SQL analytics in JupyterLab | by Jean-Claude Cote | Jul, 2022

admin by admin
September 10, 2022
in Artificial Intelligence


Put a bit of magic in your Spark SQL notebooks using JupyterLab SQL cell magic editor

Photo by Julia Kadel on Unsplash

jupyterlab-sql-editor is an ipython magic that empowers you to write composable analytics in Spark SQL. This jupyterlab extension has a rich set of features:

  • Auto-completion of
    – table names, column names, nested sub-fields
    – table joins
    – functions
  • SQL syntax highlighting and formatting
  • Multiple display outputs
    – html and plain text
    – interactive data grid and json tree
    – capture of dataframe and views

jupyterlab-sql-editor extends JupyterLab with SQL formatting and syntax highlighting.

jupyterlab-sql-editor also register an LSP server which provides autocompletion. Press key to trigger autocomplete of table and column names including nested sub-fields. Use the %%sparksql --refresh all command to update the local cache file used by the autocompleter. The refresh command enumerates the functions, tables and columns found in the current spark context.

Autocomplete Spark SQL functions with documentation and usage examples.

There are also a few power user features like auto filling all column names in the SELECT statement and auto suggesting JOIN conditions on matching column names.

The sparksql magic is a convenient way of executing spark sql statements. In this example we execute a statement and print the results as text.

%%sparksql --output textSELECT * from range(1, 1000)

We can do the same thing in python.

statement = 'select * from range(1, 1000)'
df = spark.sql(statement)
df.show()

As you can see it’s not really magic. The sparksql magic takes the body of a cell as the statement, executes it and prints the results. The extension simply encapsulates boilerplate python code and makes it readily available via command line options.

Jupyterlab-sql-editor supports a rich set of output formats such as an interactive data grid. We use Bloomberg’s ipydatagrid which can post-process (sort, filter, and search) the results returned by the Spark query.

A tree widget displays the schema of the results. This is particularly useful for columns containing complex nested data structures.

Use the JSON output to navigate the contents of complex nested columns. This output uses the standard ipython tree widget which can search in any of the nested field.

So far we have mostly seen how to query data however you are not limited to queries. You can perform any SQL statement supported by Spark SQL.

%%sparksql
CREATE TABLE prod.db.sample (
id bigint COMMENT 'unique id',
data string)
USING iceberg
%%sparksql
ALTER TABLE prod.db.sample
ADD COLUMNS (
new_column string comment 'new_column docs'
)

Use the line magic %sparksql for simple one-liner statements

%sparksql SET spark.sql.variable.substitute=false
%sparksql DROP TABLE prod.db.sample
%sparksql CALL prod.system.set_current_snapshot('db.sample', 1)

Common table expressions (CTE) is a standard way of breaking large SQL statements into more manageable pieces. Jupyterlab-sql-editor not only supports CTE but lets you capture or alias SQL statements as views which can then be reused in later cells. This mechanism takes advantage of Spark dataframe’s createOrReplaceTempView function. Use the --view option to create a temporary view. To prevent immediate execution of the query use the --output skip option. When this option is used only the view is created.

Autocompletion also works on temporary views. Use the %%sparksql --refresh local option to update your autocomplete cache file with any local views you have created.

So far we have seen how to use SQL views to create composable SQL analytics. However you are not limited to only SQL you can switch from SQL to dataframe and back. Parts of your analysis might be better suited to the dataframe API as other might lend themselves better to the SQL dialect. Jupyterlab-sql-editor makes it very easy to switch between dialects. Use the--dataframeoption to convert an SQL statements as dataframe.

Given any dataframe, switch back to SQL by calling df.createOrReplaceTempView(). For example you might be using Spark ML, Spark GraphFrames, or simply using a datasource like CSV. Many of the libraries in Spark create dataframes.

df = spark.read.csv(path)
df.createOrReplaceTempView(‘my_csv_table’)

Then refer to the view you created in %%sparksql

%%sparksql
SELECT * FROM my_csv_table

Remember to update your local cache by running %%sparksql --refresh local and enjoy the autocompleter!

jupyterlab-sql-editor has two main dependencies. jupyterlab-lsp and sql-language-server. Installing the extension is very easy.

pip install jupyterlab-lsp jupyterlab-sql-editor

Install the sql-language-server project to provide autocompletion.

sudo npm install -g sql-language-server

For a complete list of configuration options see the detailed installation guide.

In this article we showed how to leverage %%sparksql to easily write composable analytics in Spark SQL. We focused on Spark, however jupyterlab-sql-editor also includes a %%trino magic! More SQL engines might be added in the future. Contributions are welcomed! Here’s our git repo CybercentreCanada/jupyterlab-sql-editor.

In a follow up article we will cover jupyterlab-sql-editor’s support for SQL templating using Jinja and DBT.

To contributors of these projects:

krassowski/jupyterlab-lsp
joe-re/sql-language-server
zeroturnaround/sql-formatter
cryeo/sparksql-magic
bloomberg/ipydatagrid



Source link

Previous Post

Sales prediction using deep learning — Rossmann pharmaceuticals | by Amanuel Zewdu | Sep, 2022

Next Post

Deep Learning Roadmap & Learning Paths | by Nerd Dev | Sep, 2022

Next Post

Deep Learning Roadmap & Learning Paths | by Nerd Dev | Sep, 2022

What is modulo in Python. We use multiple arithmetic operation in… | by Hiren Patel | Sep, 2022

The Real Name Behind the Statistical Terms You’re Using | by Amjad El Baba | Sep, 2022

Related Post

Artificial Intelligence

Exploring TensorFlow Model Prediction Issues | by Adam Brownell | Feb, 2023

by admin
February 2, 2023
Machine Learning

Different Loss Functions used in Regression | by Iqra Bismi | Feb, 2023

by admin
February 2, 2023
Machine Learning

How to organize bills? – 3 ways to track bills

by admin
February 2, 2023
Artificial Intelligence

How to decide between Amazon Rekognition image and video API for video moderation

by admin
February 2, 2023
Artificial Intelligence

The Future of AI: GPT-3 vs GPT-4: A Comparative Analysis | by Mohd Saqib | Jan, 2023

by admin
February 2, 2023
Deep Learning

6 Ways To Streamline Tech Hiring With A Recruitment Automation Platform

by admin
February 2, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.