Data Science Foundation

An Overview of OmniSci Integrated Data Science Foundation

HEAVY.AI provides an integrated data science foundation built on several open-source components of the PyData stack. This set of tools is integrated with Heavy Immerse and allows users to switch from dashboards to an integrated notebook environment connected to HeavyDB in the background. You can switch from visual data exploration with Immerse to a deeper dive on a specific dataset, build predictive models using standard python-based data science libraries and tools, and push results back into HeavyDB for use with Immerse.

Several components make up the HEAVY.AI data science foundation.

JupyterLab

HEAVY.AI provides deep integration with JupyterLab, the next-generation version of the most popular notebook environment and workflow used by data scientists for interactive computing. You can access JupyterLab by clicking an icon in Immerse.

In addition to the seamless integration with Immerse, you can also use JupyterLab with HEAVY.AI by creating an explicit connection object, either via the heavyai API.

>>> from heavyai import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="heavyai")
>>> con
Connection(mapd://admin:***@localhost:6274/HEAVY.AI?protocol=binary)

or via the Ibis-heavyai API, which builds on heavyai.

con = ibis.heavyai.connect(
    host='localhost',
    database='ibis_testing',
    user='admin',
    password='HyperInteractive',
)

For more information, see the JupyterLab documentation.

heavyai

The heavyai client interface provides a Python DB API 2.0-compliant HEAVY.AI interface. In addition, it provides methods to get results in the Apache Arrow-based GDF format for efficient data interchange.

Documentation

See the GitHub heavyai repository and for documentation:

Examples

Create a Cursor and Execute a Query

Step 1: Create a connection

>>> from heavyai import connect
>>> con = connect(user="heavyai", password= "HyperInteractive", host="my.host.com", dbname="heavyai")

Step 2: Create a cursor

>>> c = con.cursor()
>>> c

Step 3: Query database table of flight departure and arrival delay times

>>> c.execute("SELECT depdelay, arrdelay FROM flights LIMIT 100")

Step 4: Display number of rows returned

>>> c.rowcount
100

Step 5: Display the Description objects list

The list is a named tuple with attributes required by the specification. There is one entry per returned column, and we fill the name, type_code, and null_ok attributes.

>>> c.description
[Description(name=u'depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True), Description(name=u'arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]

Step 6: Iterate over the cursor, returning a list of tuples of values

>>> result = list(c)
>>> result[:5]
[(1, 14), (2, 4), (5, 22), (-1, 8), (-1, -2)]

Select Data into a GpuDataFrame Provided by pygdf

Step 1: Create a connection to local HEAVY.AI instance

>>> from heavyai import connect
>>> con = connect(user="heavyai", password="HyperInteractive", host="localhost",
...               dbname="heavyai")

Step 2: Query GpuDataFrame database table of flight departure and arrival delay times

>>> query = "SELECT depdelay, arrdelay FROM flights_2008_10k limit 100"
>>> df = con.select_ipc_gpu(query)

Step 3: Display results

>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

Remote Backend Compiler (RBC)

Using Python, you can interact with databases in multiple ways. Libraries like SQLAlchemy provide a translation mechanism that converts Python to SQL; this is an example of an ORM (Object-Relational Mapping). With SQLAlchemy and similar approaches, user interactions with the database are simplified—and optimized—as a set of high-level functions provided by the ORM. Unfortunately, to run tasks not supported by the ORM, you need to write SQL code.

You can define your own SQL functions in HeavyDB, but to realize the full power of HeavyDB, you have to re-compile the engine to add your functions. To write GPU-compatible functions to execute on GPUs, HeavyDB supports User Defined Functions (UDFs) and User Defined Table Functions (UDTFs). A UDF operates on elements of tables; a UDTF operates on an entire table itself.

The Remote Backend Compiler (RBC) package provides a Python interface to define UDFs and UDTFs easily. Any UDF or UDTF written in Python can be registered at run time on the HeavyDB server and subsequently used in any SQL query by any client.

Functions are not persisted on the database and need to be registered if the server is restarted.

Internally, the RBC converts the Python function to an intermediate representation (IR), which is then sent to the server. The IR is compiled on a CPU or a GPU, depending on specified hardware resources .

Ibis is an ORM that supports defining UDFs in C++ for some type of databases. However, it doesn’t provide a Python interface.

Ibis

Ibis is a productivity API for working in Python and analyzing data in remote SQL-based data stores such as HeavyDB. Inspired by the pandas toolkit for data analysis, Ibis provides a Pythonic API that compiles to SQL. Combined with HeavyDB scale and speed, Ibis offers a familiar but more powerful method for analyzing very large datasets "in-place."uh b

Ibis supports multiple SQL databases backends, and also supports pandas as a native backend. Combined with Altair, this integration allows you to explore multiple datasets across different data sources.

Altair

Altair is another key component of the HEAVY.AI data science foundation. Building on the same Vega data visualization engine used by Immerse for geospatial charts, Altair provides a pythonic API over Vega-Lite, a subset of the full Vega specification for declarative charting based on the "Grammar of Graphics" paradigm. The HEAVY.AI data science foundation goes further and includes interface code to enable Altair to transparently use Ibis expressions instead of pandas data frames. This allows data visualization over much larger datasets in HEAVY.AI without writing SQL code.

NVIDIA RAPIDs

The Nvidia RAPIDs toolkit is a collection of foundational libraries for GPU-accelerated data science and machine learning. It includes popular algorithms for clustering, classification, and linear models, as well as a GPU-based dataframe (cudf). HEAVY.AI allows configurable output to cudf from any query (including via Ibis or pyomnisci), so you can quickly run machine-learning algorithms on top of query results from HEAVY.AI.

Other Tools and Utilities

In addition, the data science foundation Docker container includes Facebook's Prophet library for forecasting, and Prefect, a lightweight but powerful workflow engine that enables you to build and manage workflows in Python.

Last updated