-->

2020-12-31

Python-friendly dtypes for pyspark dataframes

When using pyspark, most of the JVM core of Apache Spark is hidden to the python user. A notable exception is the DataFrame.dtypes attribute, which contains JVM format string representations of the data types of the DataFrame columns . While for the atomic data types the translation to python data types is trivial, for the composite data types the string representations can quickly become unwieldy (e.g. when using the elasticsearch-hadoop InputFormat).

from pyspark.sql import functions as f, types as t, Row

# Create DataFrame
ints = [1,2,3]
arrays = [[1,2,3], [4,5,6], [7,8,9]]
maps = [{'a':1, 'b':2, 'c':3}]
rows = [Row(x=1, y='1'), Row(x=2, y='2'), Row(x=3, y='3')]
composites = [Row(x=1, y=[1,2], z={'a':1, 'b':2}), Row(x=2, y=[3,4], z={'a':3, 'b':4})]
df = spark.createDataFrame(zip(ints, arrays, maps, rows, composites))

# Show standard dtypes
for x in df.dtypes:
    print(x)

# ('_1', 'bigint')
# ('_2', 'array<bigint>')
# ('_3', 'map<string,bigint>')
# ('_4', 'struct<x:bigint,y:string>')
# ('_5', 'struct<x:bigint,y:array<bigint>,z:map<string,bigint>>')

# Show python types in collected DataFrame
row = df.collect()[0]
for x in row:
    print(type(x))

# <class 'int'>
# <class 'list'>
# <class 'dict'>
# <class 'pyspark.sql.types.Row'>
# <class 'pyspark.sql.types.Row'>

# Show python types passed to a user defined function
def python_type(x):
    return str(type(x))

udf_python_type = f.udf(python_type, t.StringType())
row = df \
    .withColumn('_1', udf_python_type('_1')) \
    .withColumn('_2', udf_python_type('_2')) \
    .withColumn('_3', udf_python_type('_3')) \
    .withColumn('_4', udf_python_type('_4')) \
    .withColumn('_5', udf_python_type('_5')) \
    .collect()[0]
for x in row:
    print(x)

# <class 'int'>
# <class 'list'>
# <class 'dict'>
# <class 'pyspark.sql.types.Row'>
# <class 'pyspark.sql.types.Row'>

While the dtypes attribute shows the data types in terms of the JVM StructType, ArrayType and MapType classes, the python programmer gets to see the corresponding python types when collecting the DataFrame or passing a column to a user defined function.

To fill this gap in type representations, this blog presents a small utility that translates the content of the dtypes attribute to a data structure with string representations of the corresponding python types. The utility can be found as a gist on github, but is also listed below:

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
import re
import string

import pyparsing


def pysql_dtypes(dtypes):
    """Represents the spark-sql dtypes in terms of python [], {} and Row()
    constructs.
    :param dtypes: [(string, string)] result from pyspark.sql.DataFrame.dtypes
    :return: [(string, string)]
    """

    def assemble(nested):
        cur = 0
        assembled = ''
        while cur < len(nested):
            parts = re.findall(r'[^:,]+', nested[cur])
            if not parts:
                parts = [nested[cur]]
            tail = parts[-1]
            if tail == 'array':
                assembled += nested[cur][:-5] + '['
                assembled += assemble(nested[cur+1])
                assembled += ']'
                cur += 2
            elif tail == 'map':
                assembled += nested[cur][:-3] + '{'
                assembled += assemble(nested[cur+1])
                assembled += '}'
                cur += 2
            elif tail == 'struct':
                assembled += nested[cur][:-6] + 'Row('
                assembled += assemble(nested[cur+1])
                assembled += ')'
                cur += 2
            else:
                assembled += nested[cur]
                cur += 1
        return assembled

    chars = ''.join([x for x in string.printable if x not in ['<', '>']])
    word = pyparsing.Word(chars)
    parens = pyparsing.nestedExpr('<', '>', content=word)
    dtype = word + pyparsing.Optional(parens)

    result = []
    for name, schema in dtypes:
        tree = dtype.parseString(schema).asList()
        pyschema = assemble(tree).replace(',', ', ').replace(',  ', ', ')
        result.append((name, pyschema))
    return result

The pysql_dtypes() function starts with building a simple grammar using the pyparsing package, to parse the dtypes as given by pyspark. Central in the grammar are the special characters '<' and '>' that are used to recognize nested types in the array<>, map<> and struct<> constructs. Note that these characteres cannot occur in JVM or python field names. The pyparsing.nestedExpr()method takes care of any multi-level nesting. Words are defined as arbitrary successions of printable characters with the exception of the angle brackets (because we use the output of DataFrame.dtypes as input, we assume that we will not encounter any weird characters). Finally, a word occurs either at the start of a JVM type representation (in case of so-called atomic types) or within angled brackets.

The assemble() function translates the parsed JVM type representation into corresponding python types and re-assembled them. This function is recursive because the nested expressions can have arbitrary depth. It splits the earlier defined 'words' into parts separated by a ',' or ':' character and then applies a simple recipe for the assembly of the parts for the various possible combinations of the recognized array, map and struct constructs. The gist on github also contains a test suite that provides many sample outputs of the pysql_dtypes() function. The code sample below takes the example used earlier.

from pysql_dtypes import pysql_dtypes

for x in pysql_dtypes(df.dtypes):
    print(x)

#('_1', 'bigint')
#('_2', '[bigint]')
#('_3', '{string, bigint}')
#('_4', 'Row(x:bigint, y:string)')
#('_5', 'Row(x:bigint, y:[bigint], z:{string, bigint})')

The pysql_dtypes()function will be suggested to Apache Spark, so if you would like the utility to become available as part of pyspark, be sure to star the gist on github and add yourself as watcher to the corresponding issue (requires Apache Jira account).



2020-06-11

Proposing authorization for Gremlin Server

Introduction

Gremlin Server is an important part of the Apache TinkerPop ecosystem for supporting various ways of access to graph systems. In particular, Gremlin Server offers client applications based on a Gremlin Language Variant (GLV) remote access using web socket connections. In addition, Gremlin Server caters for various security mechanisms such as authentication, read-only access, subgraph access, timeouts and sandboxing.

While this sounds like a comprehensive suite of features, there is a catch: it is not possible to discriminate between authenticated users in providing or limiting access. In other words, Gremlin Server does not provide authorization.

Recent interest in authorization solutions became apparent in the TinkerPop related user forums:

Some authentication and authorization details

The definitions of authentication (is this the user who he/she claims to be?) and authorization (what is this user allowed to do?) are generally known, but the following characteristics of the underlying processes are not always immediately realized:
  1. authentication is a one-off procedure when logging in to an application, while authorization is required at the start of each new usage context (e.g. a database query or selecting some feature of an application).
  2. authorization is implicit in an application that has a private store of authentication credentials of a limited set of users (like Gremlin Server's SimpleAuthenticator). In this case, the authorization has a binary nature: the user is yes/no allowed to run every possible operation inside the application.
  3. managing authorizations for an instance of an application in a dedicated store is time-consuming and error-prone. Scalable authorization solutions manage accessible resources and allowed operations per user in a centralized store (e.g. LDAP, Apache Ranger, Active Directory).

What to authorize

Before zooming in on specific characteristics of Gremlin Server, it is useful to consider the authorization needs of graph applications in general. Like for relational databases, access to data can be granted at a number of levels. Because authorization schemes in SQL are more developed than in graph systems (the SQL language standard even has the GRANT/REVOKE keywords), we will pursue the analogy a bit further, see the table below.

 Level
 SQL RDBMS          
 graph system
 database GRANT CONNECT
 ON moviedb
 TO users;
 access to subset of graphs
 table GRANT SELECT
 ON titles
 TO users;
 access to subset of vertices and edges, based on their labels
 column GRANT SELECT
 ON titles.director
 TO users;
 access to subset of vertex and edge properties, based on their property keys
 row CREATE POLICY age_policy
 FOR SELECT
 ON titles
 TO child_users 
 USING (age_limit < 18);
 access to vertices and edges in a graph, based on a predicate on property values

Some further comments on this table:
  • The row level security construct is specific to postgresql, but other RDBMSs have similar constructs.
  • Apart from the CONNECT operation at the database level and the SELECT operation at the table and row level, various other operations exist, e.g. for creation/insertion, deletion, etc.
  • The principle of using ACL-like (meta)properties on an object (vertex, edge, property) for column and row level security is already mentioned in the Apache TinkerPop reference documentation.
  • The fact that RDBMSs can be used to implement graph systems, suggests that these authorization types should suffice for a graph system. Of course, there may be some additional possible permissions like the number of results a query may return or the maximum duration of a query.

Impact on Gremlin Server

Gremlin Server already has some functionalities in place that can be leveraged to implement authorization. Gremlin Server exposes a graph to GLV clients in the form of a GraphTraversalSource. A GraphTraversalSource can have so-called strategies applied to it, i.e. automatically applied traversal steps, some of which can limit access to the graph. Of particular interest in this respect are the ReadOnlyStrategy and the SubgraphStrategy. The ReadOnlyStrategy raises an exception when a traversal intends to mutate the underlying graph. The SubgraphStrategy adds filter steps to a traversal, using predicates on the vertex and edge labels and vertex properties. Custom strategies for time limits or result size limits could easily be constructed.
    The Gremlin Server admin can configure different GraphTraversalSource instances for different groups of users and apply suitable strategies per user group. Then, an authorization module can easily check whether granted permissions are consistent with the GraphTraversalSource a user requests to access.

    Apart from the bytecode requests with GLVs over websockets, Gremlin Server accepts older type string-based requests which are run as regular groovy scripts. This access mode does not lend itself to implementing authorization policies based on the ReadOnlyStrategy and SubgraphStrategy, because a groovy script running on Gremlin Server can simply access the underlying graph with a "g.getGraph()" call. While bytecode requests do not suffer from this link between the TinkerPop APIs, bytecode requests could still circumvent the authorization policies by using lambdas or by simply removing the controlling strategies. This implies that authorization in Gremlin Server requires some bytecode inspection to prevent usage of these mechanisms.

    Proposal

    Introducing authorization to Gremlin Server need not ₋ and probably should not ₋ be a big-bang operation. The current proposal has a minimum first step in its scope, which includes:
    • Open up Gremlin Server to authorization plugins in a way analogous to the authentication plugin mechanism. This implies that for user requests that do not get authorized a suitable error message is returned to the user. Note that the presence of an authorization plugin would be optional, so the mechanism would be non-breaking for existing applications.
    • The authorization plugin would receive the authenticated user as ascertained by the authentication plugin as well as the user request in the form of either a bytecode request or a string-based request. Typically, an authorization plugin would use the user request (traversal) to derive some relevant features, like the GraphTraversalSource accessed and the application of lambdas, OLAP, strategies, etc. It is matter of taste whether to only provide the raw request on the plugin API or also the derived features that are of interest to multiple plugins.
    • To illustrate the use of the authorization plugin mechanism a sample WhiteListAuthorizer is provided that allows for configuring a limited authorization policy through a yaml file. Typical use case is a data science team that wants to be able to directly query the graph store of a large graph application that only has authorization functionality on its GUI.
    So, this first step supports an initial use case and lowers the barrier for more ambitious integrations with larger stores for authorization data, like an LDAP store, Apache Ranger and Active Directory.

    Readers wanting to follow the fate of this proposal can do so by visiting:

    [Edited] A pull request implementing this proposal and additional comments from the TinkerPop community was merged on December 24, 2020.

      2020-03-22

      Regulate supermarket access during crisis using IT

      The world currently needs social distancing to keep the number of COVID-19 infected people at manageable numbers. This gives rise to two problems related to the access to food in supermarkets:
      1. People try to get more food than they currently need to protect their loved ones in the near future. This urge might easily increase as more people get killed and food supply chains get disturbed by the illness of employees.      
      2. More people are present in the supermarkets than necessary. This is on its turn related to more people being at home, half-filled supermarkets, parents bringing their children, etc.

      Social distancing in supermarkets could be improved enormously by regulating the access to supermarkets in the following way:
      • Each household gets two single-person entry tickets per week at allotted times for buying groceries to a certain maximum amount.
      • The elderly or people needing care can authorize someone to do their shopping.

      Providing and checking the entry tickets provides an IT-challenge, but many building blocks are already present:
      • Musea issue entry tickets at allotted times for blockbuster exhibitions
      • Theaters, football stadiums, etc. already scan tickets
             
      • In the developed world, governments have an accurate overview of households and their members for at least 90% of the population
      • Many governments issued digital identities that can be applied for the described authorizations.       

      Of course, embarking on a course of regulating supermarket access during crisis also has its risks:
      • People might get even more worried (but it might as well give people more trust in the government handling the crisis)    
      • Digital access is still a problem for many people, especially the elderly and the homeless.
      • Governments might later abuse the data gathered during the issuing of tickets, thus trespassing on civil liberty rights.    

      Given these consideration, I feel personally that regulating and controlling supermarket access during the current COVID-19 crisis offers enormous potential to improve social distancing and slow the spread of the virus.