python之使用 Python Spark 从 Hadoop 表的 Json 对象中提取所有键

阿里 阅读:148 2025-06-02 22:19:02 评论:0

我有一个名为 table_with_json_string 的 Hadoop 表

例如:

+-----------------------------------+---------------------------------+ 
|      creation_date                |        json_string_colum        | 
+-----------------------------------+---------------------------------+ 
| 2020-01-29                        |  "{keys : {1 : 'a', 2 : 'b' }}" | 
+-----------------------------------+---------------------------------+ 

期望的输出:

+-----------------------------------+----------------------------------+----------+ 
|      creation_date                |         json_string_colum        |   keys   | 
+-----------------------------------+----------------------------------+----------+ 
| 2020-01-29                        |  "{keys : {1 : 'a', 2 : 'b' }}"  |    1     | 
| 2020-01-29                        |  "{keys : {1 : 'a', 2 : 'b' }}"  |    2     | 
+-----------------------------------+----------------------------------+----------+ 

我尝试过:

from pyspark.sql import functions as sf 
from pyspark.sql import types as st 
 
from pyspark.sql.functions import from_json, col,explode 
from pyspark.sql.types import StructType, StructField, StringType,MapType 
 
schema = StructType([StructField("keys", 
                    MapType(StringType(),StringType()),True)]) 
df = spark.table('table_with_json_string').select(col("creation_date"),col("json_string_colum")) 
df = df.withColumn("map_json_column", from_json("json_string_colum",schema)) 
df.show(1,False) 
+--------------------+-------------------------------------+----------------------------------+ 
|       creation_date|        json_string_colum            |    map_json_column               | 
+--------------------+-------------------------------------+----------------------------------+ 
|   2020-01-29       |     "{keys : {1 : 'a', 2 : 'b' }}"  |    [Map(1 ->'a',2 ->'b')]        | 
+--------------------+-------------------------------------+----------------------------------+ 

1 - 如何从此 MapType 对象中提取键?我知道我需要使用 explode 函数来达到我想要的表格格式,但我仍然不知道如何将 JSON 对象的键提取为数组格式。

I'm open to other approaches if it's easier to reach my goal.

请您参考如下方法:

基于您目前所做的,您可以按如下方式获取 key :

from pyspark.sql import functions as f 
df = (df 
 .withColumn("map_json_column", f.from_json("json_string_colum",schema)) 
 .withColumn("keys", f.map_keys("map_json_column.keys")) 
 .drop("map_json_column") 
 .withColumn("keys", f.explode("keys")) 
 ) 

结果:

+-------------+--------------------+----+ 
|creation_date|   json_string_colum|keys| 
+-------------+--------------------+----+ 
|   2020-01-29|{"keys" : {"1" : ...|   1| 
|   2020-01-29|{"keys" : {"1" : ...|   2| 
+-------------+--------------------+----+ 

以下是得出上述答案的详细步骤:

>>> from pyspark.sql import functions as f 
>>> df.show() 
+-------------+--------------------+ 
|creation_date|   json_string_colum| 
+-------------+--------------------+ 
|   2020-01-29|{"keys" : {"1" : ...| 
+-------------+--------------------+ 
 
>>> df.withColumn("map_json_column", f.from_json("json_string_colum",schema)).show() 
+-------------+--------------------+------------------+ 
|creation_date|   json_string_colum|   map_json_column| 
+-------------+--------------------+------------------+ 
|   2020-01-29|{"keys" : {"1" : ...|[[1 -> a, 2 -> b]]| 
+-------------+--------------------+------------------+ 
 
>>> df.withColumn("map_json_column", f.from_json("json_string_colum",schema)).withColumn("keys", f.map_keys("map_json_column.keys")).show() 
+-------------+--------------------+------------------+------+ 
|creation_date|   json_string_colum|   map_json_column|  keys| 
+-------------+--------------------+------------------+------+ 
|   2020-01-29|{"keys" : {"1" : ...|[[1 -> a, 2 -> b]]|[1, 2]| 
+-------------+--------------------+------------------+------+ 
 
>>> df.withColumn("map_json_column", f.from_json("json_string_colum",schema)).withColumn("keys", f.map_keys("map_json_column.keys")).drop("map_json_column").show() 
+-------------+--------------------+------+ 
|creation_date|   json_string_colum|  keys| 
+-------------+--------------------+------+ 
|   2020-01-29|{"keys" : {"1" : ...|[1, 2]| 
+-------------+--------------------+------+ 
 
>>> df.withColumn("map_json_column", f.from_json("json_string_colum",schema)).withColumn("keys", f.map_keys("map_json_column.keys")).drop("map_json_column").withColumn("keys", f.explode("keys")).show() 
+-------------+--------------------+----+ 
|creation_date|   json_string_colum|keys| 
+-------------+--------------------+----+ 
|   2020-01-29|{"keys" : {"1" : ...|   1| 
|   2020-01-29|{"keys" : {"1" : ...|   2| 
+-------------+--------------------+----+ 

需要说明的是,我在上面使用的函数 map_keys 在 PySpark 2.3+ 中可用


标签:Spark
声明

1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

关注我们

一个IT知识分享的公众号