tencent cloud

Elastic MapReduce

Meson Engine

PDF
Modo Foco
Tamanho da Fonte
Última atualização: 2025-09-26 16:41:47
Meson Engine is a high-performance vectorized query engine built into EMR Spark. It supports seamless acceleration of Spark SQL workloads and DataFrame API calls, reducing the overall cost of workloads. Compared with open-source Spark, it offers a 2.7x performance improvement in TPC-DS 1TB benchmark. Meson is fully compatible with Apache Spark APIs, requiring no changes to existing business code. In EMR product versions that support Meson Engine, you only need to modify a small amount of configuration to enable it.

Principle Introduction

With the extensive application of SSDs and significant improvement in network interface card performance, the performance bottleneck of the Spark engine has shifted from the traditional understanding of IO to computing resources mainly driven by CPU. However, CPU optimization schemes around JVM (such as Codegen) face many constraints, such as limits on bytecode length and number of parameters. Developers also find it difficult to leverage some features of modern CPUs on JVM.
The Meson Engine transforms Spark Physical Plan, uses a C++ implemented vectorized acceleration library to execute computations, and returns the executed data in a columnar format, enhancing memory and bandwidth utilization efficiency. This breakthrough in performance bottlenecks can effectively improve the efficiency of Spark jobs.

Usage Restrictions

The Meson Engine currently has usage scenario limits. In restricted scenarios, the Meson engine will perform Fallback and revert to the Native Spark engine for execution. Since Fallback needs to convert data, too many Fallback times may lead to a longer total running time than the Native Spark engine.
Please learn about the main usage limits of Meson Engine in advance.
Supports Parquet data format. ORC support is not currently optimized. Other data formats are not supported.
ANSI mode is not supported.
Applications based on RDD are not supported.
Structured Streaming is not supported.
Custom Python code based on PySpark is not supported.
MEMORY_ONLY CacheTable is not supported.

Applicable Scenarios

Support capability is provided based on Spark 3.5.3 and above versions.
Note:
Meson Engine does not fully support or has unsupported storage formats, data types, operators, and functions, which will fall back to Native Spark engine execution.

Storage Format

Meson engine supported data storage format:
Supported data formats: Parquet, ORC
Supported table formats: Iceberg, Hive

Data Types

Meson engine supported data types:
Byte,Short,Int,Long
Boolean
String,Binary
Decimal
Float,Double
Date,Timestamp

Operators

Type
Supported Operators
Unsupported Operators
Source
FileSourceScanExec,HiveTableScanExec,BatchScanExec,InMemoryTableScanExec
-
Sink
DataWritingCommandExec,InsertIntoHiveTable,
-
Common
FilterExec,ProjectExec,SortExec,UnionExec
-
Aggregate
HashAggregateExec
SortAggregateExec,ObjectHashAggregateExec
Join
BroadcastHashJoinExec,ShuffledHashJoinExec,SortMergeJoinExec,BroadcastNestedLoopJoinExec,CartesianProductExec
-
Window
WindowExec
WindowGroupLimitExec
Exchange
ShuffleExchangeExec,ReusedExchangeExec,BroadcastExchangeExec,CoalesceExec
CustomShuffleReaderExec
Limit
GlobalLimitExec,LocalLimitExec,TakeOrderedAndProjectExec,CollectLimitExec
-
Subquery
SubqueryBroadcastExec
-
Other
ExpandExec,GenerateExec,CollectTailExec,RangeExec
RangeExec,SampleExec

Functions

Type
Supported Functions
Generator Functions
explode,explode_outer,inline,inline_outer,posexplode,posexplode_outer,stack
Window Functions
cume_dist,dense_rank,lag,lead,nth_value,ntile,percent_rank,rank,row_number
Aggregate Functions
any,any_value,approx_count_distinct,approx_percentile,array_agg,avg,bit_and,bit_or,bit_xor,bool_and,bool_or,collect_list,collect_set,corr,count,count_if,covar_pop,covar_samp,every,first,first_value,grouping,grouping_id,kurtosis,last,last_value,max,max_by,mean,median,min,min_by,percentile,percentile_approx,regr_avgx,regr_avgy,regr_count,regr_intercept,regr_r2,regr_slope,regr_sxx,regr_sxy,regr_syy,skewness,some,std,stddev,stddev_pop,stddev_samp,sum,try_avg,try_sum,var_pop,var_samp,variance
Array Functions
array,array_append,array_compact,array_contains,array_distinct,array_except,array_insert,array_intersect,array_join,array_max,array_min,array_position,array_prepend,array_remove,array_repeat,array_union,arrays_overlap,arrays_zip,flatten,get,shuffle,slice,sort_array
Bitwise Functions
&,^,bit_count,bit_get,getbit,shiftright,|,~
Collection Functions
array_size,cardinality,concat,reverse,size
Conditional Functions
coalesce,if,ifnull,nanvl,nullif,nvl,nvl2,when
Conversion Functions
bigint,binary,boolean,cast,date,decimal,double,float,int,smallint,string,timestamp,tinyint
Date and Timestamp Functions
add_months,date_add,date_diff,date_format,date_from_unix_date,date_sub,date_trunc,dateadd,datediff,day,dayofmonth,dayofweek,dayofyear,extract,from_unixtime,from_utc_timestamp,hour,last_day,make_date,make_timestamp,make_ym_interval,minute,month,next_day,quarter,second,timestamp_micros,timestamp_millis,to_unix_timestamp,to_utc_timestamp,trunc,unix_date,unix_micros,unix_millis,unix_seconds,unix_timestamp,weekday,weekofyear,year
Hash Functions
crc32,hash,md5,sha,sha1,sha2,xxhash64
JSON Functions
from_json,get_json_object,json_array_length,json_object_keys,json_tuple,schema_of_json,to_json
Lambda Functions
aggregate,array_sort,exists,filter,forall,map_filter,map_zip_with,reduce,transform,transform_keys,transform_values,zip_with
Map Functions
element_at,map,map_concat,map_contains_key,map_entries,map_keys,map_values,str_to_map,try_element_at
Mathematical Functions
%,*,+,-,/,abs,acos,acosh,asin,asinh,atan,atan2,atanh,bin,cbrt,ceil,ceiling,conv,cos,cosh,cot,csc,degrees,e,exp,expm1,factorial,floor,greatest,hex,hypot,least,log,log10,log1p,log2,mod,negative,pi,pmod,positive,pow,power,rand,random,rint,round,sec,shiftleft,sign,signum,sinh,sqrt,try_add,unhex,width_bucket
Misc Functions
assert_true,equal_null,spark_partition_id,uuid,version,||
Predicate Functions
!,!=,<,<=,<=>,<>,=,==,>,>=,and,between,case,ilike,in,isnan,isnotnull,isnull,like,not,or,regexp,regexp_like
String Functions
ascii,base64,bit_length,btrim,char,char_length,character_length,chr,concat_ws,contains,endswith,find_in_set,format_number,format_string,initcap,instr,lcase,left,len,length,levenshtein,locate,lower,lpad,ltrim,luhn_check,mask,overlay,position,regexp_extract,regexp_extract_all,regexp_replace,repeat,replace,right,rpad,rtrim,soundex,split,split_part,startswith,substr,substring,substring_index,translate,trim,ucase,unbase64,upper
Struct Functions
named_struct,struct
URL Functions
url_decode,url_encode

Enabling Meson Acceleration

EMR-V3.7.0

To create an EMR-V3.7.0 Version Cluster, you can use the configuration management feature in the EMR Console to add the following configuration in the spark-defaults.conf configuration file to enable this feature:
Parameter
Description
spark.plugins
The plug-in used by Spark, set the parameter value to org.apache.gluten.GlutenPlugin (if spark.plugins is already configured, you can add org.apache.gluten.GlutenPlugin to it, use comma "," as separator).
spark.memory.offHeap.enabled
Set to true, Meson speed up requires the use of JVM off memory
spark.memory.offHeap.size
Set the offHeap Memory size according to actual conditions. For details, see recommended configurations for executor memory of varying specifications.
spark.shuffle.manager
The columnar shuffle manager used by Meson, set the parameter value to: org.apache.spark.shuffle.sort.ColumnarShuffleManager
Recommended memory configurations for Executors of varying specifications:
executor-cores
spark.executor.memory
spark.memory.offHeap.size
2
2GB
4GB
4
3GB
10GB
8
6GB
20GB

EMR-V3.6.1(beta)

To create an EMR-V3.6.1 Version Cluster, you can use the configuration management feature in the EMR Console to add the following configuration in the spark-defaults.conf configuration file to enable this feature:
Parameter
Description
spark.plugins
The plug-in used by Spark, set the parameter value to org.apache.gluten.GlutenPlugin (if spark.plugins is already configured, you can add org.apache.gluten.GlutenPlugin to it, use comma "," as separator).
spark.memory.offHeap.enabled
Set to true, Native speed up requires the use of JVM off memory
spark.memory.offHeap.size
Set the offHeap Memory size according to actual conditions. The initial size can be set to 1G.
spark.shuffle.manager
The columnar shuffle manager used by Meson, set the parameter value to: org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.driver.extraClassPath
The Gluten native jar used by Spark, the default path of the jar is /usr/local/service/spark/gluten
spark.executor.extraClassPath
The Gluten native jar used by Spark, with the default path at /usr/local/service/spark/gluten
spark.executorEnv.LIBHDFS3_CONF
Path of the integrated HDFS cluster configuration file, default at /usr/local/service/hadoop/etc/hadoop/hdfs-site.xml


Ajuda e Suporte

Esta página foi útil?

comentários