【NL2SQL】任务简介

问题别名

NL2SQL
Text to sql
Table QA
Semantic parsing

问题描述及定义

给定数据表or数据库的基本信息，通过一定方法，将对数据表的提问转为SQL语句并获取结果的任务。

举例

数据类型

问题类型	中文	英文
简单sql	NL2SQL挑战赛	WiKiSQL
复杂sql	CSpider; DuSQL；电网等	Spider

按照模型能解决的类型来看，可以粗略地分为两类问题：

单表上的查询，例如SELECT ... FROM ... WHERE ...，我们简称为简单sql
多表查询，或增加SQL语句关键词（如GROUP BY, LIMIT等），我们简称为复杂sql

免责声明：上述简称为个人拍脑子起名，不代表学术界常见用法。

简单sql

一、英文-WiKiSQL

问题
- json格式
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  {
  "phase":1,
  "question":"who is the manufacturer for the order year 1998?",
  "sql":{
  "conds":[[0,0,"1998"]],
  "sel":1,
  "agg":0
  },
  "table_id":"1-10007452-3"
  }
- 说明
  - phase: the phase in which the dataset was collected. We collected WikiSQL in two phases.
  - question: the natural language question written by the worker.
  - table_id: the ID of the table to which this question is addressed.
  - sql: the SQL query corresponding to the question. This has the following subfields:
  - sel: the numerical index of the column that is being selected. You can find the actual column from the table.
  - agg: the numerical index of the aggregation operator that is being used. You can find the actual operator from Query.agg_ops in lib/query.py.
  - conds: a list of triplets (column_index, operator_index, condition) where:
    - column_index: the numerical index of the condition column that is being used. You can find the actual column from the table.
    - operator_index: the numerical index of the condition operator that is being used. You can find the actual operator from Query.cond_ops in lib/query.py.
    - condition: the comparison value for the condition, in either string or float type.

表格信息
- json格式
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  {
  "id":"1-1000181-1",
  "header":[
  "State/territory",
  ],
  "types":[
  "text",
  ],
  "rows":[["Australian Capital Territory"],]
  }
- 格式
  - id: the table ID.
  - header: a list of column names in the table.
  - rows: a list of rows. Each row is a list of row entries.

相关link
- https://github.com/salesforce/WikiSQL

二、中文-NL2SQL挑战赛

问题

json格式

{
    "table_id": "a1b2c3d4", # 相应表格的id
    "question": "世茂茂悦府新盘容积率大于1，请问它的套均面积是多少？", # 自然语言问句
"sql":{ # 真实SQL
    "sel": [7], # SQL选择的列 
    "agg": [0], # 选择的列相应的聚合函数, '0'代表无
    "cond_conn_op": 0, # 条件之间的关系
    "conds": [
        [1,2,"世茂茂悦府"], # 条件列, 条件类型, 条件值，col_1 == "世茂茂悦府"
        [6,0,1]
    ]
}
}

SQL的表达字典说明如下：

1
2
3

op_sql_dict = {0:">", 1:"<", 2:"==", 3:"!="}
agg_sql_dict = {0:"", 1:"AVG", 2:"MAX", 3:"MIN", 4:"COUNT", 5:"SUM"}
conn_sql_dict = {0:"", 1:"and", 2:"or"}

表格信息

json格式

{
    "id":"a1b2c3d4", # 表格id
    "name":"Table_a1b2c3d4", # 表格名称
    "title":"表1：2019年新开工预测 ", # 表格标题
    "header":[ # 表格所包含的列名
        "300城市土地出让",
        "规划建筑面积(万㎡)",
        ……
    ],
    "types":[ # 表格列所相应的类型
        "text",
        "real",
        ……
    ],
    "rows":[ # 表格每一行所存储的值
        [
            "2009年7月-2010年6月",
            168212.4,
            ……
        ]
    ]
}

相关link
- 大赛简介：https://tianchi.aliyun.com/competition/entrance/231716/information
- 冠军方案：https://tianchi.aliyun.com/forum/postDetail?spm=5176.12282027.0.0.3473379cXdQbR7&postId=78781 天池NL2SQL冠军方案.pdf
- 亚军方案：https://tianchi.aliyun.com/forum/postDetail?spm=5176.12282027.0.0.3473379cXdQbR7&postId=78855
- 季军方案：https://github.com/beader/tianchi_nl2sql?spm=5176.12282029.0.0.324d324e0VRRfm
- 冠军论文：M-SQL_Multi-Task_Representation_Learning_for_Singl.pdf

复杂sql

一、英文—Spider

问题

json格式

{
    "db_id": "world_1",
    "query": "SELECT avg(LifeExpectancy) FROM country WHERE Name NOT IN (SELECT T1.Name FROM country AS T1 JOIN countrylanguage AS T2 ON T1.Code  =  T2.CountryCode WHERE T2.Language  =  \"English\" AND T2.IsOfficial  =  \"T\")",
    "query_toks": ["SELECT", "avg", "(", "LifeExpectancy", ")", "FROM", ...],
    "question": "What is average life expectancy in the countries where English is not the official language?",
    "question_toks": ["What", "is", "average", "life", ...],
    "sql": {}
}

说明
- question: the natural language question
- question_toks: the natural language question tokens
- db_id: the database id to which this question is addressed.
- query: the SQL query corresponding to the question.
- query_toks: the SQL query tokens corresponding to the question.
- sql: parsed results of this SQL query using process_sql.py. Please refer to parsed_sql_examples.sql in thepreprocess directory for the detailed documentation.

表格信息

json格式

{
  "column_names": [[0,"id"]],
  "column_names_original": [[0,"ID"]],
  "column_types": ["number"],
  "db_id": "world_1",
  "foreign_keys": [[3,8],[23,8]],
  "primary_keys": [1,8,23],
  "table_names": ["city","sqlite sequence","country","country language"],
  "table_names_original": ["city","sqlite_sequence","country","countrylanguage"]
}

说明
- db_id: database id
- table_names_original: original table names stored in the database.
- table_names: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]
- column_names_original: original column names stored in the database. Each column looks like: [0, "id"]. 0 is the index of table names in table_names, which is city in this case. "id" is the column name.
- column_names: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]
- column_types: data type of each column
- foreign_keys: foreign keys in the database. [3, 8] means column indices in the column_names. These two columns are foreign keys of two different tables.
- primary_keys: primary keys in the database. Each number is the index of column_names.

相关link
- https://github.com/taoyds/spider
- task site：https://yale-lily.github.io/spider

二、中文-CSpider

CSpider数据由Spider翻译而来，有一定难度。

相关link
- task site：https://taolusi.github.io/CSpider-explorer/
- https://link.zhihu.com/?target=https%3A//github.com/taolusi/chisp

三、中文-DuSQL

问题

json格式

{
    "db_id": "小说",
    "question": "出生日期在1812年2月7号之后的作者，按逝世日期倒序排列给出前10的作者以及国籍",
    "question_id": "qid000010",
    "sql": {
    "except": null,
    "from": {
        "conds": [],
        "table_units": [["table_unit",0]]
        },
    "groupBy": [],
    "having": [],
    "intersect": null,
    "limit": 10,
    "orderBy": [
        "asc", [[0,[0,[0,6],null]]]],
    "select": [
        [0,[0,[0,2],null]],
        [0,[0,[0,3],null]]
    ],
    "union": null,
    "where": [[0,3,[0,[0,4],null],"1812-02-07",null]]
    },
    "sql_query": "select 姓名 , 国籍 from 作者 where 出生日期 > '1812-02-07' order by 逝世日期 asc limit 10"
}

简介
- db_id: 当前样本对应的数据库ID
- question: 原始问题文本
- question_id: 样本的唯一标识
- sql_query: 问题对应的标注 SQL
- sql: sql_query 的结构化形式，其形式亦为一个dict类型，下文对齐详细介绍。

表格信息

json格式

{
    "column_names": [[-1,"*"],[0,"词条id"],[1,"词条id"]],
    "column_types": ["text","number","number"],
    "db_id": "小说",
    "foreign_keys": [[11,1],[16,8],[23,8]],
    "primary_keys": [1,8,14,21],
    "table_names": ["作者","小说","小说改编话剧","小说改编电视剧"]
}

相关link

Lic2020比赛：https://aistudio.baidu.com/aistudio/competition/detail/47

评价方法

Logic Form Accuracy: 预测完全正确的SQL语句。其中，列的顺序并不影响准确率的计算。
Execution Accuracy: 预测的SQL的执行结果与真实SQL的执行结果一致。

排行榜以 $Score_{lf}$ 与 $Score_{ex}$ 的平均值排序。

常见方法

简单sql

SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning
X-SQL: reinforce schema representation with context / X-SQL: reinforce context info schema representation

复杂sql

【IRNet】Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation
【Valuenet】ValueNet: A Neural Text-to-SQL Architecture Incorporating Values
【RATSQL】RAT-SQL: Relation-Aware Schema Encoding and Linking

Blog of YQ

this is subtitle :)

【NL2SQL】任务简介

问题别名

问题描述及定义

举例

数据类型

简单sql

一、英文-WiKiSQL

二、中文-NL2SQL挑战赛

复杂sql

一、英文—Spider

二、中文-CSpider

三、中文-DuSQL

评价方法

常见方法

简单sql

复杂sql