问题别名

  • NL2SQL
  • Text to sql
  • Table QA
  • Semantic parsing

问题描述及定义

给定数据表or数据库的基本信息,通过一定方法,将对数据表的提问转为SQL语句并获取结果的任务。

举例

image

数据类型

问题类型 中文 英文
简单sql NL2SQL挑战赛 WiKiSQL
复杂sql CSpider; DuSQL;电网等 Spider

按照模型能解决的类型来看,可以粗略地分为两类问题:

  1. 单表上的查询,例如SELECT ... FROM ... WHERE ...,我们简称为简单sql
  2. 多表查询,或增加SQL语句关键词(如GROUP BY, LIMIT等),我们简称为复杂sql

免责声明: 上述简称为个人拍脑子起名,不代表学术界常见用法。


简单sql


一、英文-WiKiSQL

  1. 问题

    • json格式

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      {
      "phase":1,
      "question":"who is the manufacturer for the order year 1998?",
      "sql":{
      "conds":[[0,0,"1998"]],
      "sel":1,
      "agg":0
      },
      "table_id":"1-10007452-3"
      }
    • 说明

      • phase: the phase in which the dataset was collected. We collected WikiSQL in two phases.
      • question: the natural language question written by the worker.
      • table_id: the ID of the table to which this question is addressed.
      • sql: the SQL query corresponding to the question. This has the following subfields:
      • sel: the numerical index of the column that is being selected. You can find the actual column from the table.
      • agg: the numerical index of the aggregation operator that is being used. You can find the actual operator from Query.agg_ops in lib/query.py.
      • conds: a list of triplets (column_index, operator_index, condition) where:
        • column_index: the numerical index of the condition column that is being used. You can find the actual column from the table.
        • operator_index: the numerical index of the condition operator that is being used. You can find the actual operator from Query.cond_ops in lib/query.py.
        • condition: the comparison value for the condition, in either string or float type.
  1. 表格信息

    • json格式

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      {
      "id":"1-1000181-1",
      "header":[
      "State/territory",
      ],
      "types":[
      "text",
      ],
      "rows":[["Australian Capital Territory"],]
      }
    • 格式

      • id: the table ID.
      • header: a list of column names in the table.
      • rows: a list of rows. Each row is a list of row entries.
  1. 相关link

二、中文-NL2SQL挑战赛

  1. 问题

    • json格式

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      {
      "table_id": "a1b2c3d4", # 相应表格的id
      "question": "世茂茂悦府新盘容积率大于1,请问它的套均面积是多少?", # 自然语言问句
      "sql":{ # 真实SQL
      "sel": [7], # SQL选择的列
      "agg": [0], # 选择的列相应的聚合函数, '0'代表无
      "cond_conn_op": 0, # 条件之间的关系
      "conds": [
      [1,2,"世茂茂悦府"], # 条件列, 条件类型, 条件值,col_1 == "世茂茂悦府"
      [6,0,1]
      ]
      }
      }
    • SQL的表达字典说明如下:

      1
      2
      3
      op_sql_dict = {0:">", 1:"<", 2:"==", 3:"!="}
      agg_sql_dict = {0:"", 1:"AVG", 2:"MAX", 3:"MIN", 4:"COUNT", 5:"SUM"}
      conn_sql_dict = {0:"", 1:"and", 2:"or"}
  2. 表格信息

    • json格式
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      {
      "id":"a1b2c3d4", # 表格id
      "name":"Table_a1b2c3d4", # 表格名称
      "title":"表1:2019年新开工预测 ", # 表格标题
      "header":[ # 表格所包含的列名
      "300城市土地出让",
      "规划建筑面积(万㎡)",
      ……
      ],
      "types":[ # 表格列所相应的类型
      "text",
      "real",
      ……
      ],
      "rows":[ # 表格每一行所存储的值
      [
      "2009年7月-2010年6月",
      168212.4,
      ……
      ]
      ]
      }
  1. 相关link

image


复杂sql


一、英文—Spider

  1. 问题

    • json格式

      1
      2
      3
      4
      5
      6
      7
      8
      {
      "db_id": "world_1",
      "query": "SELECT avg(LifeExpectancy) FROM country WHERE Name NOT IN (SELECT T1.Name FROM country AS T1 JOIN countrylanguage AS T2 ON T1.Code = T2.CountryCode WHERE T2.Language = \"English\" AND T2.IsOfficial = \"T\")",
      "query_toks": ["SELECT", "avg", "(", "LifeExpectancy", ")", "FROM", ...],
      "question": "What is average life expectancy in the countries where English is not the official language?",
      "question_toks": ["What", "is", "average", "life", ...],
      "sql": {}
      }
    • 说明

      • question: the natural language question
      • question_toks: the natural language question tokens
      • db_id: the database id to which this question is addressed.
      • query: the SQL query corresponding to the question.
      • query_toks: the SQL query tokens corresponding to the question.
      • sql: parsed results of this SQL query using process_sql.py. Please refer to parsed_sql_examples.sql in thepreprocess directory for the detailed documentation.
  1. 表格信息

    • json格式

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      {
      "column_names": [[0,"id"]],
      "column_names_original": [[0,"ID"]],
      "column_types": ["number"],
      "db_id": "world_1",
      "foreign_keys": [[3,8],[23,8]],
      "primary_keys": [1,8,23],
      "table_names": ["city","sqlite sequence","country","country language"],
      "table_names_original": ["city","sqlite_sequence","country","countrylanguage"]
      }
    • 说明

      • db_id: database id
      • table_names_original: original table names stored in the database.
      • table_names: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]
      • column_names_original: original column names stored in the database. Each column looks like: [0, "id"]. 0 is the index of table names in table_names, which is city in this case. "id" is the column name.
      • column_names: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]
      • column_types: data type of each column
      • foreign_keys: foreign keys in the database. [3, 8] means column indices in the column_names. These two columns are foreign keys of two different tables.
      • primary_keys: primary keys in the database. Each number is the index of column_names.
  1. 相关link


二、中文-CSpider

CSpider数据由Spider翻译而来,有一定难度。


三、中文-DuSQL

  1. 问题

    • json格式

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      {
      "db_id": "小说",
      "question": "出生日期在1812年2月7号之后的作者,按逝世日期倒序排列给出前10的作者以及国籍",
      "question_id": "qid000010",
      "sql": {
      "except": null,
      "from": {
      "conds": [],
      "table_units": [["table_unit",0]]
      },
      "groupBy": [],
      "having": [],
      "intersect": null,
      "limit": 10,
      "orderBy": [
      "asc", [[0,[0,[0,6],null]]]],
      "select": [
      [0,[0,[0,2],null]],
      [0,[0,[0,3],null]]
      ],
      "union": null,
      "where": [[0,3,[0,[0,4],null],"1812-02-07",null]]
      },
      "sql_query": "select 姓名 , 国籍 from 作者 where 出生日期 > '1812-02-07' order by 逝世日期 asc limit 10"
      }
    • 简介

      • db_id: 当前样本对应的数据库ID
      • question: 原始问题文本
      • question_id: 样本的唯一标识
      • sql_query: 问题对应的标注 SQL
      • sql: sql_query 的结构化形式,其形式亦为一个dict类型,下文对齐详细介绍。
  2. 表格信息

    • json格式
      1
      2
      3
      4
      5
      6
      7
      8
      {
      "column_names": [[-1,"*"],[0,"词条id"],[1,"词条id"]],
      "column_types": ["text","number","number"],
      "db_id": "小说",
      "foreign_keys": [[11,1],[16,8],[23,8]],
      "primary_keys": [1,8,14,21],
      "table_names": ["作者","小说","小说改编话剧","小说改编电视剧"]
      }
  3. 相关link

评价方法

  • Logic Form Accuracy: 预测完全正确的SQL语句。其中,列的顺序并不影响准确率的计算。
  • Execution Accuracy: 预测的SQL的执行结果与真实SQL的执行结果一致。

排行榜以 的平均值排序。

image


常见方法


简单sql

  • SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning
  • X-SQL: reinforce schema representation with context / X-SQL: reinforce context info schema representation


复杂sql

  • 【IRNet】Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation
  • 【Valuenet】ValueNet: A Neural Text-to-SQL Architecture Incorporating Values
  • 【RATSQL】RAT-SQL: Relation-Aware Schema Encoding and Linking