data.table join和j-expression意外行为

发布时间：2021-01-17 03:36:32 所属栏目：MsSql教程来源：网络整理

导读：在R 2.15.0和data.table 1.8.9中： d = data.table(a = 1:5,value = 2:6,key = "a")d[J(3),value]# a value# 3 4d[J(3)][,value]# 4 我希望两者都产生相同的输出(第二个),我相信它们应该. 为了清除这不是J语法问题,同样的期望适用于以下(与上面相同)表达式

在R 2.15.0和data.table 1.8.9中：

d = data.table(a = 1:5,value = 2:6,key = "a")

d[J(3),value]
#   a value
#   3     4

d[J(3)][,value]
#   4

我希望两者都产生相同的输出(第二个),我相信它们应该.

为了清除这不是J语法问题,同样的期望适用于以下(与上面相同)表达式：

t = data.table(a = 3,key = "a")
d[t,value]
d[t][,value]

我希望以上两个都返回完全相同的输出.

那么让我重新解释一下这个问题 – 为什么(data.table设计得如此),关键列在d [t,value]中自动打印出来？

更新(根据下面的答案和评论)：谢谢@Arun等人,我理解设计 – 为什么现在.上面打印密钥的原因是因为每次通过X [Y]语法进行data.table合并时都存在隐藏状态,而by是按键.它以这种方式设计的原因似乎如下 – 因为必须在合并时执行by操作,人们可以利用它而不是通过合并的键来执行另一个操作.

现在说,我相信这是一个语法设计缺陷.我读取data.table语法d [i,j,by = b]的方式是

take d,apply the i operation (be that subsetting or merging or whatnot),and then do the j expression “by” b

逐个打破这个阅读,并介绍一个人必须具体思考的案例(我合并i,仅仅是合并的关键等).我相信这应该是data.table的工作 – 在一个特定的合并情况下,当by等于密钥时,使得data.table更快的值得称道的努力应该以另一种方式完成(例如通过检查如果by表达式实际上是合并的键,则在内部.

解决方法

编辑号码无限：常见问题1.12正好回答你的问题:(也有用/相关是 FAQ 1.13,不粘贴在这里).

1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join,looking up X’s rows using Y (or Y’s key if it has one) as an index. Y[X] is a join,looking up Y’s rows using X (or X’s key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data,only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]),but that takes copies of the subsets of data,and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)],data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses,and Y columns enjoy standard R recycling rules within the context of each group. Let’s say foo is in X,and bar is in Y (along with 20 other columns in Y). Isn’t X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?

没有回答OP的问题的老答案(来自OP的评论),保留在这里,因为我相信它确实如此).

当你在data.table中给出像d [,4]或d [,value]这样的j的值时,j被计算为表达式.从data.table FAQ 1.1访问DT [,5](第一个常见问题解答)：

Because,by default,unlike a data.frame,the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.

因此,首先要了解的是,在您的情况下：

d[,value] # produces a "vector"
# [1] 2 3 4 5 6

当i的查询是基本索引时,这没有什么不同：

d[3,value] # produces a vector of length 1
# [1] 4

但是,当我本身就是data.table时,这是不同的.来自data.table简介(第6页)：

d[J(3)] # is equivalent to d[data.table(a = 3)]

在这里,您正在执行加入.如果您只是执行d [J(3)],那么您将获得与该连接相对应的所有列.如果你这样做,

d[J(3),value] # which is equivalent to d[J(3),list(value)]

既然你说这个答案没有回答你的问题,我会指出你的“改写”问题的答案在哪里,我相信：—>然后你只得到那个列,但是由于你正在执行连接,因此也会输出键列(因为它是基于键列的两个表之间的连接).

编辑：在你的第二次编辑之后,如果你的问题是为什么呢？那么我不情愿(或者说是无知)回答,Matthew Dowle设计的是区分data.table基于连接的子集和基于索引的子集操作.

您的第二种语法相当于：

d[J(3)][,value] # is equivalent to:

dd <- d[J(3)]
dd[,value]

再次,在dd [,value]中,j被计算为表达式,因此得到一个向量.

回答第3个修改过的问题：第3次,这是因为它是基于键列的两个data.tables之间的JOIN.如果我加入两个data.tables,我期待一个data.table

从data.table简介,再次：

Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact,the A[B] syntax in base R inspired the data.table package.

（编辑：广州站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

sql的substring函数功	SQL SERVER临时表排序
SQL的JOIN类型有哪些？	sql做分页查询有哪些方