打印

[专题讨论] 模拟器硬核研究系列3：PCSX2 64位重编译

SSforME

魔头

帖子: 1883
精华: 0
积分: 10226
激骚: 83 度
爱车
主机
相机
手机
注册时间: 2010-10-23

发短消息
加为好友
当前离线

1^# 大中小发表于 2021-7-9 12:17 只看该作者

PCSX2 64位重编译
注意此文写于2006年10月29日

原文:http://www.pcsx2.net/blog.php?p=2
PCSX2 64位重编译

已经提出了很多64位架构方案了，x86-64(也叫AMD64)架构从几年前最初开始提出到现在已经快了很多了，今天绝大多数CPU都支持x86-64架构，所以看起来x86-64会是一个不错的64位重编译的可选方案。x86-64架构提供了更多的寄存器，在显著提升游戏方面也有潜力。因为x86-64有很多兼容性问题，开发人员也不知道是不是真的值得花时间，而且在现有代码基础上增加新的可靠的高效重编译器是一个很痛苦的过程，所以直到今天Pcsx2对于64位领域一直是很不关心的，认真提出搞64位开发在聊天室会被嘲笑的。将要发布的0.9.2版本看来很稳定，经过一些研究以后，我们决定加入x86-64 重编译支持，包括64位Linux和Windows版本(没错，Linux版又回来了)。

开始讨论技术细节前，我先讲一下Pcsx2现在的重编译模型。
Pcsx2重编译

每种不同的指令集在PC上执行，要么用解释器(interpreter)，要么用重编译器(recompiler)，这两种方法对模拟器都很重要。解释器用一般的高级语言写成，而且是独立于平台的，解释器容易编写，容易调试，但是慢。对于测试和调试目的来说，解释器格外重要，比如说，解译一个简单的32位EE MIPS指令(代码)可能是下面这样的:

switch(code>>26) {
case 0x02: // J - 跳转(Jump to)
pc = (code & 0x03ffffff)*4; // 改变当前程序计数器 (Change program counter)
break;
case 0x23: // LW - 装载有符号字(load word, sign extend)
gpr[Rt] = (long long)*(long*)(memory+gpr[Rs]+(short)code);
break;
...
}

重编译器不一样，重编译器尽可能简单。比如说，如果0x1000地址的指令不会变，那么CPU就不必每次执行时通过一段switch语句来解码这条指令。所以重编译器生成CPU所需最少量的汇编代码来模拟一条指令。因为我们用了汇编，所以重编译过程依赖于平台。
简单的重编译器每次只关注一条指令，并把目标平台(这里就是PS2)的寄存器全都保存在内存中，每来一条新指令，所需的寄存器就从内存装载到真正的CPU寄存器，然后执行一些指令，最后寄存器里的结果再存回内存。0.9版之前，Pcsx2用的就是这种重编译。
复杂些的重编译器把代码分成简单块(没有跳转/分支)，而且执行指令时尽量把目标平台的寄存器保留在真正的CPU寄存器里。有很多不同类型的采用图色法(graph coloring 译者注:这个我不懂)的寄存器分配算法，这样的编译器可能也用到了常量叠算(constant propagation elimination 译者注:也叫常量折叠常量合并 constant folding)。MIPS Emotion Enging里面常见的模式(pattern)可能是这样的:

lui s0, 0x1000
lw s0, 0x2000(s0)

如果对lw指令中的常数进行变量叠算，得到读取地址为0x10002000。

[ 本帖最后由 SSforME 于 2021-7-10 15:17 编辑 ]

TOP

SSforME

魔头

帖子: 1883
精华: 0
积分: 10226
激骚: 83 度
爱车
主机
相机
手机
注册时间: 2010-10-23

发短消息
加为好友
当前离线

2^# 大中小发表于 2021-7-9 12:19 只看该作者

再复杂一点的重编译器会知道0x10002000对应IPU，就会直接调用IPU(这样就不用担心地址转换问题)。

有很多类似的局部优化，但是这样还不够。每个块结束的时候，寄存器会被压入(push)内存中，因为需要执行下一个简单块，而且这个在重编译的时候是无法预测的(比如，如果x > = 0进行分支跳转这种情况，x的值要在运行的时候才能确定)。

还复杂一点的重编译器通过搞清楚简单块之间的联系就可以运行在全局尺度内。一旦知道的简单块之间的联系，就可以摆脱在每个简单块结束的时候清理寄存器，只需要告诉下一个简单块真正的CPU寄存器和目标平台寄存器的对应关系。这技术叫做全局寄存器分配(global register allocation)，这技术有时用马尔科夫毯(Markov blanket)来处理块同步，如果你了解贝耶斯网络(Bayes net)，那么这两者是类似的，只不过现在用的是全局简单块构成的图。想想看，在一个图里面独立出一个节点来需要哪些其他节点，需要的是这个节点的父节点，子节点以及子节点的父节点。你晕了吗？没关系。
Pcsx2重编译器交替使用MMX和SSE(1/2/3)，所以根据指令不同，EE寄存器可以在MMX、SSE或者常规x86寄存器内(管理这个简直就噩梦)。
因为几年前的主机性能还不强，家用游戏机模拟器很少需要把重编译器搞到如此复杂。但是从PS2开始，主机性能变强，Pcsx2的EmotionEngine和向量单元(Vector Unit)重编译器马上就变很复杂。Pcsx2 0.9.1支持上面说到的所有优化方法，另外还用到了很多其他优化方法。VU重编译器(代号SuperVU)到目前为止是最复杂的，也是最快的，看了脑子会不清醒的。
如果有人还记得0.8.1版里面的重编译器，会感叹Pcsx2 0.9.1的优化是多么强大。

x86-64
那么为什么x86-32不够用呢？对于入门者来说，Playstation 2 EE有32个128位常规寄存器、32个32位浮点寄存器、还有一些COP0寄存器，绝大多数指令是64位的，多媒体指令(MMI)是全128位的。而x86 CPU只有8个32位通用寄存器(其中一个是处理栈(stack)的)、8个64位寄存器(MMX)以及8个128位寄存器(SSE)，而且这三种寄存器还不能简单混合在一起(比如你要把x86寄存器和SSE寄存器相加，就要先把x86的传到SSE或者SSE传到x86)。所以寄存器容量差异很大。因为x86寄存器少，重编译器老是把寄存器颠来倒去(寄存器经常换进内存)。内存读/写都比较慢，所以捣腾越多，重编译器就越慢。另外x86-32寄存器是原生32位的，所以为了处理64位加法的源和结果，要两条32位指令和四个常规x86寄存器(如果从内存内读，需要2个寄存器)。EE重编译器想要用MMX的64位运算能力来缓解寄存器的压力，但是MMX的指令集架构(ISA)有限，并且寄存器集转换严重影响性能。

x86-64架构有16个通用64位寄存器、8个64位MMX寄存器和16个128位SSE寄存器，寄存器数量翻倍了，这样就可以少捣腾寄存器，除此之外，64位加法/位移等可以用一条指令解决。

但是问题没有听起来那么简单，重编译器需要和普通C++代码对接(比如调用插件函数)，所以重编译器边界的调用规约(calling convention)必须要严格遵守。x86-64规格可以在www.x86-64.org找到，规格很清楚，但是微软有自己的一套规格(没人知道为什么)..所以现在有两种调用规约，两套函数参数和寄存器约定和两套相对稳定数据(non-volatile data 也叫不挥发性数据)处理方法。(谢谢微软，还不够烦)

因为寄存器容量的变化，所以所有的指针现在都是64位的了，这样内存读写、栈增长什么的就变麻烦了。

虚拟内存是另外一个阻碍64位操作系统移植的。因为地址窗扩展(AWE，之前的博客里讲过)映射法要改进，好在现在地址空间大多了，限制少了。Linux的虚拟机版本(VM)同样需要完全新的实现。

最后一点，看过Pcsx2代码的人知道，重编译器用了很多内联汇编(inline assembly)。采用内联汇编而不是C++代码的原因很多，实际上像动态分派(dynamic dispatch)这样的东西是不可能用C++实现的，所以内联汇编是必需的...看来微软在64位Visual C++里面禁止了所有的内联汇编功能！！！！(再次感谢微软，太会找麻烦了)

由于以上这些挑战，需要好几个月才能搞出稳定版来，到那个时候，更多人会开始用64位操作系统。如果估计有一半正确性的话，当64位重编译完成后，在同样的电脑上，Pcsx2在64位操作系统上会比32位操作系统上快很多。

zerofrog,

博客中心思想:这里提到的多数重编译理论直接来自编译原理。只要还有新的指令集架构(ISA)出现，编译器总是需要的。学习编译器如何工作，推荐Alfred V. Aho, Ravi Sethi和Jeffrey D. Ullman写的Compilers: Principles, Techniques, and Tools

[ 本帖最后由 SSforME 于 2021-7-10 15:15 编辑 ]

TOP

SSforME

魔头

帖子: 1883
精华: 0
积分: 10226
激骚: 83 度
爱车
主机
相机
手机
注册时间: 2010-10-23

发短消息
加为好友
当前离线

3^# 大中小发表于 2021-7-9 12:21 只看该作者

PCSX2 64bit Recompilation
Many 64 bit architectures have been proposed; however, the x86-64 (aka AMD64) architecture has picked up a lot of speed since its initial proposal a couple of years ago. Most 64bit CPUs today support it, so it looks like a good candidate for 64bit recompilation. The x86-64 architecture offers many more registers and can potentially speed up games by a significant amount. Up to now, Pcsx2 has largely been ignoring the 64 bit arena because there have been massive compatability issues, the developers weren't sure if it was really worth it, and adding a new bug-free and fast recompiler to the existing code base is a very painful process. Anyone seriously suggesting this to a dev would have been laughed out of the chat room. However, the upcoming 0.9.2 release is looking very stable and after doing some research, we have decided to add support for x86-64 recompilation, both for 64bit versions of Linux and Windows (yes, Linux support is returning).
Before going into technical details, I want to cover the current Pcsx2 recompilation model.
Pcsx2 Recompilation

Every different instruction set requires either an interpreter or a recompiler to execute it on the PC. Both are important in emulation. Interpreters are implemented with regular high-level languages and are platform independent. They are easy to program, easy to debug, but slow. They are extremely important for testing and debugging purposes. For example, interpreting a simple 32bit EE MIPS instruction (code) might look like:

switch(code>>26) {
case 0x02: // J - jump to
pc = (code & 0x03ffffff)*4; // change the program counter
break;
case 0x23: // LW - load word, sign extend
gpr[Rt] = (long long)*(long*)(memory+gpr[Rs]+(short)code);
break;
...
}
Recompilers, on the other hand, try to cut as many corners as possible. For example, we know the instruction at address 0x1000 will never change, so there is no reason why the CPU needs to execute the switch statement and decode the instruction every single time it executes it. So recompilers generate the minimal amount of assembly the CPU needs to execute to emulate that instruction. Because we're working with assembly, recompilation is a very platform dependent process.
Simple recompilers look at one instruction at a time and keep all target platform (in this case, the PS2) registers in memory. For every new instruction, the used registers are read from memory and stored in real CPU registers, then some instructions are executed, and finally the register with the result is stored back in memory. Before 0.9, Pcsx2 used to employ this type of recompilation.
More complex recompilers divide the code into simple blocks (no jumps/branches) and try to preserve target platform registers across instructions in the real CPU registers. There are many different types of register allocation algorithms using graph coloring. Such compilers might also do constant propagation elimination. A common pattern in the MIPS Emotion Engine is something like:
lui s0, 0x1000
lw s0, 0x2000(s0)
If we propagated the constants at the lw, we know that the read address is 0x10002000.

TOP

SSforME

魔头

帖子: 1883
精华: 0
积分: 10226
激骚: 83 度
爱车
主机
相机
手机
注册时间: 2010-10-23

发短消息
加为好友
当前离线

4^# 大中小发表于 2021-7-9 12:22 只看该作者

A little more complex recompiler will knowk that 0x10002000 corresponds to the IPU, so the assembly will call the IPU straight away (without worrying about memory location translation).

There are many such local optimizations, however they aren't enough. At the end of every block, all the registers will have to be pushed to memory because the next simple block that needs to be executed can't be predicted at recompilation time (ie: branch if x >= 0 depends on the value of x at runtime).
An even more complex recompiler can work on the global scale by finding out which simple blocks are connected to which. Once it knows, it can get rid of the register flushing at the end of every simple block by simply telling the next blocks to allocate the same real CPU register to the same target platform register. This is called global register allocation and sometimes uses Markov blankets for block synchronization. For those people that know Bayes nets, this is very similar, except it applies to the global simple block graph. Just think about the nodes necessary for making a specific node independent with respect to the whole graph. This will include the node's parents, children, and the children's parents. For those that just got lost... don't worry.
The Pcsx2 recompilers also use MMX and SSE(1/2/3) interchangeably. So an EE register can be in an MMX, SSE, or regular x86 register at any point in time depending on the current types of instructions (this is a nightmare to manage).
Console emulators rarely need to go through such complex recompilers because up until a couple of years ago, consoles weren't that powerful. But starting with the PS2, consoles got powerful and the Pcsx2 recompilers for the EmotionEngine and Vectors Units got complex really fast. Pcsx2 0.9.1 supports all the above mentioned optimizations plus many more unmentioned ones. The VU recompiler (code named SuperVU) is by far the most complex and fastest. Anyone who wants to keep their sanity should stay away from it.
For those that remember what it was like in the 0.8.1 days can appreciate how powerful the 0.9.1 Pcsx2 optimizations are.

TOP

SSforME

魔头

帖子: 1883
精华: 0
积分: 10226
激骚: 83 度
爱车
主机
相机
手机
注册时间: 2010-10-23

发短消息
加为好友
当前离线

5^# 大中小发表于 2021-7-9 12:23 只看该作者

x86-64

So why isn't x86-32 enough? Well, for starters the Playstation 2 EE has 32 128bit regular registers, 32 32bit floating point registers, and some COP0 registers. Most instructions work on 64 bits, the MMI instructions work on the full 128bits. On the other hand, the x86 CPU has 8 32bit general purpose registers (one is for stack), 8 64bit registers (MMX), and 8 128bit registers(SSE). And you can't combine the three that easily (ie: you can't add an x86 register with a SSE register before first transferring the x86 to SSE or vice versa). So there's a very big difference in registers sizes. Because of the small number of x86 registers, the recompiler does a lot of register thrashing (registers are spilled to memory very frequently). Each memory read/write is pretty slow, so the more thrashing, the slower the recompiler becomes. Also, x86-32 is inherently 32bit, so a 64bit add would require 2 32bit instructions and 4 regular x86 registers for the source and result (2 if reading from memory). The EE recompiler tries to alleviate the register pressure by using the 64bit arithmetic capabilities of MMX, but MMX has a pretty limited ISA and intra-register set transfers kill performance.

The registers on the x86-64 architecture are: 16 64bit general purpose registers, 8 64bit MMX registers, and 16 128bit SSE registers. This amounts to twice the number of register memory! This means much less register thrashing. On top of that, 64bit adds/shifts/etc can all be done in one instruction.

However, the story isn't as simple as it sounds. The recompiler has to interface with regular C++ code constantly (ie: calling plugin functions), so the calling conventions on the recompiler boundaries must be followed exactly. The x86-64 specification can be found here and is pretty straightforward. However, Microsoft decided that it wanted its own specification (for reasons not quite known to anyone else).. so now there are two different calling conventions with a different set of registers specifying arguments to functions and another different set acting as non-volatile data! (Thanks Microsoft, it wasn't difficult enough)

Because the size of the registers changed, all pointers are now 64 bits, which adds many difficulties to reading and writing from memory, incrementing the stack, etc.

Virtual memory is yet another obstacle to overcome with 64bit OSs. The AWE mapping trick (described in an early blog) has to be refined. But now that the address range is much bigger, there are less limitations. VM builds for Linux also need a completely new implemenation.
Finally, if anyone has seen Pcsx2 code, they would know that inline assembly is pretty frequent in the recompilers. The reasons we use inline assembly rather than C++ code are many. Actually, some things like dynamic dispatching become impossible to do with C++ code. So, inline is necessary... and it looks like Microsoft has disabled all functionality for inline assembly in 64bit editions of Visual C++!!!! (Thanks again Microsoft, you just know where to strike hardest)

With all the mentioned challenges, it will take a couple of months to get things working reasonably stable. By that time, more people would have switched to 64bit OSs. If we're even half right in our estimates, Pcsx2 will run much faster on a 64bit OS than on a 32bit OS on the same computer once x86-64 recompilation is done.

zerofrog,

Moral of the blog Most recompiler theory discussed here actually comes straight from compiler theory. Compilers will always be necessary as long as engineers keep coming with new instruction set architectures (ISAs). Learn how a compiler works. I recommend Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.

TOP

greatliuli

野比康夫

混世魔头

废头博士

帖子: 4822
精华: 1
积分: 18971
激骚: 383 度
爱车
主机
相机
手机
注册时间: 2007-12-23

TGFC 2020年度勋章☆☆☆☆

发短消息
加为好友
当前离线

6^# 大中小发表于 2021-7-10 04:45 只看该作者

posted by wap, platform: 小米NOTE
太硬核，看不懂了。

TOP