» 您尚未登录:请 登录 | 注册 | 标签 | 帮助 | 小黑屋 |


发新话题
打印

[新闻] vgleaks最新消息!720杀招?盛传已久的secret sauce:data move engines模块




  Moore’s Law imposes a design challenge: How to make effective use of ever-increasing numbers of transistors without breaking the bank on power consumption? Simply packing in more instances of the same components is not always the answer. Often, a more productive approach is to move easily encapsulated, math-intensive operations into hardware.

  The Durango GPU includes a number of fixed-function accelerators. Move engines are one of them.

  Durango hardware has four move engines for fast direct memory access (DMA)

  This accelerators are truly fixed-function, in the sense that their algorithms are embedded in hardware. They can usually be considered black boxes with no intermediate results that are visible to software. When used for their designed purpose, however, they can offload work from the rest of the system and obtain useful results at minimal cost.

  The following figure shows the Durango move engines and their sub-components.

  

  The four move engines all have a common baseline ability to move memory in any combination of the following ways:From main RAM or from ESRAMTo main RAM or to ESRAMFrom linear or tiled memory formatTo linear or tiled memory formatFrom a sub-rectangle of a textureTo a sub-rectangle of a textureFrom a sub-box of a 3D textureTo a sub-box of a 3D texture

  The move engines can also be used to set an area of memory to a constant value.DMA Performance

  Each move engine can read and write 256 bits of data per GPU clock cycle, which equates to a peak throughput of 25.6 GB/s both ways.  Raw copy operations, as well as most forms of tiling and untiling, can occur at the peak rate. The four move engines share a single memory path, yielding a total maximum throughput for all the move engines that is the same as for a single move engine. The move engines share their bandwidth with other components of the GPU, for instance, video encode and decode, the command processor, and the display output. These other clients are generally only capable of consuming a small fraction of the shared bandwidth.

  The careful reader may deduce that raw performance of the move engines is less than could be achieved by a shader reading and writing the same data. Theoretical peak rates are displayed in the following table.Copy OperationPeak throughput using move engine(s)Peak throughput using shaderRAM ->RAM25.6 GB/s34 GB/sRAM ->ESRAM25.6 GB/s68 GB/sESRAM -> RAM25.6 GB/s68 GB/sESRAM -> ESRAM25.6 GB/s51.2 GB/s

  The advantage of the move engines lies in the fact that they can operate in parallel with computation. During times when the GPU is compute bound, move engine operations are effectively free. Even while the GPU is bandwidth bound, move engine operations may still be free if they use different pathways. For example, a move engine copy from RAM to RAM would not be impacted by a shader that only accesses ESRAM.Generic lossless compression and decompression

  One move engine out of the four supports generic lossless encoding and one move engine supports generic lossless decoding. These operations act as extensions on top of the standard DMA modes. For instance, a title may decode from main RAM directly into a sub-rectangle of a tiled texture in ESRAM.

  The canonical use for the LZ decoder is decompression (or transcoding) of data loaded from off-chip from, for instance, the hard drive or the network. The canonical use for the LZ encoder is compression of data destined for off-chip. Conceivably, LZ compression might also be appropriate for data that will remain in RAM but may not be used again for many frames—for instance, low latency audio clips.

  The codec employed by the move engines is LZ77, the 1977 version of the Lempel-Ziv (LZ) algorithm for lossless compression. This codec is the same one used in zlib, glib and other standard libraries. The specific standard that the encoder and decoder adhere to is known as RFC1951. In other words, the encoder generates a compliant bit stream according to this standard, and the decoder can decompress certain compliant bit streams, and in particular, any bit stream generated by the encoder.

  LZ compression involves a sliding window and operates in blocks. The window represents the history available to pattern-match against. A block denotes a self-contained unit, which can be decoded independently of the rest of the stream. The window size and block size are parameters of the encoder. Larger window and block sizes imply better compression ratios, while smaller sizes require less calculation and working memory. The Durango hardware encoder and decoder can support block sizes up to 4 MB. The encoder uses a window size of 1 KB, and the decoder uses a window size of 4 KB. These facts impose a constraint on offline compressors. In order for the hardware decoder to interpret a compressed bit stream, that bit stream must have been created with a window size no larger than 4 KB and a block size no larger than 4 MB. When compression ratio is more important than performance, developers may instead choose to use a larger window size and decode in software.

  The LZ decoder supports a raw throughput of 200 MB/s compressed data. The LZ encoder is designed to support a throughput of 150-200 MB/s for typical texture content. The actual throughput will vary depending on the nature of the data.

  

  JPEG decoding

  The same move engine that supports LZ decoding also supports JPEG decoding. Just as with LZ, JPEG decoding operates as an extension on top of the standard DMA modes. For instance, a title may decode from main RAM directly into a sub-rectangle of a tiled texture in ESRAM. The move engines contain no hardware JPEG encoder, only a decoder.

  The JPEG codec used by the move engine is known as ISO/IEC 10918-1, which was the 1994 JPEG committee standard. The hardware decoder does not support later standards, such as JPEG 2000 (wavelet encoding) or the format known variously as JPEG XR, HD Photo, or Windows Media Photo, which added a number of extensions to the base algorithm. There is no native support for grayscale-only textures or for textures with alpha.

  The move engine takes as input an entire JPEG stream, including the JFIF file header. It returns as output an 8-bit luma (Y or brightness) channel and two 8-bit subsampled chroma (CbCr or color) channels. The title must convert (if desired) from YCbCr to RGB using shader instructions.

  The JPEG decoder supports both 4:2:2 and 4:2:0 subsampling of chroma. For illustration, see Figures 2 and 3. 4:2:2 subsampling means that each chroma channel is  the resolution of luma in the x direction, which implies a footprint of 2 bytes per texel. 4:2:0 subsampling means that each chroma channel is  the resolution of luma in both the x and y directions, which implies a footprint of 1.5 bytes per texel. The subsampling mode is a property of the compressed image, specified at encoding time.

  In the case of 4:2:2 subsampling, the luma and chroma channels are interleaved. The GPU supports special texture formats (DXGI_FORMAT_G8R8_G8B8_UNORM) and tiling modes to allow all three channels to be fetched using a single instruction, even though they are of different resolutions.

  JPEG decoder output, 4:2:2 subsampled, with chroma interleaved.

  

  In the case of 4:2:0 subsampling, the luma and chroma channels are stored separately. Two fetches are required to read a decoded pixel—one for the luma channel and another (with different texture coordinates) for the chroma channels.

  JPEG decoder output, 4:2:0 subsampled, with chroma stored separately.

  

  Throughput of JPEG decoding is naturally much less than throughput of raw data. The following table shows examples of processing loads that approach peak theoretical throughput for each subsampling mode.

  Peak theoretical rates for JPEG decoding.

  Subsampling mode

  Peak performance

  Raw data rate

  4:2:2two 720p images/frame at 60 Hz2 × 1280 × 720 × 2 bytes × 60 Hz = 221 MB/s

  4:2:0two 1080p images/frame at 60 Hz2 × 1920 × 1080 × 1.5 bytes × 60 Hz = 373 MB/sSystem and title usage

  Move engines 1, 2 and 3 are for the exclusive use of the running title.

  Move engine 0 is shared between the title and the system. During the system’s GPU time slice, the system uses move engine 0. During the title’s GPU time slice, move engine 0 can be used by title code. It may also be used by Direct3D to assist in carrying out title commands. For instance, to complete a Map operation on a surface in ESRAM, Direct3D will use move engine 0 to move that surface to main memory.

这恐怕就是现在传闻720没那么好开发的原因吧~

偶有很多看不懂的地方,简单讲大概就是尽可能优化cpu/gpu调度,优化频宽。。

有位仁兄简单有效地描述了一下,蛮有意思的。。

Originally Posted by Drek:

You're moving to a new house, packing up everything you own and making a few trips to the new place to get it all moved.

With the PS4's solution, DDR5, you basically have a box truck that drives 70 mph everywhere all the time and can hold 4,000 pounds of your shit in a trip. When you get to your destination you need to unpack all of this shit too, at 1.8 tons an hour.

With the Xbox 360 you have a big box truck that can carry 8,000 pounds of your shit but it only drives at 28 mph. You also have a car that can carry 320 pounds of your shit and can drive at 41 mph. Once you get to your destination you have to unpack all your shit at only 1.2 tons an hour, but thankfully when you go to unpack your shit is already being sorted for you by a friend outside, helping to organize how you're receiving the stuff you unpack inside.

The Move Engines are that friend. They aren't going to do any of the unpacking (read: processing). They aren't going to carry any of your shit to the new place (read: memory). They just make everything run more smoothly, helping to reduce snags and bottlenecks.

If you were very diligent in how you packed either box truck you reduce the need for such a friend, but it's sure nice to have them no matter what. Chances are the PS4 will have some level of this same concept as well, though likely not to the same extent that MS is incorporating, since MS needs some way to get over the DDR3 bottleneck.

I'd say it's a further complication resulting from MS wanting 8 GB of memory and therefore going with clearly too slow DDR3. Move Engines, ESRam, etc. are all attempts to patch over that deficiency. This is also likely why MS is running them pre-programmed, because making developers have to manage this to avoid a bottleneck the PS4 doesn't have would be a pain in the ass.

It's like being able to transfer data from your main memory on PC to your GPU memory without using up CPU or GPU time. And while transparently doing some compression/decompression.


联动起之前的新闻,或许真是不谋而合。。
several development sources have told us that Sony’s solution is preferable when it comes to leveraging power. Studios working with the next-gen Xbox are currently being forced to work with only approved development libraries

[ 本帖最后由 倍舒爽 于 2013-2-7 03:04 编辑 ]


TOP

引用:
原帖由 shiningfire 于 2013-2-7 02:19 发表
这回是软饭玩潜力无穷了?
哟,要真的话那倒也8错啊,让软饭了解下当年索饭的那股憋屈和失望,索饭一直等到09年e3才开始真正有作品让人挺起胸膛。。。
让人站在对方立场置身处地,身历其境。。实属难得啊。。
以姐姐的说法是:简直有利于人类的灵性升华啊。。。
不过以微软的软件开发力,相信1年内就能优化得比较彻底了。。

不过这已经都不重要了。。
因为今次有个人机交互的大前提。。

sony今次能让你看4k钢铁侠,而ms今回直接让你当钢铁侠去了哦。。。:D

[ 本帖最后由 倍舒爽 于 2013-2-7 02:37 编辑 ]



TOP

posted by wap, platform: Nokia (E71)
引用:
原帖由 @讴歌123  于 2013-2-7 23:11 发表
还有和Siri一样的自然语音识别系统,说“xbox on”就会唤醒主机,说“what are my friends playing”就会收到朋友的游戏列表,仔细想想,来福极了

http://www.theverge.com/2013/2/7 ... cogni ...
啦,这事偶早说过啦。。
至于边玩游戏边叫外卖这点小破事当然更没问题了。。
720绝对是你的管家助手甚至是老妈。。


TOP

posted by wap, platform: Nokia (E71)

有些人的老妈能帮忙打扫房间啊。。
且720还能做到毫无怨言。。

看看tony怎么把物件缩放的?
除了720不是全息投影外。。
拿起聪明玻璃往上一滑哎哟到你的投影上

”爷要打字!”
没问题!
全球最大的键盘就在你眼前,对着空气打字!
真正的虚拟键盘。。约炮简直伸手可及。。

放魔法?
so easy!
就拿tg最近热门关键字抽脸为例。。
抽左边放火,抽右边放冰!
拍屁股当然是闪人术。。
抽得越狠,威力越大。。拍得越劲,闪得越远!
终极全屏魔法?pk无敌?
行!付出代价咯,和葵花宝典类似。。
来个自插双目附带自动联系就近医院!
且完全不会破坏游戏平衡性。。
mmorpg就是这样玩的,你居然还有mp值这样的老旧概念?拜托!!!

没有高精度的捕抓你根本享受不到这样的革命
u r the controller核心精神才是这样的。。
想想就来福到吐!

当年水果那个电容触摸活用牵起一场现象。。
720有可能哦。。!

所以乃们软软怎么老愁眉深锁呢?
上网数毛战占了游戏生活大部分那不是相当可悲么。。?
想清楚乃们到底想微软赢还是自己赢呢?

是这道理不是。。

TOP

posted by wap, platform: Nokia (E71)
引用:
原帖由 @BD  于 2013-2-8 03:04 发表
posted by wap, platform: iPhone

不清楚你是真的这么想呢,还只是cos软饭。
neogaf的群众回帖条条命中命门。
理想总是好的,现实往往是残酷的。
偶很简单,把战区当水区。。
无论各种状况我觉得大家都该保持乐观,从好的角度去想有利于心理健康。。

说句认真的,kinect的确可以实现很多想法。。
但1代的输入延迟很大局限了各种想法的实现,导致了kin要融入到cu系游戏的难度太大。
甚至连个qte都实现得不好。。
kin2的使命不仅要扩大用户群,且应该有能力
去讨好cu,融入cu。。

TOP

posted by wap, platform: Nokia (E71)

无。。

本帖最后由 倍舒爽 于 2013-2-8 14:12 通过手机版编辑

TOP

对了,kin1代的二维摄像头采购的是哪家的部件??


你们说今次会不会用这玩意。。
很贵啊。。
应该是目前最新最好的了。。微软会向sony采购部件么?

支持720p 60fps和480p 120fps

[ 本帖最后由 倍舒爽 于 2013-2-8 23:03 编辑 ]

TOP

其实bd老师说要堆硬件,这无容置疑!必须的。。!
除了单看分辨率帧率这些指标,信号质量至关重要。。
这个在软件发掘层面上的限制很大~
弱光成像质量一直是成像界所面对的攻克难题之一

要提高kin的精度,甚至要识别手指。。的确不简单。。
不过我还是乐观的。。

要知道kin1的时候没有针对弱光的bsi cmos,stacked cmos和rgbw coding等科技。。

Comparison of Sample pictures in low-light setting
(10 lux)

dsp的处理速度也大幅增长,前面提到的imx135就能做到real time hdr,很逆天的感觉!!
(我个人认为这个是把两路不同ev值的30fps视频即时合成的,这意味着这个小小的dsp不仅能实现1080p
60fps以上的能力并且能即时编码)

完全依靠电脑终端去后处理图像是不切实际的,1080p的raw data可是非常大啊~
再大的水管也不能和人家stack在一起能比的。。


所以啊,我一直觉得要是sony和微软能合作搞新人机交互方式的话,发展速度可能会快上5年。。

[ 本帖最后由 倍舒爽 于 2013-2-9 01:32 编辑 ]

TOP

发新话题
     
官方公众号及微博