RTX3090有沒有可能是雙心顯示卡??

yisen2994 wrote:
3090ti才是
有需要的等等

很狂喔 雙芯卡再現 一張破十萬不是夢
ค้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้้
公版卡看不出來,
但從三廠卡看應該不太像是雙心。
No benchmark no life
ya19881217 wrote:
也不能算是灌水 浮點運算力的確有提升上去

您搞錯灌水的定義了吧

不是有提升就好,而是性能有沒有真的到達2倍CUDA

難道性能提升30%,CUDA標示成2倍就不叫灌水?

來看看最新的數據

RTX 3080首個DX12遊戲成績曝光:默頻僅比2080 Ti快27%
https://news.mydrivers.com/1/712/712050.htm

有爆料人提前偷跑了RTX 3080在《奇點灰燼(Ashes of the Singularity)》中的基準測試成績。

在Crazy_4K設定下,這套i9-9900K+RTX 3080的平台綜合拿到8700分,其中綜合負載下的幀率是88.3FPS。

據稱成績是默頻跑出,而同樣默頻下的RTX 2080 Ti成績是69.9FPS,RTX 2070是46.9FPS,RX 5700 XT是45.5FPS。

換言之,僅就《奇點灰燼》的DX12來看,RTX 3080比RTX 2080 Ti快27%。

---

目前NV對於RTX 30系列的性能宣稱都是建立在開光追+DLSS 2.0的最大化性能之下

就算30系列的CUDA有改良架構、增加效率,但性能沒有提升兩倍的話

我就認為官網規格不能把CUDA標示成2倍數量,有嚴重廣告不實、誤導嫌疑
SKAP wrote:
您搞錯灌水的定義了吧(恕刪)


我看了相關連結的新聞,底下的評論裡大陸自己的玩家都不太相信了,開局一張圖?2080TI囤貨商的反撲?一切的事實就等實際商品可以被普羅玩家測試後再說。

一般玩家是玩不過商人的!摀緊錢包現金為王,不要聽一齣是一齣。
denmark123 wrote:
我看了相關連結的新聞(恕刪)

你是說性能實測嗎?

那DF夠公正了吧,雖然受到管制不能直接秀出fps
但還是可以透過百分比實際比較3080跟2080的多款遊戲性能


3080比2080提升約60~90%,平均75%左右

可以換算得出約比2080Ti提升30~50%左右

要注意,我知道3080提升很多,很划算

但我說的是"NV官網直接把CUDA數量標示成實體數量的2倍"

這種做法有誤導、灌水之嫌

要嘛照實體數量標示,然後備註每時脈週期可執行2次著色,因此性能大幅提升

或是標示成接近實際性能的1.5倍我也沒意見

直接標示成2倍就是呵呵,不需要幫NV找理由
SKAP wrote:
要嘛照實體數量標示,然後備註每時脈週期可執行2次著色,因此性能大幅提升


實體數量就是從原本圖靈架構每個Streaming Multiprocessor有64個浮點運算單元(CUDA Cores)增加到安培架構的128個浮點運算單元(CUDA Cores),並非像是早期官方還未公布架構細節前一些網友天馬行空的臆測以為用了什麼超線程讓「每時脈週期可執行2次著色」


首先NVIDIA的CUDA Cores的「Cores」並不是一般定義上CPU的核心(NVIDIA GPU概念上最接近的東西叫做Streaming Multiprocessor裡面包含多個CUDA Cores)

NVIDIA 對於CUDA Cores這個命名的定義從古至今一直沒變,是為了要凸顯GPU可以同時並列進行大量浮點運算,所以把每一個浮點運算單元FPU都叫做一個CUDA Cores,安培架構浮點運算單元的確有翻倍,根據NVIDIA的定義下CUDA數量當然翻倍

命名的定義是別人老早發明的,給你的東西也的確有翻倍,何來灌水?頂多是因為架構突然大幅改變造成一些效能損失
kkk123kkk123kkk wrote:
實體數量就是從原本圖(恕刪)

要扯細節,OK

說法1:
https://www.pcmarket.com.hk/20200902-rtx-3090-3080-3070-announce/

RTX 3090/3080 核心代號為 GA102,擁有 280 億顆電晶體,但 RTX 3090/3080 竟擁有 10496 及 8704 CUDA Cores ,竟較擁有 542 億顆電晶體 GA100 的 6912 CUDA Cores 還要多,耐人尋味。究其原因,在於 Ampere 架構大大改進 FP32 單元,提供 2X 性能有關。如果以傳統的計算方法, RTX 3090/3080 分別只有 5248 及 4352 CUDA Cores,但進取的 Product Marketing 標示為雙倍 Cores 即 10496 及 8704。

說法2:


說法3:NVIDIA內容和技術高級副總裁回答了這個問題
https://www.nvidia.com/en-us/geforce/news/rtx-30-series-community-qa/

Q: Could you elaborate a little on this doubling of CUDA cores? How does it affect the general architectures of the GPCs? How much of a challenge is it to keep all those FP32 units fed? What was done to ensure high occupancy?

One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.

Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.

The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.

圖靈和安培架構圖:
https://wccftech.com/nvidia-details-geforce-rtx-30-series-graphics-cards-reddit/




Based on the information Tony provided, Hardwareluxx's created a block diagram representation of the Ampere SM. The new SM block looks close to the final one and you can note the dual FP32 units in two data paths. Each SM consists of 128 CUDA cores which is why we have seen a doubling of the core count on the Ampere GPU. We will have a more detailed article on the Ampere GPUs & the underlying architecture on 17th September so look forward to it.
SKAP wrote:
要扯細節,OK說法1(恕刪)


首先,感謝你幫我貼了來源,我的上一篇回文完全基於官方Q&A的說法

你貼的說法1純屬該文撰寫者的臆測,請問他所謂的「傳統的計算方法」是什麼???NVIDIA一直以來都是算FP32運算單元的數量,難不成這位撰文者幫NVIDIA重新定義了CUDA Cores?我更好奇的是他的定義到底是算什麼東西的實際數量?另外,拿出毫無可比性的深度學習特化卡A100比電晶體根本是張飛打岳飛,反而自打嘴巴,凸顯出該撰文者的無知,上一代的運算卡王V100只有211億個電晶體,相比有542億個電晶體的A100,電晶體數量只有39% CUDA Cores的數量卻達到了75%,難不成Volta和類似的Pascal / Turing從古至今一直在灌水?你只需要知道,一塊晶片包含了很多部件,光運算單元就不是只有FP32,運算卡有專門的FP64雙精度浮點運算單元,除此之外A100的L2快取記憶體大小是V100的6.7倍,拿A100出來討論反而證明了不同架構下FP32數目(CUDA Cores)和電晶體的數目不一定成線性關係成長。除此之外,你知道Turing增加了一大堆新東西包括獨立的INT32運算單元以及其獨立的資料通道、Tensor Cores、RT Cores等,所增加的電晶體數目相比Pascal只有約10%嗎?增加多一倍的FP32運算單元不代表整個晶片電晶體數量需要跟著翻倍

你貼的說法2在CUDA Cores數量的方面為該網友對官方回覆的錯誤理解,以下直接進到官方說法3
One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

我們把過去三代架構演變整理一下

Pascal 的一個SM*裡有
64個CUDA Cores[FP32 or INT32]

Turing 的一個SM裡有
64個獨立的INT32 加上 64個CUDA Cores[FP32]

Ampere 的一個SM裡有
64個CUDA Cores[FP32 or INT32] 加上 64個CUDA Cores[FP32]
共128個

從以上比較表可以發現,這次安培和圖靈相比改動新增加的CUDA Cores功能上和Pascal的CUDA Cores基本上一樣,你否定這些是CUDA Cores的話那GTX 1080Ti 按照此定義豈不是總共有0個CUDA Cores?

另外你最後貼的Wccftech的圖寫的很清楚是自製圖非官方圖

下面是官方的Turing架構圖,明顯的可以看出一個SM只有64個FP32(CUDA Cores),低於Ampere自製圖的128個CUDA Cores,你圖片底下貼的那段報導引文也提到FP32數量翻倍
and you can note the dual FP32 units in two data paths. Each SM consists of 128 CUDA cores which is why we have seen a doubling of the core count on the Ampere GPU.



圖片來源為官方Turing架構白皮書第12頁https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

*Pascal這樣寫是為了方便理解,根據上面貼的Turing白皮書第11頁,精確來說Nvidia顯卡架構比SM還上一層的是TPC(Texture Processing Clusters)
—Pascal 每一個TPC有一個包含128個CUDA Cores的SM
—Turing 每一個TPC有兩個SM各包含64個CUDA Cores,所以每個TPC還是一樣有128個CUDA Cores
—Ampere的TPC白皮書尚未公布,假設SM數量沒變的話預計則會有256個CUDA Cores(前面已解釋,等同128個Pascal CUDA Cores + 128個Turing CUDA Cores)
The Turing architecture features a new SM design that incorporates many of the features introduced in our Volta GV100 SM architecture. Two SMs are included per TPC, and each SM has a total of 64 FP32 Cores and 64 INT32 Cores. In comparison, the Pascal GP10x GPUs have one SM per TPC and 128 FP32 Cores per SM. The Turing SM supports concurrent execution of FP32 and INT32 operations (more details below), independent thread scheduling similar to the Volta GV100 GPU.
文章分享
評分
評分
複製連結

今日熱門文章 網友點擊推薦!