美国半导体ISSCC会议要点:5G加速器新兴内存技术_第1页
美国半导体ISSCC会议要点:5G加速器新兴内存技术_第2页
美国半导体ISSCC会议要点:5G加速器新兴内存技术_第3页
美国半导体ISSCC会议要点:5G加速器新兴内存技术_第4页
美国半导体ISSCC会议要点:5G加速器新兴内存技术_第5页
已阅读5页,还剩17页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、GlobalResearch21 February2019US Semiconductors2019ISSCCTakeaways;5G,Accelerators,3D NAND,DDR5,EmergingMemoryTechKey themes - 5G, Accelerators, 3D NAND, DDR5, Emerging MemoryThis week we attended ISSCC in San Francisco, one of the premier technical conferences for the Semiconductor industry. Key focu

2、s topics included 5G, Processors/AI and Memory and featured presentations from Intel, Samsung, Toshiba/WD, SK Hynix, Hitachi, TSMC and academia including research from MIT.5G: Intel Key Note unveiled 5G initiatives at Intel LabsIntel took the wraps off of a number of 5G initiatives underway at Intel

3、 Labs. INTC approachto5Giscomprehensive&spansdevices&wirelesstechattheedge,core& cloud.5Gisspawningre-designofRFcircuitry&INTCunveileddesignsofitsmillimeter wave beam forming circuitry, Low Power Wake up Radio, 5G NR Wireless Engine, PMICs,Frontendmod+Antennae.Allofthissupportswiderfreq.bands,lowerl

4、atency in5G&INTCshouldbeamuchbigger5Gplayerthanmanyinvestorsappreciate.Heterogeneous computer architectures to the foreThe trend towards Heterogeneous Compute architectures (CPU + accelerators like GPUs)wasclearlyevidentatISSCC.IBMprovidedanarchitecturaloverviewofSummit & Sierra (using 27,500 NVDA V

5、olta GPUs), the top ranked Supercomputers in the world. There were several presentations around ASIC designs optimized for specific AI/ML operations such as image Processing/ automation, and Intel featured a distributed multi robot system using its customSoC.Memory: 3D NAND Scaling continues, usheri

6、ng in DDR5 3DNANDscalingroadmapscontinuetoextendfartherintothefuture.MUisaheadon QLCNAND,butToshiba/WDdiscusseddetailsoftheir96LQLC3DNANDforthefirst time.LayermigrationremainskeytocostdownsandToshiba/WDalsoshoweddetails of its 128L TLC NAND, while Samsung detailed its 6th Gen TLC 3D NAND. On the DRA

7、M front, as the industry gets ready to usher in DDR5 managing high data rates (6.4Gbps/pinvs.3.2Gbps/pinforDDR4)remainsamajorchallenge.Samsung&Hynix detailedthedesignchallengesinvolvedinLPDDR5/DDR5.Intelpresentedatlengthon emerging memory technologies targeting the embedded space - Re-RAM and STT- M

8、RAM, neither of which are likely to displace DRAM any time soon in our view. We also question INTCs long-term commitment to the memorysector.SemiconductorsGlobalEquitiesTimothy ArcuriSemiconductorsGlobalEquities HYPERLINK mailto:timothy.arcuri +1-415-3525676Pradeep AssociateAnalyst HYPERLINK mailto:

9、pradeep.ramani +1-415-3525517John HYPERLINK mailto:j.ahn +1-415-3525624 HYPERLINK /investmentresearch /investmentresearchThis report has been prepared by UBS Securities LLC. ANALYST CERTIFICATION AND REQUIRED DISCLOSURES BEGIN ON PAGE 18. UBS does and seeks to do business with companies covered in i

10、ts research reports. As a result, investors should be aware that the firm may have a conflict of interest that could affect the objectivity of this report. Investors should consider this report as only a single factor in making their investment decision.Summary of Select PresentationsIntel 5G Key No

11、te5G Wireless comms at an inflection Point by Intel , Vida Ilderem, Intel LabsBackgroundNeed change in infrastructure for 5GIoT - 50B devices, 212B sensors, 85% unconnected, 44 Zetabytes- majority of sensors are not connectedyetCost of memory, compute and broadband has come down driving IoTHigh band

12、width applications are overloading networks , but this will changein5G-connectedfactory1MMGB.day,A/Dcar=1GB/sec5G key attributes -Human to human comms - high data rates,Machine comms - massive number of devices need battery power, hard to reach areas, low complexity , but slower data rates,mission c

13、ritical , high reliability and low latency commsNeed global spectrum harmonization , 5G NR is the globalstandardSpectrum is everything for 5G. - 3 Segments of 5GIoT. -100Mhz3-10GHz for LTE + WiFiMM Wave 30-100GHZ , high capacity5G NR will incorporate all 3 from 1GHz to 100GHz spectrumMiMO will be a

14、big deal for 5G5G needs to address a long list of protocols including backward compatibilityKeytechnologies-Phyprocessing+Appsprocessing+RF Ongoing innovations at IntelLabsLow Energy Systems at INTCLabsMM wave for high capacityBeam forming architectures:Analog Beam formingHybrid beam formingFully di

15、gital beam forming where each RF has 1 antennae element but is expensive and power hungry, butthisistheapproachthatINTCisfollowing.Theyare trying to reduce power consumption by compression. Uses22nmFFLProcess(co-optimizedprocessforboth Fmax MM Waveapplications)Integrated - Analog+RF+ Digital+ Passiv

16、es in INTCs mm Wave Phase ArrayLow Power Wake up Radio - radio wakes up on externaleventWhen packet arrives, main radio wakesup the base band and goes to sleep can save 30% -10 x based on protocol.Advantageous for IoT devices esp cellular IoTBuilt on 14nm process , consumes 95 uW of power at2.4GHz5G

17、-NR3- Wireless Processing EngineBuilt a wireless processing engine , 2x 300MHz cores, up to 50Mbps data rateCo-design algorithm and architectureLatencyisveryimportanthereUsed for v2X communicationsPower Managment PMICKey enabler of 5G siliconImportant for innovative for power, improved Efficiency ,

18、Area and CostFront end module + Antenna - co-optimization and co-designIntegrate antenna with Packaging to limit lossesBuilt RFIC on 28nm tech ,including MIMOTargeting 42.2Gb/s data rate with 12b/symbol MIMO , 4.3pJ/bitIn summary:5G is bringing in new M2M communications, M2H communication forthefirs

19、ttime-bringsinnewreliabilityandlatencychallengesand mustsatisfydiversespecs,needsinnovationfromsystemtocore.Itis much more than 4GINTC portfolio spans is doing - devices+ wireless tech+ access and edge + core+cloudINTC working on - trial platforms +Modems+Processors+FPGAs4G will be with us for a whi

20、le each protocol takes 10 years to peakProcessors/AISummit and Sierra - IBM AI/HPC Supercomputers, J KhaleSummit and Sierra are the worlds #1 and #2 supercomputers in the HYPERLINK / MotivationComplex interaction between a lot of different workloads. Heterogeneous system could not deliver the perfor

21、mance at the efficiency. Needed massive amount of threads and coordination of parallelism. Wanted to reduce time for scientificinsightsComparison of Supercomputers to Data CentersSupercomputers do Synchronous computation whichrequiresBare metal control of resources, coordinated parallel workLarge da

22、ta setsIntense computation periods from hours to daysJob length can be reliability windowHigh performance interconnection network vs. commodity interconnect for DCConstraints:Went to heterogeneous arch - power, cost , scaling challenges beyond10knodes,singlethreadedprocessorshittinglimitsowentto GPU

23、UsedNVIDIAforGPUandMLNXinterconnectsNeed to satisfy both Massively parallel capability and Analytics capabilityData is properly positioned in the computer when computation is reqd - minimize data motion , enable compute in all levels of hierarchy, modularity , application drivendesignSpecs:9216 powe

24、r 9 processors , 27648 NVDA Volta , 11.1PB memory in summit, perf 200 Petaflops, node performance 43 Teraflops. Data movement time - DRAM takes 6.4 secs to read all of it. 300GB/s bandwidth to memoryIBM power 9 processor , 22 cores cores +2 more for yield, 14nm finfet , L3 cache with embedded DRAMPC

25、Ie 4.0 NVLink 4, NVDIA - Volta GPU , 7.8 Teraflops for double precision , for DL using 125 TeraFlops on half precision. Tight GPU integrationFlexible Streaming Processor for Real Time Image ProcessingImage processing - trend towards smaller and energy constrained platformsTraditional microprocessor

26、has Low throughput - can overcome limitationbySIMDortaskparallelism,butmemorybandwidthscales withnumberofprocessingelementssothatcanbealimiterASICS , stream in the pixels in a raster scan order , but offer no programmabilityStreamingprocessor-programmableprocessingelementtobuildup a pipeline and hav

27、e a conditional execution ofpipeline3 main components - Processing element, line buffers for emery bandwidth , interconnectPEInstruction registerArithmetic UnitGuard Control unit - conditional execution , run time flexibilityLine Buffer Elements - SRAMFIFOInterconnection Network - cluster PE to redu

28、ce the number of interconnections (4:1), 3 levels ofhierarchy22nm FDX techSoC for Robotics - Distributed Multi Robot System using SoC, IntelEnergyefficientperformance,integratemultiplesensors,lowpower SoC for compute, RadioAlgorithm Flow : Sensors capture data- Fused and compute - Navigation & Contr

29、ol - mechanicalactionCompute tasks carried out by 2x 86 hits processors and TensilicaDSPPath planner is very important for RobotsPath planning is done at the edge and therefore needs to be power efficient , real time and robust (errorminimization)CPU frequencies and accelerator frequencies are quite

30、different40 x40 In-Memory Computing Graph ASIC , University of Minnesota3 algorithmsBreadth first searchDijkstraA*Uses A* AlgorithmUse time based computing , and encodes the value in the delay of theblock,lowpowerconsumptionandhighprecisionturnability40 x40, 4 neighbor arrayVertex consists of a 12 b

31、 SRAMProcessing in Memory - PIM based Spin Multichip Scalable Processor, HitachiOptimization is the major target applicationCMOS Annealing Processors solve large scale combinatorial optimization in various industriesDigital circuits (spin operator ) is tightly coupled to the SRAM and arrangedasabasi

32、cunit.Thespinstateandcoefficientarestoredin SRAMcells,andonlyrequireslocalcommunicationsbetweensRAM cell and spin operatorChallenges: Increase bit coefficient without degradingaccuracyChallenges : To increase the number of spinsArch:4 spins and coefficient are stored in the SRAMARRAYSpin Operator: u

33、sed to update spin state by majority voteFully Visual SLAM CNN processor for Autonomous Exploration - Ziyun Li University of MichiganSLAM - simultaneous localization and mapping30 ms response, range 1 KM, low power 100 mwConstruct 3D maps of landmarksSensing options for SLAM Processor - Lidar/ camer

34、as/ inertialsensorsSLAM processingExtract key points from current frameAs object moves , extract new key points and match with prior key points6 DoF of current frame is solved by optimizationDo bundle adjustment - raw poses and matching pairs are tracked by multiple framesSlam requiresNeed massive c

35、ompute 250 GOPsDouble precision floating point operationsTherefore have low energy efficiency on CPU and GPUplatformsCPUandGPUuseexternalDRAMandsocostspower UniquenessTheir approach uses on chip memoriesCNN parameters used in 18 KB memoryUse 4 layer triplet network for feature extractionKey point de

36、tection runs on every frame , maximize data re- use for energy efficiencyUse massive parallelism and caching for featuredetectionFeaturematchingandPnPunithasmatrixsolver,filtersetcsolvefor pose and uses 32 bit fixed pointarithmetic 1000 features extracted per framePredict new pose from poses of prev

37、ious frame.Innovative search algorithm for matchBundle adjustment unit has large graph memory (320kB) and matchedkeypointsaremergedintoasingleentry.Itusesfixedpoint implementationMemoryToshiba Memory & Western Digital , 3D NAND , 96 WL, 1.33Tb Density , 4b/Cell (QLC)JV project with WD and ToshibaQLC

38、 technology , with new featuresTechnology overview1.33Tb QLC, density 8.5Gb/mmsq96LBiCSProgrammingthroughputQLCvs.TLC,9.7MB/svs.57MB/sfor TLCRead time 160us vs. 58us for TLCDie size 158.4 sq-mmQLC design technologiesSource bias negative senseNew 2 step programming algorithmPage state dependent WL ov

39、erdriveIndependent PlaneQLCtechniques:BNSfor QLC wide Vt window is requiredNegative Vt requires triple well that isexpensiveSource bias negative sense gets rid of expensive triple wellprocess reqd to generate negative VtSBNS is controlled with a clock to expand negative Vt region -CSBNSNegative Vt r

40、egion is expanded while keeping low supply voltagePrechargetoSRC+delVBL ,boostphase,sensephase New programming algorithm (18% tprog improvement)For QLC type distribution need more sophisticated programming technique3D uses 2 step programming method to suppress VtdegradationBut programming thru put d

41、egrades with conventional 2 step process as verify times are verylongIn new technique8 coarse levels used for programming in step time2nd step cell is programmed to 16 levels and two step programming method reduces verifytimesPage state depended Page state dependentWLQLC read times are slower as mor

42、e bits need to besensedConvention WL overdrive technique reduces 3D WL RC delay , but voltageoverdriveisdifferentfordifferentWLandsonoitoptimalIn this solution - amt and duration of WL overdrive are adjusted asa function of voltage change for each WLtransitionNew Features : Independent Plane ReadCon

43、ventionalWLselectionscheme-24CGdriversareneededtobias 24 WLIn IPR - 24 CG drivers are divided in to 2x 12 CG drivers, during programoperationbothsetsoperateinthesamewayas24WL.But during the read, which requires lower WL voltages , only 12 WL are biased to various voltages and the other 84 WL biased

44、to fixed voltage.The2setsofCGdriversindependentlydrive12WLofeach planeArea overhead : no control gate overhead, but has 2 independent read control paths but the area overhead is still very small (0.1 mmsq)Samsung, 3D NAND , 512 Gb, 3 bit /cell, 6th gen82MB/S throughput, 1.2Gb/sInterfaceGrowth rate o

45、f WL 110 layers , can still use 2 stacksolutionWD , Toshiba Memory 512 Gb, TLC , 128 layer, 132 MB/s write perf3 major innovations128 layers132 MB/s - highest perf write perfCircuit under array tech (CuA), last year 96 was Circuit next to Array not under itArch4 arrays of 128 Gb arrays eachWL is hor

46、n, BL is verticalDie size 66mmsq , 7.8 Gb/mmsq density (31% higher than previous 96 L)Tr = 56 us, prog = 132 MB/s1.066 Gb/s I/O4 plane boasts perf by 2x compared to 2 plane, but has CUA to reduce penalty ( will be 15% die size penalty without CUA ) , now only 1% penalty compared to 2P with CUACUA- W

47、L staircases and BL switches arestackedBoth BL swiothces and cool circuitry is placed underarraySignal/Power routing different - can route any signal/ powerline abovearrayasnorestrictionwithcircuiotryunderarray4Plane area overhead reductionOrig penalty = 15%Col gets to = 8%Row gets to 4%Center peri

48、gets to= 2%Signals routing gets to = 1% penaltyMulti die peak power managementReduce both peak/average ICCHigh peak ICC , from multiple die in package can corrupt dataUse existing ZQ calibration circuit , so no areaoverheadNew approachEstimate ICC peak bin for each operation for a dieCalculate final

49、 peak ICCUseZQcalibration-tojudgeifaccumulatedcurrentisacceptable ornot.CanstaggerICCpeaks.AllthePPMcontrolisdoneinside the dieGets to 47% write throughput with thistechnique4kB page read modeSelectonly4KBWLandactivitateonlyselectWL/BL,inABLsenseall BLs are enabled so no BLcapInthisapproachtherewill

50、beadditionalBLcapacitanceasonly4KBL areactivatedoutof16KB.SoedgeBLwillhaveslowercharacteristic s, therefore they have tR degradation. So they introduce special selectedBL-biasconditionsforSSandEdgeBLtoreduceEdgeBL capacitanceResults:ReadICCreducedby40%comparedtostandard16KBPagebutyou add 2 additiona

51、l transistors , write throughput +47%, 4 plane arch1% area , 66m sq die sizeDRAMSamsung 7.5Gbps 8Gb LPDDR 5 , in 1x nm DRAMHigh speed enablersWCK Clocking, PSIJ reduction of 44%NT-ODT - Write SI improvement 32.5% 6.4 Gb/sNonTargetODT-NT-ODThasbeenremovedinrecentLPDDR4 specLow Power EnablersDVFS - re

52、ad/Write power 8%/9% lower at 1.6Gb/sWrite x - Write Power down 58% 4.266 Gb/sDSM reduces power consumption by 25%SK Hynix16 Gb DDR5 SDRAM, 1y nm , 6.4 Gb/s/pinIndustry first 16Gb DDR 51 y nm, 4 metal DRAM processDie size 76.22 mm-sqMotivationNeed higher I/O speed for serversNeed RAS features to ove

53、rcome DRAM scalinglimitationNeedlowerpowerconsumptionforthermalreasons SpecsDDR5 - 6.4 Gb/s/pin, 1.1V vs. 3.2Gb/s/pin for DDR4 1.2 VVddGet 30% lower power than DDR 4 for the same densityDespite lower voltage DDR5 is 2x faster thanDDR4I/o design is very challengingNew innovationsNew Write Training me

54、thodDelay block is placed in DQS path to match CLK and DQS relationshipUse unmatched DQS CLK relationship ,Externaltrainingg:synchronizeCLKandDQSatexternalpinlink DDR4Internal Training : Synchronize internal command and DQSpathImproves DQS gating marginDFE is adopted for Signal integrity - cancel ch

55、annelreflections4 tap DFE is used to equalize rx, used to cancel the reflection noise of the channelUse quad rate 4 phase internal DQS signalPhase rotator DLL+ILO (Injection Locked Oscillator) isusedConventionalDLLissusceptibletotDQSCkdriftandsensitiveto supply voltage drift4 phase skew can be corre

56、cted by DCC cycle and ILO ( using DCC has the same impact as using theILO)DDR 5 : uses phase rotator and ILO DLL to minimize drift from conventionalResults: TDQSCK variation is lowered by 40% vs. conventional delay line4:1 serializerSynchronize 4 data bits with a 4 phase clock , Clock skew wil cause

57、 Tx jitterClock duty cycle controlled through specific codeBERTTests transceiver performance in variousconfigurationsKAIST, 3 bit , 27 Gb/s PAM 3 single ended transceiverIncreasing the data bandwidth of DRAMIncreaseclockfrequencyormoreparallelI/oorMultilevelsignaling( PAM-N)PAM4signalinghaslowernois

58、emargins andthereforehasnotbeen commonly used in signals endedsignalingPAM 3 - has 150% pin efficiency compared to NRZ , more noise margin than PAM 4Designed 3 bit PAM 3 transmitter and receiverDesigned a 3 bit per 2UI PAM 3 Encoder , implement low swing output driverConventional approach for PAM 3

59、DFE need 2 branches of current, they are using a tri-level driver and lower hardwarecostSK Hynix, 512 GB 1.1V MDS - position between SCM and DRAMMDS DIMM , positioned between DRAM and SCMDRAM DIMM with highest capacityMDS DIMMMedia, media controller16GB DRAM , x4 operation , 2,133MbpsSlices with 16

60、GP DRAM , ODP - uses wire bonding so lower cost thanTSV,have16slicesthatareconnectedbywirebondingUse a load reduced repeater schemeN-bit fail wave scheme ignores failed bit in wafer or package test of media and improves yield of themediaPre-CMD scheme to reduce standby current1 Mb Re-RAM computing i

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论