TECHNICAL PROGRAM

TECHNICAL PROGRAM

Session

A1 Day

08/08 Time

13:30-15:00 Chair
黃弘一教授
國立台北大學 Room

301

Digital Design (I)

13:30 A1-1　 Novel VLSI Design of Circular-Carry-Select (CCS) Based Diminished-One Modulo 2n+1 Adder

   　 PDF Su-Hon Lin, Ming-Hwa Sheu, Kuang-Hui Wang, Jun-Jie Zhu, and Si-Ying Chen, 國立雲林科技大學　

　　
The diminished-one modulo 2ⁿ+1 addition is an important arithmetic operation for a high-performance residue number system (RNS). In this paper, we propose a novel Circular-Carry-Select (CCS) architecture for diminished-one modulo 2ⁿ+1 adder. The resulting modulo 2ⁿ+1 adder is mainly based on CCS addition block which is simple and regular for all n values. For actual VLSI implementation based on UMC 180nm CMOS technology, the CCS-based diminished-one modulo 2ⁿ+1 adder demonstrates the superiority in AreaxTime (AT) performance over those of the famous existing solutions. The area and clock rate for CCS-based modulo 2¹⁶+1 adder chip are 26746μm² and 476MHz respectively.

13:45 A1-2　 Self-Aware Medium-Grained Adaptive Power Control Using Current Monitoring Technique

   　 PDF　 Wei-Chih Hsieh and Wei Hwang, 國立交通大學　

　　
In this paper, a novel current monitoring technique is proposed to provide reference for voltage scaling. Instead of tracking the delay of worst case critical path replica, current characteristic of target circuits is considered to distinguish between the switching and stable state of the circuit. A medium-grained adaptive power control technique is also presented taking advantages of low overhead current monitoring technique. Conventional voltage scaling technique applied a single (scaled) voltage satisfying critical path to the whole chip, wasting the power in non-critical paths. The medium-grained technique exploits the unused slack in non-critical paths, which further discovers the power reduction potentiality that lies on non-critical paths. A different width multipliers example exhibits over 40% of power reduction on non-critical paths with only 7% overhead, most of which comes from un-optimized level converters and buffers of control word. The proposed current monitoring scheme contributes only about 7μW. Simulations are all implemented in Berkeley Predictive 65nm technology.

14:00 A1-3　 Post-Chip Adjustable Low Power Delay Element

   　 PDF Jung-Lin Yang and Chih-Wei Chao, 南台科技大學

　　
Constructing specific delays on a chip is a difficult task for deep-sub-micron technology, which is the issue this paper targets to resolve. An extreme low-power delay element with post-chip adjustment feature is introduced. Our initial intention was to develop this delay element for self-timed datapath components at the first place. Surprisingly, we found this design is also suitable for many other applications with low power requirement. In addition to the tunable behavior, a post-chip delay adjustment feature is implemented this time. Besides, the circuit itself also demonstrates valuable characteristics such as well adjustment to the operating temperature disparity on the delay and the technology variation-tolerant nature. Excluding the current-mirror circuitry, the proposed tunable delay element consumes approximately equal average power to a 4-stage minimum size inverter chain. All arguments are verified with cautiously setup post-layout simulation using TSMC 0.35μm technology.

14:15 A1-4　 A High-Resolution All-Digital Phase-Locked Loop with its Application to Built-In Speed Grading for Memory

   　 PDF　 Hsuan-Jung Hsu, Chun-Chieh Tu, and Shi-Yu Huang , 國立清華大學　

　　
In this paper we present a high-resolution and wide-range all-digital phase-locked loop (ADPLL), which is suitable to function as a clock generator. The digitally controlled oscillator (DCO) is able to operate from 70 to 725 MHz and achieves 5.2ps resolution. The Phase-Frequency Detector (PFD) is designed using a latch-based sense amplifier, leading to a nearly perfect PFD that is able to resolve a phase difference as minute as only 1ps. In addition, we use this ADPLL as a vehicle to perform built-in speed grading (BISG) for memory. Combining a binary search process with multiple runs of built-in self-test (BIST), the maximum operating speed can thus be tracked down on the chip with a high precision.

14:30 A1-5　 All-Digital PLL Using Bulk-Controlled Varactor and Pulse-Based DCO

   　 PDF　 Hong-Yi Huang and Jen-Chieh Liu, 國立台北大學　

　　
A 150–450-MHz, all-digital phase locked-loop (ADPLL) in a 0.18μm CMOS process is presented. The pulse-based digitally controlled oscillator (PB-DCO) performs a high resolution and wide range. The bulk-controlled varactor minimizes jitter performance. The worst case for frequency acquisition is 32 reference clock cycles. The multiplication factor is 2–63. The rms and peak-to-peak jitters are 6.7ps and 44ps at 450-MHz, respectively. Power consumption is 16.2mW at 450-MHz.

14:45 A1-6　 A Conditional Isolation Technique for Low-Power and High-speed Wide Domino Gates

   　 PDF Wei-Hao Chiu and How-Rern Lin, 大葉大學　

　　
A new conditional isolation technique (CI-Domino) in domino logic is proposed for wide domino gates. This technique can not only reduce the subthreshold and gate oxide leakage currents simultaneously without sacrificing circuit performance, but also it can be utilized to speed up the evaluation time of domino gate. Simulations on high fan-in domino OR gates with 0.18 mm process technology show that the proposed technique achieves reduction on total static power by 36%, dynamic power by 49.14%, and delay time by 60.27% compared to the conventional domino gate. Meanwhile, the proposed technique also gains about 48.14% improvement on leakage tolerance.

TECHNICAL PROGRAM

Session

B1 Day

08/08 Time

13:30-15:00 Chair
熊博安教授
國立中正大學 Room

305

VLSI System and Implementation

13:30 B1-1　 Novel Low-Power Bus Coding Method for Crosstalk Noise Reduction

　 PDF Chia-Hao Fang and Chih-Peng Fan, 國立中興大學　

　　
In deep-submicron technology, reducing the power dissipation and propagation delay on chip busses have become important key issues for low power System on a Chip (SoC) designs. In particular, the coupling effect causes serious problems, such as crosstalk delay, noise and power dissipation. In the paper, a new bus-coding method is proposed to reduce the dynamic power dissipation on buses and the crosstalk delay tremendously. Our method can save more coupling power than the bus inverter(BI), the coupling-driven bus-inverter(CBI), and the other schemes. The experimental results show that our method can perform 22-36% dynamic power saving for systems which are implemented with TSMC 0.18-μm CMOS technology.

13:45 B1-2　 Reconfigurable Hardware Module Sequencer for Dynamically Partially Reconfigurable Systems

　 PDF Chin-Chieh Hung and Pao-Ann Hsiung, 國立中正大學　

　　
Dynamically reconfigurable systems either adopt a processor-controlled networked architecture or a data sequencer-controlled data flow architecture. In the networked architecture, the processor is overloaded with data transfer requests, whereas in the data flow architecture, the burden is completely shifted from the processor to the data sequencer. As a tradeoff between these two extremes, this work proposes a novel module sequencer architecture, which not only allows the processor and the sequencer to share the heavy data communication load, but is also more coherent with the conventional processor-FPGA architecture. Further, the architecture is highly flexible because it can be tuned to fit a particular application. Application examples show how the proposed architecture is superior to the networked architecture in terms of lower communication load and to the data flow architecture in terms of reduced system complexity.

14:00 B1-3　 Implementing an FPGA Baseband Multipath Fading Channel Emulator Using High-Level Modular Design

　 PDF Jeng-Kuang Hwang*, Kuei-Horng Lin, Jeng-Da Li, and Juinn-Horng Deng, 元智大學　

　　
A baseband multipath fading channel emulator is implemented on Xilinx XtremeDSP FPGA platform through modular high-level design. Important modules are described, including the white Gaussian noise generator (WGNG), Doppler filter, direct digital frequency synthesizer (DDFS), muti-rate interpolators, and multipath signal generator. Since all modules are designed as high level Simulink models in terms of Xilinx System Generator, the system parameters and configuration can be easily changed as desired. The FPGA emulator have been tested at a sampling rate of 30 Msps, and all the measured signals are well coincides with the simulation results, thus verifying the correctness of the design.

14:15 B1-4　 HW/SW Co-Design of a Multi-Threaded Virtual Machine for a Scalable NoC Platform

　 PDF 李昀隆、陳泳超、周哲民, 國立成功大學　

　　
In this paper, we have designed and implemented a Multi-Threaded Java Virtual Machine (MTJVM) for a scalable NoC platform. It is composed of multiple processing elements (PEs) and can directly execute Java threads concurrently without any software/OS support. And, threads will be dynamically dispatched to PEs and run simultaneously toward Thread-Level-Parallelism (TLP). Thread processing mechanisms and instructions, such as real-time scheduling, sleep, wait, yield, and synchronization, are handled by two new global controllers, the thread-manager and the memory-manager. The complete system has been coded and synthesized in C and VHDL for its software and hardware parts, respectively. As the experiment results shown, the performance and the area of it are scalable with the number of PEs used, and it works at 96.8 MHz.

14:30 B1-5　 High Speed and Low Cost Implementations in Mix-Column/InvMix-Column

　 PDF Chung-Yi Li, Chih-Feng Chien, and Tsin-Yuan Chang, 國立清華大學　

　　
Mixed Column and Inverse Mixed Column dominate the logic resource in Advanced Encryption Standard (AES) hardware implementation with direct mapping S-boxes. In this paper, two resource sharing circuits including short-path and small-area are proposed in byte-level resource sharing to optimize the delay and area required. Theoretically, the proposed circuits have either the fastest speed (same as previous work but 41% smaller in gate count) or the small gate count (1% less than previous work with 13% saving in delay). Synthesized in a TSMC 0.18 μm CMOS technology, the proposed schemes have the top 2 measured in AT² (Area-delay square product).

14:45 B1-6　 Combined Decoding and Flexible Transform Designs for Effective H.264/AVC Decoders

　 PDF Yi-Chih Chao, Shih-Tse Wei, Jar-Ferr Yang, and Bin-Da Liu, 國立成功大學　

　　
In this paper, we propose combined decoding architecture and high-throughput flexible transform design to effectively decode the residual data for H.264/AVC decoders. The inverse quantization (IQ) procedure is combined with context-based adaptive variable length coding (CAVLC) decoder to efficiently achieve the simplification. Besides, the flexible transform architecture is also proposed for effective computation of all transforms needed in H.264/AVC decoders. Since all the transforms are realized in the same architecture, the flexible transform design with the throughput of 8 pixels/sec needs fewer logic gate counts. Simulation results show that the implemented gate count is 18.6k and the maximum operating frequency is 125MHz. For real-time requirements, this proposed design achieves 4VGA (1280×960)@30 frames/sec in the worst case.

　

TECHNICAL PROGRAM

Session

C1 Day

08/08 Time

13:30-15:00 Chair
郭建宏教授
國立台灣師範大學 Room

315

Analog Techniques

13:30 C1-1　 Design and Realization of Ultra Low-Capacitance Bond Pad With Inductive Compensation for RF Circuits in CMOS Technology

　 PDF Yuan-Wen Hsiao, Chun-Yu Lin, and Ming-Dou Ker, 國立交通大學　

　　
A low-capacitance bond pad for gigahertz RF applications is proposed. Three kinds of on-chip spiral inductors embedded under the traditional bond pad are used to compensate the parasitic capacitance of the bond-pad metals. Experimental results have verified that the bond-pad capacitance can be significantly reduced in a specific frequency band due to the cancellation effect provided by the embedded inductor in the proposed bond pad. The proposed bond pad is fully compatible to general CMOS processes without any process modification.

13:45 C1-2　 A compact square-root domain filter

　 PDF Chia-Hsiung Kao, Ping-Yu Tsai, Wen-Pin Lin, and Wan Chen Lo, 國立中山大學　

　　
A square-root domain filter based on operational transconductance amplifiers (OTAs) is proposed. The filter is compact and simple. The supply voltage is 1.5V and the power consumption is 116μW for a 10μA DC input current. Experimental results in a 0.35 μm CMOS process confirm the feasibility of the methodology.

14:00 C1-3　 Q-Factor Behavior Study of 90-nm RF-CMOS Inductors Using Transmission-Line Mode

　 PDF C.-H. Huang and T. -S. Horng, 國立中山大學　

　　
Abstract-This paper presents a novel transmission-line model for on-chip inductors to derive their quality (Q) factors in terms of the well-known transmission-line parameters in closed forms. The derived formulas are general and suitable for all kinds of integrated inductors when used to account for the frequency dependence of Q factors. The presented formulas can also uniquely distinguish the improvement in Q-factor frequency responses between the reduction of conductor loss and dielectric loss. Several 90-nm RF-CMOS inductors have been studied for their Q-factor behavior under different design configurations and process variations.

14:15 C1-4　 A 1-V CMOS Pseudo-Differential Amplifier with Multiple Common Mode Stabilization and Frequency Compensation Loops

　 PDF Meng-Hung Shen, Po-Hsiang Lan, and Po-Chiun Huang, 國立清華大學　

　　
In this paper, a 1-V three-stage operational amplifier using a standard 0.35-μm CMOS process is presented. The input stage is a pseudo-differential structure with common mode feedforward (CMFF) bias to extend the input voltage headroom. At the second and third stages, the parallel AC-boosting and signal feedforward paths are used for bandwidth and transient speed improvement. One active feedback is adopted to guarantee the stability without the RHP zero. A current-sensing CMFB loop is used to setup the common mode point. With these techniques, experiment results show a 4.3MHz gain-bandwidth product (GBW) with 68° phase margin when driving a 100pF load capacitance. The settling time of an 1-V_pp step signal is 1.1μs. All the circuits dissipate 249μW from a single 1-V supply.

14:30 C1-5　 A 1-V Fully Differential Amplifier with Buffered Nested-Miller Compensation

　 PDF Li-Wen Wang, Meng-Hung Shen, and Po-Chiun Huang, 國立清華大學　

　　
This paper presents a 1-V three-stage fully differential amplifier with buffered nested-Miller compensation. A transconductance stage is inserted in the feedback path to eliminate the right half plane (RHP) zero. In addition, a feedforward transconductance is used to enhance output large signal response. Using standard 0.35-μm CMOS technology, measurement results demonstrate that DC gain greater than 90dB, gain-bandwidth product of 4.57MHz, and phase margin of 55° is achieved with 100pF output loads. The settling time for a 1-V_pp step is 2μs. All the circuits dissipate 110μW under a single 1-V power supply.

14:45 C1-6　 A Low-Power High-Gain Rail-to-Rail Input/Output Operational Amplifier

　 PDF Chien-Hung Kuo, Hwa-Ming Lu, and Wei-Hsien Fang, 淡江大學　

　　
A low-power high-gain CMOS operational amplifier with rail-to-rail input/output ranges is presented in this paper. A constant-gm controller is employed in the input stage to achieve an optimum bandwidth and settling response in a wide operational range. A differential-input single-output gain-boosting amplifier without common-mode feedback is applied to minimize the power consumption and increase the dc gain of opamp. The floating current sources are also introduced to the cascode stage to provide proper bias levels for the class AB output stage. The proposed opamp can load with a large capacitance or a small resistance loads without losing the gain and unity-gain bandwidth. It has been fabricated in a 0.35 μm 2P4M CMOS process. With a 50 pF of the output capacitance load, a 128 dB of dc gain and a 1.81 MHz of unity-gain frequency can be achieved in the proposed opamp. The total power dissipation is only 370 μW at a 3.3 V of supply voltage.

　

TECHNICAL PROGRAM

Session

D1 Day

08/08 Time

13:30-15:00 Chair
蔡嘉明教授
國立交通大學 Room

308

Wireline Communication Circuits

13:30 D1-1　 A Quarter-Rate 2.56/3.2Gbps 16/20:1 SERDES Interface in 0.18μm CMOS technology

　 PDF Ching-Te Chiu, Jen-Ming Wu, Shuo-Hung Hsu, YarSun Hsu, Ming-Hao Lu, Ping-Lin Yang, Fan-Ta Chen, You-Hung Li, Yu-Hao Hsu, and Min-Sheng Kao, 國立清華大學　

　　
In this paper, we present a 2.56/3.2Gbps 16/20:1 SERDES circuit for integrating with switch fabric in high speed network applications. To achieve high speed and low-jitter requirement, we propose the quarter-rate, true single phase clock (TSPC) circuit, and LC-VCO based PLL architecture. This SERDES use the quarter-rate architecture to reduce the high frequency clock requirement. The shift register and multiple phase sampling structures are used to achieve the 16/20:1 serializing and 1:16/20 deserializing. In circuit measurement, this SERDES can operate up to 5.12Gbps independent of PLL. Fabricated in 0.18μm CMOS process, the 2.49mm × 2.49mm SERDES consumes 250mW including PLL at 3.2Gbps data rate.

13:45 D1-2　 A Low Power Tree-Type Multiplexer with Embedded Timing Skew Switch

　 PDF HungWen Lu, 國立中央大學　

　　
This work describes a tree-type multiplexer for Gigabits serial I/O. The proposed architecture uses the quadrature clocks as switch controls to eliminate the need of retiming circuit in the conventional design. Consequently, power consumption and circuit area are significantly reduced. Simulation results indicate that the power consumption of the modified multiplexer at 10Gbps is 70% of that of the traditional design, and the area is 22%. To verify the proposed MUX cells, a test chip is implemented by using TSMC 0.13 CMOS process and occupying a circuit area of 200μm×150μm. The measured power consumption of the 32-to-1 MUX circuit at a bit rate of 7.5Gbps and 1.2V supply voltage is only 4.8mW.

14:00 D1-3　 Transimpedance Amplifier with Enlarged Input Capacitance Tolerance for Optical Receiver

　 PDF Jiann-Jiun Lu and Chia-Ming Tsai, 國立交通大學　

　　
This paper presents a 5Gb/s transimpedance amplifier employing self-compensated architecture, inductor peaking, and negative impedance compensation technique in 0.18μm CMOS technology. The transimpedance amplifier exhibits 13kΩ differential transimpedance gain and sensitivity of -15.5dBm while dissipating 61mW from a 1.8V supply, and the chip size is 600μm×520μm.

14:15 D1-4　 A Low-jitter Phase-rotation Spread Spectrum Clock Generator for Serial ATA 6Gbps Clock and Data Recovery

　 PDF Chi- Hsien Lin, Yen-Ying Huang, Shu-Rung Li, Yuan-Pu Cheng, and Shyh- Jye Jou, 國立交通大學　

　　
A low jitter phase-lock-loop (PLL) and a proposed spread-spectrum clock method for Serial ATA with phase rotation is presented. To achieve low jitter in our device, the low jitter PLL uses error amplifier to resolve the current mismatch in charge pump and the 3rd order loop filter is adopted to reduce the reference spur. A passive resistance is presented in our design to reduce the K_VCO. Our spread spectrum clock generator (SSCG) for Serial ATA Specification is down spread 5000 ppm with a triangular waveform and the modulation frequency is 30~33KHz. The spread-spectrum technique using PLL with a ΔΣ modulator and phase rotation algorithm is reported. The proposed circuit has been designed in a 90-nm CMOS process. The non-spread spectrum clocking has a peak to peak jitter of 3.54ps and consumes 5.87mW at 1.4GHz. The EMI reduction in this circuit is about 18.22dB.

14:30 D1-5　 A 2.5 Gbps CMOS Fully Integrated Optical Receicer with Lateral PIN Detector

　 PDF Wei-Zen Chen and Shih-Hao Huang, 國立交通大學　

　　
This paper presents the design of a monolithically integrated CMOS optical receiver, including a photo detector, a transimpedance amplifier, and a post limiting amplifier on a single chip. A novel PIN detector is proposed and adopted in this design without technology modification. The optical receiver is capable of delivering 420 mV_pp to 50Ω output load and operating up to 2.5 Gbps without an equalizer. Implemented in a generic 0.18μm CMOS technology, the total power dissipation is 138 mW. The chip size is 0.53 mm².

14:45 D1-6　 Inductorless CMOS Receiver Front-End Circuits for 10-Gbs Optical Communications

　 PDF Chih-Hao Chen, 淡江大學　

　　
In this paper, a 10-Gb/s inductorless CMOS receiver front end is presented, including a transimpedance amplifier and a limiting amplifier. The transimpedance amplifier employs Regulated Cascode (RGC), active-inductor peaking, and dual third-order active feedback to achieve a transimpedance gain of 57 dBΩ and a bandwidth of 7.8 GHz with a power dissipation of 39 mW. The limiting amplifier incorporates third-order inter-leaving active feedback to achieve a voltage gain of 32 dB and a bandwidth of 10 GHz while consuming 145 mW. Both circuits are realized in 0.18-μm CMOS technology with a 1.8-V supply.

　

TECHNICAL PROGRAM

Session

E1 Day

08/08 Time

13:30-15:00 Chair
張順志教授
國立成功大學 Room

318

Modeling and Synthesis

13:30 E1-1　 Topology Generation and Floorplanning for Low Power Application-Specific Network-on-Chips

   　 PDF Wan-Yu Lee and Iris Hui-Ru Jiang, 國立交通大學　

　　
As the process advances into nanotechnology, the number of cores and the amount of communication on a chip are rapidly increasing. Using a micro-network, Network-on-Chip can overcome the communication inefficiency in the traditional shared bus communication architecture. The system performance of application-specific Network-on-Chips is mostly measured by power, timing, as well as area. Moreover, power and timing highly depend on how the network topology connects routers and cores and how many routers are used; area is simply determined by floorplanning. Unlike previous endeavors, in this paper, we propose a new methodology to perform network topology generation before floorplanning. Our method can preserve the optimality of topology to floorplan. Moreover, our method not only simultaneously minimizes power, satisfies timing and area constraints, but also guarantees deadlock free.

13:45 E1-2　 SAT Based Boolean Matching with Don't Cares

   　 PDF Kuo-Hua Wang and Chung-Ming Chan, 輔仁大學　

　　
Boolean matching is to check the equivalence of two functions under input permutation and input/output phase assignments. In this paper, we will transform the Boolean matching problem to the Boolean satisfiability problem. Based on this transformation approach, an SAT-based matching algorithm for incompletely specified functions will be proposed. Moreover, two signatures by exploiting functional symmetries will be provided to reduce the size of SAT instance and thus expedite the matching process. Experimental results on a set of benchmarking circuits show that our matching algorithm is indeed very effective and efficient to solve Boolean matching for incompletely specified functions. Compared with our prior work on Boolean matching, our SAT-based matching algorithm outperforms the old algorithm by several orders for many large benchmarking circuits.

14:00 E1-3　 Lithography-Aware Routing with Predictive OPC Formulae

   　 PDF Tai-Chen Chen, Guang-Wan Liao, and Yao-Wen Chang, 國立台灣大學　

　　
Due to the sub-wavelength lithography, manufacturing the sub-90nm feature size requires intensive use of the resolution-enhancement techniques, among which optical proximity correction (OPC) is the most popular technique in industry. Considering the OPC effects during routing can significantly alleviate the cost of post-layout OPC operations. In this paper, we present an efficient, yet accurate analytical formula for intensity computation and develop the first modeling of the post-layout OPC based on the inverse-like lithographic technique. Extensive simulations with SPLAT, the golden lithography simulator in academia and industry, show that our intensity formula has high fidelity. Incorporating the OPC costs computed by the inverse-like lithographic technique for our post-layout OPC modeling into a router, the router can be guided to maximize the effects of the correction. Experimental results show that our approach can achieve up to 24% and 16% reductions in the respective total and maximum layout distortions, using reasonable running time.

14:15 E1-4　 An Efficient Energy Modeling Approach for VLIW DSP at Instruction-Level

　 PDF Wen-Tsan Hsieh, Hsin-Ying Liao, Chien-Nan Jimmy Liu, Shu-Yu Cheng, and Ji-Jan Chen, 國立中央大學　

　　
In this work, we develop a new instruction-level energy modeling approach for pipelined Very Large Instruction Word(VLIW) DSPs. The proposed approach can take care of both the base energy cost of each instruction and the additional energy cost of consecutive instructions in each pipeline stage. Therefore, the power estimation can be much closer to the real pipelined behavior. The overall energy procedure can be separated into two phases: energy extraction phase and model re-construction phase. The experimental results have shown that the average error of our approach is less than 3% compared to “PrimePower” gate-level power simulation. Thus, the proposed energy modeling approach can be easily used for software energy optimization.

14:30 E1-5　 An Automated Synthesis Tool for Fully Differential OPAMPs

　 PDF Cheng-Wu Lin and Soon-Jyh Chang, 國立成功大學　

　　
A design automation tool, applied to figure out the proper transistor sizes for fully differential OPAMPs is presented. The developed synthesis tool provides three OPAMP topologies, two-stage, folded-cascode, and two-stage cascode. Look-up tables are utilized for device sizing in this tool. By the consideration of experience in practical circuit design, the efficiency of synthesis process is improved. A further subroutine is developed to speed up the tool development for supporting new OPAMP topologies. To design an OPAMP applied for the first stage of an 8-bit 50-MS/s pipelined ADC, the synthesis time of the proposed tool is less than 3 minutes using two 1.2 GHz UltraSPARC-III+ processors and 2 GB memory.

14:45 E1-6　 A Top-down, Mixed-level Design Methodology for CT BP ΔΣ Modulator Using Verilog-A

　 PDF Hung-Yuan Chu and Chien-Hung Tsai, 國立成功大學　

　　
This paper presents a design methodology of a continuous-time (CT) Band-pass (BP) ΔΣ modulator which can improve the design procedure. The proposed top-down, mixed-level design platform is implemented under Cadence’s Spectre environment using Verilog-A. A 2^nd order CT BP ΔΣ modulator for WCDMA applications. The central frequency of this modulator is at 100MHz and the internal quantizer operated at 400MHz clock frequency. The modulator is simulated in TSMC 0.35μm CMOS technology, at a supply voltage of 3.3V. The maximum SNDR is 40dB for a 3.84MHz bandwidth, which corresponds to a resolution of 6 bits.

　

TECHNICAL PROGRAM

Session

P1A Day

08/08 Time

13:30-16:00 Chair
李博明教授
南台科技大學 Room

2F宴會廳

Analog

P1A-1　 Frequency Domain Analog Circuit Fault Diagnosis Based on Radial Basis Function Neural Network

PDF 林宗志、郭明仁、陳盈州, 逢甲大學　

　
In this paper, a fault diagnosis methodology is proposed based on radial basis function neural networks (RFBNN) to analyze signatures of analog circuits. To perform soft fault location, RBFNN are used to process the circuit frequency responses and to build the fault dictionary. From the experimental results, we can find the proposed technique is succeeded in diagnosing and locating faults quickly and exactly.

P1A-2　 A RF CMOS Low Noise Amplifier Using High-Q Active Inductor Loads with Binary Code for Multi-Band Applications

PDF Jenn-Tzer Yang, Yuan-Hao Lee, Yi-Yuan Huang, Yu-Min Mu, and Yen-Ching Ho, 明新科技大學　

　
In this paper, a radio frequency (RF) CMOS multiple bands low noise amplifier using a high-Q active inductor load with a binary code band selector suitable for multi-standards wireless applications is proposed. By employing an improved high-Q active inductor including two bits binary controlled code, the multi-band low noise amplifier operating at four different frequency bands is realized. The proposed amplifier circuit is designed in TSMC 0.18-μm CMOS technology. Based on the simulation results, the amplifier can operate at 900MHz, 1.8GHz, 1.9GHz, and 2.4GHz with forward gain (S₂₁) of 31.15dB, 30.82dB, 30.61dB, and 28.4dB, and the noise figure (NF) of 0.563dB, 0.558dB, 0.578dB, and 0.759dB, respectively. Furthermore, the power dissipation of this amplifier can retain constant at all operating frequency bands and consume around 11.66 mW from 1.8-V power supply. The occupied area of this amplifier is about 158 × 76 mm².

P1A-3　 A Novel Precise Step-Shaped Soft-Start Technique for Integrated DC-DC Converter

PDF Yung-Chun Chuang and Ke-Horng Chen, 國立交通大學

　
The advantages of this novel precise step-shaped soft-start technique for integrated dc-dc converter are not only owning excellent peak current limiting capacity for any load condition to diminish initial inrush current powerfully but also solving the over-voltage or drop-voltage problem during the changing between the start-up mode and normal operation. Furthermore, the on-chip design for this technology reduces the numbers of the external I/O pin to decrease the cost of the converter. Therefore, this novel soft-start technique is more available and smoother for integrated dc-dc converter than the conventional soft-start technique.

P1A-4　 A self-oscillating switching power amplifier

PDF Chia-Hsiung Kao, Ping-Yu Tsai, Wen-Pin Lin, and Ming-Ching Chou, 國立中山大學　

　
A self-oscillating switching power amplifier is proposed. We use feedback to reduce the output DC bias current and to produce self-oscillation. Further, filters are added in the feedback loop to reduce the quantization noise. The experimental results show that the proposed circuit has 0.25% total harmonic distortion (THD) and the efficiency of output power of 840 μW reaches 90.1% while the DC output bias current is 8μA and the supply voltage is 1.5 V.

P1A-5　 Differential Feed-forward Transconductor Design for High Linearity WiMax Subharmonic Mixer

PDF Ying-Ta Lu, Hsien-Yuan Liao, Shao-Liang Lu, Joseph D. S. Deng*, and Hwann-Kaeo Chiou, 國立中央大學　

　
An active double balance subharmonic mixer was designed and fabricated in direct conversion receiver for the worldwide interoperability microwave access (WiMax) applications. A differential feed-forward transconductor design was applied to improve the third-order nonlinearity. The mixer achieved a conversion power gain of 6.1 dB, an input-referred third-order intercept point of 5 dBm with power dissipation of 6.72 mW from a 3 V supply voltage. The chip was fabricated in TSMC 0.35 μm SiGe HBT technology. The chip area occupies 0.86 mm × 0.78 mm.

P1A-6　 Modeling on the Mutual Inductance of On-Chip Transformers

PDF Heng-Ming Hsu, Sih-Han Lai, and Hsien-Feng Liao, 國立中興大學　

　
The behavior related to mutual inductance of on-chip transformer has been discussed comprehensively in this work, the characterization includes the low and high frequency performances.Moreover,the corresponding equivalent circuit is proposed to describe the high frequency characteristic.

P1A-7　 A 1.76 uW, 0.9V, 8-bit Successive Approximation Register ADC with Fully-Differential Input Capability

PDF 謝宗殷、洪浩喬, 國立交通大學　

　
This paper presents a 0.9V, 8-bit successive approximation register (SAR) analog-to-digital converter (ADC) with a novel pseudo differential track-and-hold stage to accept fully differential inputs. Most of the circuits in the ADC design are single-ended to save the power consumption and silicon area. The ADC has been designed in a 0.18μm CMOS process. HSPICE simulation results show that at an output rate of 111KS/s, the proposed SAR ADC achieves a peak signal-to-noise-and-distortion ratio of 48.23dB, a rail-to-rail input range, and an effective resolution bandwidth no less than its Nyquist bandwidth. Its power consumption is as low as 1.76 μW.

P1A-8　 A 3.1–10.6 GHz Ultra-Wideband CMOS Low Noise Amplifier Using Bridged-Shunt-Series Peaking Technique

PDF Yu-Liang Lin, Feng-Lin Shiu, and Hwann-Kaeo Chiou, 國立中央大學　

　
An ultra-wideband 3.1–10.6-GHz low-noise amplifier adopting inductive peaking technique for bandwidth extension is presented. Fabricated in a 0.18-μm CMOS process, the proposed circuit can both satisfy the maximum bandwidth and the maximally flat response. The feedback resistor provides good input match while contributing a small amount in noise figure (NF) degradation. The presented LNA achieves a maximum power gain of 14.1 dB within a 3-dB bandwidth from 2.2 to 11 GHz and a good NF from 3.4 to 4.5 dB in the entire UWB band, and an IIP3 better than –3 dBm while drawing 30 mW from a 1.5 V supply.

P1A-9　 A Novel Infrared Tracking System with Winner-Take-All Implementation

PDF Po-Hsiang Chang and Chih-Hsiung Shen, 國立彰化師範大學

　
This paper discusses a novel infrared tracking sensor array, which measures the number and size of thermal objects, with the winner-take-all (WTA) circuit and a new preliminary level of thermopile array image processing on chip. This infrared tracking sensor digitizes thermal image by comparison with current signals which are controlled by the output voltage of thermopile sensors with a given threshold. The winner-take-all (WTA) circuit is used in combination with readout circuit for determining an 8×8 pixels thermopile array. Realization of the winner-take-all (WTA) circuit with sharp selectivity makes it possible to pick up only one winner from each object utilizing inherent mismatch in transistor characteristics. In order to simulate and present the infrared thermal sensor array in this paper, the sensor array fully is integrated by using a 2P4M 0.35μm standard CMOS technology. So far the results have shown that integrated thermopile array with winner-take-all (WTA) can approach a high level of development, reliability and easy for high accuracy infrared tracking applications.

P1A-10　 A New Multi-Function Wave Generator Based on Multiple-Output Second-Generation Current Conveyors

PDF Yuh-Shyan Hwang, Yu-Wen Chen, Jiann-Jong Chen, and Wen-Ta Lee, 國立台北科技大學　

　
A new multi-function wave generator based on multiple-output CCII (second-generation current conveyor) is presented in this paper. With the control of the on-chip switches, the waveform of the output can be modified to achieve signal modulation like ASK, FSK, and PSK. The circuit consists of two multiple-output CCIIs, two resistors, and two grounded capacitors. The proposed circuit has been designed with TSMC 0.35μm DPQM CMOS process. The HSPICE simulation results are depicted to verify the theoretical prediction of ASK, FSK, and PSK.

P1A-11　 Design a Multiplicative type-II Fuzzy Cellular Neural Network with CMOS Image Sensor

PDF Jui-Lin Lai, Yuan-Hung Lo, Yan-Ting Chen, and Rong-Jian Chen, 國立聯合大學　

　
The architecture of Multiplicative type-II Fuzzy Cellular Neural Networks (FCNN) with CMOS image sensor is proposed, which is with local connectivity advantageous suitable implemented for VLSI. Base on the proposed FCNN structure which is included the neuron, Min/Max, analog multiplier, pixel and CDS circuit, S/H Circuit, transfer and control circuits are adopted. The proposed FCNN can operated the specific functions base on the selected template is successfully verified by the TSMC 0.35μm 2P4M CMOS technology. There have a great potential in the VLSI implementation of neural network systems for binary and gray-level patterns in image-processing applications.

P1A-12　 A 2.4GHz Current-reused VCO with Degenerated Resistors

PDF Ruey-Lue Wang, Guo-Ruey Tsai , Yu-Feng Lin, and YuJo Tzeng, 崑山科技大學　

　
In this paper, we presents a current-reused voltagecontrolled oscillator which consists of a couple of nmos and pmos transistors. The proposed voltage controlled oscillator (VCO) is designed for 2.4 GHz operation. The study is based on TSMC 0.18-um CMOS processes. Measurement results show -94.6 and - 116 dBc/Hz at 100-kHz and 1-MHz offset, respectively, when the oscillation frequency is at 2.46 GHz. The current-reused VCO can reduce power dissipation to half that of conventional differential topologies (core: 0.35mA from a 1.8-V supply). The tuning range is from 2.24 to 2.52 GHz under the tuning voltage between 0 to 2 V.

　

TECHNICAL PROGRAM

Session

P1E Day

08/08 Time

13:30-16:00 Chair
黃宗柱教授
國立彰化師範大學 Room

2F宴會廳

EDA

P1E-1　 A Single-Clock Enhanced Random Access Scan

PDF Chen-An Chen, Wei-Yi He, and Tsung-Chu Huang, 國立彰化師範大學　

　
Random access scan architecture has been an effective approach to achieve simultaneous reduction in low power, data volume and test time for stuck-at fault test. In this paper, we develop a single-clock random access scan architecture combined with combinational output observable logics that can further reduce peak power and control wires using single clock for both delay test and diagnosis. Two scan cells are de-veloped for stuck-at faults and path delay faults separately. The observable logics make the vector ordering efficient so that the test data and test application time can be further shrunk by 64%. Especially due to this structure, the flipflop array can prevent from flash capture operations and reduce the peak power dissipation up to 92%. From experiments and verification including post-layout timing analyses, we show a multipurpose solution to solve many issues simultaneously for SoC testing.

P1E-2　 Area-Driven Decoupling Capacitance Allocation Based on Space Sensitivity Analysis

PDF Jin-Tai Yan, Ming-Yuen Wu, and Zhi-Wei Chen, 中華大學　

　
Based on the space sensitivity for the decoupling capacitor to release the IR-drop constraint and to minimize the final floorplan area in a given floorplan, an area-driven allocation approach is proposed to integrate the decap estimation and allocation to assign feasible decaps around or near all the circuit modules to release all the IR-drop noises in the floorplan. The experimental results show that our proposed area-driven allocation approach obtains very promising timing and area results for MCNC benchmark circuits.

P1E-3　 A Topology-Based Construction for X-Architecture Clock Routing

PDF Chia-Chun Tsai*, Chung-Chieh Kuo, Jan-Ou Wu, Trong-Yen Lee, and Rong-Shue Hsiao, 南華大學　

　
Wire delay plays a critical role in high performance clock routing. Shortening wirelength to reduce the clock delay is an increasingly important objective. Compared with conventional Manhattan architecture, X-architecture outperforms the former in wirelength reduction, power consumption, and clock performance. In this paper, we present a DME-X algorithm based on the combination of DME method and X architecture to create a clock tree with zero skew. The algorithm constructs a parallelogram to each pair of sinks or points for determining its X-topology wiring and simplifies the procedure of merging segment. Experimental results on benchmarks show the improvement of 7.9% in clock delay compared with other algorithms.

P1E-4　 Routability-Driven Track Routing for Coupling Capacitance Reduction

PDF Jin-Tai Yan, Zhi-Wei Chen, and Kuen-Ming Lin, 中華大學　

　
Given a routing panel, the routability-driven ordering and location constraints can be firstly set according to the pin positions of all the wire segments in the panel. Based on routability-driven ordering and location constraints in the panel, an ASAP-based scheduling approach with efficient space insertion is further proposed to reduce total coupling capacitance for routability-driven track routing. The experimental results show that our proposed ASAP-based scheduling approach obtains very promising routing results in reasonable CPU time for several benchmark circuits.

P1E-5　 A Timing-Driven X-Architecture Router with Obstacles

PDF Shu-Ping Chang, Hsin Hsiung Huang, Yu-Cheng Lin, and Tsai Ming Hsieh, 國立台東大學

　
In this paper, we formulate a new X-architecture routing problem in presence of obstacles, and propose a X- architecture timing-driven routing tree construction algorithm to minimize the maximum source-to-sink delay and the total wirelength simultaneously. First, we construct the spanning graph by the terminals and the corners of the obstacles. The minimal spanning tree is obtained by searching the spanning graph. The feasible X-architecture is constructed by transforming the minimal spanning tree. For the X-architecture routing tree, the delay of two-pin net is estimated by the modified Elmore delay model. According to the user defined delay threshold, an efficient rerouting algorithm is used to fix the timing violated nets. The critical terminals iteratively are rerouted by splitting two subtrees and merging into one tree. Compared to the routing result without rerouting, the maximum source-to-sink delay is improved by 61% and only 0.7% of additional total wirlength is increased.

P1E-6　 Test Generation for Transition Delay and RS-CFM Faults in Modified Booth Multipliers

PDF Hsing-Chung Liang and Pao-Hsin Huang, 中原大學　

　
In this paper, we propose a type of modified Booth multiplier and generate C-testable and linear-testable pattern pairs for transition delay faults (TDF) and realistic sequential cell faults (RS-CFM) in the multipliers of various sizes. The patterns are generated at two description levels of the circuit, one at cell level and another at gate level. Analyzing the multipliers, we can generate 18 constant test pairs to detect TDF at cell level irrespective of the multiplier sizes. Similarly, only 20 test pairs are enough to detect TDF at the synthesized gate level. These test pairs are much less than those generated by commercial tools, which cannot generate constant test pairs, either. Furthermore, in order to prepare test pairs independent of interior structures of cells, we also generate 104 + N×10 SIC test pairs for an N×N multiplier. These patterns not only achieve very high fault coverage for TDF and RS-CFM, but also are much less than those of a previous work for RS-CFM.

P1E-7　 Non-Slicing Floorplanning-Based Crosstalk Reduction on Gridless Track Assignment

PDF Win-Nai Zheng, Yu-Ning Zhang, and Yih-Lang Li, 國立交通大學　

　
Track assignment, which is an intermediate stage between global routing and detailed routing, provides a good platform for promoting performance, and for imposing additional constraints during routing, such as crosstalk. Gridless track assignment (GTA) has not been addressed in public literature. This work develops a gridless crosstalk-driven GTA. Initial assignment is produced rapidly with a left-edge like algorithm. Crosstalk reduction on the assignment is then transformed to a restricted non-slicing floorplanning problem, and a deterministic O-tree based algorithm is employed to re-assign each net segment. Finally, each panel is partitioned into several sub-panels, and the sub-panels are re-ordered using branch and bound algorithm to decrease the crosstalk further. Experimental results demonstrate that the proposed gridless crosstalk-driven GTA has over 80% reduction in the overlapping length of adjacent wires.

P1E-8　 Modified Essential Spare Pivoting Algorithm for Embedded Memories with Global Block-Based Redundancy

PDF Chun-Lin Yang and Shyue-Kung Lu, 輔仁大學　

　
A block-based redundancy architecture is proposed in this paper. The redundant rows/columns are divided into row/column blocks. Therefore, the repair of faulty memory cells can be performed at the row/column block level. Moreover, the redundant row/column blocks can be used to replace faulty cells anywhere in the memory array. This global characteristic is helpful for repairing cluster faults. The proposed redundancy architecture can be easily integrated with the embedded memory cores. Based on the proposed global redundant architecture, a heuristic MESP (modified essential spare pivoting) algorithm suitable for built-in implementation is proposed. According to experimental results, the area overhead for implementing the MESP algorithm is negligible. The repair rate is 99.94% for a 1M-bit (1024×1024-bit) SRAM.

　

TECHNICAL PROGRAM

Session

P1D Day

08/08 Time

13:30-16:00 Chair
莊作彬教授
國立屏東商業技術學院 Room

2F宴會廳

Digital

P1D-1　 A Comparative Study of LNS and Floating-Point arithmetic

PDF Chih-Yen Fan and Chi-Chyang Chen, 逢甲大學　

　
The logarithmic number system (LNS) arithmetic is very efficient in computing complex operations such as multiplication, division, powering, and logarithmic functions. However, addition and subtraction in LNS arithmetic is difficult and requires a large hardware cost. In this paper, we compare and analyze the performance and hardware cost of the arithmetic units in the LNS and Floating-Point (FLP) arithmetic with various word length. The 16-bit and 20-bit LNS adders/subtractors is designed by using lookup tables with table-reduction techniques. The architecture of the 24-bit LNS is based on the table lookup architecture with a simple approximation method to reduce the table size. Finally, direct computation is adopted to design the 28-bit and 32-bit LNS adders/subtractors. We have designed the 16-bit, 20-bit, 24-bit, 28-bit, and 32-bit LNS and FLP units with adders, subtractors, multipliers, and dividers in VHDL and synthesized these units with TSMC 0.18μm CMOS cell library. From the synthesis and simulation results, we can compare and analyzed the advantages and disadvantages of these two number systems, which can be used as a guideline for design engineers in deciding when LNS arithmetic can be adopted for efficient digital system design.

P1D-2　 Design of Low-Error Signed Fixed-Width Multipliers

PDF Jiun-Ping Wang and Shiann-Rong Kuang, 國立中山大學

　
A framework of designing a low-error signed fixed-width multiplier that receives two n-bits operands and generates an n-bits product is proposed. The proposed error compensation circuit not only leads signed fixed-width multipliers to higher accuracy but also can be easily constructed with simple logic gates. Moreover, the proposed signed fixed-width multiplier is also applied to the inverse discrete cosine transform (IDCT) computation in JPEG image compression. Experimental results demonstrate that the proposed circuit not only improves the accurate performance but also significantly reduces the hardware complexity and propagation delay when compared with the previous solution.

P1D-3　 A Novel VLSI Iterative Division Algorithm for Fast Quotient Generation

PDF Tso-Bing Juang, 國立屏東商業技術學院　

　
In this paper, a novel VLSI iterative division algorithm for fast quotient generation that is based on radix-2 non-restoring division is proposed. To speed up the quotient generation, our method makes use of the magnitude difference between the partial dividend and the divisor for the next iteration so that the proper weight of the quotient can be obtained more rapidly than the conventional methods. Our proposed architecture is very simple compared to the multiplication-based methods such as those that are based on Newton-Raphson. Simulation results show that our proposed method can achieve less than half the number of iterations required by the conventional division (i.e. less than n/2 vs. n, where n is the bit-width of the dividend and the divisor). Our proposed algorithm can be employed in Digital Signal Processing and 3D graphic processing applications to accelerate the compute intensive division operations.

P1D-4　 Reusing Cache for Real-Time Memory Address Trace Compression

PDF Chung-Fu Kao, Chun-Hung Lai, and Ing-Jer Huang, 國立中山大學

　
The program execution trace is one of the efficient debugging approach to analyze and verify the software program and hardware architecture. However, one of the major problem of tracing is the high cost of storing the traces. How to reduce the trace information or compress the trace volumes is an important issue when debugging a system. A reusing cache for program execution trace compression in real time is proposed. This method is based on the program characteristics of temporal and spatial localities then reuse the system cache to trace program addresses. The advantage is that reusing cache with minor hardware modification can not only save the hardware compressor overhead but also obtain a high compression ratio. Experimental results show that the proposed approach causes few hardware area overhead but achieves approximately 90% compression ratio at real-time.

P1D-5　 A Novel Membership Function Approximation for Effective Digital Circuit Design of Neural Networks

PDF Che-Wei Lin and Jeen-Shing Wang, 國立成功大學　

　
This paper proposes a novel approximation approach for a commonly used membership function of neural networks. In our study, we focus on the approximation of a hyperbolic tangent sigmoid function implemented by a digital circuit. The average error and maximum error of the proposed approximation approach are in the order of 10^-3 and 10^-2, respectively. The hardware implementation of the proposed method consumes only one multiplication and one addition/subtraction ALU with the aid of effective scheduling and allocation.

P1D-6　 A Novel Architecture for Self-Reconfigurable Systems

PDF Trong-Yen Lee, Yung-Lin Hsu, Che-Cheng Hu, 國立台北科技大學　

　
Dynamic reconfigurable system will be used in consumer electronic. The state-of-the-art FPGAs have provided the capacity for fast and partial reconfiguration. We propose a novel architecture for multi-region self-reconfigurable systems, which can process all of reconfigurable operations. The new design in proposed architecture includes the wrapper, bus macro, and arbiter. The wrapper and bus macro can connect with various kinds of hardware module and transmit multi-data. The arbiter manages the data flow between hardware module and MicroBlaze, and decides the region which will be reconfigured. Experimental results show that our proposed architecture can support multi-module, direct detecting hardware module function of arbiter, 160 I/O ports of wrapper, and multi-data of bus macro.

P1D-7　 VLSI Implementation for Block-Based Gradient Domain High Dynamic Range Compression

PDF Tsun Hsien Wang, Wei-Ming Ke, Chih-Hsueh Huang, Ding-Chuang Zwao, Fang-Chu Chen, and Ching-Te Chiu, 國立清華大學　

　
Due to rapid progress in high dynamic range (HDR) capture technology; HDR display on conventional LCD devices becomes an important topic. Tone mapping algorithms are proposed for rendering HDR images on conventional displays. However, they are impractical for video applications due to intensive computation time. In this paper, we present a real-time block-based gradient domain HDR compression for image or video applications. The gradient domain HDR compression is selected as our tone mapping scheme for its ability of high compression and detail preservation. We equally divide one HDR image/frame into several blocks and process each block by the modified gradient domain HDR compression. The gradients with small magnitudes are attenuated less in each block to maintain the local contrast and thus expose the detail. We reconstruct a low dynamic range image by solving the Poisson equation on the attenuated gradient field block by block. A real-time block-based Gradient Domain Compression with Discrete Sine Transform (DST) architecture is proposed to tone-map HDR video sequences including solving the Poisson equation. Our synthesis and layout results show that our design for tone-mapping can run at 50 MHz clock and consume area of 5.29 mm² under TSMC 0.13μm technology.

P1D-8　 High Performance Decoder Design for Convolutional LDPC Codes

PDF Mu-Chung Chen, Jun-Wei Lin, Yen-Shuo Chang, Jin-Hao Yu, and Tzi-Dar Chiueh, 國立台灣大學　

　
In this paper, a new Convolutional Low-Density Parity Check Code (LDPC-CC) decoder has been designed and implemented. The proposed design can reach the same bit error rate performance while having lower computation complexity. We have proposed a new parity-check matrix that leads to saving in clock cycles. In addition, some circuit techniques, such as Wallace tree structure and linear approximation, are applied in our design in order to further improve the throughput of the proposed decoder. Finally the decoder PE is designed and its layout is implemented. This work provides a solid foundation for efficient and effective LDPC-CC decoder design.

P1D-9　 A Novel Design for Computation of All Transforms in H.264/AVC Decoders

PDF Yi-Chih Chao, Hui-Hsien Tsai, Yu-Hsiu Lin, Jar-Ferr Yang, and Bin-Da Liu, 國立成功大學　

　
In this paper, we design a novel architecture for computing all transforms required in H.264/AVC high profile decoder. This flexible architecture design can compute all transforms including 8 and 4-point integer transforms as well as 4 and 2-point Hardamard transforms such that we can reduce the implementation chip area dramatically. With 8 pixels/cycle throughput, this proposed design can complete the computation in 95 clock cycles with 8×8 inverse transform involved or 54 clock cycles without 8×8 inverse transform for one macroblock. Simulation results show that the implemented area is 18.5k gate counts, and the maximum clock frequency is 125 MHz. For the real-time requirement, the architecture can deal with all existed frame sizes in 4:2:0 format. For example, if this architecture is operated at 106 MHz, it achieves 4096×2304@30 frames/sec.

P1D-10　 Design of a 2X2 MIMO OFDM Transceiver With Correction of Different Carrier Frequency Offsets at Transmitter Antennas

PDF Li-Wen Hsu and Dah-Chung Chang, 國立中央大學

　
The combination of multiple-input multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) is regarded as the next-generation wireless LAN technology. In this paper, the design of a 2×2 space-time block coded (STBC) MIMO-OFDM baseband transceiver is studied based on the IEEE 802.11n proposal. To avoid the different phase offset between the transmitter antennas due to antenna resistance match problem, we propose a new carrier frequency offset tracking algorithm at the receiver. The overall design is verified on the Xilinx FPGA with implementation loss of about 1.5 dB.

　

TECHNICAL PROGRAM

Session

A2 Day

08/09 Time

10:00-11:30 Chair
黃穎聰教授
國立中興大學 Room

301

Digital Design (II)

10:00 A2-1　 A Redundancy Detection Algorithm for DCT and Quantization in H.264 Video Encoding

   　 PDF Ting-Wei Chen, Chang-Hsin Cheng, Yu Liu, and Chun-Lung Hsu, 國立東華大學　

　　
This paper proposes a novel efficient algorithm that can detect redundancy computation early using the relationship between the pixel values of time domain and low frequency of DCT coefficients The novel redundancy detection algorithm improves high coding efficiency and low coding complexity without hampering image quality. Extensive performance shows that the proposed algorithm can reduce DCT and quantization time up to 57.47%. This is going to be an essential algorithm that makes hardware/software H.264 codec feasible.

10:15 A2-2　 Skip Control Algorithm of Motion Estimation for Power-scalable H.264 Video Encoder

   　 PDF Chieh Chien, Yu-Han Chen, and Liang-Gee Chen, 國立台灣大學

　　
In this paper, we present a power control algorithm for a video encoder to maximize power saving and maintain video quality. A pre-skip algorithm is adopted to provide power-scalability. Then a skip ratio determination algorithm is provided to a most suitably power saving and satisfy video quality constraints. It saves power for mobile devices under user-defined quality constraints. Next, a skip threshold control algorithm is presented to control the encoding power to this most suitably power saving point. According to the simulation results, we show that the proposed algorithm can accurately achieve the most suitably power saving under a defined quality constraint.

10:30 A2-3　 Efficiency-Enhanced Multilevel LINC System Design

   　 PDF Kai-Yuan Jheng, Yuan-Jyue Chen, and An-Yeu (Andy) Wu, 國立台灣大學　

　　
Linear amplifier with nonlinear components (LINC) is a power amplifier (PA) linearization method which offers both high PA efficiency and high linearity of wireless transmitters. While LINC increases the PA efficiency, LINC requires an extra power combiner which results in low system efficiency. To solve this problem, we propose a multilevel out-phasing (MOP) scheme and a corresponding architecture: multilevel LINC (MLINC) to increase power combiner efficiency of wireless transmitters. Under WCDMA specifications, we demonstrate a 3-level MLINC as a design example which enhances power combiner efficiency from 44.5% to 75.5%.

10:45 A2-4　 A Scalable Frame-Pipeline Motion Estimation Processor for Full-Search Algorithm

   　 PDF Yeong-Kang Lai, Lien-Fei Chen, Yin-Ruey Huang, and Sheng-Yu Huang, 國立中興大學　

　　
In this paper, a scalable two-dimensional pipelined motion estimation processor for full-search block-matching algorithm (FSBMA) is proposed. The proposed 2-D motion estimation processor can smoothly perform the block-matching operations of the consecutive frames without any processor idle time at frame boundaries. Moreover, we propose a scalable architecture to satisfy the throughput requirements and to reduce the exter-nal memory bandwidth with level C+ data reuse. The experimen-tal result shows that our architecture can accomplish frame-level 100% fully pipelined computation and achieve the performance driven requirements for different video applications.

11:00 A2-5　 High Throughput Embedded Compression Engine for High-End LCD Applications

   　 PDF Tsung-Han Tsai, Yu-Yu Lee, and Yu-Xuan Lee, 國立中央大學　

　　
As the LCD panel technology advances into high-definition (HD) series, the data rate in video post-processing is increased drastically. Therefore, the memory bandwidth requirement and memory size become primary concern issues. The embedded compression engine is utilized to reduce the memory bandwidth and memory size. In this paper, the color difference pre-processing (CDP) is proposed to improve the coding efficiency, and simulation result shows that the coding efficiency can be improved as high as 40.4% by CDP. Moreover, for hardware parallelism, the proposed segment scan manner (SSM) can provide hardware scheme with capacity of flexible parallelism, and only 0.1%~0.42% of coding efficiency is sacrificed. Based on SSM, the hardware parallelism of our proposed VLSI architecture with ping-pong mode memory partition can be flexibly increased for various high-end LCD specifications.

11:15 A2-6　 A Novel Low Complexity Pulse-Triggered Flip-Flop Design with Dual Triggering Mode

   　 PDF Jin-Fa Lin, Yin-Tsung Hwang, Ming-Hwa Sheu, and Wei-Rong Ciou, 國立中興大學　

　　
In this paper, a novel dual mode pulse generator design with the least number of transistor count known in the literature is first presented. The pass-transistor logic (PTL) based design successfully reduces the circuit complexity and input loading capacitance to improve the power consumption. Both the threshold voltage loss and insufficient driving capability problems common in PTL are also resolved in our design to support low V_dd operations. Based on the proposed pulse generator circuitry, a pulse triggered flip-flop with dual triggering mode (single/double edge) is presented. This design, called PET-FF (Programmable Edge-Triggered Flip-Flop), features function versatility and low voltage operations in addition to the circuit complexity and power consumption advantages inherent in pulse triggered FF designs. Simulations in TSMC 0.18μm CMOS process show that the proposed FF designs, providing extra triggering mode selection, outperform conventional FF designs and achieve similar power and delay performance compared with peer single mode pulse triggered FF designs.

　

TECHNICAL PROGRAM

Session

B2 Day

08/09 Time

10:00-11:30 Chair
黃錫瑜教授
國立清華大學 Room

305

Memory/CPU/DSP Cores

10:00 B2-1　 A Protocol-Reconfigurable Double-Layer External Memory Management for H.264/AVC Decoder

　 PDF Chang-Hsuan Chang, Ming-Hung Chang, and Wei Hwang, 國立交通大學　

　　
In this paper, a protocol-reconfigurable double-layer external memory management for H.264/AVC decoder is proposed. There are a large amount of data need to be fetched to/from the off-chip memory in the H.264/AVC decoder. Therefore, the latency of accessing data and power consumption greatly affect the performance of the whole system. The proposed memory controller consists of two layers. The first layer is the address translation which provides an efficient pixel data arrangement to reduce the row-miss occurrence. The second layer is the external memory interface (EMI) which can further reduce access latency up to 70% by using the specific command FIFO and a unified FSM with generic scheduling. The setting of EMI could be reconfigured to suit different external memory modules. Particularly, the memory utilization can be increased about three times as compared with traditional method after combining the address translation layer with external memory interface.

10:15 B2-2　 A Energy-Efficient 256X144 TCAM Design

　 PDF Wen-Yen Liu, Po-Tsang Huang, and Wei Hwang, 國立交通大學　

　　
In this paper, a low-power and high-speed ternary content addressable memory (TCAM) are presented. For network routers, super cut-off and Multi-mode data-retention power gating techniques are applied to reduce leakage currents. Besides, the match-lines are implemented by XOR-based conditional keeper, butterfly connection and don’t-care based power gating to achieve energy-efficient searching operations. It also reduces more power consumption by hierarchy search-line scheme. Based on 65nm Berkeley Predictive Technology Model (PTM), simulation results show the leakage power reduction is 70.7% and energy metric of the TCAM is 0.047 fJ/bit/search.

10:30 B2-3　 Energy-Efficient and High-Performance Power Gating in Microprocessor Functional Units

　 PDF Chang-Ching Yeh, Kuei-Chung Chang, Tien-Fu Chen, and Chingwei Yeh, 國立中正大學　

　　
In current high performance microprocessor, more functional units are required to satisfy increasing computation demands, but have also resulted in greater leakage energy consumption. Microarchitectural technique for power gating of functional units detects suitable idle regions and turns off them to reduce leakage energy consumption, but ready instructions always have to wait in the issue queue to wake up required functional units such that wakeup overhead is repeatedly incurred. In this paper, we present a time-based power gating with reference prewakeup (PGRP) technique for an in-order processor to reduce leakage energy without degrading performance. We exploit code sequentiality and implement a branch-based execution history buffer (PGRP-buffer) for on-demand wakeup prediction. The simulation with benchmarks from SPEC2000 applications shows that it is worthwhile to reduce considerable. leakage energy with less 1% performance impact.

10:45 B2-4　 A Mini Stereo Digital Audio Processor Design

　 PDF Po-Yu Kuo, Dian Zhou, and Zhi-Ming Lin, 德州大學達拉斯分校　

　　
This paper is to present the implementation of a programmable finite impulse response (FIR) digital filter for a Mini Stereo Digital Audio Processor (MSDAP). The performance of the MSDAP is expected to be the same level as that of two general DSP (Digital Signal Processing) chips for implementing two-channel FIR digital filtering in audio applications. Implemented in TSMC 0.18μm technology, the MSDAP is run at data clock rate 768KHz and system clock rate 6.2MHz. With power supply 1.98V, power dissipation is about 2.539mW.

11:00 B2-5　 Adaptive Sensing Control in SRAM Design Using Per-Column Timing Tracking Scheme

　 PDF Ya-Chun Lai, Ming-Yi Chang, and Shi-Yu Huang, 國立清華大學　

　　
This paper presents a new timing tracking scheme in an SRAM design for enhancing the tolerance of bitline delay variation. This scheme, modifying the circuitry around each sense amplifier, allows an SRAM column to operate according its own timing. Thus, each latch-type sense amplifier can be turned on at the right time and the pulse width of the active wordline can be tuned to its optimal width on the fly, no matter how severe the operation speed of a bitline differs from the other. Monte-Carlo analysis for both the proposed and the conventional timing tracking schemes in a 22-nm predictive technology model demonstrates that this scheme has 18% better parametric yield than the conventional dummy bitline timing tracking scheme.

11:15 B2-6　 Compact Dual-Core Architecture

　 PDF Jih-Ching Chiu and Yu-Liang Chou, 國立中山大學　

　　
A novel architecture for the Chip-Multiprocessor operations is proposed in this paper, called compact dual-core architecture (CDA). Alternatively the well known Chip-Multiprocessors, the superscalar operation is supported in this architecture that data sharing mechanisms can be completed by register-file sharing. In the CDA, the programmers can dynamically switch operations among three modes: the superscalar processing mode, the multithreaded mode, and the single-processing mode, by the proposed novel instruction set to maintain the flexibility to support multithreaded applications. Compared with the 5-stage single pipe architecture, the CDA can obtain an average of 33% performance speedup in the superscalar processing mode.

　

TECHNICAL PROGRAM

Session

C2 Day

08/09 Time

10:00-11:30 Chair
郭建男教授
國立交通大學 Room

315

Wireless Circuits

10:00 C2-1　 A HQPM-Based Transmitter with Digital Predistortion Scheme for Enhancing Average Efficiency

　 PDF C.-T. Chen, C.-J. Li, T.-S. Horng, J.-K. Jau, J.-Y. Li, P.-K. Horng, and D.-S. Deng, 國立中山大學

　　
This paper presents the comparison of signal quality and efficiency performance between simulation and experiment for a linear RF transmitter based on a hybrid quadrature polar modulation (HQPM) architecture with digital predistortion scheme. The measurement results of the Class-E power amplifier and Class-S modulator are applied in the simulation. The path-delay difference between the envelope and the phase signal is also considered. Adjacent channel power ratios with and without digital predistortion are both shown to demonstrate improvement from the baseband predistortion algorithm. The simulated and experimental results have validated the key feature of the HQPM architecture, that its high efficiency characteristics are not sensitive to the output power level.

10:15 C2-2　 Limitation and Improvement of a Modified Precharge Phase Frequency Detector for Wireless Frequency Synthesizer Applications

　 PDF C.-J. Li, C.-B. Lo, S.-W. Li, T.-S. Horng, and K.-C. Peng, 國立中山大學　

　　
A modified precharge phase frequency detector (PFD) that is inherently high speed and dead-zone free is designed and then fabricated in a 0.18 μm CMOS process. Through analysis and measurement, this kind of PFD has been first found to have limitation in the minimum comparison frequency, causing a difficulty in application to wireless frequency synthesizers. In this work, a novel charge compensation circuit is proposed and implemented on the same CMOS chip with the frequency synthesizer to solve this problem successfully.

10:30 C2-3　 A 0.8V SOP-Based Wideband Fourth-Order Cascade Delta-Sigma Modulator

　 PDF Chien-Hung Kuo and Shuo-Chau Chen, 淡江大學　

　　
In this paper, a 0.8V switched-opamp (SOP)-Based wideband 2-2 cascade delta-sigma modulator is presented. Double sampling is used to promote the clock efficiency and relax the requirement of SOP. Based on the low-distortion topology, a CIFF-CIFB structure is adopted here to improve the resolution of the cascade modulator. The proposed modulator has been implemented in a 0.13μm 1P8M technology. The peak signal to noise and distortion ratio (SNDR) of the modulator in a 1.1 MHz of bandwidth is 68 dB under a 20 MHz of clock rate. The power dissipation of the modulator is 15.7 mW at 0.8V of supply voltage.

10:45 C2-4　 Sub-mW 5-GHz Receiver Front-End Circuit Design

　 PDF Tatao Hsu, Yen-Lin Liu, Shu-Hui Yen, and Chien-Nan Kuo, 國立交通大學　

　　
In this work a 5-GHz receiver front-end is designed for the application of wireless sensor networks. The circuit topology is chosen available for low supply voltage below 1V. The stability condition of the LNA circuit is ensured by adding reactive components. Total power consumption of the fabricated circuit is 0.86mW, of which 0.7mW goes to the LNA stage. The measured return loss and conversion gain are 11dB and 25dB, respectively. The noise figure is 12dB and the IIP3 is around -6.5dBm.

11:00 C2-5　 A Low Voltage Full-band Cascoded UWB LNA

　 PDF Ruey-Lue Wang, Min-Chhuien Lin, and Zhi-Cheng Lin, 崑山科技大學　

　　
In this paper, A 3.1-10.6GHz ultra-wideband (UWB) low voltage low noise amplifier (LNA) employing only one –stage cascoded topology and an additional voltage-current feedback is presented. The research is based on the TSMC 0.18μm CMOS processes. Measurement results show the following performances: maximum power gain of 9.18dB, ±0.9 dB gain flatness for full band, minimum noise figure of 4.1dB, the input-referred thirdorder intercept point (IIP3) of 7.25 dBm and the input-referred 1-dB compression point (P1dB) of –2.5 dBm. The power consumption is 23.5 mW under a 1.0 V supply voltage. The chip area is 0.995mm×0.780 mm.

11:15 C2-6　 80-S/s Delta Sigma Modulators For IR Thermometer

　 PDF Jen-Shiun Chiang, Hsin-Liang Chen, Yao-Tsung Chang, and Meng-Hsuan Ho, 淡江大學　

　　
In this paper, the high performance 80-S/s delta sigma modulators for IR thermometer applications are presented. Three methods are used to implement these modulators. First, general technique is used. Second, the chopper technique is used to cancel offset and remove 1/f noise. Third, the correlated double sampling (CDS) is mainly used to cancel offset. These third-order 1-bit quantizer single loop delta sigma modulators achieve respectively 86, 89, 88dB of dynamic range and 80, 82, 83 dB of signal to noise distortion ratio. The circuits are implemented in a standard 0.35-μm 2P4M CMOS technology. The chip area is respectively 1.57 mm² (1.37mm×1.15mm), 2.25 mm² (1.5mm×1.5mm), 2.25 mm² (1.5mm×1.5mm) and the power consumption is 2.2 mW, 1 mW, 1.1 mW at 3-V supply.

　

TECHNICAL PROGRAM

Session

D2 Day

08/09 Time

10:00-11:30 Chair
江蕙如教授
國立交通大學 Room

308

Physical Design

10:00 D2-1　 On Power-State-Aware Routing and Buffer Insertion

　 PDF Ming-Hua Wu and Iris Hui-Ru Jiang, 國立交通大學　

　　
Interconnect delay and low power are two of the main issues in nanotechnology. Buffer insertion during routing can reduce interconnect delay; multiple supply voltage can lower power consumption. However, buffering without considering power states may cause the signal integrity problem. In this paper, we first propose an algorithm to construct a buffered routing tree considering power states for dual supply voltage designs. Our approach can simultaneously minimize power, satisfy timing constraints and maintain signal integrity. The results show this method is promising, e.g., constructing a buffered routing tree with 37 sinks less than 4 seconds as well as maintaining signal integrity.

10:15 D2-2　 An Obstacle-Avoiding Rectilinear Steiner Minimal Tree Construction Algorithm

　 PDF Ya Wen Tsai, Yung Tai Chang, Jun Cheng Chi, and Mely Chen Chi, 中原大學　

　　
We present a construction-by-correction approach to solve the Obstacle-Avoiding Rectilinear Steiner Minimal Tree (OARSMT) construction problem. We build an obstacle-weighted spanning tree as a guidance to construct OARSMT on an escape graph. We use Dijkstra’s algorithm for routing. A refinement of Ushaped removal is applied during the routing process to further reduce the wire length. Our experimental results show that comparing to several state-of-the-art works this algorithm achieves the shortest average total wirelength. It also uses short run time for practical-size problems.

10:30 D2-3　 A Network-Flow Based Algorithm for Digital Microfluidic Biochip Routing

　 PDF Ping-Hung Yuh, Chia-Lin Yang, and Yao-Wen Chang, 國立台灣大學

　　
Due to the recent advances on microfluidics, digital microfluidic biochips are expected to revolutionize laboratory procedures. One critical problem for biochips is the droplet routing problem. Unlike traditional VLSI routing problems, in addition to routing path selection, the biochip routing problem needs to address the issue of scheduling droplets under the practical constraints imposed by fluidic property and timing restriction of synthesis result. Therefore, the biochip routing problem is more complicated than traditional VLSI routing. In this paper, we present the first network-flow based routing algorithm that can concurrently route a set of non-interfering nets for the droplet routing problem on biochips. We adopt a two-stage technique of global routing followed by detailed routing. In global routing, we first identify a set of non-interfering nets and then adopt the network-flow approach to generate optimal global-routing paths for the nets. In detailed routing, we present the first polynomial-time algorithm for simultaneous routing and scheduling using a based on the global-routing paths with a negotiation based routing scheme. The experimental results show the effectiveness and efficiency of our algorithm.

10:45 D2-4　 A Transitive-Closure-Graph-Based Macro Placement Algorithm

　 PDF Hsin-Chen Chen, Yi-Lin Chuang, Zhe-Wei Jiang, and Yao-Wen Chang, 國立台灣大學　

　　
In this paper, we propose a transitive-closure-graph-based (TCG-based) macro placement algorithm that removes macro overlaps and optimizes macro positions. Improving over TCG by working only on its essential edges without loss of the solution quality, our algorithm can efficiently and effectively search for a high quality macro geometric relation. Instead of packing macros along chip boundaries like the most recent previous work, our placer can determine a non-compacted macro placement by linear programming and placement region cost evaluation. Our macro placer is so flexible and versatile that it can easily extend the linear programming formulation to handle various placement constraints/objectives. Combined with various leading academic placers, our macro placer can consistently and significantly reduce the wirelength, implying that our macro placer is robust and has very high quality. For example, based on the ISPD’06 placement benchmarks, combined with our macro placer, the resulting wirelength of Capo 10.2, mPL6, and NTUplace3 can further be reduced by 5%,6%, and 15% on average, respectively.

11:00 D2-5　 Optimal Redundant Via Insertion Using Mixed Integer Linear Programming

　 PDF Kuang-Yao Lee, Ting-Chi Wang, and Kai-Yuan Chao, 國立清華大學

　　
Redundant via insertion is highly recommended to improve chip yield and reliability. The well-studied double-cut via insertion (DVI) problem allows a single via in a chip to have at most one redundant via inserted next to it, but the solution to this problem is not good enough particularly for high-activity and power nets because those nets typically need more redundant vias to further enhance reliability. This motivates us to study in this paper a new problem, called the multiple-cut via insertion (MVI) problem, in which one redundant via or more can be inserted next to a single via such that the amount of single vias with redundant vias inserted next to them and the amount of inserted redundant vias are both maximized. We formulate the MVI problem as a mixed integer linear programming (MILP) problem. To make the problem tractable, we further break the MILP problem into a set of much smaller MILP problems each of which is solved independently and efficiently without sacrificing the optimality. Besides, we identify that the DVI problem is just a special case of the MVI problem, and therefore our MILP approach can be easily adapted to optimally solve the DVI problem as well. To the best of our knowledge, none of the existing DVI works can guarantee the optimality. The extensive experimental results are provided to support the efficiencies of our MILP approaches on both the MVI and DVI problems.

11:15 D2-6　 A Simple Yet Efficient Global Router with Mirrored Monotonic Routing and Reduced Multi-Source Multi-Sink Maze Routing

　 PDF Ke-Ren Dai, Jyun-Yi Lin, and Yih-Lang Li, 國立交通大學　

　　
Traditional VLSI physical design flow is composed of floorplanning, placement, global routing and detailed routing. A fast global router can help placers in accurately estimating wire length and routability. A high-quality global router increases routability for detailed routers. In this work, we develop a high-performance congestion-driven global router to fast produce better routing results as compared to an ILP-based global router. Based on the routing flow of FastRoute 2.0, we develop an enhanced routing flow, a simplified multi-source multi-sink maze routings and mirrored monotonic routing. Experimental results reveal that our router decreases many overflows at little cost of runtime.

　

TECHNICAL PROGRAM

Session

E2 Day

08/09 Time

10:00-11:30 Chair
洪進華教授
國立高雄大學 Room

318

DFT and SOC Testing

10:00 E2-1　 Test Data and Test Time Reduction for LOS Transition Test in Multi-Mode Segmented Scan Architecture

　 PDF Sying-Jyan Wang, Po-Chang Tsai, Hung-Ming Weng, and Katherine Shu-Min Li, 國立中興大學　

　　
Launch-off-Shift (LOS) is a widely used technique for delay test in scan-based design. Test data compression for LOS patterns, however, is less efficient. In this paper, we first analyze the reason for low compression rate in LOS patterns, and present an LOS test enabled scan architecture that supports three operation modes: broadcast, multicast, and serial. Efficient LOS test data compression can be achieved under this architecture with limited hardware overhead. An ATPG method for LOS test patterns under the proposed architecture is also presented. Experimental results show that most of the serial scan operations can be replaced by multicast operations, and thus achieve much better compression rate.

10:15 E2-2　 Test Efficiency Analysis of SOC Test Platforms

　 PDF Tong-Yu Hsieh, Kuen-Jong Lee, and Jian-Jhih You, 國立成功大學　

　　
In this paper, we formally analyze the test efficiency of test platforms that appear to be a promising method for SOC testing and seek for its optimization. A test cycle estimation technique is proposed to evaluate the test efficiency for various test procedures/organizations of test platforms. It is shown that up to 24X test time difference among the test platforms with different dedicated designs/test procedures are possible. Based on the analysis results, we can easily determine an appropriate test procedure/organization that can achieve extremely high test efficiency with minimum required area overhead.

10:30 E2-3　 A Novel High-Speed SOC Test Scheme Using Virtual TAMs

　 PDF Jiann-Chyi Rau, Chien-Hsu Wu, and Chung-Lin Wu, 淡江大學　

　　
This paper presents a framework associated with an efficient method to determine the optimal scheduling of SOC test. In addition to using both traditional scan chains and reconfigurable multiple scan chains, we increase the TAM width in the proposed framework. Experimental results for ITC’02 SOC benchmarks show that our work can obtain better test application time compared to the previously published algorithms.

10:45 E2-4　 Enhancing Compression Efficiency with Skewed-Probability Scan Chains

　 PDF Sying-Jyan Wang, Shih-Cheng Chen, and Katherine Shu-Min Li, 國立中興大學　

　　
Code-based test data compression schemes encode symbols in the test data with predetermined codewords so that data volume can be reduced. The compression efficiency is affected by the distribution of data symbols. In this paper, we first analyze the factors that affect the encoding efficiency in various codes, and then propose a skewed-probability scan chain partitioning scheme, in which the distribution of 0’s and 1’s are changed in different parts of the scan chain. Both analytical and experimental results confirm that the scheme can effectively improve compression efficiency, while the routing penalty due to the partitioning method is limited.

11:00 E2-5　 DIAGNOSIS OF MULTIPLE SCAN CHAIN TIMING FAULTS

　 PDF Wei-Shun Chuang, Wei-Chih Liu, and James Chien-Mo Li, 國立台灣大學　

　　
A diagnosis technique is presented to locate multiple timing faults in scan chains. Jump simulation is a novel parallel simulation technique which quickly search for the upper bound and the lower bound of each individual faults. This technique requires only regular ATPG patterns, which is ideal for the production environment. Experiments on ISCAS’89 benchmark circuits show that, this technique diagnose every fault to a precision of no more than two scan cells (totally 16 hold-time faults in more than 800 scan cells). The proposed technique is still effective when the failure data is limited or the faults are clustered.

11:15 E2-6　 Testing MRAM for Write Disturbance Fault

　 PDF Wan-Yu Lo, Ching-Yi Chen, Chin-Lung Su, and Cheng-Wen Wu, 國立清華大學　

　　
With the development of deep sub-micron technology, the semiconductor memory has become larger and denser. Many applications require the chips to integrate non-volatile memories. The industry has been trying to develop a new non-volatile memory to replace the flash memories, and the Magnetic random access memory (MRAM) is a possible candidate. The write disturbance fault (WDF) model is a fault model specific to MRAM which implies that the data stored in theMRAM cells is changed due to excessive magnetic field during a write operation. March tests have high coverage for conventional RAM faults; however, they do not detect all WDFs. To improve quality and yield of MRAM, we suggest a new test algorithm to detect WDF for MRAM, which is extended from the March-based test algorithm. It also keeps linear time complexity and can be implemented easily within the built-in self-test (BIST).

　

TECHNICAL PROGRAM

Session

P2A Day

08/09 Time

10:00-12:00 Chair
薛雅馨教授
國立雲林科技大學 Room

2F宴會廳

Analog

P2A-1　 A 5-bit 1 GSample/s Two-Stage ADC with a New Flash Folded Architecture

PDF Hung-Yu Huang, Ying-Zu Lin, and Soon-Jyh Chang, 國立成功大學　

　
A 5-bit 1 GSample/s two-stage ADC is designed and simulated in TSMC 0.18-μm CMOS technology. The new architecture combines the characteristics of flash, subranging and folding ADC. The analog front-end of this work is the same as that of a typical flash ADC. By replacing folding amplifier with the current-mode multiplexer (MUX), cyclic thermometer code, the digital output of folding ADC, is obtained and frequency multiplication effect is avoided. Besides, the slow switching of the reference voltage range is also avoided. The number of the comparators is reduced to 16, and it is 32 typically. Operating at 1 GSample/s, the ENOB is 4.92 and 4.71 bit at input frequency 10 and 500 MHz, respectively. This ADC consumes 63mW from a 1.8 V supply, achieving FOMs of 2.4 pJ/conversion-step at 1 GSample/s.

P2A-2　 A CMOS Temperature Sensor Design for Implantable Bio-Medical Devices

PDF Ying-Hsiang Wang, Wen-Yaw Chung, Chiung-Cheng Chuang, and Chien-Hsi Kao, 中原大學　

　
This paper presents a fully integrated CMOS temperature sensing circuitry for implantable bio-medical system with low power and mixed-mode signal output. It also presents a new type of multi-level comparator which has fixed power consumption even add more stages. The circuit was verified by using TSMC 0.35μm mixed-signal 2P4M poly-cide 3.3/5V models. The simulation results show the proposed circuit adapted well to the application for a limited temperature range in implantable systems and it only consumed 37.2μW at 2.5V power supply.

P2A-3　 Low Dropout Voltage Regulator with Current-Limit Circuit

PDF Chien-Cheng Chen, Nan-Xiong Huang, Miin-Shyue Shiau, Hong-Chong Wu, and Don-Gey Liu, 逢甲大學　

　
This paper presents a protection circuit for the low-dropout (LDO) voltage regulator. This LDO provides high stability for the load current up to 800mA, and has a circuit for limiting the output current. The die size is 1.38×0.48 mm². Moreover, this protection circuit needs just only one comparator and one transistor. The comparator can use simple two stage CMOS operation amplifier. The transistor was used for switch, so didn't need large area. The proposed LDO regulator was designed using TSMC 0.35-μm CMOS technology. The main advantage of this approach is that we can use the extra voltage to limit the output current to protect the main circuit.

P2A-4　 Novel Devices Merging RITD and CMOS for Future VLSI Use

PDF Jyi-Tsong Lin, Wei-Chin Lin, and Chao-Yu Hou, 國立中山大學　

　
In this paper we bring up a new device design which merges MOSFET with Resonance Intra-band Tunneling Diode (RITD). By this concept, we throw some new designs, categorized into three sorts. The first is multi-bits memory. Such device can be equivalent to a circuit which consists of MOSFETs and RITDs. Also, it still meet the standard MOS manufacture regime and give possibly multi-bits in a single cell. The second is multi-level current regulator: one MOSFET is to be used as the load and a RITD as the driver component, whose output can result in two or more stable level to control the terminal MOSFET on and off with different current effort. Because RITD needs low density current, high speed switch and low power consumption as in previous studies, this new device can eliminate the noise shooting and get faster switch speed when even the gate length is scaled down to 10nm. The third applies a modified MOS structure where RITD part is integrated. According to the high speed point, such logic gate may be more potential for VLSI application. In this article, we use ISE TCAD simulation to carry out the physical geometric pattern and evaluate its different electric characteristics. We also use MOSs equivalent circuit to simulate RITD function and use Hspice to verify all design call meet function correction. These new devices approve good results and demonstrate better device behaviors.

P2A-5　 Using Output-Clamped Amplifier to Implement Time-Based Interface Circuit for Measuring Tiny Grounded Capacitance

PDF Wei-Hung Hsu and Meng-Lieh Sheu, 國立暨南國際大學　

　
A time-based interface circuit for measuring grounded on-chip capacitance is proposed. The measured capacitance is first converted to integration time by the interface circuit, and then to digital values by a counting circuit. Very compact circuit area and micro-power consumption are achieved. Linearity performance is also analyzed to conclude where and how non-ideal components affect the overall accuracy of the interface circuit. The simulation results give a good agreement with the proposed interface circuit.

P2A-6　 A Low Distortion Class-AB Power Amplifier With Active Tuning

PDF Ro-Min Weng, Chi-Wen Tsai, and Kuen-Yi Lin, 國立東華大學　

　
A class-AB power amplifier (PA) with active tuning is presented. An active inductor is added to adjust the output matching in order to obtain high signal integrity, low distortion, and high power efficiency. The active tuning PA with the power control can achieve high efficiency and further decrease the third order inter-modulation term (IM3). The maximum IM3 suppression is -33dBc at the output power of 10dBm measured by the two-tone test with 10MHz offset at 2.4GHz. The measurement results show a maximum PAE of 52.6% at the input power of 3dBm. The maximum power gain of 24.38dB is obtained at the input power of -5dBm. PAE is improved obviously within the input power from -9dBm to 3dBm.

P2A-7　 A Low Power 1V 10-bit Successive Approximation ADC

PDF Yi-Hung Chen, Wan-Tin Lin, and Hwang-Cherng Chow, 長庚大學　

　
A low power 1V 10-bit successive approximation analog-to-digital converter (SA-ADC) implemented in TSMC 0.18μm CMOS process is presented for biomedical applications. In the DAC capacitor arrays of this SA-ADC a charge-recycling method for switching the capacitors is used. By splitting the MSB capacitor into binary scaled sub-capacitors, the average switching energy can be reduced. Besides, a 1V rail-to-rail input comparator with current driven bulk technique and offset cancellation is proposed. The complete 1V ADC has signal-to-noise ratio of 58.5dB and its effective number of bits is 9.4 based on post-layout simulations. The entire ADC power consumption is 32.6uW for normal signals and 29.5μW for ECG applications.

P2A-8　 New Low Supply-Bounce Current-Mode Shunt Regulator

PDF Che-Min Kung, Chan-Min Pan, Jiann-Jong Chen, Yuh-Shyan Hwang, and Wen-Ta Lee, 國立台北科技大學

　
The electric devices with regard to low-noise, low檏ower, low drop-out linear regulator is in great demand. In this paper we present a new topology of low-supply bounce current-mode shunt regulator, in order to promote the performance, we utilizes the current-mode architecture to reduce the supply-bounce and ground bounce. In the closed-loop, we use the voltage of V_o to supply the core circuit, so the system don㦙 need the extra supply-source. The proposed current-mode shunt regulator has been fabricated in TSMC 0.35μm 2P4M CMOS process. The main elements of proposed regulator contain an error amplifier, voltage buffer, current feedback, pass element and off-chip capacitor and resistors. The experimental results show that the settling time is about 0.5μs with 0.5% error for heavy load current. Nevertheless the line and load regulations are 34.6μV/mA and 74.35ppm/mA. The active chip area is only 0.783×0.875mm².

P2A-9　 CMOS BANDGAP REFERENCE WITH CURVATURE COMPENSATION ON HIGHER ORDER TEMPERATURE TERMS

PDF Hong-Yi Huang and Ru-Jie Wang, 國立台北大學　

　
This work presents a curvature-compensated bandgap reference without resistors in 0.18-μm CMOS technology. The circuit uses a new current generator circuit for higher order temperature terms curvature compensation and a PMOS voltage divider for scaling down the reference voltage. A 605.6mV output voltage is generated with a temperature coefficient of 1 ppm/℃ from –40 to 125℃. It dissipates 77μW at a supply voltage of 1.8-V.

P2A-10　 A Temperature-Compensation CMOS Subbandgap Reference with 1V Power Supply Operation

PDF Hung-Wei Chen, Jing-Yu Luo, and Wen-Cheng Yen, 國立聯合大學　

　
In this paper, a low supply voltage temperature compensation CMOS subbandgap reference is proposed and implemented. This circuit has been implemented in a standard 0.35μm TSMC CMOS process. The active area of this circuit is 762μm*283μm. This designed circuit work properly with minimum supply voltage of 1V. The power dissipation of this circuit is 74μW. The experimental results have confirmed that, with the minimum supply voltage of 1V. The circuit generates a reference VREF1 of 0.633V for a power supply of just 1V and presents a 26mV output voltage vibration for the range of -10℃ to 60℃. The circuit also generates a reference VREF2 of 0.642V for a power supply of just 1V and presents a 22mV output voltage vibration for the range of -10℃ to 60℃.

P2A-11　 6 Gb/s Digitally Phase Adjusted Clock Data Recovery for Spread Spectrum Clock

PDF Chin-Hsien Lin, Yuan-Pu Cheng, Yen-Ying Huang, and Shyh-Jye Jou, 國立交通大學　

　
This paper presents the design of a clock data recovery circuit incorporating a feed-forward phase adjusted algorithm. The CDR uses 3X oversampling to track incoming data and transform the phase and frequency deviation information into multi-phase selection signal. The phase adjusted algorithm can be digitally implemented and is a second–order tracking that can handle frequency deviation and spread spectrum clock signal. The CDR is designed for 6Gbps application and is able to track spread spectrum clock to 5000ppm in SATA–3 specification.

　

TECHNICAL PROGRAM

Session

P2E Day

08/09 Time

10:00-12:00 Chair
林榮彬教授
元智大學 Room

2F宴會廳

EDA

P2E-1　 Don’t-Care Bits Filling for Reducing Capture Power

PDF Wang-Dauh Tseng, Lung-Jen Lee, and Chun-Kai Hsu, 元智大學　

　
In this paper, we propose a don’t-care-bit filling method to reduce the test power dissipation during capture cycles. An induced activity function is exploited to obtain the optimal order in assigning the don’t-care bits in test vectors or responses so as to prevent larger potential switching activity in CUT during capture cycles. It is implemented by weighting the impact of each transition occurred on each scan cell during capture cycles. The capturing power consumption could be drop down significantly. As shown in experimental result, the proposed method can achieve 40% reduction of capturing power consumption as compared with random X-filling method, and, in most cases, better results than the LCP X-filling method. No area overhead and performance loss would be caused in this method.

P2E-2　 Mismatch Address Index Encoding for Data Compression in Scan Test

PDF Lung-Jen Lee, Wang-Dauh Tseng, Rung-Bin Lin, and Hcc-Hang Jang, 元智大學　

　
In order to improve transmission efficiency between ATE and SOC under test, we present a new test data compression technique to reduce the amount of test data that must be stored on a tester. A simple yet efficient heuristic is introduced for sorting test data as a key stage in this method. Don’t care bits assignments are analyzed to promote compression effect. A continuity mismatch property between two sorted test cubes is identified and exploited. Compression is implemented by encoding the mismatch bits in the test sequence. The decoding process is performed by a small amount of on-chip circuitry. Experimental results show an average compression ratio up to 82% is achieved which is higher than PRL, 9C and ARL encoding.

P2E-3　 Reduction of Power Dissipation during Scan Testing by Test Vector Ordering

PDF Wang-Dauh Tseng and Lung-Jen Lee, 元智大學　

　
Test vector ordering is recognized as a simple and non-intrusive approach to assist test power reduction. Simulation based test vector ordering approach to minimize circuit transitions requires exhaustive simulation of each test vector pair. However, long simulation time makes this approach impractical for circuits with large test set. In this paper we present a calculation based approach to faster order test vectors to reduce test power for full scan sequential circuits. Most calculation approaches are for combinational circuits or for sequential circuits but only considering the portion of circuit derived from the primary inputs. The proposed approach exploits the dependencies between internal circuits and the state inputs and will make more power reduction. Experiments performed on the ISCAS 89 benchmark circuits show that the improvement efficiency of the proposed approach can achieve 91.55% and has better performance than the existing calculation based approaches.

P2E-4　 A Simulation-based Redundancy Identification in Combinational Circuits

PDF Yi-Yuan Huang and Chun-Yao Wang, 國立清華大學　

　
Redundancy removal is an important operation in combinational logic optimization. Traditional redundancy identification algorithms are based on automatic test pattern generation algorithms. However, automatic test pattern generation algorithms spend much CPU time to determine if a fault on a wire is untestable, and thus redundant. To determine if a wire is redundant is not easy, however, to determine if a wire is irredundant is much easier. In this paper, we present an efficient redundancy identifier such that irredundant wires can be easily filtered out. The experimental results show that the presented method can identify all irredundant wires in most benchmark circuits.

P2E-5　 An Experimentation Suite for Education in Low-Noise Design

PDF You-wei Liang, Shinyu Chen, and Robert Rieger, 國立中山大學　

　
For the design of biomedical applications the signal is in the order of micro-volts which is so small that the phenomenon of noise can not be neglected. The theory of noise is known by students, but they are not familiar with the noise in real world. In this paper, an easy measurement system is proposed. The signal is amplified, connected a DAQ card, and shown on the screen by using a Labview program. On the screen, there is not only a result, but also a theoretical value of noise which can be typed by the student. The resisters can be replaced easily, and the noise of amplifiers which like the 741 type IC can be measured, too.

P2E-6　 Performance Improvement using Application-Specific Instructions under Hardware Constrains

PDF Chijie Lin, Jiying Wu, Jerung Shiu, Desheng Chen, and Yiwen Wang, 逢甲大學　

　
Application-Specific Instruction-set Processors (ASIPs) have popularly been used to balance the trade-off between cost and performance for a specific target application without creating a new processor. The generation and selection of ASIs can dramatically affect the quality of an ASIP design with constrains such as number of I/Os, hardware cost, ASI hardware latency, and total number of ASIs. In this paper, the disjoint operations can be combined as an ASI to enrich the selection varieties. The operation cover-ratio and the more accurate ASI latency model are used to select good ASIs so that the performance can be improved. A design flow is developed to automatically generate the ASIs and the experimental results show that 1.64x speed up can be obtained on sha benchmark under 5 inputs, 3 outputs, and hardware cost less than 8000 LEs in Altera FPGA.

P2E-7　 Power-Aware Memory Mapping for FPGAs

PDF Tien-Yuan Hsu, Ting-Chi Wang, and Kuang-yao Lee, 國立清華大學

　
Embedded memory blocks on FPGAs allow designers to implement a variety of memory structures. With the increasing use of them, the power consumed by embedded memory blocks may form a significant part of the total dynamic power consumption. In this paper, we propose a power-aware memory mapping algorithm considering resource constraint. This algorithm converts the memory mapping problem to a generalized network flow problem, which can distribute resources to all logical memories at the same time. Our algorithm is compared with an existing power-aware memory mapping method. The promising experimental results show that our algorithm can always efficiently generate the optimal solutions but the existing method does not.

P2E-8　 MFASE Multiple Functions SoCs Analysis Environment

PDF Ya-Shu Chen, Shih-Chun Chou, Chi-Sheng Shih, and Tei-Wei Kuo, 國立台灣大學

　
When more and more functions are integrated into one system, the designs of embedded systems have become more and more complicated. Multiple functions SoCs analysis environment (MFASE) is a system-level design framework which includes the tools for HW/SW co-design, performance optimization, HW/SW co-simulation, and performance analysis. MFASE provides an integrated system-level design framework to lower the system cost and evaluate the system performance. Given preliminary system design specification, MFASE explores the design space in design problems by proper timing analysis, and provides suitable scheduler/arbiter design. To verify the overhead of scheduler/arbiter in the system, MFASE also provides a framework of HW/SW simulation on a Transaction Level Model system to verify the design before a hardware prototype is physically manufactured. The evaluation results of HW/SW co-simulation are analyzed to iteratively enhance the system design.

　

TECHNICAL PROGRAM

Session

P2D Day

08/09 Time

10:00-12:00 Chair
楊博惠教授
國立雲林科技大學 Room

2F宴會廳

Digital

P2D-1　 An Integrated Spatial-Temporal Sampling Rate Conversion Architecture by Motion Compensation for TV Display

PDF Chih-Hung Kuo, Li-Chuan Chang, Zheng-Wei Liu, and Bin-Da Liu, 國立成功大學　

　
To improve video quality in TV display, we propose a strategy including frame rate up-conversion (FRUC), spatial scaling, and adaptively edge sharpening. The technique of motion compensation is applied to both frame interpolation and edge sharpening in order to improve visual quality. In conventional methods, the bi-directional motion estimation is usually employed between two successive frames to interpolate a new frame. The proposed method not only considers temporal correlation, but also adopts the spatial correlation to determine the covered and uncovered region. The weighting of edge sharpness is determined by the magnitude of motion vector (MV) and mean absolute difference (MAD) of current block to improve image quality. Simulation results show that our method has better PSNR and image quality.

P2D-2　 A Novel Look-up Table-Based Multiplication/Squaring Architecture for Cryptosystems over GF(2sup/m/)

PDF Wen-Ching Lin, Jun-Hong Chen, and Ming-Der Shieh, 國立成功大學

　
This paper focuses on the high-speed multiplier design over finite field GF(2^m) for large m. We first extended the look-up table (LUT) based multiplication algorithm presented by Hasan to reduce the LUT generation time and then showed how to effectively incorporate the squaring operation into the developed multiplier. The unified multiplication/squaring module is very suitable for a particular kind of cryptosystems like Elliptic Curve Cryptography (ECC) in which these two types of operations are operated in a ping-pong fashion. Experimental results exhibit that using the presented sub-group, multiple look-up tables (SG-MLUT) based scheme, up to 29% improvement in the total computation time of multiplication can be achieved in comparison with that using Hasan’s algorithm. When employing the unified multiplication/squaring module instead of Hasan’s design in ECC applications, we can gain further improvement in the scalar multiplication time because no LUT generation is needed using our design, and obtain about 26% reduction on the resulting area-time (AT) complexity.

P2D-3　 A DPA-Resistant AES Encryption Hardware Module

PDF Kuan Jen Lin, Shih Hsien Yang, and Chih Hsuan Hsu, 輔仁大學

　
Cryptographic embedded systems are vulnerable to Differential Power Analysis (DPA) attacks. In this paper, we use a logic design style, called as Pre-charge Masked Reed-Muller Logic (PMRML) to implement a DPA-resistant AES encryption hardware module. The PMRML design can overcome the glitch and Dissipation Timing Skew (DTS) problems, which both significantly reduce the DPA-resistance. The PMRML-based AES module was implemented with TSMC 0.18um standard cell libraries. The post-layout results show the efficiency and effectiveness of the PMRML design methodology.

P2D-4　 A Low-Hardware-Cost Logical OR Operation Log-SPA LDPC Decoder

PDF Ming-Yu Lin, Ching-Da Chan, Jung-Chieh Chen, and Po-Hui Yang, 國立雲林科技大學　

　
A low hardware cost Low-Density Parity-Check (LDPC) decoder is presented in this paper. Making use of the logical OR operation in the check nodes for the log sum-product algorithm (Log-SPA), we propose a new architecture for updating the check nodes messages. Synthesized and numerical results show that the proposed architectures achieve up to 21% total hardware reduction with fair BER performance, comparing with the traditional Log-SPA decoder. In addition, the proposed decoder also outperforms the known simplest sign-min architecture in terms of hardware complexity and BER performance.

P2D-5　 Mixed-Vth (MVT) CMOS Circuit Design For Low Power Cell Libraries

PDF Jyun-Yi Lin, Li-Rong Wang, Chia-Lin Hu, and Shyh-Jye Jou, 國立交通大學　

　
Mixed-Vth (MVT) technique has been proposed to resize the MOS size and then reduce dynamic power in logic gates by applying a low threshold voltage to transistors in some critical paths, while a standard threshold voltage is used in non-critical paths. This paper presents 130nm 1.2V and 90nm 0.5V low power cell libraries using MVT technique. The dynamic power consumption of the cells has been reduced around 5% to 30% with the same timing specifications.

P2D-6　 Symbol and Integer Carrier Frequency Offset Synchronization for IEEE802.16e

PDF Juan-Nan Lin, Hsiao-Yun Chen, and Shyh-Jye Jou, 國立交通大學　

　
IEEE 802.16e has been proposed as a standard for the next generation wireless communication system. Synchronization plays an important role to set the environment for receiver. Because the repetition characteristic of preamble is unapparent, the method like match filter is used in orthogonal frequency division multiple access (OFDMA). In this paper, the reduced complexity hardware architecture of correlation is proposed. By adopting the modified algorithm, mass of multipliers are removed from hardware implementation. The modified dissipation process reduces 28 % area cost and 22% power.

P2D-7　 Register Processor for MMX instructions

PDF Jih-Ching Chiu, Shou-Xi Hong, and Kai-Ming Yang, 國立中山大學　

　
In multimedia processing, the regular and large quantities of data are always shown in the processing algorithms. The key idea in these extensions is the exploitation of sub-word parallelism in an SIMD fashion, such as Intel’s MMX. Exploiting more data-level parallelism (DLP) is a most efficient solution by which to improve performance in SIMD instructions. However, the multi-data must be executed by the execution stage in the ordinary architecture. Increasing the degree of DLP usage causes a bottleneck at the execution stage. Besides, the numbers of operands from the register file also cause the level of difficulty to increase for the DLP. These above factors will seriously limit the DLP and affect the performance of SIMD. In this paper, we have proposed a special storage cell, which is different from data latch, and is called an operation cell, to construct the register file, called the register processor. The operation cell has the dual capabilities of operation and storage for bit slicing. Therefore, we design a register file with operation cells for the MMX instruction set. All of the data for MMX operations will be placed at the register file and can be simultaneously handled such as the SIMD operations. In summary, these operation cells in the special register file can operate all of the data by themselves and won’t depend on the execution stage. In the media processing, we are able to use the bus bandwidth efficiently in the best DLP for handling the quantities of data. According to simulation results, the register file with operation cells is better than Intel’s Pentium processor with MMX technique and Ti C64 in 4 groups, which are eight 64-bit registers, and the higher degree in the DLP usage by operation cells, the higher performance of MMX instructions in substantially can be improved.

P2D-8　 Performance Comparisons and Tradeoffs of Table-Based Arithmetic Function Evaluators

PDF Ping-Chung Wei, Ching-Pin Lin, and Shen-Fu Hsiao, 國立中山大學　

　
Function units that calculate elementary arithmetic functions play an important role in many applications including scientific computing, digital signal processing, multimedia codec, 3D graphics animation etc. Thus, efficient design of arithmetic function units could significantly affect the overall system performance and area cost. In this paper we develop an automatic generator to produce hardware units that compute various single-value functions using table-method approaches and compare their performance and area costs with other alternative implementations. Experimental results show that making choices between these approaches depends on the tradeoffs between speed, area and especially the constraint of the accuracy. In particular, the hardware function units produced using our proposed generator have better delay and/or area cost compared with those obtained from Synopsys DesignWare library.

P2D-9　 Multiple-Input XOR/XNOR Circuit Design Using Pass-Transistor Logic and Its Application in Cryptography

PDF Ming-Yu Tsai, Chia-Sheng Wen, and Shen-Fu Hsiao, 國立中山大學　

　
Exclusive-OR (XOR) gate is one of the critical components in many applications such as cryptography. In this paper, we present an efficient multi-input XOR circuit design based on pass-transistor logic (PTL). A synthesis algorithm is developed to efficiently generate the PTL-based multi-input XOR circuits. Both pre-layout and post-layout simulation results show that our proposed multi-input XOR design outperforms static CMOS design. The multi-input XOR circuits are also used to design the transformations in the Advanced Encryption Standard (AES).

P2D-10　 Efficient Design of Graphic Rasterization Module

PDF Chung-Hua Tsai and Yun-Nan Chang, 國立中山大學　

　
This paper presents an efficient design of rasterization module suitable for the tile-based 3D graphic rendering systems. The ordinary line drawing algorithm for the scan-line boundary search or the direct in-out test approach is not efficient for the scan-conversion operation in tile-based approach since the shape of triangle primitive may become irregular after tiling. Therefore, this paper transforms the general pixel in-out test function into a sign-directed scan-line boundary search method. The normal in-out test circuit for single pixel can be modified to detect two end-points of the scan-line simultaneously such that the effective hardware efficiency can be largely improved. Our experimental results show that the pixel fill-rate can be improved by about 60%. The proposed rasterization design also divides the entire architecture into two parts including scan-line generation and the fragment generation. This division can help the optimization and speedup of the individual part to achieve the desired overall fill-rate goal.

　

TECHNICAL PROGRAM

Session

A3 Day

08/10 Time

10:00-11:30 Chair
蔣元隆教授
崑山科技大學 Room

301

Signal Processing ICs

10:00 A3-1　 The Efficient VLSI Design on BI-CUBIC Interpolation for Real Time Digital Image Scaling

　 PDF 林正基(Chung-chi Lin), 國立雲林科技大學　

　　
This paper presents a VLSI design of bi-cubic interpolation for digital image processing. The architecture of reducing the computational complexity of generating coefficients as well as decreasing number of memory access times is proposed. Our proposed method provides a simple hardware architecture design, low computation cost and is easy to implement. Based on our technique, the high-speed VLSI architecture has been successfully designed and implemented with TSMC 0.13μm standard cell library. The simulation results demonstrate that the high performance architecture of bi-cubic convolution interpolation at 279MHz with 30643 gates in a 498×498μm² chip is able to process digital image scaling for HDTV in real-time.

10:15 A3-2　 Design of 1x2 MB-OFDM UWB Receiver with Channel Shortening Technique

　 PDF Jen-Ming Wu and Hung-Wen Yang, 國立清華大學　

　　
In this paper, we present an adaptive channel equalization scheme with receiver diversity for MB-OFDM Ultra-wide Band (MB-OFDM UWB) communication system. When the channel impulse response is longer than guarding interval, inter-symbol interference (ISI) will heavily degrade the system performance. Our scheme combines time domain impulse response shortening technique to deal with ISI, and multiple receive antennas for extra diversity gain. We propose to over constraint the desired length of shortening window of interest, and the BER can be improved by at least 4.4dB than conventional shortening techniques with shortening window equal to guard interval. Simulation results show the proposed scheme has better bit error rate (BER) performance over traditional UWB systems, which uses only frequency domain equalization. In addition, we also present the proposed channel shortening technique within the context of Single Input Multiple Output (SIMO) channel for additional receiver diversity gain.

10:30 A3-3　 A Scalable Graph-cut Engine Architecture for Real-time Vision

　 PDF Nelson Yen-Chung Chang and Tian-Sheuan Chang, 國立交通大學　

　　
This paper presents the world’s first scalable graph-cut engine architecture for real-time vision processing. This architecture implements an exhaustive search algorithm which has very high parallelism and is suitable for scalable hardware implementation. For an n-vertices two variable graph, the estimated hardware cost of using 2^k sets of energy computation unit is 2^k-1×(n²+n+1) equivalent adders. The corresponding latency of the proposed architecture is Ceiling (log₂(n²+n)-1)+ 2^n-k +k cycles, which is faster than the n⁴+6n³+11n² cycle latency of the traditional software approach when n<17. The proposed architecture may enable two-variable graph-cut methods, such as swap and expansion moves, to be applied to real-time vision application.

10:45 A3-4　 High-Quality Mipmapped Texture Compression

　 PDF Chih-Hao Sun and Shao-Yi Chien, 國立台灣大學　

　　
This paper presents a high-quality mipmapped texture compression (MTC) system for GPU. Based on wavelet transform, a hierarchical approach is adopted for mipmap in YCbCr color space to embed three levels of mipmap in a single bitstream. Furthermore, a layer overlapping technique is proposed as wellto reduce the memory bandwidth of MTC. MTC is integrated in a cycle-accurate GPU simulator with texture cache. Simulation results show that MTC can provide better image quality with similar memory bandwidth and less cache miss rate for textures. VLSI implementation result shows that the hardware cost of MTC is similar to that of DXTC and is suitable to be integrated in GPU to provide high-quality textures with low memory bandwidth requirement.

11:00 A3-5　 A Scalable Wavelet Image Coder Based on Zero-block and Array and Its Hardware Implementation

　 PDF Yuan-Long Jeang, Hung-Yu Wang, and Cyuan-Cheng Wong, 崑山科技大學　

　　
In this paper, we propose a highly Scalable Embedded image coder based on Zero-blocks and Array structures, called S-EZBA, by the extension of an image coder I-EZBA with higher performance and cost efficiency. S-EZBA achieves not only distortion scalability, resolution scalability, and region of interest (ROI) retrievability, but also inherits the property of cost saving of I-EZBA. We use a new formation of Quality Layer to realize these properties. Comparing with S-SPECK, S-EZBA omits memory needed on counting the length of bitstreams belonging to the codeunits. S-EZBA has been implemented based on TSMC .18 um technology. Experimental results of S-EZBA show excellent cost, power consumption and PSNR (peak signal-noise ratio).

11:15 A3-6　 Efficient Fast Fourier Transform Processor Design for DVB-H System

　 PDF Yu-Ju Cho, Chi-Li Yu, Tzu-Hao Yu, Cheng-Zhou Zhan, and An-Yeu (Andy) Wu, 國立台灣大學　

　　
Fast Fourier transform (FFT) is the demodulation kernel in the DVB-H system. In this paper, we firstly propose an FFT processor that reduces the power consumption by decreasing the usage of main memory, and timely turning off the unused memory partitions in different sizes of the FFT. Second, the triple-mode conflict-free address generator is proposed to handle the address mapping of all storages in the three-size FFT computations. Then two cost-efficient twiddle-factor coefficient design methods, “Sharing” and “Interpolation-then-Sharing”, are proposed to reduce the area of coefficient storages within the allowable loss of SQNR. These methods can reduce 67% area occupied by coefficient storage at a price of 0.6dB loss of SQNR in our design. Finally, our proposed FFT processor for DVB-H system is implemented by using TSMC 0.18μm 1P6M CMOS technology with core size of 1.886×1.886mm². The minimum latency to operate 8192-point FFT is 805μs at 86MHz clock rate by consuming 75.51mW. For DVB-H system, it processes the 8192, 4096, and 2048-point FFT with clock rates of 79MHz, 75MHz, and 71MHz, and consumes of 67.01mW, 53.16mW, and 39.45mW, respectively.

　

TECHNICAL PROGRAM

Session

B3 Day

08/10 Time

10:00-11:30 Chair
吳仁銘教授
國立清華大學 Room

305

Communication ICs

10:00 B3-1　 A Partially Parallel Low-Density Parity Check Code Decoder with Reduced Memory for Long Code-Length

　 PDF Chin-Kuang Lian, Shin-Yo Lin, Tsung-Han Tsai, and Chin-Long Wey, 國立中央大學　

　　
Two partially parallel architectures have been commonly implemented for LDPC decoders: Share-memory architecture and Individual-memory architecture. This paper presents an alternative approach which significantly reduces the memory size requirement. The memory size reduction can be approximately 10% and 49% of the individual-memory and share-memory architectures, respectively, for a LDPC decoder with a code length of 1536 and a code rate of 1/2. The proposed LDPC decoder achieves the data rate up to 79 Mbps, where the clock frequency is 500 MHz.

10:15 B3-2　 Architecture of Adaptive Channel Equalizer in Dedicated Short Range Communication (DSRC) and Vehicle Infotainment Systems

　 PDF Yong-Hua Cheng, Yi-Hung Lu, and Chia-Ling Liu, 工業技術研究院　

　　
Dedicated Short Range Communication (DSRC) is the key component in Intelligent Transportation Systems (ITS). The goal of this technology is to help driver and passengers getting multimedia services via wireless communication equipment during the movement of vehicle, so as to improve the traffic safety and enhance the transportation efficiency. The paper is focus on the DSRC baseband technical analysis and the application development. Based on the WLAN 802.11a architecture, we develop a new adaptive channel estimation algorithm and architecture that uses decoder and decision feedback to enhance wireless access ability in vehicular environments. Eventually, we integrate mobile communication with vehicle multimedia network.

10:30 B3-3　 An Ultra-low Power Multi-mode LDPC Decoder Chip for Mobile WiMAX System

　 PDF Xin-Yu Shih, Cheng-Zhou Zhan, Cheng-Hung Lin, and An-Yeu (Andy) Wu, 國立台灣大學　

　　
This paper presents an ultra-low power multi-mode decoder design for Quasi-Cyclic LDPC codes for Mobile WiMAX system. Based on proposed overlapped decoding mechanism, the decoding latency can be reduced to 68.75% of non-overlapped method, and the hardware utilization ratio can be enhanced from 50% to 75%. The new early termination strategy can dynamically adjust iteration number when dealing with communication channels of different SNR values. In addition, we propose an Efficient Checkerboard Layout Scheme (ECLS) to reduce routing complexity in chip implementation. The multi-mode LDPC decoder design is implemented and fabricated in TSMC 0.13μm CMOS technology. The core size is 4.45mm² and the die area only occupies 8.29mm². The operating frequency is maximally measured 83.3MHz with only power consumption of 52mW.

10:45 B3-4　 Baseband OFDM Receiver Design for Fixed WiMAX Communication

　 PDF Chi-chie Chang and Jen-Ming Wu, 國立清華大學　

　　
In this paper, we present the implementation of an inner receiver for IEEE 802.16-2004 Wireless Metropolitan Area Network (a.k.a. WiMAX) and and fabricated with TSMC 0.18μm technology. In our chip design, it consists of low power packet detection, low complexity carrier frequency compensation, recursive FFT and channel compensation. In the packet detection and carrier frequency compensation, we use sign-bit method and propose the mapping function to achieve low power design. In the FFT design, we use the radix 8 recursive FFT to achieve small area design. The total power consumption is about 114mW.

11:00 B3-5　 A Multi-Code Rate IEEE 802.16e LDPC Decoder Design

　 PDF Chih-Hao Hsiao and Yun-Nan Chang, 國立中山大學　

　　
This paper presents a VLSI design of Low-Density Parity-Check code (LDPC) decoder for the IEEE 802.16e standard. In order to support all the code rates defined in the standard, we proposed a programmable block-based edge-serial iterative architecture which can perform the sequential check-node computation according to the internal sequence update commands. Any complex and irregular parity-check matrix can all be realized in the proposed architecture if the number of bit-nodes each check node connects will not exceed a certain bound. In order to achieve fast clock speed, the proposed LDPC decoder has been deep pipelined which, however, may prolong the execution cycles of each iteration due to the internal pipeline latency. The latency overhead can be reduced by scheduling the proper check-node update order such that different iterations of operations can be overlapped. The proposed architecture has been realized by using 0.18μm technology with the total gate count of 900k. Our experimental shows that the proposed LDPC decoder can run up to 235 MHz and deliver the average of 135 Mbps throughput. Furthermore, in order to save the number of iterations, the early termination scheme based on the parity-check detection circuit is also included in our architecture. It can save the average of 20% cycles for signal-to-noise ratio (SNR) between 1 and 3 dB.

11:15 B3-6　 Configurable Hierarchical Decoder Architectures for H-QC LDPC Codes

　 PDF Kuo-hsing Juan, Mong-kai Ku, and Yu-min Chang, 國立台灣大學　

　　
In this paper, a low-density parity-check (LDPC) decoder architecture using a fast converging layered decoding algorithm is presented. This hierarchical architecture is highly scalable and configurable. Two-level hierarchical quasi-cyclic LDPC codes are used to provide good coding gain and low error floor at long codeword length. We also develop a novel compensation method, mixed-mode min sum algorithm, which can provide better BER performance and need less iterations than the scaling min sum. Several designs are implemented on Altera Stratix 2 EP2S130 FPGA. The LDPC decoder implementation with 2 first level decoding blocks and 32 second level decoding units can achieve close to 1 Gbps information throughput.

　

TECHNICAL PROGRAM

Session

C3 Day

08/10 Time

10:00-11:30 Chair
陳科宏教授
國立交通大學 Room

315

Sensors and Power Electronics

10:00 C3-1　 A Dual Phase Charge Pump with Compact Size

　 PDF Po-Chin Fan and Ke-Horng Chen, 國立交通大學　

　　
In this paper, the regulated dual phase charge pump with compact size is presented. This charge pump uses the dual phase technique to reduce the output ripple and proposes a new power stage to define the stability of the overall system. This charge pump provides output voltage 5V and maximum load current 50mA with the constant frequency regulation. This design is based on TSMC 0.35μm 3.3V/5V CMOS technology.

10:15 C3-2　 A Dual-Mode Step-Up DC/DC Converter with Current-Limiting Technology

　 PDF Chun-Ting Kuo, Wan-Rone Liou, and Ping-Hsing Chen, 國立台灣海洋大學　

　　
This paper presents a novel dual-mode step-up DC/DC converter. Pulse-frequency modulation (PFM) is used to improve the efficiency at light load. This converter can operate between pulse-width modulation (PWM) and pulse-frequency modulation. The converter will operate in pulse-frequency-modulation mode at light load and in pulse-width modulation mode at heavy load. The maximum conversion efficiency of this converter can reach 96%. The conversion efficiency is greatly improved when load current is below 100 mA. Additionally, a novel soft-start circuit is proposed in this paper to avoid the large switching current at the start up of the converter. Furthermore, a novel current-limiting circuit is proposed in this paper. It can limit the switching current below 400 mA.

10:30 C3-3　 A SAR-Based Smart Temperature Sensor with Binary-Weighted Search Algorithm

　 PDF Chun-Chi Chen, Poki Chen, and Kai-Ming Wang, 國立台灣科技大學　

　　
A SAR-based (successive approximation register) time domain smart temperature sensor with a binary-weighted search algorithm is proposed in this paper. Without any bipolar transistor, a temperature sensor composed of temperature-dependent delay line is utilized to generate the delay time proportional to the measured temperature. A timing reference delay line with binary-weighted scheme is adopted for set-point programming. A SAR control logic is adopted for selecting the optimal delay time for digital output coding. The proposed 10-bit smart temperature sensor has a chip area of 0.6 mm² in the TSMC 0.35-μm digital process and measurement error of +-0.3℃ with a test range of 0℃~90℃.

10:45 C3-4　 A New Self-Oscillating CMOS DC-DC Converter with Adaptive Mode-Switching Mechanism

　 PDF Sau-Mou Wu , Chung-Lin Wu, and Chia-Hsien Chang, 元智大學　

　　
In this paper, presented is a new adaptive mode-switching mechanism for a synchronous, self-oscillating, fully integrated CMOS DC-DC converter. The proposed adaptive mode-switching mechanism employs a current sensing technique to enable the automatic mode switching between CCM and DCM according to the level of the load current, thus maintaining a high conversion efficiency even though the load current of an application may change during normal operation . Moreover, the level of the load current for mode switching is programmable depending on the applications. The efficiency of the resulting dc to dc converter is up to 92% while the maximum peak-to peak output voltage ripple is 18mV and the output current ranges between 50mA and 50 0mA. The dc to dc converter operates at a switching frequency from 250 to 3M Hz from a supply voltage ranging from 2.4 to 4.2 V. The new DC-DC converter was fabricated in TSMC 0.35-μm 2P4M CMOS process with die size was 0.85 mm² . Except the external inductor and capacitor, all the devices including the power switches are on chip.

11:00 C3-5　 A Novel Log-Lin-Log Response CMOS Image Sensor with High Swing and Wide Dynamic Range

　 PDF Sau-Mou Wu and Ming-Wei Chen, 元智大學　

　　
A new CMOS image sensor with log-lin-log response is presented. The pixel cell has logarithmic response in very low illumination intensity, linear response in low and medium illumination intensity, and logarithmic response in high illumination. In this scheme, the sensor is highly sensitive to very low light, while still owning the properties of high voltage swing of 0.53V (from 1.8V supply) and high dynamic range of 120dB. Furthermore, CDS technique can be applied to the proposed sensor array to reduce the fixed pattern noise. For the purpose of demonstration, a prototyped image sensor array of 75×54 with readout circuit and CDS is designed from 1.8V supply and is realized by the TSMC 0.35μm CMOS 2P4M standard process.

11:15 C3-6　 A Novel CMOS Smart Temperature Sensor for On-Line Thermal Monitoring

　 PDF Wei-Cheng Lee, Hung-Chih Lin, and Tsin-Yuan Chang, 國立清華大學　

　　
A CMOS smart temperature sensor without conventional ADC or bandgap reference is proposed for thermal management of VLSI system. The accuracy is within ±0.8℃ over the temperature range of 0℃ to 125℃ after two-point calibration. The sensor consists of a ΔV_GS generator that utilizes the temperature characteristics of CMOS transistors, a voltage-to-time converter and a time-to-digital converter to provide digital output. A small die area of 0.05mm², an extremely low power consumption of 120μW and a high conversion rate of 5K conversion/s make this temperature sensor very suitable for VLSI integration. The sensor features a finest resolution of 0.025℃/LSB with a 100MHz external reference clock.

　

TECHNICAL PROGRAM

Session

D3 Day

08/10 Time

10:00-11:30 Chair
姚嘉瑜教授
國立台灣科技大學 Room

308

Timing and Clock Generators

10:00	D3-1	Stability Analysis of Fourth-Order Charge-Pump PLLs using Linearized Discrete-Time Models
	PDF	Chia-Yu Yao, Chun-Te Hsu, and Chih-Chun Hsieh, 國立台灣科技大學
		In this paper, we derive state equations for linearized discrete-time models of forth-order charge-pump phase-locked loops. We solve the differential equations of the loop filter by using the initial conditions and the boundary conditions in a period. The solved equations are linearized and rearranged as discrete-time state equations for checking stability conditions. Some behavioral simulations are performed to verify the proposed method. By examining the stability of loops with different conditions, we also propose an expression between the lower bound of the reference frequency, the open loop unit gain bandwidth, and the phase margin.
10:15	D3-2	A Low Jitter 2.5-GHz Self-Calibration PLL
	PDF	鄭國興、蔡玉章、洪凱尉, 國立中央大學
		A 2.5-GHz 8-phase phase-locked loop (PLL) was proposed for 10Gbps system on chip (SoC) application. The proposed self-calibration method can adjust the multi-band voltage control oscillator (VCO) to compensate for PVT variations. The small K_VCO can reduce the effect of power/ ground (P/G) and substrate noise. The PLL is implemented in 0.13μm CMOS technology. The PLL output jitter is 18.55ps (p-p) where the reference clock jitter is 20ps (p-p). The total power dissipation is 21mW at 2.5-GHz and the core area is 0.08mm².
10:30	D3-3	A CMOS-MEMS Frequency Adaptive Resonator with Multiple Electrostatic Electrodes Driving.
	PDF	J. C. Chiou and L. J. Shieh, 國立交通大學
		In this paper, a prestress vertical comb drive resonator with frequency tuning capability is developed. The resonator consists of three sets of comb fingers which act as driving electrodes. The comb fingers are fabricated along with the composite beam. One end of the composite beam is clamped to the anchor, whereas the other end is elevated vertically by the residual stress. The actuation occurs when the electrostatic force, induced by the fringe effect, pulls the composite beam downward to the substrate. By applying driving voltage in different electrodes, the resonator exhibits different frequency response. The device is fabricated through a standard 0.35μm 2P4M CMOS-MEMS process. Preliminary measurement results indicated that the initial resonant frequency of the device is 18.6 kHz, and the maximum frequency tuning range up to 28.5% is obtained.
10:45	D3-4	An Efficient BMCS Approach to Accurately Predict Process Variation Effects of PLL Circuits
	PDF	Chin-Cheng Kuo, Meng-Jung Lee, I-Ching Tsai, Chien-Nan Jimmy Liu, and Ching-Ji Huang, 國立中央大學
		Hierarchical statistical analysis is often used by regression-based approach to improve the extremely expensive HSPICE Monte Carlo analysis. However, accurately fitting the repression equations requires many simulation samples. In this paper, a Behavioral Monte Carlo Simulation (BMCS) approach to analyze PLL designs under process variation is proposed based on a bottom-up behavioral modeling approach with an efficient extraction process. Using the accurate model, we also propose a modified sensitivity analysis for process variation effects to provide accurate enough results with less regression cost. In the experimental results, we reduce the simulation time for HSPICE MC analysis from several weeks to several hours and still retain similar statistical results.
11:00	D3-5	A Low Power Wide Range Duty Cycle Corrector Based on Pulse Shrinking/Stretching Mechanism
	PDF	Poki Chen, Shi-Wei Chen, and Juan-Shan Lai, 國立台灣科技大學
		A duty cycle correction circuit based on pulse shrinking/stretching mechanism is presented. The proposed DCC has been fabricated in a TSMC 0.35μm standard CMOS process. An input duty cycle range of 30%~70% is achieved. The duty cycle error is between -1.0% to +1.0% for the widest operation frequency range of 3MHz~660MHz ever fulfilled which makes the circuit best suited for ultra wide band applications. The chip area is merely 0.3×0.2 mm² and the power consumption is 1.1mW at 550 MHz.
11:15	D3-6	A Wide-Range Synchronous 50% Duty-Cycle Clock Generator
	PDF	Wei-Hao Chiu and Tsung-Hsien Lin, 國立台灣大學
		A CMOS synchronous 50% duty-cycle clock generator is presented in this paper. The proposed circuit is comprised of a clock-generation module and a phase error integration module. The clock-generation module senses the edges of an input signal to produce an output whose duty cycle is controlled to 50% by the phase error integration loop. The duty cycle control signal is generated by sensing the phase error between the input and the output. This work further proposes a calibration scheme to enhance the accuracy of the phase error integrator; hence, the residue phase error attributed to various non-idealities can be greatly reduced. This circuit is also capable of operating at wide frequency range by implementing a cyclic delay topology. The proposed circuit is designed in the TSMC 0.18-μm CMOS process and operated from a 1.8-V supply voltage. The operation frequency ranges from 1 MHz to 1 GHz, and it can accommodate a wide-range of input duty cycles ranging from 6% to 94% at 1-GHz frequency. The duty-cycle error of the output signal is less than 0.5% and draws 12 mA at 1 GHz.

TECHNICAL PROGRAM

Session

E3 Day

08/10 Time

10:00-11:30 Chair
陳春僥教授
國立高雄大學 Room

318

SoC Design Methodology

10:00 E3-1　 Throughput-Aware Floorplanning by Considering Multiple Critical Cycles

　 PDF Li-Ya Wang and Juinn-Dar Huang, 國立交通大學　

　　
The wire delay is gradually dominating the clock rate of a system and becoming an important issue for system design. However, it is hard to precisely estimate the wire delay in early design stages until floorplanning is actually done. In this work, we show how the latency induced by wire delay dominates system performance and re-evaluate several floorplanning strategies which are considered providing the same quality of result (QoR) in the past. Then we propose a new throughput-aware floorplanning strategy which considers a set of most critical performance loops simultaneously. The experimental results show that our approach can even double the system performance than the previous method in some cases.

10:15 E3-2　 SIMD Code Generation for Multimedia

　 PDF Cheng-Cho Jean, Guang-Huei Lin, Sao-Jie Chen, and Alan P. Su, 國立台灣大學　

　　
Multimedia extensions are ubiquitous in today's general-purpose processors. It has prompted the needs for generating efficient simdized codes that SIMD architectures can benefit from. This paper sets out to investigate compiler techniques to target short vector instructions effectively and automatically. The most common aspects of compilation are the effective management of memory alignment and handling of mixed data lengths. Based on a code study of various multimedia workloads, we identify several new challenges arise in simdizing multimedia extensions, and provide some solutions to these challenges. Then we present a framework that addresses several of simdization issues mentioned above.

10:30 E3-3　 H.264 Decoder Optimization – VLIW DSP Platform

　 PDF Pou-Hang Ian, Jia-Ming Chen, Hsin-Wen Wei, Jian-Liang Luo, and Wei-Kuan Shih, 國立清華大學

　　
H.264 Decoder Optimization – VLIW DSP Platform This paper presents several optimization techniques of H.264/AVC decoder implementation on a dual-core VLIW PAC DSP platform. The evaluation results show that a video with D1 resolution can be decoded in real-time.

10:45 E3-4　 H.264/AVC Baseline Profile Decoder Optimization on PAC DSP

　 PDF Chiu-Ling Chen, Jia-Ming Chen, Jian-Liang Luo, Tien-Wei Hsieh, and Wei-Kuan Shih, 國立清華大學　

　　
Optimization techniques of major procedures of the H.264/AVC decoder for PAC DSP is given in this paper, which provides a valuable experience for similar implementations.

11:00 E3-5　 SIMD Optimizations for PAC VLIW DSP Processors with Sub-word Instructions

　 PDF Ci-Bang Kuan and Jenq Kuen Lee, 國立清華大學　

　　
The speed of growth and evolution of multimedia applications have been putting lots of pressures on modern processors to deliver further performance enhancements while with limited budgets on cost and power. To meet the computing requirement, sub-word instructions, known as a form of SIMD instructions, are commonly equipped by DSP processors to boost performance for those computation intensive applications. Unfortunately, till now only library routines, intrinsic functions, and in-line assembly are available for access and leveraging sub-word instructions, but not applicable to general C programs. This hinders the use of sub-word instructions in the deployment of software applications. In this paper, we present an enabled flow for performing auto-vectorization of C compilers by utilizing sub-word instructions. The vectorizing compiler would identify data level parallel implicit in C programs and automatically generate assembly with sub-word instructions whenever possible. The target architecture in our experiment is based on PAC VLIW DSP processors. The performance of vectorized programs are evaluated using a set of DSP loop kernels, which are typical and representative in digital signal processing. The preliminary results reveal that our vectorizing compiler generates codes with efficiency. The speedup is from 1.3 to 2.1 compared to the one without our proposed optimizations.

11:15 E3-6　 Standard Cell Like Via-Configurable Logic Block Design for Structured ASICs

　 PDF Mei-Chen Li, Chien-chung Lai, Hui-Hsiang Tung, and Rung-Bin Lin, 元智大學　

　　
A structured ASIC consisting of pre-fabricated yet via configurable logic blocks (VCLBs) and a regular fabric can achieve a timing performance comparable to that of an ASIC but uses much less power and area than that of an FPGA. To reduce tool development cost for structured ASIC, in this paper we propose a standard cell like VCLB such that we can leverage existing tools to perform chip designs using our VCLBs. We create a standard cell library based on our VCLB and establish a design flow based on some commercial tools and our own tools. Experimental data show that our approach achieves a delay of 1.93 times that attained by the designs using a commercial cell library at the expense of 320% increase in chip area. The product of delay and area achieved by our approach is on average 44% better than that achieved by some previous work.

　

TECHNICAL PROGRAM

Session

P3A Day

08/10 Time

10:00-12:00 Chair
李順裕教授
國立中正大學 Room

2F宴會廳

Analog

P3A-1　 Voltage-Mode First Order All-Pass Filter using DDCC

PDF Wei–Yuan Chiu , Jiun–Wei Horng, and Chuan–Hsien Chang, 中原大學　

　
In this paper, a new voltage–mode first order all-pass filter using minimum active and passive components is presented. The proposed circuit only employs one differential difference current conveyors (DDCCs), one grounded capacitor and one resistor and offers the following advantages: the use of only grounded capacitor which is attractive for integrated circuit implementation, low active and passive sensitivities and no requirements for component matching conditions. PSPICE simulation results that verify the theoretical analyses are included.

P3A-2　 Analog Circuits Fault Diagnosis under Parameter Variations Based on Fuzzy Logic system

PDF 林宗志、陳盈州、郭明仁, 逢甲大學　

　
In this paper, an efficient, fast methodology and further diagnosis is proposed for the linear analog circuits based on fuzzy logic system (FLS). After fault diagnosing and location by using radial basis function neural network, fuzzy rule bases are constructed to characterize the behavior of the circuit under test in both fault and fault free situations and to evaluate the faulty element values. The result of our simulation confirm the validity and performance of the advocated fault diagnosis technique.

P3A-3　 A CMOS Low-Noise Amplifier with Shunt-Peaking for 3-5GHz Ultra-Wideband Wireless System

PDF Zhe-Yang Huang, Che-Cheng Huang, and Chung-Chih Hung, 國立交通大學　

　
This paper presents a low-noise amplifier (LNA) with shunt-peaking load for MB-OFDM Group-A and DS-UWB low-band 3-5GHz ultra-wideband wireless radio system. The LNA is designed and implemented in TSMC 0.18μm RF CMOS process. Measurement results show that maximum power gain is 18.5dB, input and output matching lower then -6.6dB and -6.0dB, and a minimum NF of 2.9dB can be achieved, while the power consumption is 18.7mW through 1.8V power supply.

P3A-4　 Analytical Synthesis of Low-Sensitivity Voltage-Mode Odd-Nth-Order OTA-C Elliptic Filter Structure with the Minimum Number of Components

PDF Chun-Ming Chang, 中原大學

　
Though the current-mode odd-nth-order operational transconductance amplifier and capacitor (OTA-C) elliptic filter structure with the minimum number of active and passive components was presented recently, yet none of its counterpart, the voltage-mode ones, have been reported. After a new analytical synthesis method, namely, an innovative algebraic decomposition of a complex nth-order transfer function into n simple and feasible equations, the voltage-mode odd-nth-order OTA-C elliptic filter structure with the minimum number of components is proposed in this paper. The Hspice simulation with 0.35μm process for a voltage-mode third-order OTA-C elliptic low-pass filter, employing only four OTAs and three grounded capacitors, validates not only precise filtering parameters but low sensitivity and low power consumption performances.

P3A-5　 A 14-Bit Fourth-Order Sigma-Delta Modulator with Feedforward Architecture for Hearing Aid

PDF Shuenn-Yuh Lee, Jia-Hua Hong, Chi-Ching Lin, Chui-Kum Chiu, and Sheng-Jing Ku, 國立中正大學　

　
A fourth-order sigma-delta modulator (SDM) with feedforward (FF) structure is implemented for hearing aid. In this paper, the non-ideal circuit models are built for systematic analysis and the required circuit specifications can be produced by the behavioral simulation with the non-ideal circuit models. The circuit implementation based on the required circuit specifications is employed to design a fourth-order FF SDM with over-sampling ratio (OSR) of 64 and bandwidth of 10kHz using a 0.35μm TSMC CMOS process. Measurement results reveal that the SDM operating from a 3.3-V supply voltage can achieve dynamic range of 90 dB and spurious-free dynamic range (SFDR) of 87 dB with signal bandwidth of 10kHz at sampling frequency of 1.28 MHz, and they are in agreement with the behavior analysis.

P3A-6　 A UWB CMOS Power Amplifier With Differential to Single-Ended Converter

PDF Shuenn-Yuh Lee and Guan-Da Lu, 國立中正大學　

　
A UWB PA with a Differential to Single-Ended converter (DSC) has been implemented and fabricated using the TSMC CMOS RF 0.18μm 1P6M process. Both the cascode structure and two-stage amplifier are adopted to increase the bandwidth, gain and gain flatness. Die-on-PCB measurements has shown this PA provides an average power gain of 10dB and P_1dB of above 0dBm in the frequency range from 3.1GHz to 7GHz, respectively. Moreover, the PAE is 11% at 4GHz under the power consumption of 60mW.

P3A-7　 A 8-BIT 150-MS/S FULLY DIFFERENTIAL DUAL-CHANNEL TIME-INTERLEAVED PIPELINE A/D CONVERTER

PDF Chih-Hsiang Chang and Ching-Yuan Yang, 國立中興大學　

　
In this paper a dual channel, time interleaved, pipeline analog-to-digital converter (ADC) is presented. The ADC achieves a conversion rate of 150MHz with 8-bits resolution. Fabricated in 0.35-μm CMOS technology, the chip size is 1.8mm×1.8mm. It consumes power dissipation of 212 mW under a 3.3-V supply.

P3A-8　 A Wide-Band Low-Power Quadrature VCO

PDF Ching-Yi Chen, 國立中正大學　

　
A Wide-band low-power quadrature voltage-controlled oscillator (QVCO) using TSMC 0.18μm CMOS process is proposed in this paper. The QVCO adopts cross-coupled structure and uses a current-reuse technology. The architecture can not only offer larger and more symmetrical amplitude, but also reduce power spur. Based on our measurement, the phase noise from the carrier frequency of 4.9GHz is -108 dBc/Hz under 1-MHz offset and the proposed QVCO has tuning range from 3.6 to 4.9GHz. Moreover, the phase error and power imbalance are less than 5° and 1.5 dB, respectively, and the power consumption is 8mW at 2-V power supply voltage.

P3A-9　 Low Power Sigma Delta Modulator with Dynamic Biasing for Audio Applications

PDF Hsin-Liang Chen, Yi-Sheng Lee, and Jen-Shiun Chiang, 淡江大學　

　
In this paper, a low power sigma delta modulator with dynamic biasing technique is presented. According to the analysis of the operations of the switched-capacitor integrator, the folded-cascode operational amplifier can be designed with optimized biasing currents in three different phases to reduce power dissipations. The total power saving is 20% of the general one. A prototyping fourth order single-bit MASH 2-2 sigma delta modulator is designed with the technique of dynamic biasing to achieve dynamic range of 95dB and peak signal-to-noise-and-distortion-ratio of 93dB. The experimental circuit is designed in 0.35μm 2P4M CMOS technology. The chip area is 3.11mm², and the power dissipation is only 5mW from a supply voltage of 3V.

P3A-10　 A New Current-Mode Wheatstone Bridge Based on Fully Differential Operational Transresistance Amplifiers

PDF Yuh-Shyan Hwang, Chun-Chi Shih, Jiann-Jong Chen, and Wen-Ta Lee, 國立台北科技大學　

　
A new current-mode Wheatstone bridge (CMWB) that uses a fully differential operational transresistance amplifier (FDOTRA) is presented in this paper. The proposed CMWB has been analyzed, simulated, and implemented. The advantages of the proposed CMWB are twofold. Firstly, it reduces the number of sensing passive and active elements. Secondly, the proposed CMWB circuit offers a significant improvement in accuracy compared to other CMWBs. Simulation results that confirm the theoretical analysis are obtained. The proposed circuit has been designed with TSMC 0.35μm DPQM CMOS processes.

P3A-11　 An Embedded 10-bit 200MHz DAC IP with Self-Calibrating Current Bias for SoC Applications

PDF Chung-Ming Pan and Chien-Hung Tsai, 國立成功大學　

　
In this paper, a 10-bit 200MHz DAC with a self-calibrating current bias is designed. With this architecture, the integrated device inaccuracy of internal load resistors can be improved and gain error of the output voltage swing can be reduced, making it suitable for embedded CMOS SoC applications. Dual-unary cell segments, which comprise a 6 MSB segment and a 4 LSB segment, are used to reduce the static nonlinearity. Several useful circuit techniques and a source degenerated current switch are adopted to enhance the performance. The prototype is implemented with a standard 0.35μm 2-poly 4-metal CMOS technology, occupying 0.99 mm² of die area. The simulation results show that DNL and INL are less than 0.25 LSB and 0.3 LSB, respectively. SNDR of 58 dB and SFDR of 63 dB are achieved for an input signal of 10MHz at 200MHz clock frequency with power dissipation of 70 mW.

　

TECHNICAL PROGRAM

Session

P3E Day

08/10 Time

10:00-12:00 Chair
蘇慶龍教授
國立雲林科技大學 Room

2F宴會廳

EDA

P3E-1　 Simultaneous Module Selection and Clock Skew Scheduling for Minimizing Standby Leakage Current

PDF Shih-Hsu Huang, Da-Chen Tzeng, and Chun-Hua Cheng, 中原大學　

　
In event driven applications, the standby leakage current accounts for a large fraction of total power dissipation. The power gating technique is one of the moist effective ways to reduce the standby leakage current. However, when the power gating technique is applied, there exists a delay-power tradeoff, which can be characterized with the sizes of sleep transistors. As a result, for each functional unit, the largest allowable delay (due to the timing constraints) limits the smallest leakage current that the power gating technique can achieve. In this paper, we point out that: under the same clock period constraint, different clock skew schedules result in different standby leakage currents (due to different timing constraints). Based on that observation, we present an MILP (mixed integer linear programming) approach to formally formulate the problem of simultaneous application of module selection (i.e., power gating implementation selection) and clock skew scheduling. Experimental data show that: compared with the existing possible design flow, our standby leakage current reduction achieves 29%.

P3E-2　 Totally Self-Checking Borden Code Checker Design Using Modulo Adders

PDF Wen-Feng Chang, Debaleena Das, and Cheng-Wen Wu, 萬能科技大學　

　
There is an increasing attention for on-line checkers with the advent of deep submicron VLSI technology and system-on-chip (SOC), where reliability and yield are becoming big issues for future generations of VLSI products. A technique for designing hardware-efficient totally self-checking (TSC) checkers for Borden codes is proposed. Borden codes, C(n,t), are optimal t-unidirectional error detecting codes for n-bit vectors. Borden codes have gained importance as a large number of errors in modern VLSI circuits are of the unidirectional type and have a limited multiplicity. The checker proposed here is based on the modulo property of Borden codes. The checker is composed of a modulo adder which maps all the Borden codewords to t+1 subsets. The adder is followed by a translator and a two-rail code checker, which detects these t+1 subsets. Compared with previous methods, this checker has a much lower hardware complexity: it reduces the hardware complexity from O(n(\log n)²) to O(n log n/t).

P3E-3　 Analytical Aerial Imaging Simulation for OPC

PDF 陳中平、詹霖、曾俊貴、鍾士勇、王芝宇, 國立台灣大學　

　
Optical proximity correction (OPC) is absolutely essential to nowadays microlithography, especially for complicated IC layout structures. Here in this paper, we propose an analytical way to evaluate the light distribution by imaging equation that can be easily implemented and provide accuracy for academic usage and further optimizations with acceptable simulation speed compared with traditional numerical methods which are evaluated by discrete convolution or by FFT that might lose accuracy due to discrete data and sampling. Optical imaging evaluation discussed in this paper can be also applied in physical design region to enhance well-organized layout structures which can provide more OPC-friendly designs in advance. From the imaging result, distribution of light intensity, with simulation time proportional to slits numbers, will be shown and compared with the well-known simulator, SPLAT.

P3E-4　 An Experiment of Test Plan Construction & Test Automation

PDF Tsung-Ju Yang, Ming-Chang Tung, Wei-Yu Lin, Zhi-Wei Lin, Chi-Hen Chang, and Farn Wang, 國立台灣大學　

　
We investigate the issue of test automation for embedded systems. We use one mobile phone as our experiment SUT (System Under Test) and identify the test tasks that can be automated facilitated with tool supports. As a result, we have developed a testcase graphical editor that allows the users to draw high-level test cases in MSCs (Message Sequence Charts) and a test compiler that translates MSCs to test executables in C/C++. We have also developed a configurable mobile phone simulator with versatility for the general capabilities that we may expect from a mobile phone, like dialing, call-answering, MP3 playing, calculator operation,.... We then discuss how to use the international standard of TTCN-3 to implement the SUT adaptor and platform adaptor. Then we discuss how to construct test matrix for the testing of the SUT for a number of specifications and criteria. Finally, we report the experiment data.

P3E-5　 A Flip-Flop Replacement Technique for IR Drop Reduction

PDF Jiun-Kuan Wu, Liang-Ying Lu, Kuang-Yao Chen, and Tsung-Yi Wu, 國立彰化師範大學　

　
As process technology progresses to ultra deep sub-micron, IR drop becomes an important issue for circuit designers. Clock skew scheduling for peak current reduction is a popular technique for solving IR-drop problem in physical design stage. In this paper, we propose three kinds of long delay flip-flops and an algorithm that can replace the selected normal flip-flops of a circuit by the long delay flip-flops. Because the replacement causes the switching times of flip-flops to be separated, the peak current and IR drop effect can be reduced. Unlike the traditional clock skew scheduling, our technique not only can be used in physical design stage but also in logic design stage. Another advantage of our technique over the clock skew optimization technique is lower area overhead. The reason is that our method does not increase routing resource demand while clock skew optimization technique may increase this demand. Experimental results show that our technique can reduce peak current and dropped voltage up to 41.95% and 31.82%, respectively, and the area overhead is less than 1%.

P3E-6　 A Design Methodology for Application-Specific Instruction-set Processors with Memory Access Considerations

PDF Ji-Ying Wu, Chi-Jie Lin, Je-Rung Shiu, De-Sheng Chen, and Yi-Wen Wang, 逢甲大學　

　
System designers may add some new instruction, called Application-Specific Instructions (ASIs), by automatic algorithm to optimize specific target application program to improving system performance, and to reduce design time of ASIP. In past days, ASIP researches almost focus on instructions latency to improve performance. The impact of memory access is often ignored. In this paper, a design flow is proposed to automatically generate ASIs to improve performance. The flow consists of translating a C program to CDFGs, selecting ASIs, and simulating on MIPS R3000-based microarchitecture. We consider instruction latency and a simple memory parameter at the same time. Our experiment results show that adding a simple memory model can get performance improvement up to 22% and up to 24% memory access reduction comparing to considering instruction latency only.

P3E-7　 Yield Analysis for the 65nm SRAM Cells Design with Resolution Enhancement Techniques (RET)

PDF J. J. Tang, C. L. Liao, P. C. Jheng, S. H. Chen, K. M. Lai, and L. J. Lin,, 南台科技大學　

　
Photolithography remains the driving and enabling technology in the modern semiconductor industry to fabricate integrated circuits with everdecreasing feature size. However due to the wave properties of light, such as diffraction and interference, there will be no yield on downscaling of critical dimensions without using Resolution Enhancement Techniques (RET). Two major RETs, i.e., optical proximity correction (OPC) and phase shift masks (PSM) are often employed to improve the manufacturability and yield of nanometer-ICs. RETs in lithography have even enabled optical lithography to reliably produce IC features 2 or even 3 times smaller than the optical wavelength used for pattern imaging. This paper presents the photolithography simulation results and experiences of designing 65nm SRAM standard cell using various RETs for lithography at 193nm. The effects of OPC and PSM methodologies together with the off axis illumination (OAI) and water immersion techniques to improve the photolithography image quality and yield will be examined based on the yield analysis criteria Mask Error Factor (MEF), Exposure Latitude (EL), and Process Window (PW).

P3E-8　 Object-Oriented Hardware/Software Co-Design Using Java

PDF Chin-Tai Chou, Fu-Chiung Cheng, and Hung-Chi Wu, 大同大學　

　
Object-oriented design methodology helps to handle design complexity in software and thus have increased in popularity in hardware/software co-design. However, existing approaches reported are focusing on modelling and simulation. In this paper, we thus propose a novel approach to model and synthesis hardware/software co-design systems using Java as specification language. This makes the modelling and verification in early design process more easily and possible. An example is given to demonstrate and evaluate the performance gain of our approach.

　

TECHNICAL PROGRAM

Session

P3D Day

08/10 Time

10:00-12:00 Chair
朱守禮教授
中原大學 Room

2F宴會廳

Digital

P3D-1　 High-speed, Low Cost Parallel Memory-Based FFT Processors for OFDM Applications

PDF Shin-Yo Lin, Wei-Chien Tang, Muh-Tien Shiue, and Chin-Long Wey, 國立中央大學

　
Low cost yet efficient FFT (Fast Fourier Transform) processors are greatly needed for real-time operation in many OFDM applications, such as xDSL, DAB, DVB-T/H, and etc. This paper presents three Radix-2 memory-based FFT (MBFFT) Processors with a memory size of N words for N complex points FFT operation. Experimental results show that the core area of the developed MBFFT is 2.04mm² with the maximum working frequency of 198MHz for N=8192 points (24 bits per word).

P3D-2　 Self –Aligned Double Bits SONOS Cell and Its Memory Circuit Design

PDF Jyi-Tsong Lin, Wei-Ching Lin, and Ho-Lin Lee, 國立中山大學　

　
In this paper, a new structure of two bits SONOS cell will be demonstrated, this cell fabrication is self-alignment imported, and the pair of vertical trapped columns are located in the adjacent of channel below the Gate. This Cell will meet the-art-of-state MOS manufacture process without particular additional procedure, and the vertical trapped pattern not only satisfies multi-bits function, also gives further miniaturized possibility, and the self-alignment can be fulfilled during this piece of design. In addition, the suitable distance of each trapped column can promise reliable isolation thus prolonged retention time and bit-to-bit interference suppression can be reasonably predictable. Also, a well wrapped trapped region by oxide layers can enhance longer retention time. The vertical longer stripe shape with a needle end closed to the channel terminals, the trapped carrier accumulation and concentration phenomenon will occur in this needle zone and amplify the threshold voltage modification to give more programming window and guarantee the multi-bits function feasibility. In this paper, we use ISE TCAD/DESSIS simulation to complete a device geometric structure and related characteristic. Also we use PSPICE to draw a easily understand schematic diagram for a memory cell, but no model establishment and importing effort to deepen other research and assessment, like cell to cell interference, multi-level current comparator design, will be done in the future.

P3D-3　 Computation Sharing Programmable FIR Filter Using Canonic Signed Digit Representation

PDF Shui-Wen Hsu and Yuan-Hao Huang, 國立清華大學　

　
This paper presents a low-cost and high-performance programmable digital finite impulse response (FIR) filter. The architecture employs the computation sharing algorithm to reduce the computation complexity. In the traditional computation sharing algorithm, critical path constraint on the output summation stage is a bottleneck, thus, the canonic-sign-digit representation is utilized for filter coefficients to relax the timing constraints. Due to the relaxation of critical path timing, more computation cost is reduced. Thus, the goal of low-cost and high-performance can be achieved. The synthesis results show that the proposed architecture has more area cost reduction for larger tap length compared with the traditional computation sharing FIR filter.

P3D-4　 A Low-Complex Image Coding Algorithm Based on Wavelet Transform

PDF Trong-Yen Lee, Yang-Hsin Fan, and Su-Zhen Hong, 國立台北科技大學　

　
In this paper, we present a low complex image coding algorithm for compressed image by zero tree method. We propose backward scan and lowest tree coding (BSLTC) that is able to construct efficiently for wavelet coefficient trees. Experiment result shows BSLTC gains faster execution time and less memory size. Moreover, the complexity of BSLTC is simpler than SPIHT, JasPer/JPEG 2000 and LTW.

P3D-5　 A Low-Complexity High-Performance Two-Dimensional Look-Up Table for LDPC Hardware Implementation

PDF Tzu-Wen Chung, Chen-Pang Chang, Jung-Chieh Chen, and Po-Hui Yang, 國立高雄師範大學　

　
In this paper, we propose a low-complexity and high-efficiency two-dimensional look-up table (2-D LUT) for performing the sum-product algorithm in the decoding of low-density parity-check (LDPC) codes. Instead of employing adders for core operation during updating check nodes messages, in the proposed scheme, the main term and correction factor of the core operation are successful merged into a compact 2-D LUT. Simulation results indicate that the proposed 2-D LUT not only attains close-to-optimal bit error rate performance but also enjoys low complexity advantage that is suitable for hardware implementation.

P3D-6　 Hierarchical Decision Table for Bad Pixel Detection in Stereo Vision

PDF Tsung-Hsien Tsai, Nelson Yen-Chung Chang, and Tian-Sheuan Chang, 國立交通大學　

　
The detection of bad pixels is an important issue for quality restoration in computational stereo. This paper presents a bad pixel detection method using a hierarchical decision table approach, based on the information of left-right checking, unreliability methods and the disparity smoothness. The proposed method integrates the features of the first two methods and additionally adds the disparity smoothness to eliminate the noise effect in detected pixels. The experiment result shows that the proposed method can achieve higher and consistent detection rate (58.2%~82.7%) and accuracy (56.7%~84.4%) over different stereo image pairs when compared with previous approaches.

P3D-7　 An Efficient Metric Normalization Architecture for High-speed Low-power Viterbi Decoder

PDF Kelvin Yi-Tse Lai, 國立雲林科技大學　

　
In this paper, a new efficient metric normali-zation architecture called High Bit Clear is proposed for a high throughput and low power Viterbi Decoder (VD). The proposed High Bit Clear normalization circuit not only normalizes all of the survivor path metrics, but also operates as close as the Add-Compare-Select (ACS) iteration bound possibly with a small area overhead. After we verified the function and made the platform by FPGA, we also used UMC 0.18μm 1.8V 1P6M Standard Cell Library to implement it. With implementation by using UMC 0.18μm 1.8-V Standard Cell Library, the proposed VD can improve the data rate up to 834Mbps for decoding a (3,1,2) convolutional code. To compare with the traditional VD without normalization, the proposed VD is improved by 60% in decoding speed and reduced by 50% in power consumption. Furthermore, the chip area of the new VD is reduced by 55% as compared to the traditional one. The operational speed of the proposed VD is up to 278MHz. Under 278MHz operation, the proposed VD consumes 2.48mW in power and the chip area utilized is about 110μm*110μm.

P3D-8　 Design and Implementation of a Real-Time Global Tone Mapping Processor for High Dynamic Range Video

PDF Tsun-Hsien Wang, Wei-Su Wong , Fang-Chu Chen, and Ching-Te Chiu, 國立清華大學　

　
Due to rapid progress in high dynamic range (HDR) video capture technology, HDR video display on conventional LCD devices becomes an important topic. In this paper, we show that real time HDR video display is possible. A tone mapping based HDR video architecture pipelined with a video CODEC is presented. The HDR video is compressed by the tone mapping processor. The compressed HDR video can be encoded and decoded by the video standards, such as MPEG2, MPEG4 or H.264 for transmission and display. We propose and implement a modified photographic tone mapping algorithm for the tone mapping processor .The required luminance wordlength in the processor is analyzed and the quantization error is estimated. We also develop the digit-by-digit exponent and logarithm hardware architecture for the tone mapping processor. The synthesized result shows that our real-time tone mapping processor can process a NTSC video with 720*480 resolution at 50 MHz. The core area after layout is about 1.8225 mm² under TSMC 0.13 μm technology.

P3D-9　 Design a Hardware Interprocessor Communication Mechanism for a Multi-core Computer System

PDF Slo-Li Chu, Chih-Chieh Hsiao, Pin-Hua Chiu, and Hsien-Chang Lin, 國立中原大學　

　
The multiprocessor architecture for multimedia embedded systems becomes more popular, because of processor design and fabrication evolution. However the interprocessor communication is still an important problem in multiprocessor environments. In this paper, we propose an hardware interprocessor mechanism for a multi-core FPGA chip. Although the hardware/software develop tools do not support multi-core design in the target platform, we create a novel design flow to implement the multi-core under Linux with high speed communication mechanism. In the experiment results, the performance have at least 30% speedup when Dhrystone benchmark execute on the Xilinx ML310 platform that is redesign by our mechanism.

P3D-10　 A HIGH PERFORMANCE CAVLC DECODER USING NON-ZERO SKIP AND MULTI-LEVEL DECODING

PDF Tsung-Han Tsai and De-Lung Fang, 國立中央大學　

　
In this paper, we propose a hybrid high performance CAVLD algorithm for MPEG-4 AVC/H.264 video standard in baseline profile. Two techniques, which called MLD (Multi Level Decoding) and NZS (Non Zero Skip for run_before decoding), are introduced to improve the throughput of CAVLC decoder. In comparison with previous design in, around 5% improvement on one macroblock decoding is arrived, and MLD is introduced to speed up the whole process further. With these two improvements, we obtain 137 cycles in average for one macroblock decoding which is equal to the throughput of 1.02*10⁶ macroblock per second at 140MHz. This means a speed up ratio of 57% is achieved in comparison with the same design. The hardware area is 18694 gates and the performance can archive the real time requirement on Full HD size.