TECHNICAL PROGRAM
Session
A1 |
Day
08/08
|
Time
13:30-15:00
|
Chair ¶À¥°¤@ ±Ð±Â
°ê¥ß¥x¥_¤j¾Ç |
Room
301 |
13:30 |
A1-1¡@ |
Novel VLSI Design of Circular-Carry-Select (CCS) Based Diminished-One
Modulo 2n+1 Adder |
¡@ |
PDF |
Su-Hon Lin, Ming-Hwa Sheu, Kuang-Hui Wang, Jun-Jie Zhu, and Si-Ying Chen,
°ê¥ß¶³ªL¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
The diminished-one modulo 2n+1 addition is an important arithmetic
operation for a high-performance residue number system (RNS). In this
paper, we propose a novel Circular-Carry-Select (CCS) architecture for
diminished-one modulo 2n+1 adder. The resulting modulo 2n+1 adder is
mainly based on CCS addition block which is simple and regular for all n
values. For actual VLSI implementation based on UMC 180nm CMOS
technology, the CCS-based diminished-one modulo 2n+1 adder demonstrates
the superiority in AreaxTime (AT) performance over those of the famous
existing solutions. The area and clock rate for CCS-based modulo 216+1
adder chip are 26746£gm2 and 476MHz respectively. |
13:45 |
A1-2¡@ |
Self-Aware Medium-Grained Adaptive Power Control Using Current
Monitoring Technique |
¡@ |
PDF¡@ |
Wei-Chih Hsieh and Wei Hwang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a novel
current monitoring technique is proposed to provide reference for
voltage scaling. Instead of tracking the delay of worst case critical
path replica, current characteristic of target circuits is considered to
distinguish between the switching and stable state of the circuit. A
medium-grained adaptive power control technique is also presented taking
advantages of low overhead current monitoring technique. Conventional
voltage scaling technique applied a single (scaled) voltage satisfying
critical path to the whole chip, wasting the power in non-critical
paths. The medium-grained technique exploits the unused slack in
non-critical paths, which further discovers the power reduction
potentiality that lies on non-critical paths. A different width
multipliers example exhibits over 40% of power reduction on non-critical
paths with only 7% overhead, most of which comes from un-optimized level
converters and buffers of control word. The proposed current monitoring
scheme contributes only about 7£gW. Simulations are all implemented in
Berkeley Predictive 65nm technology.
|
14:00 |
A1-3¡@ |
Post-Chip Adjustable Low Power Delay Element |
¡@ |
PDF |
Jung-Lin Yang and Chih-Wei Chao, «n¥x¬ì§Þ¤j¾Ç |
¡@ |
¡@ |
Constructing specific
delays on a chip is a difficult task for deep-sub-micron technology,
which is the issue this paper targets to resolve. An extreme low-power
delay element with post-chip adjustment feature is introduced. Our
initial intention was to develop this delay element for self-timed
datapath components at the first place. Surprisingly, we found this
design is also suitable for many other applications with low power
requirement. In addition to the tunable behavior, a post-chip delay
adjustment feature is implemented this time. Besides, the circuit itself
also demonstrates valuable characteristics such as well adjustment to
the operating temperature disparity on the delay and the technology
variation-tolerant nature. Excluding the current-mirror circuitry, the
proposed tunable delay element consumes approximately equal average
power to a 4-stage minimum size inverter chain. All arguments are
verified with cautiously setup post-layout simulation using TSMC 0.35£gm
technology. |
14:15 |
A1-4¡@ |
A High-Resolution All-Digital Phase-Locked Loop with its Application to
Built-In Speed Grading for Memory |
¡@ |
PDF¡@ |
Hsuan-Jung Hsu, Chun-Chieh Tu, and Shi-Yu Huang , °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper we present a high-resolution
and wide-range all-digital phase-locked loop (ADPLL), which is suitable
to function as a clock generator. The digitally controlled oscillator (DCO)
is able to operate from 70 to 725 MHz and achieves 5.2ps resolution. The
Phase-Frequency Detector (PFD) is designed using a latch-based sense
amplifier, leading to a nearly perfect PFD that is able to resolve a
phase difference as minute as only 1ps. In addition, we use this ADPLL
as a vehicle to perform built-in speed grading (BISG) for memory.
Combining a binary search process with multiple runs of built-in
self-test (BIST), the maximum operating speed can thus be tracked down
on the chip with a high precision. |
14:30 |
A1-5¡@ |
All-Digital PLL Using Bulk-Controlled Varactor and Pulse-Based DCO |
¡@ |
PDF¡@ |
Hong-Yi Huang and Jen-Chieh Liu, °ê¥ß¥x¥_¤j¾Ç ¡@ |
¡@ |
¡@ |
A 150¡V450-MHz, all-digital phase
locked-loop (ADPLL) in a 0.18£gm CMOS process is presented. The pulse-based digitally controlled oscillator (PB-DCO) performs a high resolution and wide range. The bulk-controlled varactor minimizes jitter performance. The worst case for frequency acquisition is 32 reference clock cycles. The multiplication factor is 2¡V63. The rms and peak-to-peak jitters are 6.7ps and 44ps at 450-MHz, respectively. Power consumption is 16.2mW at 450-MHz. |
14:45 |
A1-6¡@ |
A Conditional Isolation Technique for Low-Power and High-speed Wide
Domino Gates |
¡@ |
PDF |
Wei-Hao Chiu and How-Rern Lin, ¤j¸¤j¾Ç ¡@ |
¡@ |
¡@ |
A
new conditional isolation technique (CI-Domino) in domino logic is
proposed for wide domino gates. This technique can not only reduce the subthreshold
and gate oxide leakage currents simultaneously without sacrificing
circuit performance, but also it can be utilized to speed up the
evaluation time of domino gate. Simulations on high fan-in domino OR
gates with 0.18
mm
process technology show that the proposed technique achieves reduction
on total static power by 36%, dynamic power by 49.14%, and delay time by
60.27% compared to the conventional domino gate. Meanwhile, the proposed
technique also gains about 48.14% improvement on leakage tolerance.
|
|
TECHNICAL PROGRAM
Session
B1 |
Day
08/08
|
Time
13:30-15:00
|
Chair
ºµ³Õ¦w ±Ð±Â
°ê¥ß¤¤¥¿¤j¾Ç |
Room
305 |
VLSI System and Implementation
13:30 |
B1-1¡@ |
Novel Low-Power Bus Coding Method for Crosstalk Noise Reduction |
¡@ |
PDF |
Chia-Hao Fang and Chih-Peng Fan, °ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
¡@ |
In deep-submicron technology, reducing the
power dissipation and propagation delay on chip busses have become
important key issues for low power System on a Chip (SoC) designs. In particular, the coupling effect causes serious problems, such as crosstalk delay, noise and power dissipation. In the paper, a new bus-coding method is proposed to reduce the dynamic power dissipation on buses and the crosstalk delay tremendously. Our method can save more coupling power than the bus inverter(BI), the coupling-driven bus-inverter(CBI), and the other schemes. The experimental results show that our method can perform 22-36% dynamic power saving for systems which are implemented with TSMC
0.18-£gm CMOS technology.
|
13:45 |
B1-2¡@ |
Reconfigurable Hardware Module Sequencer for Dynamically Partially
Reconfigurable Systems |
¡@ |
PDF |
Chin-Chieh Hung and Pao-Ann Hsiung, °ê¥ß¤¤¥¿¤j¾Ç ¡@ |
¡@ |
¡@ |
Dynamically reconfigurable systems either
adopt a processor-controlled networked architecture or a data
sequencer-controlled data flow architecture. In the networked
architecture, the processor is overloaded with data transfer requests,
whereas in the data flow architecture, the burden is completely shifted
from the processor to the data sequencer. As a tradeoff between these
two extremes, this work proposes a novel module sequencer architecture,
which not only allows the processor and the sequencer to share the heavy
data communication load, but is also more coherent with the conventional
processor-FPGA architecture. Further, the architecture is highly flexible because it can be tuned to fit a particular application. Application examples show how the proposed architecture is superior to the networked architecture in terms of lower communication load and to the data flow architecture in terms of reduced system complexity.
|
14:00 |
B1-3¡@ |
Implementing an FPGA Baseband Multipath Fading Channel Emulator Using
High-Level Modular Design |
¡@ |
PDF |
Jeng-Kuang Hwang*, Kuei-Horng Lin, Jeng-Da Li,
and Juinn-Horng Deng,
¤¸´¼¤j¾Ç ¡@ |
¡@ |
¡@ |
A baseband multipath fading channel emulator is implemented on Xilinx XtremeDSP FPGA platform through modular high-level design. Important modules are described, including the white Gaussian noise generator (WGNG), Doppler filter, direct digital frequency synthesizer (DDFS), muti-rate interpolators, and multipath signal generator. Since all modules are designed as high level Simulink models in terms of Xilinx System Generator, the system parameters and configuration can be easily changed as desired. The FPGA emulator have been tested at a sampling rate of 30 Msps, and all the measured signals are well coincides with the simulation results, thus verifying the correctness of the design.
|
14:15 |
B1-4¡@ |
HW/SW Co-Design of a Multi-Threaded Virtual Machine for a Scalable NoC
Platform |
¡@ |
PDF |
§õ©û¶©¡B³¯ªa¶W¡B©Põ¥Á, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we have designed and
implemented a Multi-Threaded Java Virtual Machine (MTJVM) for a scalable NoC platform. It is composed of multiple processing elements (PEs) and can directly execute Java threads concurrently without any software/OS support. And, threads will be dynamically dispatched to PEs and run simultaneously toward Thread-Level-Parallelism (TLP). Thread processing mechanisms and instructions, such as real-time scheduling, sleep, wait, yield, and synchronization, are handled by two new global controllers, the thread-manager and the memory-manager. The complete system has been coded and synthesized in C and VHDL for its software and hardware parts, respectively. As the experiment results shown, the performance and the area of it are scalable with the number of PEs used, and it works at 96.8 MHz.
|
14:30 |
B1-5¡@ |
High Speed and Low Cost Implementations in Mix-Column/InvMix-Column |
¡@ |
PDF |
Chung-Yi Li, Chih-Feng Chien, and Tsin-Yuan Chang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
Mixed Column and Inverse Mixed
Column dominate the logic resource in Advanced Encryption Standard (AES) hardware implementation with direct mapping S-boxes. In this paper, two resource sharing circuits including short-path and small-area are proposed in byte-level resource sharing to optimize the delay and area required. Theoretically, the proposed circuits have either the fastest speed (same as previous work but 41% smaller in gate count) or the small gate count (1% less than previous work with 13% saving in delay). Synthesized in a TSMC 0.18
£gm CMOS technology, the proposed schemes have the top 2 measured in AT2 (Area-delay square product).
|
14:45 |
B1-6¡@ |
Combined Decoding and Flexible Transform Designs for Effective H.264/AVC
Decoders |
¡@ |
PDF |
Yi-Chih Chao, Shih-Tse Wei, Jar-Ferr Yang, and Bin-Da Liu, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we propose combined decoding
architecture and high-throughput flexible transform design to
effectively decode the residual data for H.264/AVC decoders. The inverse quantization (IQ) procedure is combined with context-based adaptive variable length coding (CAVLC) decoder to efficiently achieve the simplification. Besides, the flexible transform architecture is also proposed for effective computation of all transforms needed in H.264/AVC decoders. Since all the transforms are realized in the same architecture, the flexible transform design with the throughput of 8 pixels/sec needs fewer logic gate counts. Simulation results show that the implemented gate count is 18.6k and the maximum operating frequency is 125MHz. For real-time requirements, this proposed design achieves 4VGA (1280¡Ñ960)@30 frames/sec in the worst case. |
¡@
|
TECHNICAL PROGRAM
Session
C1 |
Day
08/08
|
Time
13:30-15:00
|
Chair
³¢«Ø§» ±Ð±Â
°ê¥ß¥xÆW®v½d¤j¾Ç |
Room
315 |
13:30 |
C1-1¡@ |
Design and Realization of Ultra Low-Capacitance Bond Pad With Inductive
Compensation for RF Circuits in CMOS Technology |
¡@ |
PDF |
Yuan-Wen Hsiao, Chun-Yu Lin, and Ming-Dou Ker, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
A low-capacitance bond pad for
gigahertz RF applications is proposed. Three kinds of on-chip spiral inductors embedded under the traditional bond pad are used to compensate the parasitic capacitance of the bond-pad metals. Experimental results have verified that the bond-pad capacitance can be significantly reduced in a specific frequency band due to the cancellation effect provided by the embedded inductor in the proposed bond pad. The proposed bond pad is fully compatible to general CMOS processes without any process modification.
|
13:45 |
C1-2¡@ |
A compact square-root domain filter |
¡@ |
PDF |
Chia-Hsiung Kao, Ping-Yu Tsai, Wen-Pin Lin, and Wan Chen Lo, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
¡@ |
A square-root domain filter based on
operational transconductance amplifiers (OTAs) is proposed. The filter is compact and simple. The supply voltage is 1.5V and the power consumption is 116£gW for a 10£gA DC input current. Experimental results in a 0.35
£gm CMOS process confirm the feasibility of the methodology. |
14:00 |
C1-3¡@ |
Q-Factor Behavior Study of 90-nm RF-CMOS Inductors Using
Transmission-Line Mode |
¡@ |
PDF |
C.-H. Huang and T. -S. Horng, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
¡@ |
Abstract-This paper presents a novel
transmission-line model for on-chip inductors to derive their quality
(Q) factors in terms of the well-known transmission-line parameters in
closed forms. The derived formulas are general and suitable for all
kinds of integrated inductors when used to account for the frequency
dependence of Q factors. The presented formulas can also uniquely
distinguish the improvement in Q-factor frequency responses between the
reduction of conductor loss and dielectric loss. Several 90-nm RF-CMOS
inductors have been studied for their Q-factor behavior under different
design configurations and process variations. |
14:15 |
C1-4¡@ |
A 1-V CMOS Pseudo-Differential Amplifier with Multiple Common Mode
Stabilization and Frequency Compensation Loops |
¡@ |
PDF |
Meng-Hung Shen, Po-Hsiang Lan, and Po-Chiun Huang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a 1-V three-stage
operational amplifier using a standard 0.35-£gm CMOS process is presented. The
input stage is a pseudo-differential structure with common mode
feedforward (CMFF) bias to extend the input voltage headroom.
At the second and third stages, the parallel AC-boosting and
signal feedforward paths are used for bandwidth and transient
speed improvement. One active feedback is adopted to guarantee
the stability without the RHP zero. A current-sensing CMFB
loop is used to setup the common mode point. With these
techniques, experiment results show a 4.3MHz gain-bandwidth
product (GBW) with 68¢X phase margin when driving a 100pF
load capacitance. The settling time of an 1-Vpp step signal is
1.1£gs. All the circuits dissipate 249£gW from a single 1-V supply. |
14:30 |
C1-5¡@ |
A 1-V Fully Differential Amplifier with Buffered Nested-Miller
Compensation |
¡@ |
PDF |
Li-Wen Wang, Meng-Hung Shen, and Po-Chiun Huang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a 1-V three-stage
fully differential amplifier with buffered nested-Miller compensation. A transconductance stage is inserted in the feedback path to eliminate
the right half plane (RHP) zero. In addition, a feedforward
transconductance is used to enhance output large signal response. Using standard
0.35-£gm CMOS technology, measurement results
demonstrate that DC gain greater than 90dB, gain-bandwidth
product of 4.57MHz, and phase margin of 55¢X is achieved with
100pF output loads. The settling time for a 1-Vpp step is 2£gs. All
the circuits dissipate 110£gW under a single 1-V power supply. |
14:45 |
C1-6¡@ |
A Low-Power High-Gain Rail-to-Rail Input/Output Operational Amplifier |
¡@ |
PDF |
Chien-Hung Kuo, Hwa-Ming Lu, and Wei-Hsien Fang, ²H¦¿¤j¾Ç ¡@ |
¡@ |
¡@ |
A low-power high-gain CMOS operational amplifier with rail-to-rail input/output ranges is presented in this paper. A constant-gm controller is employed in the input stage to achieve an optimum bandwidth and settling response in a wide operational range. A differential-input single-output gain-boosting amplifier without common-mode feedback is applied to minimize the power consumption and increase the dc gain of opamp. The floating current sources are also introduced to the cascode stage to provide proper bias levels for the class AB output stage. The proposed opamp can load with a large capacitance or a small resistance loads without losing the gain and unity-gain bandwidth. It has been fabricated in a 0.35
£gm 2P4M CMOS process. With a 50 pF of the output capacitance load, a 128 dB of dc gain and a 1.81 MHz of unity-gain frequency can be achieved in the proposed opamp. The total power dissipation is only 370
£gW at a 3.3 V of supply voltage.
|
¡@
|
TECHNICAL PROGRAM
Session
D1 |
Day
08/08
|
Time
13:30-15:00
|
Chair
½²¹Å©ú ±Ð±Â
°ê¥ß¥æ³q¤j¾Ç |
Room
308 |
Wireline Communication Circuits
13:30 |
D1-1¡@ |
A Quarter-Rate 2.56/3.2Gbps 16/20:1 SERDES Interface in 0.18£gm CMOS
technology |
¡@ |
PDF |
Ching-Te Chiu, Jen-Ming Wu, Shuo-Hung Hsu, YarSun Hsu, Ming-Hao Lu,
Ping-Lin Yang, Fan-Ta Chen, You-Hung Li, Yu-Hao Hsu, and Min-Sheng Kao,
°ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we present a 2.56/3.2Gbps
16/20:1 SERDES circuit for integrating with switch fabric in high speed network applications. To achieve high speed and low-jitter requirement, we propose the quarter-rate, true single phase clock (TSPC) circuit, and LC-VCO based PLL architecture. This SERDES use the quarter-rate architecture to reduce the high frequency clock requirement. The shift register and multiple phase sampling structures are used to achieve the 16/20:1 serializing and 1:16/20 deserializing. In circuit measurement, this SERDES can operate up to 5.12Gbps independent of PLL. Fabricated in 0.18£gm CMOS process, the 2.49mm ¡Ñ 2.49mm SERDES consumes 250mW including PLL at 3.2Gbps data rate. |
13:45 |
D1-2¡@ |
A Low Power Tree-Type Multiplexer with Embedded Timing Skew Switch |
¡@ |
PDF |
HungWen Lu, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
¡@ |
This work describes a tree-type
multiplexer for Gigabits serial I/O. The proposed architecture uses the quadrature clocks as switch controls to eliminate the need of retiming circuit in the conventional design. Consequently, power consumption and circuit area are significantly reduced. Simulation results indicate that the power consumption of the modified multiplexer at 10Gbps is 70% of that of the traditional design, and the area is 22%. To verify the proposed MUX cells, a test chip is implemented by using TSMC 0.13 CMOS process and occupying a circuit area of 200£gm¡Ñ150£gm. The measured power consumption of the 32-to-1 MUX circuit at a bit rate of 7.5Gbps and 1.2V supply voltage is only 4.8mW. |
14:00 |
D1-3¡@ |
Transimpedance Amplifier with Enlarged Input Capacitance Tolerance for
Optical Receiver |
¡@ |
PDF |
Jiann-Jiun Lu and Chia-Ming Tsai, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a 5Gb/s transimpedance amplifier employing self-compensated architecture, inductor peaking, and negative impedance compensation technique in 0.18£gm CMOS technology. The transimpedance amplifier exhibits 13k£[ differential transimpedance gain and sensitivity of -15.5dBm while dissipating 61mW from a 1.8V supply, and the chip size is 600£gm¡Ñ520£gm.
|
14:15 |
D1-4¡@ |
A Low-jitter Phase-rotation Spread Spectrum Clock Generator for Serial
ATA 6Gbps Clock and Data Recovery |
¡@ |
PDF |
Chi- Hsien Lin, Yen-Ying Huang, Shu-Rung Li, Yuan-Pu Cheng, and Shyh- Jye
Jou, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
A low jitter phase-lock-loop (PLL) and a proposed spread-spectrum clock method for Serial ATA with phase rotation is presented. To achieve low jitter in our device, the low jitter PLL uses error amplifier to resolve the current mismatch in charge pump and the 3rd order loop filter is adopted to reduce the reference spur. A passive resistance is presented in our design to reduce the KVCO. Our spread spectrum clock generator (SSCG) for Serial ATA Specification is down spread 5000 ppm with a triangular waveform and the modulation frequency is 30~33KHz. The spread-spectrum technique using PLL with a £G£U modulator and phase rotation algorithm is reported. The proposed circuit has been designed in a 90-nm CMOS process. The non-spread spectrum clocking has a peak to peak jitter of 3.54ps and consumes 5.87mW at 1.4GHz. The EMI reduction in this circuit is about 18.22dB.
|
14:30 |
D1-5¡@ |
A 2.5 Gbps CMOS Fully Integrated Optical Receicer with Lateral PIN
Detector |
¡@ |
PDF |
Wei-Zen Chen and Shih-Hao Huang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
This
paper presents the design of a monolithically integrated CMOS optical
receiver, including a photo detector, a transimpedance amplifier, and a
post limiting amplifier on a single chip. A novel PIN detector is
proposed and adopted in this design without technology modification. The
optical receiver is capable of delivering 420 mVpp to 50£[ output load
and operating up to 2.5 Gbps without an equalizer. Implemented in a
generic 0.18£gm CMOS technology, the total power dissipation is 138 mW.
The chip size is 0.53 mm2. |
14:45 |
D1-6¡@ |
Inductorless CMOS Receiver Front-End Circuits for 10-Gbs Optical
Communications |
¡@ |
PDF |
Chih-Hao Chen, ²H¦¿¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a 10-Gb/s inductorless CMOS receiver front end is presented, including a transimpedance amplifier and a limiting amplifier. The transimpedance amplifier employs Regulated Cascode (RGC), active-inductor peaking, and dual third-order active feedback to achieve a transimpedance gain of 57 dB£[ and a bandwidth of 7.8 GHz with a power dissipation of 39 mW. The limiting amplifier incorporates third-order inter-leaving active feedback to achieve a voltage gain of 32 dB and a bandwidth of 10 GHz while consuming 145 mW.
Both circuits are realized in 0.18-£gm CMOS technology with a 1.8-V supply. |
¡@
|
TECHNICAL PROGRAM
Session
E1 |
Day
08/08
|
Time
13:30-15:00
|
Chair
±i¶¶§Ó ±Ð±Â
°ê¥ß¦¨¥\¤j¾Ç |
Room
318 |
13:30 |
E1-1¡@ |
Topology Generation and Floorplanning for Low Power Application-Specific
Network-on-Chips |
¡@ |
PDF |
Wan-Yu Lee and Iris Hui-Ru Jiang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
As the process advances into
nanotechnology, the number of cores and the amount of communication on a
chip are rapidly increasing. Using a micro-network, Network-on-Chip can
overcome the communication inefficiency in the traditional shared bus
communication architecture. The system performance of
application-specific Network-on-Chips is mostly measured by power,
timing, as well as area. Moreover, power and timing highly depend on how
the network topology connects routers and cores and how many routers are
used; area is simply determined by floorplanning. Unlike previous endeavors, in this paper, we propose a new methodology to perform network topology generation before floorplanning. Our method can preserve the optimality of topology to floorplan. Moreover, our method not only simultaneously minimizes power, satisfies timing and area constraints, but also guarantees deadlock free. |
13:45 |
E1-2¡@ |
SAT Based Boolean Matching with Don't Cares |
¡@ |
PDF |
Kuo-Hua Wang and Chung-Ming Chan, »²¤¯¤j¾Ç ¡@ |
¡@ |
¡@ |
Boolean matching is to check the
equivalence of two functions under input permutation and input/output
phase assignments. In this paper, we will transform the Boolean matching
problem to the Boolean satisfiability problem. Based on this transformation
approach, an SAT-based matching algorithm for incompletely specified functions will be proposed.
Moreover, two signatures by
exploiting functional symmetries will be provided to reduce the
size of SAT instance and thus expedite the matching process.
Experimental results on a set of benchmarking circuits show that
our matching algorithm is indeed very effective and efficient
to solve Boolean matching for incompletely specified functions.
Compared with our prior work on Boolean matching, our SAT-based matching algorithm
outperforms the old algorithm by several orders for many large
benchmarking circuits. |
14:00 |
E1-3¡@ |
Lithography-Aware Routing with Predictive OPC Formulae |
¡@ |
PDF |
Tai-Chen Chen, Guang-Wan Liao, and Yao-Wen Chang, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
Due to the sub-wavelength lithography,
manufacturing the sub-90nm feature size requires intensive use of the
resolution-enhancement techniques, among which optical proximity
correction (OPC) is the most popular technique in industry. Considering the OPC effects during routing can significantly alleviate the cost of post-layout OPC operations. In this paper, we present an efficient, yet accurate analytical formula for intensity computation and develop
the first modeling of the post-layout OPC based on the inverse-like lithographic technique. Extensive simulations with SPLAT, the golden lithography simulator in academia and industry, show that our intensity formula has high fidelity. Incorporating the OPC costs computed by the inverse-like lithographic technique for our post-layout OPC modeling into a router, the router can be guided to maximize the effects of the correction. Experimental results show that our approach can achieve up to 24% and 16% reductions in the respective total and maximum layout distortions, using reasonable running time.
|
14:15 |
E1-4¡@ |
An Efficient Energy Modeling Approach for VLIW DSP at Instruction-Level |
¡@ |
PDF |
Wen-Tsan Hsieh, Hsin-Ying Liao, Chien-Nan Jimmy Liu, Shu-Yu Cheng,
and Ji-Jan
Chen, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
¡@ |
In this work, we develop a new
instruction-level energy modeling approach for pipelined Very Large
Instruction Word(VLIW) DSPs. The proposed approach can take care of both the base energy cost of each instruction and the additional energy cost of consecutive instructions in each pipeline stage. Therefore, the power estimation can be much closer to the real pipelined behavior. The overall energy procedure can be separated into two phases: energy extraction phase and model re-construction phase. The experimental results have shown that the average error of our approach is less than 3% compared to ¡§PrimePower¡¨ gate-level power simulation. Thus, the proposed energy modeling approach can be easily used for software energy optimization. |
14:30 |
E1-5¡@ |
An Automated Synthesis Tool for Fully Differential OPAMPs |
¡@ |
PDF |
Cheng-Wu Lin and Soon-Jyh Chang, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
¡@ |
A design automation tool, applied to
figure out the proper transistor sizes for fully differential OPAMPs is presented. The developed synthesis tool provides three OPAMP topologies, two-stage, folded-cascode, and two-stage cascode. Look-up tables are utilized for device sizing in this tool. By the consideration of experience in practical circuit design, the efficiency of synthesis process is improved. A further subroutine is developed to speed up the tool development for supporting new OPAMP topologies. To design an OPAMP applied for the first stage of an 8-bit 50-MS/s pipelined ADC, the synthesis time of the proposed tool is less than 3 minutes using two 1.2 GHz UltraSPARC-III+ processors and 2 GB memory.
|
14:45 |
E1-6¡@ |
A Top-down, Mixed-level Design Methodology for CT BP £G£U Modulator Using
Verilog-A |
¡@ |
PDF |
Hung-Yuan Chu and Chien-Hung Tsai, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a design methodology of
a continuous-time (CT) Band-pass (BP) £G£U modulator which can improve the design procedure. The proposed top-down, mixed-level design platform is implemented under Cadence¡¦s Spectre environment using Verilog-A. A 2nd order CT BP £G£U modulator for WCDMA applications. The central frequency of this modulator is at 100MHz and the internal quantizer operated at 400MHz clock frequency. The modulator is simulated in TSMC 0.35£gm CMOS technology, at a supply voltage of 3.3V. The maximum SNDR is 40dB for a 3.84MHz bandwidth, which corresponds to a resolution of 6 bits. |
¡@
|
TECHNICAL PROGRAM
Session
P1A |
Day
08/08
|
Time
13:30-16:00
|
Chair
§õ³Õ©ú ±Ð±Â
«n¥x¬ì§Þ¤j¾Ç |
Room
2F®b·|ÆU |
P1A-1¡@ |
Frequency Domain Analog Circuit Fault Diagnosis Based on Radial Basis
Function Neural Network |
PDF |
ªL©v§Ó¡B³¢©ú¤¯¡B³¯¬Õ¦{, ³{¥Ò¤j¾Ç ¡@ |
¡@ |
In this paper, a fault diagnosis
methodology is proposed based on radial basis function neural networks (RFBNN) to analyze signatures of analog circuits. To perform soft fault location, RBFNN are used to process the circuit frequency responses and to build the fault dictionary. From the experimental results, we can find the proposed technique is succeeded in diagnosing and locating faults quickly and exactly. |
P1A-2¡@ |
A RF CMOS Low Noise Amplifier Using High-Q Active Inductor Loads with
Binary Code for Multi-Band Applications |
PDF |
Jenn-Tzer Yang, Yuan-Hao Lee, Yi-Yuan Huang, Yu-Min Mu, and Yen-Ching
Ho, ©ú·s¬ì§Þ¤j¾Ç ¡@ |
¡@ |
In this paper, a radio frequency (RF) CMOS multiple bands low noise amplifier using a high-Q active inductor load with a binary code band selector suitable for multi-standards wireless applications is proposed. By employing an improved high-Q active inductor including two bits binary controlled code, the multi-band low noise amplifier operating at four different frequency bands is realized. The proposed amplifier circuit is designed in TSMC
0.18-£gm CMOS technology. Based on the simulation results, the amplifier can operate at 900MHz, 1.8GHz, 1.9GHz, and 2.4GHz with forward gain (S21) of 31.15dB, 30.82dB, 30.61dB, and 28.4dB, and the noise figure (NF) of 0.563dB, 0.558dB, 0.578dB, and 0.759dB, respectively. Furthermore, the power dissipation of this amplifier can retain constant at all operating frequency bands and consume around 11.66 mW from 1.8-V power supply. The occupied area of this amplifier is about 158 ¡Ñ 76 mm2. |
P1A-3¡@ |
A Novel Precise Step-Shaped Soft-Start Technique for Integrated DC-DC
Converter |
PDF |
Yung-Chun Chuang and Ke-Horng Chen, °ê¥ß¥æ³q¤j¾Ç |
¡@ |
The advantages of this novel precise
step-shaped soft-start technique for integrated dc-dc converter are not
only owning excellent peak current limiting capacity for any load
condition to diminish initial inrush current powerfully but also solving
the over-voltage or drop-voltage problem during the changing between the
start-up mode and normal operation. Furthermore, the on-chip design for
this technology reduces the numbers of the external I/O pin to decrease
the cost of the converter. Therefore, this novel soft-start technique is
more available and smoother for integrated dc-dc converter than the
conventional soft-start technique. |
P1A-4¡@ |
A self-oscillating switching power amplifier |
PDF |
Chia-Hsiung Kao, Ping-Yu Tsai, Wen-Pin Lin, and Ming-Ching Chou, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
A self-oscillating switching power
amplifier is proposed. We use feedback to reduce the output DC bias
current and to produce self-oscillation. Further, filters are added in
the feedback loop to reduce the quantization noise. The experimental
results show that the proposed circuit has 0.25% total harmonic
distortion (THD) and the efficiency of output power of 840 £gW reaches 90.1% while the DC output bias current is 8£gA and the supply voltage is 1.5 V.
|
P1A-5¡@ |
Differential Feed-forward Transconductor Design for High Linearity WiMax
Subharmonic Mixer |
PDF |
Ying-Ta Lu, Hsien-Yuan Liao, Shao-Liang Lu, Joseph D. S. Deng*, and
Hwann-Kaeo Chiou, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
An active double balance subharmonic mixer was designed and fabricated in direct conversion receiver for the worldwide interoperability microwave access (WiMax) applications. A differential feed-forward transconductor design was applied to improve the third-order nonlinearity. The mixer achieved a conversion power gain of 6.1 dB, an input-referred third-order intercept point of 5 dBm with power dissipation of 6.72 mW from a 3 V supply voltage. The chip was fabricated in TSMC 0.35
£gm SiGe HBT technology. The chip area occupies 0.86 mm ¡Ñ 0.78 mm. |
P1A-6¡@ |
Modeling on the Mutual Inductance of On-Chip Transformers |
PDF |
Heng-Ming Hsu, Sih-Han Lai, and Hsien-Feng Liao, °ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
The behavior related to mutual inductance
of on-chip transformer has been discussed comprehensively in this work,
the characterization includes the low and high frequency performances.Moreover,the corresponding equivalent circuit is proposed to describe the high frequency characteristic.
|
P1A-7¡@ |
A 1.76 uW, 0.9V, 8-bit Successive Approximation Register ADC with
Fully-Differential Input Capability |
PDF |
Á©v®ï¡B¬x¯E³ì, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
This paper presents a 0.9V, 8-bit
successive approximation register (SAR) analog-to-digital converter
(ADC) with a novel pseudo differential track-and-hold stage to accept
fully differential inputs. Most of the circuits in the ADC design are
single-ended to save the power consumption and silicon area. The ADC has
been designed in a 0.18£gm CMOS process. HSPICE simulation results show that at an output rate of 111KS/s, the proposed SAR ADC achieves a peak signal-to-noise-and-distortion ratio of 48.23dB, a rail-to-rail input range, and an effective resolution bandwidth no less than its Nyquist bandwidth. Its power consumption is as low as 1.76
£gW. |
P1A-8¡@ |
A 3.1¡V10.6 GHz Ultra-Wideband CMOS Low Noise Amplifier Using
Bridged-Shunt-Series Peaking Technique |
PDF |
Yu-Liang Lin, Feng-Lin Shiu, and Hwann-Kaeo Chiou, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
An ultra-wideband 3.1¡V10.6-GHz low-noise
amplifier adopting inductive peaking technique for bandwidth extension
is presented. Fabricated in a 0.18-£gm CMOS process, the proposed circuit can both satisfy the maximum bandwidth and the maximally flat response. The feedback resistor provides good input match while contributing a small amount in noise figure (NF) degradation. The presented LNA achieves a maximum power gain of 14.1 dB within a 3-dB bandwidth from 2.2 to 11 GHz and a good NF from 3.4 to 4.5 dB in the entire UWB band, and an IIP3 better than ¡V3 dBm while drawing 30
mW from a 1.5 V supply. |
P1A-9¡@ |
A Novel Infrared Tracking System with Winner-Take-All Implementation |
PDF |
Po-Hsiang Chang and Chih-Hsiung Shen, °ê¥ß¹ü¤Æ®v½d¤j¾Ç |
¡@ |
This paper discusses a novel infrared
tracking sensor array, which measures the number and size of thermal
objects, with the winner-take-all (WTA) circuit and a new preliminary level of thermopile array image processing on chip. This infrared tracking sensor digitizes thermal image by comparison with current signals which are controlled by the output voltage of thermopile sensors with a given threshold. The winner-take-all (WTA) circuit is used in combination with readout circuit for determining an 8¡Ñ8 pixels thermopile array. Realization of the winner-take-all (WTA) circuit with sharp selectivity makes it possible to pick up only one winner from each object utilizing inherent mismatch in transistor characteristics. In order to simulate and present the infrared thermal sensor array in this paper, the sensor array fully is integrated by using a 2P4M 0.35£gm standard CMOS technology. So far the results have shown that integrated thermopile array with winner-take-all (WTA) can approach a high level of development, reliability and easy for high accuracy infrared tracking applications. |
P1A-10¡@ |
A New Multi-Function Wave Generator Based on Multiple-Output
Second-Generation Current Conveyors |
PDF |
Yuh-Shyan Hwang, Yu-Wen Chen, Jiann-Jong Chen,
and Wen-Ta Lee, °ê¥ß¥x¥_¬ì§Þ¤j¾Ç ¡@ |
¡@ |
A new multi-function wave generator based
on multiple-output CCII (second-generation current conveyor) is presented in this paper. With the control of the on-chip switches, the waveform of the output can be modified to achieve signal modulation like ASK, FSK, and PSK. The circuit consists of two multiple-output CCIIs, two resistors, and two grounded capacitors. The proposed circuit has been designed with TSMC 0.35£gm DPQM CMOS process. The HSPICE simulation results are depicted to verify the theoretical prediction of ASK, FSK, and PSK. |
P1A-11¡@ |
Design a Multiplicative type-II Fuzzy Cellular Neural Network with CMOS
Image Sensor |
PDF |
Jui-Lin Lai, Yuan-Hung Lo, Yan-Ting Chen, and Rong-Jian Chen, °ê¥ßÁp¦X¤j¾Ç ¡@ |
¡@ |
The architecture of Multiplicative type-II
Fuzzy Cellular Neural Networks (FCNN) with CMOS image sensor is proposed, which is with local connectivity advantageous suitable implemented for VLSI. Base on the proposed FCNN structure which is included the neuron, Min/Max, analog multiplier, pixel and CDS circuit, S/H Circuit, transfer and control circuits are adopted. The proposed FCNN can operated the specific functions base on the selected template is successfully verified by the TSMC 0.35£gm 2P4M CMOS technology. There have a great potential in the VLSI implementation of neural network systems for binary and gray-level patterns in image-processing applications.
|
P1A-12¡@ |
A 2.4GHz Current-reused VCO with Degenerated Resistors |
PDF |
Ruey-Lue Wang, Guo-Ruey Tsai , Yu-Feng Lin,
and YuJo Tzeng, ±X¤s¬ì§Þ¤j¾Ç ¡@ |
¡@ |
In this paper, we presents a current-reused voltagecontrolled
oscillator which consists of a couple of nmos and pmos
transistors. The proposed voltage controlled oscillator (VCO) is
designed for 2.4 GHz operation. The study is based on TSMC
0.18-um CMOS processes. Measurement results show -94.6 and -
116 dBc/Hz at 100-kHz and 1-MHz offset, respectively, when the
oscillation frequency is at 2.46 GHz. The current-reused VCO
can reduce power dissipation to half that of conventional
differential topologies (core: 0.35mA from a 1.8-V supply). The
tuning range is from 2.24 to 2.52 GHz under the tuning voltage
between 0 to 2 V. |
¡@
|
TECHNICAL PROGRAM
Session
P1E |
Day
08/08
|
Time
13:30-16:00
|
Chair
¶À©v¬W ±Ð±Â
°ê¥ß¹ü¤Æ®v½d¤j¾Ç |
Room
2F®b·|ÆU |
P1E-1¡@ |
A Single-Clock Enhanced Random Access Scan |
PDF |
Chen-An Chen, Wei-Yi He, and Tsung-Chu Huang, °ê¥ß¹ü¤Æ®v½d¤j¾Ç ¡@ |
¡@ |
Random access scan architecture has been an
effective approach to achieve simultaneous reduction in low power, data
volume and test time for stuck-at fault test. In this paper, we develop
a single-clock random access scan architecture combined with
combinational output observable logics that can further reduce peak
power and control wires using single clock for both delay test and
diagnosis. Two scan cells are de-veloped for stuck-at faults and path delay faults separately. The observable logics make the vector ordering efficient so that the test data and test application time can be further shrunk by 64%. Especially due to this structure, the flipflop array can prevent from flash capture operations and reduce the peak power dissipation up to 92%. From experiments and verification including post-layout timing analyses, we show a multipurpose solution to solve many issues simultaneously for SoC testing.
|
P1E-2¡@ |
Area-Driven Decoupling Capacitance Allocation Based on Space Sensitivity
Analysis |
PDF |
Jin-Tai Yan, Ming-Yuen Wu, and Zhi-Wei Chen, ¤¤µØ¤j¾Ç ¡@ |
¡@ |
Based on the space sensitivity for the
decoupling capacitor to release the IR-drop constraint and to minimize the final floorplan area in a given floorplan, an area-driven allocation approach is proposed to integrate the decap estimation and allocation to assign feasible decaps around or near all the circuit modules to release all the IR-drop noises in the floorplan. The experimental results show that our proposed area-driven allocation approach obtains very promising timing and area results for MCNC benchmark circuits.
|
P1E-3¡@ |
A Topology-Based Construction for X-Architecture Clock Routing |
PDF |
Chia-Chun Tsai*, Chung-Chieh Kuo, Jan-Ou Wu, Trong-Yen Lee, and
Rong-Shue Hsiao, «nµØ¤j¾Ç ¡@ |
¡@ |
Wire delay plays a critical role in high
performance clock routing. Shortening wirelength to reduce the clock delay is an increasingly important objective. Compared with conventional Manhattan architecture, X-architecture outperforms the former in wirelength reduction, power consumption, and clock performance. In this paper, we present a DME-X algorithm based on the combination of DME method and X architecture to create a clock tree with zero skew. The algorithm constructs a parallelogram to each pair of sinks or points for determining its X-topology wiring and simplifies the procedure of merging segment. Experimental results on benchmarks show the improvement of 7.9% in clock delay compared with other algorithms. |
P1E-4¡@ |
Routability-Driven Track Routing for Coupling Capacitance Reduction |
PDF |
Jin-Tai Yan, Zhi-Wei Chen, and Kuen-Ming Lin, ¤¤µØ¤j¾Ç ¡@ |
¡@ |
Given a routing panel, the routability-driven ordering and location constraints can be firstly set according to the pin positions of all the wire segments in the panel. Based on routability-driven ordering and location constraints in the panel, an ASAP-based scheduling approach with efficient space insertion is further proposed to reduce total coupling capacitance for routability-driven track routing. The experimental results show that our proposed ASAP-based scheduling approach obtains very promising routing results in reasonable CPU time for several benchmark circuits. |
P1E-5¡@ |
A Timing-Driven X-Architecture Router with Obstacles |
PDF |
Shu-Ping Chang, Hsin Hsiung Huang, Yu-Cheng Lin, and Tsai Ming Hsieh,
°ê¥ß¥xªF¤j¾Ç |
¡@ |
In this paper, we formulate a new
X-architecture routing problem in presence of obstacles, and propose a
X- architecture timing-driven routing tree construction algorithm to
minimize the maximum source-to-sink delay and the total wirelength simultaneously. First, we construct the spanning graph by the terminals and the corners of the obstacles. The minimal spanning tree is obtained by searching the spanning graph. The feasible X-architecture is constructed by transforming the minimal spanning tree. For the X-architecture routing tree, the delay of two-pin net is estimated by the modified Elmore delay model. According to the user defined delay threshold, an efficient rerouting algorithm is used to fix the timing violated nets. The critical terminals iteratively are rerouted by splitting two subtrees and merging into one tree. Compared to the routing result without rerouting, the maximum source-to-sink delay is improved by 61% and only 0.7% of additional total wirlength is increased. |
P1E-6¡@ |
Test Generation for Transition Delay and RS-CFM Faults in Modified Booth
Multipliers |
PDF |
Hsing-Chung Liang and Pao-Hsin Huang, ¤¤ì¤j¾Ç ¡@ |
¡@ |
In this paper, we propose a type of
modified Booth multiplier and generate C-testable and linear-testable
pattern pairs for transition delay faults (TDF) and realistic sequential cell faults (RS-CFM) in the multipliers of various sizes. The patterns are generated at two description levels of the circuit, one at cell level and another at gate level. Analyzing the multipliers, we can generate 18 constant test pairs to detect TDF at cell level irrespective of the multiplier sizes. Similarly, only 20 test pairs are enough to detect TDF at the synthesized gate level. These test pairs are much less than those generated by commercial tools, which cannot generate constant test pairs, either. Furthermore, in order to prepare test pairs independent of interior structures of cells, we also generate 104 + N¡Ñ10 SIC test pairs for an N¡ÑN multiplier. These patterns not only achieve very high fault coverage for TDF and RS-CFM, but also are much less than those of a previous work for RS-CFM.
|
P1E-7¡@ |
Non-Slicing Floorplanning-Based Crosstalk Reduction on Gridless Track
Assignment |
PDF |
Win-Nai Zheng, Yu-Ning Zhang, and Yih-Lang Li, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
Track assignment, which is an intermediate
stage between global routing and detailed routing, provides a good
platform for promoting performance, and for imposing additional
constraints during routing, such as crosstalk. Gridless track assignment (GTA) has not been addressed in public literature. This work develops a gridless crosstalk-driven GTA. Initial assignment is produced rapidly with a left-edge like algorithm. Crosstalk reduction on the assignment is then transformed to a restricted non-slicing floorplanning problem, and a deterministic O-tree based algorithm is employed to re-assign each net segment. Finally, each panel is partitioned into several sub-panels, and the sub-panels are re-ordered using branch and bound algorithm to decrease the crosstalk further. Experimental results demonstrate that the proposed gridless crosstalk-driven GTA has over 80% reduction in the overlapping length of adjacent wires.
|
P1E-8¡@ |
Modified Essential Spare Pivoting Algorithm for Embedded Memories with
Global Block-Based Redundancy |
PDF |
Chun-Lin Yang and Shyue-Kung Lu, »²¤¯¤j¾Ç ¡@ |
¡@ |
A block-based redundancy architecture is
proposed in this paper. The redundant rows/columns are divided into row/column blocks. Therefore, the repair of faulty memory cells can be
performed at the row/column block level. Moreover, the redundant
row/column blocks can be used to replace faulty cells anywhere in the
memory array. This global characteristic is helpful for repairing
cluster faults. The proposed redundancy architecture can be easily
integrated with the embedded memory cores. Based on the proposed global
redundant architecture, a heuristic MESP (modified essential spare pivoting) algorithm
suitable for built-in implementation is proposed. According
to experimental results, the area overhead for implementing
the MESP algorithm is negligible. The repair rate is 99.94%
for a 1M-bit (1024¡Ñ1024-bit) SRAM. |
¡@
|
TECHNICAL PROGRAM
Session
P1D |
Day
08/08
|
Time
13:30-16:00
|
Chair
²ø§@±l ±Ð±Â
°ê¥ß«ÌªF°Ó·~§Þ³N¾Ç°| |
Room
2F®b·|ÆU |
P1D-1¡@ |
A Comparative Study of LNS and Floating-Point arithmetic |
PDF |
Chih-Yen Fan and Chi-Chyang Chen, ³{¥Ò¤j¾Ç ¡@ |
¡@ |
The logarithmic number system (LNS) arithmetic is very efficient in computing complex operations such as multiplication, division, powering, and logarithmic functions. However, addition and subtraction in LNS arithmetic is difficult and requires a large hardware cost. In this paper, we compare and analyze the performance and hardware cost of the arithmetic units in the LNS and Floating-Point (FLP) arithmetic with various word length. The 16-bit and 20-bit LNS adders/subtractors is designed by using lookup tables with table-reduction techniques. The architecture of the 24-bit LNS is based on the table lookup architecture with a simple approximation method to reduce the table size. Finally, direct computation is adopted to design the 28-bit and 32-bit LNS adders/subtractors. We have designed the 16-bit, 20-bit, 24-bit, 28-bit, and 32-bit LNS and FLP units with adders, subtractors, multipliers, and dividers in VHDL and synthesized these units with TSMC 0.18£gm CMOS cell library. From the synthesis and simulation results, we can compare and analyzed the advantages and disadvantages of these two number systems, which can be used as a guideline for design engineers in deciding when LNS arithmetic can be adopted for efficient digital system design.
|
P1D-2¡@ |
Design of Low-Error Signed Fixed-Width Multipliers |
PDF |
Jiun-Ping Wang and Shiann-Rong Kuang, °ê¥ß¤¤¤s¤j¾Ç |
¡@ |
A framework of designing a low-error signed
fixed-width multiplier that receives two n-bits operands and generates
an n-bits product is proposed. The proposed error compensation circuit
not only leads signed fixed-width multipliers to higher accuracy but
also can be easily constructed with simple logic gates. Moreover, the
proposed signed fixed-width multiplier is also applied to the inverse
discrete cosine transform (IDCT) computation in JPEG image compression. Experimental results demonstrate that the proposed circuit not only improves the accurate performance but also significantly reduces the hardware complexity and propagation delay when compared with the previous solution.
|
P1D-3¡@ |
A Novel VLSI Iterative Division Algorithm for Fast Quotient Generation |
PDF |
Tso-Bing Juang, °ê¥ß«ÌªF°Ó·~§Þ³N¾Ç°| ¡@ |
¡@ |
In this paper, a novel VLSI iterative
division algorithm for fast quotient generation that is based on radix-2
non-restoring division is proposed. To speed up the quotient generation,
our method makes use of the magnitude difference between the partial
dividend and the divisor for the next iteration so that the proper
weight of the quotient can be obtained more rapidly than the
conventional methods. Our proposed architecture is very simple compared
to the multiplication-based methods such as those that are based on
Newton-Raphson. Simulation results show that our proposed method can achieve less than half the number of iterations required by the conventional division (i.e. less than n/2 vs. n, where n is the bit-width of the dividend and the divisor). Our proposed algorithm can be employed in Digital Signal Processing and 3D graphic processing applications to accelerate the compute intensive division operations. |
P1D-4¡@ |
Reusing Cache for Real-Time Memory Address Trace Compression |
PDF |
Chung-Fu Kao, Chun-Hung Lai, and Ing-Jer Huang, °ê¥ß¤¤¤s¤j¾Ç |
¡@ |
The program execution trace is one of the
efficient debugging approach to analyze and verify the software program
and hardware architecture. However, one of the major problem of tracing
is the high cost of storing the traces. How to reduce the trace
information or compress the trace volumes is an important issue when
debugging a system. A reusing cache for program execution trace
compression in real time is proposed. This method is based on the
program characteristics of temporal and spatial localities then reuse
the system cache to trace program addresses. The advantage is that
reusing cache with minor hardware modification can not only save the
hardware compressor overhead but also obtain a high compression ratio.
Experimental results show that the proposed approach causes few hardware
area overhead but achieves approximately 90% compression ratio at
real-time. |
P1D-5¡@ |
A Novel Membership Function Approximation for Effective Digital Circuit
Design of Neural Networks |
PDF |
Che-Wei Lin and Jeen-Shing Wang, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
This paper proposes a novel approximation
approach for a commonly used membership function of neural networks. In
our study, we focus on the approximation of a hyperbolic tangent sigmoid
function implemented by a digital circuit. The average error and maximum
error of the proposed approximation approach are in the order of 10-3
and 10-2,
respectively. The hardware implementation of the proposed method
consumes only one multiplication and one addition/subtraction ALU with the aid of effective scheduling and allocation. |
P1D-6¡@ |
A Novel Architecture for Self-Reconfigurable Systems |
PDF |
Trong-Yen Lee, Yung-Lin Hsu, Che-Cheng Hu, °ê¥ß¥x¥_¬ì§Þ¤j¾Ç ¡@ |
¡@ |
Dynamic reconfigurable system will be used
in consumer electronic. The state-of-the-art FPGAs have provided the capacity for fast and partial reconfiguration. We propose a novel architecture for multi-region self-reconfigurable systems, which can process all of reconfigurable operations. The new design in proposed architecture includes the wrapper, bus macro, and arbiter. The wrapper and bus macro can connect with various kinds of hardware module and transmit multi-data. The arbiter manages the data flow between hardware module and MicroBlaze, and decides the region which will be reconfigured. Experimental results show that our proposed architecture can support multi-module, direct detecting hardware module function of arbiter, 160 I/O ports of wrapper, and multi-data of bus macro. |
P1D-7¡@ |
VLSI Implementation for Block-Based Gradient Domain High Dynamic Range
Compression |
PDF |
Tsun Hsien Wang, Wei-Ming Ke, Chih-Hsueh Huang, Ding-Chuang Zwao, Fang-Chu
Chen, and Ching-Te Chiu, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
Due to rapid progress in high dynamic range
(HDR) capture technology; HDR display on conventional LCD devices becomes an important topic. Tone mapping algorithms are proposed for rendering HDR images on conventional displays. However, they are impractical for video applications due to intensive computation time. In this paper, we present a real-time block-based gradient domain HDR compression for image or video applications. The gradient domain HDR compression is selected as our tone mapping scheme for its ability of high compression and detail preservation. We equally divide one HDR image/frame into several blocks and process each block by the modified gradient domain HDR compression. The gradients with small magnitudes are attenuated less in each block to maintain the local contrast and thus expose the detail. We reconstruct a low dynamic range image by solving the Poisson equation on the attenuated gradient field block by block. A real-time block-based Gradient Domain Compression with Discrete Sine Transform (DST) architecture is proposed to tone-map HDR video sequences including solving the Poisson equation. Our synthesis and layout results show that our design for tone-mapping can run at 50 MHz clock and consume area of 5.29 mm2 under TSMC 0.13£gm technology. |
P1D-8¡@ |
High Performance Decoder Design for Convolutional LDPC Codes |
PDF |
Mu-Chung Chen, Jun-Wei Lin, Yen-Shuo Chang, Jin-Hao Yu, and Tzi-Dar
Chiueh, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
In this paper, a new Convolutional Low-Density Parity Check Code (LDPC-CC) decoder has been designed and implemented. The proposed design can reach the same bit error rate performance while having lower computation complexity. We have proposed a new parity-check matrix that leads to saving in clock cycles. In addition, some circuit techniques, such as Wallace tree structure and linear approximation, are applied in our design in order to further improve the throughput of the proposed decoder. Finally the decoder PE is designed and its layout is implemented. This work provides a solid foundation for efficient and effective LDPC-CC decoder design. |
P1D-9¡@ |
A Novel Design for Computation of All Transforms in H.264/AVC Decoders |
PDF |
Yi-Chih Chao, Hui-Hsien Tsai, Yu-Hsiu Lin, Jar-Ferr Yang, and Bin-Da
Liu, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
In this paper, we design a novel
architecture for computing all transforms required in H.264/AVC high profile decoder. This flexible architecture design can compute all transforms including 8 and 4-point integer transforms as well as 4 and 2-point Hardamard transforms such that we can reduce the implementation chip area dramatically. With 8 pixels/cycle throughput, this proposed design can complete the computation in 95 clock cycles with 8¡Ñ8 inverse transform involved or 54 clock cycles without 8¡Ñ8 inverse transform for one macroblock. Simulation results show that the implemented area is 18.5k gate counts, and the maximum clock frequency is 125 MHz. For the real-time requirement, the architecture can deal with all existed frame sizes in 4:2:0 format. For example, if this architecture is operated at 106 MHz, it achieves 4096¡Ñ2304@30 frames/sec. |
P1D-10¡@ |
Design of a 2X2 MIMO OFDM Transceiver With Correction of Different
Carrier Frequency Offsets at Transmitter Antennas |
PDF |
Li-Wen Hsu and Dah-Chung Chang, °ê¥ß¤¤¥¡¤j¾Ç |
¡@ |
The combination of multiple-input
multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) is regarded as the next-generation wireless LAN technology. In this paper, the design of a 2¡Ñ2 space-time block coded (STBC) MIMO-OFDM baseband transceiver is studied based on the IEEE 802.11n proposal. To avoid the different phase offset between the transmitter antennas due to antenna resistance match problem, we propose a new carrier frequency offset tracking algorithm at the receiver. The overall design is verified on the Xilinx FPGA with implementation loss of about 1.5 dB.
|
¡@
|
TECHNICAL PROGRAM
Session
A2 |
Day
08/09 |
Time
10:00-11:30
|
Chair
¶À¿oÁo ±Ð±Â
°ê¥ß¤¤¿³¤j¾Ç |
Room
301 |
10:00 |
A2-1¡@ |
A Redundancy Detection Algorithm for DCT and Quantization in H.264 Video
Encoding |
¡@ |
PDF |
Ting-Wei Chen, Chang-Hsin Cheng, Yu Liu,
and Chun-Lung Hsu, °ê¥ßªFµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper proposes a novel efficient
algorithm that can detect redundancy computation early using the
relationship between the pixel values of time domain and low frequency
of DCT coefficients The novel redundancy detection algorithm improves high coding efficiency and low coding complexity without hampering image quality. Extensive performance shows that the proposed algorithm can reduce DCT and quantization time up to 57.47%. This is going to be an essential algorithm that makes hardware/software H.264 codec feasible. |
10:15 |
A2-2¡@ |
Skip Control Algorithm of Motion Estimation for Power-scalable H.264
Video Encoder |
¡@ |
PDF |
Chieh Chien, Yu-Han Chen, and Liang-Gee Chen, °ê¥ß¥xÆW¤j¾Ç |
¡@ |
¡@ |
In this paper, we present a power control
algorithm for a video encoder to maximize power saving and maintain
video quality. A pre-skip algorithm is adopted to provide
power-scalability. Then a skip ratio determination algorithm is provided
to a most suitably power saving and satisfy video quality constraints.
It saves power for mobile devices under user-defined quality
constraints. Next, a skip threshold control algorithm is presented to
control the encoding power to this most suitably power saving point.
According to the simulation results, we show that the proposed algorithm
can accurately achieve the most suitably power saving under a defined
quality constraint. |
10:30 |
A2-3¡@ |
Efficiency-Enhanced Multilevel LINC System Design |
¡@ |
PDF |
Kai-Yuan Jheng, Yuan-Jyue Chen, and An-Yeu (Andy) Wu, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
Linear amplifier with nonlinear components
(LINC) is a power amplifier (PA) linearization method which offers both high PA efficiency and high linearity of wireless transmitters. While LINC increases the PA efficiency, LINC requires an extra power combiner which results in low system efficiency. To solve this problem, we propose a multilevel out-phasing (MOP) scheme and a corresponding architecture: multilevel LINC (MLINC) to increase power combiner efficiency of wireless transmitters. Under WCDMA specifications, we demonstrate a 3-level MLINC
as a design example which enhances power combiner efficiency from 44.5%
to 75.5%. |
10:45 |
A2-4¡@ |
A Scalable Frame-Pipeline Motion Estimation Processor for Full-Search
Algorithm |
¡@ |
PDF |
Yeong-Kang Lai, Lien-Fei Chen, Yin-Ruey Huang, and Sheng-Yu Huang,
°ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a scalable two-dimensional
pipelined motion estimation processor for full-search block-matching
algorithm (FSBMA) is proposed. The proposed 2-D motion estimation processor can smoothly perform the block-matching operations of the consecutive frames without any processor idle time at frame boundaries. Moreover, we propose a scalable architecture to satisfy the throughput requirements and to reduce the exter-nal memory bandwidth with level C+ data reuse. The experimen-tal result shows that our architecture can accomplish frame-level 100% fully pipelined computation and achieve the performance driven requirements for different video applications.
|
11:00 |
A2-5¡@ |
High Throughput Embedded Compression Engine for High-End LCD
Applications |
¡@ |
PDF |
Tsung-Han Tsai, Yu-Yu Lee, and Yu-Xuan Lee, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
¡@ |
As the LCD panel technology advances into
high-definition (HD) series, the data rate in video post-processing is
increased drastically. Therefore, the memory bandwidth requirement and
memory size become primary concern issues. The embedded compression
engine is utilized to reduce the memory bandwidth and memory size. In
this paper, the color difference pre-processing (CDP) is proposed to improve the coding efficiency, and simulation result shows that the coding efficiency can be improved as high as 40.4% by CDP. Moreover, for hardware parallelism, the proposed segment scan manner (SSM) can provide hardware scheme with capacity of flexible parallelism, and only 0.1%~0.42% of coding efficiency is sacrificed. Based on SSM, the hardware parallelism of our proposed VLSI architecture with ping-pong mode memory partition can be flexibly increased for various high-end LCD specifications.
|
11:15 |
A2-6¡@ |
A Novel Low Complexity Pulse-Triggered Flip-Flop Design with Dual
Triggering Mode |
¡@ |
PDF |
Jin-Fa Lin, Yin-Tsung Hwang, Ming-Hwa Sheu, and Wei-Rong Ciou, °ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a novel dual mode pulse
generator design with the least number of transistor count known in the
literature is first presented. The pass-transistor logic (PTL) based design successfully reduces the circuit complexity and input loading capacitance to improve the power consumption. Both the threshold voltage loss and insufficient driving capability problems common in PTL are also resolved in our design to support low Vdd operations. Based on the proposed pulse generator circuitry, a pulse triggered flip-flop with dual triggering mode (single/double edge) is presented. This design, called PET-FF (Programmable Edge-Triggered Flip-Flop), features function versatility and low voltage operations in addition to the circuit complexity and power consumption advantages inherent in pulse triggered FF designs. Simulations in TSMC
0.18£gm CMOS process show that the proposed FF designs, providing extra triggering mode selection, outperform conventional FF designs and achieve similar power and delay performance compared with peer single mode pulse triggered FF designs.
|
¡@
|
TECHNICAL PROGRAM
Session
B2 |
Day
08/09 |
Time
10:00-11:30
|
Chair
¶À¿ü·ì ±Ð±Â
°ê¥ß²MµØ¤j¾Ç |
Room
305 |
10:00 |
B2-1¡@ |
A Protocol-Reconfigurable Double-Layer External Memory Management for
H.264/AVC Decoder |
¡@ |
PDF |
Chang-Hsuan Chang, Ming-Hung Chang, and Wei Hwang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a protocol-reconfigurable
double-layer external memory management for H.264/AVC decoder is proposed. There are a large amount of data need to be fetched to/from the off-chip memory in the H.264/AVC decoder. Therefore, the latency of accessing data and power consumption greatly affect the performance of the whole system. The proposed memory controller consists of two layers. The first layer is the address translation which provides an efficient pixel data arrangement to reduce the row-miss occurrence. The second layer is the external memory interface (EMI) which can further reduce access latency up to 70% by using the specific command FIFO and a unified FSM with generic scheduling. The setting of EMI could be reconfigured to suit different external memory modules. Particularly, the memory utilization can be increased about three times as compared with traditional method after combining the address translation layer with external memory interface.
|
10:15 |
B2-2¡@ |
A Energy-Efficient 256X144 TCAM Design |
¡@ |
PDF |
Wen-Yen Liu, Po-Tsang Huang, and Wei Hwang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a low-power and high-speed
ternary content addressable memory (TCAM) are presented. For network routers, super cut-off and Multi-mode data-retention power gating techniques are applied to reduce leakage currents. Besides, the match-lines are implemented by XOR-based conditional keeper, butterfly connection and don¡¦t-care based power gating to achieve energy-efficient searching operations. It also reduces more power consumption by hierarchy search-line scheme. Based on 65nm Berkeley Predictive Technology Model (PTM), simulation results show the leakage power reduction is 70.7% and energy metric of the TCAM is 0.047 fJ/bit/search.
|
10:30 |
B2-3¡@ |
Energy-Efficient and High-Performance Power Gating in Microprocessor
Functional Units |
¡@ |
PDF |
Chang-Ching Yeh, Kuei-Chung Chang, Tien-Fu Chen,
and Chingwei Yeh, °ê¥ß¤¤¥¿¤j¾Ç ¡@ |
¡@ |
¡@ |
In current high performance microprocessor,
more functional units are required to satisfy increasing computation
demands, but have also resulted in greater leakage energy consumption. Microarchitectural technique for power gating of functional units detects suitable idle regions and turns off them to reduce leakage energy consumption, but ready instructions always have to wait in the issue queue to wake up required functional units such that wakeup overhead is repeatedly incurred. In this paper, we present a time-based power gating with reference prewakeup (PGRP) technique for an in-order processor to reduce leakage energy without degrading performance. We exploit code sequentiality and implement a branch-based execution history buffer (PGRP-buffer) for on-demand wakeup prediction. The simulation with benchmarks from SPEC2000 applications shows that it is worthwhile to reduce considerable. leakage energy with less 1% performance impact. |
10:45 |
B2-4¡@ |
A Mini Stereo Digital Audio Processor Design |
¡@ |
PDF |
Po-Yu Kuo, Dian Zhou, and Zhi-Ming Lin, ¼w¦{¤j¾Ç¹F©Ô´µ¤À®Õ ¡@ |
¡@ |
¡@ |
This paper is to present the implementation
of a programmable finite impulse response (FIR) digital filter for a
Mini Stereo Digital Audio Processor (MSDAP). The performance of the MSDAP is expected to be the same level as that of two general DSP (Digital Signal Processing) chips for implementing two-channel FIR digital filtering in audio applications. Implemented in TSMC 0.18£gm technology, the MSDAP is run at data clock rate 768KHz and system clock rate 6.2MHz. With power supply 1.98V, power dissipation is about 2.539mW.
|
11:00 |
B2-5¡@ |
Adaptive Sensing Control in SRAM Design Using Per-Column Timing Tracking
Scheme |
¡@ |
PDF |
Ya-Chun Lai, Ming-Yi Chang, and Shi-Yu Huang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a new timing tracking
scheme in an SRAM design for enhancing the tolerance of bitline delay variation. This scheme, modifying the circuitry around each sense amplifier, allows an SRAM column to operate according its own timing. Thus, each latch-type sense amplifier can be turned on at the right time and the pulse width of the active wordline can be tuned to its optimal width on the fly, no matter how severe the operation speed of a bitline differs from the other. Monte-Carlo analysis for both the proposed and the conventional timing tracking schemes in a 22-nm predictive technology model demonstrates that this scheme has 18% better parametric yield than the conventional dummy bitline timing tracking scheme. |
11:15 |
B2-6¡@ |
Compact Dual-Core Architecture |
¡@ |
PDF |
Jih-Ching Chiu and Yu-Liang Chou, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
¡@ |
A novel architecture for the
Chip-Multiprocessor operations is proposed in this paper, called compact
dual-core architecture (CDA). Alternatively the well known Chip-Multiprocessors, the superscalar operation is supported in this architecture that data sharing mechanisms can be completed by register-file sharing. In the CDA, the programmers can dynamically switch operations among three modes: the superscalar processing mode, the multithreaded mode, and the single-processing mode, by the proposed novel instruction set to maintain the flexibility to support multithreaded applications. Compared with the 5-stage single pipe architecture, the CDA can obtain an average of 33% performance speedup in the superscalar processing mode. |
¡@
|
TECHNICAL PROGRAM
Session
C2 |
Day
08/09 |
Time
10:00-11:30
|
Chair
³¢«Ø¨k ±Ð±Â
°ê¥ß¥æ³q¤j¾Ç |
Room
315 |
10:00 |
C2-1¡@ |
A HQPM-Based Transmitter with Digital Predistortion Scheme for Enhancing
Average Efficiency |
¡@ |
PDF |
C.-T. Chen, C.-J. Li, T.-S. Horng, J.-K. Jau, J.-Y. Li, P.-K. Horng,
and D.-S. Deng, °ê¥ß¤¤¤s¤j¾Ç |
¡@ |
¡@ |
This paper presents the comparison of
signal quality and efficiency performance between simulation and
experiment for a linear RF transmitter based on a hybrid quadrature polar modulation (HQPM) architecture with digital predistortion scheme. The measurement results of the Class-E power amplifier and Class-S modulator are applied in the simulation. The path-delay difference between the envelope and the phase signal is also considered. Adjacent channel power ratios with and without digital predistortion are both shown to demonstrate improvement from the baseband predistortion algorithm. The simulated and experimental results have validated the key feature of the HQPM architecture, that its high efficiency characteristics are not sensitive to the output power level.
|
10:15 |
C2-2¡@ |
Limitation and Improvement of a Modified Precharge Phase Frequency
Detector for Wireless Frequency Synthesizer Applications |
¡@ |
PDF |
C.-J. Li, C.-B. Lo, S.-W. Li, T.-S. Horng, and K.-C. Peng, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
¡@ |
A modified precharge phase frequency detector (PFD) that is inherently high speed and dead-zone free is designed and then fabricated in a 0.18 £gm CMOS process. Through analysis and measurement, this kind of PFD has been first found to have limitation in the minimum comparison frequency, causing a difficulty in application to wireless frequency synthesizers. In this work, a novel charge compensation circuit is proposed and implemented on the same CMOS chip with the frequency synthesizer to solve this problem successfully.
|
10:30 |
C2-3¡@ |
A 0.8V SOP-Based Wideband Fourth-Order Cascade Delta-Sigma Modulator |
¡@ |
PDF |
Chien-Hung Kuo and Shuo-Chau Chen, ²H¦¿¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a 0.8V switched-opamp (SOP)-Based wideband 2-2 cascade delta-sigma modulator is presented. Double sampling is used to promote the clock efficiency and relax the requirement of SOP. Based on the low-distortion topology, a CIFF-CIFB structure is adopted here to improve the resolution of the cascade modulator. The proposed modulator has been implemented in a 0.13£gm 1P8M technology. The peak signal to noise and distortion ratio (SNDR) of the modulator in a 1.1 MHz of bandwidth is 68 dB under a 20 MHz of clock rate. The power dissipation of the modulator is 15.7 mW at 0.8V of supply voltage.
|
10:45 |
C2-4¡@ |
Sub-mW 5-GHz Receiver Front-End Circuit Design |
¡@ |
PDF |
Tatao Hsu, Yen-Lin Liu, Shu-Hui Yen, and Chien-Nan Kuo, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
In this work a 5-GHz receiver front-end is
designed for the application of wireless sensor networks. The circuit
topology is chosen available for low supply voltage below 1V. The
stability condition of the LNA circuit is ensured by adding reactive components. Total power consumption of the fabricated circuit is 0.86mW, of which 0.7mW goes to the LNA stage. The measured return loss and conversion gain are 11dB and 25dB, respectively. The noise figure is 12dB and the IIP3 is around -6.5dBm. |
11:00 |
C2-5¡@ |
A Low Voltage Full-band Cascoded UWB LNA |
¡@ |
PDF |
Ruey-Lue Wang, Min-Chhuien Lin, and Zhi-Cheng Lin, ±X¤s¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, A 3.1-10.6GHz ultra-wideband
(UWB) low voltage low noise amplifier (LNA) employing only one ¡Vstage cascoded topology and an additional voltage-current feedback is presented. The research is based on the TSMC
0.18£gm CMOS processes. Measurement results show the following performances: maximum power gain of 9.18dB, ¡Ó0.9 dB gain flatness for full band, minimum noise figure of 4.1dB, the input-referred thirdorder intercept point (IIP3) of 7.25 dBm and the input-referred 1-dB compression point (P1dB) of ¡V2.5 dBm. The power
consumption is 23.5 mW under a 1.0 V supply voltage. The chip area is 0.995mm¡Ñ0.780 mm. |
11:15 |
C2-6¡@ |
80-S/s Delta Sigma Modulators For IR Thermometer |
¡@ |
PDF |
Jen-Shiun Chiang, Hsin-Liang Chen, Yao-Tsung Chang, and Meng-Hsuan Ho,
²H¦¿¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, the high performance 80-S/s
delta sigma modulators for IR thermometer applications are presented. Three methods are used to implement these modulators. First, general technique is used. Second, the chopper technique is used to cancel offset and remove 1/f noise. Third, the correlated double sampling (CDS) is mainly used to cancel offset. These third-order 1-bit quantizer single loop delta sigma modulators achieve respectively 86, 89, 88dB of dynamic range and 80, 82, 83 dB of signal to noise distortion ratio. The circuits are implemented in a standard 0.35-£gm 2P4M CMOS technology. The chip area is respectively 1.57 mm2 (1.37mm¡Ñ1.15mm), 2.25 mm2 (1.5mm¡Ñ1.5mm), 2.25 mm2 (1.5mm¡Ñ1.5mm) and the power consumption is 2.2 mW, 1 mW, 1.1 mW at 3-V supply. |
¡@
|
TECHNICAL PROGRAM
Session
D2 |
Day
08/09 |
Time
10:00-11:30
|
Chair ¦¿¿·¦p ±Ð±Â
°ê¥ß¥æ³q¤j¾Ç |
Room
308 |
10:00 |
D2-1¡@ |
On Power-State-Aware Routing and Buffer Insertion |
¡@ |
PDF |
Ming-Hua Wu and Iris Hui-Ru Jiang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
Interconnect delay and low power are two of
the main issues in nanotechnology. Buffer insertion during routing can
reduce interconnect delay; multiple supply voltage can lower power
consumption. However, buffering without considering power states may
cause the signal integrity problem. In this paper, we first propose an
algorithm to construct a buffered routing tree considering power states
for dual supply voltage designs. Our approach can simultaneously
minimize power, satisfy timing constraints and maintain signal
integrity. The results show this method is promising, e.g., constructing
a buffered routing tree with 37 sinks less than 4 seconds as well as
maintaining signal integrity. |
10:15 |
D2-2¡@ |
An Obstacle-Avoiding Rectilinear Steiner Minimal Tree Construction
Algorithm |
¡@ |
PDF |
Ya Wen Tsai, Yung Tai Chang, Jun Cheng Chi,
and Mely Chen Chi, ¤¤ì¤j¾Ç ¡@ |
¡@ |
¡@ |
We present a construction-by-correction
approach to solve the Obstacle-Avoiding Rectilinear Steiner Minimal Tree
(OARSMT) construction problem. We build an
obstacle-weighted spanning tree as a guidance to
construct OARSMT on an escape graph. We use
Dijkstra¡¦s algorithm for routing. A refinement of Ushaped
removal is applied during the routing process to
further reduce the wire length. Our experimental results
show that comparing to several state-of-the-art works
this algorithm achieves the shortest average total
wirelength. It also uses short run time for practical-size
problems. |
10:30 |
D2-3¡@ |
A Network-Flow Based Algorithm for Digital Microfluidic Biochip Routing |
¡@ |
PDF |
Ping-Hung Yuh, Chia-Lin Yang, and Yao-Wen Chang, °ê¥ß¥xÆW¤j¾Ç |
¡@ |
¡@ |
Due to the recent advances on microfluidics, digital microfluidic biochips are expected to revolutionize laboratory procedures.
One critical problem for biochips is the droplet routing problem.
Unlike traditional VLSI routing problems, in addition to routing path selection, the biochip routing problem needs to address the issue of scheduling droplets under the practical constraints imposed by fluidic property and timing restriction of synthesis result.
Therefore, the biochip routing problem is more complicated than traditional VLSI routing.
In this paper, we present the first network-flow based routing algorithm that can concurrently route a set of non-interfering nets for the droplet routing problem on biochips.
We adopt a two-stage technique of global routing followed by detailed routing.
In global routing, we first identify a set of non-interfering nets and then adopt the network-flow approach to generate optimal global-routing paths for the nets.
In detailed routing, we present the first polynomial-time algorithm for simultaneous routing and scheduling using a based on the global-routing paths with a negotiation based routing scheme.
The experimental results show the effectiveness and efficiency of our algorithm.
|
10:45 |
D2-4¡@ |
A Transitive-Closure-Graph-Based Macro Placement Algorithm |
¡@ |
PDF |
Hsin-Chen Chen, Yi-Lin Chuang, Zhe-Wei Jiang, and Yao-Wen Chang, °ê¥ß¥xÆW¤j¾Ç
¡@ |
¡@ |
¡@ |
In this paper, we propose a
transitive-closure-graph-based (TCG-based) macro placement algorithm that removes macro overlaps and optimizes macro positions. Improving over TCG by working only on its essential edges without
loss of the solution quality, our algorithm can efficiently and effectively search for a high quality macro geometric relation. Instead of packing macros along chip boundaries like the most recent previous work, our placer can determine a non-compacted macro placement by linear programming and placement region cost evaluation. Our macro placer is so flexible and versatile that it can easily extend the linear programming formulation to handle various placement constraints/objectives. Combined with various leading academic placers, our macro
placer can consistently and significantly reduce the wirelength, implying that our macro placer is robust and has very high quality. For example, based on the ISPD¡¦06 placement benchmarks, combined with our macro placer, the resulting wirelength of Capo 10.2, mPL6, and NTUplace3 can further be reduced by 5%,6%, and 15% on average, respectively. |
11:00 |
D2-5¡@ |
Optimal Redundant Via Insertion Using Mixed Integer Linear Programming |
¡@ |
PDF |
Kuang-Yao Lee, Ting-Chi Wang, and Kai-Yuan Chao,
°ê¥ß²MµØ¤j¾Ç |
¡@ |
¡@ |
Redundant via insertion is highly
recommended to improve chip yield and reliability. The well-studied
double-cut via insertion (DVI) problem allows a single via in a chip to have at most one redundant via inserted next to it, but the solution to this problem is not good enough particularly for high-activity and power nets because those nets typically need more redundant vias to further enhance reliability. This motivates us to study in this paper a new problem, called the multiple-cut via insertion (MVI) problem, in which one redundant via or more can be inserted next to a single via such that the amount of single vias with redundant vias inserted next to them and the amount of inserted redundant vias are both maximized. We formulate the MVI problem as a mixed integer linear programming (MILP) problem. To make the problem tractable, we further break the MILP problem into a set of much smaller MILP problems each of which is solved independently and efficiently without sacrificing the optimality. Besides, we identify that the DVI problem is just a special case of the MVI problem, and therefore our MILP approach can be easily adapted to optimally solve the DVI problem as well. To the best of our knowledge, none of the existing DVI works can guarantee the optimality. The extensive experimental results are provided to support the efficiencies of our MILP approaches on both the MVI and DVI problems.
|
11:15 |
D2-6¡@ |
A Simple Yet Efficient Global Router with Mirrored Monotonic Routing and
Reduced Multi-Source Multi-Sink Maze Routing |
¡@ |
PDF |
Ke-Ren Dai, Jyun-Yi Lin, and Yih-Lang Li, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
Traditional VLSI physical design flow is
composed of floorplanning, placement, global routing and detailed routing. A fast global router can help placers in accurately estimating wire length and routability. A high-quality global router increases routability for detailed routers. In this work, we develop a high-performance congestion-driven global router to fast produce better routing results as compared to an ILP-based global router. Based on the routing flow of FastRoute 2.0, we develop an enhanced routing flow, a simplified multi-source multi-sink maze routings and mirrored monotonic routing. Experimental results reveal that our router decreases many overflows at little cost of runtime. |
¡@
|
TECHNICAL PROGRAM
Session
E2 |
Day
08/09 |
Time
10:00-11:30
|
Chair
¬x¶iµØ ±Ð±Â
°ê¥ß°ª¶¯¤j¾Ç |
Room
318 |
10:00 |
E2-1¡@ |
Test Data and Test Time Reduction for LOS Transition Test in Multi-Mode
Segmented Scan Architecture |
¡@ |
PDF |
Sying-Jyan Wang, Po-Chang Tsai, Hung-Ming Weng,
and Katherine Shu-Min Li,
°ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
¡@ |
Launch-off-Shift (LOS) is a widely used
technique for delay test in scan-based design. Test data compression for
LOS patterns, however, is less efficient. In this paper, we first
analyze the reason for low compression rate in LOS patterns, and present
an LOS test enabled scan architecture that supports three operation
modes: broadcast, multicast, and serial. Efficient LOS test data
compression can be achieved under this architecture with limited
hardware overhead. An ATPG method for LOS test patterns under the proposed architecture is also presented. Experimental results show that most of the serial scan operations can be replaced by multicast operations, and thus achieve much better compression rate.
|
10:15 |
E2-2¡@ |
Test Efficiency Analysis of SOC Test Platforms |
¡@ |
PDF |
Tong-Yu Hsieh, Kuen-Jong Lee, and Jian-Jhih You, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we formally analyze the test
efficiency of test platforms that appear to be a promising method for
SOC testing and seek for its optimization. A test cycle estimation
technique is proposed to evaluate the test efficiency for various test
procedures/organizations of test platforms. It is shown that up to 24X
test time difference among the test platforms with different dedicated
designs/test procedures are possible. Based on the analysis results, we
can easily determine an appropriate test procedure/organization that can
achieve extremely high test efficiency with minimum required area
overhead. |
10:30 |
E2-3¡@ |
A Novel High-Speed SOC Test Scheme Using Virtual TAMs |
¡@ |
PDF |
Jiann-Chyi Rau, Chien-Hsu Wu, and Chung-Lin Wu, ²H¦¿¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a framework associated
with an efficient method to determine the optimal scheduling of SOC
test. In addition to using both traditional scan chains and
reconfigurable multiple scan chains, we increase the TAM width in the
proposed framework. Experimental results for ITC¡¦02 SOC benchmarks show
that our work can obtain better test application time compared to the
previously published algorithms. |
10:45 |
E2-4¡@ |
Enhancing Compression Efficiency with Skewed-Probability Scan Chains |
¡@ |
PDF |
Sying-Jyan Wang, Shih-Cheng Chen, and Katherine Shu-Min Li, °ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
¡@ |
Code-based test data compression schemes
encode symbols in the test data with predetermined codewords so that data volume can be reduced. The compression efficiency is affected by the distribution of data symbols. In this paper, we first analyze the factors that affect the encoding efficiency in various codes, and then propose a skewed-probability scan chain partitioning scheme, in which the distribution of 0¡¦s and 1¡¦s are changed in different parts of the scan chain. Both analytical and experimental results confirm that the scheme can effectively improve compression efficiency, while the routing penalty due to the partitioning method is limited. |
11:00 |
E2-5¡@ |
DIAGNOSIS OF MULTIPLE SCAN CHAIN TIMING FAULTS |
¡@ |
PDF |
Wei-Shun Chuang, Wei-Chih Liu, and James Chien-Mo Li, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
A diagnosis technique is presented to
locate multiple timing faults in scan chains. Jump simulation is a novel
parallel simulation technique which quickly search for the upper bound
and the lower bound of each individual faults. This technique requires
only regular ATPG patterns, which is ideal for the production environment.
Experiments on ISCAS¡¦89 benchmark circuits show that, this technique diagnose every fault
to a precision of no more than two scan cells (totally 16 hold-time faults in more than 800
scan cells). The proposed technique is still effective when the failure data is limited
or the faults are clustered. |
11:15 |
E2-6¡@ |
Testing MRAM for Write Disturbance Fault |
¡@ |
PDF |
Wan-Yu Lo, Ching-Yi Chen, Chin-Lung Su, and Cheng-Wen Wu, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
With the development of deep sub-micron
technology, the semiconductor memory has become larger and denser. Many
applications require the chips to integrate non-volatile memories. The
industry has been trying to develop a new non-volatile memory to replace
the flash memories, and the Magnetic random access memory (MRAM) is a possible candidate. The write disturbance
fault (WDF) model is a fault model specific to
MRAM which implies that the data stored in theMRAM
cells is changed due to excessive magnetic field during
a write operation. March tests have high coverage for
conventional RAM faults; however, they do not detect
all WDFs. To improve quality and yield of MRAM, we
suggest a new test algorithm to detect WDF for MRAM,
which is extended from the March-based test algorithm.
It also keeps linear time complexity and can be implemented
easily within the built-in self-test (BIST). |
¡@
|
TECHNICAL PROGRAM
Session
P2A |
Day
08/09 |
Time
10:00-12:00
|
Chair Á§¶®ÄÉ ±Ð±Â
°ê¥ß¶³ªL¬ì§Þ¤j¾Ç |
Room
2F®b·|ÆU |
P2A-1¡@ |
A 5-bit 1 GSample/s Two-Stage ADC with a New Flash Folded Architecture |
PDF |
Hung-Yu Huang, Ying-Zu Lin, and Soon-Jyh Chang, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
A 5-bit 1 GSample/s two-stage ADC is designed and simulated in TSMC
0.18-£gm CMOS technology. The new architecture combines the characteristics of flash, subranging and folding ADC. The analog front-end of this work is the same as that of a typical flash ADC. By replacing folding amplifier with the current-mode multiplexer (MUX), cyclic thermometer code, the digital output of folding ADC, is obtained and frequency multiplication effect is avoided. Besides, the slow switching of the reference voltage range is also avoided. The number of the comparators is reduced to 16, and it is 32 typically. Operating at 1 GSample/s, the ENOB is 4.92 and 4.71 bit at input frequency 10 and 500 MHz, respectively. This ADC consumes 63mW from a 1.8 V supply, achieving FOMs of 2.4 pJ/conversion-step at 1 GSample/s. |
P2A-2¡@ |
A CMOS Temperature Sensor Design for Implantable Bio-Medical Devices |
PDF |
Ying-Hsiang Wang, Wen-Yaw Chung, Chiung-Cheng Chuang,
and Chien-Hsi Kao,
¤¤ì¤j¾Ç ¡@ |
¡@ |
This paper presents a fully integrated CMOS temperature sensing circuitry for implantable bio-medical system with low power and mixed-mode signal output. It also presents a new type of multi-level comparator which has fixed power consumption even add more stages. The circuit was verified by using TSMC 0.35£gm mixed-signal 2P4M poly-cide 3.3/5V models. The simulation results show the proposed circuit adapted well to the application for a limited temperature range in implantable systems and it only consumed 37.2£gW at 2.5V power supply. |
P2A-3¡@ |
Low Dropout Voltage Regulator with Current-Limit Circuit |
PDF |
Chien-Cheng Chen, Nan-Xiong Huang, Miin-Shyue Shiau, Hong-Chong Wu,
and Don-Gey
Liu, ³{¥Ò¤j¾Ç ¡@ |
¡@ |
This paper presents a protection circuit
for the low-dropout (LDO) voltage regulator. This LDO provides high
stability for the load current up to 800mA, and has a circuit for
limiting the output current. The die size is 1.38¡Ñ0.48 mm2. Moreover, this protection circuit needs just only one comparator and one transistor. The comparator can use simple two stage CMOS operation amplifier. The transistor was used for switch, so didn't need large area. The proposed LDO regulator was designed using TSMC
0.35-£gm CMOS technology. The main advantage of this approach is that we can use the extra voltage to limit the output current to protect the main circuit. |
P2A-4¡@ |
Novel Devices Merging RITD and CMOS for Future VLSI Use |
PDF |
Jyi-Tsong Lin, Wei-Chin Lin, and Chao-Yu Hou, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
In this paper we bring up a new device
design which merges MOSFET with Resonance Intra-band Tunneling Diode (RITD). By this concept, we throw some new designs, categorized into three sorts. The first is multi-bits memory. Such device can be equivalent to a circuit which consists of MOSFETs and RITDs. Also, it still meet the standard MOS manufacture regime and give possibly multi-bits in a single cell. The second is multi-level current regulator: one MOSFET is to be used as the load and a RITD as the driver component, whose output can result in two or more stable level to control the terminal MOSFET on and off with different current effort. Because RITD needs low density current, high speed switch and low power consumption as in previous studies, this new device can eliminate the noise shooting and get faster switch speed when even the gate length is scaled down to 10nm. The third applies a modified MOS structure where RITD part is integrated. According to the high speed point, such logic gate may be more potential for VLSI application. In this article, we use ISE TCAD simulation to carry out the physical geometric pattern and evaluate its different electric characteristics. We also use MOSs equivalent circuit to simulate RITD function and use Hspice to verify all design call meet function correction. These new devices approve good results and demonstrate better device behaviors.
|
P2A-5¡@ |
Using Output-Clamped Amplifier to Implement Time-Based Interface Circuit
for Measuring Tiny Grounded Capacitance |
PDF |
Wei-Hung Hsu and Meng-Lieh Sheu, °ê¥ßº[«n°ê»Ú¤j¾Ç ¡@ |
¡@ |
A time-based interface circuit for
measuring grounded on-chip capacitance is proposed. The measured
capacitance is first converted to integration time by the interface
circuit, and then to digital values by a counting circuit. Very compact
circuit area and micro-power consumption are achieved. Linearity
performance is also analyzed to conclude where and how non-ideal
components affect the overall accuracy of the interface circuit. The
simulation results give a good agreement with the proposed interface
circuit. |
P2A-6¡@ |
A Low Distortion Class-AB Power Amplifier With Active Tuning |
PDF |
Ro-Min Weng, Chi-Wen Tsai, and Kuen-Yi Lin, °ê¥ßªFµØ¤j¾Ç ¡@ |
¡@ |
A class-AB power amplifier (PA) with active
tuning is presented. An active inductor is added to adjust the output
matching in order to obtain high signal integrity, low distortion, and
high power efficiency. The active tuning PA with the power control can
achieve high efficiency and further decrease the third order
inter-modulation term (IM3). The maximum IM3 suppression is -33dBc at
the output power of 10dBm measured by the two-tone test with 10MHz
offset at 2.4GHz. The measurement results show a maximum PAE of 52.6% at the input power of 3dBm. The maximum power gain of 24.38dB is obtained at the input power of -5dBm. PAE is improved obviously within the input power from -9dBm to 3dBm. |
P2A-7¡@ |
A Low Power 1V 10-bit Successive Approximation ADC |
PDF |
Yi-Hung Chen, Wan-Tin Lin, and Hwang-Cherng Chow, ªø©°¤j¾Ç ¡@ |
¡@ |
A low power 1V 10-bit successive
approximation analog-to-digital converter (SA-ADC) implemented in TSMC
0.18£gm CMOS process is presented for biomedical applications. In the DAC capacitor arrays of this SA-ADC a charge-recycling method for switching the capacitors is used. By splitting the MSB capacitor into binary scaled sub-capacitors, the average switching energy can be reduced. Besides, a 1V rail-to-rail input comparator with current driven bulk technique and offset cancellation is proposed. The complete 1V ADC has signal-to-noise ratio of 58.5dB and its effective number of bits is 9.4 based on post-layout simulations. The entire ADC power consumption is 32.6uW for normal signals and 29.5£gW for ECG applications.
|
P2A-8¡@ |
New Low Supply-Bounce Current-Mode Shunt Regulator |
PDF |
Che-Min Kung, Chan-Min Pan, Jiann-Jong Chen, Yuh-Shyan Hwang, and Wen-Ta
Lee, °ê¥ß¥x¥_¬ì§Þ¤j¾Ç |
¡@ |
The electric devices with regard to
low-noise, low–power, low drop-out linear regulator is in great demand. In this paper we present a new topology of low-supply bounce current-mode shunt regulator, in order to promote the performance, we utilizes the current-mode architecture to reduce the supply-bounce and ground bounce. In the closed-loop, we use the voltage of Vo to supply the core circuit, so the system don’t need the extra supply-source. The proposed current-mode shunt regulator has been fabricated in TSMC 0.35£gm 2P4M CMOS process. The main elements of proposed regulator contain an error amplifier, voltage buffer, current feedback, pass element and off-chip capacitor and resistors. The experimental results show that the settling time is about 0.5£gs with 0.5% error for heavy load current. Nevertheless the line and load regulations are 34.6£gV/mA and 74.35ppm/mA. The active chip area is only 0.783¡Ñ0.875mm2. |
P2A-9¡@ |
CMOS BANDGAP REFERENCE WITH CURVATURE COMPENSATION ON HIGHER ORDER
TEMPERATURE TERMS |
PDF |
Hong-Yi Huang and Ru-Jie Wang, °ê¥ß¥x¥_¤j¾Ç ¡@ |
¡@ |
This work presents a curvature-compensated bandgap reference without resistors in 0.18-£gm CMOS technology. The circuit uses a new current generator circuit for higher order temperature terms curvature compensation and a PMOS voltage divider for scaling down the reference voltage. A 605.6mV output voltage is generated with a temperature coefficient of 1 ppm/¢J
from ¡V40 to 125¢J.
It dissipates 77£gW at a supply voltage of 1.8-V.
|
P2A-10¡@ |
A Temperature-Compensation CMOS Subbandgap Reference with 1V Power
Supply Operation |
PDF |
Hung-Wei Chen, Jing-Yu Luo, and Wen-Cheng Yen, °ê¥ßÁp¦X¤j¾Ç ¡@ |
¡@ |
In this paper, a low supply voltage
temperature compensation CMOS subbandgap reference is proposed and implemented. This circuit has been implemented in a standard 0.35£gm TSMC CMOS process. The active area of this circuit is 762£gm*283£gm. This designed circuit work properly with minimum supply voltage of 1V. The power dissipation of this circuit is 74£gW. The experimental results have confirmed that, with the minimum supply voltage of 1V. The circuit generates a reference VREF1 of 0.633V for a power supply of just 1V and presents a 26mV output voltage vibration for the range of -10¢J to 60¢J. The circuit also generates a reference VREF2 of 0.642V for a power supply of just 1V and presents a 22mV output voltage vibration for the range of -10¢J to 60¢J.
|
P2A-11¡@ |
6 Gb/s Digitally Phase Adjusted Clock Data Recovery for Spread Spectrum
Clock |
PDF |
Chin-Hsien Lin, Yuan-Pu Cheng, Yen-Ying Huang, and Shyh-Jye Jou, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
This paper presents the design of a clock
data recovery circuit incorporating a feed-forward phase adjusted
algorithm. The CDR uses 3X oversampling to track incoming data and transform the phase and frequency deviation information into multi-phase selection signal. The phase adjusted algorithm can be digitally implemented and is a second¡Vorder tracking that can handle frequency deviation and spread spectrum clock signal. The CDR is designed for 6Gbps application and is able to track spread spectrum clock to 5000ppm in SATA¡V3 specification.
|
¡@
|
TECHNICAL PROGRAM
Session
P2E |
Day
08/09 |
Time
10:00-12:00
|
Chair ªLºa±l ±Ð±Â
¤¸´¼¤j¾Ç |
Room
2F®b·|ÆU |
P2E-1¡@ |
Don¡¦t-Care Bits Filling for Reducing Capture Power |
PDF |
Wang-Dauh Tseng, Lung-Jen Lee, and Chun-Kai Hsu, ¤¸´¼¤j¾Ç ¡@ |
¡@ |
In this paper, we propose a don¡¦t-care-bit
filling method to reduce the test power dissipation during capture
cycles. An induced activity function is exploited to obtain the optimal
order in assigning the don¡¦t-care bits in test vectors or responses so
as to prevent larger potential switching activity in CUT during capture
cycles. It is implemented by weighting the impact of each transition
occurred on each scan cell during capture cycles. The capturing power
consumption could be drop down significantly. As shown in experimental
result, the proposed method can achieve 40% reduction of capturing power
consumption as compared with random X-filling method, and, in most
cases, better results than the LCP X-filling method. No area overhead and performance loss would be caused in this method. |
P2E-2¡@ |
Mismatch Address Index Encoding for Data Compression in Scan Test |
PDF |
Lung-Jen Lee, Wang-Dauh Tseng, Rung-Bin Lin, and Hcc-Hang Jang, ¤¸´¼¤j¾Ç ¡@ |
¡@ |
In order to improve transmission efficiency
between ATE and SOC under test, we present a new test data compression
technique to reduce the amount of test data that must be stored on a
tester. A simple yet efficient heuristic is introduced for sorting test
data as a key stage in this method. Don¡¦t care bits assignments are
analyzed to promote compression effect. A continuity mismatch property
between two sorted test cubes is identified and exploited. Compression
is implemented by encoding the mismatch bits in the test sequence. The
decoding process is performed by a small amount of on-chip circuitry.
Experimental results show an average compression ratio up to 82% is
achieved which is higher than PRL, 9C and ARL encoding. |
P2E-3¡@ |
Reduction of Power Dissipation during Scan Testing by Test Vector
Ordering |
PDF |
Wang-Dauh Tseng and Lung-Jen Lee, ¤¸´¼¤j¾Ç ¡@ |
¡@ |
Test vector ordering is recognized as a
simple and non-intrusive approach to assist test power reduction.
Simulation based test vector ordering approach to minimize circuit
transitions requires exhaustive simulation of each test vector pair.
However, long simulation time makes this approach impractical for
circuits with large test set. In this paper we present a calculation
based approach to faster order test vectors to reduce test power for
full scan sequential circuits. Most calculation approaches are for
combinational circuits or for sequential circuits but only considering
the portion of circuit derived from the primary inputs. The proposed
approach exploits the dependencies between internal circuits and the
state inputs and will make more power reduction. Experiments performed
on the ISCAS 89 benchmark circuits show that the improvement efficiency of the proposed approach can achieve 91.55% and has better performance than the existing calculation based approaches.
|
P2E-4¡@ |
A Simulation-based Redundancy Identification in Combinational Circuits |
PDF |
Yi-Yuan Huang and Chun-Yao Wang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
Redundancy removal is an important
operation in combinational logic optimization. Traditional redundancy
identification algorithms are based on automatic test pattern generation
algorithms. However, automatic test pattern generation algorithms spend
much CPU time to determine if a fault on a wire is untestable, and thus redundant. To determine if a wire is redundant is not easy, however, to determine if a wire is irredundant is much easier. In this paper, we present an efficient redundancy identifier such that irredundant wires can be easily filtered out. The experimental results show that the presented method can identify all irredundant wires in most benchmark circuits.
|
P2E-5¡@ |
An Experimentation Suite for Education in Low-Noise Design |
PDF |
You-wei Liang, Shinyu Chen, and Robert Rieger, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
For the design of biomedical applications
the signal is in the order of micro-volts which is so small that the
phenomenon of noise can not be neglected. The theory of noise is known
by students, but they are not familiar with the noise in real world. In
this paper, an easy measurement system is proposed. The signal is
amplified, connected a DAQ card, and shown on the screen by using a Labview program. On the screen, there is not only a result, but also a theoretical value of noise which can be typed by the student. The resisters can be replaced easily, and the noise of amplifiers which like the 741 type IC can be measured, too.
|
P2E-6¡@ |
Performance Improvement using Application-Specific Instructions under
Hardware Constrains |
PDF |
Chijie Lin, Jiying Wu, Jerung Shiu, Desheng Chen,
and Yiwen Wang, ³{¥Ò¤j¾Ç ¡@ |
¡@ |
Application-Specific Instruction-set
Processors (ASIPs) have popularly been used to balance the trade-off between cost and performance for a specific target application without creating a new processor. The generation and selection of ASIs can dramatically affect the quality of an ASIP design with constrains such as number of I/Os, hardware cost, ASI hardware latency, and total number of ASIs. In this paper, the disjoint operations can be combined as an ASI to enrich the selection varieties. The operation cover-ratio and the more accurate ASI latency model are used to select good ASIs so that the performance can be improved. A design flow is developed to automatically generate the ASIs and the experimental results show that 1.64x speed up can be obtained on sha benchmark under 5 inputs, 3 outputs, and hardware cost less than 8000 LEs in Altera FPGA.
|
P2E-7¡@ |
Power-Aware Memory Mapping for FPGAs |
PDF |
Tien-Yuan Hsu, Ting-Chi Wang, and Kuang-yao
Lee, °ê¥ß²MµØ¤j¾Ç |
¡@ |
Embedded memory blocks on FPGAs allow designers to implement a variety of memory structures. With the increasing use of them, the power consumed by embedded memory blocks may form a significant part of the total dynamic power consumption. In this paper, we propose a power-aware memory mapping algorithm considering resource constraint. This algorithm converts the memory mapping problem to a generalized network flow problem, which can distribute resources to all logical memories at the same time. Our algorithm is compared with an existing power-aware memory mapping method. The promising experimental results show that our algorithm can always efficiently generate the optimal solutions but the existing method does not. |
P2E-8¡@ |
MFASE Multiple Functions SoCs Analysis Environment |
PDF |
Ya-Shu Chen, Shih-Chun Chou, Chi-Sheng Shih, and Tei-Wei Kuo, °ê¥ß¥xÆW¤j¾Ç |
¡@ |
When more and more functions are integrated
into one system, the designs of embedded systems have become more and
more complicated. Multiple functions SoCs analysis environment (MFASE) is a system-level design framework which includes the tools for HW/SW co-design, performance optimization, HW/SW co-simulation, and performance analysis. MFASE provides an integrated system-level design framework to lower the system cost and evaluate the system performance. Given preliminary system design specification, MFASE explores the design space in design problems by proper timing analysis, and provides suitable scheduler/arbiter design. To verify the overhead of scheduler/arbiter in the system, MFASE also provides a framework of HW/SW simulation on a Transaction Level Model system to verify the design before a hardware prototype is physically manufactured. The evaluation results of HW/SW co-simulation are analyzed to iteratively enhance the system design. |
¡@
|
¡@ |
TECHNICAL PROGRAM
Session
P2D |
Day
08/09 |
Time
10:00-12:00
|
Chair ·¨³Õ´f ±Ð±Â
°ê¥ß¶³ªL¬ì§Þ¤j¾Ç |
Room
2F®b·|ÆU |
P2D-1¡@ |
An Integrated Spatial-Temporal Sampling Rate Conversion Architecture by
Motion Compensation for TV Display |
PDF |
Chih-Hung Kuo, Li-Chuan Chang, Zheng-Wei Liu, and Bin-Da Liu, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
To improve video quality in TV display, we
propose a strategy including frame rate up-conversion (FRUC), spatial scaling, and adaptively edge sharpening. The technique of motion compensation is applied to both frame interpolation and edge sharpening in order to improve visual quality. In conventional methods, the bi-directional motion estimation is usually employed between two successive frames to interpolate a new frame. The proposed method not only considers temporal correlation, but also adopts the spatial correlation to determine the covered and uncovered region. The weighting of edge sharpness is determined by the magnitude of motion vector (MV) and mean absolute difference (MAD) of current block to improve image quality. Simulation results show that our method has better PSNR and image quality. |
P2D-2¡@ |
A Novel Look-up Table-Based Multiplication/Squaring Architecture for
Cryptosystems over GF(2sup/m/) |
PDF |
Wen-Ching Lin, Jun-Hong Chen, and Ming-Der Shieh, °ê¥ß¦¨¥\¤j¾Ç |
¡@ |
This paper focuses on the high-speed
multiplier design over finite field GF(2m) for large m. We first
extended the look-up table (LUT) based multiplication algorithm presented by Hasan to reduce the LUT generation time and then showed how to effectively incorporate the squaring operation into the developed multiplier. The unified multiplication/squaring module is very suitable for a particular kind of cryptosystems like Elliptic Curve Cryptography (ECC) in which these two types of operations are operated in a ping-pong fashion. Experimental results exhibit that using the presented sub-group, multiple look-up tables (SG-MLUT) based scheme, up to 29% improvement in the total computation time of multiplication can be achieved in comparison with that using Hasan¡¦s algorithm. When employing the unified multiplication/squaring module instead of Hasan¡¦s design in ECC applications, we can gain further improvement in the scalar multiplication time because no LUT generation is needed using our design, and obtain about 26% reduction on the resulting area-time (AT) complexity.
|
P2D-3¡@ |
A DPA-Resistant AES Encryption Hardware Module |
PDF |
Kuan Jen Lin, Shih Hsien Yang, and Chih Hsuan Hsu, »²¤¯¤j¾Ç |
¡@ |
Cryptographic embedded systems are
vulnerable to Differential Power Analysis (DPA) attacks. In this paper, we use a logic design style, called as Pre-charge Masked Reed-Muller Logic (PMRML) to implement a DPA-resistant AES encryption hardware module. The PMRML design can overcome the glitch and Dissipation Timing Skew (DTS) problems, which both significantly reduce the DPA-resistance. The PMRML-based AES module was implemented with TSMC 0.18um standard cell libraries. The post-layout results show the efficiency and effectiveness of the PMRML design methodology.
|
P2D-4¡@ |
A Low-Hardware-Cost Logical OR Operation Log-SPA LDPC Decoder |
PDF |
Ming-Yu Lin, Ching-Da Chan, Jung-Chieh Chen,
and Po-Hui Yang, °ê¥ß¶³ªL¬ì§Þ¤j¾Ç ¡@ |
¡@ |
A low hardware cost Low-Density
Parity-Check (LDPC) decoder is presented in this paper. Making use of the logical OR operation in the check nodes for the log sum-product algorithm (Log-SPA), we propose a new architecture for updating the check nodes messages. Synthesized and numerical results show that the proposed architectures achieve up to 21% total hardware reduction with fair BER performance, comparing with the traditional Log-SPA decoder. In addition, the proposed decoder also outperforms the known simplest sign-min architecture in terms of hardware complexity and BER performance.
|
P2D-5¡@ |
Mixed-Vth (MVT) CMOS Circuit Design For Low Power Cell Libraries |
PDF |
Jyun-Yi Lin, Li-Rong Wang, Chia-Lin Hu, and Shyh-Jye Jou, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
Mixed-Vth (MVT) technique has been proposed to resize the MOS size and then reduce dynamic power in logic gates by applying a low threshold voltage to transistors in some critical paths, while a standard threshold voltage is used in non-critical paths. This paper presents 130nm 1.2V and 90nm 0.5V low power cell libraries using MVT technique. The dynamic power consumption of the cells has been reduced around 5% to 30% with the same timing specifications. |
P2D-6¡@ |
Symbol and Integer Carrier Frequency Offset Synchronization for
IEEE802.16e |
PDF |
Juan-Nan Lin, Hsiao-Yun Chen, and Shyh-Jye Jou, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
IEEE 802.16e has been proposed as a
standard for the next generation wireless communication system.
Synchronization plays an important role to set the environment for
receiver. Because the repetition characteristic of preamble is
unapparent, the method like match filter is used in orthogonal frequency
division multiple access (OFDMA). In this paper, the reduced complexity hardware architecture of correlation is proposed. By adopting the modified algorithm, mass of multipliers are removed from hardware implementation. The modified dissipation process reduces 28 % area cost and 22% power. |
P2D-7¡@ |
Register Processor for MMX instructions |
PDF |
Jih-Ching Chiu, Shou-Xi Hong, and Kai-Ming Yang, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
In multimedia processing, the regular and
large quantities of data are always shown in the processing algorithms.
The key idea in these extensions is the exploitation of sub-word
parallelism in an SIMD fashion, such as Intel¡¦s MMX. Exploiting more data-level parallelism (DLP) is a most efficient solution by which to improve performance in SIMD instructions. However, the multi-data must be executed by the execution stage in the ordinary architecture. Increasing the degree of DLP usage causes a bottleneck at the execution stage. Besides, the numbers of operands from the register file also cause the level of difficulty to increase for the DLP. These above factors will seriously limit the DLP and affect the performance of SIMD. In this paper, we have proposed a special storage cell, which is different from data latch, and is called an operation cell, to construct the register file, called the register processor. The operation cell has the dual capabilities of operation and storage for bit slicing. Therefore, we design a register file with operation cells for the MMX instruction set. All of the data for MMX operations will be placed at the register file and can be simultaneously handled such as the SIMD operations. In summary, these operation cells in the special register file can operate all of the data by themselves and won¡¦t depend on the execution stage. In the media processing, we are able to use the bus bandwidth efficiently in the best DLP for handling the quantities of data. According to simulation results, the register file with operation cells is better than Intel¡¦s Pentium processor with MMX technique and Ti C64 in 4 groups, which are eight 64-bit registers, and the higher degree in the DLP usage by operation cells, the higher performance of MMX instructions in substantially can be improved.
|
P2D-8¡@ |
Performance Comparisons and Tradeoffs of Table-Based Arithmetic Function
Evaluators |
PDF |
Ping-Chung Wei, Ching-Pin Lin, and Shen-Fu Hsiao, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
Function units that calculate elementary
arithmetic functions play an important role in many applications
including scientific computing, digital signal processing, multimedia
codec, 3D graphics animation etc. Thus, efficient design of arithmetic
function units could significantly affect the overall system performance
and area cost. In this paper we develop an automatic generator to
produce hardware units that compute various single-value functions using
table-method approaches and compare their performance and area costs
with other alternative implementations. Experimental results show that
making choices between these approaches depends on the tradeoffs between
speed, area and especially the constraint of the accuracy. In
particular, the hardware function units produced using our proposed
generator have better delay and/or area cost compared with those
obtained from Synopsys DesignWare library. |
P2D-9¡@ |
Multiple-Input XOR/XNOR Circuit Design Using Pass-Transistor Logic and
Its Application in Cryptography |
PDF |
Ming-Yu Tsai, Chia-Sheng Wen, and Shen-Fu Hsiao, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
Exclusive-OR (XOR) gate is one of the critical components in many applications such as cryptography. In this paper, we present an efficient multi-input XOR circuit design based on pass-transistor logic (PTL). A synthesis algorithm is developed to efficiently generate the PTL-based multi-input XOR circuits. Both pre-layout and post-layout simulation results show that our proposed multi-input XOR design outperforms static CMOS design. The multi-input XOR circuits are also used to design the transformations in the Advanced Encryption Standard (AES). |
P2D-10¡@ |
Efficient Design of Graphic Rasterization Module |
PDF |
Chung-Hua Tsai and Yun-Nan Chang, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
This paper presents an efficient design of rasterization module suitable for the tile-based 3D graphic rendering systems. The ordinary line drawing algorithm for the scan-line boundary search or the direct in-out test approach is not efficient for the scan-conversion operation in tile-based approach since the shape of triangle primitive may become irregular after tiling. Therefore, this paper transforms the general pixel in-out test function into a sign-directed scan-line boundary search method. The normal in-out test circuit for single pixel can be modified to detect two end-points of the scan-line simultaneously such that the effective hardware efficiency can be largely improved. Our experimental results show that the pixel fill-rate can be improved by about 60%. The proposed rasterization design also divides the entire architecture into two parts including scan-line generation and the fragment generation. This division can help the optimization and speedup of the individual part to achieve the desired overall fill-rate goal. |
¡@
|
TECHNICAL PROGRAM
Session
A3 |
Day
08/10
|
Time
10:00-11:30
|
Chair ½±¤¸¶© ±Ð±Â
±X¤s¬ì§Þ¤j¾Ç |
Room
301 |
10:00 |
A3-1¡@ |
The Efficient VLSI Design on BI-CUBIC Interpolation for Real Time
Digital Image Scaling |
¡@ |
PDF |
ªL¥¿°ò(Chung-chi Lin), °ê¥ß¶³ªL¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a VLSI design of
bi-cubic interpolation for digital image processing. The architecture of
reducing the computational complexity of generating coefficients as well
as decreasing number of memory access times is proposed. Our proposed
method provides a simple hardware architecture design, low computation
cost and is easy to implement. Based on our technique, the high-speed
VLSI architecture has been successfully designed and implemented with TSMC 0.13£gm standard cell library. The simulation results demonstrate that the high performance architecture of bi-cubic convolution interpolation at 279MHz with 30643 gates in a 498¡Ñ498£gm2 chip is able to process digital image scaling for HDTV in real-time.
|
10:15 |
A3-2¡@ |
Design of 1x2 MB-OFDM UWB Receiver with Channel Shortening Technique |
¡@ |
PDF |
Jen-Ming Wu and Hung-Wen Yang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we present an adaptive
channel equalization scheme with receiver diversity for MB-OFDM Ultra-wide Band (MB-OFDM UWB) communication system. When the channel impulse response is longer than guarding interval, inter-symbol interference (ISI) will heavily degrade the system performance. Our scheme combines time domain impulse response shortening technique to deal with ISI, and multiple receive antennas for extra diversity gain. We propose to over constraint the desired length of shortening window of interest, and the BER can be improved by at least 4.4dB than conventional shortening techniques with shortening window equal to guard interval. Simulation results show the proposed scheme has better bit error rate (BER) performance over traditional UWB systems, which uses only frequency domain equalization. In addition, we also present the proposed channel shortening technique within the context of Single Input Multiple Output (SIMO) channel for additional receiver diversity gain.
|
10:30 |
A3-3¡@ |
A Scalable Graph-cut Engine Architecture for Real-time Vision |
¡@ |
PDF |
Nelson Yen-Chung Chang and Tian-Sheuan Chang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents the world¡¦s first
scalable graph-cut engine architecture for real-time vision processing.
This architecture implements an exhaustive search algorithm which has
very high parallelism and is suitable for scalable hardware
implementation. For an n-vertices two variable graph, the estimated
hardware cost of using 2k sets of energy computation unit is 2k-1¡Ñ(n2+n+1)
equivalent adders. The corresponding latency of the proposed
architecture is Ceiling (log2(n2+n)-1)+ 2n-k
+k cycles, which is faster than the n4+6n3+11n2
cycle latency of the traditional software approach when n<17. The
proposed architecture may enable two-variable graph-cut methods, such as
swap and expansion moves, to be applied to real-time vision application. |
10:45 |
A3-4¡@ |
High-Quality Mipmapped Texture Compression |
¡@ |
PDF |
Chih-Hao Sun and Shao-Yi Chien, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a high-quality mipmapped texture compression (MTC) system for GPU.
Based on wavelet transform, a hierarchical approach is adopted for mipmap in YCbCr color space to embed three levels of mipmap in a single bitstream.
Furthermore, a layer overlapping technique is proposed as wellto reduce the memory bandwidth of MTC.
MTC is integrated in a cycle-accurate GPU simulator with texture cache.
Simulation results show that MTC can provide better image quality with similar memory bandwidth and less cache miss rate for textures.
VLSI implementation result shows that the hardware cost of MTC is similar to that of DXTC and is suitable to be integrated in GPU to provide high-quality textures with low memory bandwidth requirement. |
11:00 |
A3-5¡@ |
A Scalable Wavelet Image Coder Based on Zero-block and Array and Its
Hardware Implementation |
¡@ |
PDF |
Yuan-Long Jeang, Hung-Yu Wang, and Cyuan-Cheng Wong, ±X¤s¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we propose a highly Scalable
Embedded image coder based on Zero-blocks and Array structures, called
S-EZBA, by the extension of an image coder I-EZBA with higher performance and cost efficiency. S-EZBA achieves not only distortion scalability, resolution scalability, and region of interest (ROI) retrievability, but also inherits the property of cost saving of I-EZBA. We use a new formation of Quality Layer to realize these properties. Comparing with S-SPECK, S-EZBA omits memory needed on counting the length of bitstreams belonging to the codeunits. S-EZBA has been implemented based on TSMC .18 um technology. Experimental results of S-EZBA show excellent cost, power consumption and PSNR (peak signal-noise ratio).
|
11:15 |
A3-6¡@ |
Efficient Fast Fourier Transform Processor Design for DVB-H System |
¡@ |
PDF |
Yu-Ju Cho, Chi-Li Yu, Tzu-Hao Yu, Cheng-Zhou Zhan, and An-Yeu (Andy) Wu,
°ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
Fast Fourier transform (FFT) is the demodulation kernel in the DVB-H system. In this paper, we firstly propose an FFT processor that reduces the power consumption by decreasing the usage of main memory, and timely turning off the unused memory partitions in different sizes of the FFT. Second, the triple-mode conflict-free address generator is proposed to handle the address mapping of all storages in the three-size FFT computations. Then two cost-efficient twiddle-factor coefficient design methods, ¡§Sharing¡¨ and ¡§Interpolation-then-Sharing¡¨, are proposed to reduce the area of coefficient storages within the allowable loss of SQNR. These methods can reduce 67% area occupied by coefficient storage at a price of 0.6dB loss of SQNR in our design. Finally, our proposed FFT processor for DVB-H system is implemented by using TSMC 0.18£gm 1P6M CMOS technology with core size of 1.886¡Ñ1.886mm2. The minimum latency to operate 8192-point FFT is 805£gs at 86MHz clock rate by consuming 75.51mW. For DVB-H system, it processes the 8192, 4096, and 2048-point FFT with clock rates of 79MHz, 75MHz, and 71MHz, and consumes of 67.01mW, 53.16mW, and 39.45mW, respectively.
|
¡@
|
TECHNICAL PROGRAM
Session
B3 |
Day
08/10 |
Time
10:00-11:30
|
Chair §d¤¯»Ê ±Ð±Â
°ê¥ß²MµØ¤j¾Ç |
Room
305 |
10:00 |
B3-1¡@ |
A Partially Parallel Low-Density Parity Check Code Decoder with Reduced
Memory for Long Code-Length |
¡@ |
PDF |
Chin-Kuang Lian, Shin-Yo Lin, Tsung-Han Tsai, and Chin-Long Wey, °ê¥ß¤¤¥¡¤j¾Ç
¡@ |
¡@ |
¡@ |
Two partially parallel architectures have
been commonly implemented for LDPC decoders: Share-memory architecture and Individual-memory architecture. This paper presents an alternative approach which significantly reduces the memory size requirement. The memory size reduction can be approximately 10% and 49% of the individual-memory and share-memory architectures, respectively, for a LDPC decoder with a code length of 1536 and a code rate of 1/2. The proposed LDPC decoder achieves the data rate up to 79 Mbps, where the clock frequency is 500 MHz.
|
10:15 |
B3-2¡@ |
Architecture of Adaptive Channel Equalizer in Dedicated Short Range
Communication (DSRC) and Vehicle Infotainment Systems |
¡@ |
PDF |
Yong-Hua Cheng, Yi-Hung Lu, and Chia-Ling Liu, ¤u·~§Þ³N¬ã¨s°| ¡@ |
¡@ |
¡@ |
Dedicated Short Range Communication (DSRC) is the key component in Intelligent Transportation Systems (ITS). The goal of this technology is to help driver and passengers getting multimedia services via wireless communication equipment during the movement of vehicle, so as to improve the traffic safety and enhance the transportation efficiency. The paper is focus on the DSRC baseband technical analysis and the application development. Based on the WLAN 802.11a architecture, we develop a new adaptive channel estimation algorithm and architecture that uses decoder and decision feedback to enhance wireless access ability in vehicular environments. Eventually, we integrate mobile communication with vehicle multimedia network.
|
10:30 |
B3-3¡@ |
An Ultra-low Power Multi-mode LDPC Decoder Chip for Mobile WiMAX System |
¡@ |
PDF |
Xin-Yu Shih, Cheng-Zhou Zhan, Cheng-Hung Lin, and An-Yeu (Andy) Wu,
°ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents an ultra-low power
multi-mode decoder design for Quasi-Cyclic LDPC codes for Mobile
WiMAX system. Based on proposed overlapped decoding mechanism, the decoding latency can be reduced to 68.75% of
non-overlapped method, and the hardware utilization ratio can be enhanced from 50% to 75%. The new early termination strategy can dynamically adjust iteration number when dealing with communication channels of different SNR values. In addition, we propose an Efficient Checkerboard Layout Scheme (ECLS) to reduce routing complexity in chip implementation. The multi-mode LDPC decoder design is implemented and fabricated in TSMC
0.13£gm CMOS technology. The core size is 4.45mm2 and the die area only occupies 8.29mm2. The operating frequency is maximally measured 83.3MHz with only power consumption of 52mW.
|
10:45 |
B3-4¡@ |
Baseband OFDM Receiver Design for Fixed WiMAX Communication |
¡@ |
PDF |
Chi-chie Chang and Jen-Ming Wu, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we present the
implementation of an inner receiver for IEEE 802.16-2004 Wireless
Metropolitan Area Network (a.k.a. WiMAX) and and fabricated with TSMC 0.18£gm technology. In our chip design, it consists of low power packet detection, low complexity carrier frequency compensation, recursive FFT and channel compensation. In the packet detection and carrier frequency compensation, we use sign-bit method and propose the mapping function to achieve low power design. In the FFT design, we use the radix 8 recursive FFT to achieve small area design. The total power consumption is about 114mW. |
11:00 |
B3-5¡@ |
A Multi-Code Rate IEEE 802.16e LDPC Decoder Design |
¡@ |
PDF |
Chih-Hao Hsiao and Yun-Nan Chang, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a VLSI design of
Low-Density Parity-Check code (LDPC) decoder for the IEEE 802.16e standard. In order to support all the code rates defined in the standard, we proposed a programmable block-based edge-serial iterative architecture which can perform the sequential check-node computation according to the internal sequence update commands. Any complex and irregular parity-check matrix can all be realized in the proposed architecture if the number of bit-nodes each check node connects will not exceed a certain bound. In order to achieve fast clock speed, the proposed LDPC decoder has been deep pipelined which, however, may prolong the execution cycles of each iteration due to the internal pipeline latency. The latency overhead can be reduced by scheduling the proper check-node update order such that different iterations of operations can be overlapped. The proposed architecture has been realized by using 0.18£gm technology with the total gate count of 900k. Our experimental shows that the proposed LDPC decoder can run up to 235 MHz and deliver the average of 135 Mbps throughput. Furthermore, in order to save the number of iterations, the early termination scheme based on the parity-check detection circuit is also included in our architecture. It can save the average of 20% cycles for signal-to-noise ratio (SNR) between 1 and 3 dB. |
11:15 |
B3-6¡@ |
Configurable Hierarchical Decoder Architectures for H-QC LDPC Codes |
¡@ |
PDF |
Kuo-hsing Juan, Mong-kai Ku, and Yu-min Chang, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a low-density parity-check (LDPC) decoder architecture using a fast converging layered decoding algorithm is presented. This hierarchical architecture is highly scalable and configurable. Two-level hierarchical quasi-cyclic LDPC codes are used to provide good coding gain and low error floor at long codeword length. We also develop a novel compensation method, mixed-mode min sum algorithm, which can provide better BER performance and need less iterations than the scaling min sum. Several designs are implemented on Altera Stratix 2 EP2S130 FPGA. The LDPC decoder implementation with 2 first level decoding blocks and 32 second level decoding units can achieve close to 1 Gbps information throughput.
|
¡@
|
TECHNICAL PROGRAM
Session
C3 |
Day
08/10 |
Time
10:00-11:30
|
Chair ³¯¬ì§» ±Ð±Â
°ê¥ß¥æ³q¤j¾Ç |
Room
315 |
Sensors and Power Electronics
10:00 |
C3-1¡@ |
A Dual Phase Charge Pump with Compact Size |
¡@ |
PDF |
Po-Chin Fan and Ke-Horng Chen, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, the regulated dual phase
charge pump with compact size is presented. This charge pump uses the
dual phase technique to reduce the output ripple and proposes a new
power stage to define the stability of the overall system. This charge
pump provides output voltage 5V and maximum load current 50mA with the
constant frequency regulation. This design is based on TSMC 0.35£gm
3.3V/5V CMOS technology. |
10:15 |
C3-2¡@ |
A Dual-Mode Step-Up DC/DC Converter with Current-Limiting Technology |
¡@ |
PDF |
Chun-Ting Kuo, Wan-Rone Liou, and Ping-Hsing Chen, °ê¥ß¥xÆW®ü¬v¤j¾Ç ¡@ |
¡@ |
¡@ |
This paper presents a novel dual-mode
step-up DC/DC converter. Pulse-frequency modulation (PFM) is used to improve the efficiency at light load. This converter can operate between pulse-width modulation (PWM) and pulse-frequency modulation. The converter will operate in pulse-frequency-modulation mode at light load and in pulse-width modulation mode at heavy load. The maximum conversion efficiency of this converter can reach 96%. The conversion efficiency is greatly improved when load current is below 100 mA. Additionally, a novel soft-start circuit is proposed in this paper to avoid the large switching current at the start up of the converter. Furthermore, a novel current-limiting circuit is proposed in this paper. It can limit the switching current below 400 mA.
|
10:30 |
C3-3¡@ |
A SAR-Based Smart Temperature Sensor with Binary-Weighted Search
Algorithm |
¡@ |
PDF |
Chun-Chi Chen, Poki Chen, and Kai-Ming Wang, °ê¥ß¥xÆW¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
A SAR-based (successive approximation
register) time domain smart temperature sensor with a binary-weighted
search algorithm is proposed in this paper. Without any bipolar
transistor, a temperature sensor composed of temperature-dependent delay
line is utilized to generate the delay time proportional to the measured
temperature. A timing reference delay line with binary-weighted scheme
is adopted for set-point programming. A SAR control logic is adopted for
selecting the optimal delay time for digital output coding. The proposed
10-bit smart temperature sensor has a chip area of 0.6 mm2
in the TSMC 0.35-£gm digital process and measurement error of +-0.3¢J
with a test range of 0¢J~90¢J.
|
10:45 |
C3-4¡@ |
A New Self-Oscillating CMOS DC-DC Converter with Adaptive Mode-Switching
Mechanism |
¡@ |
PDF |
Sau-Mou Wu , Chung-Lin Wu, and Chia-Hsien Chang, ¤¸´¼¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, presented is a new adaptive
mode-switching mechanism for a synchronous, self-oscillating, fully
integrated CMOS DC-DC converter. The proposed adaptive mode-switching mechanism employs a current sensing technique to enable the automatic mode switching between CCM and DCM according to the level of the load current, thus maintaining a high conversion efficiency even though the load current of an application may change during normal operation . Moreover, the level of the load current for mode switching is programmable depending on the applications. The efficiency of the resulting dc to dc converter is up to 92% while the maximum peak-to peak output voltage ripple is 18mV and the output current ranges between 50mA and 50 0mA. The dc to dc converter operates at a switching frequency from 250 to 3M Hz from a supply voltage ranging from 2.4 to 4.2 V. The new DC-DC converter was fabricated in TSMC 0.35-£gm 2P4M CMOS process with die size was 0.85 mm2 . Except the external inductor and capacitor, all the devices including the power switches are on chip.
|
11:00 |
C3-5¡@ |
A Novel Log-Lin-Log Response CMOS Image Sensor with High Swing and Wide
Dynamic Range |
¡@ |
PDF |
Sau-Mou Wu and Ming-Wei Chen, ¤¸´¼¤j¾Ç ¡@ |
¡@ |
¡@ |
A new CMOS image sensor with log-lin-log response is presented. The pixel cell has logarithmic response in very low illumination intensity, linear response in low and medium illumination intensity, and logarithmic response in high illumination. In this scheme, the sensor is highly sensitive to very low light, while still owning the properties of high voltage swing of 0.53V (from 1.8V supply) and high dynamic range of 120dB. Furthermore, CDS technique can be applied to the proposed sensor array to reduce the fixed pattern noise. For the purpose of demonstration, a prototyped image sensor array of 75¡Ñ54 with readout circuit and CDS is designed from 1.8V supply and is realized by the TSMC
0.35£gm CMOS 2P4M standard process. |
11:15 |
C3-6¡@ |
A Novel CMOS Smart Temperature Sensor for On-Line Thermal Monitoring |
¡@ |
PDF |
Wei-Cheng Lee, Hung-Chih Lin, and Tsin-Yuan Chang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
A CMOS smart temperature sensor without conventional ADC or bandgap
reference is proposed for thermal management of VLSI system. The
accuracy is within ¡Ó0.8¢J
over the temperature range of 0¢J to 125¢J
after two-point calibration. The sensor consists of a £GVGS generator that utilizes the temperature characteristics of CMOS transistors, a voltage-to-time converter and a time-to-digital converter to provide digital output. A small die area of 0.05mm2, an extremely low power consumption of 120£gW and a high conversion rate of 5K conversion/s make this temperature sensor very suitable for VLSI integration. The sensor features a finest resolution of 0.025¢J/LSB with a 100MHz external reference clock. |
¡@
|
TECHNICAL PROGRAM
Session
D3 |
Day
08/10 |
Time
10:00-11:30
|
Chair «À¹Å·ì ±Ð±Â
°ê¥ß¥xÆW¬ì§Þ¤j¾Ç |
Room
308 |
Timing and Clock Generators
10:00 |
D3-1¡@ |
Stability Analysis of Fourth-Order Charge-Pump PLLs using Linearized
Discrete-Time Models |
¡@ |
PDF |
Chia-Yu Yao, Chun-Te Hsu, and Chih-Chun Hsieh, °ê¥ß¥xÆW¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, we derive state equations
for linearized
discrete-time models of forth-order charge-pump phase-locked
loops. We solve the differential equations of the loop filter by
using the initial conditions and the boundary conditions in a
period. The solved equations are linearized and rearranged as
discrete-time state equations for checking stability conditions.
Some behavioral simulations are performed to verify the
proposed method. By examining the stability of loops with
different conditions, we also propose an expression between the
lower bound of the reference frequency, the open loop unit gain
bandwidth, and the phase margin. |
10:15 |
D3-2¡@ |
A Low Jitter 2.5-GHz Self-Calibration PLL |
¡@ |
PDF |
¾G°ê¿³¡B½²¥É³¹¡B¬x³Í±L, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
¡@ |
A 2.5-GHz 8-phase phase-locked loop (PLL) was proposed for 10Gbps system on chip (SoC) application. The proposed self-calibration method can adjust the multi-band voltage control oscillator (VCO) to compensate for PVT variations. The small KVCO can reduce the effect of power/ ground (P/G) and substrate noise. The PLL is implemented in 0.13£gm CMOS technology. The PLL output jitter is 18.55ps (p-p) where the reference clock jitter is 20ps (p-p). The total power dissipation is 21mW at 2.5-GHz and the core area is 0.08mm2.
|
10:30 |
D3-3¡@ |
A CMOS-MEMS Frequency Adaptive Resonator with Multiple Electrostatic
Electrodes Driving. |
¡@ |
PDF |
J. C. Chiou and L. J. Shieh, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
In this paper, a prestress vertical comb drive resonator with frequency tuning capability is developed. The resonator consists of three sets of comb fingers which act as driving electrodes. The comb fingers are fabricated along with the composite beam. One end of the composite beam is clamped to the anchor, whereas the other end is elevated vertically by the residual stress. The actuation occurs when the electrostatic force, induced by the fringe effect, pulls the composite beam downward to the substrate. By applying driving voltage in different electrodes, the resonator exhibits different frequency response. The device is fabricated through a standard 0.35£gm 2P4M CMOS-MEMS process. Preliminary measurement results indicated that the initial resonant frequency of the device is 18.6 kHz, and the maximum frequency tuning range up to 28.5% is obtained. |
10:45 |
D3-4¡@ |
An Efficient BMCS Approach to Accurately Predict Process Variation
Effects of PLL Circuits |
¡@ |
PDF |
Chin-Cheng Kuo, Meng-Jung Lee, I-Ching Tsai, Chien-Nan Jimmy Liu, and
Ching-Ji Huang, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
¡@ |
Hierarchical statistical analysis is often
used by regression-based approach to improve the extremely expensive HSPICE Monte Carlo analysis. However, accurately fitting the repression equations requires many simulation samples. In this paper, a Behavioral Monte Carlo Simulation (BMCS) approach to analyze PLL designs under process variation is proposed based on a bottom-up behavioral modeling approach with an efficient extraction process. Using the accurate model, we also propose a modified sensitivity analysis for process variation effects to provide accurate enough results with less regression cost. In the experimental results, we reduce the simulation time for HSPICE MC analysis from several weeks to several hours and still retain similar statistical results. |
11:00 |
D3-5¡@ |
A Low Power Wide Range Duty Cycle Corrector Based on Pulse
Shrinking/Stretching Mechanism |
¡@ |
PDF |
Poki Chen, Shi-Wei Chen, and Juan-Shan Lai, °ê¥ß¥xÆW¬ì§Þ¤j¾Ç ¡@ |
¡@ |
¡@ |
A duty cycle correction circuit based on
pulse shrinking/stretching mechanism is presented. The proposed DCC has been fabricated in a TSMC 0.35£gm standard CMOS process. An input duty cycle range of 30%~70% is achieved. The duty cycle error is between -1.0% to +1.0% for the widest operation frequency range of 3MHz~660MHz ever fulfilled which makes the circuit best suited for ultra wide band applications. The chip area is merely 0.3¡Ñ0.2 mm2 and the power consumption is 1.1mW at 550 MHz. |
11:15 |
D3-6¡@ |
A Wide-Range Synchronous 50% Duty-Cycle Clock Generator |
¡@ |
PDF |
Wei-Hao Chiu and Tsung-Hsien Lin, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
¡@ |
A CMOS synchronous 50% duty-cycle clock generator is presented in this paper. The proposed circuit is comprised of a clock-generation module and a phase error integration module. The clock-generation module senses the edges of an input signal to produce an output whose duty cycle is controlled to 50% by the phase error integration loop. The duty cycle control signal is generated by sensing the phase error between the input and the output. This work further proposes a calibration scheme to enhance the accuracy of the phase error integrator; hence, the residue phase error attributed to various non-idealities can be greatly reduced. This circuit is also capable of operating at wide frequency range by implementing a cyclic delay topology. The proposed circuit is designed in the TSMC
0.18-£gm CMOS process and operated from a 1.8-V supply voltage. The operation frequency ranges from 1 MHz to 1 GHz, and it can accommodate a wide-range of input duty cycles ranging from 6% to 94% at 1-GHz frequency. The duty-cycle error of the output signal is less than 0.5% and draws 12 mA at 1 GHz. |
¡@ |
TECHNICAL PROGRAM
Session
E3 |
Day
08/10 |
Time
10:00-11:30
|
Chair ³¯¬K¹® ±Ð±Â
°ê¥ß°ª¶¯¤j¾Ç |
Room
318 |
10:00 |
E3-1¡@ |
Throughput-Aware Floorplanning by Considering Multiple Critical Cycles |
¡@ |
PDF |
Li-Ya Wang and Juinn-Dar Huang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
¡@ |
The wire delay is gradually dominating the
clock rate of a system and becoming an important issue for system
design. However, it is hard to precisely estimate the wire delay in
early design stages until floorplanning is actually done. In this work, we show how the latency induced by wire delay dominates system performance and re-evaluate several floorplanning strategies which are considered providing the same quality of result (QoR) in the past. Then we propose a new throughput-aware floorplanning strategy which considers a set of most critical performance loops simultaneously. The experimental results show that our approach can even double the system performance than the previous method in some cases.
|
10:15 |
E3-2¡@ |
SIMD Code Generation for Multimedia |
¡@ |
PDF |
Cheng-Cho Jean, Guang-Huei Lin, Sao-Jie Chen, and Alan P. Su, °ê¥ß¥xÆW¤j¾Ç
¡@ |
¡@ |
¡@ |
Multimedia extensions are ubiquitous in
today's general-purpose processors. It has prompted the needs for
generating efficient simdized codes that SIMD architectures can benefit from. This paper sets out to investigate compiler techniques to target short vector instructions effectively and automatically. The most common aspects of compilation are the effective management of memory alignment and handling of mixed data lengths. Based on a code study of various multimedia workloads, we identify several new challenges arise in simdizing multimedia extensions, and provide some solutions to these challenges. Then we present a framework that addresses several of simdization issues mentioned above. |
10:30 |
E3-3¡@ |
H.264 Decoder Optimization ¡V VLIW DSP Platform |
¡@ |
PDF |
Pou-Hang Ian, Jia-Ming Chen, Hsin-Wen Wei, Jian-Liang Luo, and Wei-Kuan
Shih, °ê¥ß²MµØ¤j¾Ç |
¡@ |
¡@ |
H.264 Decoder Optimization ¡V VLIW DSP Platform
This paper presents several optimization techniques of
H.264/AVC decoder implementation on a dual-core VLIW PAC
DSP platform. The evaluation results show that a video with D1
resolution can be decoded in real-time. |
10:45 |
E3-4¡@ |
H.264/AVC Baseline Profile Decoder Optimization on PAC DSP |
¡@ |
PDF |
Chiu-Ling Chen, Jia-Ming Chen, Jian-Liang Luo, Tien-Wei Hsieh, and
Wei-Kuan Shih, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
Optimization techniques of major procedures
of the H.264/AVC decoder for PAC DSP is given in this paper, which provides a valuable experience for similar implementations. |
11:00 |
E3-5¡@ |
SIMD Optimizations for PAC VLIW DSP Processors with Sub-word
Instructions |
¡@ |
PDF |
Ci-Bang Kuan and Jenq Kuen Lee, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
¡@ |
The speed of growth and evolution of
multimedia applications have been putting lots of pressures on modern
processors to deliver further performance enhancements while with
limited budgets on cost and power. To meet the computing requirement,
sub-word instructions, known as a form of SIMD instructions,
are commonly equipped by DSP processors to boost performance
for those computation intensive applications.
Unfortunately, till now only library routines, intrinsic functions,
and in-line assembly are available for access and leveraging
sub-word instructions, but not applicable to general C programs.
This hinders the use of sub-word instructions in the deployment of
software applications.
In this paper, we present an enabled flow for
performing auto-vectorization of C compilers by utilizing sub-word instructions.
The vectorizing compiler would identify data level parallel implicit in C programs
and automatically generate assembly with sub-word instructions whenever possible.
The target architecture in our experiment is based on PAC VLIW DSP processors.
The performance of vectorized programs are evaluated
using a set of DSP loop kernels,
which are typical and representative in digital signal processing.
The preliminary results reveal that our vectorizing compiler
generates codes with efficiency.
The speedup is from 1.3 to 2.1 compared to the one without our
proposed optimizations. |
11:15 |
E3-6¡@ |
Standard Cell Like Via-Configurable Logic Block Design for Structured
ASICs |
¡@ |
PDF |
Mei-Chen Li, Chien-chung Lai, Hui-Hsiang Tung, and Rung-Bin Lin, ¤¸´¼¤j¾Ç ¡@ |
¡@ |
¡@ |
A structured ASIC consisting of pre-fabricated yet via configurable logic blocks (VCLBs) and a regular fabric can achieve a timing performance comparable to that of an ASIC but uses much less power and area than that of an FPGA. To reduce tool development cost for structured ASIC, in this paper we propose a standard cell like VCLB such that we can leverage existing tools to perform chip designs using our VCLBs. We create a standard cell library based on our VCLB and establish a design flow based on some commercial tools and our own tools. Experimental data show that our approach achieves a delay of 1.93 times that attained by the designs using a commercial cell library at the expense of 320% increase in chip area. The product of delay and area achieved by our approach is on average 44% better than that achieved by some previous work. |
¡@
|
TECHNICAL PROGRAM
Session
P3A |
Day
08/10 |
Time
10:00-12:00
|
Chair §õ¶¶¸Î ±Ð±Â
°ê¥ß¤¤¥¿¤j¾Ç |
Room
2F®b·|ÆU |
P3A-1¡@ |
Voltage-Mode First Order All-Pass Filter using DDCC |
PDF |
Wei¡VYuan Chiu , Jiun¡VWei Horng, and Chuan¡VHsien Chang, ¤¤ì¤j¾Ç ¡@ |
¡@ |
In this paper, a new voltage¡Vmode first
order all-pass filter using minimum active and passive components is
presented. The proposed circuit only employs one differential difference
current conveyors (DDCCs), one grounded capacitor and one resistor and offers the following advantages: the use of only grounded capacitor which is attractive for integrated circuit implementation, low active and passive sensitivities and no requirements for component matching conditions. PSPICE simulation results that verify the theoretical analyses are included.
|
P3A-2¡@ |
Analog Circuits Fault Diagnosis under Parameter Variations Based on
Fuzzy Logic system |
PDF |
ªL©v§Ó¡B³¯¬Õ¦{¡B³¢©ú¤¯, ³{¥Ò¤j¾Ç ¡@ |
¡@ |
In this paper, an efficient, fast
methodology and further diagnosis is proposed for the linear analog
circuits based on fuzzy logic system (FLS). After fault diagnosing and location by using radial basis function neural network, fuzzy rule bases are constructed to characterize the behavior of the circuit under test in both fault and fault free situations and to evaluate the faulty element values. The result of our simulation confirm the validity and performance of the advocated fault diagnosis technique.
|
P3A-3¡@ |
A CMOS Low-Noise Amplifier with Shunt-Peaking for 3-5GHz Ultra-Wideband
Wireless System |
PDF |
Zhe-Yang Huang, Che-Cheng Huang, and Chung-Chih Hung, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
This paper presents a low-noise amplifier (LNA) with shunt-peaking load for MB-OFDM Group-A and DS-UWB low-band 3-5GHz ultra-wideband wireless radio system. The LNA is designed and implemented in TSMC
0.18£gm RF CMOS process. Measurement results show that maximum power gain is 18.5dB, input and output matching lower then -6.6dB and -6.0dB, and a minimum NF of 2.9dB can be achieved, while the power consumption is 18.7mW through 1.8V power supply.
|
P3A-4¡@ |
Analytical Synthesis of Low-Sensitivity Voltage-Mode Odd-Nth-Order OTA-C
Elliptic Filter Structure with the Minimum Number of Components |
PDF |
Chun-Ming Chang, ¤¤ì¤j¾Ç |
¡@ |
Though the current-mode odd-nth-order
operational transconductance amplifier and capacitor (OTA-C) elliptic filter structure with the minimum number of active and passive components was presented recently, yet none of its counterpart, the voltage-mode ones, have been reported. After a new analytical synthesis method, namely, an innovative algebraic decomposition of a complex nth-order transfer function into n simple and feasible equations, the voltage-mode odd-nth-order OTA-C elliptic filter structure with the minimum number of components is proposed in this paper. The Hspice simulation with 0.35£gm process for a voltage-mode third-order OTA-C elliptic low-pass filter, employing only four OTAs and three grounded capacitors, validates not only precise filtering parameters but low sensitivity and low power consumption performances. |
P3A-5¡@ |
A 14-Bit Fourth-Order Sigma-Delta Modulator with Feedforward
Architecture for Hearing Aid |
PDF |
Shuenn-Yuh Lee, Jia-Hua Hong, Chi-Ching Lin, Chui-Kum Chiu, and
Sheng-Jing Ku, °ê¥ß¤¤¥¿¤j¾Ç ¡@ |
¡@ |
A fourth-order sigma-delta modulator (SDM) with feedforward (FF) structure is implemented for hearing aid. In this paper, the non-ideal circuit models are built for systematic analysis and the required circuit specifications can be produced by the behavioral simulation with the non-ideal circuit models. The circuit implementation based on the required circuit specifications is employed to design a fourth-order FF SDM with over-sampling ratio (OSR) of 64 and bandwidth of 10kHz using a 0.35£gm TSMC CMOS process. Measurement results reveal that the SDM operating from a 3.3-V supply voltage can achieve dynamic range of 90 dB and spurious-free dynamic range (SFDR) of 87 dB with signal bandwidth of 10kHz at sampling frequency of 1.28 MHz, and they are in agreement with the behavior analysis.
|
P3A-6¡@ |
A UWB CMOS Power Amplifier With Differential to Single-Ended Converter |
PDF |
Shuenn-Yuh Lee and Guan-Da Lu, °ê¥ß¤¤¥¿¤j¾Ç ¡@ |
¡@ |
A UWB PA with a Differential to Single-Ended converter (DSC) has been implemented and fabricated using the TSMC CMOS RF 0.18£gm 1P6M process. Both the cascode structure and two-stage amplifier are adopted to increase the bandwidth, gain and gain flatness. Die-on-PCB measurements has shown this PA provides an average power gain of 10dB and P1dB of above 0dBm in the frequency range from 3.1GHz to 7GHz, respectively. Moreover, the PAE is 11% at 4GHz under the power consumption of 60mW.
|
P3A-7¡@ |
A 8-BIT 150-MS/S FULLY DIFFERENTIAL DUAL-CHANNEL TIME-INTERLEAVED
PIPELINE A/D CONVERTER |
PDF |
Chih-Hsiang Chang and Ching-Yuan Yang, °ê¥ß¤¤¿³¤j¾Ç ¡@ |
¡@ |
In this paper a dual channel, time
interleaved, pipeline analog-to-digital converter (ADC) is presented.
The ADC achieves a conversion rate of 150MHz with 8-bits resolution.
Fabricated in 0.35-£gm CMOS technology, the chip size is 1.8mm¡Ñ1.8mm. It consumes power dissipation of 212 mW under a 3.3-V supply. |
P3A-8¡@ |
A Wide-Band Low-Power Quadrature VCO |
PDF |
Ching-Yi Chen, °ê¥ß¤¤¥¿¤j¾Ç ¡@ |
¡@ |
A Wide-band low-power quadrature voltage-controlled oscillator (QVCO) using TSMC
0.18£gm CMOS process is proposed in this paper. The QVCO adopts cross-coupled structure and uses a current-reuse technology. The architecture can not only offer larger and more symmetrical amplitude, but also reduce power spur. Based on our measurement, the phase noise from the carrier frequency of 4.9GHz is -108 dBc/Hz under 1-MHz offset and the proposed QVCO has tuning range from 3.6 to 4.9GHz. Moreover, the phase error and power imbalance are less than 5¢X and 1.5 dB, respectively, and the power consumption is 8mW at 2-V power supply voltage.
|
P3A-9¡@ |
Low Power Sigma Delta Modulator with Dynamic Biasing for Audio
Applications |
PDF |
Hsin-Liang Chen, Yi-Sheng Lee, and Jen-Shiun Chiang, ²H¦¿¤j¾Ç ¡@ |
¡@ |
In this paper, a low power sigma delta
modulator with dynamic biasing technique is presented. According to the
analysis of the operations of the switched-capacitor integrator, the
folded-cascode operational amplifier can be designed with optimized biasing currents in three different phases to reduce power dissipations. The total power saving is 20% of the general one. A prototyping fourth order single-bit MASH 2-2 sigma delta modulator is designed with the technique of dynamic biasing to achieve dynamic range of 95dB and peak signal-to-noise-and-distortion-ratio of 93dB. The experimental circuit is designed in 0.35£gm 2P4M CMOS technology. The chip area is 3.11mm2, and the power dissipation is only 5mW from a supply voltage of 3V.
|
P3A-10¡@ |
A New Current-Mode Wheatstone Bridge Based on Fully Differential
Operational Transresistance Amplifiers |
PDF |
Yuh-Shyan Hwang, Chun-Chi Shih, Jiann-Jong Chen,
and Wen-Ta Lee, °ê¥ß¥x¥_¬ì§Þ¤j¾Ç ¡@ |
¡@ |
A new current-mode Wheatstone bridge (CMWB) that uses a fully differential operational transresistance amplifier (FDOTRA) is presented in this paper. The proposed CMWB has been analyzed, simulated, and implemented. The advantages of the proposed CMWB are twofold. Firstly, it reduces the number of sensing passive and active elements. Secondly, the proposed CMWB circuit offers a significant improvement in accuracy compared to other CMWBs. Simulation results that confirm the theoretical analysis are obtained. The proposed circuit has been designed with TSMC
0.35£gm DPQM CMOS processes.
|
P3A-11¡@ |
An Embedded 10-bit 200MHz DAC IP with Self-Calibrating Current Bias for
SoC Applications |
PDF |
Chung-Ming Pan and Chien-Hung Tsai, °ê¥ß¦¨¥\¤j¾Ç ¡@ |
¡@ |
In this paper, a 10-bit 200MHz DAC with a self-calibrating current bias is designed. With this architecture, the integrated device inaccuracy of internal load resistors can be improved and gain error of the output voltage swing can be reduced, making it suitable for embedded CMOS SoC applications. Dual-unary cell segments, which comprise a 6 MSB segment and a 4 LSB segment, are used to reduce the static nonlinearity. Several useful circuit techniques and a source degenerated current switch are adopted to enhance the performance. The prototype is implemented with a standard 0.35£gm 2-poly 4-metal CMOS technology, occupying 0.99 mm2 of die area. The simulation results show that DNL and INL are less than 0.25 LSB and 0.3 LSB, respectively. SNDR of 58 dB and SFDR of 63 dB are achieved for an input signal of 10MHz at 200MHz clock frequency with power dissipation of 70 mW. |
¡@
|
TECHNICAL PROGRAM
Session
P3E |
Day
08/10 |
Time
10:00-12:00
|
Chair Ĭ¼yÀs ±Ð±Â
°ê¥ß¶³ªL¬ì§Þ¤j¾Ç |
Room
2F®b·|ÆU |
P3E-1¡@ |
Simultaneous Module Selection and Clock Skew Scheduling for Minimizing
Standby Leakage Current |
PDF |
Shih-Hsu Huang, Da-Chen Tzeng, and Chun-Hua Cheng, ¤¤ì¤j¾Ç ¡@ |
¡@ |
In event driven applications, the standby
leakage current accounts for a large fraction of total power
dissipation. The power gating technique is one of the moist effective
ways to reduce the standby leakage current. However, when the power
gating technique is applied, there exists a delay-power tradeoff, which
can be characterized with the sizes of sleep transistors. As a result,
for each functional unit, the largest allowable delay (due to the timing
constraints) limits the smallest leakage current that the power gating
technique can achieve. In this paper, we point out that: under the same
clock period constraint, different clock skew schedules result in
different standby leakage currents (due to different timing
constraints). Based on that observation, we present an MILP (mixed integer linear programming) approach to formally formulate the problem of simultaneous application of module selection (i.e., power gating implementation selection) and clock skew scheduling. Experimental data show that: compared with the existing possible design flow, our standby leakage current reduction achieves 29%.
|
P3E-2¡@ |
Totally Self-Checking Borden Code Checker Design Using Modulo Adders |
PDF |
Wen-Feng Chang, Debaleena Das, and Cheng-Wen Wu, ¸U¯à¬ì§Þ¤j¾Ç ¡@ |
¡@ |
There is an increasing attention for
on-line checkers with the advent of deep submicron VLSI technology and
system-on-chip (SOC), where reliability and yield are becoming big
issues for future generations of VLSI products. A technique for
designing hardware-efficient totally self-checking (TSC) checkers for Borden
codes is proposed. Borden codes, C(n,t), are optimal
t-unidirectional error detecting codes for n-bit
vectors. Borden codes have gained importance as a large number of
errors in modern VLSI circuits are of the unidirectional type and
have a limited multiplicity. The checker proposed here is based on
the modulo property of Borden codes. The checker is composed of a
modulo adder which maps all the Borden codewords to t+1 subsets.
The adder is followed by a translator and a two-rail code checker,
which detects these t+1 subsets. Compared with previous methods, this checker
has a much lower hardware complexity: it reduces the hardware complexity
from O(n(\log n)2) to O(n log
n/t). |
P3E-3¡@ |
Analytical Aerial Imaging Simulation for OPC |
PDF |
³¯¤¤¥¡B¸âÀM¡B´¿«T¶Q¡BÁé¤h«i¡B¤ýªÛ¦t, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
Optical proximity correction (OPC) is absolutely essential to
nowadays microlithography, especially for complicated IC layout
structures. Here in this paper, we propose an analytical way to
evaluate the light distribution by imaging equation that can be
easily implemented and provide accuracy for academic usage and
further optimizations with acceptable simulation speed compared with
traditional numerical methods which are evaluated by discrete
convolution or by FFT that might lose accuracy due to discrete data
and sampling. Optical imaging evaluation discussed in this paper can
be also applied in physical design region to enhance well-organized
layout structures which can provide more OPC-friendly designs in
advance. From the imaging result, distribution of light intensity,
with simulation time proportional to slits numbers, will be shown
and compared with the well-known simulator, SPLAT. |
P3E-4¡@ |
An Experiment of Test Plan Construction & Test Automation |
PDF |
Tsung-Ju Yang, Ming-Chang Tung, Wei-Yu Lin, Zhi-Wei Lin, Chi-Hen
Chang, and Farn Wang, °ê¥ß¥xÆW¤j¾Ç ¡@ |
¡@ |
We investigate the issue of test automation
for embedded systems. We use one mobile phone as our experiment SUT (System Under Test) and identify the test tasks that
can be automated facilitated with tool supports. As a result,
we have developed a testcase graphical editor that allows
the users to draw high-level test cases in MSCs (Message
Sequence Charts) and a test compiler that translates
MSCs to test executables in C/C++. We have also developed
a configurable mobile phone simulator with versatility
for the general capabilities that we may expect from a mobile
phone, like dialing, call-answering, MP3 playing, calculator
operation,.... We then discuss how to use the international
standard of TTCN-3 to implement the SUT adaptor
and platform adaptor. Then we discuss how to construct test
matrix for the testing of the SUT for a number of specifications
and criteria. Finally, we report the experiment data. |
P3E-5¡@ |
A Flip-Flop Replacement Technique for IR Drop Reduction |
PDF |
Jiun-Kuan Wu, Liang-Ying Lu, Kuang-Yao Chen,
and Tsung-Yi Wu, °ê¥ß¹ü¤Æ®v½d¤j¾Ç ¡@ |
¡@ |
As process technology progresses to ultra
deep sub-micron, IR drop becomes an important issue for circuit designers. Clock skew scheduling for peak current reduction is a popular technique for solving IR-drop problem in physical design stage. In this paper, we propose three kinds of long delay flip-flops and an algorithm that can replace the selected normal flip-flops of a circuit by the long delay flip-flops. Because the replacement causes the switching times of flip-flops to be separated, the peak current and IR
drop effect can be reduced. Unlike the traditional clock skew
scheduling, our technique not only can be used in physical design stage
but also in logic design stage. Another advantage of our technique over
the clock skew optimization technique is lower area overhead. The reason
is that our method does not increase routing resource demand while clock skew optimization technique may increase this demand. Experimental results show that our technique can reduce peak current and dropped voltage up to 41.95% and 31.82%, respectively, and the area overhead is less than 1%.
|
P3E-6¡@ |
A Design Methodology for Application-Specific Instruction-set Processors
with Memory Access Considerations |
PDF |
Ji-Ying Wu, Chi-Jie Lin, Je-Rung Shiu, De-Sheng Chen,
and Yi-Wen Wang, ³{¥Ò¤j¾Ç
¡@ |
¡@ |
System designers may add some new
instruction, called Application-Specific Instructions (ASIs), by automatic algorithm to optimize specific target application program to improving system performance, and to reduce design time of ASIP. In past days, ASIP researches almost focus on instructions latency to improve performance. The impact of memory access is often ignored. In this paper, a design flow is proposed to automatically generate ASIs to improve performance. The flow consists of translating a C program to CDFGs, selecting ASIs, and simulating on MIPS R3000-based microarchitecture. We consider instruction latency and a simple memory parameter at the same time. Our experiment results show that adding a simple memory model can get performance improvement up to 22% and up to 24% memory access reduction comparing to considering instruction latency only.
|
P3E-7¡@ |
Yield Analysis for the 65nm SRAM Cells Design with Resolution
Enhancement Techniques (RET) |
PDF |
J. J. Tang, C. L. Liao, P. C. Jheng, S. H. Chen, K. M. Lai, and L. J.
Lin,, «n¥x¬ì§Þ¤j¾Ç ¡@ |
¡@ |
Photolithography remains the driving and
enabling technology in the modern semiconductor industry to fabricate
integrated circuits with everdecreasing
feature size. However due to the wave properties of light,
such as diffraction and interference,
there will be no yield on downscaling of critical dimensions
without using Resolution Enhancement Techniques (RET).
Two major RETs, i.e., optical proximity correction (OPC) and phase shift masks (PSM)
are often employed to improve the manufacturability
and yield of nanometer-ICs. RETs in lithography have even enabled optical
lithography to reliably produce IC features 2 or even 3 times smaller than the
optical wavelength used for pattern imaging.
This paper presents the photolithography simulation results and experiences of designing
65nm SRAM standard cell using various RETs for lithography at 193nm.
The effects of OPC and PSM methodologies together with the off axis illumination (OAI)
and water immersion techniques
to improve the photolithography image quality and yield will be examined based
on the yield analysis criteria Mask Error Factor (MEF), Exposure Latitude (EL), and Process Window (PW).
|
P3E-8¡@ |
Object-Oriented Hardware/Software Co-Design Using Java |
PDF |
Chin-Tai Chou, Fu-Chiung Cheng, and Hung-Chi Wu, ¤j¦P¤j¾Ç ¡@ |
¡@ |
Object-oriented design methodology helps to
handle design complexity in software and thus have increased in
popularity in hardware/software co-design. However, existing approaches
reported are focusing on modelling and simulation. In this paper, we thus propose a novel approach to model and synthesis hardware/software co-design systems using Java as specification language. This makes the modelling and verification in early design process more easily and possible. An example is given to demonstrate and evaluate the performance gain of our approach.
|
¡@
|
TECHNICAL PROGRAM
Session
P3D |
Day
08/10 |
Time
10:00-12:00
|
Chair ¦¶¦u§ ±Ð±Â
¤¤ì¤j¾Ç |
Room
2F®b·|ÆU |
P3D-1¡@ |
High-speed, Low Cost Parallel Memory-Based FFT Processors for OFDM
Applications |
PDF |
Shin-Yo Lin, Wei-Chien Tang, Muh-Tien Shiue, and Chin-Long Wey, °ê¥ß¤¤¥¡¤j¾Ç |
¡@ |
Low cost yet efficient FFT (Fast Fourier Transform) processors are greatly needed for real-time operation in many OFDM applications, such as xDSL, DAB, DVB-T/H, and etc. This paper presents three Radix-2 memory-based FFT (MBFFT) Processors with a memory size of N words for N complex points FFT operation. Experimental results show that the core area of the developed MBFFT is 2.04mm2 with the maximum working frequency of 198MHz for N=8192 points (24 bits per word).
|
P3D-2¡@ |
Self ¡VAligned Double Bits SONOS Cell and Its Memory Circuit Design |
PDF |
Jyi-Tsong Lin, Wei-Ching Lin, and Ho-Lin Lee, °ê¥ß¤¤¤s¤j¾Ç ¡@ |
¡@ |
In this paper, a new structure of two bits SONOS cell will be demonstrated, this cell fabrication is self-alignment imported, and the pair of vertical trapped columns are located in the adjacent of channel below the Gate. This Cell will meet the-art-of-state MOS manufacture process without particular additional procedure, and the vertical trapped pattern not only satisfies multi-bits function, also gives further miniaturized possibility, and the self-alignment can be fulfilled during this piece of design.
In addition, the suitable distance of each trapped column can promise reliable isolation thus prolonged retention time and bit-to-bit interference suppression can be reasonably predictable. Also, a well wrapped trapped region by oxide layers can enhance longer retention time.
The vertical longer stripe shape with a needle end closed to the channel terminals, the trapped carrier accumulation and concentration phenomenon will occur in this needle zone and amplify the threshold voltage modification to give more programming window and guarantee the multi-bits function feasibility.
In this paper, we use ISE TCAD/DESSIS simulation to complete a device geometric structure and related characteristic. Also we use PSPICE to draw a easily understand schematic diagram for a memory cell, but no model establishment and importing effort to deepen other research and assessment, like cell to cell interference, multi-level current comparator design, will be done in the future.
|
P3D-3¡@ |
Computation Sharing Programmable FIR Filter Using Canonic Signed Digit
Representation |
PDF |
Shui-Wen Hsu and Yuan-Hao Huang, °ê¥ß²MµØ¤j¾Ç ¡@ |
¡@ |
This paper presents a low-cost and
high-performance programmable digital finite impulse response (FIR)
filter. The architecture employs the computation sharing algorithm to
reduce the computation complexity. In the traditional computation
sharing algorithm, critical path constraint on the output summation
stage is a bottleneck, thus, the canonic-sign-digit representation is
utilized for filter coefficients to relax the timing constraints. Due to
the relaxation of critical path timing, more computation cost is
reduced. Thus, the goal of low-cost and high-performance can be
achieved. The synthesis results show that the proposed architecture has
more area cost reduction for larger tap length compared with the
traditional computation sharing FIR filter. |
P3D-4¡@ |
A Low-Complex Image Coding Algorithm Based on Wavelet Transform |
PDF |
Trong-Yen Lee, Yang-Hsin Fan, and Su-Zhen Hong, °ê¥ß¥x¥_¬ì§Þ¤j¾Ç ¡@ |
¡@ |
In this paper, we present a low complex
image coding algorithm for compressed image by zero tree method. We
propose backward scan and lowest tree coding (BSLTC) that is able to
construct efficiently for wavelet coefficient trees. Experiment result
shows BSLTC gains faster execution time and less memory size. Moreover,
the complexity of BSLTC is simpler than SPIHT, JasPer/JPEG 2000 and LTW.
|
P3D-5¡@ |
A Low-Complexity High-Performance Two-Dimensional Look-Up Table for LDPC
Hardware Implementation |
PDF |
Tzu-Wen Chung, Chen-Pang Chang, Jung-Chieh Chen,
and Po-Hui Yang, °ê¥ß°ª¶¯®v½d¤j¾Ç ¡@ |
¡@ |
In this paper, we propose a low-complexity
and high-efficiency two-dimensional look-up table (2-D LUT) for performing the sum-product algorithm in the decoding of low-density parity-check (LDPC) codes. Instead of employing adders for core operation during updating check nodes messages, in the proposed scheme, the main term and correction factor of the core operation are successful merged into a compact 2-D LUT. Simulation results indicate that the proposed 2-D LUT not only attains close-to-optimal bit error rate performance but also enjoys low complexity advantage that is suitable for hardware implementation.
|
P3D-6¡@ |
Hierarchical Decision Table for Bad Pixel Detection in Stereo Vision |
PDF |
Tsung-Hsien Tsai, Nelson Yen-Chung Chang, and Tian-Sheuan Chang, °ê¥ß¥æ³q¤j¾Ç ¡@ |
¡@ |
The detection of bad pixels is an important
issue for quality restoration in computational stereo. This paper
presents a bad pixel detection method using a hierarchical decision
table approach, based on the information of left-right checking,
unreliability methods and the disparity smoothness. The proposed method
integrates the features of the first two methods and additionally adds
the disparity smoothness to eliminate the noise effect in detected
pixels. The experiment result shows that the proposed method can achieve
higher and consistent detection rate (58.2%~82.7%) and accuracy
(56.7%~84.4%) over different stereo image pairs when compared with
previous approaches. |
P3D-7¡@ |
An Efficient Metric Normalization Architecture for High-speed Low-power
Viterbi Decoder |
PDF |
Kelvin Yi-Tse Lai, °ê¥ß¶³ªL¬ì§Þ¤j¾Ç ¡@ |
¡@ |
In this paper, a new efficient metric normali-zation architecture called High Bit Clear is proposed for a high throughput and low power Viterbi Decoder (VD). The proposed High Bit Clear normalization circuit not only normalizes all of the survivor path metrics, but also operates as close as the Add-Compare-Select (ACS) iteration bound possibly with a small area overhead. After we verified the function and made the platform by FPGA, we also used UMC 0.18£gm 1.8V 1P6M Standard Cell Library to implement it.
With implementation by using UMC 0.18£gm 1.8-V Standard Cell Library, the proposed VD can improve the data rate up to 834Mbps for decoding a (3,1,2) convolutional code. To compare with the traditional VD without normalization, the proposed VD is improved by 60% in decoding speed and reduced by 50% in power consumption. Furthermore, the chip area of the new VD is reduced by 55% as compared to the traditional one. The operational speed of the proposed VD is up to 278MHz. Under 278MHz operation, the proposed VD consumes 2.48mW in power and the chip area utilized is about 110£gm*110£gm.
|
P3D-8¡@ |
Design and Implementation of a Real-Time Global Tone Mapping Processor
for High Dynamic Range Video |
PDF |
Tsun-Hsien Wang, Wei-Su Wong , Fang-Chu Chen, and Ching-Te Chiu, °ê¥ß²MµØ¤j¾Ç
¡@ |
¡@ |
Due to rapid progress in high dynamic range
(HDR) video capture technology, HDR video display on conventional LCD devices becomes an important topic. In this paper, we show that real time HDR video display is possible. A tone mapping based HDR video architecture pipelined with a video CODEC is presented. The HDR video is compressed by the tone mapping processor. The compressed HDR video can be encoded and decoded by the video standards, such as MPEG2, MPEG4 or H.264 for transmission and display. We propose and implement a modified photographic tone mapping algorithm for the tone mapping processor .The required luminance wordlength in the processor is analyzed and the quantization error is estimated. We also develop the digit-by-digit exponent and logarithm hardware architecture for the tone mapping processor. The synthesized result shows that our real-time tone mapping processor can process a NTSC video with 720*480 resolution at 50 MHz. The core area after layout is about 1.8225 mm2 under TSMC 0.13
£gm technology. |
P3D-9¡@ |
Design a Hardware Interprocessor Communication Mechanism for a
Multi-core Computer System |
PDF |
Slo-Li Chu, Chih-Chieh Hsiao, Pin-Hua Chiu,
and Hsien-Chang Lin, °ê¥ß¤¤ì¤j¾Ç ¡@ |
¡@ |
The multiprocessor architecture for
multimedia embedded systems becomes more popular, because of processor
design and fabrication evolution. However the interprocessor communication is still an important problem in multiprocessor environments. In this paper, we propose an hardware interprocessor mechanism for a multi-core FPGA chip. Although the hardware/software develop tools do not support multi-core design in the target platform, we create a novel design flow to implement the multi-core under Linux with high speed communication mechanism. In the experiment results, the performance have at least 30% speedup when Dhrystone benchmark execute on the Xilinx ML310 platform that is redesign by our mechanism.
|
P3D-10¡@ |
A HIGH PERFORMANCE CAVLC DECODER USING NON-ZERO SKIP AND MULTI-LEVEL
DECODING |
PDF |
Tsung-Han Tsai and De-Lung Fang, °ê¥ß¤¤¥¡¤j¾Ç ¡@ |
¡@ |
In this paper, we propose a hybrid high
performance CAVLD algorithm for MPEG-4 AVC/H.264 video standard in baseline profile. Two techniques, which called MLD (Multi Level Decoding) and NZS (Non Zero Skip for run_before decoding), are introduced to improve the throughput of CAVLC decoder. In comparison with previous design in, around 5% improvement on one macroblock decoding is arrived, and MLD is introduced to speed up the whole process further. With these two improvements, we obtain 137 cycles in average for one macroblock
decoding which is equal to the throughput of 1.02*106 macroblock per second at 140MHz. This means a speed up ratio of 57% is achieved in comparison with the same design. The hardware area is 18694 gates and the performance can archive the real time requirement on Full HD size. |
|
¡@ |