Rapid 
    Execution Engine
    
  
  The 
    new architecture permitted the Pentium 4 to run the Arithmetic Logic Units 
    (ALUs) two times the frequency of the Processor’s core it self. This means 
    that the Arithmetic Logic Units on a Pentium 4 running at 1.5 are operating 
    at 3GHz with a latency that is half the duration of the core clock. This can 
    be directly translated in higher through and reduced latency of execution. 
    
    
    400MHz 
    Front Side Bus
    
    One of the most talked features of the Pentium 4 is its 400MHz BUS. The Pentium 
    III Processor’s 133MHz bus, which is 64-bit Wide, is capable of delivering 
    1.06GB/S of data. The Architecture of the Pentium 4 is somewhat different. 
    The Pentium 4’s bus is clocked at only 100MHz at also 64-bit Wide, what differs 
    here is that the 100MHz is quad pumped and is capable of achieving a whooping 
    3.2GB/s peak.
  Advanced Transfer Cache 
  Intel’s Pentium III features 8KB of L1 data 
    cache. This is half the size of what the Pentium III features. This may seem 
    a bit confusing at first, but smaller caches have lower latencies. This was 
    done in order to decrease the latency of the L1 memory, this should result 
    in an improved transfer rate but at the same time, the little size (8K) might 
    not be enough for some specific tasks. 
    
    This is where the L2 memory comes in mind. The Pentium 4, like the Pentium 
    III (Coppermine), spots 256k of on-die-cache on a 256-bit bus. However, there 
    is a difference between both. The new architecture of the Pentium 4 permits 
    to transfer data on each clock, compared to the Pentium III (Coppermine) that 
    is transferring data on every other Cycle.   
    
  
     
      | Intel 
          Pentium 4 1.5GHz | 256-bit (32 byte) x 1 x 1.5GHz = 48GB/s  | 
     
      | Intel 
          Pentium 3 1000GHz | 256-bit (32 byte) x .5 x 1GHz = 16GB/s  | 
  
  Execution Trace Cache 
    
    This technology caches decoded x86 instructions (micro-ops), thus removing 
    the latency associated with the instruction decoder from the execution loop. 
    The Execution Trace Cache stores the micro-ops in the path of program execution 
    flow, where the results of branches in the code are integrated into the same 
    cache line. 
    
    Execution Trace Cache is another handy technique Intel implemented in its 
    new Architecture to ease the penalty of miss-Predicted Branch instructions. 
    On older Intel processors, based on previous architectures, if the branch 
    instruction was miss-predicted, the processor needed to start the process 
    from the beginning. The NetBurst architectures permits to go directly through 
    the Execution Trace Cache Technology to retrieve the micro-op and then send 
    it through execution pipeline without having to restart the process from the 
    first phase.