There are 64 "Stream Processors" per Compute Unit in all Graphics Core Next Architecture,. which include VEGA.
GCN 4.0 (Polaris) and Earlier however only has 32KB Local Data Storage, where-as GCN 5.0 (Vega) has 64KB LDS.
Now it should be noted that GCN 3.0 and beyond was supposed to have 64KB LDS as well,. but it actually makes no performance difference at a greater cost and silicon footprint to have done this.
Keep in mind that GCN 4.0 and earlier is only capable of up to 16 Cached Instructions Per Cycle,. while GCN 5.0 is capable of 64.
Each SIMD has 4x128-bit (Vec4) Wide Registers, with 4 SIMD... this results in a total of 256 Registers and 64 Threads per Compute Unit.
Now this gets a little more complicated for Vega,. as while the above remains true; it is capable of Double Data Rate Operations, very similar to how Ryzen handles SMT... mind that said it's identical technology that enables it.
This means that traditionally speaking 64 Threaded Operations Per Cycle actually breaks down to 32+32 Threads for the Rising and Falling Cycles... as such GCN 4.0 and earlier requires a Constant Stream of Data to really achieve 100% Utilisation,. but it also means you can waste a lot of Cycle Instruction Time when you're changing Tasks ... which as I noted above, the Architecture isn't exactly designed to Queue up a lot before hand.
GCN 5.0 on the other hand changes this to where each Rise and Fall is capable of 100% Utilisation,. meaning that changing Tasks / Threads is (almost) Costless... you might have noticed this in the substantially better Lows in Frame Times, especially during Heavy Streamed Data Scenes with the classic example being Alpha Channel / Pixel Operations that typically Stall most GPU.
Strictly speaking Vega isn't Programmatically much of a change from previous GCN ISA,. however in utilisation terms the optimisations it has can provide (at peak) up to 2X Operations over near identical Fiji GPU. For the moment you only really see this in OpenCL Applications, hence why they're excellent Mining GPU,. but once you get used to how this works for Game Development this provides a substantial performance uplift in common scenarios that would normally be "Problematic" for Stable or Decent Framerates.
APIs like DirectX 12 and Vulkan already do provide some natural performance uplift, especially with Thread Heavy Workloads... keep in mind that in a Traditional Pipeline approach you now have 128 In-Flight (32-bit) Threads as opposed to 64. But if you actually maintain a Thread Balance,. say for example with things like Tress FX or other Particle Solutions, then you'll find you'll be getting very close to said 2x Throughput and substantially better Framerates.
AMD did showcase (but didn't really explain it very well) with their Tress FX Vega Demo,. where Fiji was capable of ~500K Strands Vs. Vega that was capable of ~1.2M Strands... at the same 1080p60 Resolution and Performance. Some of this was obviously the Higher Clock of Vega (avg. 1250MHz Vs. 1025MHz) but more of it comes from said changes in Thread Processing Capabilities.
The thing to keep in mind is that GCN 5.0 (Vega) doesn't provide Universal Performance Uplift over GCN 4.0 (Polaris) or Earlier,. but where it does you have substantially better optimisation possibilities.
Hopefully this helps clears some things up about ISA and Architecture.