I've recently come across this description of a single processor cache saying that it's
"Word addressed (addresses are left shifted by 2 by adding “00” to end of address inside the processor, this implies that it can address 2^32*4 = 16GBytes of memory"
I understand that word addressing means that each consecutive address holds a word of data versus byte addressing which holds a byte of data at each address. I further understand that left shifting address by 2 means multiplying the address by 4 so we are trying to obtain multiples of 4, but doesn't that imply that each address holds a byte of data thus this is not word addressing but in fact byte addressing and the processor has logic in which we only access a word at a time despite what the memory is like?
So far I am confused on whether word addressed just means we have logic in place to access a word at a time or the actual memory is formatted in a way that each address will hold a whole word and not a byte
It would help to know where this description came from but nevertheless, perhaps the following helps illuminate what might be going on.
When we talk about byte- versus word-addressable memories (caches, RAM, SSDs, or otherwise) we are referring to the minimum width of data that can be referenced by an address. Regardless of the addressable width of a single location, the memory will still hold bytes, grouped together in 4s for words, or 8s for double-words, etc. Just the same as a byte is always made of bits (in modern systems: typically 8 bits per byte, but that's not always the case.)
For the following exmaples, let's assume we have 8 address bits. (Though real systems usually have between 16 and 54 address bits. None, that I know of, have full 64-bit addresses.)
With a byte-addressable memory, 8 address bits allows us to reference 2^8 locations. To get the size of memory, we multiply by the size (in bytes) of each addressed location. That gives us 2^8 * 1byte = 256 bytes
With a word-addressable memory, 8 address bits allows us to reference the same number of locations (2^8.) However, the size (in bytes) of each addressed location is now 4 bytes. That gives us 2^8 * 4 = 1024 bytes (1KiB.)
Now, the confusion starts to arise when we look at where the address bits are coming from.
Let's say I have 2 instruction sets (ISAs). Both have load/store instructions that take an address to memory. In one ISA, the instructions use byte-addressing. In the other, the instructions use word-addressing.
Looking at the byte index in memory that such a load instruction accesses, what would we expect in the two ISAs?
In the byte-addressed ISA, it's easy: the byte address is 0x12 (decimal: 18.)
In the word-addressed ISA, it's different. The 0x12 is an address that refers to a word in memory, rather than an individual byte. To get the index in memory as a number of bytes, we multiply by 4 (bytes per word.) That is to say, we shift the address left by 2. So 0x12 becomes 0x48 (decimal: 72.)
Okay, so an Instruction Set Architecture (ISA) can define how a number is interpreted - whether it's a word-address or byte-address.
Additionally, when it comes to caches, hardware doesn't behave in the neat way the ISA presents to software engineers.
For example, modern caches may present a dword-addressed, word-addressed or byte-addressed interface to the processor core (or other variations.) In a word-addressed scenario, to store an individual byte, the cache controller must first load the whole word in which the byte resides into a temporary internal buffer, update it with the byte being stored, then store that word back into the cache's SRAM array. This is known as a read-modify-write (RMW.)
See Are there any modern CPUs where a cached byte store is actually slower than a word store? for some real examples such as ARM Cortex-A15 where ARM's manual explains that updating the ECC (Error Correction Code) data for the 32-bit chunk is part of the reason for an RMW being needed for isolated byte or 16-bit stores. (ARM is a byte-addressable ISA, unlike the one described in the question.)
RMW is relatively straightforward for a single byte. It's more complex for, say, a half-word (i.e. 2 byte) load/store on a byte-addressable memory. This is because the 2 bytes may cross a natural word boundary. Thus, it may be that two whole words must be loaded/stored to access a single half-word.
With word-addressable memory, the half-word will (only ever) be the bottom 2 bytes of a naturally aligned word, making such a 2-word RMW impossible/unnecessary. However, it would also be impossible to directly perform a half-word load/store on the upper half-word of any given word, since word-addressing doesn't allow for referencing that upper half-word directly. (Some ISAs offer separate load/store instructions to access the different positions within a word - such as load half-word/load-byte instructions that take a word address and an index within the word.)
Going back to the cache: On the other side of the cache (the interface to main memory or other layers of cache), neither byte- nor word-addressing are used. Instead, row-addresses are used, where a row may be 64, 128 or more bytes in size.
So, what's the point of all this? At the end of the day, it all comes down the physical hardware resources. If you know that something is word-addressed, you don't need to store the bottom 2 bits of the address because (if they were needed at all) they are guaranteed to always be zero. We can save silicon by not storing/communicating more bits than necessary. We can also simplify hardware logic by knowing certain scenarios are impossible (such as the split half-word load/store described earlier.) Knowing they're impossible cases means we don't need logic (i.e. hardware) to handle them.
Additionally, if you consider the bits needed to encode an instruction (in either of the earlier 2 ISA examples), by using word-addressing, the ISA can access (/address) more memory without needing more encoding bits per instruction.
Word-addressing also offers other practical benefits. Ask yourself, why don't we have "bit addressing"? Most memory cannot be accessed on a bit-by-bit or byte-by-byte basis. At best, a whole word or whole row must be loaded/stored. So word-addressing saves hardware the effort of masking off the bottom two address bits, loading a word, then masking&shifting the loaded word to make it align to the address.
For loads/stores, the issue of word- versus byte-addressing also overlaps with the issue of (natural) alignment of the address. But that's a topic for another question.