I still dont understand why u cant distribute a large llm over many different processors each holding a section of the parameters in memory.
Because each weight in a layer influences each weight in the next layer, which means the bandwidth requirements are enormous and regular networking solutions are insufficient for that.
Because each weight in a layer influences each weight in the next layer, which means the bandwidth requirements are enormous and regular networking solutions are insufficient for that.