Of these, option 1 is probably the best.
However, there is another option:
Create 1 buffer, use it a single, large PBO, and on each frame, simply adjust the offset into the buffer in whichever calls you are using (glTexSubImage, glReadPixels, or whatever).
I also upload much of vertices in small batches in this way, but my vertices aren't aligned at all. I always memcpy the new vertices into the vbo, call glDraw and adjust the offset for the next memcpy. It works really fast, but I'm not sure, if it's safe:
The specs doesn't say anything about cache or readahead of the gpu, only the programmer have to take care of syncing. I do glFenceSync and glClientWaitSync bytewise, so here are maybe the same issues.
Do I have to skip some bytes or align to the next cache item? Or will the gpu cache get invalidated by writing into the pinned memory?
If the cache gets invalidated, would it be faster to skip some bytes? How many bytes are useful?