A comma is only a separator until it's inside quotes — so you parse with a state machine, not split().
CSV looks trivial: split on commas, split on newlines, done. But a field can be wrapped in double quotes, and a quoted field is allowed to contain commas, newlines, and even quote characters of its own. The moment "Doe, Jane" appears, a blind split(',') shears it in two.
The robust way — this is what RFC 4180 describes — is to read the text one character at a time, keeping track of which mode you're in. That mode, the parser state, decides what each character means. The same comma is a field boundary in one state and ordinary text in another.
Four states carry all the meaning. FIELD_START decides whether a field is quoted. IN_UNQUOTED reads plain text until a comma. IN_QUOTED reads anything until a quote. QUOTE_IN_QUOTED peeks: a second quote means an escaped ", anything else means the quoted field just ended.
function parseLine(s) {
const out = [];
let state = 'FIELD_START', buf = '';
for (const c of s) {
switch (state) {
case 'FIELD_START':
if (c === '"') { state = 'IN_QUOTED'; }
else if (c === ',') { out.push(buf); buf = ''; } // empty field
else { buf += c; state = 'IN_UNQUOTED'; }
break;
case 'IN_UNQUOTED':
if (c === ',') { out.push(buf); buf = ''; state = 'FIELD_START'; }
else { buf += c; }
break;
case 'IN_QUOTED':
if (c === '"') { state = 'QUOTE_IN_QUOTED'; } // maybe end, maybe ""
else { buf += c; } // comma/newline are literal here
break;
case 'QUOTE_IN_QUOTED':
if (c === '"') { buf += '"'; state = 'IN_QUOTED'; } // "" -> one literal "
else if (c === ',') { out.push(buf); buf = ''; state = 'FIELD_START'; }
else { buf += c; state = 'IN_UNQUOTED'; } // tolerate stray text
break;
}
}
out.push(buf); // last field has no trailing comma to flush it
return out;
}
Every character is visited exactly once — a single forward pass, no backtracking. The machine itself holds only a tiny state label; the only growing memory is the buffer for the field currently being built.
split(',') breaks on quoted commas — "Doe, Jane" becomes two fields instead of one.\n; the same state machine has to carry across line boundaries."" inside a quoted field is one escaped quote character, not an empty field — easy to mistake for a missing value.\r\n (Windows) versus \n, plus a possible trailing newline that would otherwise emit a phantom empty final row.Run the machine over name,"Doe, Jane","She said ""hi""",42 and four clean fields fall out:
name ← plain, unquoted
Doe, Jane ← quoted, so its comma stays inside
She said "hi" ← each "" pair collapses to a single "
42 ← plain again, flushed at end of line
Notice the comma between Doe and Jane never split anything: it arrived while the state was IN_QUOTED, where commas are just text. And each "" contributed exactly one " to the buffer, not a boundary.
How many fields does a,"b,c",d parse into?
Inside a quoted field, what does the sequence "" become?