Parsing CSV, character by character

A comma is only a separator until it's inside quotes — so you parse with a state machine, not split().

The idea

CSV looks trivial: split on commas, split on newlines, done. But a field can be wrapped in double quotes, and a quoted field is allowed to contain commas, newlines, and even quote characters of its own. The moment "Doe, Jane" appears, a blind split(',') shears it in two.

The robust way — this is what RFC 4180 describes — is to read the text one character at a time, keeping track of which mode you're in. That mode, the parser state, decides what each character means. The same comma is a field boundary in one state and ordinary text in another.

Press play, or step through one character at a time.

How it works

Four states carry all the meaning. FIELD_START decides whether a field is quoted. IN_UNQUOTED reads plain text until a comma. IN_QUOTED reads anything until a quote. QUOTE_IN_QUOTED peeks: a second quote means an escaped ", anything else means the quoted field just ended.

function parseLine(s) {
  const out = [];
  let state = 'FIELD_START', buf = '';
  for (const c of s) {
    switch (state) {
      case 'FIELD_START':
        if (c === '"')      { state = 'IN_QUOTED'; }
        else if (c === ',') { out.push(buf); buf = ''; }      // empty field
        else                { buf += c; state = 'IN_UNQUOTED'; }
        break;
      case 'IN_UNQUOTED':
        if (c === ',')      { out.push(buf); buf = ''; state = 'FIELD_START'; }
        else                { buf += c; }
        break;
      case 'IN_QUOTED':
        if (c === '"')      { state = 'QUOTE_IN_QUOTED'; }     // maybe end, maybe ""
        else                { buf += c; }                      // comma/newline are literal here
        break;
      case 'QUOTE_IN_QUOTED':
        if (c === '"')      { buf += '"'; state = 'IN_QUOTED'; }   // "" -> one literal "
        else if (c === ',') { out.push(buf); buf = ''; state = 'FIELD_START'; }
        else                { buf += c; state = 'IN_UNQUOTED'; }   // tolerate stray text
        break;
    }
  }
  out.push(buf);   // last field has no trailing comma to flush it
  return out;
}

Cost

Time O(n)
Space O(1) state + O(field) buffer

Every character is visited exactly once — a single forward pass, no backtracking. The machine itself holds only a tiny state label; the only growing memory is the buffer for the field currently being built.

Watch out for

Worked example

Run the machine over name,"Doe, Jane","She said ""hi""",42 and four clean fields fall out:

name              ← plain, unquoted
Doe, Jane         ← quoted, so its comma stays inside
She said "hi"     ← each "" pair collapses to a single "
42                ← plain again, flushed at end of line

Notice the comma between Doe and Jane never split anything: it arrived while the state was IN_QUOTED, where commas are just text. And each "" contributed exactly one " to the buffer, not a boundary.

Check yourself

How many fields does a,"b,c",d parse into?

Inside a quoted field, what does the sequence "" become?