I'm currently trying to learn Scheme(specifically chicken Scheme) and want to get a better idea of the performance pitfalls of the language. I've written a CSV parser and have shared it bellow. The test file that I've attempted to use is 130MB and it takes around 7 minutes to parse. Simply reading in all the lines only takes milliseconds, so the issue lies in the parser. I'm trying to keep my code as "Lispy" as possible and want to avoid introducing too many of the low-level constructs that I am aware are available in Chicken Scheme. The main thing I hope to gain from improving this code is a better intuition behind the performance of Scheme.
Edit: I've taken some of the advice from the comments and now have two separate parsers. The first code block uses a list based solution and the second uses string-refs. I thought string-refs would be faster because of linear memory but the experiments only left me baffled. But this is the core issue I'm trying to solve. I truly have no understanding of why one solution performs better than the other. Any insight would be great.
I also went back and did some measurements of the performance:
List-Solution-Compiled:
37.658s CPU time, 1.2s GC time (major), 258575354/244106 mutations (total/tracked), 1160/317431 GCs (major/minor), maximum live heap: 306.33 KiB
String-Ref-Solution-Compiled: 644.939s CPU time, 1.308s GC time (major), 167854392/243430 mutations (total/tracked), 1281/1023116 GCs (major/minor), maximum live heap: 305.51 KiB
; List based solution
(import (chicken string))
(import utf8)
(import list-utils)
(define (lookahead-list ahead lst)
(cond [(null? ahead) #t]
[(and (not (null? lst)) (not (null? ahead)))
(if (eq? (car ahead) (car lst)) (lookahead-list (cdr ahead) (cdr lst)) #f)]
[else #f]
)
)
(define (null-blist? blist) (null? (car blist)))
(define (lookahead-blist ahead blst) (lookahead-list ahead (car blst)))
(define (read-blist blist)
(if (null? blist) #!eof
(let ([head (caar blist)]) (set-car! blist (cdar blist)) head))
)
; Csv parsing
(define (csv/read-atom blist)
(let loop ([res '()] [cmode #t])
(cond [(lookahead-blist '(#\" #\") blist) (read-blist blist) (loop (cons (read-blist blist) res) cmode)]
[(lookahead-blist '(#\") blist) (read-blist blist) (loop res (not cmode))]
[(and cmode (lookahead-blist '(#\,) blist)) (reverse-list->string res)]
[(null-blist? blist) (reverse-list->string res)]
[else (loop (cons (read-blist blist) res) cmode)]
))
)
(define (csv/parse-non-blank-line blist)
(reverse
(let loop ([res '()])
(let ([nres (cons (csv/read-atom blist) res)])
(if (lookahead-blist '(#\,) blist)
(begin (read-blist blist) (loop nres)) nres)
)
))
)
(define (csv/parse-line str)
(if (equal? str "") '() (csv/parse-non-blank-line (list (string->list str))))
)
(define (csv/parse func init port)
(let loop ([acc init] [line (read-line port)])
(if (eq? line #!eof) acc
(loop (func acc (csv/parse-line line)) (read-line port))
)
)
)
(define (csv/parse-table func init port)
(let ([format (csv/parse-line (read-line port))])
(define (shim acc elem)
(func acc (zip-alist format elem))
)
(csv/parse shim init port)
)
)
(define (csv/parse-table-file func init fname) (call-with-input-file (lambda (p) (csv/parse-table func init p))))
; String-ref based solution
(import (chicken io))
(import (chicken string))
(import utf8)
(import list-utils)
; Cursor string
(define (string->cstring str)
(cons 0 str)
)
(define (cstring/forward cstr)
(cons (+ 1 (car cstr)) (cdr cstr))
)
(define (cstring/peek cstr)
(if (cstring/null? cstr)
#!eof
(string-ref (cdr cstr) (car cstr))
)
)
(define (cstring/forward! cstr)
(let ([ret (cstring/peek cstr)])
(set-car! cstr (+ 1 (car cstr)))
ret
)
)
(define (cstring/null? cstr)
(>= (car cstr) (string-length (cdr cstr)))
)
(define (lookahead-cstring ahead cstr)
(cond [(null? ahead) #t]
[(and (not (cstring/null? cstr)) (not (null? ahead)))
(if (eq? (car ahead) (cstring/peek cstr)) (lookahead-cstring (cdr ahead) (cstring/forward cstr)) #f)]
[else #f]
)
)
; Csv parsing
(define (csv/read-atom cstr)
(let loop ([res '()] [cmode #t])
(cond [(lookahead-cstring '(#\" #\") cstr) (cstring/forward! cstr) (loop (cons (cstring/forward! cstr) res) cmode)]
[(lookahead-cstring '(#\") cstr) (cstring/forward! cstr) (loop res (not cmode))]
[(and cmode (lookahead-cstring '(#\,) cstr)) (reverse-list->string res)]
[(cstring/null? cstr) (reverse-list->string res)]
[else (loop (cons (cstring/forward! cstr) res) cmode)]
))
)
(define (csv/parse-non-blank-line cstr)
(reverse
(let loop ([res '()])
(let ([nres (cons (csv/read-atom cstr) res)])
(if (lookahead-cstring '(#\,) cstr)
(begin (cstring/forward! cstr) (loop nres)) nres)
)
))
)
(define (csv/parse-line str)
(if (equal? str "") '() (csv/parse-non-blank-line (string->cstring str)))
)
(define (csv/parse func init port)
(let loop ([acc init] [line (read-line port)])
(if (eq? line #!eof) acc
(loop (func acc (csv/parse-line line)) (read-line port))
)
)
)
(define (csv/parse-table func init port)
(let ([format (csv/parse-line (read-line port))])
(define (shim acc elem)
(func acc (zip-alist format elem))
)
(csv/parse shim init port)
)
)
(define (csv/parse-table-file func init fname) (call-with-input-file fname (lambda (p) (csv/parse-table func init p))))
CHICKEN Scheme can exhibit pathologically bad performance behavior with code that stresses the garbage collector too much. You can either attempt to tune GC parameters (see the output of the
-:?flag for a full list of runtime options you can pass to your program) or rewrite your program to not produce this much garbage. For example, both the suggestion from the comments to read in characters withread-char/peek-charand reading in whole lines would help with that.Personally, I do believe there to be nothing wrong with using implementation-specific features. You don't have to go all in on them. For example you could use
cond-expandto introduce implementation-specific code and define acsv/read-linehelper procedure that either usesread-linefrom(chicken io)or a variant written in portable Scheme usingread-charandpeek-char. Scheme lends itself very well to language experimentation, so don't be afraid to try out various strategies and see how different they are in terms of readability/portability/performance/...