Michael Lissner
At CourtListener, we’re developing a new system to convert scanned court
documents to text. As part of our development we’ve analyzed more than
1,000 court opinions to determine what fonts courts are using.
Now that we have this information,our next step is to create training
data for our OCR system so
that it specializes in these fonts, but for now we’ve attached a
spreadsheet with our findings, and a script that can be used by others
to extract font metadata from PDFs.
Unsurprisingly, the top font — drumroll please — is Times New Roman.
Attachments
Source link