Skip to content

Commit e59b3ce

Browse files
petervwyattjneen
andauthored
Add PDF syntax to Rouge (#2058)
* Initial PDF COS rouge lexer * Update pdf.rb * Create demo PDF (functional) Needs to be treated as binary for xref to remain valid * Update pdf.rb * Add basic spec checker * Fixups * Altered tokens for better color * More complex PDF for visual test * Added EOL to last line of PDF Added EOL to last line of PDF to pass linelint CI check used by Rouge. This is not required by real PDF files. * Update lib/rouge/lexers/pdf.rb Co-authored-by: Jeanine Adkisson <[email protected]> * Update lib/rouge/lexers/pdf.rb Co-authored-by: Jeanine Adkisson <[email protected]> * Update lib/rouge/lexers/pdf.rb Co-authored-by: Jeanine Adkisson <[email protected]> * Fix spelling. Ensure PERIOD in "%PDF-x.y". Comment added --------- Co-authored-by: Jeanine Adkisson <[email protected]>
1 parent 50971f5 commit e59b3ce

File tree

4 files changed

+225
-0
lines changed

4 files changed

+225
-0
lines changed

lib/rouge/demos/pdf

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
%PDF-1.6
2+
%©©©©
3+
4+
1 0 obj<</Type/Catalog/Pages 2 0 R/StructTreeRoot null/MarkInfo<</Marked false>>>>
5+
endobj
6+
2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>
7+
endobj
8+
3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[.0 0 200 200]/Contents 4 0 R/Resources<<>>>>
9+
endobj
10+
4 0 obj<</Length 60>>
11+
stream
12+
+8 w 1 j
13+
1.0 0 0 rg
14+
0 0 1 RG
15+
10 10 180 180 re B
16+
endstream
17+
endobj
18+
xref
19+
0 5
20+
0000000000 65535 f
21+
0000000021 00000 n
22+
0000000113 00000 n
23+
0000000165 00000 n
24+
0000000261 00000 n
25+
trailer
26+
<</Root 1 0 R/Size 5/ID[<18D6B641245C03F28E67D93AD879D6EC><18D6B641245C03F28E67D93AD879D6EC>]>>
27+
startxref
28+
371
29+
%%EOF

lib/rouge/lexers/pdf.rb

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# -*- coding: utf-8 -*- #
2+
# frozen_string_literal: true
3+
# vim: set ts=2 sw=2 et:
4+
5+
# PDF = Portable Document Format page description language
6+
# As defined by ISO 32000-2:2020 including resolved errata from https://pdf-issues.pdfa.org/
7+
#
8+
# The PDF syntax is also known as "COS" and can be used with FDF (Forms Data Field) files as
9+
# per ISO 32000-2:2020 clause 12.7.8.
10+
#
11+
# This is a token-based parser ONLY! It is intended to syntax highlight full or partial fragments
12+
# of nicely written hand-written PDF syntax in documentation such as ISO specifications. It is NOT
13+
# intended to cope with real-world PDFs that will contain arbitrary binary data (that form invalid
14+
# UTF-8 sequences and generate "ArgumentError: invalid byte sequence in UTF-8" Ruby errors) and
15+
# other types of malformations or syntax errors.
16+
#
17+
# Author: Peter Wyatt, CTO, PDF Association. 2024
18+
#
19+
module Rouge
20+
module Lexers
21+
class Pdf < RegexLexer
22+
title "PDF"
23+
desc "PDF - Portable Document Format (ISO 32000)"
24+
tag 'pdf'
25+
aliases "fdf", 'cos'
26+
filenames '*.pdf', '*.fdf'
27+
mimetypes 'application/pdf', 'application/fdf' # IANA registered media types
28+
29+
# PDF and FDF files must start with "%PDF-x.y" or "%FDF-x.y"
30+
# where x is the single digit major version and y is the single digit minor version.
31+
# For simplicity as a syntax highlighter, assumes occurs at start of a line.
32+
def self.detect?(text)
33+
return true if /\A%(P|F)DF-\d\.\d/.match?(text)
34+
end
35+
36+
# PDF Delimiters (ISO 32000-2:2020, Table 1 and Table 2).
37+
# Ruby whitespace "\s" is /[ \t\r\n\f\v]/ which does not include NUL (ISO 32000-2:2020, Table 1).
38+
# PDF also support 2 character EOL sequences.
39+
40+
state :root do
41+
# Start-of-file header comment is special (comment is up to EOL)
42+
rule %r/^%(P|F)DF-\d\.\d.*$/, Comment::Preproc
43+
44+
# End-of-file marker comment is special (comment is up to EOL)
45+
rule %r/^%%EOF.*$/, Comment::Preproc
46+
47+
# PDF only has single-line comments: from "%" to EOL
48+
rule %r/%.*$/, Comment::Single
49+
50+
# PDF Boolean and null object keywords
51+
rule %r/(false|true|null)/, Keyword::Constant
52+
53+
# PDF Dictionary and array object start and end tokens
54+
rule %r/(<<|>>|\[|\])/, Punctuation
55+
56+
# PDF Hex string - can contain whitespace and span multiple lines.
57+
# This rule must be after "<<"/">>"
58+
rule %r/<[0-9A-Fa-f\s]*>/m, Str::Other
59+
60+
# PDF literal strings are complex (multi-line, escapes, etc.). Use separate state machine.
61+
rule %r/\(/, Str, :stringliteral
62+
63+
# PDF Name objects - can be empty (i.e., nothing after "/").
64+
# No special processing required for 2-digit hex codes that start with "#".
65+
rule %r/\/[^\(\)<>\[\]\/%\s]*/, Name::Other
66+
67+
# PDF objects and stream (no checking of object ID)
68+
# Note that object number and generation numbers do not have sign.
69+
rule %r/\d+\s\d+\sobj/, Keyword::Declaration
70+
rule %r/endstream|endobj|stream/, Keyword::Declaration
71+
72+
# PDF conventional file layout keywords
73+
rule %r/startxref|trailer|xref/, Keyword::Declaration
74+
75+
# PDF cross reference section entries (20 bytes including EOL).
76+
# Explicit single SPACE separators.
77+
rule %r/^\d{10} \d{5} (n|f)\s*$/, Keyword::Namespace
78+
79+
# PDF Indirect reference (lax, allows zero as the object number).
80+
# Requires terminating delimiter lookahead to disambiguate from "RG" operator
81+
rule %r/\d+\s\d+\sR(?=[\(\)<>\[\]\/%\s])/, Name::Decorator
82+
83+
# PDF Real object
84+
rule %r/(\-|\+)?([0-9]+\.?|[0-9]*\.[0-9]+|[0-9]+\.[0-9]*)/, Num::Float
85+
86+
# PDF Integer object
87+
rule %r/(\-|\+)?[0-9]+/, Num::Integer
88+
89+
# A run of non-delimiters is most likely a PDF content stream
90+
# operator (ISO 32000-2:2020, Annex A).
91+
rule %r/[^\(\)<>\[\]\/%\s]+/, Operator::Word
92+
93+
# Whitespace (except inside strings and comments) is ignored = /[ \t\r\n\f\v]/.
94+
# Ruby doesn't include NUL as whitespace (vs ISO 32000-2:2020 Table 1)
95+
rule %r/\s+/, Text::Whitespace
96+
end
97+
98+
# PDF literal string. See ISO 32000-2:2020 clause 7.3.4.2 and Table 3
99+
state :stringliteral do
100+
rule %r/\(/, Str, :stringliteral # recursive for internal bracketed strings
101+
rule %r/\\\(/, Str::Escape, :stringliteral # recursive for internal escaped bracketed strings
102+
rule %r/\)/, Str, :pop!
103+
rule %r/\\\)/, Str::Escape, :pop!
104+
rule %r/\\([0-7]{3}|n|r|t|b|f|\\)/, Str::Escape
105+
rule %r/[^\(\)\\]+/, Str
106+
end
107+
end
108+
end
109+
end

spec/lexers/pdf_spec.rb

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# -*- coding: utf-8 -*- #
2+
# frozen_string_literal: true
3+
4+
describe Rouge::Lexers::Pdf do
5+
let(:subject) { Rouge::Lexers::Pdf.new }
6+
7+
describe 'guessing' do
8+
include Support::Guessing
9+
10+
it 'guesses by filename' do
11+
assert_guess :filename => 'foo.pdf'
12+
assert_guess :filename => 'foo.fdf'
13+
end
14+
15+
it 'guesses by mimetype' do
16+
assert_guess :mimetype => 'application/pdf'
17+
assert_guess :mimetype => 'application/fdf'
18+
end
19+
20+
it 'guesses by source' do
21+
assert_guess :source => '%PDF-1.6'
22+
assert_guess :source => '%PDF-2.0'
23+
assert_guess :source => '%PDF-0.3' # Fake PDF version
24+
assert_guess :source => '%PDF-6.8' # Fake PDF version
25+
assert_guess :source => '%FDF-1.2'
26+
end
27+
end
28+
29+
end

spec/visual/samples/pdf

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
%PDF-1.7
2+
%©©
3+
1 0 obj
4+
<</Type/Catalog/MarkInfo<<%comment after dictionary start
5+
/Marked true/Suspects true%comment after a boolean
6+
/UserProperties true>>/StructTreeRoot null/AA<</WP<</S/JavaScript/JS(//JavaScript comment
7+
app.alert\( "Document Will-Print Action!!"\))>>>>/Pages 3 0 R>>%comment after dictionary close
8+
endobj
9+
2 0 obj
10+
null%comment after null
11+
endobj
12+
3 0 obj
13+
<</FakeBigDataArray[true[[[]]]true<686931>null<686932>null[/Dummy](hi3)[(hi4)(hi5)true(hi6)null(hi7)12(hi8)]-1.<</ABC +.123/DEF +.0>>[](hi99)[]null[]<</DEF null>>true<</GHI/JKL>>[<</MNO +.0>>]<686933>1 0 R[.1 -2 +.3]6 0 R<686934>4 0 R(hi9)2 0 R<</QRS true>>[true]<</TUV true>><686935><</XYZ true>>3 0 R<</AAB true>>(hi10)<</AAC true>>null<686936>true(hi11)<686937>(hi12)+.0<686938>]
14+
/Type/Pages/Count 1/Kids[4 0 R%comment after indirect ref
15+
]>>endobj
16+
4 0 obj
17+
<</Type/Page/Parent 3 0 R/MediaBox[%comment after array start
18+
+0 .0 999 999.]%comment after array end token
19+
/CropBox[+0 .0 999%comment after an integer
20+
999.]/Contents[5 0 R]/UserUnit +0.88
21+
/Resources<</Pattern<<>>/ProcSet[null]/ExtGState<</ 6 0 R>>/Font<</F1<</Type/Font/Subtype/Type1/BaseFont/Times-Bold/Encoding/WinAnsiEncoding>>>>>>>>
22+
endobj
23+
5 0 obj
24+
<</Length 757 >>
25+
stream
26+
BX /BreakMyParser <</FakeBigDataArray[true[[[]]]true<686931>null<686932>null[/Dummy](hi3)[(hi4)(hi5)true(hi6)null(hi7)12(hi8)]-1.<</ABC +.123/DEF +.0>>[](hi99)[]null[]<</DEF null>>true<</GHI/JKL>>[<</MNO +.0>>]<686933>[1 2 3]<686934>(hi9)<</QRS true>>[true]<</TUV true>><686935><</XYZ true>><</AAB true>>(hi10)<</AAC true>>null<686936>true(hi11)<686937>(hi12)+.0<686938>]>> DP EX
27+
BT/F1 30 Tf 0 Tr 1 0 0 1 10 950 Tm(PDF Ruby Rouge test file)Tj 1 0 0 1 10 900 Tm
28+
(This file must NOT be resaved or modified by any tool!!)Tj ET% 3 colored vector graphic squares that are clipped
29+
/ gs q 40 w 75 75 400 400 re W S % stroke then clip a path with a wide black border
30+
1 0. .0 rg 75 75 200 200 re f 0 1 0 rg 275 75 200 200 re f .0 0 1 rg 275 275 200 200 re f Q
31+
endstream
32+
endobj
33+
6 0 obj<</Type/ExtGState/ca 0.33/CA 0.66%comment after a real
34+
>>
35+
endobj
36+
7 0 obj
37+
<</Subject(Compacted Syntax v3.0)%comment after literal string end
38+
/Title<436f6d7061637465642073796e746178>%comment after hex string end
39+
/Keywords(PDF,Compacted,Syntax,ISO 32000-2:2020)/CreationDate(D:20200317)/Author(Peter Wyatt)/Creator< 48616e
40+
642d65646974>/Producer<48616e 6 4 2 d 6 5646974>>>
41+
endobj
42+
xref
43+
0 8
44+
0000000000 65535 f
45+
0000000017 00000 n
46+
0000000332 00000 n
47+
0000000374 00000 n
48+
0000000837 00000 n
49+
0000001198 00000 n
50+
0000002009 00000 n
51+
0000002084 00000 n
52+
trailer
53+
<</Root 1 0 R/Info%comment after name
54+
7 0 R/ID[<18D6B6412
55+
45C033A6E67D93AD879D6EC><18D 6B 641245C033A6E67D93AD879D6EC>]/Size 8>>
56+
startxref
57+
2403
58+
%%EOF

0 commit comments

Comments
 (0)