New attempt to make markdown converter work
And that's eum... Pretty hard. I don't know why, but I can only imagine my regexes become so fucking ugly, that I can't do this properly in Python without creating a burden of problems and bugs. Putting this in the commit for historic reference.
- Author
- Maarten 'Vngngdn' Vangeneugden
- Date
- Sept. 28, 2017, 11:50 p.m.
- Hash
- 5a08d014792d6a8417ef684c8e0a22eb6e6ac351
- Parent
- 772aedaf0099886a369576cc8f8ab9a6860ced2e
- Modified files
- markdown.py
- models.py
markdown.py ¶
22 additions and 15 deletions.
View changes Hide changes
1 |
1 |
import pygments |
2 |
2 |
|
3 |
3 |
""" So welcome to my Markdown module. Since the markdown library in PyPI is |
4 |
4 |
fucking shit, I've decided to write my own implementation. Contary to the one in |
5 |
5 |
PyPI, my version handles **all** cases, and is a **full implementation** of the |
6 |
6 |
reference. |
7 |
7 |
|
8 |
8 |
Oh, and just so you know: You don't need an entire shitty object oriented system |
9 |
9 |
to make something decent. Sometimes the solution is a function. Period. |
10 |
10 |
""" |
11 |
11 |
|
12 |
12 |
""" |
13 |
13 |
Checklist about all shit that must be implemented: |
14 |
14 |
- headers need to have their ID's be the same as the title. BUT! id's |
15 |
15 |
mustn't have spaces, and need to be unique. The latter isn't that big of a |
16 |
16 |
deal, but spaces in the header title must be converted to dashes. |
17 |
17 |
- HTML code needs to be escaped; & must become &, < and > become < |
18 |
18 |
and > and so on. This isn't necessary for UTF-8 symbols such as ©, |
19 |
19 |
which can be put in place as is, instead of converting to ©. |
20 |
20 |
- Some elements have to be placed in the tag itself, such as links in <a />. |
21 |
21 |
This is noted with the {#} tags. The context in which they are used in the |
22 |
22 |
defaults should give a good explanation on what number points to what. |
23 |
23 |
- Remember to support 2 trailing spaces as <br />! |
24 |
24 |
- There are also "closing ATX headers": "# title" is the same as |
25 |
25 |
"# title ####" and "# title #". (So it's purely cosmetic, remove the |
26 |
26 |
trailing whitespace in these cases) |
27 |
27 |
- When code is used, call Pygments to markup the code properly. If a code |
28 |
28 |
tag is provided (e.g. "Python", "C", ...), tell that to Pygments as well, |
29 |
29 |
so it can do a better job. If nothing is provided, leave it as is. When |
30 |
30 |
it's an inline code block (`CODE`), leave that always as is. |
31 |
31 |
Look how to do it at |
32 |
32 |
<http://docs.getpelican.com/en/stable/content.html#syntax-highlighting>. |
33 |
33 |
|
34 |
34 |
Future expansions: |
35 |
35 |
- Allow nesting of more elements. For example: Headers cannot be nested in |
36 |
36 |
blockquotes, but this is a nice thing to have. |
37 |
37 |
- Allow headers to follow a line wrapping, if the next line is perceded by |
38 |
38 |
the same amount of hashtags (=> same header level). |
39 |
39 |
- Allow the special p "Perseverance porn" stories, about how someone walks 10 miles to work every day, have the effect of normalizing the big disadvantages from society that make people do hard labor that society should not need. |
40 |
- | |
41 |
- | Marriage |
42 |
- | 7 August 2017aragraph blockquote style: |
43 |
- | https://daringfireball.net/projects/markdown/syntax#blockquote |
44 |
40 |
""" |
45 |
41 |
|
46 |
42 |
|
47 |
- | def toHTML( |
48 |
43 |
text, |
49 |
44 |
emphasis = r"<em>{text}</em>", |
50 |
45 |
strong = r"<strong>{text}</strong>", |
51 |
46 |
unordered_list = r"<ul>{items}</ul>", |
52 |
47 |
ordered_list = r"<ol>{items}</ol>", |
53 |
48 |
list_item = r"<li>{text}</li>", |
54 |
49 |
hyperlink = r'<a href="{link}" title="{title}">{text}</a>', |
55 |
50 |
image = r'<img src="{link}" alt="{alt}" title="{title}" />', |
56 |
51 |
paragraph = r"<p>{text}</p>", |
57 |
52 |
blockquote = r"<blockquote>{text}</blockquote>", |
58 |
53 |
header1 = r'<h1 id="{link}">{text}</h1>', |
59 |
54 |
header2 = r'<h2 id="{link}">{text}</h2>', |
60 |
55 |
header3 = r'<h3 id="{link}">{text}</h3>', |
61 |
56 |
header4 = r'<h4 id="{link}">{text}</h4>', |
62 |
57 |
header5 = r'<h5 id="{link}">{text}</h5>', |
63 |
58 |
header6 = r'<h6 id="{link}">{text}</h6>', |
64 |
59 |
code = r'<code lang="{language}">{code}</code>', |
65 |
- | incorrect = r"<s>{text}</s>", |
+ |
60 |
code = r'<code lang="\g<language>">\g<code></code>', |
+ |
61 |
incorrect = r"<s>{text}</s>", |
66 |
62 |
line_break = r"<br />", |
67 |
63 |
horizontal_rule = r"<hr />", |
68 |
64 |
): |
69 |
65 |
""" Translates Markdown code to HTML code. |
70 |
66 |
|
71 |
67 |
This is a pure function. |
72 |
68 |
|
73 |
69 |
This function will translate given Markdown code to HTML code. |
74 |
70 |
It follows the specification as good as possible, with a few custom additions: |
75 |
71 |
- Incorrect text can be marked with "~" around a text block. |
76 |
72 |
|
77 |
73 |
The default parameters have sane defaults, but can be customized if you wish |
78 |
74 |
to do so. Pay attention to the tags, as your custom value must also |
79 |
75 |
incorporate these. |
80 |
76 |
|
81 |
77 |
The function works in a simple way: |
82 |
78 |
1. Replace all redundant content with only 1 unique part |
83 |
79 |
1.1. For example: 5 blank lines mean the same as 2; a line with only spaces |
84 |
80 |
and tabs means the same as an empty line; hashtags at the end of a header |
85 |
81 |
line are meaningless; ... |
86 |
82 |
2. Handle blockquotes. Blockquotes have the highest precedence and can contain |
87 |
83 |
any other element, thus it's easiest to just handle these as soon as possible. |
88 |
84 |
3. Replace Setext with atx-style headers, to provide consistency for header handling. |
89 |
85 |
4. Handle block elements (paragraphs, code, ...). |
90 |
86 |
5. In all block elements, handle span elements (links, emphasis, ...). |
91 |
87 |
""" |
92 |
88 |
|
93 |
89 |
# Replacing some shit: |
94 |
90 |
text = re.sub(r"^[ \t]+$", "\n", text) # Make all blank lines consistent |
95 |
91 |
text = re.sub(r"\n{3}", "\n\n", text) # Replace redundant blanks with 2 blank lines |
96 |
92 |
|
97 |
93 |
# XXX: Blockquotes have the highest precedence: **ANYTHING** can be nested |
98 |
94 |
# in a blockquote. So, handle these first, and convert them up front to |
99 |
95 |
# make it easier to handle the other text. |
100 |
96 |
|
101 |
97 |
|
102 |
98 |
""" About handling blockquotes: |
103 |
99 |
Every line that starts with "> " is a blockquote. As long as the next line |
104 |
100 |
starts in the same way, it's considered part of the same blockquote. |
105 |
101 |
**However**, there is 1 exception to this rule: |
106 |
102 |
paragraphs that are hard-wrapped only need 1 > for their first line, but can |
107 |
103 |
then be hard wrapped, and even start without prior spacing. |
108 |
104 |
""" |
109 |
105 |
blockquotes_left = True |
110 |
106 |
while blockquotes_left: |
111 |
107 |
blockquote = re.compile(r"(^> .+\n)+") |
112 |
108 |
quote = blockquote.search(text) |
113 |
109 |
if quote is None: |
114 |
110 |
blockquotes_left = False |
115 |
111 |
else: |
116 |
112 |
begin, end = quote.span() |
117 |
113 |
reworked = "<blockquote>" + text[begin:end].replace(r"\n> ", r"\n") + r"</blockquote>\n" |
118 |
114 |
text = text[:begin] + reworked + text[end:] |
119 |
115 |
|
120 |
116 |
# All blockquotes are now removed |
121 |
117 |
|
122 |
118 |
# Converting setext to atx headers |
123 |
119 |
text = re.sub(r"^(?P<title>.+)\n=+$", r"# \g<title>", text, flags=re.MULTILINE) |
124 |
120 |
text = re.sub(r"^(?P<title>.+)\n-+$", r"## \g<title>", text, flags=re.MULTILINE) |
125 |
121 |
# All are now converted to atx style headers |
126 |
122 |
# Transforming headers: |
127 |
123 |
for i in range(1,7): |
128 |
124 |
header = r"^#{"+str(i)+r"} (?P<title>.+)$" |
129 |
125 |
match = re.search(header, text, flags=re.MULTILINE) |
130 |
126 |
while match is not None: |
131 |
127 |
future_id = match['title'].lower() |
132 |
128 |
future_id = re.sub(r"[ _,.!]", r"-", future_id) |
133 |
- | dictionary = match.groupdict() |
+ |
129 |
future_id = re.sub(r" ", r"-", future_id) |
+ |
130 |
dictionary = match.groupdict() |
134 |
131 |
dictionary['link'] = future_id |
135 |
132 |
replacement = (r'<h'+str(i)+r' id="{link}">{title}</h'+str(i)+r'>').format_map(dictionary) |
136 |
133 |
text = text[:match.start()] + replacement + text[match.end():] |
137 |
134 |
match = re.search(header, text, flags=re.MULTILINE) |
138 |
135 |
|
139 |
136 |
# All headers transformed |
140 |
137 |
|
141 |
138 |
# Paragraphs |
142 |
139 |
text = re.sub(r"(?P<text>(?:^(?!<).+\n)+)", r"<p>\n\g<text></p>", text, flags=re.MULTILINE) |
143 |
140 |
|
144 |
141 |
|
145 |
142 |
# Doing inline hyperlinks |
146 |
143 |
text = re.sub(r"\[(?P<text>.+)\]\((?P<url>.+) \"(?P<title>.+)\"\)", r'<a href="\g<url>" title="\g<title>">\g<text></a>', text) |
147 |
- | text = re.sub(r"\[(?P<text>.+)\]\((?P<url>.+)\)", r'<a href="\g<url>">\g<text></a>', text) |
148 |
- | |
149 |
- | # Doing emphasis and strongs |
150 |
- | text = re.sub(r"\*\*(?P<text>[^*.]*)\*\*", r"<strong>\g<text></strong>", text) |
151 |
- | text = re.sub(r"__(?P<text>[^\_.]*)__", r"<strong>\g<text></strong>", text) |
152 |
- | text = re.sub(r"\*(?P<text>[^\*.]*)\*", r"<em>\g<text></em>", text) |
153 |
- | text = re.sub(r"_(?P<text>[^\_.]*)_", r"<em>\g<text></em>", text) |
154 |
- | |
+ |
144 |
text = re.sub(r"\[(?P<text>.+?)\]\((?P<url>.+?)\)", r'<a href="\g<url>">\g<text></a>', text, flags=re.S) |
+ |
145 |
|
+ |
146 |
# Doing strongs |
+ |
147 |
text = re.sub(r"\*\*(?P<text>.+?)\*\*", r"<strong>\g<text></strong>", text, flags=re.S) |
+ |
148 |
text = re.sub(r"__(?P<text>.+?)__", r"<strong>\g<text></strong>", text, flags=re.S) |
+ |
149 |
# Doing emphasis |
+ |
150 |
text = re.sub(r"\*(?P<text>.+?)\*", r"<em>\g<text></em>", text, flags=re.S) |
+ |
151 |
text = re.sub(r"_(?P<text>.+?)_", r"<em>\g<text></em>", text, flags=re.S) |
+ |
152 |
# Code blocks |
+ |
153 |
text = re.sub(r"^```(?P<language>.+?)\n(?P<code>.+?)\n```$", code, text, flags=re.S) |
+ |
154 |
# Doing inline code |
+ |
155 |
text = re.sub(r"``(?P<code>.+?)``", inline_code, text, flags=re.S) |
+ |
156 |
text = re.sub(r"`(?P<code>.+?)`", inline_code, text, flags=re.S) |
+ |
157 |
# Header lines |
+ |
158 |
text = re.sub(r"^((\*|_|-) *){3,}$", horizontal_rule, text) |
+ |
159 |
# Line breaks |
+ |
160 |
text = re.sub(r" $", line_break, text) |
+ |
161 |
|
155 |
162 |
return text |
156 |
163 |
""" |
157 |
164 |
|
158 |
165 |
|
159 |
166 |
|
160 |
167 |
|
161 |
168 |
|
162 |
169 |
block_elements_table = { |
163 |
170 |
"code": r"```(?P<language>\w+)\n( .*\n)+", |
164 |
171 |
"blockquote": r"^> (?P<text>.+) |
165 |
172 |
"paragraph": r"(?P<text>(^.+\n)+)", |
166 |
173 |
"header": r"^#{1,6} (?P<title>(\w+ ?)+ *) ?#*$", |
167 |
174 |
|
168 |
175 |
|
169 |
176 |
element_table = { |
170 |
177 |
"emphasis": (r"\*(?P<text>[^\*.]*)\*|_(?P<text>[^\_.]*)_", emphasis, emphasis_end), |
171 |
178 |
"strong": (r"\*\*(?P<text>[^*.]*)\*\*|__(?P<text>[^\_.]*)__", strong, strong_end), |
172 |
179 |
"unordered list": (r"") |
173 |
180 |
"inline link": (r"\[(\w\s)+\]\( |
174 |
181 |
|
175 |
182 |
|
176 |
183 |
def translate(text, begin, end, parameters): |
177 |
184 |
|
178 |
185 |
if alpha: # If this contains no more nested elements: |
179 |
186 |
return begin.format(parameters) + text + end |
180 |
187 |
elif beta: # text contains nested elements: |
181 |
188 |
# Find parameters or something IDK |
182 |
189 |
return begin.format(parameters) + _ |
183 |
190 |
translate(text[alpha:beta], begin_tag, end_tag, found_parameters) + _ |
184 |
191 |
end |
185 |
192 |
|
186 |
193 |
# Zoom zoom insert magic code here |
187 |
194 |
|
188 |
195 |
# NOTE: Hyperlinks are handled specially in Markdown. Check the syntax page |
189 |
196 |
# for more information. That said, it's imperative to **first** collect all |
190 |
197 |
# information about hyperlinks, and remove it, so it can be used when |
191 |
198 |
# parsing hyperlinks. |
192 |
199 |
|
193 |
200 |
# Table of all elements and their respective regular expression: |
194 |
201 |
elements = { |
195 |
202 |
paragraph: r"", |
196 |
203 |
ordered_list_item: r"" |
197 |
204 |
hyperlink: |
198 |
205 |
header1: r"^# [*\(\n)] \n" |
199 |
206 |
} |
200 |
207 |
|
201 |
208 |
""" |
202 |
209 |
""" The reason the length is stored instead of the end, is because it is |
203 |
210 |
less error prone; if a parent node is updated, only the begin needs to be |
204 |
211 |
updated, as the length is still the same for the node. The begin can be |
205 |
212 |
relative to the parent node, so even that won't have to be updated. """ |
206 |
213 |
""" |
207 |
214 |
node = { |
208 |
215 |
"type": block_type, |
209 |
216 |
"begin": begin, |
210 |
217 |
"length": length, |
211 |
218 |
"children": children, |
212 |
219 |
} |
213 |
220 |
|
214 |
221 |
return markdown_code |
215 |
222 |
""" |
216 |
223 |
models.py ¶
21 additions and 0 deletions.
View changes Hide changes
1 |
1 |
from django.db import models |
2 |
2 |
import datetime |
3 |
3 |
import os |
4 |
4 |
|
5 |
5 |
def post_title_directory(instance, filename): |
6 |
6 |
""" Files will be uploaded to MEDIA_ROOT/blog/<year of publishing>/<blog |
7 |
7 |
title> |
8 |
8 |
The blog title is determined by the text before the first period (".") in |
9 |
9 |
the filename. So if the file has the name "Trains are bæ.en.md", the file |
10 |
10 |
will be stored in "blog/<this year>/Trains are bæ". Name your files |
11 |
11 |
properly! |
12 |
12 |
It should also be noted that all files are stored in the same folder if they |
13 |
13 |
belong to the same blogpost, regardless of language. The titles that are |
14 |
14 |
displayed to the user however, should be the titles of the files themselves, |
15 |
15 |
which should be in the native language. So if a blog post is titled |
16 |
16 |
"Universities of Belgium", its Dutch counterpart should be titled |
17 |
17 |
"Universiteiten van België", so the correct title can be derived from the |
18 |
18 |
filename. |
19 |
19 |
|
20 |
20 |
Recommended way to name the uploaded file: "<name of blog post in language |
21 |
21 |
it's written>.md". This removes the maximum amount of redundancy (e.g. the |
22 |
22 |
language of the file can be derived from the title, no ".fr.md" or something |
23 |
23 |
like that necessary), and can directly be used for the end user (the title |
24 |
24 |
is what should be displayed). |
25 |
25 |
""" |
26 |
26 |
english_file_name = os.path.basename(instance.english_file.name) # TODO: Test if this returns the file name! |
27 |
27 |
english_title = english_file_name.rpartition(".")[0] |
28 |
28 |
year = datetime.date.today().year |
29 |
29 |
|
30 |
30 |
return "blog/{0}/{1}/{2}".format(year, english_title, filename) |
31 |
31 |
|
32 |
32 |
class Post(models.Model): |
33 |
33 |
""" Represents a blog post. The title of the blog post is determnined by the name |
34 |
34 |
of the files. |
35 |
35 |
A blog post can be in 5 different languages: German, Spanish, English, French, |
36 |
36 |
and Dutch. For all these languages, a seperate field exists. Thus, a |
37 |
37 |
translated blog post has a seperate file for each translation, and is |
38 |
38 |
seperated from Django's internationalization/localization system. |
39 |
39 |
Only the English field is mandatory. The others may contain a value if a |
40 |
40 |
translated version exists, which will be displayed accordingly. |
41 |
41 |
""" |
42 |
42 |
published = models.DateTimeField(auto_now_add=True) |
43 |
43 |
english_file = models.FileField(upload_to=post_title_directory, unique=True, blank=False) |
44 |
44 |
dutch_file = models.FileField(upload_to=post_title_directory, blank=True) |
45 |
45 |
french_file = models.FileField(upload_to=post_title_directory, blank=True) |
46 |
46 |
german_file = models.FileField(upload_to=post_title_directory, blank=True) |
47 |
47 |
spanish_file = models.FileField(upload_to=post_title_directory, blank=True) |
48 |
48 |
# Only the English file can be unique, because apparantly, there can't be |
49 |
49 |
# two blank fields in a unique column. Okay then. |
50 |
50 |
|
51 |
51 |
def __str__(self): |
52 |
52 |
return os.path.basename(self.english_file.name).rpartition(".")[0] |
53 |
53 |
|
+ |
54 |
class Comment(models.model): |
+ |
55 |
""" Represents a comment on a blog post. |
+ |
56 |
|
+ |
57 |
Comments are not linked to an account or anything, I'm trusting the |
+ |
58 |
commenter that he is honest with his credentials. That being said: |
+ |
59 |
XXX: Remember to put up a notification that comments are not checked for |
+ |
60 |
identity, and, unless verified by a trustworthy source, cannot be seen as |
+ |
61 |
being an actual statement from the commenter. |
+ |
62 |
Comments are linked to a blogpost, and are not filtered by language. (So a |
+ |
63 |
comment made by someone reading the article in Dutch, that's written in |
+ |
64 |
Dutch, will show up (unedited) for somebody whom's reading the Spanish |
+ |
65 |
version. |
+ |
66 |
XXX: Remember to notify (tiny footnote or something) that comments showing |
+ |
67 |
up in a foreign language is by design, and not a bug. |
+ |
68 |
""" |
+ |
69 |
date = models.DateTimeField(auto_now_add=True) |
+ |
70 |
name = models.TextField() |
+ |
71 |
mail = models.EmailField() |
+ |
72 |
post = models.ForeignKey(Post) # TODO: Finish this class and the shit... |
+ |
73 |
|
+ |
74 |