Discussion:
[libreoffice-users] Can LO build a TOC from a PDF file?
Gilles
2017-07-09 17:20:05 UTC
Permalink
Hello,

This PDF file
<https://www.legifrance.gouv.fr/download_code_pdf.do?cidTexte=LEGITEXT000006074228&dlType=pdf>
has no Table of Contents, and I was wondering if LO could grab all the
headers and build a TOC.

Thank you.



--
View this message in context: http://nabble.documentfoundation.org/Can-LO-build-a-TOC-from-a-PDF-file-tp4217910.html
Sent from the Users mailing list archive at Nabble.com.
--
To unsubscribe e-mail to: users+***@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
Jean-Francois Nifenecker
2017-07-09 21:58:57 UTC
Permalink
Hello Gilles,
Post by Gilles
Hello,
This PDF file
<https://www.legifrance.gouv.fr/download_code_pdf.do?cidTexte=LEGITEXT000006074228&dlType=pdf>
has no Table of Contents, and I was wondering if LO could grab all the
headers and build a TOC.
In order to create a PDF with a TOC/index you'll have to set heading
styles to the appropriate paragraphs.

Opening a PDF with LibO won't go anywhere as the tool for that is Draw
which can't set styles for a text processor.

I can't see a way to do that quickly, I'm afraid: a copy/paste from the
PDF document to Writer is possible but you'll have to fix a lot of
things (eg. useless carriage returns) and apply heading styles by hand.
On a 400+ pages document this a big PITA.

Hopefully someone else will come with brighter ideas.


Bien cordialement,
--
Jean-Francois Nifenecker, Bordeaux
--
To unsubscribe e-mail to: users+***@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
Cley Faye
2017-07-09 22:54:51 UTC
Permalink
2017-07-09 23:58 GMT+02:00 Jean-Francois Nifenecker <
Post by Jean-Francois Nifenecker
Hello Gilles,
Post by Gilles
Hello,
This PDF file
<https://www.legifrance.gouv.fr/download_code_pdf.do?cidText
e=LEGITEXT000006074228&dlType=pdf>
has no Table of Contents, and I was wondering if LO could grab all the
headers and build a TOC.
In order to create a PDF with a TOC/index you'll have to set heading
styles to the appropriate paragraphs.
Opening a PDF with LibO won't go anywhere as the tool for that is Draw
which can't set styles for a text processor.
I can't see a way to do that quickly, I'm afraid: a copy/paste from the
PDF document to Writer is possible but you'll have to fix a lot of things
(eg. useless carriage returns) and apply heading styles by hand. On a 400+
pages document this a big PITA.
Hopefully someone else will come with brighter ideas.
​You want brighter ideas? Say no more!

So... hmm... I'm afraid there won't be many fully-automated tools that can
build a TOC for you. A PDF basically contains a lot of individual elements,
that are arranged to look like ​something coherent.
From the document you linked, it could theoretically be possible to write a
tool that split every pages, grab the raw text, use a regex to find actual
titles, build a TOC, and inject it in the PDF. This would assume:
- Text extraction works correctly (it's not always the case with PDF)
- Titles always follow the same format

But on this kind of document, you could definitely get some acceptable
results. I experimented a bit. The output is here:
http://www.cjoint.com/c/GGjw0OtPkGc
And for the curious, the "script" I used is here:
​https://pastebin.com/icQSZxQr

As you'll see, it is VERY specific to this document, ​but it is possible to
do something.
--
To unsubscribe e-mail to: users+***@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot b
gordon cooper
2017-07-10 04:56:57 UTC
Permalink
There is a round-about way of doing this using Nuance's PDF
Converter, but
I have not used it since I abandoned Windows® several years ago. With the
PDF Converter, one can make a Word file which could be read by LO, then
use LO's Insert ToC tool and export the result back to PDF.

Gordon

Tauranga N.Z.
Post by Gilles
Hello,
This PDF file
<https://www.legifrance.gouv.fr/download_code_pdf.do?cidTexte=LEGITEXT000006074228&dlType=pdf>
has no Table of Contents, and I was wondering if LO could grab all the
headers and build a TOC.
Thank you.
--
View this message in context: http://nabble.documentfoundation.org/Can-LO-build-a-TOC-from-a-PDF-file-tp4217910.html
Sent from the Users mailing list archive at Nabble.com.
--
To unsubscribe e-mail to: users+***@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be delete
Gilles
2017-07-10 07:12:29 UTC
Permalink
Thanks much everyone. I naively thought it could simply be done by converting
the PDF into text in LO, and run a few regexes to build a TOC :-/



--
View this message in context: http://nabble.documentfoundation.org/Can-LO-build-a-TOC-from-a-PDF-file-tp4217910p4217934.html
Sent from the Users mailing list archive at Nabble.com.
--
To unsubscribe e-mail to: users+***@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
Loading...