Word Attachment

rfazendeiro

Well-known member
Joined
Mar 8, 2004
Messages
110
I wrote a web application that allows users to search words in files. These files can be .doc, .pdf, .txt, etc.

Right now the search engine is working but the other day a user send a bug that the search was not seeking in files that where attached to a .doc file.

Ive been seeking the web and cant find nothing about this. Can anyone tell me how can i extract an attachment from a Word document? especially a PDF file?

thx
 
Last edited by a moderator:
hello again :)

Well ive managed to extract attachments from a .doc file if they are .doc, .xls, .ppt.

Im really lost on how to extract PDF files. Any help would be really appreciated

thank you all
 
Well actually extrating office files is pretty easy, because you have support for those kind of files.

heres a sample on how i take the office documents from a word document:

Code:
public static void SearchFileAttachments(Uri file)
        {
            object missing = Type.Missing;
            object fileName = file.ToString();
            object VerbIndex = Microsoft.Office.Interop.Word.WdOLEVerb.wdOLEVerbOpen;
            Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
            Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);

            try
            {
                docs.Activate();

                foreach (Microsoft.Office.Interop.Word.InlineShape inlineShape in docs.InlineShapes)
                {
                    if (inlineShape.OLEFormat.ProgID != null)
                    {
                        switch (inlineShape.OLEFormat.ProgID)
                        {
                            case "PowerPoint.Show.8":
                                Microsoft.Office.Interop.PowerPoint.Application powerpoint = new Microsoft.Office.Interop.PowerPoint.Application();
                                try
                                {
                                    powerpoint.WindowState =
                                        Microsoft.Office.Interop.PowerPoint.PpWindowState.ppWindowNormal;
                                    inlineShape.OLEFormat.DoVerb(ref VerbIndex);
                                    powerpoint = Marshal.GetActiveObject("PowerPoint.Application") as Microsoft.Office.Interop.PowerPoint.Application;

                                    if (powerpoint != null)
                                    {

                                        Guid guid = Guid.NewGuid();
                                        string presentationName = guid + ".ppt";

                                        powerpoint.ActivePresentation.SaveAs(presentationName, Microsoft.Office.Interop.PowerPoint.PpSaveAsFileType.ppSaveAsPresentation, Microsoft.Office.Core.MsoTriState.msoTrue);
                                    }
                                }
                                catch (Exception ex) { //exception code here }
                                finally
                                {
                                    if (powerpoint != null)
                                    {
                                        powerpoint.ActivePresentation.Close();
                                        powerpoint.Quit();
                                    }
                                }
                                break;
                            case "Excel.Sheet.8":
                                Microsoft.Office.Interop.Excel.Application excel = new Microsoft.Office.Interop.Excel.Application();
                                try
                                {
                                    excel.Visible = false;
                                    excel.ScreenUpdating = false;
                                    excel.Left = -1000;

                                    inlineShape.OLEFormat.DoVerb(ref VerbIndex);

                                    excel = Marshal.GetActiveObject("Excel.Application") as Microsoft.Office.Interop.Excel.Application;

                                    if (excel != null)
                                    {

                                        Guid guid = Guid.NewGuid();
                                        object workBookName = guid + ".xls";

                                        excel.ActiveWorkbook.SaveAs(workBookName, missing, missing, missing, missing,
                                                                    missing, Microsoft.Office.Interop.Excel.XlSaveAsAccessMode.xlNoChange,
                                                                    missing, missing, missing, missing, missing);
                                    }
                                }
                                catch (Exception ex) { //exception code here }
                                finally
                                {
                                    if (excel != null)
                                    {
                                        excel.Workbooks.Close();
                                        excel.Quit();
                                    }
                                }

                                break;
                            case "Word.Document.8":                            
                                Microsoft.Office.Interop.Word.Application wordDocument = new Microsoft.Office.Interop.Word.Application();
                                try
                                {
                                    wordDocument.Visible = false;
                                    wordDocument.ScreenUpdating = false;
                                    wordDocument.Left = -1000;

                                    inlineShape.OLEFormat.DoVerb(ref VerbIndex);
                                    wordDocument = Marshal.GetActiveObject("Word.Application") as Microsoft.Office.Interop.Word.Application;

                                    if (wordDocument != null)
                                    {
                                        Guid guid = Guid.NewGuid();
                                        object wordDocumentName = guid + ".doc";

                                        wordDocument.ActiveDocument.SaveAs(ref wordDocumentName, ref missing, ref missing, ref missing, ref missing,
                                                                            ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
                                                                            ref missing, ref missing, ref missing, ref missing, ref missing);

                                    }

                                }
                                catch (Exception ex) { //exeption code here }
                                finally
                                {
                                    if (wordDocument != null)
                                    {

                                        object saveChanges = Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges;
                                        object originalFormat = Microsoft.Office.Interop.Word.WdOriginalFormat.wdWordDocument;
                                        object routeDocument = true;
                                        wordDocument.ActiveDocument.Close(ref saveChanges, ref originalFormat, ref routeDocument);
                                        //wordDocument.Quit(ref saveChanges, ref originalFormat, ref routeDocument);
                                    }
                                }
                                break;

                            default:
                                break;

                        }

                    }

                }
            }
            catch (Exception ex) { //exception code here }
            finally
            {
                object saveChanges = Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges;
                object originalFormat = Microsoft.Office.Interop.Word.WdOriginalFormat.wdWordDocument;
                object routeDocument = true;

                docs.Close(ref saveChanges, ref originalFormat, ref routeDocument);
                word.Quit(ref saveChanges, ref originalFormat, ref routeDocument);
            }
        }

With the Microsoft Office type of attachments I have support by importing the
Code:
Microsoft.Office.Interop.Excel.dll
Microsoft.Office.Interop.PowerPoint.dll
Microsoft.Office.Interop.Word.dll

but i have no such luck with PDFs. I have tried importing an DLL from Acrobat but with no success. Any ideia on how to extract the PDF?


As a side note, if its a PDF the inlineShape.OLEFormat.ProgID is AcroExch.Document.X, where "X" is the version of the PDF File.
 
So ive been freneticly seeking the web for this problem (taking the PDF from a word attachment) and found this code.

The user says hes able to get 95% of the word attachment out, but the problem is that its written in pearl :/

Ive been trying to convert it to C# but not having much sucess. So can anyone here help translating this to C#?


Code:
$byte = "";
$buffer = "";
#$infh = new FileHandle;
#sysopen $infh, "$explodeinto/$inname", O_RDONLY;
Open the infh filehandle with the "inname" file containing the OLE object.
sysseek $infh, 6, SEEK_SET; # Skip 1st 6 bytes
Skip the first 6 bytes, these appear to be useless
$outname = "";
$finished = 0;
$length = 0;
until ($byte eq "\0" || $finished || $length>1000) {
# Read a C-string into $outname
sysread($infh, $byte, 1) or $finished = 1;
$outname .= $byte;
$length++;
}
Read a null-terminated string of bytes,
this becomes the output filename.
next OLEFILE if $length>1000; # Bail out if it went wrong
If the filename was way too long, this is probably corrupt.
$finished = 0;
$byte = 1;
$length = 0;
until ($byte eq "\0" || $finished || $length>1000) { # Throw away a C-string
sysread($infh, $byte, 1) or $finished = 1;
$length++;
}
Throw away the next null-terminated string of bytes.
next OLEFILE if $length>1000; # Bail out if it went wrong
If the string was way too long, this is probably corrupt.
sysseek $infh, 4, Fcntl::SEEK_CUR or next OLEFILE; # Skip next 4 bytes
Skip the next 4 bytes of the file.
sysread $infh, $number, 4 or next OLEFILE;
$number = unpack V, $number;
Read the next 4 bytes into a 4-byte int called "$number".
#print STDERR "Skipping $number bytes of header filename\n";
if ($number>0 && $number<1_000_000) {
sysseek $infh, $number, Fcntl::SEEK_CUR; # Skip the next bit of header (C-string)
} else {
next OLEFILE;
}
If the number $number was a reasonable size,
skip that many bytes of the file.
sysread $infh, $number, 4 or next OLEFILE;
$number = unpack V, $number;
Read the next 4 bytes in a 4-byte int called "$number".
This is the length of the real embedded file we want to extract.
#print STDERR "Reading $number bytes of file data\n";
sysread $infh, $buffer, $number
if $number>0 && $number < $size; # Sanity check
Read the $number number of bytes into memory into a chunk
of memory allocated which is at least $number bytes long.
Do a sanity check that the number of bytes we have asked it to read
is less than the total length of the input file.
$outfh = new FileHandle;
$outsafe = $this->MakeNameSafe($outname, $explodeinto);
sysopen $outfh, "$explodeinto/$outsafe", (O_CREAT | O_WRONLY)
or next OLEFILE;
Create an output file with a filename which is a sanitised safe
version of the filename we read at the top of this bit of code.
if ($number>0 && $number<1_000_000_000) { # Number must be reasonable!
syswrite $outfh, $buffer, $number or next OLEFILE;
}
close $outfh;
If the output file is less than 1Gbyte long, write out the data we just read.
This creates the file containing the embedded file we wanted to extract.
Then close that output file.

Thank you to all
 
Back
Top